Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use rotation to remove branching in inner loop #63

Merged
merged 6 commits into from
Apr 16, 2024
Merged

Conversation

mcroomp
Copy link
Collaborator

@mcroomp mcroomp commented Apr 12, 2024

Felt inspired for some reason. Who knew that there was another 20% of optimization left in the inner loop @danielrh

This change removes 2 branches from the inner loop:

  • instead of having two paths for updating the counter (one for the high byte and one for the low byte), use bit rotation at the beginning and end (depending on if it was a 1 or 0) to use the same code in both cases
  • use a different mask for special casing, which allows the optimizer to use a conditional move instead of a jump instruction

src/structs/branch.rs Outdated Show resolved Hide resolved
src/structs/branch.rs Outdated Show resolved Hide resolved
@danielrh
Copy link

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

@mcroomp
Copy link
Collaborator Author

mcroomp commented Apr 12, 2024

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

Thanks! Effectively now there's only one branch in the inner loop, which is when counters need to be normalized, and branch prediction deals with this effectively since it happens infrequently. Avoiding the 50% probability jump was the biggest gain.

image

@Melirius
Copy link
Collaborator

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

@mcroomp
Copy link
Collaborator Author

mcroomp commented Apr 12, 2024

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

@Melirius
Copy link
Collaborator

Melirius commented Apr 12, 2024

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

Hashtable does not help, I tried. But some of the table values of initial Lepton tables are not used, so tables can be more compact. For example, num_non_zeros_to_bin function has output range of [0,9] inclusive, so that num_non_zeros_counts7x7 1st dimension can be shrunk accordingly (note also that max predictor for non-zeros is not 49 but 25, then used range is even [0,8]), 0th bin is not used in exponent_counts - nothing is decoded if no 7x7 coefficients present, residual_noise_counts can be split into two tables for 7x7 and edge coefficients with lower dimensions for each, etc. Also get_grid function arrays can be compactified by a trick: note it accesses only [0][0] element, then [1][0-1], then [2][0-3], ... [n][0-(2^n-1)], second index being decoded_so_far n-bit value. This can be converted to one-dimensional array with (n+1)-bit indices, let be starting decoded_so_far=1 and then decoded_so_far <<= 1; decoded_so_far |= cur_bit as usize;. Finally value can be obtained just by zeroing this leading bit of decoded_so_far.

Copy link

@danielrh danielrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mega clever. I like how you use the upper half always so you can utilize the overflowing add to accomplish the same math for both true and false.

I wonder if writing a little comment blurb showing a very branch-unoptimized equivalent could help new people understand the algorithm or if the version in the exhaustive test is sufficient. Maybe a link to the he exhaustive test is as good.

src/structs/branch.rs Outdated Show resolved Hide resolved
@mcroomp
Copy link
Collaborator Author

mcroomp commented Apr 14, 2024

@Melirius for the lookup of the 64K probability table, it would be nice to reduce this so it fits in 32K in most cases. The usage is a bit uneven, so maybe a minor transform before the lookup (if it is a lightweight bit operation) might help with the cache lines.

The graph below is the number of times that a particular value in the probability table is looked up (red the most, green the least)

image

@Melirius
Copy link
Collaborator

Melirius commented Apr 14, 2024

@mcroomp Yes, I've thought about it, but as I said before interleaving does not help - counters update become cumbersome (2 mask shifts, 3 or, 1 and, 1 inc in general case and more or less the same ops for overflow) and slow. Effectively most of the time next value will be taken from the same cache line (next in row) or next line of LUT (next in column), so I've tried also manual prefetch of this lower line - the results were inconclusive on x64 and really bad on ARM.

The pattern is understandable as after you have the first overflow at least one of the counters is never less than 129, so left-top quadrant is never referenced again. Red patch at the corner comes from rarely accessed branches that do not accumulate much hits. Tantalizing thing is that differences of adjacent values in these 3/4 of the square is only [-2,+2] inclusive, but I cannot figure out how to use it for LUT compactification.

@Melirius
Copy link
Collaborator

The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %.

@mcroomp
Copy link
Collaborator Author

mcroomp commented Apr 16, 2024 via email

@mcroomp mcroomp merged commit 983873c into main Apr 16, 2024
3 checks passed
@mcroomp mcroomp deleted the rotationoptimization branch April 16, 2024 18:23
@Melirius Melirius mentioned this pull request Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants