-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use rotation to remove branching in inner loop #63
Conversation
Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-) |
Thanks! Effectively now there's only one branch in the inner loop, which is when counters need to be normalized, and branch prediction deals with this effectively since it happens infrequently. Avoiding the 50% probability jump was the biggest gain. |
Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update. |
From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently. |
Hashtable does not help, I tried. But some of the table values of initial Lepton tables are not used, so tables can be more compact. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mega clever. I like how you use the upper half always so you can utilize the overflowing add to accomplish the same math for both true and false.
I wonder if writing a little comment blurb showing a very branch-unoptimized equivalent could help new people understand the algorithm or if the version in the exhaustive test is sufficient. Maybe a link to the he exhaustive test is as good.
@Melirius for the lookup of the 64K probability table, it would be nice to reduce this so it fits in 32K in most cases. The usage is a bit uneven, so maybe a minor transform before the lookup (if it is a lightweight bit operation) might help with the cache lines. The graph below is the number of times that a particular value in the probability table is looked up (red the most, green the least) |
@mcroomp Yes, I've thought about it, but as I said before interleaving does not help - counters update become cumbersome (2 mask shifts, 3 or, 1 and, 1 inc in general case and more or less the same ops for overflow) and slow. Effectively most of the time next value will be taken from the same cache line (next in row) or next line of LUT (next in column), so I've tried also manual prefetch of this lower line - the results were inconclusive on x64 and really bad on ARM. The pattern is understandable as after you have the first overflow at least one of the counters is never less than 129, so left-top quadrant is never referenced again. Red patch at the corner comes from rarely accessed branches that do not accumulate much hits. Tantalizing thing is that differences of adjacent values in these 3/4 of the square is only [-2,+2] inclusive, but I cannot figure out how to use it for LUT compactification. |
The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %. |
One other idea was to try to improve the unary read function (since this is one of the main costs), since it only reads 1s until it exits the loop. As soon as we see the next bit is going to be a 1, we can in theory already start doing the work for the next bit if we can get the CPU to interleave the instructions. My main problem has been that the compiler is being too smart and optimizing away my optimizations…
On Mon, Apr 15, 2024 at 5:32 PM Ivan Siutsou ***@***.*** ***@***.***> > wrote:
The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %.
—
Reply to this email directly, view it on GitHub <#63 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6EMMV2EXWB4I52LQKU473Y5PXJVAVCNFSM6AAAAABGDR66PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXGE2DKOJZGA> .
You are receiving this because you were mentioned. <https://ci3.googleusercontent.com/meips/ADKq_NbK1w99-F8InDEO2ip8sqglX5b5YeDrrbyF5qZFqkVRy5rLPi0jRjh00zBlkx-x2lgr7njrXtELqM-VHGAQvSzWr0m4Z4Q37BljaoiYt9kXiM6dOT31gSnk-v3t4grj6F0BYJ04Eykk429N_ip6V1KVyyOXqxGvguLtDztzrsJ9e6RnCtfue3FwZx2bW20BuSAgKDFnEhbVcj2xQlIGH0nPWOa5OrgmrPXoAZsp6sRgRbyKHG1gnNg=s0-d-e1-ft#https://github.com/notifications/beacon/AE6EMMVVNJGFYWZN45S6DFLY5PXJVA5CNFSM6AAAAABGDR66PSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2TWHIM.gif> Message ID: ***@***.***>
|
Felt inspired for some reason. Who knew that there was another 20% of optimization left in the inner loop @danielrh
This change removes 2 branches from the inner loop: