use rotation to remove branching in inner loop #63

mcroomp · 2024-04-12T06:32:47Z

Felt inspired for some reason. Who knew that there was another 20% of optimization left in the inner loop @danielrh

This change removes 2 branches from the inner loop:

instead of having two paths for updating the counter (one for the high byte and one for the low byte), use bit rotation at the beginning and end (depending on if it was a 1 or 0) to use the same code in both cases
use a different mask for special casing, which allows the optimizer to use a conditional move instead of a jump instruction

src/structs/branch.rs

danielrh · 2024-04-12T06:59:16Z

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

mcroomp · 2024-04-12T07:09:55Z

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

Thanks! Effectively now there's only one branch in the inner loop, which is when counters need to be normalized, and branch prediction deals with this effectively since it happens infrequently. Avoiding the 50% probability jump was the biggest gain.

Melirius · 2024-04-12T09:27:51Z

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

mcroomp · 2024-04-12T13:53:53Z

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

Melirius · 2024-04-12T16:01:14Z

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

Hashtable does not help, I tried. But some of the table values of initial Lepton tables are not used, so tables can be more compact. For example, num_non_zeros_to_bin function has output range of [0,9] inclusive, so that num_non_zeros_counts7x7 1st dimension can be shrunk accordingly (note also that max predictor for non-zeros is not 49 but 25, then used range is even [0,8]), 0th bin is not used in exponent_counts - nothing is decoded if no 7x7 coefficients present, residual_noise_counts can be split into two tables for 7x7 and edge coefficients with lower dimensions for each, etc. Also get_grid function arrays can be compactified by a trick: note it accesses only [0][0] element, then [1][0-1], then [2][0-3], ... [n][0-(2^n-1)], second index being decoded_so_far n-bit value. This can be converted to one-dimensional array with (n+1)-bit indices, let be starting decoded_so_far=1 and then decoded_so_far <<= 1; decoded_so_far |= cur_bit as usize;. Finally value can be obtained just by zeroing this leading bit of decoded_so_far.

danielrh

This is mega clever. I like how you use the upper half always so you can utilize the overflowing add to accomplish the same math for both true and false.

I wonder if writing a little comment blurb showing a very branch-unoptimized equivalent could help new people understand the algorithm or if the version in the exhaustive test is sufficient. Maybe a link to the he exhaustive test is as good.

src/structs/branch.rs

…e that no longer applies

mcroomp · 2024-04-14T16:25:08Z

@Melirius for the lookup of the 64K probability table, it would be nice to reduce this so it fits in 32K in most cases. The usage is a bit uneven, so maybe a minor transform before the lookup (if it is a lightweight bit operation) might help with the cache lines.

The graph below is the number of times that a particular value in the probability table is looked up (red the most, green the least)

Melirius · 2024-04-14T20:03:57Z

@mcroomp Yes, I've thought about it, but as I said before interleaving does not help - counters update become cumbersome (2 mask shifts, 3 or, 1 and, 1 inc in general case and more or less the same ops for overflow) and slow. Effectively most of the time next value will be taken from the same cache line (next in row) or next line of LUT (next in column), so I've tried also manual prefetch of this lower line - the results were inconclusive on x64 and really bad on ARM.

The pattern is understandable as after you have the first overflow at least one of the counters is never less than 129, so left-top quadrant is never referenced again. Red patch at the corner comes from rarely accessed branches that do not accumulate much hits. Tantalizing thing is that differences of adjacent values in these 3/4 of the square is only [-2,+2] inclusive, but I cannot figure out how to use it for LUT compactification.

Melirius · 2024-04-15T15:32:20Z

The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %.

mcroomp · 2024-04-16T04:32:07Z

One other idea was to try to improve the unary read function (since this is one of the main costs), since it only reads 1s until it exits the loop. As soon as we see the next bit is going to be a 1, we can in theory already start doing the work for the next bit if we can get the CPU to interleave the instructions. My main problem has been that the compiler is being too smart and optimizing away my optimizations… On Mon, Apr 15, 2024 at 5:32 PM Ivan Siutsou ***@***.*** ***@***.***> > wrote: The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %. — Reply to this email directly, view it on GitHub <#63 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6EMMV2EXWB4I52LQKU473Y5PXJVAVCNFSM6AAAAABGDR66PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXGE2DKOJZGA> . You are receiving this because you were mentioned. <https://ci3.googleusercontent.com/meips/ADKq_NbK1w99-F8InDEO2ip8sqglX5b5YeDrrbyF5qZFqkVRy5rLPi0jRjh00zBlkx-x2lgr7njrXtELqM-VHGAQvSzWr0m4Z4Q37BljaoiYt9kXiM6dOT31gSnk-v3t4grj6F0BYJ04Eykk429N_ip6V1KVyyOXqxGvguLtDztzrsJ9e6RnCtfue3FwZx2bW20BuSAgKDFnEhbVcj2xQlIGH0nPWOa5OrgmrPXoAZsp6sRgRbyKHG1gnNg=s0-d-e1-ft#https://github.com/notifications/beacon/AE6EMMVVNJGFYWZN45S6DFLY5PXJVA5CNFSM6AAAAABGDR66PSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2TWHIM.gif> Message ID: ***@***.***>

use rotation to remove branching

133dfba

mcroomp assigned m6w6, Melirius and gbrovman Apr 12, 2024

danielrh reviewed Apr 12, 2024

View reviewed changes

src/structs/branch.rs Outdated Show resolved Hide resolved

danielrh reviewed Apr 12, 2024

View reviewed changes

src/structs/branch.rs Outdated Show resolved Hide resolved

mcroomp added 2 commits April 12, 2024 09:01

fix comments

5a3e6f7

fix comments

0d7cecf

add more comments

268842d

danielrh approved these changes Apr 12, 2024

View reviewed changes

Melirius reviewed Apr 12, 2024

View reviewed changes

src/structs/branch.rs Outdated Show resolved Hide resolved

mcroomp added 2 commits April 12, 2024 22:13

added comments as suggested and remove some leftover special case cod…

b7fab2f

…e that no longer applies

fix comments to remove references to removed special cases

68208ef

gbrovman approved these changes Apr 16, 2024

View reviewed changes

mcroomp merged commit 983873c into main Apr 16, 2024
3 checks passed

mcroomp deleted the rotationoptimization branch April 16, 2024 18:23

Melirius mentioned this pull request Apr 19, 2024

Shrink Model #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use rotation to remove branching in inner loop #63

use rotation to remove branching in inner loop #63

mcroomp commented Apr 12, 2024

danielrh commented Apr 12, 2024

mcroomp commented Apr 12, 2024 •

edited

Loading

Melirius commented Apr 12, 2024

mcroomp commented Apr 12, 2024

Melirius commented Apr 12, 2024 •

edited

Loading

danielrh left a comment

mcroomp commented Apr 14, 2024 •

edited

Loading

Melirius commented Apr 14, 2024 •

edited

Loading

Melirius commented Apr 15, 2024

mcroomp commented Apr 16, 2024 via email

use rotation to remove branching in inner loop #63

use rotation to remove branching in inner loop #63

Conversation

mcroomp commented Apr 12, 2024

danielrh commented Apr 12, 2024

mcroomp commented Apr 12, 2024 • edited Loading

Melirius commented Apr 12, 2024

mcroomp commented Apr 12, 2024

Melirius commented Apr 12, 2024 • edited Loading

danielrh left a comment

Choose a reason for hiding this comment

mcroomp commented Apr 14, 2024 • edited Loading

Melirius commented Apr 14, 2024 • edited Loading

Melirius commented Apr 15, 2024

mcroomp commented Apr 16, 2024 via email

mcroomp commented Apr 12, 2024 •

edited

Loading

Melirius commented Apr 12, 2024 •

edited

Loading

mcroomp commented Apr 14, 2024 •

edited

Loading

Melirius commented Apr 14, 2024 •

edited

Loading