Shrink Model #67

Melirius · 2024-04-19T20:01:58Z

The size of Model affects number of cache misses. Some of the Branches are never referenced, then more dense arrays for get_grid/put_grid and num_non_zeros_counts7x7 can be used along the lines of this comment. Performance is not much changed on x64, but memory consumption is lower by ~10 kB per Model and Branches requested in succession are grouped together, that improve cache locality.

…ng unused branches from num_non_zeros_counts7x7

Melirius · 2024-04-22T16:36:02Z

Performance is better though, compilation with +avx2+lzcnt:
HEAD 5ac1daf

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       390 126 953      cache-references                                                        (45,24%)
        28 850 992      cache-misses                     #    7,40% of all cache refs           (45,42%)
     9 172 030 367      cycles                                                                  (45,42%)
       395 459 199      stalled-cycles-backend           #    4,31% backend cycles idle         (45,43%)
        20 297 434      stalled-cycles-frontend          #    0,22% frontend cycles idle        (45,55%)
    23 285 910 196      instructions                     #    2,54  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,70%)
     2 387 202 436      branch-instructions                                                     (45,76%)
        72 726 804      branch-misses                    #    3,05% of all branches             (45,76%)
     4 172 462 359      ic_fetch_stall.ic_stall_any                                             (45,75%)
        23 132 818      l2_cache_misses_from_ic_miss                                            (45,56%)
       761 995 310      l2_latency.l2_cycles_waiting_on_fills                                        (45,35%)
            83 319      faults                                                                
                 1      migrations                                                            

       2,034150060 seconds time elapsed

       1,875813000 seconds user
       0,155984000 seconds sys

d692023

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       409 828 089      cache-references                                                        (45,22%)
        27 165 356      cache-misses                     #    6,63% of all cache refs           (45,22%)
     8 971 453 852      cycles                                                                  (45,22%)
       376 186 450      stalled-cycles-backend           #    4,19% backend cycles idle         (45,50%)
        22 944 563      stalled-cycles-frontend          #    0,26% frontend cycles idle        (45,82%)
    22 665 330 779      instructions                     #    2,53  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,90%)
     2 444 459 853      branch-instructions                                                     (45,90%)
        75 619 279      branch-misses                    #    3,09% of all branches             (45,90%)
     3 927 762 385      ic_fetch_stall.ic_stall_any                                             (45,72%)
        20 967 746      l2_cache_misses_from_ic_miss                                            (45,34%)
       715 590 263      l2_latency.l2_cycles_waiting_on_fills                                        (45,09%)
            83 297      faults                                                                
                 1      migrations                                                            

       1,981634374 seconds time elapsed

       1,816212000 seconds user
       0,163666000 seconds sys

mcroomp · 2024-04-22T19:46:12Z

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

Melirius · 2024-04-22T23:00:31Z

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

Not quite get it. Do you have in mind that we can use an one-dimensional array there and update index by NON_ZERO_TO_BIN_7X7 always and by -1 if coef !=0 and bin is changed?

mcroomp · 2024-04-23T04:57:32Z

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

Not quite get it. Do you have in mind that we can use an one-dimensional array there and update index by NON_ZERO_TO_BIN_7X7 always and by -1 if coef !=0 and bin is changed?

Yeah, so in serialize_token/parse_token we can incrementally update the index into the one-dimensional array and pass a single parameter to read/write_coef (nz_zig49 or something like that). You can either have non_zero_to_bin as the most significant or the zig49. In one case you do the above, I did it the oppose way:

Each iteration increment the array counter always += 1
then add ProbabilityTables::num_non_zeros_to_bin_7x7(num_non_zeros_left_7x7) * 49 (which can be precalculated and stored as a separate variable only updated in the nonzero case.

Don't know which is faster. We might also shrink the table to some degree since we can't have more non-zeros than there are coefficients left to process.

Shrink Model using more dense arrays for get_grid/put_grid and removi…

ba721d6

…ng unused branches from num_non_zeros_counts7x7

Melirius requested review from mcroomp and gbrovman April 19, 2024 20:02

Melirius added 7 commits April 20, 2024 14:46

Remove unnecessary part of NON_ZERO_TO_BIN, formatting

30ff0f9

Remove unused table

4407640

Formatting

62ad628

Shorter Branch arrays, explanations

a5d98b4

2 separate arrays NON_ZERO_TO_BIN

363eef6

Structs for better cache locality

57a17b2

Clear unused constants and exports

d692023

Finishing touches

8dc2d8f

Merge branch 'microsoft:main' into Shorter-Model

243b5d9

mcroomp approved these changes Apr 23, 2024

View reviewed changes

mcroomp merged commit c2aaf2c into microsoft:main Apr 23, 2024
2 checks passed

mcroomp mentioned this pull request May 18, 2024

Add test cases for handling overflows on dequantization and fix size of count7x7 #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink Model #67

Shrink Model #67

Melirius commented Apr 19, 2024 •

edited

Loading

Melirius commented Apr 22, 2024 •

edited

Loading

mcroomp commented Apr 22, 2024

Melirius commented Apr 22, 2024 •

edited

Loading

mcroomp commented Apr 23, 2024

Shrink Model #67

Shrink Model #67

Conversation

Melirius commented Apr 19, 2024 • edited Loading

Melirius commented Apr 22, 2024 • edited Loading

mcroomp commented Apr 22, 2024

Melirius commented Apr 22, 2024 • edited Loading

mcroomp commented Apr 23, 2024

Melirius commented Apr 19, 2024 •

edited

Loading

Melirius commented Apr 22, 2024 •

edited

Loading

Melirius commented Apr 22, 2024 •

edited

Loading