Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink Model #67

Merged
merged 10 commits into from
Apr 23, 2024
Merged

Shrink Model #67

merged 10 commits into from
Apr 23, 2024

Conversation

Melirius
Copy link
Collaborator

@Melirius Melirius commented Apr 19, 2024

The size of Model affects number of cache misses. Some of the Branches are never referenced, then more dense arrays for get_grid/put_grid and num_non_zeros_counts7x7 can be used along the lines of this comment. Performance is not much changed on x64, but memory consumption is lower by ~10 kB per Model and Branches requested in succession are grouped together, that improve cache locality.

…ng unused branches from num_non_zeros_counts7x7
@Melirius Melirius requested review from mcroomp and gbrovman April 19, 2024 20:02
@Melirius
Copy link
Collaborator Author

Melirius commented Apr 22, 2024

Performance is better though, compilation with +avx2+lzcnt:
HEAD 5ac1daf

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       390 126 953      cache-references                                                        (45,24%)
        28 850 992      cache-misses                     #    7,40% of all cache refs           (45,42%)
     9 172 030 367      cycles                                                                  (45,42%)
       395 459 199      stalled-cycles-backend           #    4,31% backend cycles idle         (45,43%)
        20 297 434      stalled-cycles-frontend          #    0,22% frontend cycles idle        (45,55%)
    23 285 910 196      instructions                     #    2,54  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,70%)
     2 387 202 436      branch-instructions                                                     (45,76%)
        72 726 804      branch-misses                    #    3,05% of all branches             (45,76%)
     4 172 462 359      ic_fetch_stall.ic_stall_any                                             (45,75%)
        23 132 818      l2_cache_misses_from_ic_miss                                            (45,56%)
       761 995 310      l2_latency.l2_cycles_waiting_on_fills                                        (45,35%)
            83 319      faults                                                                
                 1      migrations                                                            

       2,034150060 seconds time elapsed

       1,875813000 seconds user
       0,155984000 seconds sys

d692023

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       409 828 089      cache-references                                                        (45,22%)
        27 165 356      cache-misses                     #    6,63% of all cache refs           (45,22%)
     8 971 453 852      cycles                                                                  (45,22%)
       376 186 450      stalled-cycles-backend           #    4,19% backend cycles idle         (45,50%)
        22 944 563      stalled-cycles-frontend          #    0,26% frontend cycles idle        (45,82%)
    22 665 330 779      instructions                     #    2,53  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,90%)
     2 444 459 853      branch-instructions                                                     (45,90%)
        75 619 279      branch-misses                    #    3,09% of all branches             (45,90%)
     3 927 762 385      ic_fetch_stall.ic_stall_any                                             (45,72%)
        20 967 746      l2_cache_misses_from_ic_miss                                            (45,34%)
       715 590 263      l2_latency.l2_cycles_waiting_on_fills                                        (45,09%)
            83 297      faults                                                                
                 1      migrations                                                            

       1,981634374 seconds time elapsed

       1,816212000 seconds user
       0,163666000 seconds sys

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 22, 2024

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

@Melirius
Copy link
Collaborator Author

Melirius commented Apr 22, 2024

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

Not quite get it. Do you have in mind that we can use an one-dimensional array there and update index by NON_ZERO_TO_BIN_7X7 always and by -1 if coef !=0 and bin is changed?

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 23, 2024

we could make ModelPerColor::counts a single array of NON_ZERO_7X7_BINS * 49 size and pass a single usize that is incrementally update during the loop, adding 1 each iteration and using a NON_ZERO_TO_BIN_7X7 * 49 to incrementally update the index.

Not quite get it. Do you have in mind that we can use an one-dimensional array there and update index by NON_ZERO_TO_BIN_7X7 always and by -1 if coef !=0 and bin is changed?

Yeah, so in serialize_token/parse_token we can incrementally update the index into the one-dimensional array and pass a single parameter to read/write_coef (nz_zig49 or something like that). You can either have non_zero_to_bin as the most significant or the zig49. In one case you do the above, I did it the oppose way:

Each iteration increment the array counter always += 1
then add ProbabilityTables::num_non_zeros_to_bin_7x7(num_non_zeros_left_7x7) * 49 (which can be precalculated and stored as a separate variable only updated in the nonzero case.

Don't know which is faster. We might also shrink the table to some degree since we can't have more non-zeros than there are coefficients left to process.

@mcroomp mcroomp merged commit c2aaf2c into microsoft:main Apr 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants