Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build the string directly instead of copying and then pruning tabs or newlines #536

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PragmaTwice
Copy link

@PragmaTwice PragmaTwice commented Oct 13, 2023

Following the comment below, I add a new function called get_ascii_tab_or_newline_removed to build the string directly instead of copying and then pruning tabs or newlines.

ada/src/url.cpp

Lines 500 to 502 in d8f77a1

// Optimization opportunity: Instead of copying and then pruning, we could
// just directly build the string from user_input.
helpers::remove_ascii_tab_or_newline(tmp_buffer);

Note:

  • Due to NRVO in C++ compilers, returning a local variable back to caller should be effective as pass pointers. (neither copy ctor nor move ctor will be called)
  • If we add something like inline or always_inline attribute to a function, it is better to put the function definition to the header file instead of source file. Since all inline optimizitions cannot take effect if the definition cannot be reached inside the current translation unit (except link-time optimization enabled). (the definition may be in another translation unit if we put them in a source file)

@lemire
Copy link
Member

lemire commented Oct 13, 2023

Thanks.

Please consider running with ADA_BENCHMARKS=ON (e.g., cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute ./build/benchmarks/benchdata before and after your changes.

We need some evidence that the change is beneficial (meaning that the code runs faster).

@PragmaTwice
Copy link
Author

Please consider running with ADA_BENCHMARKS=ON (e.g., cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute ./build/benchmarks/benchdata before and after your changes.

Sure. Here is the benchmark result:

Before:

Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:21+09:00
Running ./benchmarks/benchdata
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.05, 0.64, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs       26
whatwg---count of bad URLs    26
-------------------------------

bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              23837537 ns     23837653 ns           30 speed=364.469M/s time/byte=2.74372ns time/url=238.317ns url/s=4.19609M/s
BasicBench_AdaURL_aggregator_href   15135664 ns     15135570 ns           47 speed=574.018M/s time/byte=1.74211ns time/url=151.318ns url/s=6.6086M/s
BasicBench_whatwg                   41866413 ns     41866288 ns           16 speed=207.52M/s time/byte=4.81881ns time/url=418.558ns url/s=2.38915M/s

After:

Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:26+09:00
Running ./benchmarks/benchdata-new
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.04, 0.65, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs       26
whatwg---count of bad URLs    26
-------------------------------

bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              23819007 ns     23819062 ns           29 speed=364.754M/s time/byte=2.74158ns time/url=238.131ns url/s=4.19937M/s
BasicBench_AdaURL_aggregator_href   14740361 ns     14740285 ns           47 speed=589.411M/s time/byte=1.69661ns time/url=147.366ns url/s=6.78583M/s
BasicBench_whatwg                   40722269 ns     40722100 ns           17 speed=213.351M/s time/byte=4.68712ns time/url=407.119ns url/s=2.45628M/s

The benchmark results are unstable (and only for x86), so I think you can run on your side if you are interested : )

@lemire
Copy link
Member

lemire commented Oct 15, 2023

@anonrig I will assess this soon. I recommend waiting before we include it in a release. (Not because I think it is bad, but because we want to make sure we document the benefits.)

@anonrig
Copy link
Member

anonrig commented Oct 15, 2023

It provides a different output for me. If you run the script with sudo, our benchmarks enable performance counters, which provide much better output.

Before:

BasicBench_AdaURL_aggregator_href   17023800 ns     17022634 ns           41 GHz=3.22958 cycle/byte=6.28781 cycles/url=546.155 instructions/byte=28.6539 instructions/cycle=4.55705 instructions/ns=14.7173 instructions/url=2.48885k ns/url=169.11 speed=510.385M/s time/byte=1.95931ns time/url=170.184ns url/s=5.876M/s

After:

BasicBench_AdaURL_aggregator_href   17118766 ns     17118780 ns           41 GHz=3.22878 cycle/byte=6.3365 cycles/url=550.383 instructions/byte=28.6567 instructions/cycle=4.52248 instructions/ns=14.6021 instructions/url=2.4891k ns/url=170.462 speed=507.518M/s time/byte=1.97037ns time/url=171.145ns url/s=5.843M/s```

@lemire
Copy link
Member

lemire commented Oct 15, 2023

I am also getting so-so results with GCC 12 and a recent Intel compiler.

Before:

BasicBench_AdaURL_aggregator_href   22788603 ns     22752369 ns           31 GHz=3.18841 cycle/byte=8.375 cycles/url=727.446 instructions/byte=25.1422 instructions/cycle=3.00205 instructions/ns=9.57178 instructions/url=2.18383k ns/url=228.153 speed=381.854M/s time/byte=2.6188ns time/url=227.467ns url/s=4.39625M/s

After:

BasicBench_AdaURL_aggregator_href   23516792 ns     23485895 ns           30 GHz=3.18849 cycle/byte=8.83776 cycles/url=767.641 instructions/byte=25.1306 instructions/cycle=2.84355 instructions/ns=9.06662 instructions/url=2.18282k ns/url=240.754 speed=369.928M/s time/byte=2.70323ns time/url=234.8ns url/s=4.25894M/s

As you can see, you save about 0.01 instructions per byte (which is zero) and the performance is down. So this PR does not seem to improve the performance.

@darowny
Copy link

darowny commented Apr 30, 2024

When I was trying to improve can_parse I tried doing optimizations like here, especially allocation of strings and stuff, got same results, either worse performance or almost no difference

I think the biggest performance improvement would be achieved by changing stuff on line 499 tmp_buffer=input
and I think the comment relates to that piece of code, as it copies chars, so what has to be changed IMO is tmp_buffer creation/initialization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants