-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve convert_hex_to_binary x86_64 codegen. #473
Conversation
This results in shorter and branchless x86_64 assembly: https://godbolt.org/z/5jqTbYWWh Original: ``` convert_hex_to_binary(char): # @convert_hex_to_binary(char) mov eax, edi cmp al, 57 jg .LBB0_2 add eax, -48 ret .LBB0_2: xor ecx, ecx cmp al, 97 setb cl shl ecx, 5 add eax, ecx add eax, -87 ret ``` and new version: ``` convert_hex_to_binary(char): # @convert_hex_to_binary(char) xor ecx, ecx cmp dil, 97 setl cl shl ecx, 5 add ecx, -87 cmp dil, 58 mov eax, -48 cmovge eax, ecx add eax, edi ret ```
When convert_hex_to_binary, we know that the character is an hexadecimal digit so it is a nice function to optimize. Your new code should sustain a throughput of 1 routine per ~slightly more than 2 cycles. Can we do better? What about... unsigned convert_hex_to_binary(const char c) {
const static unsigned char table[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 0, 0, 0, 0, 0, 0, 10, 11,
12, 13, 14, 15, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 10, 11, 12, 13, 14, 15};
return table[c - '0'];
} This should compile to something like... movsxd rax, edi
lea rcx, [rip + table(char)::table]
movzx eax, byte ptr [rax + rcx - 48]
ret It is a fraction of the muops and should support a throughput of 1 per cycle, so it is unbeatable from that point of view. The table fits in a cache line, so it is going to be efficient. |
Note that we have as a homework to greatly optimize this work via... https://github.com/ada-url/ada/pull/459/files We still don't have good benchmarks for this (at least not benchmarks that are in place). |
thanks for the suggestion, @lemire ! I've also considered your approach but since I'm not sure if trading a few CPU instructions is worth a memory lookup. That's also a reason why compilers don't always prefer using lookup tables to implement switches. Depending on access pattern, this table may end up in cache which would make its cost very low, but in case cache hit rate is not going to be good, it may end up being slower. |
The function you propose compiles to about 30 bytes (x64). The function I propose compiles to about 16 bytes, plus a table of 54 bytes... so about 70 bytes. Not counting alignment and function headers, it is a difference of about 40 bytes. Of course, it is a short function so it is likely to get inlined. The table will not be duplicated, but the code itself might be. So it is unclear to me which is going to generate a smaller binary. Let us try it out. I use GCC 11. Current main branch:
This PR:
PR with a table:
So the function with a table actually saves 8 bytes. Does it matter? Well. It is 0.0016 % and current processors have megabytes of cache. Now, we can, of course, measure this empirically, by running benchmarks. We don't yet have good benchmarks for this problem, but I hope to have some later today. So we will be able to measure the difference objectively. |
I wasn't making any claims about binary size - what I'm saying is that lookup table requires extra memory load and in my experience memory is more often a bottleneck than CPU, so in practice I usually observe runtime reduction in case a few extra CPU instructions can replace a memory access. Obviously only benchmark using representative workload can answer this, but in any case this PR seems like an improvement over original implementation whereas change to lookup table, while may be even better, is much harder to compare in terms of performance. |
I think we need a benchmark before we can consider such a PR. I have the following for consideration: #477 |
Now that we have a benchmark, we can run it on the three different options (current code, this PR and the table-based approach). You can run GCC 11, Ice Lake processor. Main:
This PR
Table approach:
The PR is about 1% faster than the main branch, but the table-based approach is 13% faster than the PR. |
If you want to avoid branches + lookup table, this might be shorter:
x86-64 codegen:
ARM64 codegen:
|
Gah, scrap that, found a shorter version:
x86-64:
ARM64:
|
@zingaburga Very nice. Very slightly slower than a table in my tests, but seemingly faster than this PR. |
Thanks! On x86, it's +2 instructions to avoid a lookup, but on ARM64, it should roughly be the same. Actually, it should be possible to eliminate the sign extension, but I can't seem to get the compiler to comply. This seems to work (unless I misunderstand something), but don't know how to get the compiler to produce it naturally. This would reduce it to 4 instructions, or effectively 3 with move elimination, which should put it almost on par with the lookup table (but without a lookup). |
Let us try it out experimentally. LLVM 14, Apple M2 (aarch64). Benchmark: bench_search_params. You can run our benchmarks like so:
Version with a table
The ANDNOT version
AnalysisOn the M2 with LLVM 14, the performance is undistinguishable, and so is the binary size of the produced library. However, the version with a table has significantly fewer instructions (54.0 instructions per byte vs. 54.5 instructions per byte). You end up with the same speed because the ANDNOT version retires very slightly fewer instructions per cycle. Practically speaking, the difference is probably insignificant, meaning that one can go with your approach or the table, and it makes no difference, at least on Apple Silicon. |
Thanks for the benchmark - interesting result! My statement was just based on the output of the compiler, not on any benchmark. Putting in your code, with Clang 14, gives me 4 instructions, so the same number as my version, and all instructions are reasonably fast. In terms of binary size, one would think the tableless variant to be slightly smaller due to not having a table. I haven't looked at the benchmark code, but could it be inlining a bunch of stuff? If the |
This results in shorter and branchless x86_64 assembly: https://godbolt.org/z/5jqTbYWWh
Original:
and new version:
It also results in 1 fewer instructions for ARM64 using GCC - https://godbolt.org/z/x9oYoG93h
vs new
Specifically, the first version has an extra
add w0, w0, 10