A friend of mine brought to my attention an interesting article on habrahabr.ru - a Russian translation of Daniel Lemire’s Pruning Spaces from Strings Quickly on ARM Processors. That article intrigued me for two reasons: first, somebody had actually made an effort to seek an optimal solution of a common problem on a non-x86 architecture (yay!), and second, the results the author presented at the end of his article puzzled me: nearly 6x per-clock advantage for Intel? That was emphasized by the author’s conclusion that perhaps ARM couldn’t possibly be nearly as clock-efficient as the ‘big iron’ Intel on this simple task.
Well, challenge accepted!
The author had started with a baseline - a serial implementation, so I too decided to start from there and move up. Let’s call this baseline testee00
and get familiar with it before we move on:
inline size_t testee00() {
size_t i = 0, pos = 0;
while (i < 16) {
const char c = input[i++];
output[pos] = c;
pos += (c > 32 ? 1 : 0);
}
return pos;
}
Routine processes a batch of 16 charactes per invocation and returns the count of non-blanks from that batch. Apparently the output string is missing its nil termination, but that is a trivial operation which does not change the performance characteristics of the code. This holds true for the rest of the routines in this survey.
I ran testee00
on a bunch of amd64 CPUs and a few arm64 CPUs, using different GCC and Clang compiler versions, always taking the best compiler result. Here are the clocks/character results, computed from perf -e cycles
divided by the number of processed chars (in our case - 5 * 10^7 * 16), and truncated to the 4th digit after the decimal point:
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
Intel Xeon E5-2687W (Sandy Bridge) | g++-4.8 -Ofast | 1.6363 |
Intel Xeon E3-1270v2 (Ivy Bridge) | g++-5.1 -Ofast | 1.6186 |
Intel i7-5820K (Haswell) | g++-4.8 -Ofast | 1.5223 |
AMD Ryzen 7 1700 (Zen) | g++-5.4 -Ofast | 1.4113 |
Marvell 8040 (Cortex-A72) | g++-5.4 -Ofast | 1.3805 |
Table 1. Performance of testee00
on desktop-level cores
Interesting, isn’t it - ARM's 'big' A72 core (3-decode, 8-issue OoO) actually does better clocks/char than the wider desktop chips (you can see the actual perf stat sessions at the end of this writing). It's as if the ISA of the ARM (A64) is better suited for the task (see the instruction counts in the perf logs below) and the IPC of the A72, while lower, is still sufficiently high to out-perform the Intels per clock.
So, let’s move on to SIMD. Now, I don’t claim to be a seasoned NEON coder, but I get my hands dirty with ARM SIMD occasionally. I will not inline the SIMD routines here since that would choke the reader; instead, the entire testbed and participating test routines can be found in the supplied code.
Daniel’s original SSSE3 pruning routine uses a look-up table to do the input vector sampling. But I used another algorithm for the test. The reason? I cannot easily take 2^16 * 2^4 = 1MB look-up tables in my code - that would be a major cache thrasher for any scenarios where we don’t just prune ascii streams, but call the routine amids other work. The LUT-less SSSE3 version comes at the price of a tad more computations, but runs entirely off registers, and as you’ll see, the price for dropping the table is not prohibitive even on sheer pruning workloads. Moreover, both the new SSSE3 version and the NEON (ASIMD2) version use the same algorithm now, so the comparison is as direct as physically possible. And lastly, all tests run entirely off L1 cache.
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
Intel Xeon E5-2687W (Sandy Bridge) | clang++-3.9 -Ofast -mssse3 -mpopcnt | .9268 |
Intel Xeon E3-1270v2 (Ivy Bridge) | clang++-3.7 -Ofast -mssse3 -mpopcnt | .8223 |
Intel i7-5820K (Haswell) | clang++-3.9 -Ofast -mavx2 | .8232 1 |
AMD Ryzen 7 1700 (Zen) | clang++-4.0 -Ofast -mssse3 -mpopcnt | .6671 2 |
Marvell 8040 (Cortex-A72) | clang++-3.6 -Ofast -mcpu=cortex-a57 | 1.4603 3 |
Table 2. Performance of testee04
on desktop-level cores
As you see, the per-clock efficiency advantage is approx. 2x for the desktop amd64 cores - cores that at the same (or similar) fabnode would be 4x the area of the A72. That said, the employed SIMD algorithm does not perform well on A72; actually, A72's SIMD does not scale at all, let alone nearly as good as Intel's or AMD's, with this algorithm. This appears to be due to an uarch issue with A72 - its SIMD exhibits high latencies for the permutation ops employed by our algorithm. As a result, the scalar version performs better than the 32-wide SIMD version! As we will see below, that is not the case with other ARMv8 uarchs, though.
Same test on entry-level arm64 and amd64 CPUs:
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
AMD C60 (Bobcat) | g++-4.8 -Ofast | 3.5733 |
MediaTek MT8163 (Cortex-A53) | clang++-3.6 -Ofast -mcpu=cortex-a53 | 2.6568 |
Table 3. Performance of testee00
on entry-level cores
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
AMD C60 (Bobcat) | clang++-3.7 -Ofast -mssse3 -mpopcnt | 4.3284 4 |
MediaTek MT8163 (Cortex-A53) | clang++-3.8 -Ofast -mcpu=cortex-a53 | 2.0850 5 |
Table 4. Performance of testee04
on entry-level cores
Going wider, from 16-batch to 32-batch:
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
AMD C60 (Bobcat) | TODO | |
MediaTek MT8163 (Cortex-A53) | clang++-3.8 -Ofast -mcpu=cortex-a53 | 1.4559 |
Table 5. Performance of testee07
on entry-level cores
Surprisingly enough, the per-clock efficiency of A72 on this test is nearly identical to that of A53. But don't let this fool you into thinking that the integer SIMD pipelines of the two uarchs are identical - that is not the case at all, and the close performance in this scenario is a mere happenensance. This is demonstrated by the fact testee07
has divergent tuning for A72/A57 compared to A53. Perhaps we can compare the A72 to another 'big' core by an architecture licensee?
Same test on Apple's custom cores:
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
Apple A7 (Cyclone) | apple clang++-8.1 -Ofast | 1.0507 |
Apple A8 (Typhoon) | apple clang++-8.1 -Ofast | 1.1102 |
Apple A9 (Twister) | apple clang++-8.1 -Ofast | 1.0485 |
Apple M1 (Firestorm) | apple clang++-12.0 -Ofast | .8800 |
Table 6. Performance of testee00
on Apple ARMv8
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
Apple A7 (Cyclone) | apple clang++-8.1 -Ofast | .9130 |
Apple A8 (Typhoon) | apple clang++-8.1 -Ofast | .9153 |
Apple A9 (Twister) | apple clang++-8.1 -Ofast | .8598 |
Apple M1 (Firestorm) | apple clang++-12.0 -Ofast | .5600 |
Table 7. Performance of testee06
on Apple ARMv8
CPU | Compiler & codegen flags | clocks/character |
---|---|---|
Apple A7 (Cyclone) | apple clang++-8.1 -Ofast | .6868 |
Apple A8 (Typhoon) | apple clang++-8.1 -Ofast | .7206 |
Apple A9 (Twister) | apple clang++-8.1 -Ofast | .6552 |
Apple M1 (Firestorm) | apple clang++-12.0 -Ofast | .4000 |
Table 8. Performance of testee07
on Apple ARMv8
Clearly Apple's uarchs behave quite differently to ARM's A72 - their behaviour in this test is much more in-line with the desktop chips and the 'little' A53, leaving the A72 as the outlier at this SIMD algorithm. All thanks to "outstanding" permute latencies.
Xeon E5-2687W @ 3.10GHz
Scalar version
$ g++-4.8 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
421.886991 task-clock (msec) # 0.998 CPUs utilized
1,309,087,898 cycles # 3.103 GHz
4,603,132,268 instructions # 3.52 insns per cycle
0.422602570 seconds time elapsed
$ echo "scale=4; 1309087898 / (5 * 10^7 * 16)" | bc
1.6363
SSSE3 version, 16-batch
$ clang++-3.9 -Ofast -mssse3 -mpopcnt prune.cpp -DTESTEE=4
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abc
Performance counter stats for './a.out':
238.949065 task-clock (msec) # 0.998 CPUs utilized
741,459,257 cycles # 3.103 GHz
2,102,528,489 instructions # 2.84 insns per cycle
0.239405193 seconds time elapsed
$ echo "scale=4; 741459257 / (5 * 10^7 * 16)" | bc
.9268
Xeon E3-1270v2 @ 1.60GHz
Scalar version
$ g++-5 -Ofast prune.cpp
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
810.515709 task-clock (msec) # 0.999 CPUs utilized
1,294,903,960 cycles # 1.598 GHz
4,601,118,631 instructions # 3.55 insns per cycle
0.811646618 seconds time elapsed
$ echo "scale=4; 1294903960 / (5 * 10^7 * 16)" | bc
1.6186
SSSE3 version, 16-batch
$ clang++-3.7 -Ofast prune.cpp -mssse3 -mpopcnt -DTESTEE=4
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abc
Performance counter stats for './a.out':
411.788221 task-clock (msec) # 0.998 CPUs utilized
657,871,749 cycles # 1.598 GHz
2,102,394,157 instructions # 3.20 insns per cycle
0.412536582 seconds time elapsed
$ echo "scale=4; 657871749 / (5 * 10^7 * 16)" | bc
.8223
Intel i7-5820K
Scalar version
$ g++-4.8 -Ofast prune.cpp
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
339.202545 task-clock (msec) # 0.997 CPUs utilized
1,204,872,493 cycles # 3.552 GHz
4,602,943,398 instructions # 3.82 insn per cycle
0.340089829 seconds time elapsed
$ echo "scale=4; 1204872493 / (5 * 10^7 * 16)" | bc
1.5060
AVX2-128 version, 16-batch
$ clang++-3.9 -Ofast prune.cpp -mavx2 -DTESTEE=4
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abc
Performance counter stats for './a.out':
185.266773 task-clock (msec) # 0.998 CPUs utilized
658,574,857 cycles # 3.555 GHz
1,652,476,471 instructions # 2.51 insn per cycle
0.185700333 seconds time elapsed
$ echo "scale=4; 658574857 / (5 * 10^7 * 16)" | bc
.8232
AMD Ryzen 7 1700
Scalar version
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
356,169901 task-clock:u (msec) # 0,999 CPUs utilized
1129098820 cycles:u # 3,170 GHz
4602126161 instructions:u # 4,08 insn per cycle
0,356353748 seconds time elapsed
$ echo "scale=4; 1129098820 / (5 * 10^7 * 16)" | bc
1.4113
SSSE3 version, 16-batch
$ clang++-4 -Ofast prune.cpp -mssse3 -mpopcnt -DTESTEE=4
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abc
Performance counter stats for './a.out':
168,212672 task-clock:u (msec) # 0,999 CPUs utilized
533680965 cycles:u # 3,173 GHz
2102126854 instructions:u # 3,94 insn per cycle
0,168401579 seconds time elapsed
$ echo "scale=4; 533680965 / (5 * 10^7 * 16)" | bc
.6671
Marvell ARMADA 8040 (Cortex-A72) @ 1.30GHz
Scalar version
$ g++-5 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
849.549040 task-clock (msec) # 0.999 CPUs utilized
1,104,405,671 cycles # 1.300 GHz
3,251,212,918 instructions # 2.94 insns per cycle
0.850107930 seconds time elapsed
$ echo "scale=4; 1104405671 / (5 * 10^7 * 16)" | bc
1.3805
ASIMD2 version, 32-batch
$ clang++-3.6 prune.cpp -Ofast -mcpu=cortex-a57 -DTESTEE=7
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abcdef123456789abc
Performance counter stats for './a.out':
1797.414520 task-clock (msec) # 1.000 CPUs utilized
2,336,626,281 cycles # 1.300 GHz
2,954,747,281 instructions # 1.26 insns per cycle
1.798039048 seconds time elapsed
$ echo "scale=4; 2336626281 / (5 * 10^7 * 32)" | bc
1.4603
AMD C60 (Bobcat) @ 1.33GHz
Scalar version
$ g++-4.8 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
Performance counter stats for './a.out':
2151.161881 task-clock (msec) # 0.998 CPUs utilized
2,858,689,394 cycles # 1.329 GHz
4,602,837,331 instructions # 1.61 insns per cycle
2.154862906 seconds time elapsed
$ echo "scale=4; 2858689394 / (5 * 10^7 * 16)" | bc
3.5733
SSSE3 version, 16-batch
$ clang++-3.7 -Ofast -mssse3 -mpopcnt prune.cpp -DTESTEE=4
$ perf stat -e task-clock,cycles,instructions -- ./a.out
0123456789abc
Performance counter stats for './a.out':
2608.381080 task-clock (msec) # 0.998 CPUs utilized
3,462,731,680 cycles # 1.328 GHz
2,104,783,519 instructions # 0.61 insns per cycle
2.612564016 seconds time elapsed
$ echo "scale=4; 3462731680 / (5 * 10^7 * 16)" | bc
4.3284
MediaTek MT8163 (Cortex-A53) @ 1.50GHz (sans perf)
Scalar version
$ clang++-3.6 -Ofast prune.cpp -mcpu=cortex-a53
$ time ./a.out
alabalanica1234
real 0m1.417s
user 0m1.410s
sys 0m0.000s
$ echo "scale=4; 1.417 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
2.6568
ASIMD2 version, 16-batch
$ clang++-3.8 -Ofast prune.cpp -mcpu=cortex-a53 -DTESTEE=6
$ time ./a.out
0123456789abc
real 0m1.112s
user 0m1.100s
sys 0m0.000s
$ echo "scale=4; 1.112 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
2.0850
ASIMD2 version, 32-batch
$ clang++-3.8 -Ofast prune.cpp -mcpu=cortex-a53 -DTESTEE=7 -DSAME_LATENCY_Q_AND_D
$ time ./a.out
0123456789abcdef123456789abc
real 0m1.553s
user 0m1.540s
sys 0m0.000s
$ echo "scale=4; 1.553 * 1.5 * 10^9 / (5 * 10^7 * 32)" | bc
1.4559
Apple A7 (Cyclone) @ 1.30GHz (sans perf)
Scalar version
$ echo "scale=4; 0.646607 * 1.3 * 10^9 / (5 * 10^7 * 16)" | bc
1.0507
ASIMD2 version, 16-batch
$ echo "scale=4; 0.561868 * 1.3 * 10^9 / (5 * 10^7 * 16)" | bc
.9130
ASIMD2 version, 32-batch
$ echo "scale=4; 0.845359 * 1.3 * 10^9 / (5 * 10^7 * 32)" | bc
.6868
Apple A8 (Typhoon) @ 1.50GHz (sans perf)
Scalar version
$ echo "scale=4; 0.592141 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
1.1102
ASIMD2 version, 16-batch
$ echo "scale=4; 0.488176 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
.9153
ASIMD2 version, 32-batch
$ echo "scale=4; 0.768718 * 1.5 * 10^9 / (5 * 10^7 * 32)" | bc
.7206
Apple A9 (Twister) @ 1.85GHz (sans perf)
Scalar version
$ echo "scale=4; 0.453420 * 1.85 * 10^9 / (5 * 10^7 * 16)" | bc
1.0485
ASIMD2 version, 16-batch
$ echo "scale=4; 0.371842 * 1.85 * 10^9 / (5 * 10^7 * 16)" | bc
.8598
ASIMD2 version, 32-batch
$ echo "scale=4; 0.566688 * 1.85 * 10^9 / (5 * 10^7 * 32)" | bc
.6552
Apple M1 (Firestorm) @ 3.2GHz (sans perf)
Scalar version
$ clang++ -Ofast prune.cpp -mcpu=native
$ /usr/bin/time -p ./a.out
0123456789abc
real 0.22
user 0.22
sys 0.00
$ echo "scale=4; 0.22 * 3.2 * 10^9 / (5 * 10^7 * 16)" | bc
.8800
ASIMD2 version, 16-batch
$ clang++ -Ofast prune.cpp -mcpu=native -DTESTEE=6 -DSAME_LATENCY_Q_AND_D
$ /usr/bin/time -p ./a.out
0123456789abc
real 0.14
user 0.14
sys 0.00
$ echo "scale=4; 0.14 * 3.2 * 10^9 / (5 * 10^7 * 16)" | bc
.5600
ASIMD2 version, 32-batch
$ clang++ -Ofast prune.cpp -mcpu=native -DTESTEE=7 -DSAME_LATENCY_Q_AND_D
$ /usr/bin/time -p ./a.out
0123456789abcdef123456789abc
real 0.20
user 0.20
sys 0.00
$ echo "scale=4; 0.20 * 3.2 * 10^9 / (5 * 10^7 * 32)" | bc
.4000
Footnotes
-
AVX2-128 used for Haswell, as the same intrinsics work with the newer target, while producing better results than SSSE3. ↩
-
SSSE3 yields significantly better IPC than AVX2 on Ryzen, so the overall performance of SSSE3 code is better despite the increased instruction count. ↩
-
Version used here is
testee07
- a twice-wider (32-batch) version oftestee04
; that still proves insufficient to fully counter the pipeline bubbles brought about by A72's ASIMD latencies. ↩ -
Bobcat (btver1) experiences Death by popcnt™ here; Jaguar (btver2) does not suffer from that, but part is hard to get ahold of. ↩
-
Version used here is
testee06
- the arm64 equivalent oftestee04
. ↩