Highway is a C++ library that provides portable SIMD/vector intrinsics.
We are passionate about high-performance software. We see major untapped potential in CPUs (servers, mobile, desktops). Highway is for engineers who want to reliably and economically push the boundaries of what is possible in software.
CPUs provide SIMD/vector instructions that apply the same operation to multiple data items. This can reduce energy usage e.g. fivefold because fewer instructions are executed. We also often see 5-10x speedups.
Highway makes SIMD/vector programming practical and workable according to these guiding principles:
Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.
Works on widely-used platforms: Highway supports four architectures; the same application code can target eight instruction sets, including those with 'scalable' vectors (size unknown at compile time). Highway only requires C++11 and supports four families of compilers. If you would like to use Highway on other platforms, please raise an issue.
Flexible to deploy: Applications using Highway can run on heterogeneous
clouds or client devices, choosing the best available instruction set at
runtime. Alternatively, developers may choose to target a single instruction set
without any runtime overhead. In both cases, the application code is the same
except for swapping HWY_STATIC_DISPATCH
with HWY_DYNAMIC_DISPATCH
plus one
line of code.
Suitable for a variety of domains: Highway provides an extensive set of operations, used for image processing (floating-point), compression, video analysis, linear algebra, cryptography, sorting and random generation. We recognise that new use-cases may require additional ops and are happy to add them where it makes sense (e.g. no performance cliffs on some architectures). If you would like to discuss, please file an issue.
Rewards data-parallel design: Highway provides tools such as Gather, MaskedLoad, and FixedTag to enable speedups for legacy data structures. However, the biggest gains are unlocked by designing algorithms and data structures for scalable vectors. Helpful techniques include batching, structure-of-array layouts, and aligned/padded allocations.
Online demos using Compiler Explorer:
Projects using Highway: (to add yours, feel free to raise an issue or contact us via the below email)
- iresearch database index
- JPEG XL image codec
- Grok JPEG 2000 image codec
- vectorized Quicksort (paper)
Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake,
requires opt-in by defining HWY_WANT_AVX3_DL
), NEON (ARMv7 and v8), SVE, SVE2,
WASM SIMD, RISC-V V.
HWY_WASM_EMU256
is a 2x unrolled version of wasm128 and is enabled if
HWY_WANT_WASM2
is defined. This will remain supported until it is potentially
superseded by a future version of WASM.
SVE was initially tested using farm_sve (see acknowledgments).
Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH), incrementing MINOR after backward-compatible additions and PATCH after backward-compatible fixes. We recommend using releases (rather than the Git tip) because they are tested more extensively, see below.
The current version 1.0 signals an increased focus on backwards compatibility. Applications using documented functionality will remain compatible with future updates that have the same major version number.
Continuous integration tests build with a recent version of Clang (running on native x86, or QEMU for RVV and ARM) and MSVC 2019 (v19.28, running on native x86).
Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via GCC cross-compile. See the testing process for details.
The contrib
directory contains SIMD-related utilities: an image class with
aligned rows, a math library (16 functions already implemented, mostly
trigonometry), and functions for computing dot products and sorting.
If you only require x86 support, you may also use Agner Fog's VCL vector class library. It includes many functions including a complete math library.
If you have existing code using x86/NEON intrinsics, you may be interested in SIMDe, which emulates those intrinsics using other platforms' intrinsics or autovectorization.
This project uses CMake to generate and build. In a Debian-based system you can install it via:
sudo apt install cmake
Highway's unit tests use googletest.
By default, Highway's CMake downloads this dependency at configuration time.
You can disable this by setting the HWY_SYSTEM_GTEST
CMake variable to ON and
installing gtest separately:
sudo apt install libgtest-dev
Running cross-compiled tests requires support from the OS, which on Debian is
provided by the qemu-user-binfmt
package.
To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS), the standard CMake workflow can be used:
mkdir -p build && cd build
cmake ..
make -j && make test
Or you can run run_tests.sh
(run_tests.bat
on Windows).
Bazel is also supported for building, but it is not as widely used/tested.
When building for Arm v7, a limitation of current compilers requires you to add
-DHWY_CMAKE_ARM7:BOOL=ON
to the CMake command line; see #834 and #1032. We
understand that work is underway to remove this limitation.
You can use the benchmark
inside examples/ as a starting point.
A quick-reference page briefly lists all operations and their parameters, and the instruction_matrix indicates the number of instructions per operation.
The FAQ answers questions about portability, API design and where to find more information.
We recommend using full SIMD vectors whenever possible for maximum performance
portability. To obtain them, pass a ScalableTag<float>
(or equivalently
HWY_FULL(float)
) tag to functions such as Zero/Set/Load
. There are two
alternatives for use-cases requiring an upper bound on the lanes:
-
For up to
N
lanes, specifyCappedTag<T, N>
or the equivalentHWY_CAPPED(T, N)
. The actual number of lanes will beN
rounded down to the nearest power of two, such as 4 ifN
is 5, or 8 ifN
is 8. This is useful for data structures such as a narrow matrix. A loop is still required because vectors may actually have fewer thanN
lanes. -
For exactly a power of two
N
lanes, specifyFixedTag<T, N>
. The largest supportedN
depends on the target, but is guaranteed to be at least16/sizeof(T)
.
Due to ADL restrictions, user code calling Highway ops must either:
- Reside inside
namespace hwy { namespace HWY_NAMESPACE {
; or - prefix each op with an alias such as
namespace hn = hwy::HWY_NAMESPACE; hn::Add()
; or - add using-declarations for each op used:
using hwy::HWY_NAMESPACE::Add;
.
Additionally, each function that calls Highway ops (such as Load
) must either
be prefixed with HWY_ATTR
, OR reside between HWY_BEFORE_NAMESPACE()
and
HWY_AFTER_NAMESPACE()
. Lambda functions currently require HWY_ATTR
before
their opening brace.
The entry points into code using Highway differ slightly depending on whether they use static or dynamic dispatch.
-
For static dispatch,
HWY_TARGET
will be the best available target amongHWY_BASELINE_TARGETS
, i.e. those allowed for use by the compiler (see quick-reference). Functions insideHWY_NAMESPACE
can be called usingHWY_STATIC_DISPATCH(func)(args)
within the same module they are defined in. You can call the function from other modules by wrapping it in a regular function and declaring the regular function in a header. -
For dynamic dispatch, a table of function pointers is generated via the
HWY_EXPORT
macro that is used byHWY_DYNAMIC_DISPATCH(func)(args)
to call the best function pointer for the current CPU's supported targets. A module is automatically compiled for each target inHWY_TARGETS
(see quick-reference) ifHWY_TARGET_INCLUDE
is defined andforeach_target.h
is included.
When using dynamic dispatch, foreach_target.h
is included from translation
units (.cc files), not headers. Headers containing vector code shared between
several translation units require a special include guard, for example the
following taken from examples/skeleton-inl.h
:
#if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
#ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#else
#define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#endif
#include "hwy/highway.h"
// Your vector code
#endif
By convention, we name such headers -inl.h
because their contents (often
function templates) are usually inlined.
Applications should be compiled with optimizations enabled - without inlining,
SIMD code may slow down by factors of 10 to 100. For clang and GCC, -O2
is
generally sufficient.
For MSVC, we recommend compiling with /Gv
to allow non-inlined functions to
pass vector arguments in registers. If intending to use the AVX2 target together
with half-width vectors (e.g. for PromoteTo
), it is also important to compile
with /arch:AVX2
. This seems to be the only way to generate VEX-encoded SSE4
instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
resulting binary will then require AVX2. Note that no such flag is needed for
clang and GCC because they support target-specific attributes, which we use to
ensure proper VEX code generation for AVX2 targets.
To vectorize a loop, "strip-mining" transforms it into an outer loop and inner loop with number of iterations matching the preferred vector width.
In this section, let T
denote the element type, d = ScalableTag<T>
, count
the number of elements to process, and N = Lanes(d)
the number of lanes in a
full vector. Assume the loop body is given as a function template<bool partial, class D> void LoopBody(D d, size_t index, size_t max_n)
.
Highway offers several ways to express loops where N
need not divide count
:
-
Ensure all inputs/outputs are padded. Then the loop is simply
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
Here, the template parameter and second function argument are not needed.
This is the preferred option, unless
N
is in the thousands and vector operations are pipelined with long latencies. This was the case for supercomputers in the 90s, but nowadays ALUs are cheap and we see most implementations split vectors into 1, 2 or 4 parts, so there is little cost to processing entire vectors even if we do not need all their lanes. Indeed this avoids the (potentially large) cost of predication or partial loads/stores on older targets, and does not duplicate code. -
Process whole vectors and include previously processed elements in the last vector:
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
This is the second preferred option provided that
count >= N
andLoopBody
is idempotent. Some elements might be processed twice, but a single code path and full vectorization is usually worth it. Even ifcount < N
, it usually makes sense to pad inputs/outputs up toN
. -
Use the
Transform*
functions in hwy/contrib/algo/transform-inl.h. This takes care of the loop and remainder handling and you simply define a generic lambda function (C++14) or functor which receives the current vector from the input/output array, plus optionally vectors from up to two extra input arrays, and returns the value to write to the input/output array.Here is an example implementing the BLAS function SAXPY (
alpha * x + y
):Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR { return MulAdd(Set(d, alpha), v, v1); });
-
Process whole vectors as above, followed by a scalar loop:
size_t i = 0; for (; i + N <= count; i += N) LoopBody<false>(d, i, 0); for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
The template parameter and second function arguments are again not needed.
This avoids duplicating code, and is reasonable if
count
is large. Ifcount
is small, the second loop may be slower than the next option. -
Process whole vectors as above, followed by a single call to a modified
LoopBody
with masking:size_t i = 0; for (; i + N <= count; i += N) { LoopBody<false>(d, i, 0); } if (i < count) { LoopBody<true>(d, i, count - i); }
Now the template parameter and third function argument can be used inside
LoopBody
to non-atomically 'blend' the firstnum_remaining
lanes ofv
with the previous contents of memory at subsequent locations:BlendedStore(v, FirstN(d, num_remaining), d, pointer);
. Similarly,MaskedLoad(FirstN(d, num_remaining), d, pointer)
loads the firstnum_remaining
elements and returns zero in other lanes.This is a good default when it is infeasible to ensure vectors are padded, but is only safe
#if !HWY_MEM_OPS_MIGHT_FAULT
! In contrast to the scalar loop, only a single final iteration is needed. The increased code size from two loop bodies is expected to be worthwhile because it avoids the cost of masking in all but the final iteration.
- Highway introduction (slides)
- Overview of instructions per operation on different architectures
- Design philosophy and comparison
- Implementation details
We have used farm-sve by Berenger Bramas; it has proved useful for checking the SVE port on an x86 development machine.
This is not an officially supported Google product. Contact: [email protected]