I found a way to speed up CPU inference using a HNSW index on the output embeddings #11686
martinloretzzz
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Does it currently only support FP16? I found that Q8_0 is slower than MM. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
To get the next token from an LLM, we compute the probabilities for each individual token in the LLM's vocabulary by multiplying the last hidden state with the output embedding matrix. This matrix is massive, accounting for up to 20% of the total parameters in small multilingual LLMs.
When sampling the next token with top-k sampling, we're only sampling from the 40 most probable tokens out of 128,256 (for Llama 3.2 models). By using an HNSW vector index, we can retrieve these 40 most probable tokens directly through an approximate nearest neighbor search over the output embeddings, avoiding the full matrix multiplication with the output embeddings.
This reduces memory accesses and computation, resulting in up to 28% faster CPU-based inference for Llama 2.1 1B on mid-range laptops.
For more details, read the full blog post on martinloretz.com/blog/vector-index-cpu/
Benchmarks
llama-bench
for Llama 1B F16 (Ubuntu = Intel® Core™ i7-10750H x 12, 2 x 16GiB DDR4 2933 MHz, MacBook = MacBook Pro 16" M4 Pro, vec = vector index, MM = matrix multiplication (reference)):LLama 3.2 1B was selected for these benchmarks because of its relatively large embedding matrix (21% of all parameters). Full model speedups for larger models are lower because less time is spent computing the output embeddings.
Replicate these benchmarks:
Code is here. It's quite hacky (I haven't done too much with C++ before) and only works with Llama-3.2-1B-Instruct.fp16.gguf, you can get the prebuilt HNSW index for the model from here.
Build and install this faster version of hnswlib: martinloretzzz/hnswlib
CPU build:
cmake -B build -DLLAMA_OPENBLAS=OFF -DGGML_METAL=OFF -DGGML_BLAS=OFF
cmake --build build --config Release
LLama Bench:
./build/bin/llama-bench -m Llama-3.2-1B-Instruct.fp16.gguf -p 0 -n 256 -t 1,6
Without vector index:
MM=True ./build/bin/llama-bench -m Llama-3.2-1B-Instruct.fp16.gguf -p 0 -n 256 -t 1,6
Beta Was this translation helpful? Give feedback.
All reactions