Understanding Matmul Kernel Support for BF16 and INT8 Quantization in llama.cpp on x86 CPUs #11734
Unanswered
lalith1403
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
I'm currently working with llama.cpp on x86 CPUs and have some questions regarding the support and implementation of matrix multiplication (matmul) kernels for various data types and quantization schemes. I hope to get some insights from the community or contributors to better understand these aspects.
For k-bit quantization (e.g., 4-bit, 5-bit quantization):
Does llama.cpp internally represent these quantized weights using INT8?
During computation, are the quantized weights dequantized to a higher precision (e.g., FP16 or FP32) and then requantized, or are operations performed directly on the quantized representations?
Which specific matmul kernels are utilized for computations involving k-bit quantization?
I appreciate any clarification or insights you can provide on these topics. Understanding the underlying implementations will greatly help in optimizing models and leveraging llama.cpp effectively on x86 hardware.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions