low performance of a simple ggml_mul_mat between GGML_TYPE_F16 and GGML_TYPE_Q4_0 #964
Replies: 1 comment
-
output: {'ggml_type': 2, 'shape': [8192, 8192], 'bad_offset': 548601856, 'item_type': <class 'numpy.uint8'>, 'item_count': 37748736, 'np_dims': (8192, 4608), 'offset': 549368736} |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I perform a simple ggml_mul_mat between GGML_TYPE_F16 and GGML_TYPE_Q4_0.
Profile the process, I found that GGML_TYPE_Q4_0 was dequantized and then matmuled. It is very slow.
How to use the optimized cuda kernel?
ggml_mm.cpp
test.py
my_gguf.py
Beta Was this translation helpful? Give feedback.
All reactions