Llama.cpp slower than transformers - CPU inference #11711
Unanswered
cosmin-petrescu-ptt
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I fine-tuned a Llama 3.2 3B with LoRA and analysed the inference speed on CPU when using transformers. The sequence length of the sample I am benchmarking with is ~2k .It takes ~8 seconds to generate the first token, which is all I care about for now, given my use case. I converted both the base and the lora adapter to .gguf and used llama.cpp to again generate one token. To my surprise, it does not take below 17 seconds. There might be something that I am doing wrong. Any suggestion is appreciated. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions