Llama.cpp slower than transformers - CPU inference #11711

cosmin-petrescu-ptt · 2025-02-06T13:53:34Z

cosmin-petrescu-ptt
Feb 6, 2025

Hi, I fine-tuned a Llama 3.2 3B with LoRA and analysed the inference speed on CPU when using transformers. The sequence length of the sample I am benchmarking with is ~2k .It takes ~8 seconds to generate the first token, which is all I care about for now, given my use case. I converted both the base and the lora adapter to .gguf and used llama.cpp to again generate one token. To my surprise, it does not take below 17 seconds. There might be something that I am doing wrong. Any suggestion is appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama.cpp slower than transformers - CPU inference #11711

{{title}}

Replies: 0 comments

Select a reply

Llama.cpp slower than transformers - CPU inference #11711

cosmin-petrescu-ptt Feb 6, 2025

Replies: 0 comments

cosmin-petrescu-ptt
Feb 6, 2025