Can someone explain to me, in detail, on what llama.cpp allocates memory? #9936
Replies: 1 comment 8 replies
-
This is the model weights. The total size should be very close to the size of the model file on disk in most cases.
This is the model context. Parameters such as
This is a small buffer used to store the results of the computations, usually it is too small to worry about.
This is the buffer where the intermediate results of computations and other temporary data used during inference is stored. Parameters such as |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I recently asked a question about llama.cpp memory usage. I recieved some helpful answers, one of which suggested looking at the outputs I get when I start llama-server.
From looking at the output of starting llama-server, I see that KV buffer size is around 1.3 GB. This led me to a question: what is the KV buffer? It reminded me of key-query-value in attention, but I'm not sure if this is it.
I will now paste my output log below. I would really appreciate it if someone could go over it block-by-block (or line-by-line if necessary) to explain the memory allocations of llama.cpp and what they relate to in the LLM.
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions