"One-shot conversations" with local LLMs via llama-cli #11357
Replies: 1 comment
-
I know that a lot bigger fish are being fried in these discussions, so I decided to provide my stab at an answer to the observed, probably obvious behavior to the initatied. After getting some insight into what might be happening, I crafted a LLM prompt that gives (IMHO) a good answer:
Copy-on-write (COW) is a memory management technique used in operating systems, including Linux, to optimize memory usage and improve performance. The basic idea behind COW is to delay the creation of a writable copy of a memory page until it is actually modified. This can be particularly beneficial in scenarios where multiple processes or threads need to access the same data but do not modify it. How COW Works in Linux
Nested Command Substitutions and COWIn the context of nested command substitutions of the same executable accessing a large, read-only file via
Example ScenarioIn a Linux system with X GB of VRAM, a model is run using llama.cpp (
In this scenario,
Considerations
ConclusionCopy-on-write is a powerful memory management technique in Linux that can significantly optimize memory usage and improve performance, especially in scenarios involving multiple processes accessing the same data. In the context of nested command substitutions of the same executable accessing a large, read-only file via ~❯❯ ~/📂 28.82s 🗠 10:25:03 I would appreciate the developer insight on this use of (Of course, the added inference cost of reevaluating the previous prompts is a drawback, but it gives me quick and dirty command-line alias way to send followup questions to a local LLM from the command line.) |
Beta Was this translation helpful? Give feedback.
-
Yeah, it sounds like an oxymoron in this context but what I mean is the following:
I like using llama-cli in various ways from the Linux command line and I love zsh.
(In fact my tool BlahST was written in zsh to orchestrate whisper.cpp and llama.cpp for speech input and speech-to-speech LLM interaction.)
I wanted to share a peculiar way to have a few-turn conversation with a
llama-cli
-driven LLM from the command line in essentially one-shot mode. I describe it in more detail here.Disclaimer: This can be done by simply keeping a file with the previous LLM outputs, but in this case, repeated computation aside, I would like to know why the following works at all.
Basically, with my zsh alias to llama-cli (with a Qwen model), that I use all the time and have found useful in multiple scenarios:
I found myself using a triple-nested call (and expected the system to blow up with OOM or similar condition):
but not only it worked, it did not even slow down (maybe a little due to the expanded prompt).
Questions:
I do not know much about the memory management (VRAM, of which I fill about 80% with a single instance of the chosen model) of llama.cpp but it must be quite efficient and clever, to reuse in place the model, prompt cache etc. in this triple-nested situation.
It does not seem like llama-cli is stopping and then restarting at all. Is it?
Kudos developers, can we expect this behavior also with in a CPU only scenario?
I would appreciate a simple explanation from a knowledgeable person why there is no slowdown or OOM blowup.
Also, from the developer (or is it the OS, or both) point of view what is the expected maximum practical depth of such command nesting?
Encouraged, I came up with an idea and proceeded to add one more alias to my zsh setup:
where the gist is in
qwen "$(r) $1"
which should work by itself but we need to also check that the preceding command was a call toqwen...
, since what this new alias does is expand the history and evaluate the previous prompt, in essence, continuing the "conversation". (Aboveqwec
is an alias to a Qwen Coder call with llama-cli andqre
is the alias itself, allowing more than 2 turn conversation)So, equipped with this tooling, I could have command line conversations like this:
Why not just use the regular conversation mode of llama-cli?
For the same reason I have created the LLM alias in the first place, to leverage the power of the CLI and not be isolated in llama-cli's conversation bubble. Please, see the linked discussion above or this for useful examples.
Beta Was this translation helpful? Give feedback.
All reactions