"One-shot conversations" with local LLMs via llama-cli #11357

QuantiusBenignus · 2025-01-22T18:21:38Z

QuantiusBenignus
Jan 22, 2025

Yeah, it sounds like an oxymoron in this context but what I mean is the following:

I like using llama-cli in various ways from the Linux command line and I love zsh.
(In fact my tool BlahST was written in zsh to orchestrate whisper.cpp and llama.cpp for speech input and speech-to-speech LLM interaction.)

I wanted to share a peculiar way to have a few-turn conversation with a llama-cli-driven LLM from the command line in essentially one-shot mode. I describe it in more detail here.

Disclaimer: This can be done by simply keeping a file with the previous LLM outputs, but in this case, repeated computation aside, I would like to know why the following works at all.

Basically, with my zsh alias to llama-cli (with a Qwen model), that I use all the time and have found useful in multiple scenarios:

  qwen='() { llam -t 8 -c 4096 --temp 0 2>/dev/null -fa -ngl 99 --top-p 0.95 -co -mli -no-cnv --no-display-prompt -m /MODELFOLDER/Qwen2.5-14B-Instruct-Q5_K_L.gguf --prompt "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n$1<|im_end|>\n<|im_start|>assistant\n" ; }'

I found myself using a triple-nested call (and expected the system to blow up with OOM or similar condition):

qwen "$(qwen "Given this output: $(qwen 'Initial question?'), elaborate on the second solution.") Why is the new option necessary?"

but not only it worked, it did not even slow down (maybe a little due to the expanded prompt).

Questions:
I do not know much about the memory management (VRAM, of which I fill about 80% with a single instance of the chosen model) of llama.cpp but it must be quite efficient and clever, to reuse in place the model, prompt cache etc. in this triple-nested situation.
It does not seem like llama-cli is stopping and then restarting at all. Is it?
Kudos developers, can we expect this behavior also with in a CPU only scenario?
I would appreciate a simple explanation from a knowledgeable person why there is no slowdown or OOM blowup.
Also, from the developer (or is it the OS, or both) point of view what is the expected maximum practical depth of such command nesting?

Encouraged, I came up with an idea and proceeded to add one more alias to my zsh setup:

qre='() { [[ "${$(fc -nl -1)%% *}" == (qwec|qwen|qre) ]] && qwen "$(r) $1" || :}'

where the gist is in qwen "$(r) $1" which should work by itself but we need to also check that the preceding command was a call to qwen..., since what this new alias does is expand the history and evaluate the previous prompt, in essence, continuing the "conversation". (Above qwec is an alias to a Qwen Coder call with llama-cli and qre is the alias itself, allowing more than 2 turn conversation)

So, equipped with this tooling, I could have command line conversations like this:


~❯❯qwen 'What is 1+1?'                                                       ~/📂  0.8ms 🥁 12:18:01
1+1 equals 2. [end of text]


~❯❯qre "Not that trivial , is it?"                                           ~/📂  1.79s 🥁 12:18:16
qwen 'What is 1+1?'
While it might seem trivial, the statement "1+1=2" is actually a fundamental concept in mathematics that forms the basis for more complex mathematical operations and theories. The simplicity of this equation belies the complexity and importance of the underlying principles of arithmetic and number theory. In fact, the formal proof of 1+1=2 was a significant part of the work done by mathematicians Alfred North Whitehead and Bertrand Russell in their book "Principia Mathematica," which they didn't complete until over 300 pages into the text. This underscores the depth and complexity that can be found even in what might initially appear to be simple mathematical truths. [end of text]


~❯❯qre "To me this is essentially a direct expression of the conservation of energy principle."
qwen 'What is 1+1?'
The statement "1+1 equals 2" is a basic arithmetic fact, representing the combination of two units into a total of two units. While it is a fundamental concept in mathematics, it is not directly related to the conservation of energy principle, which is a fundamental law in physics.

The conservation of energy principle states that energy cannot be created or destroyed, only transformed from one form to another. In a closed system, the total amount of energy remains constant over time. This principle is crucial in physics and is expressed mathematically in various forms, such as the first law of thermodynamics.

While both concepts deal with the idea of "conservation" in their respective fields, they are applied in different contexts and describe different phenomena. The arithmetic statement "1+1 equals 2" is about the combination of quantities, whereas the conservation of energy is about the preservation of energy in physical systems. [end of text]


~❯❯qre "There is a line of thought that a more direct connection exists."    ~/📂  9.76s 🤼 12:19:31
qwen 'What is 1+1?'
The statement "1+1 equals 2" is a fundamental concept in arithmetic, based on the principles of mathematics as we understand them. The idea of a "more direct connection" might refer to the way this concept is taught or understood intuitively. For example, if you have one apple and you add another apple, you have two apples. This physical representation can be seen as a more direct or intuitive way to understand the mathematical concept of 1+1=2.

In a more abstract sense, the direct connection could be referring to the axiomatic foundations of mathematics, where the Peano axioms provide a basis for arithmetic, including the definition of addition and the number 2 as the successor of 1. However, the simplicity and directness of the concept "1+1=2" often make it one of the first mathematical truths children learn, and it serves as a foundational building block for more complex mathematical ideas. [end of text]
~❯❯

Why not just use the regular conversation mode of llama-cli?
For the same reason I have created the LLM alias in the first place, to leverage the power of the CLI and not be isolated in llama-cli's conversation bubble. Please, see the linked discussion above or this for useful examples.

QuantiusBenignus · 2025-01-24T16:01:10Z

QuantiusBenignus
Jan 24, 2025
Author

I know that a lot bigger fish are being fried in these discussions, so I decided to provide my stab at an answer to the observed, probably obvious behavior to the initatied. After getting some insight into what might be happening, I crafted a LLM prompt that gives (IMHO) a good answer:

~❯❯qwec 'Describe in some detail what is "copy on write" or COW in Linux, especially in the context of nested command substitutions of the same executable accessing a large, read-only file (that occupies 80% of the available memory, even if it is GPU VRAM) via mmap?'

Copy-on-write (COW) is a memory management technique used in operating systems, including Linux, to optimize memory usage and improve performance. The basic idea behind COW is to delay the creation of a writable copy of a memory page until it is actually modified. This can be particularly beneficial in scenarios where multiple processes or threads need to access the same data but do not modify it.

How COW Works in Linux

Memory Pages: In Linux, memory is divided into fixed-size pages (typically 4KB). When a process is forked, it inherits a copy of the parent process's memory pages. However, these pages are marked as read-only.
Shared Pages: If a process tries to write to a page that is marked as read-only, the operating system triggers a page fault. At this point, the operating system creates a writable copy of the page for the process that attempted the write. This is the "copy-on-write" mechanism.
Efficiency: By delaying the creation of writable copies, COW reduces the amount of memory used and can improve performance, especially when multiple processes are accessing the same data.

Nested Command Substitutions and COW

In the context of nested command substitutions of the same executable accessing a large, read-only file via mmap, COW can play a significant role in optimizing memory usage.

mmap and COW: When a process uses mmap to map a large, read-only file into its address space, the file's contents are initially mapped as read-only pages. If multiple processes or threads access the same file using mmap, they will initially share the same read-only pages.
Nested Command Substitutions: If the same executable is invoked multiple times through nested command substitutions, each invocation will have its own process. However, if these processes do not modify the memory pages mapped by mmap, they will continue to share the same read-only pages.
Memory Usage: Since the pages are read-only and shared, the memory usage remains low, even if multiple processes are accessing the same large file. This is particularly beneficial when the file occupies a significant portion of the available memory (e.g., 80%).
Performance: By avoiding the creation of writable copies, COW can improve performance, especially in scenarios where the processes only read from the file and do not modify it.

Example Scenario

In a Linux system with X GB of VRAM, a model is run using llama.cpp (llama-cli) in a triple-nested shell call (where llam is an alias to llama-cli):

llam "$(llam "$(llam "Initial question") Follow up question") Last question"

In this scenario, llam is invoked three times through nested command substitutions. If llam uses mmap to access a large, read-only file, the following happens:

The first invocation of llam maps the file into its address space using mmap.
The second and third invocations of llam will also map the file into their address spaces using mmap.
Since the file is read-only, the memory pages mapped by mmap are shared among all three processes.
If none of the processes modify the memory pages, no writable copies are created, and the memory usage remains low.

Considerations

GPU VRAM: If the file is mapped into GPU VRAM, the behavior of COW might be influenced by the GPU's memory management system. However, the fundamental principle of COW should still apply, as long as the GPU's memory management supports shared read-only pages.

Conclusion

Copy-on-write is a powerful memory management technique in Linux that can significantly optimize memory usage and improve performance, especially in scenarios involving multiple processes accessing the same data. In the context of nested command substitutions of the same executable accessing a large, read-only file via mmap, COW allows multiple processes to share the same read-only memory pages, reducing memory usage and improving performance. [end of text]

~❯❯ ~/📂 28.82s 🗠 10:25:03

I would appreciate the developer insight on this use of llama-cli and my, at this point likely not accurate, guess.

(Of course, the added inference cost of reevaluating the previous prompts is a drawback, but it gives me quick and dirty command-line alias way to send followup questions to a local LLM from the command line.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"One-shot conversations" with local LLMs via llama-cli #11357

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

"One-shot conversations" with local LLMs via llama-cli #11357

QuantiusBenignus Jan 22, 2025

Replies: 1 comment

QuantiusBenignus Jan 24, 2025 Author

How COW Works in Linux

Nested Command Substitutions and COW

Example Scenario

Considerations

Conclusion

QuantiusBenignus
Jan 22, 2025

QuantiusBenignus
Jan 24, 2025
Author