Dual-CPU utilization - CPU inference only #11530

nekiee13 · 2025-01-30T23:15:47Z

nekiee13
Jan 30, 2025

I'm trying to utilize dual-CPU inference setup - 2xEpyc 9154 (2x16 physical cores) and 392 GB RAM, Win10.

When I run Deppseek-R1_Q2 600B+, inference range from 2.38 tk/s for 500 ctx (~200GB RAM) to 0.66 tk/s for 28k ctx (~300GB RAM). Declared memory bandwidth is 460GB/s per socket or ~900GB/s for dual-CPU config, coupled with 24x16GB RAM chips (from Vendor Qualified manufacturer), I would expect faster ctx inference speed, somewhere around 2-3 tk/s, not 0.66.

But, when I run a smaller model (32B Q8_0), with a small ctx (~600) I get again 2.7 tk/s, even I only utilized ~30 GB RAM, and I expect inference speed, regarding available memory bandwidth, to be significantly faster. So, pretty sure I messed something in my setup.

This is my cmd:

llama-cli --model "C:\xLLMs\bartowski\FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-GGUF\FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-Q8_0.gguf" --ctx-size 600 -n 450 --cache-type-k q8_0 --cache-type-v f16 --threads 32 --numa distribute --prio 2 --temp 0.65 --top_k 40 --top_p 0.9 --min-p 0.05 --seed 3407 --flash-attn -no-cnv --prompt "<|User|>Why is the sky blue?<|Assistant|>"

I use latest Llamacpp Win64 binary avx512. I configured Win10 to High performance and BIOS to HPC (hyperthreading / SMT - disabled), Mem speed 4800 MT/s, cTDP 320W/CPU, disabled P-states and C-states...

One thing I couldn't find in BIOS was NUMA mode (MB Gigabyte MZ73-LM1), and I will look again for that.

What else can I check? Would appreciate some guidelines or advice on this.

nekiee13 · 2025-02-01T09:13:09Z

nekiee13
Feb 1, 2025
Author

EDIT:
Confirmed - Win10/11 Pro & Home support only 1 CPU setup.

4 replies

fairydreaming Feb 1, 2025
Collaborator

One reason for mediocre performance is that your CPU has likely only 4 CCDs (based on its 128MB of cache, the same value as in 9245), therefore its real memory bandwidth is not 460GB/s, but more like 260-270GB. You can measure it with Aida64. Also if your operating system gives you access only to 1 CPU, then perhaps try lowering the number threads passed in llama.cpp parameters to 16?

nekiee13 Feb 2, 2025
Author

Thank you for the information and help. I owe you one on this. :)

Kamayuq Feb 5, 2025

I thought Win11pro supports less equal than 2 socket.

nekiee13 Feb 5, 2025
Author

Yes and no, it depends on core count. Win10 Pro supports up to 64 cores, so in my case it shoud not be the issue. With greater core number (stronger CPU) it would be.

Entropy-Enthalpy · 2025-02-03T07:31:42Z

Entropy-Enthalpy
Feb 3, 2025

EPYC 9154 doesn't seem to exist. Do you mean 9124? It has only 2 CCDs, and the memory bandwidth measured in the dual-slot configuration is ~530GB/s (https://x.com/xyster/status/1884843091024097569), not 922GB/s as advertised by AMD.

I also observed that you opted for 16GiB RDIMMs, which are typically 1-Rank. This choice can result in significantly lower bandwidth compared to the more common 2-Rank RDIMMs.

BTW, DeepSeek-R1 is a MoE model that has 37B activated parameters, therefore the theoretical maximum speed (of the Q4 model) for a 530GB/s platform is 530/(4/8)/37=28.6 tokens/s. Unfortunately, actual tests on the Internet including my own test achieve only a fraction of this theoretical efficiency... I'm not sure why... An exception is NVIDIA's NIM, which claims to have achieved 3872tps on a single HGX-H200-8GPU platform. I assumed that they used the Q2 model: 4800 GBps * 8 / ( 2 / 8 ) / 37 ≈ 4k tps

1 reply

nekiee13 Feb 3, 2025
Author

You're correct - EPYC 9124. Typo on my side.

fairydreaming · 2025-02-03T09:16:46Z

fairydreaming
Feb 3, 2025
Collaborator

With only 1 CPU support in Windows I guess the other CPU works only as an extra memory controller? So if you load a very large model and use large context so that the memory usage is spread across both CPUs then your performance may be limited by the xGMI interconnection between CPUs (it adds latency in memory accesses and reduces the bandwidth). See https://lenovopress.lenovo.com/lp1852.pdf page 10.

Also: 2 x 9124, Windows, C:\xLLMs. Your reddit comments will be remembered forever... 😉

0 replies

nekiee13 · 2025-02-04T08:47:26Z

nekiee13
Feb 4, 2025
Author

Yeah, I'm upgrading from Intel i5-13400F & 4090. So, AMD 2x9124 and the rest of HW, was as far I could stretch..

Anyway, TPM chip has arrived, so I can finish my migration. I plan to do testing next weekend. Hopefully, no more issues will arise... Thanks to all of you, there is a solid chance, that the system will finally work properly. I think I understand now, why did the system behaved that way. Will report back on results, if all turns out well. Though, as based on relativly weak CPU pair, and 1-Rank RDIMM RAM, I doubt it will match your score.

PS. I always had this fear that echoes of my existence would fade into oblivion. But, you got that covered for me also. So, that's another issue solved. You're good at this...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dual-CPU utilization - CPU inference only #11530

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Dual-CPU utilization - CPU inference only #11530

nekiee13 Jan 30, 2025

Replies: 4 comments · 5 replies

nekiee13 Feb 1, 2025 Author

fairydreaming Feb 1, 2025 Collaborator

nekiee13 Feb 2, 2025 Author

Kamayuq Feb 5, 2025

nekiee13 Feb 5, 2025 Author

Entropy-Enthalpy Feb 3, 2025

nekiee13 Feb 3, 2025 Author

fairydreaming Feb 3, 2025 Collaborator

nekiee13 Feb 4, 2025 Author

nekiee13
Jan 30, 2025

Replies: 4 comments 5 replies

nekiee13
Feb 1, 2025
Author

fairydreaming Feb 1, 2025
Collaborator

nekiee13 Feb 2, 2025
Author

nekiee13 Feb 5, 2025
Author

Entropy-Enthalpy
Feb 3, 2025

nekiee13 Feb 3, 2025
Author

fairydreaming
Feb 3, 2025
Collaborator

nekiee13
Feb 4, 2025
Author