Replies: 4 comments 5 replies
-
EDIT: |
Beta Was this translation helpful? Give feedback.
-
EPYC 9154 doesn't seem to exist. Do you mean 9124? It has only 2 CCDs, and the memory bandwidth measured in the dual-slot configuration is ~530GB/s (https://x.com/xyster/status/1884843091024097569), not 922GB/s as advertised by AMD. I also observed that you opted for 16GiB RDIMMs, which are typically 1-Rank. This choice can result in significantly lower bandwidth compared to the more common 2-Rank RDIMMs. BTW, DeepSeek-R1 is a MoE model that has 37B activated parameters, therefore the theoretical maximum speed (of the Q4 model) for a 530GB/s platform is 530/(4/8)/37=28.6 tokens/s. Unfortunately, actual tests on the Internet including my own test achieve only a fraction of this theoretical efficiency... I'm not sure why... An exception is NVIDIA's NIM, which claims to have achieved 3872tps on a single HGX-H200-8GPU platform. I assumed that they used the Q2 model: 4800 GBps * 8 / ( 2 / 8 ) / 37 ≈ 4k tps |
Beta Was this translation helpful? Give feedback.
-
With only 1 CPU support in Windows I guess the other CPU works only as an extra memory controller? So if you load a very large model and use large context so that the memory usage is spread across both CPUs then your performance may be limited by the xGMI interconnection between CPUs (it adds latency in memory accesses and reduces the bandwidth). See https://lenovopress.lenovo.com/lp1852.pdf page 10. Also: 2 x 9124, Windows, |
Beta Was this translation helpful? Give feedback.
-
Yeah, I'm upgrading from Intel i5-13400F & 4090. So, AMD 2x9124 and the rest of HW, was as far I could stretch.. Anyway, TPM chip has arrived, so I can finish my migration. I plan to do testing next weekend. Hopefully, no more issues will arise... Thanks to all of you, there is a solid chance, that the system will finally work properly. I think I understand now, why did the system behaved that way. Will report back on results, if all turns out well. Though, as based on relativly weak CPU pair, and 1-Rank RDIMM RAM, I doubt it will match your score. PS. I always had this fear that echoes of my existence would fade into oblivion. But, you got that covered for me also. So, that's another issue solved. You're good at this... |
Beta Was this translation helpful? Give feedback.
-
I'm trying to utilize dual-CPU inference setup - 2xEpyc 9154 (2x16 physical cores) and 392 GB RAM, Win10.
When I run Deppseek-R1_Q2 600B+, inference range from 2.38 tk/s for 500 ctx (~200GB RAM) to 0.66 tk/s for 28k ctx (~300GB RAM). Declared memory bandwidth is 460GB/s per socket or ~900GB/s for dual-CPU config, coupled with 24x16GB RAM chips (from Vendor Qualified manufacturer), I would expect faster ctx inference speed, somewhere around 2-3 tk/s, not 0.66.
But, when I run a smaller model (32B Q8_0), with a small ctx (~600) I get again 2.7 tk/s, even I only utilized ~30 GB RAM, and I expect inference speed, regarding available memory bandwidth, to be significantly faster. So, pretty sure I messed something in my setup.
This is my cmd:
llama-cli --model "C:\xLLMs\bartowski\FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-GGUF\FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-Q8_0.gguf" --ctx-size 600 -n 450 --cache-type-k q8_0 --cache-type-v f16 --threads 32 --numa distribute --prio 2 --temp 0.65 --top_k 40 --top_p 0.9 --min-p 0.05 --seed 3407 --flash-attn -no-cnv --prompt "<|User|>Why is the sky blue?<|Assistant|>"
I use latest Llamacpp Win64 binary avx512. I configured Win10 to High performance and BIOS to HPC (hyperthreading / SMT - disabled), Mem speed 4800 MT/s, cTDP 320W/CPU, disabled P-states and C-states...
One thing I couldn't find in BIOS was NUMA mode (MB Gigabyte MZ73-LM1), and I will look again for that.
What else can I check? Would appreciate some guidelines or advice on this.
Beta Was this translation helpful? Give feedback.
All reactions