Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466
Replies: 4 comments 17 replies
-
Would there be any benefit in pruning down a 0.5B model to be even smaller? From your examples above it looks like the speculative models' size reduction has the biggest effect? You could prune the later layers like this: https://arxiv.org/abs/2403.17887 but with a calibration dataset you could probably prune down the width of the MLP hidden state quite significantly too... The I think you could even apply L1-regularisation during fine-tuning to spasify the weights and then remove all those close to zero, but the effectiveness of this would depend on whether the induced sparseness was evenly distributed for the corresponding tensors in each layer (which from the paper above; I doubt is the case). It would be interesting to see where the balance point is between "tiny and fast/dumb" vs "small but slower/less-dumb" actually is. If using greedy speculation then it won't make any difference, but if you have to actually apply the softmax (instead of just finding the maximum logit), then for stuff like coding using only English; it would be perfectly valid to remove a lot (most) of the tokens and prune down the |
Beta Was this translation helpful? Give feedback.
-
Just thinking about this some more and wondered how feasible it would be:
I'm thinking along the lines of using the draft model to create a tree (with probabilies on the edges and tokens in the nodes), and then use it to decide on a set of batches for the larger model to generate in parallel. If we constrain the branching factor to a fixed k, then we can again use Hinge Loss to try to pick the top-k using k-vs-all. I don't have a good idea of how the cost of batch processing grows though and it all depends on this. |
Beta Was this translation helpful? Give feedback.
-
Testing my server rebase for regressions after all the recent changes along with a few new "LRM" (Marco-o1 and QwQ) models and RPC mode also. The spec algo I implemented is greedy match with fixed size draft block and no probs computes. Hardware: RTX4070 GOLDCOIN:
HUMANEVAL 1ST PROBLEM
|
Beta Was this translation helpful? Give feedback.
-
@steampunque any update on this? |
Beta Was this translation helpful? Give feedback.
-
I recently added an efficient greedy-only spec decode to my downstream server patch (a completely different implementation than the current spec decode PR). I then evaluated tg performance for two cases : 1) Solve the first humaneval problem with coding model and 2) solve the goldcoin problem with general model. I used Qwen 14B for the target and 0.5B, 1.5B, and 3B for the drafts. I evaluated tg vs. draft token length on a 4070 fully offloaded with the target and draft weights where target is IQ4_XS quant and draft is Q6_K quant.
HUMANEVAL first problem:
TARGET Qwen2.5-Coder-14B-Instruct
DRAFTS Qwen2.5-Coder-0.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct, Qwen2.5-Coder-3B-Instruct
TPS vs draft tokens:
GOLDCOIN
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? Use step-by-step reasoning to solve this problem.
TARGET Qwen2.5-14B-Instruct
DRAFTS Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct
TPS vs draft tokens:
TARGET Llama 3.1 8B Instruct
DRAFT Llama 3.2 1B Instruct
TPS vs draft tokens:
TARGET Gemma 2 9B it IQ4_XS
DRAFT Gemma 2 2B it IQ4_XS
TPS vs draft tokens:
Results Summary:
Coding shows a max speedup of 2.5x tg at 10 draft tokens speculated using 0.5B model. At 1.5B draft the max speedup is 1.63x at 4 draft tokens. At 3B draft the max speedup is 1.33 at 4 draft tokens. The efficiency crossover (where draft+target is the same as no draft) is >32 draft tokens for 0.5B, >16 draft tokens for 1.5B, and 11 draft tokens for 3B.
Goldcoin shows a max speedup of 1.4x tg at 4 draft tokens speculated using 0.5B model. at 1.5 draft the max speedup is 1.17x at 4 draft tokens. At 3B draft the max speedup is 1.08 at 1 draft token. The efficiency crossover (where draft+target is the same as no draft) is 12 tokens for 0.5B, 6 tokens for 1.5B, and 3 tokens for 3B.
With Llama 3.18B instruct drafted by Llama 3.2 1B instruct a speedup in token gen of 1.83x is found at draft tokens of 5.
With Gemma2 9B it drafted by Gemma2 2B it there is never any speculative decoding speedup. Guess 2B not distilled from 9B at all but was trained on a completey different data set.
Conclusions and potential for running big LLMs on consumer grade GPUs:
Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft. Coding is far more efficient than general text gen with speculation. Qwen 2.5 series is perfect for exploiting the potential of speculation.
For running big LLMs on consumer grade GPUs with limited memory it is desired to avoid the need to store all model weights and output layer in VRAM because there is not enough room. Most of the model weights are sitting there doing nothing most of the time, i.e. a 32 layer model has 31 dead weights sitting there occupying VRAM doing nothing 31/32 of the time. To get around this problem it is necessary to dynamically swap the layers into VRAM as they are needed from CPU RAM which is normally much higher capacity. If the draft size at the efficiency crossover is big enough, there may be (emphasis on may, it needs to be investigated for feasibility) enough time to compute the target batch (say 8 to 10 samples) and simultaneously transfer the next layer into the GPU. The GPU capacity needs 1 working layer allocation and one transfer allocation (two total model layers which are ping ponged between compute and transfer) + a fully offloaded speculator. KV for speculator and target should also both be in GPU mem. Even if it is needed to go above the efficiency crossover, it can still be more efficient to do dynamic layer loading to GPU because offloading to CPU is an immediate 10X or higher slowdown due to memory BW limits.
Beta Was this translation helpful? Give feedback.
All reactions