[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733
Labels
bug
Something isn't working
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
Latest installed by the requirements.txt
Model
No response
Describe the bug
I get errors when trying to run the vision model (Qwen2.5-VL-72B) with a draft model (Qwen2.5-VL-7B). Both models have the same vocabulary, and belong to the same family of models.
Reproduction steps
If I run the model with a draft model and try inference, I get
cd /home/lissanro/pkgs/tabbyAPI/ && ./start.sh --vision True --autosplit-reserve 512 --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw --cache-mode Q6 --draft-model-name Qwen2.5-VL-7B-Instruct-exl2-4.0bpw --draft-cache-mode=Q4 --max-seq-len 65536
I tested with a 54136 token long prompt, that also has few images in it, in case it matters.
If I run without a draft model, it works fine:
cd /home/lissanro/pkgs/tabbyAPI/ && ./start.sh --vision True --autosplit-reserve 512 --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq --cache-mode Q6 --max-seq-len 65536
Without a draft model, the model generated a reply without issues.
Expected behavior
Performance boost from speculative decoding instead of an error.
Logs
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: