[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733

Lissanro · 2025-02-06T15:48:16Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

Latest installed by the requirements.txt

Model

No response

Describe the bug

I get errors when trying to run the vision model (Qwen2.5-VL-72B) with a draft model (Qwen2.5-VL-7B). Both models have the same vocabulary, and belong to the same family of models.

Reproduction steps

If I run the model with a draft model and try inference, I get
cd /home/lissanro/pkgs/tabbyAPI/ && ./start.sh --vision True --autosplit-reserve 512 --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw --cache-mode Q6 --draft-model-name Qwen2.5-VL-7B-Instruct-exl2-4.0bpw --draft-cache-mode=Q4 --max-seq-len 65536

I tested with a 54136 token long prompt, that also has few images in it, in case it matters.

If I run without a draft model, it works fine:

cd /home/lissanro/pkgs/tabbyAPI/ && ./start.sh --vision True --autosplit-reserve 512 --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq --cache-mode Q6 --max-seq-len 65536

Without a draft model, the model generated a reply without issues.

Expected behavior

Performance boost from speculative decoding instead of an error.

Logs

> cd /home/lissanro/pkgs/tabbyAPI/ && ./start.sh --vision True --autosplit-reserve 512 --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw --cache-mode Q6 --draft-model-name Qwen2.5-VL-7B-Instruct-exl2-4.0bpw --draft-cache-mode=Q4 --max-seq-len 65536
Activating venv
pip 24.0 from /home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/pip (python 3.12)
Loaded your saved preferences from `start_options.json`
Starting TabbyAPI...
INFO:     ExllamaV2 version: 0.2.7
INFO:     Your API key is: a2cdb0c05aa3bbc7dd749016257e48a3
INFO:     Your admin key is: 9f77b1c5bc877f4d4ec6c2ef5194824b
INFO:     
INFO:     If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO:     Generation logging is disabled
WARNING:  The given cache_size (65536) is less than 2 * max_seq_len and may be too small for requests using CFG. 
WARNING:  Ignore this warning if you do not plan on using CFG.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading draft model: /mnt/neuro/text-generation-webui/models/Qwen2.5-VL-7B-Instruct-exl2-4.0bpw
INFO:     Loading model: /mnt/neuro/text-generation-webui/models/Qwen2.5-VL-72B-Instruct-exl2-6.0bpw
INFO:     Loading with autosplit
INFO:     Model successfully loaded.
Loading draft modules  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 59/59   0:00:00
Loading vision modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 66/66   0:00:00
Loading model modules  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 0:00:00
INFO:     Developer documentation: http://127.0.0.1:5000/redoc
INFO:     Starting OAI API
INFO:     Completions: http://127.0.0.1:5000/v1/completions
INFO:     Chat completions: http://127.0.0.1:5000/v1/chat/completions
INFO:     Started server process [2237297]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)
INFO:     127.0.0.1:59380 - "GET /v1/models HTTP/1.1" 200
INFO:     127.0.0.1:59380 - "POST /v1/chat/completions HTTP/1.1" 200
INFO:     Received chat completion streaming request 3de41614004e4bac9ffca988ea9e5286
ERROR:    FATAL ERROR with generation. Attempting to recreate the generator. If this fails, please restart the server.

WARNING:  Immediately terminating all jobs. Clients will have their requests cancelled.

ERROR:    Traceback (most recent call last):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 357, in stream_generate_chat_completion
ERROR:        raise generation
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/completion.py", line 100, in _stream_collector
ERROR:        async for generation in new_generation:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 1491, in generate_gen
ERROR:        raise ex
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 1399, in generate_gen
ERROR:        async for result in job:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/generator/dynamic_async.py", line 97, in __aiter__
ERROR:        raise result
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/generator/dynamic_async.py", line 28, in _run_iteration
ERROR:        results = self.generator.iterate()
ERROR:                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/generator/dynamic.py", line 980, in iterate
ERROR:        job.prefill(results)
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/generator/dynamic.py", line 2424, in prefill
ERROR:        self.generator.draft_model.forward_chunk(
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 1012, in forward_chunk
ERROR:        x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
ERROR:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/embedding.py", line 182, in forward
ERROR:        hidden_states = self.embedding(hidden_states)
ERROR:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 190, in forward
ERROR:        return F.embedding(
ERROR:               ^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/nn/functional.py", line 2551, in embedding
ERROR:        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:    IndexError: index out of range in self
ERROR:    Sent to request: Chat completion aborted. Please check the server console.
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2237297]
WARNING:  Shutdown signal called. Exiting gracefully.

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

DocShotgun · 2025-02-07T04:33:26Z

I'm pretty certain that nobody has tested vision+speculative decoding lol, so it may just be a case that doesn't work currently.

Assuming you are using the dev branch of exl2 that supports Qwen2.5-VL?

remichu-ai · 2025-02-07T04:34:41Z

I tested qwen vl and speculative decoding. It is not working 🥹

Lissanro · 2025-02-07T04:44:37Z

Yes, I am using the latest dev branch. New Qwen2.5-VL 72B is pretty good as both text and vision model, and comes with a perfect draft 7B model too, so hopefully it can be fixed, speedup of 1.5-2 would be of great help! I tried to look at the code and debug myself, but I don't have required knowledge it seems, I could not figure out how to fix it.

Lissanro added the bug Something isn't working label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733

[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733

Lissanro commented Feb 6, 2025 •

edited

Loading

DocShotgun commented Feb 7, 2025

remichu-ai commented Feb 7, 2025

Lissanro commented Feb 7, 2025

[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733

[BUG] When trying inference with Qwen2.5-VL-72B with Qwen2.5-VL-7B as a draft model, I get "IndexError: index out of range in self" (both models have identical vocab.json) #733

Comments

Lissanro commented Feb 6, 2025 • edited Loading

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

DocShotgun commented Feb 7, 2025

remichu-ai commented Feb 7, 2025

Lissanro commented Feb 7, 2025

Lissanro commented Feb 6, 2025 •

edited

Loading