New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[KERNEL] AllSpark + Exllama vLLM #1359

Open

Qubitium opened this issue Mar 1, 2025 · 0 comments

Collaborator

Qubitium commented Mar 1, 2025 •

edited

Loading

Let's port these 2 kernels over to GPTQModel as well for simple inference.

AllSpark by @wyajieha: The difference is that allspark kernel only supports bits=8, group_size=-1, and desc_act=False. [Misc][Kernel]: Add GPTQAllSpark Quantization vllm-project/vllm#12931
Exllama vLLM: Very different, structually from the existing v1/v2. We need to benchmark and validate accuracy . If this kernel is good, let's retire Exllama v1/v2.

HF Transformers uses GPTQModel kernels for GPTQ models so this will benefit all HF transformer api/loading.

The text was updated successfully, but these errors were encountered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment