-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Sparse compression #822
base: main
Are you sure you want to change the base?
Conversation
src/llmcompressor/transformers/compression/quantization_format.py
Outdated
Show resolved
Hide resolved
src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py
Show resolved
Hide resolved
src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes highlight how tied sparsity and quantization are (quantization format now depends on sparsity structure, and sparsity config now depends on quantization format).
I think this is the proper way to do it (first infer format, then use format to generate configs), but I hope in the future that we can structure the classes in a way such that the flow is clearer and users have an easier time understanding and using the tools
a29da0d
to
86cd1d9
Compare
Special condition for marlin_24 compressor Update tests Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
86cd1d9
to
43cb1d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a couple of e2e tests to run sample generation in llm-compressor.
We can then eventually expand on them to also run in vllm (once we've integrated).
|
||
@staticmethod | ||
def from_pretrained( | ||
model: Module, | ||
state_dict: Optional[Dict[str, Tensor]] = None, | ||
compress: bool = False, | ||
is_marlin: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this more generic, why not pass in the quantization config? We will for sure have different compression formats which affect sparsity in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall structure is good (having quantization config be dependent on sparsity structure is better than it being dependent on a sparsity config)
I definitely recommend genericising the is_marlin a little bit by passing the compression format directly so it's easier to support more formats in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just summarizing offline discussions:
Although marlin_24 should no longer be the default, we still need to make sure we have a pathway to enable it, such as through the addition of a "format" argument in the recipe.
This PR makes compressors composable. After this change, both sparse and quantization compression can be applied to a sparse quantized model.
Notable Changes:
sparse_bitmask
orsparse_24
).marlin_24
compressor is used).Dependencies:
Test Script