-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPTQ] Vision Model Support #850
base: main
Are you sure you want to change the base?
Conversation
I tried this on microsoft/Phi-3.5-vision-instruct and once finishing the GPTQ initialization, I saw this failure
I think the error message suggestion is probably right |
Replacing that function, I was able to produce an INT8 W8A8 model that loads in vLLM and seems to give reasonable output! https://huggingface.co/nm-testing/Phi-3.5-vision-instruct-W8A8-Dynamic-Per-Token However it seems like the calibration was not actually performed based on the speed of compression and the logs (happy to be incorrect on this):
Here is the full script I used to test this: from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier
# from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class
# Select model and load it.
MODEL_ID = "microsoft/Phi-3.5-vision-instruct"
model_class = wrap_hf_model_class(AutoModelForCausalLM)
model = model_class.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
_attn_implementation="eager",
)
processor = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {
"text": processor.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return processor(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
print(ds)
# Configure algorithms. In this case, we:
# * apply SmoothQuant to make the activations easier to quantize
# * quantize the weights to int8 with GPTQ (static per channel)
# * quantize the activations to int8 (dynamic per token)
# Note: set sequential_update: true in the recipe to reduce memory
ignore=["re:.*lm_head", "re:model.vision_embed_tokens.*"]
recipe = [
# SmoothQuantModifier(smoothing_strength=0.8, ignore=ignore),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=ignore),
]
# Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
)
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = processor("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(processor.decode(output[0]))
print("==========================================\n\n")
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR) |
@mgoin Nice, indeed the calibration only uses the first sample right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. We should add an updated lifecycle docstring since we're no longer using the GPTQ wrapper. Something like this:
https://github.com/neuralmagic/compressed-tensors/blob/232e4944b84798bd05fddc18a7752ae2b5d460da/src/compressed_tensors/compressors/base.py#L29 or
Run calibration if running input/output activation quantization or kv_cache |
# decoder layers (ie LlamaDecoderLayer) | ||
self.sequential_targets = get_no_split_params(modifiable_model) | ||
self.sequential_targets = get_no_split_params(state.model) | ||
layers = get_layers(self.sequential_targets, state.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to keep compressible_layers()
to return the output of get_layers
- Slightly better naming/clarity on the layers being returned
- Consistent with the other modifiers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to depreciate this to be handled by SequentialLayerCompressor, but I'd be fine to keep in on GPTQModifier
The hook-based design was initially proposed because of
However, using hooks to has proven to be more difficult than expected. The argument goes something like this: Suppose that a dataset contains Note that attempting to use batch size of Consider the case where the batch size is
The same argument applies for both modules and layers in the case of true_sequential=False In summary, a hook cannot return any samples until all samples are accumulated, and a hook cannot return more than one sample without batching, and forced batching can lead to mismatched shapes and memory requirements |
This isn't a formal proof that hook-based compression and batch_size=1 are incompatible, but the solution is not straight-forward and more thought will be needed as to how the two concepts might work together |
Purpose
Prerequisites
Changes
Notes