Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX GPTQModel Lora Wrapper #2404

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

FIX GPTQModel Lora Wrapper #2404

wants to merge 12 commits into from

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Feb 28, 2025

PR Changes:

  • FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper
  • Skip Scaling multiply ops if Scaling == 1 (Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale so scale = r / lora_alpha where r == lora_alpha

Notes:

  • GPTQLoraLinear copies most of the code and structure from AwqLoraLinear.

TODO:

  • Add CI test for GPTQmodel + Lora

@Qubitium Qubitium marked this pull request as draft February 28, 2025 05:34
@Qubitium Qubitium changed the title [WIP] FIX GPTQ Lora Wrapper [WIP] FIX GPTQModel Lora Wrapper Feb 28, 2025
@Qubitium Qubitium changed the title [WIP] FIX GPTQModel Lora Wrapper FIX GPTQModel Lora Wrapper Feb 28, 2025
@Qubitium
Copy link
Contributor Author

Qubitium commented Feb 28, 2025

@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing.

This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for lora. We are also rounding out the tests on our side and will merge and release v2.0 today or tomorrow.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left a couple of comments

Comment on lines 492 to 497

if scaling == 1: # no scaling
lora_output = lora_B(lora_A(dropout(sub_batch)))
else:
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to see the benefit of doing that

Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1

Benchmark Result on A100:

Benchmarking for tensor shape: (256, 256)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000005 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (512, 512)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (1024, 1024)
Operation A: 0.000014 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
import torch
import timeit

tensor_shapes = [(256, 256), (512, 512), (1024, 1024)]
repeats = 100

def benchmark_operation(operation, x, y):
    # Warm-up
    for _ in range(10):
        _ = operation(x, y)

    timer = timeit.Timer(lambda: operation(x, y))
    time_taken = timer.timeit(number=repeats) / repeats
    return time_taken

scale = 1
def operation_A(x, y):
    return x + y * scale

def operation_B(x, y):
    if scale == 1:
        return x + y
    else:
        return x + y * scale

CUDA = torch.device("cuda:0")

for shape in tensor_shapes:
    print(f"Benchmarking for tensor shape: {shape}")

    x = torch.rand(shape).to(CUDA)
    y = torch.rand(shape).to(CUDA)

    # Benchmark A
    time_A = benchmark_operation(operation_A, x, y)
    print(f"Operation A: {time_A:.6f} seconds per iteration")

    # Benchmark B
    time_B = benchmark_operation(operation_B, x, y)
    print(f"Operation B: {time_B:.6f} seconds per iteration")

    print("-" * 40)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan Our upcoming feature of GPTQModel + Lora will be 100% scale = 1 so that's why I want to make sure it is as performant as possible. Lora apply is already an extra cost on top of the normal module forward so any bit helps. I would predict, the faster your gpu, the more this helps, since faster gpus would iterate the module/layers even faster and result even more loops that hit this useless math.

I will add a note on how some Loras, such as ones GPTQModel will generate, will for now, always be scale == 1.

Comment on lines 729 to 732
if scaling == 1: # no scaling
result = result + lora_B(lora_A(dropout(x)))
else:
result = result + lora_B(lora_A(dropout(x))) * scaling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check above. Ugly code, but worth it, I think.

@@ -19,7 +19,7 @@
from peft.import_utils import is_gptqmodel_available
from peft.tuners.lora.layer import LoraLayer
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you don't use get_gptqmodel_quant_linear anymore, you can remove the function

Comment on lines 128 to 141
class GPTQLoraLinear(torch.nn.Module, LoraLayer):
def __init__(
self,
base_layer,
adapter_name,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.0,
init_lora_weights: bool = True,
use_rslora: bool = False,
use_dora: bool = False,
lora_bias: bool = False,
**kwargs,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference with Quantlinear ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with QuantLinear but I renamed to GPTQLoraLinear as the name is much clearer .

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
return result

for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
if not self._adapter_in_lora_keys(active_adapter):
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Found another micro optimization opportunity. In the loop in forward it is allocating an iterable keys every single time. I don't see anywhere that updates the self.lora_A so I adde a lru_cache to remove this allocation after first foward pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, within the forward call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already removed the lru cache for all instance and just moved the allocation once outside the loop.

Comment on lines 86 to 91
# lora_dropout float value is not stored so we need to check for cls
if isinstance(dropout, torch.nn.Dropout):
output = lora_B(lora_A(dropout(x)))
else:
# dropout == Identity which is no-op if lora_dropout == 0.0
output = lora_B(lora_A(x))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Another micro optimization no-op skip. lora_dropout: float is not stored so if 0.0 the update_layers assign an Identity module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe that this affects runtime?

Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a benchmark and it does not affect runtime. Look lke python is able to optimize it away. Will remove.

if isinstance(dropout, nn.Dropout):
result = result + lora_B(lora_A(dropout(x))) * scaling
else:
result = result + lora_B(lora_A(x)) * scaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same micro optimzations from gptq changes.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on these fixes and improvements to GPTQ. I have a couple of small comments, please check.

FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper

What exactly was broken?

)
elif prompt_install:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're generally not using logging in PEFT, only warnings where deemed necessary. Thus, let's remove the install prompt. Note that the LoRA dispatcher checks all dispatch functions, i.e. it also visits dispatch_gptq when users don't want to use GPTQ. Thus they would get this log info and could be confused.

We can add install instructions to the docs instead.

Comment on lines 86 to 91
# lora_dropout float value is not stored so we need to check for cls
if isinstance(dropout, torch.nn.Dropout):
output = lora_B(lora_A(dropout(x)))
else:
# dropout == Identity which is no-op if lora_dropout == 0.0
output = lora_B(lora_A(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe that this affects runtime?

Comment on lines 492 to 497

if scaling == 1: # no scaling
lora_output = lora_B(lora_A(dropout(sub_batch)))
else:
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.

@@ -696,6 +702,11 @@ def get_delta_weight(self, adapter) -> torch.Tensor:

return output_tensor

@lru_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if users decide to delete the adapter. So unless this is a large speedup, I'd suggest to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan I was looking for this code that may modify the self.adapters info post GPTQLoraLinear init. but I was not able to find it likely because I don't see the whole picture of PEFT as you do. Can you point me to me how user dynamically add new adapters, and remove them? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not easy to figure out where that happens. The actual logic is here:

def delete_adapter(self, adapter_name: str) -> None:
"""
Delete an adapter from the layer
This should be called on all adapter layers, or else we will get an inconsistent state.
This method will also set a new active adapter if the deleted adapter was an active adapter. It is important
that the new adapter is chosen in a deterministic way, so that the same adapter is chosen on all layers.
Args:
adapter_name (`str`): The name of the adapter to delete
"""
for attr in self.adapter_layer_names + self.other_param_names:
if adapter_name in getattr(self, attr):
del getattr(self, attr)[adapter_name]

The relevant attributes for LoRA are defined here:

adapter_layer_names = ("lora_A", "lora_B", "lora_embedding_A", "lora_embedding_B")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage is nearly useless here.

Comment on lines 117 to +122
if is_gptqmodel_available():
device_map = kwargs.get("device_map", None)
quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map)
from gptqmodel.nn_modules.qlinear import BaseQuantLinear

if isinstance(target_base_layer, BaseQuantLinear):
new_module = GPTQLoraLinear(target, adapter_name, **kwargs)
target.qweight = target_base_layer.qweight
Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan

What exactly was broken?

Using transformer to load the model and then use model.load_adapter() api we hit a problem where the pre-PR code would select a quant linear with wrong device_map == none == cpu when the model was loaded on cuda so the gptqmodel_select_quant_linear() returns TorchQuantLinear for device_map==none==cpu which mismatches the target_base_linear which is MarlinQuantLinear (correct) causing this line to mistach if quant_linear is not None and isinstance(target_base_layer, quant_linear):

DEBUG: selected_quant_linear=`<class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>`
DEBUG: target_base_layer=`MarlinQuantLinear`
# old code:
if is_gptqmodel_available():
        device_map = kwargs.get("device_map", None)
        quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map)
    else:
        quant_linear = get_auto_gptq_quant_linear(cfg)

    if quant_linear is not None and isinstance(target_base_layer, quant_linear):
        new_module = QuantLinear(target, adapter_name, **kwargs)
        target.qweight = target_base_layer.qweight

So the weird thing is device_map is wrong. But instead of fixing this, I saw that at this point, the quant linear is always BaseQuantLInear so I just blindy did the GPTQLoraLInear wrap as that would always work.

Would the GPTQ dispatcher be called on non-gptq modules? Maybe I am not getting the full picture of when/whom the GPTQ dispatcher is called upon. Is the dispatcher like a serial prober peft sends to all modules to say, hey, is this module yours (GPTQ)? If so, do your lora thing..if not, do nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see the code in dispatcher logic where the module is serially sent to all dispatcher and first one that returns not None new_module wins.

@@ -168,7 +168,7 @@ def __init__(
if not hasattr(self, "peft_config"):
self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config
else:
logger.info(
logger.warning(
Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan Based on your previous msg, peft should only do warning, and the strict/stern message reads very like a warning message to me, not level == info.

 "Already found a `peft_config` attribute in the model. This will lead to having multiple adapters"
                " in the model. Make sure to know what you are doing!"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting. Probably it should be a warnings.warn, we're not 100% consistent throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +721 to 724
lora_A_keys = self.lora_A.keys()
for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
if active_adapter not in lora_A_keys:
continue
Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan I did micro benchmarks on this code too. And lru cache and old code had exact same performance which is very strange as the dict.keys must make a copy/clone of at the very least the list for the keys. But benchmarks shows there is no speed difference. Still, I am a firm believer that nothing is magic so I removed the unsafe lru_cache with a var outside the loop so the same values list object is not allocated in a loop.

Comment on lines +733 to +737
# Loras such as EoRA will always be scaling == 1 so we can skip the no-op math
if scaling == 1:
result = result + lora_B(lora_A(dropout(x)))
else:
result = result + lora_B(lora_A(dropout(x))) * scaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan Comments added to why Loras, depending on implementation and target, may well be scale == 1

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage is nearly useless here.

I agree that this type of dynamism should be avoided if possible. However, I don't really see how else this can be implemented in a generic fashion, since each PEFT method can have different relevant attributes.

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
return result

for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
if not self._adapter_in_lora_keys(active_adapter):
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, within the forward call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.

@@ -168,7 +168,7 @@ def __init__(
if not hasattr(self, "peft_config"):
self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config
else:
logger.info(
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting. Probably it should be a warnings.warn, we're not 100% consistent throughout.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Qubitium
Copy link
Contributor Author

@BenjaminBossan Looks like everything is cleared for merge. But please don't merge until we have added CI tests.

Signed-off-by: ZX-ModelCloud <[email protected]>
Comment on lines +352 to +377
@staticmethod
def test_load_lora():
model_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit"
adapter_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit-lora"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
model.load_adapter(adapter_id)

print("peft model", model)

# assert dynamic rank
v_proj_module = model.model.layers[5].self_attn.v_proj
assert isinstance(v_proj_module, GPTQLoraLinear)
assert v_proj_module.lora_A["default"].weight.data.shape[0] == 128
assert v_proj_module.lora_B["default"].weight.data.shape[1] == 128
gate_proj_module = model.model.layers[5].mlp.gate_proj
assert isinstance(gate_proj_module, GPTQLoraLinear)
assert gate_proj_module.lora_A["default"].weight.data.shape[0] == 256
assert gate_proj_module.lora_B["default"].weight.data.shape[1] == 256

tokenizer = AutoTokenizer.from_pretrained(model_id)
inp = tokenizer("Capital of France is", return_tensors="pt").to(model.device)
tokens = model.generate(**inp)[0]
result = tokenizer.decode(tokens)

print("result: ", result)
Copy link
Contributor Author

@Qubitium Qubitium Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan Unit test added which combines both simple lora plus LoraConfig.rank_pattern which aligns with GPTQModel's dynamic per module or in this case, per-lora rank override. So using GPTQModel you can automatically generate HF compatible lora weight that have different rank per module/lora combo via the rank_pattern dict.

PR is good to go, pending CI test success on PEFT. Need to paired with GPTQModel main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants