FIX GPTQModel Lora Wrapper #2404

Qubitium · 2025-02-28T05:32:05Z

PR Changes:

FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper
Skip Scaling multiply ops if Scaling == 1 (Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale so scale = r / lora_alpha where r == lora_alpha

Notes:

GPTQLoraLinear copies most of the code and structure from AwqLoraLinear.

TODO:

Add CI test for GPTQmodel + Lora

Qubitium · 2025-02-28T05:54:47Z

@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing.

This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for lora. We are also rounding out the tests on our side and will merge and release v2.0 today or tomorrow.

SunMarc

Thanks ! Left a couple of comments

SunMarc · 2025-02-28T10:01:53Z

src/peft/tuners/lora/layer.py

+
+            if scaling == 1:  # no scaling
+                lora_output = lora_B(lora_A(dropout(sub_batch)))
+            else:
+                lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling
+


not sure to see the benefit of doing that

Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1

Benchmark Result on A100:

Benchmarking for tensor shape: (256, 256) Operation A: 0.000012 seconds per iteration Operation B: 0.000005 seconds per iteration ---------------------------------------- Benchmarking for tensor shape: (512, 512) Operation A: 0.000012 seconds per iteration Operation B: 0.000006 seconds per iteration ---------------------------------------- Benchmarking for tensor shape: (1024, 1024) Operation A: 0.000014 seconds per iteration Operation B: 0.000006 seconds per iteration ----------------------------------------

import torch import timeit tensor_shapes = [(256, 256), (512, 512), (1024, 1024)] repeats = 100 def benchmark_operation(operation, x, y): # Warm-up for _ in range(10): _ = operation(x, y) timer = timeit.Timer(lambda: operation(x, y)) time_taken = timer.timeit(number=repeats) / repeats return time_taken scale = 1 def operation_A(x, y): return x + y * scale def operation_B(x, y): if scale == 1: return x + y else: return x + y * scale CUDA = torch.device("cuda:0") for shape in tensor_shapes: print(f"Benchmarking for tensor shape: {shape}") x = torch.rand(shape).to(CUDA) y = torch.rand(shape).to(CUDA) # Benchmark A time_A = benchmark_operation(operation_A, x, y) print(f"Operation A: {time_A:.6f} seconds per iteration") # Benchmark B time_B = benchmark_operation(operation_B, x, y) print(f"Operation B: {time_B:.6f} seconds per iteration") print("-" * 40)

Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.

@BenjaminBossan Our upcoming feature of GPTQModel + Lora will be 100% scale = 1 so that's why I want to make sure it is as performant as possible. Lora apply is already an extra cost on top of the normal module forward so any bit helps. I would predict, the faster your gpu, the more this helps, since faster gpus would iterate the module/layers even faster and result even more loops that hit this useless math.

I will add a note on how some Loras, such as ones GPTQModel will generate, will for now, always be scale == 1.

SunMarc · 2025-02-28T10:02:00Z

src/peft/tuners/lora/layer.py

+                    if scaling == 1:  # no scaling
+                        result = result + lora_B(lora_A(dropout(x)))
+                    else:
+                        result = result + lora_B(lora_A(dropout(x))) * scaling


Check above. Ugly code, but worth it, I think.

SunMarc · 2025-02-28T10:03:36Z

src/peft/tuners/lora/gptq.py

@@ -19,7 +19,7 @@
 from peft.import_utils import is_gptqmodel_available
 from peft.tuners.lora.layer import LoraLayer
 from peft.tuners.tuners_utils import BaseTunerLayer
-from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear


since you don't use get_gptqmodel_quant_linear anymore, you can remove the function

SunMarc · 2025-02-28T10:06:27Z

src/peft/tuners/lora/gptq.py

+class GPTQLoraLinear(torch.nn.Module, LoraLayer):
+    def __init__(
+        self,
+        base_layer,
+        adapter_name,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        init_lora_weights: bool = True,
+        use_rslora: bool = False,
+        use_dora: bool = False,
+        lora_bias: bool = False,
+        **kwargs,
+    ):


what's the difference with Quantlinear ?

You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.

Replaced with QuantLinear but I renamed to GPTQLoraLinear as the name is much clearer .

Qubitium · 2025-02-28T11:30:51Z

src/peft/tuners/lora/gptq.py

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
            return result

        for active_adapter in self.active_adapters:
-            if active_adapter not in self.lora_A.keys():
+            if not self._adapter_in_lora_keys(active_adapter):
                continue


@SunMarc Found another micro optimization opportunity. In the loop in forward it is allocating an iterable keys every single time. I don't see anywhere that updates the self.lora_A so I adde a lru_cache to remove this allocation after first foward pass.

Yes, within the forward call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.

I have already removed the lru cache for all instance and just moved the allocation once outside the loop.

Qubitium · 2025-02-28T11:32:27Z

src/peft/tuners/lora/gptq.py

+            # lora_dropout float value is not stored so we need to check for cls
+            if isinstance(dropout, torch.nn.Dropout):
+                output = lora_B(lora_A(dropout(x)))
+            else:
+                # dropout == Identity which is no-op if lora_dropout == 0.0
+                output = lora_B(lora_A(x))


@SunMarc Another micro optimization no-op skip. lora_dropout: float is not stored so if 0.0 the update_layers assign an Identity module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0

Did you observe that this affects runtime?

Just did a benchmark and it does not affect runtime. Look lke python is able to optimize it away. Will remove.

Qubitium · 2025-02-28T12:15:44Z

src/peft/tuners/lora/layer.py

+                        if isinstance(dropout, nn.Dropout):
+                            result = result + lora_B(lora_A(dropout(x))) * scaling
+                        else:
+                            result = result + lora_B(lora_A(x)) * scaling


Same micro optimzations from gptq changes.

BenjaminBossan

Thanks for working on these fixes and improvements to GPTQ. I have a couple of small comments, please check.

FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper

What exactly was broken?

BenjaminBossan · 2025-02-28T13:44:49Z

src/peft/import_utils.py

            )
+    elif prompt_install:


We're generally not using logging in PEFT, only warnings where deemed necessary. Thus, let's remove the install prompt. Note that the LoRA dispatcher checks all dispatch functions, i.e. it also visits dispatch_gptq when users don't want to use GPTQ. Thus they would get this log info and could be confused.

We can add install instructions to the docs instead.

BenjaminBossan · 2025-02-28T13:45:23Z

src/peft/tuners/lora/gptq.py

+            # lora_dropout float value is not stored so we need to check for cls
+            if isinstance(dropout, torch.nn.Dropout):
+                output = lora_B(lora_A(dropout(x)))
+            else:
+                # dropout == Identity which is no-op if lora_dropout == 0.0
+                output = lora_B(lora_A(x))


Did you observe that this affects runtime?

BenjaminBossan · 2025-02-28T13:51:52Z

src/peft/tuners/lora/layer.py

+
+            if scaling == 1:  # no scaling
+                lora_output = lora_B(lora_A(dropout(sub_batch)))
+            else:
+                lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling
+


Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.

BenjaminBossan · 2025-02-28T13:53:18Z

src/peft/tuners/lora/layer.py

@@ -696,6 +702,11 @@ def get_delta_weight(self, adapter) -> torch.Tensor:

        return output_tensor

+    @lru_cache


This won't work if users decide to delete the adapter. So unless this is a large speedup, I'd suggest to remove it.

@BenjaminBossan I was looking for this code that may modify the self.adapters info post GPTQLoraLinear init. but I was not able to find it likely because I don't see the whole picture of PEFT as you do. Can you point me to me how user dynamically add new adapters, and remove them? Thanks.

Yeah, it's not easy to figure out where that happens. The actual logic is here:

peft/src/peft/tuners/tuners_utils.py

Lines 789 to 804 in f51203f

def delete_adapter(self, adapter_name: str) -> None:

"""

Delete an adapter from the layer

This should be called on all adapter layers, or else we will get an inconsistent state.

This method will also set a new active adapter if the deleted adapter was an active adapter. It is important

that the new adapter is chosen in a deterministic way, so that the same adapter is chosen on all layers.

Args:

adapter_name (`str`): The name of the adapter to delete

"""

for attr in self.adapter_layer_names + self.other_param_names:

if adapter_name in getattr(self, attr):

del getattr(self, attr)[adapter_name]

The relevant attributes for LoRA are defined here:

peft/src/peft/tuners/lora/layer.py

Line 42 in f51203f

adapter_layer_names = ("lora_A", "lora_B", "lora_embedding_A", "lora_embedding_B")

@BenjaminBossan Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage is nearly useless here.

…r later

Qubitium · 2025-02-28T15:21:17Z

src/peft/tuners/lora/gptq.py

    if is_gptqmodel_available():
-        device_map = kwargs.get("device_map", None)
-        quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map)
+        from gptqmodel.nn_modules.qlinear import BaseQuantLinear
+
+        if isinstance(target_base_layer, BaseQuantLinear):
+            new_module = GPTQLoraLinear(target, adapter_name, **kwargs)
+            target.qweight = target_base_layer.qweight


@BenjaminBossan

What exactly was broken?

Using transformer to load the model and then use model.load_adapter() api we hit a problem where the pre-PR code would select a quant linear with wrong device_map == none == cpu when the model was loaded on cuda so the gptqmodel_select_quant_linear() returns TorchQuantLinear for device_map==none==cpu which mismatches the target_base_linear which is MarlinQuantLinear (correct) causing this line to mistach if quant_linear is not None and isinstance(target_base_layer, quant_linear):

DEBUG: selected_quant_linear=`<class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>` DEBUG: target_base_layer=`MarlinQuantLinear`

# old code: if is_gptqmodel_available(): device_map = kwargs.get("device_map", None) quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map) else: quant_linear = get_auto_gptq_quant_linear(cfg) if quant_linear is not None and isinstance(target_base_layer, quant_linear): new_module = QuantLinear(target, adapter_name, **kwargs) target.qweight = target_base_layer.qweight

So the weird thing is device_map is wrong. But instead of fixing this, I saw that at this point, the quant linear is always BaseQuantLInear so I just blindy did the GPTQLoraLInear wrap as that would always work.

Would the GPTQ dispatcher be called on non-gptq modules? Maybe I am not getting the full picture of when/whom the GPTQ dispatcher is called upon. Is the dispatcher like a serial prober peft sends to all modules to say, hey, is this module yours (GPTQ)? If so, do your lora thing..if not, do nothing.

Ok. I see the code in dispatcher logic where the module is serially sent to all dispatcher and first one that returns not None new_module wins.

Qubitium · 2025-02-28T15:24:01Z

src/peft/tuners/tuners_utils.py

@@ -168,7 +168,7 @@ def __init__(
        if not hasattr(self, "peft_config"):
            self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config
        else:
-            logger.info(
+            logger.warning(


@BenjaminBossan Based on your previous msg, peft should only do warning, and the strict/stern message reads very like a warning message to me, not level == info.

"Already found a `peft_config` attribute in the model. This will lead to having multiple adapters" " in the model. Make sure to know what you are doing!"

Thanks for adjusting. Probably it should be a warnings.warn, we're not 100% consistent throughout.

Qubitium · 2025-02-28T15:27:54Z

src/peft/tuners/lora/layer.py

+            lora_A_keys = self.lora_A.keys()
            for active_adapter in self.active_adapters:
-                if active_adapter not in self.lora_A.keys():
+                if active_adapter not in lora_A_keys:
                    continue


@BenjaminBossan I did micro benchmarks on this code too. And lru cache and old code had exact same performance which is very strange as the dict.keys must make a copy/clone of at the very least the list for the keys. But benchmarks shows there is no speed difference. Still, I am a firm believer that nothing is magic so I removed the unsafe lru_cache with a var outside the loop so the same values list object is not allocated in a loop.

Qubitium · 2025-02-28T15:29:29Z

src/peft/tuners/lora/layer.py

+                    # Loras such as EoRA will always be scaling == 1 so we can skip the no-op math
+                    if scaling == 1:
+                        result = result + lora_B(lora_A(dropout(x)))
+                    else:
+                        result = result + lora_B(lora_A(dropout(x))) * scaling


@BenjaminBossan Comments added to why Loras, depending on implementation and target, may well be scale == 1

BenjaminBossan

Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage is nearly useless here.

I agree that this type of dynamism should be avoided if possible. However, I don't really see how else this can be implemented in a generic fashion, since each PEFT method can have different relevant attributes.

BenjaminBossan · 2025-02-28T16:32:07Z

src/peft/tuners/lora/gptq.py

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
            return result

        for active_adapter in self.active_adapters:
-            if active_adapter not in self.lora_A.keys():
+            if not self._adapter_in_lora_keys(active_adapter):
                continue


Yes, within the forward call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.

BenjaminBossan · 2025-02-28T16:33:45Z

src/peft/tuners/tuners_utils.py

@@ -168,7 +168,7 @@ def __init__(
        if not hasattr(self, "peft_config"):
            self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config
        else:
-            logger.info(
+            logger.warning(


Thanks for adjusting. Probably it should be a warnings.warn, we're not 100% consistent throughout.

HuggingFaceDocBuilderDev · 2025-02-28T16:39:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Qubitium · 2025-02-28T16:56:41Z

@BenjaminBossan Looks like everything is cleared for merge. But please don't merge until we have added CI tests.

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium · 2025-03-01T08:00:33Z

tests/test_gptqmodel.py

+    @staticmethod
+    def test_load_lora():
+        model_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit"
+        adapter_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit-lora"
+
+        model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+        model.load_adapter(adapter_id)
+
+        print("peft model", model)
+
+        # assert dynamic rank
+        v_proj_module = model.model.layers[5].self_attn.v_proj
+        assert isinstance(v_proj_module, GPTQLoraLinear)
+        assert v_proj_module.lora_A["default"].weight.data.shape[0] == 128
+        assert v_proj_module.lora_B["default"].weight.data.shape[1] == 128
+        gate_proj_module = model.model.layers[5].mlp.gate_proj
+        assert isinstance(gate_proj_module, GPTQLoraLinear)
+        assert gate_proj_module.lora_A["default"].weight.data.shape[0] == 256
+        assert gate_proj_module.lora_B["default"].weight.data.shape[1] == 256
+
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        inp = tokenizer("Capital of France is", return_tensors="pt").to(model.device)
+        tokens = model.generate(**inp)[0]
+        result = tokenizer.decode(tokens)
+
+        print("result: ", result)


@BenjaminBossan Unit test added which combines both simple lora plus LoraConfig.rank_pattern which aligns with GPTQModel's dynamic per module or in this case, per-lora rank override. So using GPTQModel you can automatically generate HF compatible lora weight that have different rank per module/lora combo via the rank_pattern dict.

PR is good to go, pending CI test success on PEFT. Need to paired with GPTQModel main.

Qubitium added 2 commits February 28, 2025 01:01

test gptq lora

f5ce3fa

skip scaling ops if scaling == 1

65f9583

Qubitium marked this pull request as draft February 28, 2025 05:34

Qubitium changed the title ~~[WIP] FIX GPTQ Lora Wrapper~~ [WIP] FIX GPTQModel Lora Wrapper Feb 28, 2025

format

7bc4817

Qubitium changed the title ~~[WIP] FIX GPTQModel Lora Wrapper~~ FIX GPTQModel Lora Wrapper Feb 28, 2025

Qubitium marked this pull request as ready for review February 28, 2025 05:54

Qubitium mentioned this pull request Feb 28, 2025

Quant type != exllama when loading on GPU ModelCloud/GPTQModel#1354

Open

SunMarc reviewed Feb 28, 2025

View reviewed changes

use QuantLinear and rename to GPTQLoraLinear + micro optimizations

e0b069d

Qubitium commented Feb 28, 2025

View reviewed changes

optimize

c5b07bb

Qubitium commented Feb 28, 2025

View reviewed changes

BenjaminBossan requested changes Feb 28, 2025

View reviewed changes

Qubitium added 5 commits February 28, 2025 14:11

remove logger.info and add comments on scaling == 1

7421210

remove useless dropout optimization + ping version to 1.9.99 (v2.0) o…

9b77802

…r later

replace lru_cache with safer keys var

fbf909f

cleanup

9caf45f

revert

5531ddd

Qubitium commented Feb 28, 2025

View reviewed changes

BenjaminBossan reviewed Feb 28, 2025

View reviewed changes

warnings.warn

aa9f98c

add unittest

14b87b4

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium commented Mar 1, 2025

View reviewed changes

		@@ -696,6 +702,11 @@ def get_delta_weight(self, adapter) -> torch.Tensor:

		return output_tensor

		@lru_cache

	def delete_adapter(self, adapter_name: str) -> None:
	"""
	Delete an adapter from the layer

	This should be called on all adapter layers, or else we will get an inconsistent state.

	This method will also set a new active adapter if the deleted adapter was an active adapter. It is important
	that the new adapter is chosen in a deterministic way, so that the same adapter is chosen on all layers.

	Args:
	adapter_name (`str`): The name of the adapter to delete

	"""
	for attr in self.adapter_layer_names + self.other_param_names:
	if adapter_name in getattr(self, attr):
	del getattr(self, attr)[adapter_name]

FIX GPTQModel Lora Wrapper #2404

Are you sure you want to change the base?

FIX GPTQModel Lora Wrapper #2404

Conversation

Qubitium commented Feb 28, 2025 • edited Loading

Qubitium commented Feb 28, 2025 • edited Loading

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 28, 2025

Qubitium commented Feb 28, 2025

Qubitium Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Qubitium commented Feb 28, 2025 •

edited

Loading

Qubitium commented Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Mar 1, 2025 •

edited

Loading