-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX GPTQModel Lora Wrapper #2404
base: main
Are you sure you want to change the base?
Conversation
@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing. This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Left a couple of comments
src/peft/tuners/lora/layer.py
Outdated
|
||
if scaling == 1: # no scaling | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) | ||
else: | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to see the benefit of doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1
Benchmark Result on A100:
Benchmarking for tensor shape: (256, 256)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000005 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (512, 512)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (1024, 1024)
Operation A: 0.000014 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
import torch
import timeit
tensor_shapes = [(256, 256), (512, 512), (1024, 1024)]
repeats = 100
def benchmark_operation(operation, x, y):
# Warm-up
for _ in range(10):
_ = operation(x, y)
timer = timeit.Timer(lambda: operation(x, y))
time_taken = timer.timeit(number=repeats) / repeats
return time_taken
scale = 1
def operation_A(x, y):
return x + y * scale
def operation_B(x, y):
if scale == 1:
return x + y
else:
return x + y * scale
CUDA = torch.device("cuda:0")
for shape in tensor_shapes:
print(f"Benchmarking for tensor shape: {shape}")
x = torch.rand(shape).to(CUDA)
y = torch.rand(shape).to(CUDA)
# Benchmark A
time_A = benchmark_operation(operation_A, x, y)
print(f"Operation A: {time_A:.6f} seconds per iteration")
# Benchmark B
time_B = benchmark_operation(operation_B, x, y)
print(f"Operation B: {time_B:.6f} seconds per iteration")
print("-" * 40)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Our upcoming feature of GPTQModel + Lora will be 100% scale = 1 so that's why I want to make sure it is as performant as possible. Lora apply is already an extra cost on top of the normal module forward so any bit helps. I would predict, the faster your gpu, the more this helps, since faster gpus would iterate the module/layers even faster and result even more loops that hit this useless math.
I will add a note on how some Loras, such as ones GPTQModel will generate, will for now, always be scale == 1.
src/peft/tuners/lora/layer.py
Outdated
if scaling == 1: # no scaling | ||
result = result + lora_B(lora_A(dropout(x))) | ||
else: | ||
result = result + lora_B(lora_A(dropout(x))) * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check above. Ugly code, but worth it, I think.
@@ -19,7 +19,7 @@ | |||
from peft.import_utils import is_gptqmodel_available | |||
from peft.tuners.lora.layer import LoraLayer | |||
from peft.tuners.tuners_utils import BaseTunerLayer | |||
from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you don't use get_gptqmodel_quant_linear anymore, you can remove the function
src/peft/tuners/lora/gptq.py
Outdated
class GPTQLoraLinear(torch.nn.Module, LoraLayer): | ||
def __init__( | ||
self, | ||
base_layer, | ||
adapter_name, | ||
r: int = 0, | ||
lora_alpha: int = 1, | ||
lora_dropout: float = 0.0, | ||
init_lora_weights: bool = True, | ||
use_rslora: bool = False, | ||
use_dora: bool = False, | ||
lora_bias: bool = False, | ||
**kwargs, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference with Quantlinear ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with QuantLinear
but I renamed to GPTQLoraLinear
as the name is much clearer .
@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor): | |||
return result | |||
|
|||
for active_adapter in self.active_adapters: | |||
if active_adapter not in self.lora_A.keys(): | |||
if not self._adapter_in_lora_keys(active_adapter): | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Found another micro optimization opportunity. In the loop in forward
it is allocating an iterable keys
every single time. I don't see anywhere that updates the self.lora_A
so I adde a lru_cache
to remove this allocation after first foward pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, within the forward
call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already removed the lru cache for all instance and just moved the allocation once outside the loop.
src/peft/tuners/lora/gptq.py
Outdated
# lora_dropout float value is not stored so we need to check for cls | ||
if isinstance(dropout, torch.nn.Dropout): | ||
output = lora_B(lora_A(dropout(x))) | ||
else: | ||
# dropout == Identity which is no-op if lora_dropout == 0.0 | ||
output = lora_B(lora_A(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Another micro optimization no-op skip. lora_dropout: float
is not stored so if 0.0
the update_layers
assign an Identity
module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you observe that this affects runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a benchmark and it does not affect runtime. Look lke python is able to optimize it away. Will remove.
src/peft/tuners/lora/layer.py
Outdated
if isinstance(dropout, nn.Dropout): | ||
result = result + lora_B(lora_A(dropout(x))) * scaling | ||
else: | ||
result = result + lora_B(lora_A(x)) * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same micro optimzations from gptq
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on these fixes and improvements to GPTQ. I have a couple of small comments, please check.
FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper
What exactly was broken?
src/peft/import_utils.py
Outdated
) | ||
elif prompt_install: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're generally not using logging in PEFT, only warnings where deemed necessary. Thus, let's remove the install prompt. Note that the LoRA dispatcher checks all dispatch functions, i.e. it also visits dispatch_gptq
when users don't want to use GPTQ. Thus they would get this log info and could be confused.
We can add install instructions to the docs instead.
src/peft/tuners/lora/gptq.py
Outdated
# lora_dropout float value is not stored so we need to check for cls | ||
if isinstance(dropout, torch.nn.Dropout): | ||
output = lora_B(lora_A(dropout(x))) | ||
else: | ||
# dropout == Identity which is no-op if lora_dropout == 0.0 | ||
output = lora_B(lora_A(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you observe that this affects runtime?
src/peft/tuners/lora/layer.py
Outdated
|
||
if scaling == 1: # no scaling | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) | ||
else: | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting observation, thanks for sharing. I believe overall, this won't make a big dent, as this should be quite cheap in the grand scheme of things (and since most of the times, scale != 1), but I'm fine with keeping it if a comment is added to explain why.
src/peft/tuners/lora/layer.py
Outdated
@@ -696,6 +702,11 @@ def get_delta_weight(self, adapter) -> torch.Tensor: | |||
|
|||
return output_tensor | |||
|
|||
@lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work if users decide to delete the adapter. So unless this is a large speedup, I'd suggest to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan I was looking for this code that may modify the self.adapters info post GPTQLoraLinear
init. but I was not able to find it likely because I don't see the whole picture of PEFT as you do. Can you point me to me how user dynamically add new adapters, and remove them? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not easy to figure out where that happens. The actual logic is here:
peft/src/peft/tuners/tuners_utils.py
Lines 789 to 804 in f51203f
def delete_adapter(self, adapter_name: str) -> None: | |
""" | |
Delete an adapter from the layer | |
This should be called on all adapter layers, or else we will get an inconsistent state. | |
This method will also set a new active adapter if the deleted adapter was an active adapter. It is important | |
that the new adapter is chosen in a deterministic way, so that the same adapter is chosen on all layers. | |
Args: | |
adapter_name (`str`): The name of the adapter to delete | |
""" | |
for attr in self.adapter_layer_names + self.other_param_names: | |
if adapter_name in getattr(self, attr): | |
del getattr(self, attr)[adapter_name] |
The relevant attributes for LoRA are defined here:
peft/src/peft/tuners/lora/layer.py
Line 42 in f51203f
adapter_layer_names = ("lora_A", "lora_B", "lora_embedding_A", "lora_embedding_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage
is nearly useless here.
if is_gptqmodel_available(): | ||
device_map = kwargs.get("device_map", None) | ||
quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map) | ||
from gptqmodel.nn_modules.qlinear import BaseQuantLinear | ||
|
||
if isinstance(target_base_layer, BaseQuantLinear): | ||
new_module = GPTQLoraLinear(target, adapter_name, **kwargs) | ||
target.qweight = target_base_layer.qweight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly was broken?
Using transformer to load the model and then use model.load_adapter()
api we hit a problem where the pre-PR code would select a quant linear with wrong device_map == none == cpu
when the model was loaded on cuda
so the gptqmodel_select_quant_linear()
returns TorchQuantLinear for device_map==none==cpu
which mismatches the target_base_linear which is MarlinQuantLinear
(correct) causing this line to mistach if quant_linear is not None and isinstance(target_base_layer, quant_linear):
DEBUG: selected_quant_linear=`<class 'gptqmodel.nn_modules.qlinear.torch.TorchQuantLinear'>`
DEBUG: target_base_layer=`MarlinQuantLinear`
# old code:
if is_gptqmodel_available():
device_map = kwargs.get("device_map", None)
quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map)
else:
quant_linear = get_auto_gptq_quant_linear(cfg)
if quant_linear is not None and isinstance(target_base_layer, quant_linear):
new_module = QuantLinear(target, adapter_name, **kwargs)
target.qweight = target_base_layer.qweight
So the weird thing is device_map is wrong. But instead of fixing this, I saw that at this point, the quant linear is always BaseQuantLInear so I just blindy did the GPTQLoraLInear wrap as that would always work.
Would the GPTQ dispatcher
be called on non-gptq modules? Maybe I am not getting the full picture of when/whom the GPTQ dispatcher is called upon. Is the dispatcher like a serial prober peft sends to all modules to say, hey, is this module yours (GPTQ)? If so, do your lora thing..if not, do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see the code in dispatcher
logic where the module is serially sent to all dispatcher and first one that returns not None new_module
wins.
src/peft/tuners/tuners_utils.py
Outdated
@@ -168,7 +168,7 @@ def __init__( | |||
if not hasattr(self, "peft_config"): | |||
self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config | |||
else: | |||
logger.info( | |||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Based on your previous msg, peft should only do warning, and the strict/stern message reads very like a warning message to me, not level == info.
"Already found a `peft_config` attribute in the model. This will lead to having multiple adapters"
" in the model. Make sure to know what you are doing!"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting. Probably it should be a warnings.warn
, we're not 100% consistent throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lora_A_keys = self.lora_A.keys() | ||
for active_adapter in self.active_adapters: | ||
if active_adapter not in self.lora_A.keys(): | ||
if active_adapter not in lora_A_keys: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan I did micro benchmarks on this code too. And lru cache and old code had exact same performance which is very strange as the dict.keys
must make a copy/clone of at the very least the list
for the keys. But benchmarks shows there is no speed
difference. Still, I am a firm believer that nothing is magic so I removed the unsafe lru_cache
with a var outside the loop so the same values list object is not allocated in a loop.
# Loras such as EoRA will always be scaling == 1 so we can skip the no-op math | ||
if scaling == 1: | ||
result = result + lora_B(lora_A(dropout(x))) | ||
else: | ||
result = result + lora_B(lora_A(dropout(x))) * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Comments added to why Loras, depending on implementation and target, may well be scale == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch. No wonder my ide can't find it. These types of dynamic code makes things real tough. Python's find_usage is nearly useless here.
I agree that this type of dynamism should be avoided if possible. However, I don't really see how else this can be implemented in a generic fashion, since each PEFT method can have different relevant attributes.
@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor): | |||
return result | |||
|
|||
for active_adapter in self.active_adapters: | |||
if active_adapter not in self.lora_A.keys(): | |||
if not self._adapter_in_lora_keys(active_adapter): | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, within the forward
call, this should be safe to do. I wouldn't expect any measurable speed up, but it shouldn't hurt.
src/peft/tuners/tuners_utils.py
Outdated
@@ -168,7 +168,7 @@ def __init__( | |||
if not hasattr(self, "peft_config"): | |||
self.peft_config = {adapter_name: peft_config} if isinstance(peft_config, PeftConfig) else peft_config | |||
else: | |||
logger.info( | |||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting. Probably it should be a warnings.warn
, we're not 100% consistent throughout.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@BenjaminBossan Looks like everything is cleared for merge. But please don't merge until we have added CI tests. |
Signed-off-by: ZX-ModelCloud <[email protected]>
@staticmethod | ||
def test_load_lora(): | ||
model_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit" | ||
adapter_id = "ModelCloud/Llama-3.2-1B-gptqmodel-ci-4bit-lora" | ||
|
||
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") | ||
model.load_adapter(adapter_id) | ||
|
||
print("peft model", model) | ||
|
||
# assert dynamic rank | ||
v_proj_module = model.model.layers[5].self_attn.v_proj | ||
assert isinstance(v_proj_module, GPTQLoraLinear) | ||
assert v_proj_module.lora_A["default"].weight.data.shape[0] == 128 | ||
assert v_proj_module.lora_B["default"].weight.data.shape[1] == 128 | ||
gate_proj_module = model.model.layers[5].mlp.gate_proj | ||
assert isinstance(gate_proj_module, GPTQLoraLinear) | ||
assert gate_proj_module.lora_A["default"].weight.data.shape[0] == 256 | ||
assert gate_proj_module.lora_B["default"].weight.data.shape[1] == 256 | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
inp = tokenizer("Capital of France is", return_tensors="pt").to(model.device) | ||
tokens = model.generate(**inp)[0] | ||
result = tokenizer.decode(tokens) | ||
|
||
print("result: ", result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Unit test added which combines both simple lora plus LoraConfig.rank_pattern
which aligns with GPTQModel's dynamic
per module or in this case, per-lora rank
override. So using GPTQModel you can automatically generate HF compatible lora
weight that have different rank
per module/lora combo via the rank_pattern
dict.
PR is good to go, pending CI test success on PEFT. Need to paired with GPTQModel main
.
PR Changes:
Scaling
multiply ops ifScaling == 1
(Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale soscale = r / lora_alpha
wherer == lora_alpha
Notes:
GPTQLoraLinear
copies most of the code and structure fromAwqLoraLinear
.TODO: