-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MoSLoRA (EMNLP 2024) #2294
base: main
Are you sure you want to change the base?
Add MoSLoRA (EMNLP 2024) #2294
Conversation
Thank you for resuming the work on MoSLoRA. I just want to let you know that we're currently on vacation and will review this next year. |
Okay, enjoy your vacation. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this very thorough PR. I reviewed parts of the code for now, but did not do a full deep dive. For this, it would be really great if you could do the following:
- Quickly summarize the changes you made compared to LoRA. E.g. you included all the modules for the quantized layers, which is great. But it would take a lot of time to review them all when there is probably only very little change. Could you summarize those changes.
- Similarly, I assume that most of
model.py
andlayer.py
is copied 1:1. Could you summarize the parts that were changed so that I can review them specifically? - On top of that, it would be also great if you could mark those changes in the code with comments. This will also help with keeping MosLoRA up to date when there are future changes in LoRA.
Regarding the parts that I did review, I had a couple of comments but most should be quick to fix. I think the biggest change I consider would be about the parameter names: Right now, you mostly still use lora
all throughout, e.g. we still have lora_A
, lora_B
, etc. I'm a bit conflicted about this: On the one hand, those parameters have the same function as in LoRA, on the other hand it could be confusing when checking the model what type of adapter we have. Since you probably put some thoughts into this, could you share your reasoning?
Apart from those comments, could you please:
- Run
make style
- Update the year in the copyright notices to 2025
@@ -54,6 +54,17 @@ lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...) | |||
``` | |||
For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/pissa_finetuning). | |||
|
|||
### MoSLoRA | |||
[MoSLoRA](https://arxiv.org/abs/2406.11909) initializes the lora branch with an extra mixer layer. The vanilla LoRA can be viewed as the special case that the mixer is a _fixed_ identity metrix mixing $r$ subspaces. MoSLoRA try to employ a _learnable_ matrix to mix the information from $r^2$ subspaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MoSLoRA try to employ a learnable matrix to mix the information from
$r^2$ subspaces.
Let's reword this, there is no need for having "try to", right?
@@ -120,6 +120,42 @@ def starcoder_model_postprocess_past_key_value(past_key_values): | |||
"qwen2": ["q_proj", "v_proj"], | |||
} | |||
|
|||
TRANSFORMERS_MODELS_TO_MOSLORA_TARGET_MODULES_MAPPING = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make this a copy of TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
? I think each time that LoRA works, MosLoRA should also work.
- **peft_config** ([`MoSLoraConfig`]): The configuration of the Lora model. | ||
""" | ||
|
||
prefix: str = "lora_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefix should be changed to moslora
to avoid confusion.
|
||
class MoSLoraLayer(LoraLayer): | ||
# All names of layers that may contain (trainable) adapter weights | ||
adapter_layer_names = ("lora_A", "lora_B", "lora_mixer", "lora_embedding_A", "lora_embedding_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have all these parameter names that are still lora_*
. I think it will be better to rename them all to moslora_*
to avoid confusion. WDYT?
|
||
def __init__(self, base_layer: nn.Module, ephemeral_gpu_offload: bool = False, **kwargs) -> None: | ||
super().__init__(base_layer) | ||
self.use_moslora: dict[str, bool] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this argument? If I use MosLoraConfig
, it's clear that I want to use it.
metadata={ | ||
"help": ( | ||
"Whether to enable 'Mixture-of-Subspaces in Low-Rank Adaptation' (MoSLoRA)." | ||
"This technique employs a learnable mixer to fuse more subspaces in vanilla LoRA and more flexibly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This technique employs a learnable mixer to fuse more subspaces in vanilla LoRA and more flexibly." | |
"This technique employs a learnable mixer to fuse more subspaces in vanilla LoRA and is more flexibly." |
use_moslora: (`bool` | `Literal["kai", "orth"]`): | ||
|
||
""" | ||
use_moslora: (bool | Literal["kai", "orth"]) = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better to rename the parameter to moslora_init
, as this is more about how to initialize the MosLoRA component.
"Passing `'False'` to disable mixer and thus it would be same as vanilla LoRA" | ||
"Passing `'kai'` results in Kaiming Uniform initialization for Mixer." | ||
"Passing `'orth'` results in Orthogonal initialization for Mixer." | ||
"Passing `'True'` would enable Kaiming Uniform initialization which is default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "kai"
and True
options are identical, right? I would suggest to remove the redundancy. Also, I don't think we need the False
option -- if users want LoRA, they can already do that. Moreover, I prefer the names to be fully spelled out for clarity (remember to update the docs & examples too).
In sum, I would change the argument to: moslora_init: Literal["kaiming", "orthogonal"]
|
||
Args: | ||
use_moslora: (`bool` | `Literal["kai", "orth"]`): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, if users inspect the docstring of MosLoRA, they would only see this one argument. You should copy the full docstring of the LoRA config and make the MosLoRA-specific changes.
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device_map="cuda") | ||
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") | ||
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train") | ||
lora_config = MoSLoraConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lora_config = MoSLoraConfig( | |
moslora_config = MoSLoraConfig( |
Code below also needs adjusting.
Heads up: We recently merged PR #2282, which facilitates the addition of new PEFT methods. This is what causes the merge conflicts. It should be straightforward to adapt the code to the new way, as it reduces book keeping. Check for instance the diff for LoRA: https://github.com/huggingface/peft/pull/2282/files#diff-1c7921c8a3305a64be1fe90184fd27e1d32547caba0037e5ead8cd8927e93ebc. You can do the same changes, while removing the need to update all those mappings (see the description of said PR). If you have any doubts, just ask me. |
Previous PR:
#2013
#1905
Paper: https://aclanthology.org/2024.emnlp-main.450/
Blog: https://zhuanlan.zhihu.com/p/704821936
TL, DR: We decompose LoRA into subspaces via structural re-parameterization and propose a simple yet effective MoSLoRA method, employing a learnable mixer to fuse more subspaces and more flexibly.
Sorry for the late submission. Based on the discussion in the previous PRs, I have implemented MoSLoRA in separate files. Also, I added test cases and expanded the documents with guidance on applying MoSLoRA.
Please help review the code. Thanks for your kind work and Merry Christmas.