-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem about seg module when finetuning with lora #20
Comments
Hi, I'm sorry I'm late. I've been a little busy lately. Another approach is multi-step fine-tuning rather than one step, for example, by fine-tuning the model with text dataset instructions and then using segmentation datasets alone for instruction fine-tuning, avoiding mixing two different data types. If you have successfully solved this problem using a better method, please share it with us and thank you. |
Hi, Yeah, I think it is the reason that segmentation modules are not always calculated as you said, so it might have no gradients of seg modules in some machines when using ddp. If you use sharded parameters technology like zero methods in deepspeed which shard all parameters in the models, there will be some nccl communication problems. So my solution is use FSDP integrated with accelerate, and only wrap LlamaDecoderLayer in LLM backbone with FSDP sharding strategy. Below is my accelerate configuration:
|
Thank you very much for your contribution, I believe researchers will benefit from it. |
Hi @baifanxxx :
I'm encountering an issue where the forward pass of the
SegVol
class hangs when theimage
is passed toimage_encoder
, resulting in NCCL communication timeouts in finetuning with lora. Below is the relevant part of the code:Problem:
self.image_encoder(image)
inside theforward
function, the process stalls, leading to an NCCL timeout.torch.Size([1, 1, 32, 256, 256])
, but the process doesn’t proceed past this point, causing a communication timeout.Questions:
image_encoder
?Environment:
Any insights or suggestions would be appreciated!
You can modify the environment details as needed before submitting!
The text was updated successfully, but these errors were encountered: