Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights for relation_detr_focalnet_large_lrf_fl4_800_1333.py #15

Open
ck-amrahd opened this issue Aug 31, 2024 · 10 comments
Open

Weights for relation_detr_focalnet_large_lrf_fl4_800_1333.py #15

ck-amrahd opened this issue Aug 31, 2024 · 10 comments
Labels
question Further information is requested

Comments

@ck-amrahd
Copy link

Question

Hi, Thanks for the awesome repo. I am trying to finetune your model on a custom dataset. My GPU memory is not enough to finetune relation_detr_focalnet_large_lrf_fl4_1200_2000.py version. I have tried with batch_size=1, and "fp16" mixed precision training. Could you please release the weights and accuracy infor for the relation_detr_focalnet_large_lrf_fl4_800_1333.py version? Thank you.

Additional

No response

@ck-amrahd ck-amrahd added the question Further information is requested label Aug 31, 2024
@ck-amrahd
Copy link
Author

The same model works for both image sizes. My GPU memory is not enough for 800 x 1333 images as well. Did you guys get the same accuracy with both input image sizes (800, 1333) & (1200, 2000)?

@ck-amrahd
Copy link
Author

Is there any way to reduce the memory of the system? I think the memory issues come from O(n^2) memory for the transformer, can we replace that with the O(n) version or something like xformers?

@ck-amrahd
Copy link
Author

Now, I have gone into the details and it is using Deformable attention, meaning we don't need the O(n) version of the transformer. Please correct me if I am wrong. Maybe we can fine-tune it using LORA? I don't know how straightforward this process is in this model. Could you suggest some ideas for fine-tuning your biggest model on a single RTX 3090 Ti machine (24GB VRAM)?

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 1, 2024

Hi, @ck-amrahd Thanks for your question.

Actually, we haven't pre-trained relation_detr_focalnet_large_LRF_fl4_800_1333 on COCO. To fine-tune the custom dataset, you can directly load the weight of 1200_2000 into the model of 800_1333 as Image_size will not change the model architecture.

The model accuracy of image_size (800 800,1333) should be a little lower than that of (1200,2000), but it will reduce much memory.

The memory cost mainly comes from the backbone and transformer encoder. Here are some advice to reduce memory when fine-tuning:

For backbone:

  • Freeze more layers: Each backbone has 4 stages indexed by (0, 1, 2, 3), by default we freeze no stages. You can freeze more layers with freeze_indices to save memory, for example freeze_indices=(0,) or freeze_indices=(0, 1).

  • Use fewer feature maps for transformer: please set return_indices=(1, 2, 3) for focalnet_large, if will reduce about 50% tokens for transformer encoder and only reduce accuracy by about 1~2 AP on COCO.

Here is a backbone setting with a better trade-off between GPU memory and accuracy.

backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(1, 2, 3), freeze_indices=(0,))

For transformer encoder:

Yes we don't need O(n) attention since deformable attention has been a O(n) version.

LoRA mainly solves the memory problem caused by large parameters by decomposing W into low rank matrix. But the memory of Relation-DETR mainly comes from the intermediate output of the model, not the model parameters. Our model may not need LoRA. If you want to try it, you can wrap the following linear layers in the MultiScaleDeformableAttention of the transformer encoder with LoRA.

self.attention_weights = nn.Linear(embed_dim, num_heads * num_levels * num_points)
self.value_proj = nn.Linear(embed_dim, embed_dim)
self.output_proj = nn.Linear(embed_dim, embed_dim)

@ck-amrahd
Copy link
Author

Thank you @xiuqhou, I will give it a try.

@ck-amrahd
Copy link
Author

Hi @xiuqhou Thanks for the idea. I am able to fine-tune the larger model with:

backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(2, 3), freeze_indices=(0, 1))

I also reduced the number of queries and the number of hybrid proposals and now I am able to fine-tune the version that takes 1200 * 2000 images. However, the model fine-tuned with this setting performed poorly compared to a model that I fine-tuned from the Swin-L backbone. I am not able to figure out why. I will keep looking into it. In the meantime, do you have any intuition over why that may be the case?

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 3, 2024

Hi, @ck-amrahd Thanks for your feedback.
If you can fine-tune the Swin-L backbone, the FocalNet-Large backbone should also work under the same settings, because they have similar memory cost. Therefore, I suggest that you keep the backbone settings consistent with Swin-L, including:

Please use return_indexes = (1, 2, 3) instead of return_ indexes = (2, 3). Using fewer indexes will seriously affect performance and will not reduce memory cost too much. From index 0 to index 3, the memory cost of each index will only be about 1/4 of the previous one. So return_indexes = (1, 2, 3) is totally enough.

Did you change min_size and max_size, but left train_transform in train_config.py unchanged? They should be changed correspondingly. strong_album transform should be used with 800 * 1333 version and strong_album_1200_2000 should be used with 1200 * 2000 version. You can define your own data_augmentation for other image_size.

min_size and max_size affect image_size for inference and train_transform affect image_size for training.

On the other hand, I strongly suggest you using 800 * 1333 version since larger image_sizes have marginal diminishing returns but will increase memory_cost largely. And please do not reduce or increase num_queries and hybrid_proposals as they have a large affect on final performance.

To make it simple, I put my own changed 800 * 1333 model_configs for Focal-large here. I have successfully run it on my own 3090 GPU with with bf16 and strong_album train_transform. It should have better performance than swin-L. You can load checkpoint (https://github.com/xiuqhou/Relation-DETR/releases/download/v1.0.0/relation_detr_focalnet_large_lrf_fl4_o365_4e-coco_2x_1200_2000.pth) directly and fine-tune it.

from torch import nn

from models.backbones.focalnet import FocalNetBackbone
from models.bricks.position_encoding import PositionEmbeddingSine
from models.bricks.post_process import PostProcess
from models.bricks.relation_transformer import (
    RelationTransformer,
    RelationTransformerDecoder,
    RelationTransformerEncoder,
    RelationTransformerEncoderLayer,
    RelationTransformerDecoderLayer,
)
from models.bricks.set_criterion import HybridSetCriterion
from models.detectors.relation_detr import RelationDETR
from models.matcher.hungarian_matcher import HungarianMatcher
from models.necks.channel_mapper import ChannelMapper

# mostly changed parameters
embed_dim = 256
num_classes = 91
num_queries = 900
hybrid_num_proposals = 1500
hybrid_assign = 6
num_feature_levels = 5
transformer_enc_layers = 6
transformer_dec_layers = 6
num_heads = 8
dim_feedforward = 2048

# instantiate model components
position_embedding = PositionEmbeddingSine(
    embed_dim // 2, temperature=10000, normalize=True, offset=-0.5
)

backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(1, 2, 3), freeze_indices=(0,))

neck = ChannelMapper(backbone.num_channels, out_channels=embed_dim, num_outs=num_feature_levels)

transformer = RelationTransformer(
    encoder=RelationTransformerEncoder(
        encoder_layer=RelationTransformerEncoderLayer(
            embed_dim=embed_dim,
            n_heads=num_heads,
            dropout=0.0,
            activation=nn.ReLU(inplace=True),
            n_levels=num_feature_levels,
            n_points=4,
            d_ffn=dim_feedforward,
        ),
        num_layers=transformer_enc_layers,
    ),
    decoder=RelationTransformerDecoder(
        decoder_layer=RelationTransformerDecoderLayer(
            embed_dim=embed_dim,
            n_heads=num_heads,
            dropout=0.0,
            activation=nn.ReLU(inplace=True),
            n_levels=num_feature_levels,
            n_points=4,
            d_ffn=dim_feedforward,
        ),
        num_layers=transformer_dec_layers,
        num_classes=num_classes,
    ),
    num_classes=num_classes,
    num_feature_levels=num_feature_levels,
    two_stage_num_proposals=num_queries,
    hybrid_num_proposals=hybrid_num_proposals,
)

matcher = HungarianMatcher(
    cost_class=2, cost_bbox=5, cost_giou=2, focal_alpha=0.25, focal_gamma=2.0
)

# construct weight_dict for loss
weight_dict = {"loss_class": 1, "loss_bbox": 5, "loss_giou": 2}
weight_dict.update({"loss_class_dn": 1, "loss_bbox_dn": 5, "loss_giou_dn": 2})
aux_weight_dict = {}
for i in range(transformer.decoder.num_layers - 1):
    aux_weight_dict.update({k + f"_{i}": v for k, v in weight_dict.items()})
weight_dict.update(aux_weight_dict)
weight_dict.update({"loss_class_enc": 1, "loss_bbox_enc": 5, "loss_giou_enc": 2})
weight_dict.update({k + "_hybrid": v for k, v in weight_dict.items()})

criterion = HybridSetCriterion(
    num_classes=num_classes, matcher=matcher, weight_dict=weight_dict, alpha=0.25, gamma=2.0
)
postprocessor = PostProcess(select_box_nums_for_evaluation=300)

# combine above components to instantiate the model
model = RelationDETR(
    backbone=backbone,
    neck=neck,
    position_embedding=position_embedding,
    transformer=transformer,
    criterion=criterion,
    postprocessor=postprocessor,
    num_classes=num_classes,
    num_queries=num_queries,
    hybrid_assign=hybrid_assign,
    denoising_nums=100,
    min_size=800,
    max_size=1333,
)

@ck-amrahd
Copy link
Author

Hi @xiuqhou, Thank you so much for the detailed feedback. I will try it and let you know.

@ck-amrahd
Copy link
Author

Hi @xiuqhou Thank you for the feedback. I am now fine-tuning the large focal net model on a custom dataset. I am able to fine-tune according to your instructions. I am now fine-tuning for 1-2 epochs due to the computational cost. The total loss starts around 60 and goes to around 40 at the end of the first epoch. I will train for longer epochs, but this loss seems quite high for the detection task. Do you have any intuition for this? Is that what you observe when fine-tuning on some datasets?

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 28, 2024

Hi @ck-amrahd The total loss for the first epoch looks OK. Our method has an extra branch compared to DETRs like DINO, so it contains more loss terms and a larger total loss. When I trained Relation-DETR on COCO, the loss also started around 60 and went to around 35 at the end of the first epoch. It is similar to your result and the difference may comes from different sizes of datasets. As long as the loss goes down steadily, the training process should be OK. Your can refer to our released training log for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants