Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining Divergence #524

Open
egoetz opened this issue Jun 13, 2024 · 3 comments
Open

Pretraining Divergence #524

egoetz opened this issue Jun 13, 2024 · 3 comments

Comments

@egoetz
Copy link

egoetz commented Jun 13, 2024

I have been trying to follow the steps listed under "reproducing GPT-2" from the README.md. Unfortunately, when I run the model, my training always diverges. I have tried switching up my learning rate and gradient accumulation but neither of these tactics seemed to work, although I did have to fix a bug in the learning rate after varying those parameters. I could try changing those variables again, but my latest runs lead me to think that neither of those parameters are the issue:

image
Here are the last two runs. The orange run decays the learning rate over 300,000 steps while the pink run decays the learning rate over 600,000 steps. For these runs the learning rate starts at 6e-5 and hits its minimum at 6e-6.

Here are some of my meta-parameters:
batch_size = 24
block_size = 1024
max_iters = 300000
lr_decay_iters = 300000
eval_interval = 1000
eval_iters = 200
log_interval = 100
weight_decay = 5e-2

I am running this model on 4 A100 80GB GPUs.

@iminfine
Copy link

This is caused by flash attention. Please disable it and use original self attention.

And also use the default training config, here is batch setting for 4 A100 80g:

30 batch size * 1024 block size * 4 gradaccum * 4 GPUs = 491,520

@egoetz
Copy link
Author

egoetz commented Jul 30, 2024

How do you disable flash attention? I can't find anything on the torch website which suggests it is togglable.

@egoetz
Copy link
Author

egoetz commented Aug 4, 2024

Is there a way to find the correct configuration for an arbitrary setup? Based off of your comment and the original script I'm not exactly sure when to altern the batch size vs the grad acc:

From iminfine comment - 4 A100s, one node

30 batch size * 1024 block size * 4 gradaccum * 4 GPUs = 491,520
batch_size = 30
block_size = 1024
gradient_accumulation_steps = 4 * 4 = 16

From train_gpt2.py - 8 A100s, one to two nodes

12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8 = 40

I will attempt to use your suggested parameters with the 4 A100s. I have also had much better luck recently in accessing more GPUs and so I will try to replicate the results of the 8 A100 training discussed in the README. I changed the parameters back to their default values for the 8 GPUs, but I still have flash attention enabled.

Are you suggesting that the divergence may be caused by flash attention instability? I have changed my default dtype value from bfloat16 to float32 in the hope that increasing the precision could address that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants