Improve overflow handling in ZeRO #6976

tjruwase · 2025-01-28T21:24:04Z

Fix #5241: Improve overflow handling

ZeRO 1
ZeRO 2
ZeRO 3
BF16Optimizer

Enable pydantic configuration for mixed precision

bf16
fp16

tjruwase · 2025-01-30T11:35:31Z

@delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks!

deepspeed/runtime/hybrid_engine.py

…o olruwase/ds_5241

delock · 2025-02-05T01:46:53Z

@delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks!

@tjruwase thanks! Our engineer is looking into it.

Signed-off-by: Olatunji Ruwase <[email protected]>

) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]>

…ead of Raising Error (#6979) This pull request addresses an issue in setup_env_ranks where, under certain conditions, the function raises an error instead of setting the necessary MPI-related environment variables (LOCAL_RANK, RANK, and WORLD_SIZE). The intended behavior is to properly map Open MPI variables (OMPI_COMM_WORLD_*) to the variables expected by DeepSpeed/PyTorch, but the code currently raises an EnvironmentError if these Open MPI variables are not found. With this fix, setup_env_ranks will: - Correctly detect and map the required Open MPI environment variables. - Only raise an error if there is genuinely no valid way to obtain rank information from the environment (e.g., both Open MPI variables and any fallback mechanism are unavailable). Fix #6895 Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

…h 2.6) (#6982) Fixes #6984. The workflow was pulling the updated torch 2.6, which caused CI failures. This keeps us on torch 2.5 for now, since installing torchvision as a dependency later on was pulling torch 2.6 with it which was unintended. This PR also unsets NCCL_DEBUG to avoid a large print out in the case of any errors. Signed-off-by: Olatunji Ruwase <[email protected]>

As discussed in [PR-6918](#6918), padding can occur on multiple ranks with large DP degrees. For example, with: - Flattened tensor size: 266240 - DP degree: 768 - Alignment: 1536 - Required padding: 1024 (1536 * 174 - 266240) - Per-rank partition size: 348 (1536 * 174 / 768) - The padding occurs on last three ranks. This PR removes the single-rank padding assumption for more general cases. --------- Co-authored-by: Sam Foreman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Olatunji Ruwase <[email protected]>

NVIDIA Blackwell GPU generation has number 10. The SM code and architecture should be `100`, but the current code generates `1.`, because it expects a 2 characters string. This change modifies the logic to consider it as a string that contains a `.`, hence splits the string and uses the array of strings. Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]>

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Olatunji Ruwase <[email protected]>

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

…o olruwase/ds_5241

Improve overflow handling in ZeRO

a3a18f7

tjruwase requested review from tohtana and loadams as code owners January 28, 2025 21:24

tjruwase added 7 commits January 28, 2025 16:25

Unit test and pydantic configuration

19431f8

Formatting fixes

406cf26

Merge branch 'master' into olruwase/ds_5241

35570f5

Remove unused symbol

cb78444

Fix typo

ee1c1fd

Pydantic fp16 config

0b2cf73

Fix more typos

c7a90f9

tjruwase mentioned this pull request Jan 29, 2025

Use default value of initial_scale_power if FP16 scaling params not provided #4986

Closed

tjruwase added 3 commits January 29, 2025 12:42

Address #4986

3694e07

Merge branch 'master' into olruwase/ds_5241

2bbcf00

Merge branch 'master' into olruwase/ds_5241

c1b87ea

tjruwase mentioned this pull request Jan 30, 2025

[BUG] Zero2 offload overflow #5241

Open

Merge branch 'master' into olruwase/ds_5241

5da6cd0

tjruwase mentioned this pull request Jan 30, 2025

[NaN check] Add NaN check to support bfloat16. #5879

Closed

Merge branch 'master' into olruwase/ds_5241

a65d20c

loadams reviewed Jan 30, 2025

View reviewed changes

deepspeed/runtime/hybrid_engine.py Outdated Show resolved Hide resolved

tjruwase and others added 5 commits January 30, 2025 18:27

Fix typo

ae039b2

Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…

0446192

…o olruwase/ds_5241

Merge branch 'master' into olruwase/ds_5241

5d48745

Merge branch 'master' into olruwase/ds_5241

05c362d

Merge branch 'master' into olruwase/ds_5241

5e17ed6

tjruwase added 4 commits February 5, 2025 10:55

Merge branch 'master' into olruwase/ds_5241

06bb3a6

Fix min loss scale

0d0ab3d

Merge branch 'master' into olruwase/ds_5241

cccd5b1

Fix UTs

2c6f630

tjruwase and others added 21 commits February 6, 2025 13:02

Address #4986

2bbb7b4

Signed-off-by: Olatunji Ruwase <[email protected]>

generalize deepspeed linear and implement it for non cuda systems (#6932

3ab5e88

) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Fix typo

271db94

Signed-off-by: Olatunji Ruwase <[email protected]>

Update recommended Windows whl building versions (#6983)

b1900af

Signed-off-by: Olatunji Ruwase <[email protected]>

Use ds-specific module id to avoid conflicts (#6847)

b0b0132

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (…

353ab08

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Olatunji Ruwase <[email protected]>

Update GH org references (#6998)

75996f8

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Fix min loss scale

b23c545

Signed-off-by: Olatunji Ruwase <[email protected]>

Fix UTs

7cd3a9f

Signed-off-by: Olatunji Ruwase <[email protected]>

Update CNAME

2c5629e

Signed-off-by: Olatunji Ruwase <[email protected]>

Update CNAME

6b15688

Signed-off-by: Olatunji Ruwase <[email protected]>

[XPU] max1100 workflow update for docker and softwares (#7003)

3773d83

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Olatunji Ruwase <[email protected]>

Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…

5fa2910

…o olruwase/ds_5241

Merge branch 'master' into olruwase/ds_5241

1f5a672

Fix ds-chat CI regression

9882116

Merge branch 'olruwase/ds_7014' of github.com:microsoft/DeepSpeed int…

97d7915

…o olruwase/ds_5241

tjruwase requested a review from hwchen2017 as a code owner February 7, 2025 14:57

tjruwase and others added 8 commits February 7, 2025 10:30

Fix bug

4a1dd0f

Avoid naming collision on partition()

0ac4457

Merge branch 'master' into olruwase/ds_5241

1597d48

Use new API

2ae2062

Merge branch 'master' into olruwase/ds_7014

9fb73a4

Merge branch 'olruwase/ds_7014' of github.com:microsoft/DeepSpeed int…

26fa8af

…o olruwase/ds_5241

Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…

b565d77

…o olruwase/ds_5241

Merge branch 'master' into olruwase/ds_5241

d098c32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve overflow handling in ZeRO #6976

Improve overflow handling in ZeRO #6976

tjruwase commented Jan 28, 2025 •

edited

Loading

tjruwase commented Jan 30, 2025

delock commented Feb 5, 2025

Improve overflow handling in ZeRO #6976

Are you sure you want to change the base?

Improve overflow handling in ZeRO #6976

Conversation

tjruwase commented Jan 28, 2025 • edited Loading

tjruwase commented Jan 30, 2025

delock commented Feb 5, 2025

tjruwase commented Jan 28, 2025 •

edited

Loading