Add DeepSpeed Example with Pytorch Operator #2235

Syulin7 · 2024-08-27T13:27:30Z

What this PR does / why we need it:
Add DeepSpeed Example with Pytorch Operator. The script used is HelloDeepSpeed from DeepSpeedExamples.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2091

Checklist:

Docs included if any changes are user facing

coveralls · 2024-08-27T13:31:56Z

Pull Request Test Coverage Report for Build 11216096781

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11164961069:	0.0%
Covered Lines:	66
Relevant Lines:	66

💛 - Coveralls

Syulin7 · 2024-08-28T04:52:10Z

@tenzen-y @andreyvelich @kuizhiqing @terrytangyuan This PR is ready for review. PTAL, Thanks!

andreyvelich

Thank you for adding this great example @Syulin7!
/assign @kubeflow/wg-training-leads @kuizhiqing

examples/pytorch/deepspeed-demo/README.md

andreyvelich · 2024-09-05T17:16:35Z

examples/pytorch/deepspeed-demo/README.md

+DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
+See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).


Do we set the appropriate env variables for deepspeed or accelerate launchers in PyTorchJob or only torchrun can be used ?

When using deepspeed launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.

# deepspeed --hostfile path default at /job/hostfile deepspeed --hostfile=/etc/mpi/hostfile /train_bert_ds.py --checkpoint_dir /root/deepspeed_data

About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187

# hostfile worker-1 slots=4 worker-2 slots=4

I can add an example in mpi-operator (mpi v2) later.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific deepspeed runtime that will leverage MPI orchestration to create hostfiles.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that torchrun doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing

I'm happy to accept a PR for this in the mpi-operator repo.

I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob

@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

examples/pytorch/deepspeed-demo/README.md

andreyvelich · 2024-09-05T17:22:15Z

examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml

+  name: pytorch-deepspeed-demo
+spec:
+  pytorchReplicaSpecs:
+    Master:


Why do you need Master replica for this example ?

Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)

# node1 torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ --master_port=9901 your_program.py <normal cl args> # node2 torchrun --nproc_per_node=8 --nnode=2 --node_rank=1 --master_addr=hostname1 \ --master_port=9901 your_program.py <normal cl args>

so the command can be simplified as follows:

torchrun --nproc_per_node=8 --nnode=2 your_program.py <normal cl args>

See: https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#multi-node-deployment

Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes.

I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set.

Yes, so we need Master replica for this example.

examples/pytorch/deepspeed-demo/train_bert_ds.py

andreyvelich · 2024-09-05T17:33:13Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+# Checkpoint Related Functions
+
+
+def load_model_checkpoint(


How do we use it ?

The function is not used; this script was directly copied from DeepSpeedExamples.
https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md

In this tutorial show the changes necessary to integrate DeepSpeed, and show some of the advantages of doing so.

load_model_checkpoint is used in train_bert.py, which is the original script that does not use DeepSpeed.

I’m not sure whether we should delete it or stay consistent with DeepSpeedExamples.

I am fine with both, eventually we can add it when we have more dedicated example/notebook when we can show how to resume training from checkpoint.
Any thoughts @kubeflow/wg-training-leads ?

andreyvelich · 2024-09-05T17:40:28Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+    return uuid
+
+
+def create_experiment_dir(


Do we need this experiment dir in this example ?

Similar to the issue above, it will create a directory in the checkpoint_dir of rank 0, I think we can stay consistent with DeepSpeedExamples.

andreyvelich · 2024-09-05T17:41:31Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+            )
+    # Save the last checkpoint if not saved yet
+    if step % checkpoint_every != 0:
+        model.save_checkpoint(save_dir=exp_dir, client_state={"checkpoint_step": step})


Will the model checkpointing be only on the rank 0 node ?

Yes, generally, the model will be saved on shared storage (using PVC).

@Syulin7 Do we have this check as part of save_checkpoint() API or we need to verify it?
Like in this FSDP Example from PyTorch

@andreyvelich all processes must call save_checkpoint(), so we don't need to verify it.

https://www.deepspeed.ai/getting-started/#model-checkpointing

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

andreyvelich · 2024-09-30T22:34:00Z

@Syulin7 @kubeflow/wg-training-leads Are we ready to merge this PR ?

Syulin7 · 2024-10-01T09:30:45Z

Are we ready to merge this PR ?

@andreyvelich Yes, I think this PR can be merged.

andreyvelich

Thanks @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

kuizhiqing

Good work.

kuizhiqing · 2024-10-03T13:51:29Z

examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml

+            - name: pytorch
+              image: kubeflow/pytorch-deepspeed-demo:latest
+              command:
+                - torchrun


@Syulin7 No, actually, you no need to set the parameters for torchrun, setting the correct environment related parameters(or using env in the operator case) is the responsibility of the operator.

If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator.

kuizhiqing · 2024-10-03T13:53:45Z

examples/pytorch/deepspeed-demo/README.md

+DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
+See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).


@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

Signed-off-by: Syulin7 <[email protected]>

Syulin7 · 2024-10-07T13:11:31Z

@andreyvelich @kuizhiqing Thanks for the review! I addressed all comments. PTAL.

kuizhiqing

LGTM

andreyvelich

Thank you for doing this @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

andreyvelich · 2024-10-17T17:06:43Z

I think, we can merge it.
thanks for the work @Syulin7!
/approve

google-oss-prow · 2024-10-17T17:07:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from jinchihe and kuizhiqing August 27, 2024 13:27

google-oss-prow bot added the size/XL label Aug 27, 2024

Syulin7 force-pushed the deepspeed branch 4 times, most recently from 23c73db to 6bd91b8 Compare August 28, 2024 03:51

Syulin7 force-pushed the deepspeed branch from 6bd91b8 to 2a5530f Compare September 4, 2024 07:02

andreyvelich reviewed Sep 5, 2024

View reviewed changes

Syulin7 force-pushed the deepspeed branch from 2a5530f to 869fe6c Compare September 6, 2024 08:23

Syulin7 force-pushed the deepspeed branch from 869fe6c to 7fa309c Compare October 1, 2024 08:23

andreyvelich reviewed Oct 1, 2024

View reviewed changes

google-oss-prow bot assigned andreyvelich Oct 1, 2024

google-oss-prow bot added the lgtm label Oct 1, 2024

kuizhiqing reviewed Oct 3, 2024

View reviewed changes

Add DeepSpeed Example with Pytorch Operator

470f62b

Signed-off-by: Syulin7 <[email protected]>

Syulin7 force-pushed the deepspeed branch from 7fa309c to 470f62b Compare October 7, 2024 13:02

google-oss-prow bot removed the lgtm label Oct 7, 2024

kuizhiqing approved these changes Oct 7, 2024

View reviewed changes

andreyvelich reviewed Oct 7, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Oct 7, 2024

google-oss-prow bot added the approved label Oct 17, 2024

google-oss-prow bot merged commit 2d58b49 into kubeflow:master Oct 17, 2024
41 checks passed

		DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
		See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).

Add DeepSpeed Example with Pytorch Operator #2235

Add DeepSpeed Example with Pytorch Operator #2235

Conversation

Syulin7 commented Aug 27, 2024

coveralls commented Aug 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11216096781

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Syulin7 commented Aug 28, 2024 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Syulin7 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Sep 30, 2024

Syulin7 commented Oct 1, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

kuizhiqing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Syulin7 commented Oct 7, 2024

kuizhiqing left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 17, 2024

google-oss-prow bot commented Oct 17, 2024

coveralls commented Aug 27, 2024 •

edited

Loading

Syulin7 commented Aug 28, 2024 •

edited

Loading

Syulin7 Sep 6, 2024 •

edited

Loading