-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeepSpeed Example with Pytorch Operator #2235
Conversation
Pull Request Test Coverage Report for Build 11216096781Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
23c73db
to
6bd91b8
Compare
@tenzen-y @andreyvelich @kuizhiqing @terrytangyuan This PR is ready for review. PTAL, Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this great example @Syulin7!
/assign @kubeflow/wg-training-leads @kuizhiqing
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. | ||
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set the appropriate env variables for deepspeed
or accelerate
launchers in PyTorchJob or only torchrun
can be used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using deepspeed
launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.
The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.
# deepspeed --hostfile path default at /job/hostfile
deepspeed --hostfile=/etc/mpi/hostfile /train_bert_ds.py --checkpoint_dir /root/deepspeed_data
About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187
# hostfile
worker-1 slots=4
worker-2 slots=4
I can add an example in mpi-operator (mpi v2) later.
In PyTorchJob, torchrun
and accelerate
can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.
Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific deepspeed
runtime that will leverage MPI orchestration to create hostfiles.
In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.
As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that torchrun
doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepspeed
is already compatible with mpi-operator (the one outside of training-operator)
Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepspeed
is already compatible with mpi-operator (the one outside of training-operator)Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610
Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to accept a PR for this in the mpi-operator repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.
name: pytorch-deepspeed-demo | ||
spec: | ||
pytorchReplicaSpecs: | ||
Master: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need Master replica for this example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)
# node1
torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>
# node2
torchrun --nproc_per_node=8 --nnode=2 --node_rank=1 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>
so the command can be simplified as follows:
torchrun --nproc_per_node=8 --nnode=2 your_program.py <normal cl args>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set.
Yes, so we need Master replica for this example.
# Checkpoint Related Functions | ||
|
||
|
||
def load_model_checkpoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is not used; this script was directly copied from DeepSpeedExamples.
https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md
In this tutorial show the changes necessary to integrate DeepSpeed, and show some of the advantages of doing so.
load_model_checkpoint
is used in train_bert.py, which is the original script that does not use DeepSpeed.
I’m not sure whether we should delete it or stay consistent with DeepSpeedExamples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with both, eventually we can add it when we have more dedicated example/notebook when we can show how to resume training from checkpoint.
Any thoughts @kubeflow/wg-training-leads ?
return uuid | ||
|
||
|
||
def create_experiment_dir( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this experiment dir in this example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the issue above, it will create a directory in the checkpoint_dir of rank 0, I think we can stay consistent with DeepSpeedExamples.
) | ||
# Save the last checkpoint if not saved yet | ||
if step % checkpoint_every != 0: | ||
model.save_checkpoint(save_dir=exp_dir, client_state={"checkpoint_step": step}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the model checkpointing be only on the rank 0 node ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, generally, the model will be saved on shared storage (using PVC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Syulin7 Do we have this check as part of save_checkpoint()
API or we need to verify it?
Like in this FSDP Example from PyTorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich all processes must call save_checkpoint()
, so we don't need to verify it.
https://www.deepspeed.ai/getting-started/#model-checkpointing
Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.
@Syulin7 @kubeflow/wg-training-leads Are we ready to merge this PR ? |
@andreyvelich Yes, I think this PR can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work.
- name: pytorch | ||
image: kubeflow/pytorch-deepspeed-demo:latest | ||
command: | ||
- torchrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Syulin7 No, actually, you no need to set the parameters for torchrun
, setting the correct environment related parameters(or using env in the operator case) is the responsibility of the operator.
If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator.
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. | ||
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.
Signed-off-by: Syulin7 <[email protected]>
@andreyvelich @kuizhiqing Thanks for the review! I addressed all comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads
I think, we can merge it. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, kuizhiqing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Add DeepSpeed Example with Pytorch Operator. The script used is HelloDeepSpeed from DeepSpeedExamples.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #2091
Checklist: