Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepSpeed Example with Pytorch Operator #2235

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

Syulin7
Copy link
Contributor

@Syulin7 Syulin7 commented Aug 27, 2024

What this PR does / why we need it:
Add DeepSpeed Example with Pytorch Operator. The script used is HelloDeepSpeed from DeepSpeedExamples.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2091

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Aug 27, 2024

Pull Request Test Coverage Report for Build 11216096781

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11164961069: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

@Syulin7 Syulin7 force-pushed the deepspeed branch 4 times, most recently from 23c73db to 6bd91b8 Compare August 28, 2024 03:51
@Syulin7
Copy link
Contributor Author

Syulin7 commented Aug 28, 2024

@tenzen-y @andreyvelich @kuizhiqing @terrytangyuan This PR is ready for review. PTAL, Thanks!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this great example @Syulin7!
/assign @kubeflow/wg-training-leads @kuizhiqing

examples/pytorch/deepspeed-demo/README.md Outdated Show resolved Hide resolved
Comment on lines +7 to +8
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set the appropriate env variables for deepspeed or accelerate launchers in PyTorchJob or only torchrun can be used ?

Copy link
Contributor Author

@Syulin7 Syulin7 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using deepspeed launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.

# deepspeed --hostfile path default at /job/hostfile
deepspeed --hostfile=/etc/mpi/hostfile /train_bert_ds.py --checkpoint_dir /root/deepspeed_data

About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187

# hostfile
worker-1 slots=4
worker-2 slots=4

I can add an example in mpi-operator (mpi v2) later.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific deepspeed runtime that will leverage MPI orchestration to create hostfiles.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that torchrun doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to accept a PR for this in the mpi-operator repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

examples/pytorch/deepspeed-demo/README.md Outdated Show resolved Hide resolved
examples/pytorch/deepspeed-demo/README.md Outdated Show resolved Hide resolved
examples/pytorch/deepspeed-demo/README.md Outdated Show resolved Hide resolved
name: pytorch-deepspeed-demo
spec:
pytorchReplicaSpecs:
Master:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need Master replica for this example ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)

# node1
torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>

# node2
torchrun --nproc_per_node=8 --nnode=2 --node_rank=1 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>

so the command can be simplified as follows:

torchrun --nproc_per_node=8 --nnode=2 your_program.py <normal cl args>

See: https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#multi-node-deployment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set.

Yes, so we need Master replica for this example.

# Checkpoint Related Functions


def load_model_checkpoint(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we use it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is not used; this script was directly copied from DeepSpeedExamples.
https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md

In this tutorial show the changes necessary to integrate DeepSpeed, and show some of the advantages of doing so.

load_model_checkpoint is used in train_bert.py, which is the original script that does not use DeepSpeed.

I’m not sure whether we should delete it or stay consistent with DeepSpeedExamples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with both, eventually we can add it when we have more dedicated example/notebook when we can show how to resume training from checkpoint.
Any thoughts @kubeflow/wg-training-leads ?

return uuid


def create_experiment_dir(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this experiment dir in this example ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the issue above, it will create a directory in the checkpoint_dir of rank 0, I think we can stay consistent with DeepSpeedExamples.

)
# Save the last checkpoint if not saved yet
if step % checkpoint_every != 0:
model.save_checkpoint(save_dir=exp_dir, client_state={"checkpoint_step": step})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the model checkpointing be only on the rank 0 node ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally, the model will be saved on shared storage (using PVC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Do we have this check as part of save_checkpoint() API or we need to verify it?
Like in this FSDP Example from PyTorch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich all processes must call save_checkpoint(), so we don't need to verify it.

https://www.deepspeed.ai/getting-started/#model-checkpointing

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

@andreyvelich
Copy link
Member

@Syulin7 @kubeflow/wg-training-leads Are we ready to merge this PR ?

@Syulin7
Copy link
Contributor Author

Syulin7 commented Oct 1, 2024

Are we ready to merge this PR ?

@andreyvelich Yes, I think this PR can be merged.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work.

- name: pytorch
image: kubeflow/pytorch-deepspeed-demo:latest
command:
- torchrun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 No, actually, you no need to set the parameters for torchrun, setting the correct environment related parameters(or using env in the operator case) is the responsibility of the operator.

If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator.

Comment on lines +7 to +8
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

@Syulin7
Copy link
Contributor Author

Syulin7 commented Oct 7, 2024

@andreyvelich @kuizhiqing Thanks for the review! I addressed all comments. PTAL.

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

@google-oss-prow google-oss-prow bot added the lgtm label Oct 7, 2024
@andreyvelich
Copy link
Member

I think, we can merge it.
thanks for the work @Syulin7!
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 2d58b49 into kubeflow:master Oct 17, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants