-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work with DeepSpeed for large scale training #611
Comments
How do you plan to obtain the IPs? |
The idea is to update the hostfile with IP after all containers(not pod) are ready. Files have no length limit, environ variable does, AFAIK, one may want to convert the nodes list into env or args to make execution, while using svc may exceed the length limit in very large scale. Two main reason to do that, with very large scale job, svc is not performance efficient and it may not stable enough since it depend on coreDNS or alternatives. BTW, CM for hostfile is a good idea since it support runtime update which makes it possible to support quasi-elastic. |
That was a design I chose for the Flux Operator before switching to a headless service, would be interesting to see how it works for you! I can point you to old code if you want an example of how I did that - it was basically designed exactly as you stated. I am not familiar with DeepSpeed, but is ssh bootstrap the best / only way available? |
does this means it creates a configmap with the etc hosts content and mounts it with a volume in
What environment variables limitation and length limit are you referring to here? With a Headless service each pod has a well known record https://github.com/kubernetes/dns/blob/master/docs/specification.md#241---aaaaa-records |
Ok, just to get rid of unnecessary feature requests, let's forget about passing a hostfile via an environment variable.
But instead, you will have to wait for all worker pods to be ready before being able to create the launcher Pod. This means that you will have to wait for container images to download. That time will probably be more significant than DNS name resolution. Also, if anything happens to a worker Pod, then you will have to start over and recreate the hostfile.
How does elasticity work in deepspeed? Is it similar to horovod?
No, a hostfile is a requirement from MPI (https://www.open-mpi.org/faq/?category=running#mpirun-hostfile). The mpi-operator populates using the domain name for each Pod. If we were to wait for the pods to be created, we could put the Pod IPs instead, which is what the OP is proposing. |
This was the main point that made me pause on the design. However, I do think the length of the hostnames are an issue, especially when you get into indexed job. With regular job, it's usually successful with a reasonable (and somewhat short) service and job name. However, when you go into JobSet (if you do) I found that I had to make everything tiny - the service name often 2 letters, and the different jobs also indicated by letters. The design of how the JobSet members map to the names didn't allow for the same lengths as regular job. When we were debugging networking issues, at least for small numbers of pods I did compare writing the hostfile to using the headless service, and the first didn't help much - the means / quartiles were very much overlapping. kubernetes/kubernetes#117819 (comment) |
@vsoch Great, thx, I'm already completed the implementations and make those features into production, what I'm wandering is to merge some features back here. |
Well, since NCCL is not support fault tolerance, perfect recovery is impossible in GPU training scenario, what the we can do it to replace one pod a time without recreating thousands/hundred of pods. The failure node trigger the internal restart and IP update in the hostfile is the current design for me. |
If the host names are too large, maybe we can make this optional #453 (via command line flag, for example). In most scenarios, we don't actually need the service name. |
When I say env length limit, I refer https://informix.hcldoc.com/14.10/help/index.jsp?topic=%2Fcom.ibm.sqlr.doc%2Fids_sqr_405.htm. It happen when user use |
Well I will hold the PR process and welcome more discussion on the feature requirement. More ideas are really appreciate, @alculquicondor @vsoch @aojea @willb . Thx |
We shouldn't support that. It feels unnecessary. |
Regarding using IPs instead of DNS, I would like to hear from @aojea, as TL for Kubernets SIG network. My gut feeling is that improving DNS building and lookup should be done by k8s or providers, and we shouldn't re-implement it in mpi-operator. |
I'm wondering if we should support this feature since as other forks say, it sounds like re-implementing DNS.
I believe that restarting the training process and reconnecting with other workers are responsible for the application side.
If it is possible, it would be great. |
Do you have more details on how the workloads are deployed? I do think that using hostnames is the way to go as it is more human friendly and headless service solves this problem as each Pod will have a resolvable name, if we just want to avoid the DNS problems we can always add a simple sidecar or something that adds the |
That's a really cool idea! I love sidecars (and init containers), so much you can do with them. @kuizhiqing if you wind up trying that out can you point me to it to look? If not I'd like to try prototyping it - it seems useful for other cases (I'm thinking JobSet where it's more common to go over the limit). |
Does DeepSpeed support fault tolerance and elasticity without recreating pods or reloading checkpoints? |
Thanks for those great ideas, I've adopted the IP version solution which works well in my case. |
@Syulin7 @tenzen-y It make sense that it may out of the scope of cloud native design, especially in mpi-operator, while it's indeed a strong requirement in practice. |
This would be handled well by the flux operator, which uses zeromq to bootstrap and if a pod (follower broker) goes down flux would see the node as down, and we could schedule to another node (possibly newly added, which would join the cluster and then be seen as going from down to up). Nodes going up and down happens all the time in HPC so our workload managers are used to handling that, and for the flux operator you essentially get your own scheduler within the indexed job. We do, however, use a headless service and not the host file. If you are interested and have a dummy example that shows commands for launching different stuff I would be happy to prototype something to think about. |
That's exactly why using DNS solves the problem. If there is an error, and we need to retry, there is no need to recreate the hostfile. mpi-operator would create a new Pod and the DNS provider would be updated to point to the new Pod IP. From @aojea:
Let me try to summarize what the operator does:
When the launcher starts, it reads the hostfile and starts SSH connections to each of them. |
so the only optimization I can see is to append to the existing |
DNS works, but some DNS providers have a limit in the number of pods. Or they might have non-negligible latency at scale.
But then we need to wait for the worker Pods to be created before being able to start the launcher (which is the one that needs the IPs). That's almost the same as just using IPs in the hostfile, instead of today's hostnames. |
Yes, the DNS stuff is clear until now, I may conclude that the IP version hostfile is not a feature worth merging and let's leave it to those who need it in their implementation. @vsoch I've found your project flux-operator interesting, let me come back to you after I got more details. |
Sure! And make sure to look at the refactor branch - I fixed a ton of design flaws and it's much better, but waiting on a paper (and likely some more testing) before merging that in. flux-framework/flux-operator#211 |
I think I had a similar issue to this:
When I was working on the metrics operator. The issue is that some variants of mpi require the actual ipaddress. So what I did:
It's not great, it's basically looking them up on the fly before running. The rationale I had (aside from only a few apps needing this) is that if a pod fails in the middle of the run (because a pod dies) it's going to throw up and fail anyway - there is no good (easy) concept of elasticity (that I know of / have tried) for MPI itself so if you lose a pod, you are out of luck anyway (and updating the hostslists doesn't matter). For the use case where you are running mpirun several times (and you have changing pods), this use case might make sense to have some list that is self updating, in which case (for a quick test) I'd do the equivalent janky trick just before running something. |
You could also just use a different MPI flavor that will use the DNS names and call it a day :) |
DeepSpeed is an excellent framework for training LLMs on a large scale, while the mpi-operator is the ideal tool to facilitate this within the Kubernetes ecosystem.
I'm planning to submit a series of PRs to make this project more ready-to-use for very large scale training with the DeepSpeed/mpi-style training framework.
The upcoming features may include the following modifications:
Support for IP-style hostfile
This is for performance efficiency and to prevent the environment variable length from exceeding its limit when using svc for those who wish to wrap it into an environment variable.
Support for fault tolerance and elasticity
This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.
Configuration decoupling
There are some requirements that are currently left to the docker image maker to handle, such as ssh_config and sshd_config. Perhaps the operator can manage all of these.
There are also some minor changes under consideration. Please feel free to share your thoughts on this topic.
The text was updated successfully, but these errors were encountered: