You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As suggested by @Shelley-BaoYue, I'll create an issue to document current problems and limitations in the initial design of Sedna Federated Learning V2: #455. This issue is a conclusion of my work, and will also help the successor get started smoothly and better know about the obstacles blocking us.
1. Origins
Sedna Federated Learning V2 originates from a LFX'24 issue: kubeedge/kubeedge#5762, which requires the mentee student integrating volcano with Sedna and achieveing gang-scheduling for training workloads. This will provide users with more AI-specific scheduling capabilities and resolve the performance drawbacks brought by default scheduler.
2. How our goals changed during the LFX'24 period
However, gang-scheduler only makes sense when we execute the distributed tasks. Many job patterns in Sedna are now, of nature, not in a distributed pattern. Some job sequential job patterns like IncrementalLearningJob, execute the training stage sequentially. And bringing in a gang-scheduler is meaningless in this scenario.
So, our goals changed to:
Refactor all Sedna APIs to fit in with the distributed training pattern.
Integrate Volcano with these different APIs corresponding to different training tasks.
The workload of these goals are surely beyond the requirement of LFX'24 project. We decide to migrate Federated Learning first and have some proof-of-concepts.
And also, implementing a brand new distributed runtime for Sedna is extremely complex and needs large amounts of time, which is not efficient and practical for us. So, we decided to adopt training-operator, a state-of-art AI training toolkit on Kubernetes having rich support for multiple ML training frameworks (e.g., PyTorch, TensorFlow) and gang scheduling tools (e.g., Volcano, Kueue), as the runtime for our distributed training tasks.
Finally, our goals changed to: Integrate FederatedLearningJob with training-operator(PyTorch framework first)
3. Design Proposal
After months of investigation into Sedna and training-operator and many discussions in the community call, We concluded with this initial design proposal: #455. However, we need to have several assumptions to make the proposal practical, which put many restrictions on Sedna Federated Learning V2:
Data can be transferred within a secure subnet: Since federated learning is a training task, it’s in fact data-driven. If we only schedule training tasks without scheduling the training data, the model will have unacceptable training bias. So we need to collect data and distribute it to different training workers to avoid training bias, after which we can execute the federated learning jobs.
All training workers have the same parameters: The PyTorchJob CRD in training-operator assumes that all training workers(pods) shared the same training parameters, while FederatedLearningJob CRD in Sedna allows training workers to have different training parameters. So we assume that all training workers have the same training parameters, which will surely put many restrictions on the applied scenarios of Senda Federated Learning V2 but we have no choice.
4. Probable Approaches to Mitigate the Effects of Assumptions
For the Assumption 1, we haven't thought of the mitigation approach.
For the Assumption 2, we noticed that Kubeflow is migrating to V2 now, which will adopt JobSet as its low-level runtime for distributed training: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2, which allows to define mutiple training parameters for different training workers. It's a good choice to adopt Kubeflow Training V2 intead of the V1 version.
However, Kubeflow Training V2 is still in alpha stage and not mature enough to use in the production. It may require half a year or more for the stable release version to come out. Meanwhile, Kubeflow will remove the V1 code soon in the master branch and is planning to deprecate the support for V1 in this year, which means relying on Kubeflow Training V1 is dangerous and not practical.
5. Conclusion
During the LFX'24 period, I have made the project goal clear and straighforward, explored many possible solutions for integrating Volcano into Sedna, and finally drafted the initial design proposal after many rounds of meetings, which may had inspired this KubeCon Talk in EU (Kubeflow Summit). Also, I raised this issue to discuss about the current problems in our design and point out a probable apporach for us to proceed.
Currently, this project is being blocked by the implementation of Kubeflow Training V2. But, I have no enough time and effort after LFX'24 period and would leave this issue for the successors for reference.
As suggested by @Shelley-BaoYue, I'll create an issue to document current problems and limitations in the initial design of Sedna Federated Learning V2: #455. This issue is a conclusion of my work, and will also help the successor get started smoothly and better know about the obstacles blocking us.
1. Origins
Sedna Federated Learning V2 originates from a LFX'24 issue: kubeedge/kubeedge#5762, which requires the mentee student integrating volcano with Sedna and achieveing gang-scheduling for training workloads. This will provide users with more AI-specific scheduling capabilities and resolve the performance drawbacks brought by default scheduler.
2. How our goals changed during the LFX'24 period
However, gang-scheduler only makes sense when we execute the distributed tasks. Many job patterns in Sedna are now, of nature, not in a distributed pattern. Some job sequential job patterns like
IncrementalLearningJob
, execute the training stage sequentially. And bringing in a gang-scheduler is meaningless in this scenario.So, our goals changed to:
The workload of these goals are surely beyond the requirement of LFX'24 project. We decide to migrate Federated Learning first and have some proof-of-concepts.
And also, implementing a brand new distributed runtime for Sedna is extremely complex and needs large amounts of time, which is not efficient and practical for us. So, we decided to adopt training-operator, a state-of-art AI training toolkit on Kubernetes having rich support for multiple ML training frameworks (e.g., PyTorch, TensorFlow) and gang scheduling tools (e.g., Volcano, Kueue), as the runtime for our distributed training tasks.
Finally, our goals changed to: Integrate
FederatedLearningJob
with training-operator(PyTorch framework first)3. Design Proposal
After months of investigation into Sedna and training-operator and many discussions in the community call, We concluded with this initial design proposal: #455. However, we need to have several assumptions to make the proposal practical, which put many restrictions on Sedna Federated Learning V2:
PyTorchJob
CRD in training-operator assumes that all training workers(pods) shared the same training parameters, whileFederatedLearningJob
CRD in Sedna allows training workers to have different training parameters. So we assume that all training workers have the same training parameters, which will surely put many restrictions on the applied scenarios of Senda Federated Learning V2 but we have no choice.4. Probable Approaches to Mitigate the Effects of Assumptions
For the Assumption 1, we haven't thought of the mitigation approach.
For the Assumption 2, we noticed that Kubeflow is migrating to V2 now, which will adopt
JobSet
as its low-level runtime for distributed training: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2, which allows to define mutiple training parameters for different training workers. It's a good choice to adopt Kubeflow Training V2 intead of the V1 version.However, Kubeflow Training V2 is still in alpha stage and not mature enough to use in the production. It may require half a year or more for the stable release version to come out. Meanwhile, Kubeflow will remove the V1 code soon in the master branch and is planning to deprecate the support for V1 in this year, which means relying on Kubeflow Training V1 is dangerous and not practical.
5. Conclusion
During the LFX'24 period, I have made the project goal clear and straighforward, explored many possible solutions for integrating Volcano into Sedna, and finally drafted the initial design proposal after many rounds of meetings, which may had inspired this KubeCon Talk in EU (Kubeflow Summit). Also, I raised this issue to discuss about the current problems in our design and point out a probable apporach for us to proceed.
Currently, this project is being blocked by the implementation of Kubeflow Training V2. But, I have no enough time and effort after LFX'24 period and would leave this issue for the successors for reference.
Thanks for your support during the LFX'24 period @Shelley-BaoYue @fisherxu @tangming1996 @MooreZheng @jaypume @hsj576
The text was updated successfully, but these errors were encountered: