-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2401: Kubeflow LLM Trainer V2 #2410
base: master
Are you sure you want to change the base?
Conversation
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 13089269276Details
💛 - Coveralls |
Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP? |
@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage. WDYT @franciscojavierarceo @kubeflow/wg-training-leads |
I would probably leave that out of scope. Not to say that it's not important of course. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi Folks, just a friendly reminder that this Wednesday at 5pm UTC, we will discuss the cc @kubeflow/wg-training-leads @Electronic-Waste @franciscojavierarceo @joecummings @astefanutti @akshaychitneni @shravan-achar @janeyx99 @bigsur0 |
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
39b82ef
to
ffa3b94
Compare
Signed-off-by: Electronic-Waste <[email protected]>
ffa3b94
to
54ab01e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @Electronic-Waste!
To hide users from complex Kubernetes configuations, we will provide a simple yet flexible Python SDK wrapping all specifications of models, datasets, training runtime and fine-tuning configs. Like this: | ||
|
||
```python | ||
job_id = TrainingClient().train( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you update this API according to our recent design: https://github.com/kubeflow/trainer/blob/54ab01e9050b98e608b675a4813058a774991b76/docs/proposals/2401-llm-trainer-v2/README.md#modify-the-train-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
spec: | ||
containers: | ||
- name: trainer | ||
image: <pytorch+cuda+torchtune image> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to maintain this image by ourselves under:
/cmd/trainers/torchtune/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
recipe: str | ||
config: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in Slack, we might want to create Runtime for each Model/Recipe.
Which means, user can select the appropriate runtime based on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, as you mentioned in this KEP that will increase number of ClusterTrainingRuntime we should deploy.
Any thoughts on UX here @kubeflow/wg-training-leads @franciscojavierarceo @astefanutti ?
Should we have an API that returns user available torchtune configs that we support ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, as you mentioned in this KEP that will increase number of ClusterTrainingRuntime we should deploy.
It will increase the number of ClusterTrainingRuntime to 100 or so, for which I think it's unacceptable for us to maintain.
If we remove the recipe
and config
parameters in TorchtuneConfig
, we need to create exactly the same number of TrainingRuntimes for recipe-config tuples. That is because we can't merge config files for a model into one file and mutate it on demand to fit with every scenario, because we can't guarantee all config files share the same default configurations like profiler, tokenzier, which are very complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we know what config we need to fetch, can we mutate values based on user configuration ?
Should we have list of supported config for every supported LLM in Kubeflow SDK ?
I want to design UX when user doesn't need to know exact recipe and config they should use to fine-tune LLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 @saileshd1402 @johnugeorge do you have any ideas here ?
- name: trainer | ||
image: <pytorch+cuda+torchtune image> | ||
command: | ||
- tune ls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, by default our runtime should be able to fine-tune model without any additional configurations from users.
So user can do something like this:
TrainerClient().train(
runtime_ref="torchtune-llama-3.3-70b"
)
In that case, we will just use the default settings that being configured under runtime.
However, that will depend on this: #2410 (comment).
Let's continue discussion in other thread.
- name: trainer | ||
image: <pytorch+cuda+torchtune image> | ||
command: | ||
- tune ls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap the default torchtune config under TrainingRuntime args, and override the user values using tune run
arguments ?
So the experience will be:
- ClusterTrainingRuntime contains the default torchtune config to fine-tune model.
- User can override values using
TorchTuneConfig()
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default torchtune configs are defined in the config file, which will be downloaded automatically to the training container by torchtune
. So, maybe we do not need to wrap the default torchtune config here:)
|
||
**How to Determine Default Resources** | ||
|
||
Currently, `torchtune` has limited support for multi-node training (but will coming soon). So, I would propose that we use 1 PyTorch node and 1 GPU by default. Users can specify `num_nodes` and `resource_per_node` in the `Trainer` field to increase PyTorch nodes and GPU number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have multi-node support in torchtune
isn't it @joecummings ?
Do you mean that it is not supported for all LLMs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I only see multinode support for Llama3.3 70B: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something stopping us to use --nnodes=2
for other configs ?
def train( | ||
trainer: Optional[CustomTrainer], | ||
fine_tuning_config: Optional[Union[TorchTuneConfig]], | ||
dataset_config: Optional[types.HuggingFaceDatasetConfig] = None, | ||
model_config: Optional[types.HuggingFaceModelInputConfig] = None, | ||
runtime_ref: Optional[str] = "torchtune-llm-finetuning", | ||
) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti @kubeflow/wg-training-leads @Electronic-Waste @deepanker13 @saileshd1402 @franciscojavierarceo @seanlaii What do you think about this API design ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it gives us opportunity in the future to support more framework-specific trainers like TorchTrainer, TorchXLATrainer, DeepSpeedTrainer, etc.
Which will help user to appropriately configure model, dataset (assign devices and configure distributed backend).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this API design ?
LGTM.
| | | |-- kustomization.yaml | ||
| | | |-- mpi_distributed.yaml # MPI Distributed Runtime | ||
| | | |-- torch_distributed.yaml # PyTorch Distributed Runtime | ||
| | |-- posttraining/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be re-consider according to: #2430
@Electronic-Waste What do you think about this folder structure:
manifests/
|-- base/
| |-- runtimes/
| | |-- kustomization.yaml
| | |-- mpi_distributed.yaml # MPI Distributed Runtime
| | |-- torch_distributed.yaml # PyTorch Distributed Runtime
| | |-- torchtune/
| | | |-- kustomization.yaml
| | | |-- torchtune_llm_finetuning.yaml # Torchtune LLM Fine-tuning Runtime
| |-- crds/
Maybe as a label to apply for these runtimes, we should add this:
trainer.runtime.kubeflow.org/type: custom-trainer
trainer.runtime.kubeflow.org/phase: any
trainer.runtime.kubeflow.org/type: torchtune
trainer.runtime.kubeflow.org/phase: post-training
WDYT @Electronic-Waste @astefanutti @kubeflow/wg-training-leads @franciscojavierarceo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
manifests/
|-- base/
| |-- runtimes/
| | |-- kustomization.yaml
| | |-- mpi_distributed.yaml # MPI Distributed Runtime
| | |-- torch_distributed.yaml # PyTorch Distributed Runtime
| | |-- torchtune_llm_finetuning.yaml # Torchtune LLM Fine-tuning Runtime
| |-- crds/
is better? Since we only use torchtune for LLM fine-tuning, it might be unnecessary to open a new dir for it.
|
||
### Support some common PEFT mechanisms | ||
|
||
We need to support some common PEFT mechnisms like LoRA, QLoRA, DoRA to allow users to optimize the memory usage when they are fine-tuning the LLMs. This is crucial for users who have limited resources and want to fine-tune their model at the minimum cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split LoRA, QLoRA, and DoRA between different configs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are coupled in torchtune
:
They share most of the parameters. Compared to LoRA, QLoRA only adds quant_base
. Compared to QLoRA, DoRA only adds use_dora
. It might be a good choice to implement them together.
| epochs | Optional[int] | The number of samples processed before updating model weights. | | ||
| loss | Optional[str] | The loss algorithm we use to fine-tune the LLM, e.g. `torchtune.modules.loss.CEWithChunkedOutputLoss` | | ||
| peft_config | Optional[Union[LoraConfig]] | Configuration for the PEFT(Parameter-Efficient Fine-Tuning), including LoRA/QLoRA/DoRA, etc. | | ||
| dataset_preprocess_config | Optional[Union[InstructDataset, ChatDataset, MultimodalDataset]] | Configuration for dataset preprocessing. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we also have dataset_config
as part of train()
API, how do we distinguish it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any better experience to make it clear for users when they should use it?
Maybe we could create a new dataset config: TorchTuneDatasetConfig
, where we can define dataset properties which is specific to torchtune
.
We can always add the storage_uri
parameter there which allows users to configure location of dataset:
hf://....
s3://...
@joecummings Do we have any capabilities in torchtune
to pre-process dataset outside of the main tune run
loop ?
For example, in multi-node
environment dataset can be pre-processed in CPU-based machine before passing data to all Training Nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the UX clear, we can put dataset_config
under TorchTuneConfig
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we also have dataset_config as part of train() API, how do we distinguish it ?
dataset_config
is global, in charge of downloading dataset.dataset_preprocess_config
is limited totorchtune
only, doing the preprocess work with the help of built in Dataset Class support oftorchtune
It might not be a good idea to put dataset_config
into TorchtuneConfig
since their scopes are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, shouldn't we perform dataset preprocessing on CPU-based Kubernetes pods using dataset-initializer
container to offload this work from GPU nodes ?
This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170
We are collecting the final community feedback and any suggestions are welcome!
Open Questions
tune run
CLI to enable distributed training, instead of passing distributed parameters begins withPET_
to env variables. Do you prefer reusing thetorch
runtime plugin or creating a new one?/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0