KEP-2401: Kubeflow LLM Trainer V2 #2410

Electronic-Waste · 2025-02-01T13:45:12Z

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We are collecting the final community feedback and any suggestions are welcome!

Open Questions

We need to pass arguments to tune run CLI to enable distributed training, instead of passing distributed parameters begins with PET_ to env variables. Do you prefer reusing the torch runtime plugin or creating a new one?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

google-oss-prow · 2025-02-01T13:45:23Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?

Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2025-02-01T13:50:30Z

Pull Request Test Coverage Report for Build 13089269276

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on doc/KEP-2401 at 100.0%

Totals
Change from base Build 13016586638:	100.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

docs/proposals/2401-llm-trainer-v2/README.md

juliusvonkohout · 2025-02-12T10:36:33Z

Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP?

Electronic-Waste · 2025-02-12T16:03:26Z

@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage.

WDYT @franciscojavierarceo @kubeflow/wg-training-leads

franciscojavierarceo · 2025-02-13T01:55:09Z

@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage.

I would probably leave that out of scope. Not to say that it's not important of course.

google-oss-prow · 2025-02-18T11:47:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-02-18T13:27:06Z

Hi Folks, just a friendly reminder that this Wednesday at 5pm UTC, we will discuss the torchtune usage for Kubeflow Trainer LLM blueprints. Please join if you are available.

cc @kubeflow/wg-training-leads @Electronic-Waste @franciscojavierarceo @joecummings @astefanutti @akshaychitneni @shravan-achar @janeyx99 @bigsur0

Signed-off-by: Electronic-Waste <[email protected]>

andreyvelich

Thanks for the updates @Electronic-Waste!

andreyvelich · 2025-02-25T14:57:29Z

docs/proposals/2401-llm-trainer-v2/README.md

+To hide users from complex Kubernetes configuations, we will provide a simple yet flexible Python SDK wrapping all specifications of models, datasets, training runtime and fine-tuning configs. Like this:
+
+```python
+job_id = TrainingClient().train(


Please can you update this API according to our recent design: https://github.com/kubeflow/trainer/blob/54ab01e9050b98e608b675a4813058a774991b76/docs/proposals/2401-llm-trainer-v2/README.md#modify-the-train-api

Thanks for pointing this out!

andreyvelich · 2025-02-25T14:59:33Z

docs/proposals/2401-llm-trainer-v2/README.md

+                spec:
+                  containers:
+                    - name: trainer
+                      image: <pytorch+cuda+torchtune image>


Do we need to maintain this image by ourselves under:
/cmd/trainers/torchtune/Dockerfile

andreyvelich · 2025-02-25T15:01:09Z

docs/proposals/2401-llm-trainer-v2/README.md

+    recipe: str
+    config: str


As we discussed in Slack, we might want to create Runtime for each Model/Recipe.
Which means, user can select the appropriate runtime based on it.

However, as you mentioned in this KEP that will increase number of ClusterTrainingRuntime we should deploy.
Any thoughts on UX here @kubeflow/wg-training-leads @franciscojavierarceo @astefanutti ?
Should we have an API that returns user available torchtune configs that we support ?

However, as you mentioned in this KEP that will increase number of ClusterTrainingRuntime we should deploy.

It will increase the number of ClusterTrainingRuntime to 100 or so, for which I think it's unacceptable for us to maintain.

If we remove the recipe and config parameters in TorchtuneConfig, we need to create exactly the same number of TrainingRuntimes for recipe-config tuples. That is because we can't merge config files for a model into one file and mutate it on demand to fit with every scenario, because we can't guarantee all config files share the same default configurations like profiler, tokenzier, which are very complex.

But if we know what config we need to fetch, can we mutate values based on user configuration ?
Should we have list of supported config for every supported LLM in Kubeflow SDK ?
I want to design UX when user doesn't need to know exact recipe and config they should use to fine-tune LLM.

@deepanker13 @saileshd1402 @johnugeorge do you have any ideas here ?

andreyvelich · 2025-02-25T15:03:33Z

docs/proposals/2401-llm-trainer-v2/README.md

+                    - name: trainer
+                      image: <pytorch+cuda+torchtune image>
+                      command:
+                        - tune ls


I think, by default our runtime should be able to fine-tune model without any additional configurations from users.
So user can do something like this:

TrainerClient().train( runtime_ref="torchtune-llama-3.3-70b" )

In that case, we will just use the default settings that being configured under runtime.

However, that will depend on this: #2410 (comment).
Let's continue discussion in other thread.

andreyvelich · 2025-02-25T15:06:17Z

docs/proposals/2401-llm-trainer-v2/README.md

+                    - name: trainer
+                      image: <pytorch+cuda+torchtune image>
+                      command:
+                        - tune ls


Can we wrap the default torchtune config under TrainingRuntime args, and override the user values using tune run arguments ?
So the experience will be:

ClusterTrainingRuntime contains the default torchtune config to fine-tune model.

User can override values using TorchTuneConfig() class.

The default torchtune configs are defined in the config file, which will be downloaded automatically to the training container by torchtune. So, maybe we do not need to wrap the default torchtune config here:)

andreyvelich · 2025-02-25T15:07:50Z

docs/proposals/2401-llm-trainer-v2/README.md

+
+**How to Determine Default Resources**
+
+Currently, `torchtune` has limited support for multi-node training (but will coming soon). So, I would propose that we use 1 PyTorch node and 1 GPU by default. Users can specify `num_nodes` and `resource_per_node` in the `Trainer` field to increase PyTorch nodes and GPU number.


We already have multi-node support in torchtune isn't it @joecummings ?
Do you mean that it is not supported for all LLMs ?

Currently I only see multinode support for Llama3.3 70B: https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml

Is there something stopping us to use --nnodes=2 for other configs ?

andreyvelich · 2025-02-25T15:15:02Z

docs/proposals/2401-llm-trainer-v2/README.md

+def train(
+    trainer: Optional[CustomTrainer],
+    fine_tuning_config: Optional[Union[TorchTuneConfig]],
+    dataset_config: Optional[types.HuggingFaceDatasetConfig] = None,
+    model_config: Optional[types.HuggingFaceModelInputConfig] = None,
+    runtime_ref: Optional[str] = "torchtune-llm-finetuning",
+) -> str:


@astefanutti @kubeflow/wg-training-leads @Electronic-Waste @deepanker13 @saileshd1402 @franciscojavierarceo @seanlaii What do you think about this API design ?

Maybe it gives us opportunity in the future to support more framework-specific trainers like TorchTrainer, TorchXLATrainer, DeepSpeedTrainer, etc.
Which will help user to appropriately configure model, dataset (assign devices and configure distributed backend).

What do you think about this API design ?

LGTM.

andreyvelich · 2025-02-25T15:23:25Z

docs/proposals/2401-llm-trainer-v2/README.md

+|   |   |   |-- kustomization.yaml
+|   |   |   |-- mpi_distributed.yaml                    # MPI Distributed Runtime
+|   |   |   |-- torch_distributed.yaml                  # PyTorch Distributed Runtime
+|   |   |-- posttraining/


This needs to be re-consider according to: #2430
@Electronic-Waste What do you think about this folder structure:

manifests/ |-- base/ | |-- runtimes/ | | |-- kustomization.yaml | | |-- mpi_distributed.yaml # MPI Distributed Runtime | | |-- torch_distributed.yaml # PyTorch Distributed Runtime | | |-- torchtune/ | | | |-- kustomization.yaml | | | |-- torchtune_llm_finetuning.yaml # Torchtune LLM Fine-tuning Runtime | |-- crds/

Maybe as a label to apply for these runtimes, we should add this:

trainer.runtime.kubeflow.org/type: custom-trainer trainer.runtime.kubeflow.org/phase: any trainer.runtime.kubeflow.org/type: torchtune trainer.runtime.kubeflow.org/phase: post-training

WDYT @Electronic-Waste @astefanutti @kubeflow/wg-training-leads @franciscojavierarceo ?

Maybe:

manifests/ |-- base/ | |-- runtimes/ | | |-- kustomization.yaml | | |-- mpi_distributed.yaml # MPI Distributed Runtime | | |-- torch_distributed.yaml # PyTorch Distributed Runtime | | |-- torchtune_llm_finetuning.yaml # Torchtune LLM Fine-tuning Runtime | |-- crds/

is better? Since we only use torchtune for LLM fine-tuning, it might be unnecessary to open a new dir for it.

andreyvelich · 2025-02-25T15:31:26Z

docs/proposals/2401-llm-trainer-v2/README.md

+
+### Support some common PEFT mechanisms
+
+We need to support some common PEFT mechnisms like LoRA, QLoRA, DoRA to allow users to optimize the memory usage when they are fine-tuning the LLMs. This is crucial for users who have limited resources and want to fine-tune their model at the minimum cost.


Should we split LoRA, QLoRA, and DoRA between different configs ?

They are coupled in torchtune:

https://pytorch.org/torchtune/main/tutorials/memory_optimizations.html#parameter-efficient-fine-tuning-peft

They share most of the parameters. Compared to LoRA, QLoRA only adds quant_base. Compared to QLoRA, DoRA only adds use_dora. It might be a good choice to implement them together.

andreyvelich · 2025-02-25T15:33:38Z

docs/proposals/2401-llm-trainer-v2/README.md

+| epochs | Optional[int] | The number of samples processed before updating model weights. |
+| loss | Optional[str] | The loss algorithm we use to fine-tune the LLM, e.g. `torchtune.modules.loss.CEWithChunkedOutputLoss` |
+| peft_config | Optional[Union[LoraConfig]] | Configuration for the PEFT(Parameter-Efficient Fine-Tuning), including LoRA/QLoRA/DoRA, etc. |
+| dataset_preprocess_config | Optional[Union[InstructDataset, ChatDataset, MultimodalDataset]] | Configuration for dataset preprocessing. |


Since we also have dataset_config as part of train() API, how do we distinguish it ?

Do we have any better experience to make it clear for users when they should use it?

Maybe we could create a new dataset config: TorchTuneDatasetConfig, where we can define dataset properties which is specific to torchtune.
We can always add the storage_uri parameter there which allows users to configure location of dataset:

hf://.... s3://...

@joecummings Do we have any capabilities in torchtune to pre-process dataset outside of the main tune run loop ?
For example, in multi-node environment dataset can be pre-processed in CPU-based machine before passing data to all Training Nodes.

cc @akshaychitneni @shravan-achar @bigsur0

To make the UX clear, we can put dataset_config under TorchTuneConfig as well.

Since we also have dataset_config as part of train() API, how do we distinguish it ?

dataset_config is global, in charge of downloading dataset.

dataset_preprocess_config is limited to torchtune only, doing the preprocess work with the help of built in Dataset Class support of torchtune

It might not be a good idea to put dataset_config into TorchtuneConfig since their scopes are different.

Eventually, shouldn't we perform dataset preprocessing on CPU-based Kubernetes pods using dataset-initializer container to offload this work from GPU nodes ?

google-oss-prow bot requested review from franciscojavierarceo, deepanker13, thesuperzapper, juliusvonkohout, a team, rimolive, varodrig, Doris-xm, helenxie-bit and jbottum February 1, 2025 13:45

google-oss-prow bot added the size/L label Feb 1, 2025

franciscojavierarceo reviewed Feb 5, 2025

View reviewed changes

docs/proposals/2401-llm-trainer-v2/README.md Show resolved Hide resolved

andreyvelich mentioned this pull request Feb 7, 2025

Nominate @Electronic-Waste as a reviewer #2427

Merged

joecummings reviewed Feb 10, 2025

View reviewed changes

docs/proposals/2401-llm-trainer-v2/README.md Show resolved Hide resolved

joecummings reviewed Feb 10, 2025

View reviewed changes

docs/proposals/2401-llm-trainer-v2/README.md Show resolved Hide resolved

joecummings reviewed Feb 10, 2025

View reviewed changes

docs/proposals/2401-llm-trainer-v2/README.md Outdated Show resolved Hide resolved

andreyvelich mentioned this pull request Feb 11, 2025

Enable GPU Testing for LLM Blueprints #2432

Open

Electronic-Waste added 6 commits February 19, 2025 04:15

doc: add initial doc for KEP-2401.

dc8d6bb

Signed-off-by: Electronic-Waste <[email protected]>

doc: update motivation.

d564618

Signed-off-by: Electronic-Waste <[email protected]>

doc: add llm lifecycle picture.

02dd56d

Signed-off-by: Electronic-Waste <[email protected]>

doc: add goals and non-goals.

a695ec5

Signed-off-by: Electronic-Waste <[email protected]>

doc: add alternatives.

65c1d17

Signed-off-by: Electronic-Waste <[email protected]>

doc: add proposal chapter.

779024a

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): update fine-tuning config & fix doc according to comments.

bdc4ef7

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot added size/XL and removed size/L labels Feb 19, 2025

Electronic-Waste added 19 commits February 19, 2025 13:41

doc: add model & dataset initialization / model exporting.

30c8b84

Signed-off-by: Electronic-Waste <[email protected]>

doc: add dataset preprocess/tokenizer chapter.

11f9cc4

Signed-off-by: Electronic-Waste <[email protected]>

doc: fix some errors in doc.

7a0411c

Signed-off-by: Electronic-Waste <[email protected]>

doc: update chapter name.

86033ae

Signed-off-by: Electronic-Waste <[email protected]>

doc: add type in the diagrams.

3a39c2b

Signed-off-by: Electronic-Waste <[email protected]>

doc: add optimizer and scheduler config.

ae15088

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): fix some errors in doc.

a19603e

Signed-off-by: Electronic-Waste <[email protected]>

doc: add initial parameter override.

935bb67

Signed-off-by: Electronic-Waste <[email protected]>

doc: update config override.

e910167

Signed-off-by: Electronic-Waste <[email protected]>

doc: fix some errors in doc.

39a54da

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): add CustomTrainingConfig dataclass.

8d29318

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): integrate torchtune mutation logic into torch plugin.

67bd3b7

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): split torchtune config chapter.

e946c19

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): add two options for SDK & seperate LoRA chapter.

2b1bccc

Signed-off-by: Electronic-Waste <[email protected]>

doc: add an example to show parameters mutation.

f4fee07

Signed-off-by: Electronic-Waste <[email protected]>

doc: add detailed design on mutation in torch plugin.

b16de4d

Signed-off-by: Electronic-Waste <[email protected]>

doc: add dir structure for option 1.

7605cb8

Signed-off-by: Electronic-Waste <[email protected]>

doc: add dir structure for option 2.

b84a61b

Signed-off-by: Electronic-Waste <[email protected]>

doc: add Test Plans chapter.

a2eb388

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste requested a review from andreyvelich February 25, 2025 13:54

Electronic-Waste added 2 commits February 25, 2025 13:55

fix(doc): remove device parameter.

a9ac42e

Signed-off-by: Electronic-Waste <[email protected]>

fix(doc): fix typo error.

9d95b75

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste force-pushed the doc/KEP-2401 branch 2 times, most recently from 39b82ef to ffa3b94 Compare February 25, 2025 14:05

fix(doc): fix code line format.

54ab01e

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste force-pushed the doc/KEP-2401 branch from ffa3b94 to 54ab01e Compare February 25, 2025 14:06

andreyvelich reviewed Feb 25, 2025

View reviewed changes


		How to Determine Default Resources

		Currently, `torchtune` has limited support for multi-node training (but will coming soon). So, I would propose that we use 1 PyTorch node and 1 GPU by default. Users can specify `num_nodes` and `resource_per_node` in the `Trainer` field to increase PyTorch nodes and GPU number.


		### Support some common PEFT mechanisms

		We need to support some common PEFT mechnisms like LoRA, QLoRA, DoRA to allow users to optimize the memory usage when they are fine-tuning the LLMs. This is crucial for users who have limited resources and want to fine-tune their model at the minimum cost.

KEP-2401: Kubeflow LLM Trainer V2 #2410

Are you sure you want to change the base?

KEP-2401: Kubeflow LLM Trainer V2 #2410

Conversation

Electronic-Waste commented Feb 1, 2025 • edited Loading

Open Questions

google-oss-prow bot commented Feb 1, 2025

Open Questions

coveralls commented Feb 1, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13089269276

Details

💛 - Coveralls

juliusvonkohout commented Feb 12, 2025

Electronic-Waste commented Feb 12, 2025 • edited Loading

franciscojavierarceo commented Feb 13, 2025

google-oss-prow bot commented Feb 18, 2025

andreyvelich commented Feb 18, 2025

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

andreyvelich Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste commented Feb 1, 2025 •

edited

Loading

coveralls commented Feb 1, 2025 •

edited

Loading

Electronic-Waste commented Feb 12, 2025 •

edited

Loading

Electronic-Waste Feb 25, 2025 •

edited

Loading

andreyvelich Feb 25, 2025 •

edited

Loading

andreyvelich Feb 25, 2025 •

edited

Loading

Electronic-Waste Feb 25, 2025 •

edited

Loading