Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Use SSA to reconcile TrainJob components #2431

Merged
merged 17 commits into from
Feb 26, 2025

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Feb 11, 2025

What this PR does / why we need it:

Replace Get/Create/Update logic with SSA to reconcile TrainJob components.

I've successfully tested it with https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb.

Depends on kubernetes-sigs/jobset#782.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2297

Checklist:

  • Docs included if any changes are user facing

@astefanutti
Copy link
Contributor Author

/hold

Requires kubernetes-sigs/jobset#782

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 11, 2025

@andreyvelich @tenzen-y @varshaprasad96 please take a look when you can.

@astefanutti astefanutti force-pushed the pr-ssa branch 2 times, most recently from 8e002f8 to 465e443 Compare February 11, 2025 13:35
go.mod Outdated
@@ -19,7 +19,7 @@ require (
sigs.k8s.io/controller-runtime v0.19.1
sigs.k8s.io/jobset v0.5.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify main branch, and then let us check if it should work fine as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I assumed you preferred to wait until the upcoming JobSet release :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current trainer development is under alpha stage. So, I don't care the library stability :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti Should we merge this PR once JobSet release v0.8.0 ?
I think, we are planning to release it this week: kubernetes-sigs/jobset#774

cc @kannon92 @ahg-g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced the JobSet API version to v0.8.0-devel for now.

I'm working on adapting the unit tests but the main controller should work fine.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Feb 13, 2025
@astefanutti astefanutti force-pushed the pr-ssa branch 4 times, most recently from 0e92a8c to 16a000a Compare February 13, 2025 09:34
@astefanutti astefanutti force-pushed the pr-ssa branch 4 times, most recently from 03d6f54 to b2063e1 Compare February 14, 2025 19:01
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
My first question is could we avoid unstructured and then keep using client.Object typed?
I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

go.mod Outdated
Comment on lines 3 to 5
go 1.23.0

toolchain go1.23.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use these toolchains? Who is bringing this? JobSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the replace added to override the transitive dependencies coming from the newer JobSet version, go mod tidy keeps adding this. I'd need to dig into it more to understand why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, interesting. Actually, Go < 1.23.4, they enforce to specify toolchain. But after the Go 1.23.4, the restrictions are relaxed. So, IIUC, this technically could be removed.

Or, we may just bump to Kube 1.32 as library. @andreyvelich Do you want to avoid bumping kube lib to 1.32 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should try to avoid toolchain here if that is possible.
Let's bump k8s to 1.32

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Should we bump k8s libs to 1.32 in this PR to remove toolchain ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich yes, do you prefer the upgrade to be done in this PR or separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's maybe first create PR to migrate to 1.32 if that is easier for you.

@astefanutti
Copy link
Contributor Author

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

@astefanutti
Copy link
Contributor Author

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

@tenzen-y
Copy link
Member

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA

@tenzen-y
Copy link
Member

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

Yeah, absolutely

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 17, 2025

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.
I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA

Using client.Object may work as an approximation to compute SSA patches for simpler objects, but it needs some tweaks like the following one that feel like work-arounds and that may not "scale" well for more complex objects like JobSets:
https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L509

I think the target should ultimately be to rely on the runtime.ApplyConfiguration interface that's going to be introduced: kubernetes/kubernetes#129313.

For now, I've moved the conversion to unstructured just before the call to client.Patch, but that'll go away once controller-runtime integrates runtime.ApplyConfiguration and exposes a first-class client.Apply method for SSA.

@astefanutti
Copy link
Contributor Author

/unhold

All the tests pass now.

@andreyvelich
Copy link
Member

We need to rebase PR before the merge.

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
@tenzen-y
Copy link
Member

@astefanutti Could you check commits signs?

Signed-off-by: Antonin Stefanutti <[email protected]>
@astefanutti
Copy link
Contributor Author

@astefanutti Could you check commits signs?

@tenzen-y should be good now, I need more coffee :)

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve
/hold cancel

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3c6c90f into kubeflow:master Feb 26, 2025
14 checks passed
@astefanutti astefanutti deleted the pr-ssa branch February 26, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
3 participants