-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Use SSA to reconcile TrainJob components #2431
Conversation
/hold Requires kubernetes-sigs/jobset#782 |
@andreyvelich @tenzen-y @varshaprasad96 please take a look when you can. |
8e002f8
to
465e443
Compare
go.mod
Outdated
@@ -19,7 +19,7 @@ require ( | |||
sigs.k8s.io/controller-runtime v0.19.1 | |||
sigs.k8s.io/jobset v0.5.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify main branch, and then let us check if it should work fine as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I assumed you preferred to wait until the upcoming JobSet release :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current trainer development is under alpha stage. So, I don't care the library stability :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @astefanutti Should we merge this PR once JobSet release v0.8.0 ?
I think, we are planning to release it this week: kubernetes-sigs/jobset#774
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've replaced the JobSet API version to v0.8.0-devel for now.
I'm working on adapting the unit tests but the main controller should work fine.
0e92a8c
to
16a000a
Compare
03d6f54
to
b2063e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
My first question is could we avoid unstructured and then keep using client.Object typed?
I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430
go.mod
Outdated
go 1.23.0 | ||
|
||
toolchain go1.23.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use these toolchains? Who is bringing this? JobSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the replace
added to override the transitive dependencies coming from the newer JobSet version, go mod tidy
keeps adding this. I'd need to dig into it more to understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, interesting. Actually, Go < 1.23.4, they enforce to specify toolchain. But after the Go 1.23.4, the restrictions are relaxed. So, IIUC, this technically could be removed.
Or, we may just bump to Kube 1.32 as library. @andreyvelich Do you want to avoid bumping kube lib to 1.32 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should try to avoid toolchain here if that is possible.
Let's bump k8s to 1.32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Should we bump k8s libs to 1.32 in this PR to remove toolchain ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich yes, do you prefer the upgrade to be done in this PR or separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's maybe first create PR to migrate to 1.32 if that is easier for you.
Ultimately the I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA. |
Note I still need to update the newly introduced MPI plugin to rely on apply configurations. |
Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA |
Yeah, absolutely |
Using I think the target should ultimately be to rely on the For now, I've moved the conversion to unstructured just before the call to |
/unhold All the tests pass now. |
We need to rebase PR before the merge. |
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
@astefanutti Could you check commits signs? |
Signed-off-by: Antonin Stefanutti <[email protected]>
@tenzen-y should be good now, I need more coffee :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/lgtm
/approve
/hold cancel
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Replace Get/Create/Update logic with SSA to reconcile TrainJob components.
I've successfully tested it with https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb.
Depends on kubernetes-sigs/jobset#782.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #2297
Checklist: