-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Implement runtime framework #2248
KEP-2170: Implement runtime framework #2248
Conversation
92b1dd1
to
4195338
Compare
Pull Request Test Coverage Report for Build 11372731768Details
💛 - Coveralls |
d220851
to
caa8564
Compare
sigs.k8s.io/controller-runtime v0.17.3 | ||
sigs.k8s.io/jobset v0.5.2 | ||
sigs.k8s.io/kueue v0.6.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need kueue dependency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency came from
training-operator/pkg/runtime.v2/runtime.go
Line 125 in 22da8af
PodRequests: kueuelr.TotalRequests(&spec.podSpec), |
This allows us to set the appropriate required resources for PodGroup. If we remove this dependency, we need to just copy Kueue's "TotalRequests" function here. I believe that just coping and pasting is not the ideal way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to reduce the dependency. Maybe copy and paste is okay in this case as long as we provide a reference to the original source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not simple implementation. That's so complex, multiple files and lines codes.
So, I would propose keeping it here and then (after kube 1.32) switching to the kube library as I mentioned in #2280.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we discussed with @tenzen-y offline that after k/k separates this utility function, we will remove dependency on Kueue.
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
fd7aab4
to
2aaae2b
Compare
Signed-off-by: Yuki Iwai <[email protected]>
…indexes for the TrainJobs Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
options := defaultOptions | ||
for _, opt := range opts { | ||
opt(&options) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need default options for Info object ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options := defaultOptions | |
for _, opt := range opts { | |
opt(&options) | |
} | |
options := InfoOptions{} | |
for _, opt := range opts { | |
opt(&options) | |
} |
Do you recommend this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I am trying to understand how are your planning to use InfoOptions in other parts ?
@tenzen-y What are the differences between Info{} and InfoOptions{} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The infoOptions
is object to set up the Info
object.
This approach allows us to dynamically specify the parameters to the Info
.
When we get rid of the InfoOptions
, we need to specify all parameters everytime or need to directly pass the Info
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is well-known to avid the following function:
// Even if all parameters is not used, all parameters should be specified.
func Foo(paramA string, paramB int, paramC int32, paramD bool, paramE int64)
// After introduced `infoOptions`.
func Foo(params ...InfoOption)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so maybe let's name it as InfoOption, not defaultOptions to make it clearer, and since we are not going to have default values for Info object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not going to have default values for Info object.
Actually, this is the default info Option. Here, this means that the default is an empty struct.
So, the below indicates to initialize options as a default values (currently default has empty fields)
options := defaultOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we need to consider if we should the default parameters to infoOption. For example common labels and annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense.
for rName := range info.TotalRequests { | ||
info.TotalRequests[rName] = runtime.TotalResourceRequest{ | ||
Replicas: numNodes, | ||
PodRequests: info.TotalRequests[rName].PodRequests, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we using this while enforcing the MLPolicy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~ if info == nil || info.MLPolicy != nil { ~~
I wanted to implement this in line 43. Maybe I failed to reabase.
Let me fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM above comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec:
mlPolicy:
numNodes: 1
We can imagine this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so we override the value that we set here, right ?
training-operator/pkg/runtime.v2/runtime.go
Lines 120 to 125 in 0c376d3
for _, spec := range options.podSpecReplicas { | |
info.TotalRequests[spec.name] = TotalResourceRequest{ | |
Replicas: spec.replicas, | |
// TODO: Need to address LimitRange and RuntimeClass. | |
PodRequests: kueuelr.TotalRequests(&spec.podSpec), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this plainml codes try to update with the proper one.
) | ||
|
||
var ( | ||
TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the indexer be equal to the Golang struct or json
name ?
E.g. the trainingRuntimeSpec is named as .spec
Spec TrainingRuntimeSpec `json:"spec,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can specify the arbitrary key name. But the key is global within the training-ooerator.
We should be ready to merge this. |
@andreyvelich: GitHub didn't allow me to assign the following users: kannon92. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have context to review this at the moment.
I'll leave that to kubeflow members.
Thanks for the review! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
Brief Design: https://docs.google.com/presentation/d/1HyEsBa7hxWpIoBXaX6uECiB48FWB85SG1kx15mO8hug/edit#slide=id.g30596bfee76_0_202
I implemented the runtime framework interfaces.
The responsibilities are the following:
/runtime.v2/core: This contains the actual Kubeflow Job Pipeline like TrainigRuntime (not CRD), which is an internal concept.
These pipelines build objects or create reconcile builders. We will add some pipelines in the future like SingleHostTrainingRuntime.
/runtime.v2/framework: This contains the Kubeflow Job Pipeline Framework, which has some extension points in the following, and we will add some extension points in the future.
/runtime.v2/framework/plugins: This contains the Kubeflow Job Pipeline Framework plugins, which implement the Framework extension points. Each of these plugins is performed in Kubeflow Job Pipeline Framework extension points.
Additionally, I did not implement all plugins. So, I will open an issue and delegate plugin implementation contributors who are interested in this project.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Part-of #2290
Checklist: