Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Implement runtime framework #2248

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Sep 4, 2024

What this PR does / why we need it:
Brief Design: https://docs.google.com/presentation/d/1HyEsBa7hxWpIoBXaX6uECiB48FWB85SG1kx15mO8hug/edit#slide=id.g30596bfee76_0_202

I implemented the runtime framework interfaces.
The responsibilities are the following:

  • /runtime.v2/core: This contains the actual Kubeflow Job Pipeline like TrainigRuntime (not CRD), which is an internal concept.
    These pipelines build objects or create reconcile builders. We will add some pipelines in the future like SingleHostTrainingRuntime.

  • /runtime.v2/framework: This contains the Kubeflow Job Pipeline Framework, which has some extension points in the following, and we will add some extension points in the future.

    • WatchExtensionPlugin
    • EnforcePodGroupPolicyPlugin
    • EnforceMLPolicyPlugin
    • CustomValidationPlugin
    • ComponentBuilderPlugin
  • /runtime.v2/framework/plugins: This contains the Kubeflow Job Pipeline Framework plugins, which implement the Framework extension points. Each of these plugins is performed in Kubeflow Job Pipeline Framework extension points.

    • coscheduleing
    • jobset (Under development)
    • mpi
    • plainml
    • torch

Additionally, I did not implement all plugins. So, I will open an issue and delegate plugin implementation contributors who are interested in this project.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #
Part-of #2290

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Sep 6, 2024

Pull Request Test Coverage Report for Build 11372731768

Details

  • 9 of 9 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11330381194: 0.0%
Covered Lines: 73
Relevant Lines: 73

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the second-implementation-for-traininig-v2 branch 18 times, most recently from d220851 to caa8564 Compare September 10, 2024 17:40
Makefile Outdated Show resolved Hide resolved
sigs.k8s.io/controller-runtime v0.17.3
sigs.k8s.io/jobset v0.5.2
sigs.k8s.io/kueue v0.6.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need kueue dependency ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency came from

PodRequests: kueuelr.TotalRequests(&spec.podSpec),
.

This allows us to set the appropriate required resources for PodGroup. If we remove this dependency, we need to just copy Kueue's "TotalRequests" function here. I believe that just coping and pasting is not the ideal way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to reduce the dependency. Maybe copy and paste is okay in this case as long as we provide a reference to the original source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not simple implementation. That's so complex, multiple files and lines codes.
So, I would propose keeping it here and then (after kube 1.32) switching to the kube library as I mentioned in #2280.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed with @tenzen-y offline that after k/k separates this utility function, we will remove dependency on Kueue.

@tenzen-y tenzen-y force-pushed the second-implementation-for-traininig-v2 branch from fd7aab4 to 2aaae2b Compare October 14, 2024 23:46
Comment on lines +106 to +109
options := defaultOptions
for _, opt := range opts {
opt(&options)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need default options for Info object ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options := defaultOptions
for _, opt := range opts {
opt(&options)
}
options := InfoOptions{}
for _, opt := range opts {
opt(&options)
}

Do you recommend this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I am trying to understand how are your planning to use InfoOptions in other parts ?
@tenzen-y What are the differences between Info{} and InfoOptions{} ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infoOptions is object to set up the Info object.
This approach allows us to dynamically specify the parameters to the Info.

When we get rid of the InfoOptions, we need to specify all parameters everytime or need to directly pass the Info object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is well-known to avid the following function:

// Even if all parameters is not used, all parameters should be specified.
func Foo(paramA string, paramB int, paramC int32, paramD bool, paramE int64)
// After introduced `infoOptions`.
func Foo(params ...InfoOption)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so maybe let's name it as InfoOption, not defaultOptions to make it clearer, and since we are not going to have default values for Info object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not going to have default values for Info object.

Actually, this is the default info Option. Here, this means that the default is an empty struct.
So, the below indicates to initialize options as a default values (currently default has empty fields)

options := defaultOptions

Copy link
Member Author

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we need to consider if we should the default parameters to infoOption. For example common labels and annotations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense.

Comment on lines +48 to +53
for rName := range info.TotalRequests {
info.TotalRequests[rName] = runtime.TotalResourceRequest{
Replicas: numNodes,
PodRequests: info.TotalRequests[rName].PodRequests,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we using this while enforcing the MLPolicy ?

Copy link
Member Author

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~ if info == nil || info.MLPolicy != nil { ~~

I wanted to implement this in line 43. Maybe I failed to reabase.
Let me fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM above comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec:
  mlPolicy:
    numNodes: 1

We can imagine this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we override the value that we set here, right ?

for _, spec := range options.podSpecReplicas {
info.TotalRequests[spec.name] = TotalResourceRequest{
Replicas: spec.replicas,
// TODO: Need to address LimitRange and RuntimeClass.
PodRequests: kueuelr.TotalRequests(&spec.podSpec),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this plainml codes try to update with the proper one.

)

var (
TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the indexer be equal to the Golang struct or json name ?
E.g. the trainingRuntimeSpec is named as .spec

Spec TrainingRuntimeSpec `json:"spec,omitempty"`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can specify the arbitrary key name. But the key is global within the training-ooerator.

@andreyvelich
Copy link
Member

We should be ready to merge this.
Thank you for this great work @tenzen-y!
/lgtm
/assign @terrytangyuan @johnugeorge @kannon92

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: kannon92.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

We should be ready to merge this.
Thank you for this great work @tenzen-y!
/lgtm
/assign @terrytangyuan @johnugeorge @kannon92

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have context to review this at the moment.

I'll leave that to kubeflow members.

@tenzen-y
Copy link
Member Author

We should be ready to merge this. Thank you for this great work @tenzen-y! /lgtm /assign @terrytangyuan @johnugeorge @kannon92

Thanks for the review!

@andreyvelich
Copy link
Member

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 62a058d into kubeflow:master Oct 17, 2024
40 checks passed
@tenzen-y tenzen-y deleted the second-implementation-for-traininig-v2 branch October 17, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants