Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix(backend): implement subdag output resolution #11196

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

droctothorpe
Copy link
Contributor

Description of your changes:
This is a WIP PR intended to fix #10039. Additional functionality, tests, and a more detailed PR description to follow.

Checklist:

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: droctothorpe <[email protected]>
Co-authored-by: zazulam <[email protected]>
Co-authored-by: CarterFendley <[email protected]>
@@ -125,6 +126,8 @@ func RootDAG(ctx context.Context, opts Options, mlmd *metadata.Client) (executio
err = fmt.Errorf("driver.RootDAG(%s) failed: %w", opts.info(), err)
}
}()
b, _ := json.Marshal(opts)
glog.V(4).Info("RootDAG opts: ", string(b))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added a ton of debug level logs to make debugging stuff like this easier for the next person. We need to add some handling in the backend to support toggling level 4 logs in the driver on and off.

@droctothorpe
Copy link
Contributor Author

Happy to jump on a call if synchronous questions / feedback is easier. Although concise, these changes are quite convoluted.

Signed-off-by: droctothorpe <[email protected]>
Co-authored-by: zazulam <[email protected]>
Co-authored-by: CarterFendley <[email protected]>
@droctothorpe
Copy link
Contributor Author

We just pushed up a commit that implements support for multiple layers of nested subdags (i.e. subdags of subdags). We validated that it behaves as expected with the following example code:

from kfp import dsl
from kfp.client import Client

@dsl.component
def small_comp() -> str:
    return "privet"

@dsl.component
def large_comp(input: str):
    print("input :", input)


@dsl.pipeline
def small_matroushka_doll() -> str:
    task = small_comp()
    task.set_caching_options(False)
    return task.output

@dsl.pipeline
def medium_matroushka_doll() -> str:
    dag_task = small_matroushka_doll()
    dag_task.set_caching_options(False)
    return dag_task.output

@dsl.pipeline
def large_matroushka_doll():
    dag_task = medium_matroushka_doll()
    task = large_comp(input=dag_task.output)
    task.set_caching_options(False)
    dag_task.set_caching_options(False)


if __name__ == "__main__":
    client = Client()

    run = client.create_run_from_pipeline_func(
        pipeline_func=large_matroushka_doll,
        enable_caching=False,
    )

PS. I hate matroushka dolls, they're so full of themselves.

@droctothorpe
Copy link
Contributor Author

droctothorpe commented Sep 12, 2024

So this PR handles subdag output parameters but not subdag output artifacts. We're going to add some logic to handle the latter as well since the problems are similar.

zazulam and others added 2 commits September 17, 2024 09:58
Signed-off-by: zazulam <[email protected]>
Co-authored-by: droctothorpe <[email protected]>
Signed-off-by: droctothorpe <[email protected]>
Co-authored-by: zazulam <[email protected]>
Co-authored-by: CarterFendley <[email protected]>
Co-authored-by: edmondop <[email protected]>
@droctothorpe
Copy link
Contributor Author

We just added and validated support for output artifacts as well, which addresses #10041. Here's a screenshot from a pipeline with nested DAGs and output artifacts that executed successfully:

image

Here's the example code:

from kfp import dsl
from kfp.client import Client
from kfp.compiler import Compiler

@dsl.component
def inner_comp(dataset: dsl.Output[dsl.Dataset]):
    with open(dataset.path, "w") as f:
        f.write("foobar")


@dsl.component
def outer_comp(input: dsl.Dataset):
    print("input: ", input)


@dsl.pipeline
def inner_pipeline() -> dsl.Dataset:
    inner_comp_task = inner_comp()
    inner_comp_task.set_caching_options(False)
    return inner_comp_task.output
    
@dsl.pipeline
def outer_pipeline():
    inner_pipeline_task = inner_pipeline()
    outer_comp_task = outer_comp(input=inner_pipeline_task.output)
    outer_comp_task.set_caching_options(False)


if __name__ == "__main__":
    # Compiler().compile(outer_pipeline, "ignore/subdag_artifacts.yaml")
    client = Client()

    run = client.create_run_from_pipeline_func(
        pipeline_func=outer_pipeline,
        enable_caching=False,
    )

There's still a lot more work to be done in terms of testing, decomposition, making the code more consistent and DRY, etc, but it works and it did not work before, so hooray for progress.

Signed-off-by: droctothorpe <[email protected]>
Co-authored-by: zazulam <[email protected]>
Signed-off-by: droctothorpe <[email protected]>
@droctothorpe
Copy link
Contributor Author

Just pushed up a commit that decomposes the graph traversal logic to improve readability, reduce complexity, and make granular testing easier. The next order of business is multiple outputs and NamedTuples.

Copy link
Contributor

@HumairAK HumairAK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey folks, love that you are doing this, amazing stuff!!

I just had a skim and left some quick thoughts, I see that it's still WIP, so apologies if the comments are premature. Haven't had a chance to try it out yet.

The approach does make sense to me. Given that we are just writing spec data as execution properties, I think it makes sense to do it in the driver, since we already have this info at pipeline creation.

if flattenedTasks == nil {
flattenedTasks = make(map[string]*metadata.Execution)
}
currentExecutionTasks, err := mlmd.GetExecutionsInDAG(ctx, dag, pipeline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, have you considered just getting all the executions for the context instead of doing a dfs filter here?

for example GetExecutionsInDag() is simply doing a get executions for the context but with a filter, without the filter it should just simply get all the dags for that particular context, this way we don't need to do multiple db queries

task name's iirc should be unique, I suppose the only concern here would be if the pipeline is really large and has a lot of task executions, but I would think it would have to be unrealistically large for that to be an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Right now, the number of calls is equal to the number of nested sub-DAGs. If the call sans filter gets all executions AND executions in sub-DAGs, that could definitely reduce the number of database queries. We'll test it out. Thanks for the suggestion!

Comment on lines +1293 to +1303
json.Unmarshal(b, &outputParametersMap)
glog.V(4).Info("Deserialized outputParametersMap: ", outputParametersMap)
subTaskName := outputParametersMap["producer_subtask"]
glog.V(4).Infof(
"Overriding currentTask, %v, output with currentTask's producer_subtask, %v, output.",
currentTask.TaskName(),
subTaskName,
)

// Reassign sub-task before running through the loop again.
currentTask = tasks[subTaskName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to handle some of these potential error cases, for example:

  • if the producer task is not in outputParametersMap
  • subtaskname is not in tasks

Comment on lines +764 to +774
outputParameterKey := value.GetValueFromParameter().OutputParameterKey
producerSubTask := value.GetValueFromParameter().ProducerSubtask
glog.V(4).Info("outputParameterKey: ", outputParameterKey)
glog.V(4).Info("producerSubtask: ", producerSubTask)

outputParameterMap := map[string]interface{}{
"output_parameter_key": outputParameterKey,
"producer_subtask": producerSubTask,
}

outputParameterStruct, _ := structpb.NewValue(outputParameterMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come you didn't use a DagOutputParameterSpec like you did for DagOutputArtifactSpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic we added to the mlmd client handles converting the artifact struct into a format suitable for storage in the database. Our reasoning was that that was an implementation detail that wasn't really relevant to the driver. That logic holds true for the output parameters as well, but we never refactored it to apply the principle. We'll move that logic out of driver.go and into client.go. Good call!

Comment on lines +589 to +602
if config.OutputParameters != nil {
e.CustomProperties[keyOutputs] = &pb.Value{Value: &pb.Value_StructValue{
StructValue: &structpb.Struct{
Fields: config.OutputParameters,
},
}}
}
if config.OutputArtifacts != nil {
b, err := json.Marshal(config.OutputArtifacts)
if err != nil {
return nil, err
}
e.CustomProperties[keyOutputArtifacts] = StringValue(string(b))
}
Copy link
Contributor

@HumairAK HumairAK Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing I'm thinking about is how we distinguish between outputs for container executions vs DAG executions, since for paramers in container executions they map to the actual resolved values, but for DAG we're simply storing reference values to the producer tasks

similarly for artifacts, for container executions we store the artifact obj store metadata (pulled from artifact properties), but for dag executions we are again, storing the output producer spec data, right?

so I'm wondering if it makes sense to have a separate parameter entirely to distinguish these types, I don't have a concrete suggestion, maybe something like parameter_producer_tasks and artifact_producer_tasks 🤷🏾‍♂️

I'm not sure but atm if we use outputs then it will probably show up on the UI for executions page under the DAG but show something different than it does for Container executions

Comment on lines +776 to +778
ecfg.OutputParameters = map[string]*structpb.Value{
value.GetValueFromParameter().OutputParameterKey: outputParameterStruct,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like you are re writing ecfg.OutputParameters with a new map every iteration, did you mean to update the map instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! We would likely have hit a wall because of this when we started tested multiple outputs.

Really appreciate the time and effort you took to grok some not particularly grokkable code. Thank you for the feedback, @HumairAK!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Nested pipelines fail to run
3 participants