Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define node types #371

Closed
r-c-n opened this issue Oct 3, 2023 · 13 comments
Closed

Define node types #371

r-c-n opened this issue Oct 3, 2023 · 13 comments

Comments

@r-c-n
Copy link

r-c-n commented Oct 3, 2023

Related to #370 and maybe to #204 as well

It's not clear how to distinguish different node types. For instance, how to tell if a node is a test result?

Example:
from this node:

$ kci show results 651162e28326c545a7810637
kci show results 651162e28326c545a7810637
Node
  path      checkout
  id        651162e28326c545a7810637
  parent    ----
  owner     nuclearcat
  created   2023-09-25 at 10:37:22
  state     done

Kernel
  tree      kernelci
  branch    staging-mainline
  commit    4abc3c0a156b6aabb774334652fbfd505550b525
  describe  staging-mainline-20230925.0

...

Results
checkout                                                        ----  651162e28326c545a7810637
  kbuild-gcc-10-x86                                             pass  651163a68326c545a7810638
    config                                                      pass  651164b18326c545a781078b
    kernel                                                      pass  651164b18326c545a781078c
    modules                                                     pass  651164b18326c545a781078d
    modules_install                                             pass  651164b18326c545a781078e
    modules_tarball                                             pass  651164b18326c545a781078f
    baseline-x86                                                ----  651164b18326c545a7810790
    baseline-x86                                                ----  651164b28326c545a7810791
  kbuild-gcc-10-x86                                             ----  651163a78326c545a7810639
  kver                                                          ----  651163a78326c545a781063a
  kunit-x86_64                                                  ----  651163a78326c545a781063b
  kunit-x86_64                                                  pass  651163a78326c545a781063c
    config                                                      pass  651164268326c545a781063e
    build                                                       pass  651164268326c545a781063f
    exec                                                        pass  651164268326c545a7810640
      time_test_cases                                           ----  651164268326c545a7810641
        time64_to_tm_test_date_range                            pass  651164268326c545a7810642
      hw_breakpoint                                             ----  651164268326c545a7810643
        test_one_cpu                                            ----  651164268326c545a7810644
        test_many_cpus                                          ----  651164268326c545a7810645
        test_one_task_on_all_cpus                               ----  651164268326c545a7810646
        test_two_tasks_on_all_cpus                              ----  651164268326c545a7810647
        test_one_task_on_one_cpu                                ----  651164268326c545a7810648
        test_one_task_mixed                                     ----  651164268326c545a7810649
        test_two_tasks_on_one_cpu                               ----  651164268326c545a781064a
        test_two_tasks_on_one_all_cpus                          ----  651164268326c545a781064b
        test_task_on_all_and_one_cpu                            ----  651164268326c545a781064c
      resource                                                  ----  651164268326c545a781064d
        resource_test_union                                     pass  651164268326c545a781064e
        resource_test_intersection                              pass  651164268326c545a781064f
      sysctl_test                                               ----  651164268326c545a7810650
        sysctl_test_api_dointvec_null_tbl_data                  pass  651164268326c545a7810651
        sysctl_test_api_dointvec_table_maxlen_unset             pass  651164268326c545a7810652

Let's see how a test result looks like (sysctl_test_api_dointvec_null_tbl_data):

{
  "id": "651164268326c545a7810651",
  "kind": "node",
  "name": "sysctl_test_api_dointvec_null_tbl_data",
  "path": [
    "checkout",
    "kunit-x86_64",
    "exec",
    "sysctl_test",
    "sysctl_test_api_dointvec_null_tbl_data"
  ],
  "group": "kunit-x86_64",
  "revision": {
    "tree": "kernelci",
    "url": "https://github.com/kernelci/linux.git",
    "branch": "staging-mainline",
    "commit": "4abc3c0a156b6aabb774334652fbfd505550b525",
    "describe": "staging-mainline-20230925.0",
    "version": {
      "version": 6,
      "patchlevel": 6,
      "sublevel": null,
      "extra": "-rc3-1-g4abc3c0a156b",
      "name": null
    }
  },
  "parent": "651164268326c545a7810650",
  "state": "done",
  "result": "pass",
  "artifacts": null,
  "data": null,
  "created": "2023-09-25T10:42:46.138000",
  "updated": "2023-09-25T10:42:46.138000",
  "timeout": "2023-09-25T16:42:46.138000",
  "holdoff": null,
  "owner": "kci",
  "user_groups": []
}

And then let's see its parent (sysctl_test), which is a node that groups many tests cases together:

{
  "id": "651164268326c545a7810650",
  "kind": "node",
  "name": "sysctl_test",
  "path": [
    "checkout",
    "kunit-x86_64",
    "exec",
    "sysctl_test"
  ],
  "group": "kunit-x86_64",
  "revision": {
    "tree": "kernelci",
    "url": "https://github.com/kernelci/linux.git",
    "branch": "staging-mainline",
    "commit": "4abc3c0a156b6aabb774334652fbfd505550b525",
    "describe": "staging-mainline-20230925.0",
    "version": {
      "version": 6,
      "patchlevel": 6,
      "sublevel": null,
      "extra": "-rc3-1-g4abc3c0a156b",
      "name": null
    }
  },
  "parent": "651164268326c545a7810640",
  "state": "done",
  "result": null,
  "artifacts": null,
  "data": null,
  "created": "2023-09-25T10:42:46.138000",
  "updated": "2023-09-25T10:42:46.138000",
  "timeout": "2023-09-25T16:42:46.138000",
  "holdoff": null,
  "owner": "kci",
  "user_groups": []
}

There's nothing in the node structure or in the values that lets me tell which one of this is a test result and which one is a parent of a test result. The "result" is different, but I found other final test nodes with a null result as well. For instance, test_task_on_all_and_one_cpu (651164268326c545a781064c):

{
  "id": "651164268326c545a781064c",
  "kind": "node",
  "name": "test_task_on_all_and_one_cpu",
  "path": [
    "checkout",
    "kunit-x86_64",
    "exec",
    "hw_breakpoint",
    "test_task_on_all_and_one_cpu"
  ],
  "group": "kunit-x86_64",
  "revision": {
    "tree": "kernelci",
    "url": "https://github.com/kernelci/linux.git",
    "branch": "staging-mainline",
    "commit": "4abc3c0a156b6aabb774334652fbfd505550b525",
    "describe": "staging-mainline-20230925.0",
    "version": {
      "version": 6,
      "patchlevel": 6,
      "sublevel": null,
      "extra": "-rc3-1-g4abc3c0a156b",
      "name": null
    }
  },
  "parent": "651164268326c545a7810643",
  "state": "done",
  "result": null,
  "artifacts": null,
  "data": null,
  "created": "2023-09-25T10:42:46.138000",
  "updated": "2023-09-25T10:42:46.138000",
  "timeout": "2023-09-25T16:42:46.138000",
  "holdoff": null,
  "owner": "kci",
  "user_groups": []
}
@gctucker
Copy link
Contributor

gctucker commented Oct 4, 2023

Why is this a problem?

One thing that I believe needs to be better defined is how to set the result for parent nodes. Some CI systems set the parent result to "fail" if any of their underlying test cases failed. Arguably we could instead not set a result at all if it's not testing anything, or "pass" to show that it ran successfully. Then an API query can be used to find all the individual test cases that failed. The group attribute (which might be renamed to something like job) can be used to help with this kind of use-case, it's already used by the primitive email report implementation we have at the moment.

@r-c-n
Copy link
Author

r-c-n commented Oct 5, 2023

Why is this a problem?

Because it limits the type of queries a client can perform programmatically. Examples:

  • How can I query for kernel builds specifically?
  • How to search for test suites?
  • How to search for specific test runs?
  • How to tell if the parent node of a node is a test suite, a kernel build node or any other thing?

Maybe there's a well defined way to search for this kind of nodes already, based on their structure or on other attributes, but it's not clear how.

I understand that, since different node types may have different schemas (maybe I'm wrong about this), not all operations might make sense on all types of nodes. Without a node "type", how can a client know how to operate on a certain node?

@gctucker
Copy link
Contributor

gctucker commented Oct 5, 2023

How can I query for kernel builds specifically?

kci node find name=kbuild

How to search for test suites?

kci node find parent=<build node id>

How to tell if the parent node of a node is a test suite, a kernel build node or any other thing?

Based on the node name and kind.

I understand that, since different node types may have different schemas (maybe I'm wrong about this), not all operations might make sense on all types of nodes.

At the moment, all nodes follow the same schema. Only the .data field is for arbitrary fields, but even then you can do queries using these fields and they'll produce some results for the nodes that have such fields.

Without a node "type", how can a client know how to operate on a certain node?

We discussed this in the meeting today actually, eventually we probably should have some Pydantic model (or jsonschema...) for the payload in the .data structure. That would provide validation and objects rather than plain dictionaries loaded from raw JSON. Until then, the pipeline knows which fields it creates so it knows which fields to get back.

@r-c-n
Copy link
Author

r-c-n commented Oct 5, 2023

kci node find name=kbuild

But the node name isn't kbuild. If I run this I get nothing, and the nodes endpoint doesn't seem to accept wildcards or regexes to specify something like kbuild* or kbuild.*.

kci node find parent=<build node id>

But test nodes don't hang from build nodes (see #373 (comment)), so we'd have to ascend to the root (checkout) and then find the right kbuild node (how?)

Based on the node name and kind

But how is kind defined? That's the initial question of this issue.

@gctucker
Copy link
Contributor

gctucker commented Oct 5, 2023

Yes the node name needs to be changed to kbuild, or something needs to be done to make this actually work. It's not finished basically, and it's part of the job configuration roadmap issue which is now late on the roadmap and that's what I'm going to be focusing on next week. A lot of things relate to this. Let's sync up at some point in the coming days to make sure we all work towards achieving the same goals and share the work where we can. Probably next Thursday's open hours would be good to catch up on anything we haven't been able to cover by other means.

@JenySadadia
Copy link
Collaborator

But how is kind defined? That's the initial question of this issue.

Atm kind can have 2 values. Regression nodes will have Node.kind=regression and all other nodes will have Node.kind=node. Yes, we need to extend the set of values for kind to make the nodes more specific.

But the node name isn't kbuild. If I run this I get nothing, and the nodes endpoint doesn't seem to accept wildcards or regexes to specify something like kbuild* or kbuild.*.

Until we have that in place, you can look at Node.group field. Nodes associated with a particular job are grouped under the same job name.
e.g kunit, kver, kbuild-gcc-10-x86

@r-c-n
Copy link
Author

r-c-n commented Oct 9, 2023

We discussed this in the meeting today actually, eventually we probably should have some Pydantic model (or jsonschema...) for the payload in the .data structure. That would provide validation and objects rather than plain dictionaries loaded from raw JSON.

Continuing this discussion, I've been thinking on how to proceed and I'd like to start defining certain types of node "patterns" that can cover the most immediate use cases and then grow from there.

As I understand it, there are two approaches to this:

1. Differentiate node types in the API (node.kind)

  • Implies changes in api/models.py, api/main.py and api/db.py at least.
  • Operations on different node types require and return different "object" types.
  • Implies possible design changes in DB management: what's the long-term impact of this? complexity, performance considerations, etc.
  • New node types mean more changes to maintain all over the API

2. Differentiate node types in the client side (client-defined schema / duck typing)

  • One single node type in the API
  • Node differences are defined by other attributes, done in the client side. API and DB don't make any distinctions based on them
  • Pros: simplicity of design, flexibility, homogeneity
  • Cons: lack of a formal definition across clients, performance (possibly?)

About these cons:

  • Lack of a formal definition across clients. Solution: I guess this is what you proposed with a validation schema for the payload in node.data.
  • Performance: will potentially arbitrary node queries (based on specific node.data schemas) be as performant as queries based on known DB and API models?

If we can address these cons, I think this is a better option than defining the node types in the API side. But currently there's already a Regression node type defined, so I wonder if the design is already committed to option 1.

@gctucker
Copy link
Contributor

gctucker commented Oct 9, 2023

Right, I think the actual ideal solution is pretty well known. It hasn't been implemented due to various other distractions and priorities. So here's what I would propose:

First let's take a look at the node fields (should also answer some of @yurinnick's questions):

  • node.name is arbitrary and relates to things such as the test name etc. It's used in the path with parent node names like baseline.dmesg.emerg and appears when reporting results.
  • node.kind matches the schema used for node.data. The base objects have a kind set to node and then node.data is entirely arbitrary.
  • node.data is as explained above the "payload" of the node. Some sub-models can be defined to validate it based on the kind of node.

Some thoughts about how to get this in place:

Node kinds aren't well defined yet, that's why it's too early to enable validation at the API level since the pipeline can work without it. The checkout nodes don't seem to require additional data than the kernel revision which is part of the Node model itself and the source tarball artifact which is also stored directly in Node.artifacts.

It would seem the first one to add would be kbuild, which is something I started doing a while ago, but we can entirely determine how that would work without API validation. Then once the sub-model for node.data for kbuild nodes has stabilised, we can add support for validating it at the API level. Then when submitting a node to the API, if node.kind is set to kbuild then the content of node.data will need to match the model. When we start doing this, we probably also only accept node and kbuild as values for kind - if some client wants to send arbitrary data they can keep using the base node kind.

Then it would be very useful to have these models shared between client and server implementations. Right now, there's nothing to convert node data to objects or anything like that outside of kernelci-api. So kernelci-core only deals with raw dictionaries, which is fine for early experiments but not great for users and robustness.

In order to implement this, I would propose to move the API data models to kernelci-core and import kernelci.api package in kernelci-api. I believe this is better than having the kernelci Python package depend on the kernelci-api package as in principle, other independent API implementations could be created using the same models. The API implementation may internally use different models for the database, as we've seen with users this may actually be required in practice, but API users don't have to care about that.

So to get there, we need to:

  • complete kbuild job support and define the sub-model for build data
  • finish reworking kernelci-core to make it a proper Python package with a release candidate on PyPI
  • add kernelci.api.models with the API models using Pydantic (or jsonschema, to be defined)
  • implement data validation on the API side based on node.kind

Note: I now remember there were issues when I first tried to do this as having a dynamic model for a subpart of the schema doesn't work that well with FastAPI and Pydantic. It's definitely possible to do it in a clean way, at least based on the early experiments I did. One thing we just can't do is inherit Node as a subclass like KbuildNode with something defining the model in .data as only the actual object base class specified in FastAPI is used for validation - not sub-types. So I think this can be done but it probably won't be completely trivial.

Now I'll compare what I'm proposing here with what you wrote and see where things stand. Does this make sense?

@r-c-n
Copy link
Author

r-c-n commented Oct 9, 2023

Yes, the general idea sounds good, although some bits and implementation details are out of my reach. Particularly the part about moving the API models to kernelci-core, I'd rather keep those details in a separate topic if possible.

In any case, the ability to share the schema and validation of .data between the server and clients should be a hard requirement, so I'd say the priority then is to check the feasibility of that and make sure there's a clear way to do it technically. Unfortunately I know nothing about FastAPI and Pydantic yet, I'll need some time to understand how they're used in the current implementation.

@gctucker
Copy link
Contributor

gctucker commented Oct 9, 2023

Right, well about sharing the models I can only see 3 solutions:

  1. have the models in kernelci-api and import them on the client side
  2. have the models in kernelci-core and import them both on the client and server side
  3. have the models in a new, separate package and import them anywhere they're needed

I believe kernelci-core is meant to be that base layer or core component with the definitions and functions that are needed to implement an API, clients, command line tools etc. So that's why I propose solution 2.

In terms of having a proof-of-concept for this, we could take the current data stored in node.data for kbuild jobs and use that as a starting point. It's not complete, but anything can be used initially just to verify the mechanics for schema validation. Even before we do that, we could work on moving API public-facing models to kernelci-core one by one - or at least just Node to start with.

@yurinnick
Copy link

I was thinking about model/api separation another day, and I believe that that the way we should approach it is something along these lines:

  1. kernelci-api has FastAPI models for, well, API, but nothing else
  2. kernelci-core has a similar set of classes, shared across all KernelCI internals and use for communication between internal services. We will created these classes from FastAPI classes.

In principal I think we should separate internal structures from public-facing API interface, even if it will add some code de-duplication.

@gctucker
Copy link
Contributor

Yes, but I think kernelci-core is the main public-facing software Python package. It provides API bindings, the CLI tools, so having data models there would make perfect sense too. If the API implementation depends on it then I think this is expected.

It's true we'll probably have to use 2 sets of models, but I believe the line should be drawn at the API level: public ones used by the API endpoints and client code, and internal ones used by the API implementation with the database. In many cases they'll probably align a lot so we may not have to duplicate any code or models while still decoupling these two aspects.

@r-c-n
Copy link
Author

r-c-n commented Oct 23, 2023

@gctucker I'm doing some tests about different node.data formats on Regression, since it's not really used for now:

But I don't know if that's what you had in mind. I have some questions about the possible approaches:

  1. This one lets Node define data as a generic dict and then submodels of Node can specify the dict fields as needed
  2. I don't see what's the benefit of having that "payload" encapsulated in node.data. We could simply define the additional specific fields of Node subclasses in the top level
  3. If we don't define any structure for node.data (like in the original Node model, where it's just an arbitrary dict), then we'll have to code the validation ourselves using Pydantic validators (I guess). I see one potential benefit to this: simplified models; but many downsides:
    • We'll be simply moving the validation from one place to another. Even if the models are simpler we're still differentiating them but using a different abstraction
    • We'll be replacing the automatic pydantic validation with an ad-hoc, nonstandard validation code that we then have to test and maintain

Also, I noticed that we'll probably have to rework some parts of Node and, to me, it makes sense to use it as an "abstract class" and create proper submodels for all the concepts that we need to model. For instance, fields like "revision", "result" and "parent" don't translate well to regressions, since a regression points to at least two test results that define the breaking point.

I could be missing some information or context, so what's your opinion on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants