Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data model #1276

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

data model #1276

wants to merge 10 commits into from

Conversation

laanak08
Copy link
Collaborator

@laanak08 laanak08 commented Feb 18, 2025

Invocation

  • help: cargo run --bin goose -- bench --help
  • cargo run --bin goose -- bench to run the "core" suite of bencharks
  • cargo run --bin goose -- bench -s $suite_name1,$suite_name2,...,etc
  • add new benchmark-suites to crates/goose-bench/src/eval_suites

Semantics [DO NOT SKIP READING]

  • there is a core suite of evaluations that runs by default if the --suites cli flag is not set
    • differently stated, any evaluation not included in core will not run
  • if --suites is supplied, only the items in that list will run, so if core isnt part of the list of suites passed to --suites, it will not run.

Individual Evals

  • example can be examined here: crates/goose-bench/src/eval_suites/core/complex_tasks/flappy_bird.rs
  • groups of related evals can be placed together in a module crates/goose-bench/src/eval_suites/core/$group_name
    • In this example core is the $suite_name
    • where each eval is in its own file at crates/goose-bench/src/eval_suites/core/$group_name/$eval_name
    • register new evals to the top-level suite-name, not their group name.
      • ex. suite name core has one group complex_tasks, which has one eval flappy_bird so its registered as follows:
      • register_evaluation!("core", FlappyBird)

Limitations

  • no namespacing until this PR is merged in.
    • until then, wherever its run, and whatever its allowed to do (via exts), it will, without isolating its work to a tmp env
    • copy files needed for eval into eval work-dir
  • summary/run-report/errors-report
  • tracing. maybe it works, maybe it doesnt, havent checked.
  • does not handle configuring ollama. still necessary to manually config before running bench
  • test multiple configs easily.
    • currently runs tests for the agent/config thats active in the environment its run.
  • parallelize at evals-level, or suite-level, or goose-bench
    struck items are outside the scope of current bench-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants