data model #1276

laanak08 · 2025-02-18T17:16:59Z

Invocation

there is a core suite of evaluations that runs by default if the --suites cli flag is not set
- differently stated, any evaluation not included in core will not run
if --suites is supplied, only the items in that list will run, so if core isnt part of the list of suites passed to --suites, it will not run.

example can be examined here: crates/goose-bench/src/eval_suites/core/complex_tasks/flappy_bird.rs
groups of related evals can be placed together in a module crates/goose-bench/src/eval_suites/core/$group_name
- In this example core is the $suite_name
- where each eval is in its own file at crates/goose-bench/src/eval_suites/core/$group_name/$eval_name
- register new evals to the top-level suite-name, not their group name.
  - ex. suite name core has one group complex_tasks, which has one eval flappy_bird so its registered as follows:
  - register_evaluation!("core", FlappyBird)

no namespacing until this PR is merged in.
- until then, wherever its run, and whatever its allowed to do (via exts), it will, without isolating its work to a tmp env
- copy files needed for eval into eval work-dir
summary/run-report/errors-report
tracing. maybe it works, maybe it doesnt, havent checked.
~~does not handle configuring ollama. still necessary to manually config before running bench~~
~~test multiple configs easily.~~
- ~~currently runs tests for the agent/config thats active in the environment its run.~~
~~parallelize at evals-level, or suite-level, or goose-bench~~
struck items are outside the scope of current bench-work.

zakiali and others added 10 commits February 13, 2025 21:20

initial commit of goosebench

5ec0694

ext (in)validation prompts

e1c51be

WIP

013933c

removed py benchmark proj

49dc81f

add cli opt to specify which eval suite to run + support nested suites

60a6100

fmt

e85a895

remove to-list comments

826050f

remove report struct

4cb2740

fmt

6482629

clippy

91511ae