Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Invocation
cargo run --bin goose -- bench --help
cargo run --bin goose -- bench
to run the "core" suite of bencharkscargo run --bin goose -- bench -s $suite_name1,$suite_name2,...,etc
crates/goose-bench/src/eval_suites
Semantics [DO NOT SKIP READING]
core
suite of evaluations that runs by default if the--suites
cli flag is not setcore
will not run--suites
is supplied, only the items in that list will run, so ifcore
isnt part of the list of suites passed to--suites
, it will not run.Individual Evals
crates/goose-bench/src/eval_suites/core/complex_tasks/flappy_bird.rs
crates/goose-bench/src/eval_suites/core/$group_name
core
is the$suite_name
crates/goose-bench/src/eval_suites/core/$group_name/$eval_name
core
has one groupcomplex_tasks
, which has one evalflappy_bird
so its registered as follows:register_evaluation!("core", FlappyBird)
Limitations
does not handle configuring ollama. still necessary to manually config before running benchtest multiple configs easily.currently runs tests for the agent/config thats active in the environment its run.parallelize at evals-level, or suite-level, or goose-benchstruck items are outside the scope of current bench-work.