This file goes over details of the expectation files which are critical for ensuring that GPU tests only run where they should and that flakes are suppressed to avoid red bots.
[TOC]
The GPU Telemetry-based integration tests (tests that use the
telemetry_gpu_integration_test
target)
utilize expectation files in order to define when certain
tests should not be run or are expected to fail. The core expectation format is
defined by typ, although there are some Chromium-specific
extensions as well. Each expectation consists of the following fields, separated
by a space:
- An optional bug identifier. While optional, it is heavily encouraged that GPU expectations have this field filled.
- A set of tags that the expectation applies to. This is technically optional, as omitting tags will cause the expectation to be applied everywhere, but there are very few, if any, instances where tags will not be specified for GPU expectations.
- The name of the test that the expectation applies to. A single wildcard (
*
) character is allowed at the end of the string, but use of a wildcard anywhere but the end of the string is an error. - A set of expected results for the test. This technically supports multiple values, but for GPU purposes, it will always be a single value.
Additionally, comments are supported, which begin with #
.
Thus, a sample expectation entry might look like:
# Flakes regularly but infrequently.
crbug.com/1234 [ win amd ] foo/test [ RetryOnFailure ]
The following are further details on each of the parts of an expectation that are part of the core expectation file format.
An optional string(s) pointing to the bug(s) tracking the reason why the expectation exists. For GPU uses, this is usually a single bug, but multiple space-separated strings are supported.
The format of the string is enforced by these regular expressions, so CLs that introduce malformed bugs will not be submittable.
One or more tags are used to specify which configuration(s) an expectation applies to. For GPU tests, this is often things such as the OS, the GPU vendor, or the specific GPU model.
Tag sets are defined at the top of the expectation file using # tags:
comments. Each comment defines a different set of mutually exclusive tags, e.g.
all of the OS tags are in a single set. An expectation is only allowed to use
one tag from each set, but can use tags from an arbitrary number of sets. For
example, [ win win10 ]
would be invalid since both are OS tags, but
[ win amd release ]
would be valid since there is one tag each from the OS,
GPU, and browser type tag sets.
Additionally, tags used for expectations with the same test must be unambiguous so that the same test cannot have multiple expectations applied to it at once. Take the following expectations as an example:
[ mac intel ] foo/test [ Failure ]
[ mac debug ] foo/test [ RetryOnFailure ]
These expectations would be considered to be conflicting since [ mac intel ]
does not make any distinctions about the browser type, and [ mac debug ]
does
not make any distinctions about the GPU type. As written, foo/test
running
on a configuration that produced the mac
, intel
, and debug
tags would try
to use both expectations.
This can be fixed by adding a tag from the same tag set but with a different
value so that the configurations are no longer ambiguous.
[ mac intel release ]
would work since a configuration cannot be both
release
and debug
at the same time. Similarly, [ mac amd debug ]
would
work since a configuration cannot be both intel
and amd
at the same time.
Such conflicts will be caught and reported by presubmit tests, so you should not have to worry about accidentally landing bad expectations, but you will need to fix any found conflicts before you can submit your CL.
Actually updating the test harness to generate new tags is out of scope for this documentation. However, if a new tag needs to be added to an expectation file or an existing one modified (e.g. renamed), it is important to note that the tag header should not be manually modified in the expectation file itself.
Instead, modify the header in validate_tag_consistency.py and run
validate_tag_consistency.py apply
to apply the new header to all expectation
files. This ensures that all files remain in sync.
Tag consistency is checked as part of presubmit, so it will be apparent if you accidentally modify the tag header in a file directly.
A single string with either a test name or part of a test name suffixed with a wildcard character. Note that the test name is just the test case as reported by the test harness, not the fully qualified name that is sometimes reported in places such as the "Test Results" tab on bots.
As an example,
gpu_tests.webgl1_conformance_integration_test.WebGL1ConformanceIntegrationTest.WebglExtension_EXT_blend_minmax
is a fully qualified name, while WebglExtension_EXT_blend_minmax
is what would
actually be used in the expectation file for the webgl1_conformance
suite.
Usually one, but potentially multiple, results that are expected on the configuration that the expectation is for. Like tags, expected results are defined at the top of each expectation file and have the same caveat about addition/modification with the helper script. However, unlike tags, there is only one set of values which are not expected to be added to/changed on any sort of regular basis. The following expected results are used by GPU tests:
Skips the test entirely. The benefit of this is that no time is wasted on a bad test. However, it also means that it is impossible to check if the test is still failing or not by just looking at historical results. This is problematic for humans, but even more problematic for scripts we have to automatically remove expectations that are no longer needed.
As such, it is heavily discouraged to add new Skip expectations except under the following circumstances:
- The test is invalid on a configuration for some reason, e.g. a feature is not and will not be supported on a certain OS, and so should never be run. These sorts of expectations are expected to be permanent.
- The act of running the test is significantly detrimental to other tests, e.g. running the test kills the test device. These are expected to be temporary, so the root cause should be fixed relatively quickly.
If presubmit thinks you are adding new Skip expectations, it will warn you, but the warning can be ignored if the addition falls into one of the above categories or it is a false positive, such as due to modifying tags on an existing expectation.
Lets the test run normally, but hides the fact that it failed during result reporting. This is the preferred way to suppress frequent failures on bots, as it keeps the bots green while still reporting results that can be used later.
Allows the test to be retried up to two additional times before being marked as failing, as by default GPU tests do not retry on failure. This is preferred if the test fails occasionally, but not enough to warrant marking it as failing consistently.
In addition to the normal expectation functionality, Chromium has several extensions to the expectation file format.
Chromium has several unexpected pass finder scripts (sometimes called stale expectation removers) to automatically reclaim test coverage by modifying expectation files. These mostly work as intended, but can occasionally make changes that don't align with what we actually want. Thus, there are several annotations that can be inserted into expectation files to adjust the behavior of these scripts.
There are several annotations that can be used to prevent the scripts from
automatically removing expectations. All of these start with finder:disable
with some suffix.
finder:disable-general
prevents the expectation from being removed under any
circumstances.
finder:disable-stale
prevents the expectation from being removed if it is
still applicable to at least one bot, but all queried results point to the
expectation no longer being needed. This is most likely to be used for
expectations for very infrequent flakes, where the flake might not occur within
the data range that we query.
finder:disable-unused
prevents the expectation from being removed if it is
found to not be used on any bots, i.e. the specified configuration does not
appear to actually be tested. This is most likely to be used for expectations
for failures reported by third parties with their own testing configurations.
All of these annotations can either be used inline for a single expectation:
[ mac intel ] foo/test [ Failure ] # finder:disable-general
or with their finder:enable
equivalent for blocks:
# finder:disable-general
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bart/test [ Failure ]
# finder:enable-general
Nested blocks are not allowed. The finder:disable
annotations can be followed
with a description of why the disable is necessary, which will be output by the
script when it encounters a case where one of the disabled expectations would
have been removed if the annotation was not present:
# finder:disable-stale Very low flake rate
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bar/test [ Failure ]
# finder:enable-stale
There may be cases where groups of expectations should only be removed together, e.g. if a flake affects a large number of tests but the chance of any individual test hitting the flake is low. In these cases, the expectations can be grouped together so one is only removed if all of them are being removed.
# finder:group-start Some group description or name
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bar/test [ Failure ]
# finder:group-end
The group name/description is required and is used to uniquely identify each group. This means that groups with the same name string in different parts of the file will be treated as the same group, as if they were all in a single group block together.
# finder:group-start group_name
[ mac ] foo/test [ Failure ]
[ mac ] bar/test [ Failure ]
# finder:group-end
...
# finder:group-start group_name
[ android ] foo/test [ Failure ]
[ android ] bar/test [ Failure ]
# finder:group-end
is equivalent to
# finder:group-start group_name
[ mac ] foo/test [ Failure ]
[ mac ] bar/test [ Failure ]
[ android ] foo/test [ Failure ]
[ android ] bar/test [ Failure ]
# finder:group-end