-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114
RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114
Conversation
705683d
to
e05638c
Compare
The proposed changes make sense to me. Does this change affect the scoring for Interop 2022? |
No, but I would like us to pick a scoring method that is as close as possible. And if it's not exactly the same, I'd like to use the scoring method we pick here for Interop 2023. |
@DanielRyanSmith I love how this resolves web-platform-tests/wpt.fyi#421, nice! On the precision, I'm leaning towards no decimal points, because it's more compact, and a difference of <1% rarely matters. |
I haven't looked at this in detail, but I'm concerned that we originally avoided percentage scores because they're so easy to misinterpret into some kind of ranking. They're also lossy; you can't tell whether a score of 50% is based on passing 1/2 tests or 500/1000. The latter is at least suggestive of a more complete testsuite. For Interop-* we are putting a lot of care into selecting the tests so that everyone agrees that it's OK to publish this kind of simplified score. I'm not sure who the intended audience is for doing that for all tests; I don't see that it's especially helpful to browser developers who I consider to be the main target for the raw test scores. |
@jgraham note that the number of tests are preserved next to directory names. In an earlier prototype they were next to the percentage, but since they're the same for each cell it was moved outside. Some other way of showing the number of tests would be OK too I think. |
This will not affect Interop 2022 scores and will move wpt.fyi a bit closer to mirroring interop 2022's scores.
This is true for a single test with multiple subtests. Like @foolip mentioned, an attempt to mitigate this has been to make the number of tests in a directory visible next to the path name. Not displaying subtests is a potential risk, but hopefully mitigated by the test numbers display. |
I'd like to use the staging instance to see what this long diff query looks like, but I get an error when I try to just substitute the staging domain into that url
Are those runs specified at the end of the URL not available to the staging instance? |
Thanks @davidsgrogan for taking a look 😊 - the original plan was to only change the results view to percentages and leave the diff query view with its current implementation. Perhaps this is something that people have opinions about? The view you linked is not available because the run IDs differ between the staging and production environments, but the view should look mostly the same as it currently is in production. |
I think that counting subtests in aggregate, like for all of css/ or editing/ doesn't make sense, and am glad to see that go away. But if anyone thinks it's important to continue to show subtest counts at the test level, an option would be to continue to show "3 / 4" instead of 75% for the cell for an individual test. |
Or for tests that have subtests, show the number of subtests next to the filename |
The tricky part of that is that the number of subtests can be different between browsers, in cases like these: We can just pick the biggest number to show since it would be purely informational and not for scores, and I think that'd be OK enough, and is how the single "Showing 18 subtests" in the blue banner is currently computed. @DanielRyanSmith do you think displaying that next to test names would be OK? What do we do about names that are too long? (This can also happen with directories, but much more rarely.) |
This is a quick and easy addition, and here is an example of the change. If any subtests exist in the test, the number will be displayed next to the name. In terms of text wrapping, it would happen the same as it does for long test file names now. Here is an example of long test names. Some are already wrapping without the new subtest numbers. One possible issue I could see is if there is a need to differentiate the numbers displayed by a directory (tests) with numbers displayed by a single test (subtests), but it seems relatively intuitive to me. Great suggestion! |
I have opinions on it! 😊 I will need to calculate the interop2022 score difference caused by engines changing behavior for a large subset of tests. It will be absurdly tedious to do manually based on the existing diff view because of subtests. So showing score change in the diff view would eliminate a lot of manual calculation on my part. Optimally we could have a toggle button that lets us users switch the view between score difference and the existing view that shows # test differences, but maybe that would add too much complexity on your side? |
Overall, thanks for implementing this! It'll be helpful as we work through improving Interop2022 scores.
I think we need to keep the decimal precision. In the third screenshot example, "backgrounds" has 480 sub-tests and a pass rate of 99.8%, meaning a "combined" score of 480*0.998 = 479. So there's a full point to pick up there, which wouldn't be visible if you rounded that to 100%. |
Your pain point is an important one, and is tangentially related. I think it would be very useful to describe your situation in an issue on wpt.fyi if it's not already there. Because it's a separate view with a use case that is typically different than this results view it might take some more discussion, but this feels like very useful information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC isn't really clear on who the intended beneficiary of these changes is. As far as I can tell it doesn't really help browser engineers to find and address problems in their engines, which I consider the main use case for wpt.fyi. Making the results look similar to interop-2022 seems explictly like it should be a non-goal; that's a carefully curated set of tests which vendors have mutually agreed are useful for public consumption. The full set of web-platform-tests doesn't have such agreement, and I'm extremely reluctant to turn it into something that looks like an externally meaningful metric.
rfcs/test_scoring_change.md
Outdated
_this change. This only affects the total percentage that the test counts as_ | ||
_passing._ | ||
|
||
### Display percentages to represent the fraction of tests passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty solidly opposed to changing the display to be based on percentages. It has been a long term goal to avoid wpt.fyi becoming a primary source for non-browser developers trying to do cross-browser comparisons. It's not clear to me that percentages here are helpful to browser developers, but by making random wpt.fyi results pages look like Interop-202x scores we are encouraging third parties to make value judgements using test sets that haven't been subject to the extensive discussion/auditing of the Interop-selected tests. I'd expect that to affect whether people are willing to submit tests to web-platform-tests in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see the issue you're describing with percentages giving a false sense of normalized numbers to compare between browsers. This is definitely not the intention of the change. Maybe this is the most contentious change proposed, and finding a middle-ground here would be a better option.
rfcs/test_scoring_change.md
Outdated
proposed, although not perfect, is to mark these tests with harness errors | ||
as 0% passing, regardless of subtest results. | ||
|
||
_Note: Individual subtest results will still be visible normally even with_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the caveat here is important: this change doesn't seem to help browser developers at all; they still need to go down into the individual test and figure out "oh actually there are some passing subtests, the problem is that the test is slow running, so we time out" (or whatever). That suggests the change isn't actually very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that this change isn't perfect, but hopefully still a net positive.
To use the example you mentioned, it could be the case that, for example, 7 subtests pass as expected, but a timeout occurs on the 8th test. There are instances where a test like this will be marked as 7/8 subtests passing (or 7/9 since the harness status is currently counted as a subtest), but the test could have many more subtests that were not run and are missing from the results. The current aggregation method can hide the scope of an issue like this, so the hope is that this change will bring attention to areas of improvement that could currently be overlooked.
That being said, I'm not certain about how common this scenario is. Perhaps delving into this further and having numbers would be of use.
@jgraham this change is related to (motivated by) these wpt.fyi issues, some very old, some new:
The intention is of course that it should be an improvement for users of wpt.fyi, and not regress any important use cases. That last issue is about being able to browse (and sort) the Interop 2022 tests. The original suggestion was a fraction like "43.95 / 90", and that would actually work better when drilling down into a part of the test suite, since a percentage would be normalized at each level, while a test count would not. One could split this all up into the following considerations:
Which parts of this would be fine and which need more discussion? |
I think that wanting the results on wpt.fyi to look like Interop 2022 scores is unnecessarily constraining the solution space. If we drop that requirement then there are more options to surface relevant information in a more usable way, without it having to look like a single numerical score. For reftests I agree that it makes sense to inline the actual test result in the directory level view. For tests with subtests, the limiting assumption seems to be that we have to combine both the overall harness status and the subtest scores into a single meaningful number. I don't think that's true. For example, for tests with both some passing subtests and a non OK status we could display the result like "6/7 ⚠" with title-text on the warning triange to convey the problematic status. For directory level views we could display the test count, the subtest pass rate and also the warning triange with details of any non-OK overall statuses. Maybe clicking the triange could take you through to a view of just the problematic tests. From that thought, I wonder if we should actually be displaying the subtest failure count rather than the pass count, and making it possible to click through to a view only containing the failures (or tests containing failures at the directory level). Maybe that's too confusing, and I'm certainly not wedded to the failure-triangle suggestion; I'm all in favour of a better way to convey the same information. But in general setting up the dasboard to be useful to developers looking for problems that need to be solved involves surfacing different information to setting it up as a source of metrics. I also understand that there are requests to be able to reflect Interop metrics directly in the wpt.fyi display. Generally I think this kind of thing is a bit misguided; knowing that fixing case A will improve the score by 0.5% whereas B will improve it by 1% doesn't mean that B is going to be twice as useful as A; we just haven't calibrated the scores down to that level and I don't think we could even if we tried. That's different performance benchamarks, where a 1% win is pretty much twice as desirable as a 0.5% win. Of course once you have a metric that's a proxy for some underlying goal (e.g. "allow developers to use without running into interop problems") people are going to want to optimise for the metric rather than the goal. I don't think we should encourage that, but to the extent that people disagree with me there, I think we should at least make sure that any view designed to support the metric is gated on only being available when displaying tests that are part of the metric. So anything specifically for surfacing Interop 2022 scores should be contingent on first having filtered down to one of the |
As a browser implementer who raised one of the issues leading to this change, I just wanted to chime in that I do find changes along these lines helpful. The old display improperly (in my opinion) weighed sub-test failures over test-file failures. My request (which I believe this change fixes) was to report fractional files-passing statistics, rather than subtest-passing statistics. I'm generally agnostic to the many different ways that can be achieved, but I do think it's an important change to make. This was motivated by Interop2022, which, for the first time, set out a methodology for how to score a set of tests. While we don't need to be tied to that exact methodology, I think it is more valuable to use that approach than to stick to just counting sub-tests.
Yes, Goodhart's law applies. But Goodhart doesn't offer a better solution to the problem. We need to measure things to try to improve them, and to do that, we need a well-defined metric. Certainly the Interop2022 methodology isn't perfect, but it is very practically helpful. Compat2021 was fairly successful I think, using a similar methodology. |
Hello everyone, I've drafted some changes to the view and have made it available to test in the staging environment. To summarize:
Some views that have a new look as a result:
I think this looks like a promising direction, but criticisms are encouraged here. Thanks @jgraham for bringing up the suggestion for parts of this, and I think this is closer to your original suggestion if I'm not mistaken @mfreed7. I think we're moving closer to something with the most utility for users. 😎 |
This definitely still solves my problem. And I think I actually like it better - there's more information available now. Not just the fraction of tests passing but also the total number, in the same place.
I do kind of think the parenthesis might not be enough to distinguish these. The difference between "19 / 19" and "(10 / 10)" is pretty subtle, especially if you haven't been in this conversation. I wish I had a great alternative suggestion to make, but I don't. Perhaps it's too verbose, but what about "0.8/1 (8/10 subtests)" or something like that? |
Is there a need to differentiate? The fact that it's a folder vs a file seems clear to me. You can click though to get detailed results of the subtests. |
Thanks for the update @DanielRyanSmith! I think it will be hard to distinguish between test and subtest fractions in a way that is clear without explanation. It's already pretty clear what's a directory and what is a test, but neither that nor the parenthesis makes it obvious what the fractions mean without explanation. Maybe we should have an "how to read these results" explanation somewhere? And/or we could have a tooltip that says "8 of 10 subtests pass"? |
On harness errors, I'm not sure about the warning triangles. I would prefer to inline the status into test listing, making it even more obvious that there's a problem. But this depends on how scores are aggregated, and makes the most sense if we count a test as 0 if there's a harness error, which I think we should since the test didn't run to completion. |
Hello everyone - I have been OOO and haven't been able to respond here. @jgraham do you have any further opinions on this implementation? Soon I'll update the RFC proposal with more info and focus on the user we're targeting here (which has been the browser engineers). |
I had a discussion about this with @foolip @gsnedders and @past My conclusions:
|
There seems to be a consensus that this new view would be more useful specifically for Interop scores, so I've been working to make that so.
There also seems to be a consensus on not counting the harness status toward the subtest count and displaying harness errors above in some way. 👌 I don't think I explained this very well, but my rationale towards marking tests with a harness error as 0 was less about mirroring the Interop scoring and more to increase visibility of possibly larger failures (a test marked as 0 will catch more eyes than one with a better pass rate because missing tests aren't weighted). However, making the harness error more visible above is a solid middle-ground there. I've kept in the warning triangles with title text currently.
I agree that the mixture of numbers and PASS/FAIL can look cluttered in some views where they show together. I've removed this from all views as of now.
Perhaps this is something to look into for the future if there is enough demand - I think it sounds like a valuable idea. Here is a demonstration of the newest iteration. Our staging environment has some resource issues so an error message might occur, but here is an example of an Interop score with:
Adding the The main page should look mostly unchanged with the exception of removing the harness status from subtest aggregation and having harness error notices. Thanks for the detailed thoughts @jgraham 😊 Maybe having these alternate hidden views and adding them where it makes sense is a workable approach. |
I haven't fully looked at this, but a quick comment: I'd really like views like the percentage view to be gated on having a filter that selects |
rfcs/test_scoring_change.md
Outdated
* This change will cause a net loss in the percentage of passing | ||
tests/subtests, as the harness status passing has been artificially inflating | ||
some subtest numbers. | ||
* This change will require a rework of the current process for aggregating and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the final conclusion, or can we use the search cache when there aren't pre-computed summary numbers in the new format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reword this a bit - viewing runs using old vs. new summary files will not affect the scores being displayed in the Interop-20** views. However, there will be a discrepancy when comparing old and new results in the default view. See here for an example of results that are identical but were aggregated differently so their subtest counts look different (new aggregation on the left, old on the right).
There is a larger than expected scope of work for no longer counting the harness status in old summary files. The results view on wpt.fyi assumes there is ALWAYS a summary file available to use and bypasses any aggregation on simple queries (simple = results views with no search query in the search bar). As I can see it, these are the options available:
- Remove old summary files and regenerate the summary files with the new format. This would take a very large amount of time and resources to run.
- When fetching a summary file server side, check if the summary file is an old format and don't use it if that is the case. Instead aggregate in the same way as is done with search queries.
- Accept that there will be small subtest count discrepancies when comparing new runs against older runs.
I think option 2 is promising, but it will feasibly cause a small performance hit when viewing results from these old runs. However, this performance hit should be lessened the further we move away from these old runs and compare them less.
as well as risks update
af7872e
to
960bde5
Compare
I've updated the RFC, as well as the screenshots and links in the PR description (see these for examples of the current iteration). I believe I've responded to the concerns that have arisen at this point. Are there any more items anyone would like to raise? |
If there is no more feedback on this by Friday I'll consider this RFC accepted and merge it. (Per our process.) |
I disagree that's our process; I think it's clear there was substantive disagreement from @jgraham's comments above (and I broadly agree with them). |
@gsnedders the requested changes were made though, right? Do you have additional changes you'd like to see? |
The part of the process I was referring to is this step:
This isn't explicit about what happens if the person requesting a change doesn't get back to the RFC within a week. An out-of-band ping would be helpful, and I didn't do that here. I'll do that now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably OK, but I would also appreciate @gsnedders confirming that it addresses their concerns.
@gsnedders do you want to review/approve as well, given #114 (comment)? |
@gsnedders said in web-platform-tests/wpt#34752 (comment) they'd be out this week, so I'll go ahead and declare this RFC accepted due to a lack of further feedback. Sam, feel free to raise any issues when you're back, nothing is set in stone. |
This is a proposal for a larger change to the results view of wpt.fyi.
Screenshot examples (with links)
Tests with harness errors display warnings with title text next to results (link)
Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores (link)
Harness status will no longer count toward subtest totals (link)
The run in the left column is aggregated using the new method, compared to the old totals displayed in the right column. This change more accurately represents the scope of test failures.
