Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate test scoring methods #567

Open
foolip opened this issue Aug 5, 2024 · 2 comments
Open

Evaluate test scoring methods #567

foolip opened this issue Aug 5, 2024 · 2 comments
Labels
community Issues seeking input from the community on project direction, policy, or decision-making

Comments

@foolip
Copy link
Member

foolip commented Aug 5, 2024

On https://webstatus.dev/ and feature details pages like https://webstatus.dev/features/dialog we show a test score between 0 and 100% based on WPT results.

The current approach is to count passing subtests divided by number of known subtests, the same as the default wpt.fyi view. Let's evaluate how well that works, and compare it to other scoring methods.

Desirable properties:

  • Correlates with implementation quality as judged by web developers
  • Correlates with implementation completeness as judged by browser engineers
  • Easy to explain and understand

The options are, along with their wpt.fyi URL query parameter. (Note that the URLs aren't exactly right and include tentative tests, working around web-platform-tests/wpt.fyi#3930 to make comparison possible.)

Passing subtests (view=subtest)

This method counts all subtests and

Example: 225 / 258 = 87%

Pros:

  • Matches the default view of wpt.fyi.
  • Easy to explain assuming familiarity with WPT's tests/subtests.

Cons:

  • The total number of subtests is only known once all subtests are passing and often differ across browsers. (Not easy to understand.)
  • The harness status sometimes counts and sometimes doesn't. (Not easy to understand.)
  • Fixing a timeout or subtest can cause new failing subtests to appear, reducing the score. (Does not correlate with improvement.)

Partially passing tests (view=interop)

Example: 105.12 / 109 = 96%

Pros:

  • Total number of tests (the denominator) is easy to explain and understand

Cons:

  • Fixing a timeout or subtest can cause new failing subtests to appear, reducing the score. (But the effect is smaller than for view=subtest.)
  • Linking to view=interop would likely cause confusion, as the view is named for the Interop project. (Renaming/aliasing the URL query parameter would address this.)

Fully passing tests (view=test)

Example: 102 / 109 = 94%

Pros:

  • Fully passing test is a simple rule
  • Total number of tests (the denominator) is easy to explain and understand

Cons:

  • Fixing a subtest doesn't count unless all subtests pass. (Does not correlate with improvement.)
  • Similarly, introducing a single failing subtest in a previously passing test has a large effect.

Next steps

Evaluate how well each method corresponds with feature completeness/quality, by taking a random sample of features and listing what the scores would be. Things to consider:

  • What does the score tend to be for features not supported at all? (Closer to 0 is better.)
  • What does the score tend to be for features browser engineers and web developers think are complete? (Closer to 100 is better, and below 80 or 90 is bad.)
  • What does the score tend to be for in-development features? (Exact score is not important, but an even progression is better.)

cc @gsnedders @jgraham since we have discussed test scoring many times over the years, most recently in web-platform-tests/rfcs#190.

@jcscottiii jcscottiii pinned this issue Aug 5, 2024
@jcscottiii jcscottiii added the community Issues seeking input from the community on project direction, policy, or decision-making label Sep 30, 2024
@mfreed7
Copy link

mfreed7 commented Oct 28, 2024

I'd like to chime in with an opinion on this, taking Anchor Positioning as a great canonical example. I say it's great because a) it has many tests, and it's a sizable feature, b) the Blink implementation is basically complete and shipped, so the correct expectation would be to have a relatively high score, and c) it shows the nuances of the three scoring methods quite well in my opinion.

How do the three methods score Anchor Positioning for Chrome?

  1. view=interop: 242.82 / 253 = 96.0%
  2. view=subtest: 9839 / 12674 = 77.6%
  3. view=test: 159 / 253 = 62.8%

First, a quick comment that it appears there's a sizable bug in view=test mode, as seen above. Many tests appear to receive a "fail" even though they have only one passing subtest.

The sizable difference (96% vs. 78%) between interop view and subtest view in this case is roughly down to just one test file which contains 4291 subtests. Note that there are a total of 253 test files, testing all of the many, many aspects of this large feature, and a total of 12674 sub-tests. This one test file, which only tests the parsing behavior for a single CSS function, was written in a very dilligent way - it loops through all permutations of name, property, size, and fallback, and ensures the complete functionality of this one small aspect of parsing. That's awesome, and should be encouraged. However, because this one small part of the much larger overall feature has been exhaustively tested, the subtest scoring methodology suddenly attributes a whopping 4291 / 12674 = 33.9% of the overall Anchor Positioning score to this small part of the feature. I can roughly guarantee that web developers do not share the view that the corner-case anchor-size() parsing behavior amounts to 34% of the utility of the overall feature.

Essentially, in the abstract, all three scoring methods sound like they could be plausible ways to tabulate these scores. However, in practice, engineers tend to write WPTs as "file-per-functionality". I.e. they implement a particular part of a feature, and test that part with a new test file. Sometimes, within that test file, they decide to break the test into subtests, and sometimes they're all lumped into a single subtest. Either way, that one file represents the piece of functionality being tested. There are, of course, exceptions to this. But in my experience, it's the most common way to test features. We should therefore be scoring via the same idea: one file tests one part of the feature. Subtests are just that, a sub- part of the test, not a standalone test of something important.

The correct choice, given the above, is therefore test=interop. I believe it meets all three of your criteria:

  • Correlates with implementation quality as judged by web developers
  • Correlates with implementation completeness as judged by browser engineers
  • Easy to explain and understand

@rachelandrew
Copy link

I was just talking about this with Kadir, and I think the current way things are displayed on the dashboard could have the inadvertent negative impact of causing people to not want to write comprehensive tests with subtests if it made the feature look like it was more poorly implemented than it is.

I'm not sure how test=interop is worked out, is that based on manually flagging things as in or out?

Thinking about all the partial implementations that exist it seems to me that there will need to be some ability to manually flag subtests as irrelevant. For example in CSS the two value syntax for display is interoperable, but the percentages are low. Digging into it the majority of failing subtests are for the run-in value of display. No browser implements this value although IE11 did, and so it's completely irrelevant for developers that the value isn't supported by the two value syntax. If it did get implemented though, we'd want to make sure the tests pass here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues seeking input from the community on project direction, policy, or decision-making
Projects
Status: 2024-Q4
Development

No branches or pull requests

4 participants