Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start parsing the chunks file with serde #31

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Swatinem
Copy link
Collaborator

@Swatinem Swatinem commented Sep 3, 2024

This implements a hand-written parser which scans through the chunks file line-by-line, and parses the various headers and line records with serde.

@Swatinem Swatinem self-assigned this Sep 3, 2024
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 76.67845% with 66 lines in your changes missing coverage. Please review.

Project coverage is 96.77%. Comparing base (06b8472) to head (56391d0).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
core/src/parsers/pyreport/chunks.rs 75.29% 63 Missing ⚠️
core/src/report/pyreport/types.rs 85.71% 2 Missing ⚠️
core/src/report/models.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #31      +/-   ##
==========================================
- Coverage   98.52%   96.77%   -1.75%     
==========================================
  Files          21       19       -2     
  Lines        6772     4999    -1773     
==========================================
- Hits         6672     4838    -1834     
- Misses        100      161      +61     
Components Coverage Δ
core 96.87% <76.67%> (-1.74%) ⬇️
bindings 79.31% <ø> (ø)
python 100.00% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from swatinem/bench-chunks to main September 4, 2024 07:19
@Swatinem Swatinem force-pushed the swatinem/parse-chunks-serde branch 3 times, most recently from 56e60a1 to e0dd890 Compare September 4, 2024 10:06
@Swatinem Swatinem force-pushed the swatinem/parse-chunks-serde branch from 12ee764 to bd18f58 Compare September 18, 2024 09:36
@Swatinem Swatinem force-pushed the swatinem/parse-chunks-serde branch from bd18f58 to 7787cbb Compare November 12, 2024 12:36
@Swatinem Swatinem changed the base branch from main to swatinem/codspeed-bench November 12, 2024 12:37
@Swatinem Swatinem marked this pull request as ready for review November 14, 2024 12:44
Copy link

codspeed-hq bot commented Nov 14, 2024

CodSpeed Performance Report

Merging #31 will improve performances by ×2.5

Comparing swatinem/parse-chunks-serde (ca167a5) with swatinem/codspeed-bench (54cbaa6)

Summary

⚡ 2 improvements
✅ 2 untouched benchmarks

Benchmarks breakdown

Benchmark swatinem/codspeed-bench swatinem/parse-chunks-serde Change
complex_chunks 13.9 s 5.5 s ×2.5
simple_chunks 62.1 µs 35.7 µs +74.01%

@Swatinem Swatinem force-pushed the swatinem/codspeed-bench branch from 54cbaa6 to 251cf8e Compare November 19, 2024 22:32
Base automatically changed from swatinem/codspeed-bench to main November 19, 2024 22:33
This implements a hand-written parser which scans through the `chunks` file line-by-line, and parses the various headers and line records with serde.

The most complex part here is parsing the line records.
If that complexity starts to be unreasonable, a hybrid approach is also possible in which the hand-written parser is used along with the simpler serde-based `header` parsers, and still falling back to the existing parser-combinator based parser for the line records.
This should implement everything except for the `complexity` parser.
@Swatinem Swatinem force-pushed the swatinem/parse-chunks-serde branch from ca167a5 to 56391d0 Compare November 19, 2024 22:42
Copy link
Collaborator

@matt-codecov matt-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is much less code to read here, but a huge part of what was deleted is tests and tbh i think what remains is way less readable and way less useful to understand the format :v

do you have insight into why this is faster? if i am remembering my previous profiling right, a significant majority of time was spent in SQLite inserts, and parsing string labels was slow but not slow enough to account for a reduction from 13s to 5s

could LineRecord and ReportLine be the same type? could chunks::CoverageDatapoint and types::CoverageDatapoint be the same type? same for LineSession, i figured the dream of using serde for this was a single set of types that provided (de)serialization mostly for free, but i don't actually think this implementation has reduced complexity much

Comment on lines -57 to -58
- [`winnow`](https://crates.io/crates/winnow), a parser combinator framework (fork of [`nom`](https://crates.io/crates/nom))
- `winnow`'s docs illustrate [how one can write a streaming parser](https://docs.rs/winnow/latest/winnow/_topic/partial/index.html)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not use winnow anymore but it is still an option we might use for other parsers in the future (seems like it would be very clean for something like lcov for example). we also don't use quick_xml but that's still in this list

@@ -64,6 +65,7 @@ Non-XML formats lack clean OOTB support for streaming so `codecov-rs` currently
### Testing

Run tests with:

```
# Rust tests
$ cargo test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be cargo nextest?

Comment on lines +47 to +48
// Replace our mmap handle so the first one can be unmapped
let chunks_file = unsafe { Mmap::map(chunks_file)? };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment no longer accurate, the report_json_file handle is still in scope so it has not necessarily been unmapped. but because of non-lexical lifetimes this probably was not necessary to begin with

Comment on lines +32 to +35
//! This parser performs all the writes it can to the output
//! stream and only returns a [`ReportLine`] for tests. The
//! `report_line_or_empty` parser which wraps this and supports empty lines
//! returns `Ok(())`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not accurate as a module-level comment. report_line_or_empty wraps report_line, not the module

coverage: session.1,
branches: session.2.into(),
partials: session.3.into(),
complexity: None, // TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment on lines +417 to +426
impl From<CoverageDatapoint> for types::CoverageDatapoint {
fn from(datapoint: CoverageDatapoint) -> Self {
Self {
session_id: datapoint.0,
_coverage: datapoint.1,
_coverage_type: datapoint.2,
labels: datapoint.3.unwrap_or_default(),
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these need to be different types?

if let Some(condition) = v.strip_suffix(":jump") {
let condition: u32 = condition.parse().map_err(|_| invalid())?;

// TODO(swatinem): can we skip saving the `jump` here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not positive there is value in saving the specific branch information at all if it is not in Line format. but i have not dug into the issue, and i was trying to make as few mutations as possible when converting a pyreport to sqlite and back to minimize the difficulty of automated validation

@matt-codecov
Copy link
Collaborator

chatted offline about the perf improvement. in practice it is not 2.5x better but it is still a solid improvement

/usr/bin/time -plh cargo run --example parse_pyreport --release -- ../sample_data/worker-c71ddfd4cb1753c7a540e5248c2beaa079fc3341/report_json.json ../sample_data/worker-c71ddfd4cb1753c7a540e5248c2beaa079fc3341/chunks.txt c71ddfd.sqlite

base takes 2.5s in the kernel and just over 7 seconds in userspace. this commit takes the same time in the kernel but just over 6 seconds in userspace, around a second better

@Swatinem
Copy link
Collaborator Author

i figured the dream of using serde for this was a single set of types that provided (de)serialization mostly for free, but i don't actually think this implementation has reduced complexity much

Indeed the reason of why we have two different versions of structs is that serde is a bit limited in its ability to control the specific de/serialized format for structs.

In particular its not possible to de/serialize a normal struct with named fields as a JSON array. Thats why I have tuple-structs which de/serialize to JS arrays, and then convert those to proper structs with names fields to work with.

Its a bit unfortunate, but not that uncommon of a pattern actually. You frequently have different structs purely for de/serialization and to actually work with internally.


As for the perf, and looking at the flamegraphs, the runtime can be split into 3 distinct phases:

  • parsing the format
  • massaging the structs around into a format thats insertable into sqlite
  • (actually doing sqlite inserts, which is being skipped in the benchmarks)

Its interesting that after speeding up the parser itself, the second step ends up taking ~50% of the time, which looks a bit unreasonable, but does make sense when digging a bit deeper into whats actually happening.


Given that we are in a much better spot perf-wise with the current python code after some architecture improvements and low hanging fruit, and given that we are not putting much effort into the Rust work right now, we might revisit this whole approach a bit down the road.

In particular, I would love to just kill the existing Datapoints alltogether which I believe make up a big part of of both the filesize, parsing time, and complexity.

Afterwards, we can revisit this again and make it fully intentional to split this into structs aimed at de/serialization that follow the shape of the serialized JSON, and structs that actually encode the logical data with more appropriate data types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants