-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start parsing the chunks
file with serde
#31
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## main #31 +/- ##
==========================================
- Coverage 98.52% 96.77% -1.75%
==========================================
Files 21 19 -2
Lines 6772 4999 -1773
==========================================
- Hits 6672 4838 -1834
- Misses 100 161 +61
☔ View full report in Codecov by Sentry. |
56e60a1
to
e0dd890
Compare
12ee764
to
bd18f58
Compare
bd18f58
to
7787cbb
Compare
CodSpeed Performance ReportMerging #31 will improve performances by ×2.5Comparing Summary
Benchmarks breakdown
|
54cbaa6
to
251cf8e
Compare
This implements a hand-written parser which scans through the `chunks` file line-by-line, and parses the various headers and line records with serde. The most complex part here is parsing the line records. If that complexity starts to be unreasonable, a hybrid approach is also possible in which the hand-written parser is used along with the simpler serde-based `header` parsers, and still falling back to the existing parser-combinator based parser for the line records.
This should implement everything except for the `complexity` parser.
ca167a5
to
56391d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is much less code to read here, but a huge part of what was deleted is tests and tbh i think what remains is way less readable and way less useful to understand the format :v
do you have insight into why this is faster? if i am remembering my previous profiling right, a significant majority of time was spent in SQLite inserts, and parsing string labels was slow but not slow enough to account for a reduction from 13s to 5s
could LineRecord
and ReportLine
be the same type? could chunks::CoverageDatapoint
and types::CoverageDatapoint
be the same type? same for LineSession
, i figured the dream of using serde for this was a single set of types that provided (de)serialization mostly for free, but i don't actually think this implementation has reduced complexity much
- [`winnow`](https://crates.io/crates/winnow), a parser combinator framework (fork of [`nom`](https://crates.io/crates/nom)) | ||
- `winnow`'s docs illustrate [how one can write a streaming parser](https://docs.rs/winnow/latest/winnow/_topic/partial/index.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might not use winnow anymore but it is still an option we might use for other parsers in the future (seems like it would be very clean for something like lcov for example). we also don't use quick_xml
but that's still in this list
@@ -64,6 +65,7 @@ Non-XML formats lack clean OOTB support for streaming so `codecov-rs` currently | |||
### Testing | |||
|
|||
Run tests with: | |||
|
|||
``` | |||
# Rust tests | |||
$ cargo test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be cargo nextest
?
// Replace our mmap handle so the first one can be unmapped | ||
let chunks_file = unsafe { Mmap::map(chunks_file)? }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment no longer accurate, the report_json_file
handle is still in scope so it has not necessarily been unmapped. but because of non-lexical lifetimes this probably was not necessary to begin with
//! This parser performs all the writes it can to the output | ||
//! stream and only returns a [`ReportLine`] for tests. The | ||
//! `report_line_or_empty` parser which wraps this and supports empty lines | ||
//! returns `Ok(())`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not accurate as a module-level comment. report_line_or_empty
wraps report_line
, not the module
coverage: session.1, | ||
branches: session.2.into(), | ||
partials: session.3.into(), | ||
complexity: None, // TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
impl From<CoverageDatapoint> for types::CoverageDatapoint { | ||
fn from(datapoint: CoverageDatapoint) -> Self { | ||
Self { | ||
session_id: datapoint.0, | ||
_coverage: datapoint.1, | ||
_coverage_type: datapoint.2, | ||
labels: datapoint.3.unwrap_or_default(), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do these need to be different types?
if let Some(condition) = v.strip_suffix(":jump") { | ||
let condition: u32 = condition.parse().map_err(|_| invalid())?; | ||
|
||
// TODO(swatinem): can we skip saving the `jump` here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not positive there is value in saving the specific branch information at all if it is not in Line
format. but i have not dug into the issue, and i was trying to make as few mutations as possible when converting a pyreport to sqlite and back to minimize the difficulty of automated validation
chatted offline about the perf improvement. in practice it is not 2.5x better but it is still a solid improvement
base takes 2.5s in the kernel and just over 7 seconds in userspace. this commit takes the same time in the kernel but just over 6 seconds in userspace, around a second better |
Indeed the reason of why we have two different versions of structs is that serde is a bit limited in its ability to control the specific de/serialized format for structs. In particular its not possible to de/serialize a normal struct with named fields as a JSON array. Thats why I have tuple-structs which de/serialize to JS arrays, and then convert those to proper structs with names fields to work with. Its a bit unfortunate, but not that uncommon of a pattern actually. You frequently have different structs purely for de/serialization and to actually work with internally. As for the perf, and looking at the flamegraphs, the runtime can be split into 3 distinct phases:
Its interesting that after speeding up the parser itself, the second step ends up taking ~50% of the time, which looks a bit unreasonable, but does make sense when digging a bit deeper into whats actually happening. Given that we are in a much better spot perf-wise with the current python code after some architecture improvements and low hanging fruit, and given that we are not putting much effort into the Rust work right now, we might revisit this whole approach a bit down the road. In particular, I would love to just kill the existing Afterwards, we can revisit this again and make it fully intentional to split this into structs aimed at de/serialization that follow the shape of the serialized JSON, and structs that actually encode the logical data with more appropriate data types. |
This implements a hand-written parser which scans through the
chunks
file line-by-line, and parses the various headers and line records with serde.