Start parsing the `chunks` file with serde #31

Swatinem · 2024-09-03T10:52:24Z

This implements a hand-written parser which scans through the chunks file line-by-line, and parses the various headers and line records with serde.

codecov · 2024-09-03T10:54:48Z

Codecov Report

Attention: Patch coverage is 76.67845% with 66 lines in your changes missing coverage. Please review.

Project coverage is 96.77%. Comparing base (06b8472) to head (56391d0).

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
core/src/parsers/pyreport/chunks.rs	75.29%	63 Missing ⚠️
core/src/report/pyreport/types.rs	85.71%	2 Missing ⚠️
core/src/report/models.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #31      +/-   ##
==========================================
- Coverage   98.52%   96.77%   -1.75%     
==========================================
  Files          21       19       -2     
  Lines        6772     4999    -1773     
==========================================
- Hits         6672     4838    -1834     
- Misses        100      161      +61

Components	Coverage Δ
core	`96.87% <76.67%> (-1.74%)`	⬇️
bindings	`79.31% <ø> (ø)`
python	`100.00% <ø> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codspeed-hq · 2024-11-14T12:51:09Z

CodSpeed Performance Report

Merging #31 will improve performances by ×2.5

_{Comparing swatinem/parse-chunks-serde (ca167a5) with swatinem/codspeed-bench (54cbaa6)}

Summary

⚡ 2 improvements
✅ 2 untouched benchmarks

Benchmarks breakdown

	Benchmark	`swatinem/codspeed-bench`	`swatinem/parse-chunks-serde`	Change
⚡	`complex_chunks`	13.9 s	5.5 s	×2.5
⚡	`simple_chunks`	62.1 µs	35.7 µs	+74.01%

This implements a hand-written parser which scans through the `chunks` file line-by-line, and parses the various headers and line records with serde. The most complex part here is parsing the line records. If that complexity starts to be unreasonable, a hybrid approach is also possible in which the hand-written parser is used along with the simpler serde-based `header` parsers, and still falling back to the existing parser-combinator based parser for the line records.

…face

This should implement everything except for the `complexity` parser.

matt-codecov

there is much less code to read here, but a huge part of what was deleted is tests and tbh i think what remains is way less readable and way less useful to understand the format :v

do you have insight into why this is faster? if i am remembering my previous profiling right, a significant majority of time was spent in SQLite inserts, and parsing string labels was slow but not slow enough to account for a reduction from 13s to 5s

could LineRecord and ReportLine be the same type? could chunks::CoverageDatapoint and types::CoverageDatapoint be the same type? same for LineSession, i figured the dream of using serde for this was a single set of types that provided (de)serialization mostly for free, but i don't actually think this implementation has reduced complexity much

matt-codecov · 2024-11-19T20:43:45Z

README.md

- [`winnow`](https://crates.io/crates/winnow), a parser combinator framework (fork of [`nom`](https://crates.io/crates/nom))
-  - `winnow`'s docs illustrate [how one can write a streaming parser](https://docs.rs/winnow/latest/winnow/_topic/partial/index.html)


we might not use winnow anymore but it is still an option we might use for other parsers in the future (seems like it would be very clean for something like lcov for example). we also don't use quick_xml but that's still in this list

matt-codecov · 2024-11-19T20:44:11Z

README.md

@@ -64,6 +65,7 @@ Non-XML formats lack clean OOTB support for streaming so `codecov-rs` currently
 ### Testing

 Run tests with:
+
 ```
 # Rust tests
 $ cargo test


should this be cargo nextest?

matt-codecov · 2024-11-19T20:49:26Z

core/src/parsers/pyreport/mod.rs

+    // Replace our mmap handle so the first one can be unmapped
+    let chunks_file = unsafe { Mmap::map(chunks_file)? };


comment no longer accurate, the report_json_file handle is still in scope so it has not necessarily been unmapped. but because of non-lexical lifetimes this probably was not necessary to begin with

matt-codecov · 2024-11-19T20:58:23Z

core/src/parsers/pyreport/chunks.rs

+//! This parser performs all the writes it can to the output
+//! stream and only returns a [`ReportLine`] for tests. The
+//! `report_line_or_empty` parser which wraps this and supports empty lines
+//! returns `Ok(())`.


this is not accurate as a module-level comment. report_line_or_empty wraps report_line, not the module

matt-codecov · 2024-11-19T21:03:27Z

core/src/parsers/pyreport/chunks.rs

+                        coverage: session.1,
+                        branches: session.2.into(),
+                        partials: session.3.into(),
+                        complexity: None, // TODO


matt-codecov · 2024-11-19T21:18:47Z

core/src/parsers/pyreport/chunks.rs

+impl From<CoverageDatapoint> for types::CoverageDatapoint {
+    fn from(datapoint: CoverageDatapoint) -> Self {
+        Self {
+            session_id: datapoint.0,
+            _coverage: datapoint.1,
+            _coverage_type: datapoint.2,
+            labels: datapoint.3.unwrap_or_default(),
        }
    }
+}


why do these need to be different types?

matt-codecov · 2024-11-19T21:24:00Z

core/src/parsers/pyreport/chunks.rs

+                if let Some(condition) = v.strip_suffix(":jump") {
+                    let condition: u32 = condition.parse().map_err(|_| invalid())?;
+
+                    // TODO(swatinem): can we skip saving the `jump` here?


i am not positive there is value in saving the specific branch information at all if it is not in Line format. but i have not dug into the issue, and i was trying to make as few mutations as possible when converting a pyreport to sqlite and back to minimize the difficulty of automated validation

matt-codecov · 2024-11-19T23:02:02Z

chatted offline about the perf improvement. in practice it is not 2.5x better but it is still a solid improvement

/usr/bin/time -plh cargo run --example parse_pyreport --release -- ../sample_data/worker-c71ddfd4cb1753c7a540e5248c2beaa079fc3341/report_json.json ../sample_data/worker-c71ddfd4cb1753c7a540e5248c2beaa079fc3341/chunks.txt c71ddfd.sqlite

base takes 2.5s in the kernel and just over 7 seconds in userspace. this commit takes the same time in the kernel but just over 6 seconds in userspace, around a second better

Swatinem · 2024-11-27T11:51:17Z

i figured the dream of using serde for this was a single set of types that provided (de)serialization mostly for free, but i don't actually think this implementation has reduced complexity much

Indeed the reason of why we have two different versions of structs is that serde is a bit limited in its ability to control the specific de/serialized format for structs.

In particular its not possible to de/serialize a normal struct with named fields as a JSON array. Thats why I have tuple-structs which de/serialize to JS arrays, and then convert those to proper structs with names fields to work with.

Its a bit unfortunate, but not that uncommon of a pattern actually. You frequently have different structs purely for de/serialization and to actually work with internally.

As for the perf, and looking at the flamegraphs, the runtime can be split into 3 distinct phases:

parsing the format
massaging the structs around into a format thats insertable into sqlite
(actually doing sqlite inserts, which is being skipped in the benchmarks)

Its interesting that after speeding up the parser itself, the second step ends up taking ~50% of the time, which looks a bit unreasonable, but does make sense when digging a bit deeper into whats actually happening.

Given that we are in a much better spot perf-wise with the current python code after some architecture improvements and low hanging fruit, and given that we are not putting much effort into the Rust work right now, we might revisit this whole approach a bit down the road.

In particular, I would love to just kill the existing Datapoints alltogether which I believe make up a big part of of both the filesize, parsing time, and complexity.

Afterwards, we can revisit this again and make it fully intentional to split this into structs aimed at de/serialization that follow the shape of the serialized JSON, and structs that actually encode the logical data with more appropriate data types.

Swatinem self-assigned this Sep 3, 2024

Base automatically changed from swatinem/bench-chunks to main September 4, 2024 07:19

Swatinem force-pushed the swatinem/parse-chunks-serde branch 3 times, most recently from 56e60a1 to e0dd890 Compare September 4, 2024 10:06

Swatinem force-pushed the swatinem/parse-chunks-serde branch from 12ee764 to bd18f58 Compare September 18, 2024 09:36

Swatinem force-pushed the swatinem/parse-chunks-serde branch from bd18f58 to 7787cbb Compare November 12, 2024 12:36

Swatinem changed the base branch from main to swatinem/codspeed-bench November 12, 2024 12:37

Swatinem marked this pull request as ready for review November 14, 2024 12:44

Swatinem force-pushed the swatinem/codspeed-bench branch from 54cbaa6 to 251cf8e Compare November 19, 2024 22:32

Base automatically changed from swatinem/codspeed-bench to main November 19, 2024 22:33

Swatinem added 9 commits November 19, 2024 23:41

Use memchr-based splitting instead of an iterator/event-based inter…

e47f552

…face

get closer to the existing parser interface dealing with report builders

6f8af38

Implement missing chunks parser features

d6f4a47

This should implement everything except for the `complexity` parser.

insert all datapoints

70fa498

rip out old chunks parser

35f0674

remove all winnow-related code

c89f46f

switch to stable rust and reformat imports

7e40eed

fix CI failures

56391d0

Swatinem force-pushed the swatinem/parse-chunks-serde branch from ca167a5 to 56391d0 Compare November 19, 2024 22:42

matt-codecov reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start parsing the `chunks` file with serde #31

Start parsing the `chunks` file with serde #31

Swatinem commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 3, 2024 •

edited

Loading

codspeed-hq bot commented Nov 14, 2024

matt-codecov left a comment

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov Nov 19, 2024

matt-codecov commented Nov 19, 2024

Swatinem commented Nov 27, 2024

		- [`winnow`](https://crates.io/crates/winnow), a parser combinator framework (fork of [`nom`](https://crates.io/crates/nom))
		- `winnow`'s docs illustrate [how one can write a streaming parser](https://docs.rs/winnow/latest/winnow/_topic/partial/index.html)

		// Replace our mmap handle so the first one can be unmapped
		let chunks_file = unsafe { Mmap::map(chunks_file)? };

Start parsing the chunks file with serde #31

Are you sure you want to change the base?

Start parsing the chunks file with serde #31

Conversation

Swatinem commented Sep 3, 2024 • edited Loading

codecov bot commented Sep 3, 2024 • edited Loading

Codecov Report

codspeed-hq bot commented Nov 14, 2024

Merging #31 will improve performances by ×2.5

Summary

Benchmarks breakdown

matt-codecov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-codecov commented Nov 19, 2024

Swatinem commented Nov 27, 2024

Start parsing the `chunks` file with serde #31

Start parsing the `chunks` file with serde #31

Swatinem commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 3, 2024 •

edited

Loading