Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

scip-syntax: adds strict SCIP symbol parsing and formatting #63443

Merged
merged 9 commits into from
Jul 1, 2024

Conversation

kritzcreek
Copy link
Contributor

@kritzcreek kritzcreek commented Jun 24, 2024

Adds strict and performant symbol parsing/formatting for scip-syntax.

Parsing is "zero" allocation when the symbol does not contain escapes. (Technically it does allocate a Vec to hold the descriptors)

Final benchmark numbers:

symbol parsing/parse_v1/10000 [19.158 ms 19.231 ms 19.306 ms]
symbol parsing/parse_v1/100000 [252.10 ms 252.48 ms 252.86 ms]
symbol parsing/parse_v1/1000000 [2.9972 s 3.0033 s 3.0094 s]

symbol parsing/parse_v2/10000 [1.1307 ms 1.1357 ms 1.1413 ms]
symbol parsing/parse_v2/100000 [15.645 ms 15.670 ms 15.697 ms]
symbol parsing/parse_v2/1000000 [191.80 ms 192.11 ms 192.44 ms]

Test plan

Some basic unit tests. Verified manually that it produces the same symbols as the existing parser for all of chromium.scip

@cla-bot cla-bot bot added the cla-signed label Jun 24, 2024
@github-actions github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels Jun 24, 2024
@kritzcreek
Copy link
Contributor Author

kritzcreek commented Jun 26, 2024

I set up a small benchmark for symbol parsing. We're getting about a 5x speedup.

symbol parsing/parse    time:   [30.992 ms 31.228 ms 31.481 ms]
                        change: [-0.9727% +0.3686% +1.5880%] (p = 0.59 > 0.05)
symbol parsing/parse_v2 time:   [5.9309 ms 6.0102 ms 6.0921 ms]
                        change: [-0.4667% +1.8921% +4.1729%] (p = 0.11 > 0.05)

This is on the scip-clang index for chromium parsing ~10,000 symbols (The full index has about 80,000,000).

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Jun 27, 2024

The improved parsing time seems about an order of magnitude slower than the Go implementation, which seems counter-intuitive. Are you sure you benchmarked in release mode? These are the numbers from the Go implementation in sourcegraph/scip#258 (the code should compile now, you should be able to repro by adjusting the path to the SCIP index in the benchmark)

        Benchmark parseV2:
        	1000: 83.333µs
        	10000: 685.458µs <- compare with 6.01ms
        	100000: 8.801583ms
        	1000000: 96.723459ms

Is the extra Vec allocation causing this much overhead/can we benchmark without that?

@kritzcreek
Copy link
Contributor Author

The improved parsing time seems about an order of magnitude slower than the Go implementation, which seems counter-intuitive.

It's very possible I'm doing something stupid :D I haven't profiled once yet.

Are you sure you benchmarked in release mode?

cargo bench uses the bench profile which is release + no debug symbols by default, so I'd think so?

Is the extra Vec allocation causing this much overhead/can we benchmark without that?

I'll collect some statistics for the symbols I'm parsing here and see if pre-allocating capacity for the descriptors helps.

@varungandhi-src
Copy link
Contributor

OK, after fixing the bugs, I get a slowdown, but it's surprisingly still much faster than the Rust version.

        Benchmark parseV2:
        	1000: 274.417µs
        	10000: 1.949875ms <- compare with 6.01ms
        	100000: 26.774833ms
        	1000000: 326.905708ms

I also added a test to compare with old parsing implementation: sourcegraph/scip@3afd2e2

@kritzcreek
Copy link
Contributor Author

When I change the benchmark to also parse the first 10_000 symbols in the index (I parsed every (TOTAL/10,000)th element to get a "diverse" set of symbols from the index)

@@ -33,7 +33,7 @@ fn bench_symbol_parsing(c: &mut Criterion) {
     let all_symbols: Vec<String> = symbols_from_index("/Users/creek/work/scip-indices/chromium-1.scip").collect();
     let symbol_count = all_symbols.len();
     let n = 10_000;
-    let symbols: Vec<&str> = all_symbols.iter().step_by(symbol_count / n).map(|s| s.as_str()).collect();
+    let symbols: Vec<&str> = all_symbols.iter().take(n).map(|s| s.as_str()).collect();
     let mut group = c.benchmark_group("symbol parsing");
     group.bench_function("parse", |b| {
         b.iter(|| parse_symbols(&symbols))

I get timings close enough that the Vec allocations could start to matter (at least close enough in perf for me to not feel like there's a horrible perf bug lurking)

symbol parsing/parse_v2 time:   [3.5838 ms 3.6570 ms 3.7458 ms]
                        change: [-40.615% -39.152% -37.393%] (p = 0.00 < 0.05)
                        Performance has improved.

@kritzcreek
Copy link
Contributor Author

kritzcreek commented Jun 27, 2024

Okay, I tried one quick thing (VerboseError keeps a stack of active parsers around to give a nice error message)

@@ -5,7 +5,7 @@ use nom::{
     bytes::complete::{tag, take_while1},
     character::complete::char,
     combinator::{cut, eof, fail, opt},
-    error::{context, convert_error, VerboseError},
+    error::{context, Error},
     multi::many1,
     sequence::{delimited, preceded, tuple},
     Finish, IResult, Parser,
@@ -17,13 +17,12 @@ pub(super) fn parse_symbol(input: &str) -> Result<Symbol<'_>, String> {
     match parse_symbol_inner(input).finish() {
         Ok((_, symbol)) => Ok(symbol),
         Err(err) => Err(format!(
-            "Invalid symbol: '{input}'\n{}",
-            convert_error(input, err)
+            "Invalid symbol: '{input}'\n{err}",
         )),
     }
 }
 
-type PResult<'a, A> = IResult<&'a str, A, VerboseError<&'a str>>;
+type PResult<'a, A> = IResult<&'a str, A, Error<&'a str>>;
 
 fn parse_symbol_inner(input: &str) -> PResult<'_, Symbol<'_>> {
     let (input, symbol) = alt((parse_local_symbol, parse_nonlocal_symbol))(input)?;

By using the non-allocating nom::Error I get to:

symbol parsing/parse_v2 time:   [1.0488 ms 1.0513 ms 1.0538 ms]
                        change: [-71.982% -71.301% -70.708%] (p = 0.00 < 0.05)
                        Performance has improved.

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wish we had finished the work around having some SCIP indexes available for testing in CI to catch regressions.

@kritzcreek
Copy link
Contributor Author

LGTM. I wish we had finished the work around having some SCIP indexes available for testing in CI to catch regressions.

Did we ever consider using git-lfs?

@kritzcreek
Copy link
Contributor Author

Final perf numbers after adding a custom (still non-allocating) error type, to get some basic error reporting that doesn't tank the performance.

symbol parsing/parse_v1/10000 [19.158 ms 19.231 ms 19.306 ms]
symbol parsing/parse_v1/100000 [252.10 ms 252.48 ms 252.86 ms]
symbol parsing/parse_v1/1000000 [2.9972 s 3.0033 s 3.0094 s]

symbol parsing/parse_v2/10000 [1.1307 ms 1.1357 ms 1.1413 ms]
symbol parsing/parse_v2/100000 [15.645 ms 15.670 ms 15.697 ms]
symbol parsing/parse_v2/1000000 [191.80 ms 192.11 ms 192.44 ms]

@kritzcreek kritzcreek changed the title WIP: adds strict SCIP symbol parsing and formatting for scip-syntax scip-syntax: adds strict SCIP symbol parsing and formatting Jul 1, 2024
@kritzcreek kritzcreek marked this pull request as ready for review July 1, 2024 04:27
@kritzcreek kritzcreek force-pushed the christoph/strict-scip-rust-bindings branch 2 times, most recently from e79d0e8 to 56a95ad Compare July 1, 2024 05:33
@kritzcreek kritzcreek force-pushed the christoph/strict-scip-rust-bindings branch from 56a95ad to 7127e6b Compare July 1, 2024 06:08
@kritzcreek kritzcreek merged commit 4d02f09 into main Jul 1, 2024
11 checks passed
@kritzcreek kritzcreek deleted the christoph/strict-scip-rust-bindings branch July 1, 2024 07:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla-signed team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants