scip-syntax: adds strict SCIP symbol parsing and formatting #63443

kritzcreek · 2024-06-24T10:48:25Z

Adds strict and performant symbol parsing/formatting for scip-syntax.

Parsing is "zero" allocation when the symbol does not contain escapes. (Technically it does allocate a Vec to hold the descriptors)

Final benchmark numbers:

symbol parsing/parse_v1/10000 [19.158 ms 19.231 ms 19.306 ms]
symbol parsing/parse_v1/100000 [252.10 ms 252.48 ms 252.86 ms]
symbol parsing/parse_v1/1000000 [2.9972 s 3.0033 s 3.0094 s]

symbol parsing/parse_v2/10000 [1.1307 ms 1.1357 ms 1.1413 ms]
symbol parsing/parse_v2/100000 [15.645 ms 15.670 ms 15.697 ms]
symbol parsing/parse_v2/1000000 [191.80 ms 192.11 ms 192.44 ms]

Test plan

Some basic unit tests. Verified manually that it produces the same symbols as the existing parser for all of chromium.scip

kritzcreek · 2024-06-26T03:32:11Z

I set up a small benchmark for symbol parsing. We're getting about a 5x speedup.

symbol parsing/parse    time:   [30.992 ms 31.228 ms 31.481 ms]
                        change: [-0.9727% +0.3686% +1.5880%] (p = 0.59 > 0.05)
symbol parsing/parse_v2 time:   [5.9309 ms 6.0102 ms 6.0921 ms]
                        change: [-0.4667% +1.8921% +4.1729%] (p = 0.11 > 0.05)

This is on the scip-clang index for chromium parsing ~10,000 symbols (The full index has about 80,000,000).

varungandhi-src · 2024-06-27T03:48:06Z

The improved parsing time seems about an order of magnitude slower than the Go implementation, which seems counter-intuitive. Are you sure you benchmarked in release mode? These are the numbers from the Go implementation in sourcegraph/scip#258 (the code should compile now, you should be able to repro by adjusting the path to the SCIP index in the benchmark)

        Benchmark parseV2:
        	1000: 83.333µs
        	10000: 685.458µs <- compare with 6.01ms
        	100000: 8.801583ms
        	1000000: 96.723459ms

Is the extra Vec allocation causing this much overhead/can we benchmark without that?

kritzcreek · 2024-06-27T04:23:26Z

The improved parsing time seems about an order of magnitude slower than the Go implementation, which seems counter-intuitive.

It's very possible I'm doing something stupid :D I haven't profiled once yet.

Are you sure you benchmarked in release mode?

cargo bench uses the bench profile which is release + no debug symbols by default, so I'd think so?

Is the extra Vec allocation causing this much overhead/can we benchmark without that?

I'll collect some statistics for the symbols I'm parsing here and see if pre-allocating capacity for the descriptors helps.

varungandhi-src · 2024-06-27T05:28:09Z

OK, after fixing the bugs, I get a slowdown, but it's surprisingly still much faster than the Rust version.

        Benchmark parseV2:
        	1000: 274.417µs
        	10000: 1.949875ms <- compare with 6.01ms
        	100000: 26.774833ms
        	1000000: 326.905708ms

I also added a test to compare with old parsing implementation: sourcegraph/scip@3afd2e2

kritzcreek · 2024-06-27T06:18:12Z

When I change the benchmark to also parse the first 10_000 symbols in the index (I parsed every (TOTAL/10,000)th element to get a "diverse" set of symbols from the index)

@@ -33,7 +33,7 @@ fn bench_symbol_parsing(c: &mut Criterion) {
     let all_symbols: Vec<String> = symbols_from_index("/Users/creek/work/scip-indices/chromium-1.scip").collect();
     let symbol_count = all_symbols.len();
     let n = 10_000;
-    let symbols: Vec<&str> = all_symbols.iter().step_by(symbol_count / n).map(|s| s.as_str()).collect();
+    let symbols: Vec<&str> = all_symbols.iter().take(n).map(|s| s.as_str()).collect();
     let mut group = c.benchmark_group("symbol parsing");
     group.bench_function("parse", |b| {
         b.iter(|| parse_symbols(&symbols))

I get timings close enough that the Vec allocations could start to matter (at least close enough in perf for me to not feel like there's a horrible perf bug lurking)

symbol parsing/parse_v2 time:   [3.5838 ms 3.6570 ms 3.7458 ms]
                        change: [-40.615% -39.152% -37.393%] (p = 0.00 < 0.05)
                        Performance has improved.

kritzcreek · 2024-06-27T06:49:15Z

Okay, I tried one quick thing (VerboseError keeps a stack of active parsers around to give a nice error message)

@@ -5,7 +5,7 @@ use nom::{
     bytes::complete::{tag, take_while1},
     character::complete::char,
     combinator::{cut, eof, fail, opt},
-    error::{context, convert_error, VerboseError},
+    error::{context, Error},
     multi::many1,
     sequence::{delimited, preceded, tuple},
     Finish, IResult, Parser,
@@ -17,13 +17,12 @@ pub(super) fn parse_symbol(input: &str) -> Result<Symbol<'_>, String> {
     match parse_symbol_inner(input).finish() {
         Ok((_, symbol)) => Ok(symbol),
         Err(err) => Err(format!(
-            "Invalid symbol: '{input}'\n{}",
-            convert_error(input, err)
+            "Invalid symbol: '{input}'\n{err}",
         )),
     }
 }
 
-type PResult<'a, A> = IResult<&'a str, A, VerboseError<&'a str>>;
+type PResult<'a, A> = IResult<&'a str, A, Error<&'a str>>;
 
 fn parse_symbol_inner(input: &str) -> PResult<'_, Symbol<'_>> {
     let (input, symbol) = alt((parse_local_symbol, parse_nonlocal_symbol))(input)?;

By using the non-allocating nom::Error I get to:

symbol parsing/parse_v2 time:   [1.0488 ms 1.0513 ms 1.0538 ms]
                        change: [-71.982% -71.301% -70.708%] (p = 0.00 < 0.05)
                        Performance has improved.

varungandhi-src

LGTM. I wish we had finished the work around having some SCIP indexes available for testing in CI to catch regressions.

docker-images/syntax-highlighter/crates/scip-syntax/src/scip_strict/parse.rs

kritzcreek · 2024-06-29T06:53:35Z

LGTM. I wish we had finished the work around having some SCIP indexes available for testing in CI to catch regressions.

Did we ever consider using git-lfs?

kritzcreek · 2024-07-01T04:26:38Z

Final perf numbers after adding a custom (still non-allocating) error type, to get some basic error reporting that doesn't tank the performance.

symbol parsing/parse_v1/10000 [19.158 ms 19.231 ms 19.306 ms]
symbol parsing/parse_v1/100000 [252.10 ms 252.48 ms 252.86 ms]
symbol parsing/parse_v1/1000000 [2.9972 s 3.0033 s 3.0094 s]

symbol parsing/parse_v2/10000 [1.1307 ms 1.1357 ms 1.1413 ms]
symbol parsing/parse_v2/100000 [15.645 ms 15.670 ms 15.697 ms]
symbol parsing/parse_v2/1000000 [191.80 ms 192.11 ms 192.44 ms]

kind of embarrassing...

kritzcreek requested review from keynmol and varungandhi-src June 24, 2024 10:48

cla-bot bot added the cla-signed label Jun 24, 2024

github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels Jun 24, 2024

varungandhi-src approved these changes Jun 28, 2024

View reviewed changes

docker-images/syntax-highlighter/crates/scip-syntax/src/scip_strict/parse.rs Outdated Show resolved Hide resolved

kritzcreek changed the title ~~WIP: adds strict SCIP symbol parsing and formatting for scip-syntax~~ scip-syntax: adds strict SCIP symbol parsing and formatting Jul 1, 2024

kritzcreek marked this pull request as ready for review July 1, 2024 04:27

kritzcreek added 5 commits July 1, 2024 07:10

adds strict SCIP symbol parsing and formatting

4845cb6

spec only allows ascii-alphanumeric

6be3369

kind of embarrassing...

removes unused code, makes disambiguator optional

1e8fb30

rust fmt

7badfca

adds benchmark for symbol parsing

6ba7bb2

kritzcreek force-pushed the christoph/strict-scip-rust-bindings branch 2 times, most recently from e79d0e8 to 56a95ad Compare July 1, 2024 05:33

kritzcreek added 4 commits July 1, 2024 08:08

uses the fact that terminators are always single-byte chars

1e1bc02

adds a simple error type to allow producing nicer errors

1a1ee75

clippy/cleanup

cf78fe1

fixes BUILD.bazel file to include nested rs sources

7127e6b

kritzcreek force-pushed the christoph/strict-scip-rust-bindings branch from 56a95ad to 7127e6b Compare July 1, 2024 06:08

kritzcreek merged commit 4d02f09 into main Jul 1, 2024
11 checks passed

kritzcreek deleted the christoph/strict-scip-rust-bindings branch July 1, 2024 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scip-syntax: adds strict SCIP symbol parsing and formatting #63443

scip-syntax: adds strict SCIP symbol parsing and formatting #63443

kritzcreek commented Jun 24, 2024 •

edited

Loading

kritzcreek commented Jun 26, 2024 •

edited

Loading

varungandhi-src commented Jun 27, 2024 •

edited

Loading

kritzcreek commented Jun 27, 2024

varungandhi-src commented Jun 27, 2024

kritzcreek commented Jun 27, 2024

kritzcreek commented Jun 27, 2024 •

edited

Loading

varungandhi-src left a comment

kritzcreek commented Jun 29, 2024

kritzcreek commented Jul 1, 2024

scip-syntax: adds strict SCIP symbol parsing and formatting #63443

scip-syntax: adds strict SCIP symbol parsing and formatting #63443

Conversation

kritzcreek commented Jun 24, 2024 • edited Loading

Test plan

kritzcreek commented Jun 26, 2024 • edited Loading

varungandhi-src commented Jun 27, 2024 • edited Loading

kritzcreek commented Jun 27, 2024

varungandhi-src commented Jun 27, 2024

kritzcreek commented Jun 27, 2024

kritzcreek commented Jun 27, 2024 • edited Loading

varungandhi-src left a comment

Choose a reason for hiding this comment

kritzcreek commented Jun 29, 2024

kritzcreek commented Jul 1, 2024

kritzcreek commented Jun 24, 2024 •

edited

Loading

kritzcreek commented Jun 26, 2024 •

edited

Loading

varungandhi-src commented Jun 27, 2024 •

edited

Loading

kritzcreek commented Jun 27, 2024 •

edited

Loading