-
Notifications
You must be signed in to change notification settings - Fork 1.3k
scip-syntax: adds strict SCIP symbol parsing and formatting #63443
Conversation
I set up a small benchmark for symbol parsing. We're getting about a 5x speedup.
This is on the scip-clang index for chromium parsing ~10,000 symbols (The full index has about 80,000,000). |
The improved parsing time seems about an order of magnitude slower than the Go implementation, which seems counter-intuitive. Are you sure you benchmarked in release mode? These are the numbers from the Go implementation in sourcegraph/scip#258 (the code should compile now, you should be able to repro by adjusting the path to the SCIP index in the benchmark)
Is the extra Vec allocation causing this much overhead/can we benchmark without that? |
It's very possible I'm doing something stupid :D I haven't profiled once yet.
I'll collect some statistics for the symbols I'm parsing here and see if pre-allocating capacity for the descriptors helps. |
OK, after fixing the bugs, I get a slowdown, but it's surprisingly still much faster than the Rust version.
I also added a test to compare with old parsing implementation: sourcegraph/scip@3afd2e2 |
When I change the benchmark to also parse the first 10_000 symbols in the index (I parsed every (TOTAL/10,000)th element to get a "diverse" set of symbols from the index) @@ -33,7 +33,7 @@ fn bench_symbol_parsing(c: &mut Criterion) {
let all_symbols: Vec<String> = symbols_from_index("/Users/creek/work/scip-indices/chromium-1.scip").collect();
let symbol_count = all_symbols.len();
let n = 10_000;
- let symbols: Vec<&str> = all_symbols.iter().step_by(symbol_count / n).map(|s| s.as_str()).collect();
+ let symbols: Vec<&str> = all_symbols.iter().take(n).map(|s| s.as_str()).collect();
let mut group = c.benchmark_group("symbol parsing");
group.bench_function("parse", |b| {
b.iter(|| parse_symbols(&symbols)) I get timings close enough that the Vec allocations could start to matter (at least close enough in perf for me to not feel like there's a horrible perf bug lurking)
|
Okay, I tried one quick thing ( @@ -5,7 +5,7 @@ use nom::{
bytes::complete::{tag, take_while1},
character::complete::char,
combinator::{cut, eof, fail, opt},
- error::{context, convert_error, VerboseError},
+ error::{context, Error},
multi::many1,
sequence::{delimited, preceded, tuple},
Finish, IResult, Parser,
@@ -17,13 +17,12 @@ pub(super) fn parse_symbol(input: &str) -> Result<Symbol<'_>, String> {
match parse_symbol_inner(input).finish() {
Ok((_, symbol)) => Ok(symbol),
Err(err) => Err(format!(
- "Invalid symbol: '{input}'\n{}",
- convert_error(input, err)
+ "Invalid symbol: '{input}'\n{err}",
)),
}
}
-type PResult<'a, A> = IResult<&'a str, A, VerboseError<&'a str>>;
+type PResult<'a, A> = IResult<&'a str, A, Error<&'a str>>;
fn parse_symbol_inner(input: &str) -> PResult<'_, Symbol<'_>> {
let (input, symbol) = alt((parse_local_symbol, parse_nonlocal_symbol))(input)?; By using the non-allocating
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I wish we had finished the work around having some SCIP indexes available for testing in CI to catch regressions.
docker-images/syntax-highlighter/crates/scip-syntax/src/scip_strict/parse.rs
Outdated
Show resolved
Hide resolved
Did we ever consider using git-lfs? |
Final perf numbers after adding a custom (still non-allocating) error type, to get some basic error reporting that doesn't tank the performance.
|
e79d0e8
to
56a95ad
Compare
56a95ad
to
7127e6b
Compare
Adds strict and performant symbol parsing/formatting for
scip-syntax
.Parsing is "zero" allocation when the symbol does not contain escapes. (Technically it does allocate a Vec to hold the descriptors)
Final benchmark numbers:
Test plan
Some basic unit tests. Verified manually that it produces the same symbols as the existing parser for all of chromium.scip