Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement parser #450

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Alex-Fischman
Copy link
Contributor

@Alex-Fischman Alex-Fischman commented Oct 21, 2024

This PR implements a fully-Rust parser, which adds better error messages. This has the additional benefits of reducing dependencies and reducing variance in small benchmarks.

@Alex-Fischman Alex-Fischman changed the title Fast parser Reimplement parser Oct 21, 2024
Copy link

codspeed-hq bot commented Oct 21, 2024

CodSpeed Performance Report

Merging #450 will not alter performance

Comparing Alex-Fischman:fast-parser (4d70999) with main (af49ae2)

Summary

✅ 8 untouched benchmarks

@yihozhang
Copy link
Collaborator

Thanks for this heroic effort! This is amazing! Two thoughts:

  • Parser combinator is known to be asymptotically slower than bottom-up parsers like what lalrpop provides. However here we see speedups for several small to medium programs. Have you compared the two parsers with larger programs and see their performance?
  • The parser lalrpop generates is very, very large, something like 50k lines of code, so I imagine this PR can greatly reduce the compilation time (especially when one just tweaks the parser a little bit). We can maybe try to get some concrete numbers on this once your PR is ready.

@Alex-Fischman
Copy link
Contributor Author

  • Parser combinator is known to be asymptotically slower than bottom-up parsers like what lalrpop provides. However here we see speedups for several small to medium programs. Have you compared the two parsers with larger programs and see their performance?

These two commits speed up the slowest part of the combinator approach, which are the large multi-way choices. It fixes this by branching on the name of the command/action/schedule, which lets us skip attempting to parse them.

@Alex-Fischman Alex-Fischman marked this pull request as ready for review October 22, 2024 06:50
@Alex-Fischman Alex-Fischman requested a review from a team as a code owner October 22, 2024 06:50
@Alex-Fischman Alex-Fischman requested review from saulshanabrook and removed request for a team October 22, 2024 06:50
@Alex-Fischman
Copy link
Contributor Author

Alex-Fischman commented Oct 22, 2024

This parser still doesn't have great error messages, but they are better in some cases. I'm marking as ready for review to get more eyes. It's at the point where I'd like to merge and incrementally improve error messages as bad cases come up (which we can do now because it's all in Rust).

@thaliaarchi
Copy link
Contributor

thaliaarchi commented Oct 25, 2024

It seems to me that this would be vastly simplified by parsing to a general S-expression structure, then resolving commands according to their identifiers. Parsing and semantic analysis would be decoupled and both would become more consistent.

For parsing: The grammar for a single S-expression would be something like sexp ::= "(" WS* IDENT (WS+ (":" IDENT WS+)? sexp)* ")" | literal (but with proper handling of whitespace; e.g., required between ident and integer literal, but not ident and paren). The code duplication for similar patterns is then eliminated and combinators wouldn't be necessary.

For semantic analysis: Each command would register its signature (e.g., arity, types, optional arguments, or whatever needed), then they'd be resolved by identifier. Since they'd be parsed to a single structure with spans, shared across all commands, then error reporting has far fewer cases. This would allow more invalid programs to be parsed, where, for example, the only problem is in arities or names of optional arguments, and the correct signature could be reported in the error.

If there is no lookahead, each top-level command could be parsed independently with allocations recycled, so the AST for the entire program does not need to be generated.

@Alex-Fischman
Copy link
Contributor Author

I agree that parsing to S-expressions first does almost certainly improve the error reporting somewhat, although I'm not sure how much of a win it actually is; currently error reporting is pretty simple.

I'm not motivated enough to write two parsers for the same language in one week, but if someone does want to take a shot at it you can use my toy parser from a year ago as a reference/baseline/starting point. It parses a subset of egglog but the basic ideas are there. (Slice is just Span.)

I do think that merging this PR makes that job slightly easier, since this PR consolidates all of the parsing in one place and does some related cleanup (like removing Quote).

@Alex-Fischman
Copy link
Contributor Author

An additional related Zach thought to put on people's radar: an egglog fuzzer would be nice, but it's the sort of thing that's perpetually low priority until it gets done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants