Skip to content

Scannerless Parsing

Mark S. Miller edited this page May 3, 2015 · 8 revisions

Base Grammars

Grammars writted using the bootbnf.bnf tag language always inherit from some base grammar. This package implements two such base grammars directly in JavaScript:

  • defaultBaseGrammar as a simple lexer for defining a conventional two-level grammar.
  • baseScannerless as a base for scannerless parsing.

If no base grammar is explicitly specified, then the defined grammar implicitly inherits from defaultBaseGrammar, which defines the token-level rules

  • SKIP calls SPACE and COMMENT to skip whitespace tokens.
  • SPACE contiguous whitespace characters.
  • COMMENT a line comment.
  • HOLE a ${...} template string hole, i.e., an anti-quote.
  • EOF end of input.
  • NUMBER uses JSON number syntax.
  • STRING uses JSON string syntax.
  • IDENT conventional alphanumeric identifier including underbar (_) and dollar ($).

Except for SKIP, SPACE, and COMMENT, all the other token-level rules first call SKIP to skip whatever a given grammar considers whitespace. Derived grammars should do likewise if they introduce new token-level rules.

In grammars inheriting from defaultBaseGrammar, quoted identifiers define keywords which are thereby excluded from the IDENT production. Recognition of other quoted punctuation should ideally be based on what quoted strings appear in the grammar definition. However, this simple lexer instead simply recognizes

  • Any of [](){},; as a single character token.
  • Any contiguous sequence of the characters :~@%&+=*<>.?|-^\/ as a single token.

For defining little embedded DSLs, these simple rules work surprisingly well. For more sophicticated lexing needs, you can write an alternative to defaultBaseGrammar, or you can use scannerless parsing.

Scannerless Parsing

The baseScannerless grammar defines only the rules

  • SKIP which is meant to be overridden by derived grammars which wish to universally ignore whatever they consider to be whitespace. The default here matches only empty. The SKIP rule should always be defined so accept empty, so that it always succeeds. CHAR, HOLE, and EOF all call SKIP first. In derived grammars overriding SKIP, further token-level rules should also start by calling SKIP.
  • CHAR matches SKIP followed by a single character. Its semantic value is that character.
  • HOLE matches SKIP followed by a template string hole, i.e., an embedded "```${...}" template string unquoted expression. Its semantic value is the hole number, i.e., the index into the array of substitution arguments.
  • EOF matches SKIP followed only by the end of input.

The scannerless.es6 module uses grammar inheritance to define a derived grammar, scannerless, that provides the same token-level productions as defaultBaseGrammar in a scannerless manner.

const scannerless = bnf.extends(baseScannerless)`
  start ::= TOKEN* EOF                       ${(toks,_) => toks};
  SKIP ::= (SPACE / COMMENT)*;
  SPACE ::= this.${skip(SPACE_RE)};
  # COMMENT is broken out to make it easy to override
  COMMENT ::= this.${skip(LINE_COMMENT_RE)};

  TOKEN ::= NUMBER / STRING / IDENT / CHAR / HOLE;
  NUMBER ::= this.${match(NUMBER_RE)};
  STRING ::= this.${match(STRING_RE)};
  IDENT ::= this.${match(IDENT_RE)};
`;

This example makes use of an additional feature of our bnf language, external rule methods, as well as two helper functions, skip and match, to help define such external rule methods.

The bnf language syntax

    "this" "." HOLE

expresses a call to an external rule. The substitution value in the hole is expected to be a this-sensitive function whose this-value is expected to be a parser (i.e., an instance of a Parser class), and whose single argument is a position. It is expected to return a pair of a new position and either a semantic value or FAIL. In other words, this external function is invoked as if it is a rule method of this parser. By virtue of packrat memoization, this function will be called at most once at any given position of a parse.

The skip and match helper functions take a JavaScript RegExp (regular expression) object as argument, and returns a function that can be used as an external rule method. This rule method will eat as many characters as match the regular expression starting at the current position. The difference is that match defines an external rule method that first calls the SKIP rule, which is normally what you want. The skip function does not, which is typically only needed for defining components of the SKIP rule, such as SPACE and COMMENT above.

The *_RE variables above hold the same regular expressions used to define the defaultBaseGrammar, helping to ensure that these agree.

Clone this wiki locally