-
Notifications
You must be signed in to change notification settings - Fork 9
Scannerless Parsing
Grammars writted using the bootbnf.bnf
tag language always inherit from some base grammar. This package implements two such base grammars directly in JavaScript:
-
defaultBaseGrammar
as a simple lexer for defining a conventional two-level grammar. -
baseScannerless
as a base for scannerless parsing.
If no base grammar is explicitly specified, then the defined grammar implicitly inherits from defaultBaseGrammar
, which defines the token-level rules
-
SKIP
callsSPACE
andCOMMENT
to skip whitespace tokens. -
SPACE
contiguous whitespace characters. -
COMMENT
a line comment. -
HOLE
a${...}
template string hole, i.e., an anti-quote. -
EOF
end of input. -
NUMBER
uses JSON number syntax. -
STRING
uses JSON string syntax. -
IDENT
conventional alphanumeric identifier including underbar (_
) and dollar ($
).
Except for SKIP
, SPACE
, and COMMENT
, all the other token-level rules first call SKIP to skip whatever a given grammar considers whitespace. Derived grammars should do likewise if they introduce new token-level rules.
In grammars inheriting from defaultBaseGrammar
, quoted identifiers define keywords which are thereby excluded from the IDENT
production. Recognition of other quoted punctuation should ideally be based on what quoted strings appear in the grammar definition. However, this simple lexer instead simply recognizes
- Any of
[](){},;
as a single character token. - Any contiguous sequence of the characters
:~@%&+=*<>.?|-^\/
as a single token.
For defining little embedded DSLs, these simple rules work surprisingly well. For more sophicticated lexing needs, you can write an alternative to defaultBaseGrammar
, or you can use scannerless parsing.
The baseScannerless
grammar defines only the rules
-
SKIP
which is meant to be overridden by derived grammars which wish to universally ignore whatever they consider to be whitespace. The default here matches only empty. TheSKIP
rule should always be defined so accept empty, so that it always succeeds.CHAR
,HOLE
, andEOF
all callSKIP
first. In derived grammars overridingSKIP
, further token-level rules should also start by callingSKIP
. -
CHAR
matchesSKIP
followed by a single character. Its semantic value is that character. -
HOLE
matchesSKIP
followed by a template string hole, i.e., an embedded "```${...}" template string unquoted expression. Its semantic value is the hole number, i.e., the index into the array of substitution arguments. -
EOF
matchesSKIP
followed only by the end of input.
The scannerless.es6
module uses grammar inheritance to define a derived grammar, scannerless
, that provides the same token-level productions as defaultBaseGrammar
in a scannerless manner.
const scannerless = bnf.extends(baseScannerless)`
start ::= TOKEN* EOF ${(toks,_) => toks};
SKIP ::= (SPACE / COMMENT)*;
SPACE ::= this.${skip(SPACE_RE)};
# COMMENT is broken out to make it easy to override
COMMENT ::= this.${skip(LINE_COMMENT_RE)};
TOKEN ::= NUMBER / STRING / IDENT / CHAR / HOLE;
NUMBER ::= this.${match(NUMBER_RE)};
STRING ::= this.${match(STRING_RE)};
IDENT ::= this.${match(IDENT_RE)};
`;
This example makes use of an additional feature of our bnf
language, external rule methods, as well as two helper functions, skip
and match
, to help define such external rule methods.
The bnf
language syntax
"this" "." HOLE
expresses a call to an external rule. The substitution value in the hole is expected to be a this-sensitive function whose this-value is expected to be a parser (i.e., an instance of a Parser class), and whose single argument is a position. It is expected to return a pair of a new position and either a semantic value or FAIL
. In other words, this external function is invoked as if it is a rule method of this parser. By virtue of packrat memoization, this function will be called at most once at any given position of a parse.
The skip
and match
helper functions take a JavaScript RegExp (regular expression) object as argument, and returns a function that can be used as an external rule method. This rule method will eat as many characters as match the regular expression starting at the current position. The difference is that match
defines an external rule method that first calls the SKIP
rule, which is normally what you want. The skip
function does not, which is typically only needed for defining components of the SKIP
rule, such as SPACE
and COMMENT
above.
The *_RE
variables above hold the same regular expressions used to define the defaultBaseGrammar
, helping to ensure that these agree.