Usage¶
This section assumes you are familiar with the basic terminology involved in parsing Context Free grammars.
Defining a Grammar¶
A grammar is stored as a collection of ParseRule
objects inside a ParseRuleSet
. Each ParseRule
is a single
production from a “head” (symbol named by string), to a list of Symbol
objects. Multiple ParseRule
objects with
the same head define alternative productions.
For example, the following Backus-Naur notated grammar:
<sentence> ::= <noun> <verb> <noun>
<noun> ::= "man" | "dog"
<verb> ::= "bites"
Would be expressed:
from axaxaxas import ParseRule, ParseRuleSet, Terminal as T, NonTerminal as NT
grammar = ParseRuleSet()
grammar.add(ParseRule("sentence", [NT("noun"), NT("verb"), NT("noun")]))
grammar.add(ParseRule("noun", [T("man")]))
grammar.add(ParseRule("noun", [T("dog")]))
grammar.add(ParseRule("verb", [T("bites")]))
Invoking the parser¶
Having defined our grammar, we can attempt to parse it. Parsing operates on an iterator of token objects, There is no
lexer included. In the example above we have assumed that the tokens are Python strings, but they can be anything.
Often a formal lexer is not needed - we can use string.split
to produce lists of strings.
The parser is invoked with the parse
function:
from axaxaxas import parse
parse_forest = parse(grammar, "sentence", "man bites dog".split())
print(parse_forest.single())
# (sentence: (noun: 'man') (verb: 'bites') (noun: 'dog'))
Parse results¶
parse
returns a ParseForest
that contains a collection of ParseTree
objects. By calling single()
we
check that there is exactly one possible parse tree, and return it.
See Handling Ambiguity for more details.
ParseTree
objects themselves are a fairly straightforward representation. ParseTree.rule
contains the ParseRule
used to match the tokens, and ParseTree.children
contains a value for each symbol of the rule, where the value is
the token matched for terminals, or another ParseTree
for nonterminals.
Optional, Star and Plus¶
Any Symbol
can be declared with optional=True
to make it nullable - it must occur zero or one times. Similarly,
star=True
allows any number of occurences, min zero, and plus=True
allows any number of occurences, min one.
In the resulting parse tree, skipped optional symbols are represented with None
. star and plus elements become
tuples:
grammar.add(ParseRule("relative", [T("step", optional=True), T("sister")]))
grammar.add(ParseRule("relative", [T("great", star=True), T("grandfather")]))
print(parse(grammar, "relative", "sister".split()).single())
# (relative: None 'sister')
print(parse(grammar, "relative", "step sister".split()).single())
# (relative: 'step' 'sister')
print(parse(grammar, "relative", "grandfather".split()).single())
# (relative: () 'grandfather')
print(parse(grammar, "relative", "great great grandfather".split()).single())
# (relative: ('great', 'great') 'grandfather')
Greedy Symbols¶
Like in a regular expression, you can mark parts of the grammar as greedy
or lazy
. In case of ambiguity
the parser will preferentially prefer the parse tree with the more (or fewer) number of occurrences. lazy and greedy
can only be used in combination with optional, plus or star:
grammar.add(ParseRule("described relative", [NT("adjective", star=True), NT("relative")]))
grammar.add(ParseRule("adjective", [T("awesome")]))
grammar.add(ParseRule("adjective", [T("great")]))
print(parse(grammar, "described relative", "great grandfather".split()).single())
# -- raises AmbiguousParseError
grammar.add(ParseRule("described relative 2", [NT("adjective", star=True, greedy=True), NT("relative")]))
print(parse(grammar, "described relative 2", "great grandfather".split()).single())
# (described relative 2: ((adjective: 'great'),) (relative: () 'grandfather'))
Greediness only trims ambiguous possibilities, so will never cause a sentence fail to parse.
It only affects choices the parser makes when reading from left to right, which means you will still get ambiguity if the leftmost symbol isn’t marked.
For non-terminals, settings prefer_early
and prefer_late
work analogously. They instruct the parser that
if there are several possible productions that could be used for a given symbol, to prefer the first (or last) one
in order of definition in the grammar:
grammar.add(ParseRule("dinner order", [T("I"), T("want"), NT("item", prefer_early=True)]))
grammar.add(ParseRule("item", [T("ham")]))
grammar.add(ParseRule("item", [T("eggs")]))
grammar.add(ParseRule("item", [T("ham"), T("and"), T("eggs")]))
grammar.add(ParseRule("item", [NT("item", prefer_early=True), T("and"), NT("item", prefer_early=True)]))
print(parse(grammar, "dinner order", "I want eggs and ham".split()).single())
# (dinner order: 'I' 'want' (item: (item: 'eggs') 'and' (item: 'ham')))
print(parse(grammar, "dinner order", "I want ham and eggs".split()).single())
# (dinner order: 'I' 'want' (item: 'ham' 'and' 'eggs'))
Penalty¶
As mentioned above, greedy and related settings only trim ambiguity when the two options have so far parsed identically.
In some circumstances, you wish to avoid a particular rule, no matter how different the alternatives are. You can associate a penalty with each rule. The parser sums up all the penalties associated with a given parse, and choose only possibly parses with the lowest sum. This can have wide ranging effects on eliminating ambiguity. Penalties can be viewed as very lightweight support for probabilistic parsing:
grammar.add(ParseRule("sentence", [NT("noun"), T("like"), T("a"), NT("noun")]))
grammar.add(ParseRule("sentence", [NT("noun"), T("flies"), T("like"), T("a"), NT("noun")]))
grammar.add(ParseRule("noun", [T("fruit"), T("flies")], penalty=1))
grammar.add(ParseRule("noun", [T("fruit")]))
grammar.add(ParseRule("noun", [T("banana")]))
print(parse(grammar, "sentence", "fruit flies like a banana".split()).single())
# (sentence: (noun: 'fruit') 'flies' 'like' 'a' (noun: 'banana'))
In the example above the parser chose to avoid the other possible parse
(sentence: (noun: 'fruit' 'flies') 'like' 'a' (noun: 'banana'))
because it contains a rule with a penalty.
Penalties can be considered an experimental feature. Most of the time, you can just add more greedy settings to get the desired effect.