Customization¶
Tokens¶
The parser works on a stream of tokens. Tokens can be any python object, they are not expected to have any particular
behaviour. You may want to provide useful __repr__
and __str__
methods to give better error messages.
Thus, the parse
function can work just as effectively on a stream of str
, bytes
or a stream of single characters
(a scannerless parser). A common technique is for the
lexer to produce some sort of Token object that includes a text string and additional annotations.
For example the Natural Language Toolkit can mark each token with the relevant part of speech.
Symbols¶
Symbols are objects used to define the right hand side of a ParseRule
production. Two Symbols, NonTerminal
and
Terminal
are provided in the symbols
module. Anything that duck-types the same as these can be used however.
This is mostly useful for re-defining Terminal.match
, which is the method responsible for determining if
a given token matches the terminal. The default Terminal
class matches by equality, but, for example,
you may have terminals that match entire classes of tokens.
Customizing ParseTrees¶
There is no way to customize the ParseTree
class. But you can avoid using it entirely by writing your own
Builder
. Builders specify a semantic action to take at each step of the parse, allowing you to build your own
parse trees or abstract syntax trees directly from a ParseForest
. See Builders
for more details.
Customizing Grammars¶
You can override ParseRuleSet.get
with anything that returns a list of ParseRule
objects. As there is no
preprocessing done on the rules, you can generate a grammar on the fly. You can use this feature to parse
context sensitive grammars, by passing any relevant context as part of the head, and adjusting the non-terminals
of the returned rules to forward on relevant context. This will probably lead to very long parse times unless
care is applied.