The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Decl::Parser - implements a parser to be defined using Decl::Semantics::Parse.

VERSION

Version 0.01

SYNOPSIS

The Decl::Semantics::Parse module uses the structure of a "parse" tag to build a parser. The parser it builds, though, is implemented using this class. And in fact this class also exposes a procedural API for building parsers, if you need to bypass the parser builder. It's the work of but a moment to realize that you need to bypass the parser builder when building a parser to parse parser specifications. (If you could parse that, I'm impressed.)

The idea is to build this parser with as few external dependencies as possible. Then it might be useful outside the framework as well.

These parsers are based on those in Mark Jason Dominus' fantastic book Higher-Order Perl. They consist of a tokenizer that is a chain of lesser tokenizers, registered actions that can be carried out on intermediate parses, and rules that build structure from a sequence of tokens.

new()

Instantiates a blank parser.

BUILDING THE PARSER

To build a parser, we add tokenizers, rules, and actions.

add_tokenizer()

Each parser has a tokenizer, which is a list of atomic tokenizers consisting of regular expressions to examine incoming text and spit it back out in categorized chunks of low-level meaning.

Each atomic tokenizer consists of a label, a regex pattern, and an optional callback to be called to produce the token. Intervening text that does not match the token's pattern is passed through unchanged, allowing later tokenizers in the chain to break that up.

The add_tokenizer function just pushes an atomic tokenizer onto the list. Later, lexer is called to tie those all together into a full lexer.

Possible extension: $pattern could be a coderef instead of a string, to permit more flexibility in the design of tokenizers.

action()

Adds a named action to the list of actions that can be integrated into the parser. Also used to retrieve a named action.

add_rule($name, $rule), get_rule($name), list_rules(), clear_rule($name);

The add_rule function adds a rule. The rule is expressed in a sort of restricted Perl to assemble the available parser atoms into something useful. Rule cross-references can be indicated by enclosing the name of the rule in angle brackets <>; that will be substituted by a reference to a parser built with that rule. The purpose of this API is to provide a simple but procedural way to assemble a basic parser - one that we can then use to parse our declarative structures.

The target Perl code again leans heavily on Dominus, with some extensions and simplifications to make things easier in our context.

Multiple rules added under the same name will be considered alternatives, and treated as such when the parser is built.

The clear_rule function clears the information associated with a rule name. I'm not sure it will ever be used, but it just seems so easy that it would be silly not to put it in here. It does not delete the rule from the list of rules, so the rule's precedence (if any) will be unchanged.

USING THE PARSER

lexer($input), _t()

The lexer function creates a lexer using the list of tokenizers already registered, using the input stream provided. The lexer is an iterator, with a peek function to check the next token without consuming it. Tokens are arrayrefs or plain strings.

This leans heavily on Dominus.

Note that the input may itself be a token stream.

If called in a list context, returns the full list of tokens instead of an iterator. I hope that's what you wanted.

The _t function does most of the heavy lifting, and *really* leans on Dominus. I've extended his lexer framework with two features: first, if a lexer is simply passed a string as its input, it will still work, by creating a single-use interator. Second, token labels that end in an asterisk are filtered out of the final token string.

Dominus's framework provides for suppression of tokens using the token building function (e.g. sub {''} to suppress whitespace in the outgoing token stream), but there's a surprising problem with that approach - if the resulting stream is fed into the next atomic tokenizer in a chain, neighboring unparsed text will be pushed back together! This is a natural result of the fact that blockwise reading of files needs to be supported without breaking tokens that span block boundaries; the final tokenizer in the chain necessarily treats the output of earlier tokenizers like blocks.

But what if I want to tokenize into whitespace first, then, say, find all words starting with 't' and treat them as special tokens? OK, so this was a silly test case, and yet it seems intuitively to be something like what I'd want to do in some situations. The naive approach is this:

  parse t
     tokens 
        WHITESPACE "\s+"   { "" }
        TWORDS     "^t.*"
        

If I give that the string "this is a test string", I don't get five tokens, two of which are TWORDS. I get one TWORD token with the value "thisisateststring". That is because by swallowing the "tokenicity" of the whitespace, we're actually just ignoring the whitespace.

Bad!

So instead, we put an asterisk on the whitespace specification, so that it will be suppressed after the tokenizing process is complete, that is, at the end of the tokenizer chain. In the meantime, though, the whitespace tokens are still there to hold their place in the queue.

   parse t
      tokens
         WHITESPACE*  "\s+"
         TWORDS       "^t.*"

tokens($input)

If you know you've got a limited number of tokens and just want to grab the whole list, use tokens, which just returns a list.

tokenstream($input)

Finally, if you need a lazily evaluated stream for your token output (and hey, who doesn't?) call tokenstream. (Note: you'll want a stream if you're passing your lexer to a recursive-descent parser as below, because you need to be able to unwind the stream if one of your rules doesn't match.)

PARSER COMPONENTS: parser, nothing, anything, end_of_input, token, token_silent, literal, word, p_and, p_or, series, one_or_more, list_of, optional, debug, debug_next_token

These are not methods; they're functions. They are the little subparsers that we hack together to make a full parser. The output of each of these parsers is an arrayref containing a flat list of tokens it has matched in the token stream it's given as input. Each token is itself an arrayref of two parts (a cons pair), with the first being the type, and second the token value. Bare words surviving the lexer are converted into individual tokens of type '' (empty string), allowing tokens to be treated uniformly.

build(), make_component($name, $spec), get_parser($name), parse($input), execute($defined input)

The build function takes the rules that have been added to the parser, and builds the actual parser using make_parser, which is also available for external use. The make_parser function runs in the context of the parser itself and uses eval to build its parser. Each parser built with build or make_parser is named. Its output, if it matches, is a two-part arrayref, with the first element being its name and the second the arrayref list of tokens or subvalues that it matched.

An anonymous parser (name '' or undef) just returns the list of tokens, without the level of structure. The same applies to any name ending in an asterisk.

This should probably be covered in more detail in the tutorial, but the principle used here is that of the recursive-descent parser. A recursive-descent parser can effectively be constructed as a series of little parsers that are glued together by combination functions. Each of these parsers consumes a series of tokens, and returns a value; the default value is an arrayref (a pair, if you're Pythonic) consisting of the name or tag of the parser, followed by the list of tokens consumed. The sum total of all those arrayrefs is an abstract syntax tree for the expression being parsed.

When a parser is invoked in a macro context, that syntax tree is converted into a structure of Decl::Node objects (a nodal structure), with or without subclass decoration, depending on where the macro is expanded. But when we call a parser from Perl directly, we get the arrayrefs.

By defining actions and invoking them during the parse, we can also modify that structure as it's being built, or even build something else entirely, like a numeric result of a calculation or perhaps some callable code. This is still pretty hand-wavy, as I still haven't got my head around actual applications.

At any rate, the rule specifications passed to add_rule are pretty straightforward:

token(['TOKEN']) matches a token by that name. token['type', 'text']) matches a specific token. literal('text') matches either a token by text, or a bare word. It converts the bare word to ['', 'word']. regex('regex') matches a bare word using a regex. If the regex has parentheses in it, the output value may be one or more tokens with the contents. <parser> matches a named parser rule, and expands to $eta_parser in order to permit self-reference. (See Dominus Chapter 8.) \&nothing is the null parser, used to build complex rules. \&anything is the universal token, used to match things like comments. \&end_of_input is the end of input. \&word is any bare word (non-token text). It also converts the bare word to ['', 'word']. p_or() is Dominus's "alternate" function, because I don't like to type that much. p_and() is Dominus's "concatenate" function, for the same reason. series() is just a function that matches a series of whatever it's called on. list_of() is a function. It matches a delimited series of its first argument, delimited by tokens of its second argument. If omitted, the delimiter is COMMA. optional() is a function that matches either its contents or nothing.

Note that the only code-munging done here is reference to other rules. It's difficult for me to avoid code generation because it's so fun, but since parser specifications are supposed to be pretty general code, it's really not safe.

The order of addition of rules determines the order they'll be processed in. When the parser is built, it will check for consistency and dangling rule references (i.e. rules you mention but don't define), perform the eta expansions needed for self-reference, and build all the subparsers.

AUTHOR

Michael Roberts, <michael at vivtek.com>

BUGS

Please report any bugs or feature requests to bug-decl at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Decl. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

LICENSE AND COPYRIGHT

Copyright 2010 Michael Roberts.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.