# PanPG API # #top PanPG v0.0.9. This documents the most recent stable version. The current [bleeding edge documentation][edge_docs] is also available and may include differences in the development code since the last release. The [about page][about] always has links to recent versions. [edge_docs]: /3box/PanPG/README.html [006_docs]: /3box-asof/1282013570963/PanPG/README.html [about]: /3box/PanPG/about.html ## Introduction ## #intro PanPG is a parser generator which reads a PEG description of a grammar and writes a packrat parser in pure JavaScript. There is an API for compiling a parser from a grammar, and a smaller API of helpful convenience functions for dealing with the parse trees that a parser produces. ## The PEG Language ## #lang Grammars are given as a parser expression grammar (PEG). The arithmetic grammar from the demo makes a good example of the basic features. Expr ← Add Add ← Mult ( S? "+" S? Mult )* Mult ← Num ( S? "*" S? Num )* Num ← "0" / [1-9] [0-9]* S ← [ U+0020 ]+ PEGs are similar to context-free grammars (CFGs), but have a more imperative nature: while a CFG describes a language, which may or may not be easy to parse, and may be ambiguous, a PEG describes a deterministic parsing strategy. The unambiguous nature of PEGs makes them convenient for parsing programming languages. Wikipedia's [article on PEGs][wp] is a good introduction. PEGs do not require a separate lexer; the same formalism is used to define the grammar down to the level of individual characters. Due to the difference in emphasis between CFGs and PEGs, what in a CFG would be called a production is here called a parse rule. [wp]: http://en.wikipedia.org/wiki/Parsing_expression_grammar TODO: document accepted PEG language Until this is properly documented, the [self-describing PEG grammar][PEG.peg] is the definitive reference. The [ECMAScript grammar][ES5] is a practical example of a full programming language grammar. [PEG.peg]: grammars/PEG.peg [ES5]: grammars/ECMAScript_5.peg ## Compiling ## #compiling The following API is used to produce a parser from a PEG. To use it, include the file [PanPG.js][lib]. The library is only required to generate a parser; once generated, the parsers do not have any dependencies. In most cases, rather than compiling the parser at the point of use, which is slow and requires the parser generator, the parser will instead be compiled once and then included where it will be used. The parser generator API is demonstrated in the [arithmetic grammar demo][demo]. [lib]: build/PanPG.js [demo]: build/demo.html ### `PanPG.generateParser(peg,opts)` ### #generateParser - `peg` is a string, containing the grammar (or an array of grammars to be merged) - `opts`, documented below, optional, controls various details of code generation. When successful, the return value is a string containing a JavaScript program which declares a single function which is the parser. This string can be passed to `eval()` or written to a file for later use. In the case of failure, various exceptions will be thrown. Common reasons for failure are syntax errors in the provided PEG, and rules that are depended on but not defined. If an array is passed as the first argument, each element is parsed as a PEG and the rules are combined, with those occurring in later grammars taking precedence. This is equivalent to using the `patches` option, described below. ### `opts` ### #opts The `opts` argument is optional and if provided, all the properties are optional. The defined properties and default behaviors are as follows, with the most common and most useful listed first: - drop An array of parse rule names to drop from the generated parse tree. Any node generated by such a parse, and all its children, will not appear in the parse tree. This does not change what inputs will be accepted by the parser, only the parse tree that is returned. This is commonly used when parsing programming languages to exclude low-level or atomic rules such as whitespace, comments, individual tokens, or any other rule which is not interesting. As an extreme case, if the start rule is dropped, then no parse tree will be returned, but the parser will still recognize the grammar and indicate failure (by returning a parse error) or success (by not returning a parse error). By default, all rules are returned in parse tree results, which is often convenient when designing and testing a grammar. - elide An array of parse rule names to elide in the generated parse tree. The node generated by any rule appearing in `opts.elide` will not appear in the generated tree; instead it will be replaced by its children. For example, in the grammar `S ← A; A ← B; B ← "x"`, the expected parse tree would have an S node with an A child node with a B child node. If `{elide:['A']}` was passed, the generated tree would contain the S node with a B child node, the A node having been elided. If `{drop:['A']}` was passed, the generated tree would contain only the S node alone. By default, no nodes are elided. - fname The parser is returned as a JavaScript program which declares a single function. This option sets the name of that function. If the `commonjs` option is set, this sets the name of the module returned. By default, the start token of the grammar is used as the function name. - start The start rule may be given as an argument. This can be used to create a parser for a subset of a grammar, which can be useful when testing or when reusing a part of a grammar. Only the named rule given, and its dependencies, will be included in the generated parser. For example, if you had a parser for C and wanted to generate a parser for just C expressions (but not statements or programs) you could do something like `generateParser(complete\_C\_grammar, {start:'Expression'}`. This can also be combined with `opts.patches` to create hybrid grammars. It is also possible to store different grammars (such as different versions of the same or closely related languages) in the same file with different start rules, and common rules shared; the desired parser can then be generated by passing the appropriate `opts.start` rule name. By default, the first parse rule in the PEG grammar is taken to be the start rule. - commonjs Instead of a bare function, the returned parser will be a CommonJS-compatible module. The name of the module will be taken from the `fname` option or the start rule if that is unset. The module exports a `parse` function. The module will be browser-compatible, so it can be used in the browser as well as in a CommonJS environment. This means that a hypothetical parser for language Foo could be used as `Foo.parse(...)` either in the browser, by including ``, or in a CommonJS environment by using `var Foo=require('foo')`. By default, CommonJS output is not used and the returned program declares a single function at the top level. - patches A set of patches may be provided as an array of PEGs given as strings, just like the main grammar. First each is parsed, then each is applied, in the order given, to the main grammar. To apply a patch here means that each rule appearing in it is added to the main grammar, replacing the existing rule with the same name if any. Since patches can override rules as well as add them, it is possible to replace or modify as much of the grammar as desired. Alternatively, multiple grammars can be provided as an array in the first argument to `generateParser`. The following three options are all off by default and are primarily used for debugging of grammars or of the parser generator itself. - trace If true, the generated parser will generate trace messages, which can be used for extremely detailed debugging and analysis of generated parsers. The trace output will typically be orders of magnitude larger than the parse tree, so this option should be turned on with care; tracing parsers work best on small inputs. When combined with a subset parser using `start`, this can be a powerful debugging tool. Tracing is used (automatically) by the `explain` function documented below. - debug This causes the parser generator to dump large amounts of internal information as a source code comment before the parser itself. This data will be mostly of interest to those who are working on PanPG itself. Apart from the added comment, the generated parser code is unchanged. - asserts If true, the generated parser will include assertion statements, and will be slower. By design, the assertions should never fail, regardless of the grammar and input, so this is not useful for debugging grammars, only the parser generator itself or for testing JavaScript engine sanity. ### `PanPG.explain(grammar,opts,input,[verbosity])` ### #explain When a particular grammar and input do not give the parse tree (or error!) that was expected, `explain` can be used to determine what the parser did in excruciating detail. The arguments are the original PEG grammar, the options as they would be used with `generateParser`, and the sample input. The output is a human readable string describing what the parser did at every step, in several sections. ASCII-art is used, so the string should be displayed in a monospace font. The grammar is required because this function recompiles the parser with tracing enabled, and then runs it over the input and analyzes the trace so generated. The compiled grammar is cached by the `explain` function itself, so if called again with only the input text changed, it will not generate the parser code again. Nonetheless, this function can be quite slow and generates truly voluminous output, so care is advised and small inputs are recommended. As an example, with the JSON grammar and a sample input of 706 bytes, `explain` generates over 6 MB of output including a complete trace of the full parser state at every change, and an analysis of the trace which shows the order and nesting of each rule that was tested at each position in the input. The `start` option can be used to specify a particular parse rule to isolate issues to the relevant part of the grammar. The optional verbosity argument increases the verbosity. The default verbosity is 1. If `verbosity` is 2, the trace will be shown. If `verbosity` is 3, the generated parser code and explanatory source code comments will also be included. If `verbosity` is 4, the raw parse tree (an array of numbers) will also be shown. ## Parsing and Parse Trees ## #parsing Once a parser is generated using the API above, it can be used to parse input, returning a parse tree. This is used in the [arithmetic parsing demo][demo]. An example with a more complex grammar is provided by the [ECMAScript 5 demo][ES5demo]. [ES5demo]: build/ES5.html The generated parsers consist of a single function, or if the CommonJS output is used, a module which exports a single function. The function takes a string to be parsed and returns either a parse tree or an error. The return value is an array, either of the form [true, <parse tree>], or [false, <pos>, <error>, <input>]. The format of these arrays may change in future versions, but the first element will always be a Boolean indicating success. The format of the parse tree is described below. In the case of a parse error, the <pos> is the position in the input at which parsing failed. Once a parse result is in hand, it may be passed to `showTree` or `treeWalker`. These functions are contained in [PanPG_util.js][]. The `showTree` function takes the output of a parser, which may be either an error or a parse tree, and displays errors in human-readable form and parse trees as ASCII art. `treeWalker` takes a parse result and a set of callback functions and walks the parse tree. [PanPG_util.js]: build/PanPG_util.js ### Parse Tree Format ### #parse_tree_format Rather than embedded code which is executed on parser rules, the generated parser returns a parse tree which can be processed further as desired. Most users will not deal with the parse trees directly but will prefer the `treeWalker` function, documented below, or some other API. The parse tree format is designed to use little memory and to allow streaming. Streaming parsing is not yet automatically supported but is in the works, see the [project roadmap][]. [project roadmap]: http://inimino.org/~inimino/blog/peg_roadmap For complete details on the parse tree format see its [documentation][]. [documentation]: doc/parse_tree_representation ### `PanPG_util.showTree(result, opts)` ### #showTree - `result` is the return value of a parser. - `opts` is an object which may have the following properties: - `elide` is an array of rule names to be hidden in the output. - `drop` is an array of rule names to be completely suppressed including descendant nodes. `showTree` takes a parse result and returns a pretty-printed ASCII representation of the parse tree. In the event that the parse failed, showTree calls showError which returns a human-readable description of the error including the line on which the failure occurred. Note that the error will be reported at the position of the last alternative to be tried, which may be earlier in the input than the actual error, depending on the grammar. The following example shows a short JavaScript program, and the parse tree that the ECMAScript 5 parser generates when parsing this program. This tree contains parse nodes that are not interesting for most uses, so passing {elide:['anonymous','IdentifierStart','UnicodeLetter',...]} to `showTree` would give a more concise output. Once a grammar is complete and correct, rather than just hiding the uninteresting nodes in showTree, the uninteresting rule names (for a particular use of the grammar) can be passed to the parser generator to produce a parser that simply does not generate those nodes. function f(){return 42} Program 0-25 "function f(x){re…" FunctionDeclaration 0-25 "function f(x){re…" FunctionTok 0-8 "function" S 8-9 " " WhiteSpace 8-9 " " Identifier 9-10 "f" IdentifierName 9-10 "f" IdentifierStart 9-10 "f" UnicodeLetter 9-10 "f" anonymous 10-11 "(" FormalParameterList 11-12 "x" Identifier 11-12 "x" IdentifierName 11-12 "x" IdentifierStart 11-12 "x" UnicodeLetter 11-12 "x" anonymous 12-13 ")" anonymous 13-14 "{" FunctionBody 14-24 "return x*x" Statement 14-24 "return x*x" ReturnStatement 14-24 "return x*x" ReturnTok 14-20 "return" SnoLB 20-21 " " WhiteSpace 20-21 " " Expr 21-24 "x*x" AssignmentExpr 21-24 "x*x" ConditionalExpr 21-24 "x*x" LogicalOrExpr 21-24 "x*x" LogicalAndExpr 21-24 "x*x" BitwiseOrExpr 21-24 "x*x" BitwiseXOrExpr 21-24 "x*x" BitwiseAndExpr 21-24 "x*x" EqualityExpr 21-24 "x*x" RelationalExpr 21-24 "x*x" ShiftExpr 21-24 "x*x" AdditiveExpr 21-24 "x*x" MultiplicativeExpr 21-24 "x*x" UnaryExpr 21-22 "x" PostfixExpr 21-22 "x" LeftHandSideExpr 21-22 "x" NewExpr 21-22 "x" MemberExpr 21-22 "x" PrimaryExpr 21-22 "x" Identifier 21-22 "x" IdentifierName 21-22 "x" IdentifierStart 21-22 "x" UnicodeLetter 21-22 "x" MultiplicativeOp 22-23 "*" UnaryExpr 23-24 "x" PostfixExpr 23-24 "x" LeftHandSideExpr 23-24 "x" NewExpr 23-24 "x" MemberExpr 23-24 "x" PrimaryExpr 23-24 "x" Identifier 23-24 "x" IdentifierName 23-24 "x" IdentifierStart 23-24 "x" UnicodeLetter 23-24 "x" EOS 24-24 "" anonymous 24-25 "}" ### `PanPG_util.treeWalker(callbacks, result)` ### #treeWalker `treeWalker` takes a set of callback functions and a parse result, and walks the result parse tree, calling the callbacks provided for each node in the parse tree. Callbacks are named according to the corresponding parse rule in the grammar. A callback is called by the walker when a node ends (post-order traversal), so the top-level rule or start rule of the grammar (corresponding to the parse of the entire input) will be the last callback to be called. Each callback is called with two arguments: a match object and a child array. The match object provides information about the segment of the input matched by the node. The child array is not an array of parse nodes, but an array of return values from callback functions of the child nodes. The match object has a text() method which returns the matched segment of the input string. (It is provided as a function rather than a string because slicing the match out of the string is a relatively expensive operation, so generating it only when required speeds up execution in some engines considerably.) The match object also has `start` and `end` properties which are indices into the input string, useful for such things as syntax highlighters where the position of the parse node within the input string is what is relevant, rather than the contents that were matched. Five special callbacks can be provided; these are `anonymous`, `other`, `fail`, `exception`, and `warn`. All are optional. `anonymous` is called on anonymous nodes in the parse tree. These can be seen by passing `{hide:[]}` options to showTree. `other` will be called on named nodes for which no specific callback was provided. In addition to the match and child array arguments described above for named callbacks, it is also called with the name of the rule as a third argument. `fail` will be called on parse failure, with the unsuccessful parse result as the only argument. `exception` will be called in case an exception is thrown by any callback. Otherwise the exception will propagate normally and will terminate the walk. If the `exception` callback returns `true`, the walk will continue, otherwise the treeWalker call will return immediately. `warn` will be called when potential errors are detected, such as a callback returning a value when there is no callback for the parent node to receive that value. It is recommended to pass a logging or alert function as the warn callback especially when debugging. When a callback returns a value, it is appended to an array, and that array is provided to the callback for the parent node. If none of a node's children's callbacks returned a value, then the second argument will be an empty array. If a callback for the top-level node returns a value, it will be used as the return value of the `treeWalker` call itself. There is an example use of the tree walker API in the [arithmetic demo][demo]. ## Source Tree Layout ## #layout `doc/` contains plain text documentation which may or may not be useful or current. The most current documentation is the source code itself, and the files under `doc/` that are pointed to in source code comments. `build/` contains generated files and generated code. This includes the library files themselves, PanPG\_generator.js and PanPG\_util.js, which are concatenations of other source files and generated files, the generated versions of some of the example grammars that ship with the library (JSON, ES5, and the PanPG PEG grammar itself), built versions of dependencies such as the cset library and its generated Unicode character class tables. Some files, notably parsePEG.js and cset_prod.js, are both built by the project, and depended on by it, so the build/ directory cannot be wiped out and reconstructed easily from the source tree. Currently the build system is browser-based and somewhat arcane, so to do your own build from source you'll probably need to write a non-browser-based build script, perhaps starting from `build.js` and `build_support.js`. `grammars/` contains example grammars for JSON, ECMAScript, and the PEG language accepted by PanPG itself. `src/` contains the source files. To read the code start from the API entry points in the API_* files and follow the dependencies from there.