Skip to content

Latest commit

 

History

History
executable file
·
762 lines (566 loc) · 26.7 KB

manual.md

File metadata and controls

executable file
·
762 lines (566 loc) · 26.7 KB

The LUIF

This document (which is currently a work-in-progress for further discussion) describes the LUIF (LUa InterFace), the interface language of the Kollos project.

The LUIF is Lua, extended with BNF statements as specified below.

Table of Contents

BNF Statement

Semantics

Events
Post-Processing
Programmatic Grammar Construction
The Complete Syntax of BNF Statement

Implementation Considerations

Example Grammars

BNF Statement

LUIF extends the Lua syntax by adding bnf alternative to stat rule of the Lua grammar and introducing the new rules for BNF statements. There is only one BNF statement, combining precedence, sequences, and alternation as specified below.

A BNF statement specifies a rule, which consists of, in order:

  • A left hand side (LHS), which will be a symbol.

  • A produce-operator (::= or ~).

  • A right-hand side (RHS), which contains one or more RHS alternatives. A RHS alternative is a series of RHS primaries, where a RHS primary may be a symbol name, a character class, a literal, a sequence or another, grouped or hidden, RHS alternative.

BNF statements are grouped into one or more grammars. The grammar is indicated by the produce-operator of the BNF. The general form of the produce operator is

    :grammar:=

where grammar is the name of a grammar. grammar must be a string of the form acceptable as a Lua variable name.

If the produce-operator is ::=, then the grammar is g1. The tilde ~ can be a produce-operator, in which case it is equivalent to :l0:= or :lexical_grammar_name:= if the lexical_grammar_name is set using the lexer adverb of a start rule.

A grammar can be either structural or lexical.

A structural grammar will often contain lexical elements, such as strings and character classes, and these will go into its associated lexical grammar. The start rule of a structural grammar specifies its lexical grammar with the lexer adverb. In a lexical grammar the lexemes are indicated with the lexeme adverb -- if a rule has a lexeme adverb, its LHS is a lexeme.

Initially, the post-processing will not support anything but l0 and g1 used in the default way, like this:

-- structural grammar
a ::= <b> <c>   -- the first rule is the start rule
                -- unless the start rule is defined explicitly

a ::= <w> -- using the LHS of a lexical rule (<w>, or <b> <c> above
          -- on the RHS of a structural rule makes a lexeme

aa ::= a a

-- lexical grammar
<w> ~ x y z
<b> ~ 'b' x
<c> ~ 'c' y

x ~ 'x'
y ~ 'y'
z ~ [xyz]

Structural and Lexical Grammars

A grammar is lexical if one of its rules contains the lexeme adverb. A grammar is structural if it has a start rule.

By default, the LUIF starts with two grammar in its grammar set: g1 and l0. By default, g1 is a structural grammar. By default, l0 is a lexical grammar, and is the lexer for g1. A structural grammar may specify its lexer by using the lexer adverb in its start rule.

A grammar is always structural or lexical, but never both. It is a fatal error if, according to the stipulations above, a grammar would be both lexical and structural. If applying the stipulations above does not determine whether a grammar is lexical or structural, then it defaults to structural.

A rule is structural if it belongs to a structural grammar. A rule is lexical if it belongs to a lexical grammar. Structural or lexical rules are declared by using produce-operators (::= and ~).

In a lexical grammar, a lexeme is a top-level symbol, and must be specified with the lexeme adverb. In a lexical grammar, a lexeme can never appear on a RHS.

In a structural grammar, a lexeme is a symbol which never appears on the LHS of a rule.

Structural rules may imply lexical rules in the structural grammar’s associated lexer. Charclasses and strings in structural rules, for example, create lexemes in the associated lexical grammar. It is a fatal error if a structural rule implies a lexical rule, but the structural grammar has no associated lexical grammar.

A lexeme is a sequence of characters in the input matched by a rule in a lexical grammar. Marpa’s usage of the term “lexeme” is special to it.

For all pairings of structural grammars and their lexers, both grammars in the pair must have a consistent idea of which symbols are lexemes. Full enforcement of this and the other stipulations of this section does not occur until the lower layer (KLOL) processes the KIR.

Start Rule

The start rule of a LUIF structural grammar sets its

  • name, and, optionally,
  • its lexical grammar and
  • start symbol.

If the start rule is omitted

  • the name of the grammar will be g1 and all subsequent rules will belong to that grammar or, if they are lexical, in its associated lexical grammar l0,
  • the start symbol will be the LHS of the structural rule, which occurs first in the source file, and
  • the lexical grammar will be l0.

The syntax of a start rule is

start_rule ::= ':' grammar ':='
  ( 'lexer' '=' lexical_grammar_name ( 'start' =  start_symbol )? )?

The lexer adverb defines the lexical grammar for the structural grammar specified in the start rule. If it is omitted, the name of the lexical grammar will be the name of the structural with _lex appended, e.g., in the case of start :grammar:= , grammar_lex.

The start adverb, which is optional, defines the start symbol of the grammar. If it is omitted, the start symbol will be the LHS of the structural rule, which occurs first in the source file.

An adverbless start rule just sets the name of the structural grammar for all rules that follow, e.g. :json:=.

As an example, this start rule

:calc:= start = Script, lexer = calc_lex

defines a structural grammar named calc with start symbol Script and lexical grammar calc_lex.

Precedenced Rules

A precedenced rule contains a series of one or more RHS alternatives, separated by either the alternation operator (|) or the loosen operators (||). In a typical grammar, most rules are precedenced rules, but they are often trivially precedenced, consisting of one or several RHS alternatives with equal precedence. For brevity, RHS alternatives are often called alternatives.

An alternative may be followed by a list of adverbs, from which it is separated with a comma.

The RHS alternatives in a precedenced right hand side proceed from tightest (highest) priority to loosest. The double “or” symbol (||) is the “loosen” operator -- the alternatives after it have a looser (lower) priority than the alternatives before it. The single “or” symbol (|) is the ordinary “alternative” operator -- alternatives on each side of it have the same priority. Associativity is specified using the assoc adverb, as described below.

For a usage example of precedenced rules, see the Calculator grammar below.

Sequences

Sequences are expressions on the RHS of a BNF rule alternative which imply the repetition of a symbol, or a parenthesized series of symbols.

The item to be repeated (the repetend) can be either a single symbol, or a sequence of symbols grouped by parentheses or square brackets, as described above. A repetition consists of

  • A repetend, followed by
  • An optional punctuation specifier.

A repetition specifier is one of

    ** N..M     -- repeat between N and M times
    ** N..*     -- repeat more than N times
    ?           -- equivalent to ** 0..1
    *           -- equivalent to ** 0..*
    +           -- equivalent to ** 1..*

A punctuation specifier is one of

    % <sep>     -- use <sep> as a proper separator
    %% <sep>    -- use <sep> as a liberal separator
    %- <sep>    -- proper separation, same as %
    %$ <sep>    -- use <sep> as a terminator

When proper separation is in use, the separators must actually separate items. A separator after the last item is not allowed.

When the separator is used as a terminator, it must come after every item. In particular, there must be a separator after the last item.

A “liberal” separator may be used either as a proper separator or a terminator. That is, the separator after the last item is optional.

Here are some examples:

    A+                 -- one or more <A> symbols
    A*                 -- zero or more <A> symbols
    A ** 42            -- exactly 42 <A> symbols
    <A> ** 3..*        -- 3 or more <A> symbols
    <A> ** 3..42       -- between 3 and 42 <A> symbols
    (<A> <B>) ** 3..42 -- between 3 and 42 repetitions of <A> and <B>
    [<A> <B>] ** 3..42 -- between 3 and 42 repetitions of <A> and <B>,
                       --   hidden from the semantics
    <a>* % ','         -- 0 or more properly comma-separated <a> symbols
    <a>+ % ','         -- 1 or more properly comma-separated <a> symbols
    <a>? % ','         -- 0 or 1 <a> symbols; note that ',' is never used
    <a> ** 2..* % ','  -- 2 or more properly comma-separated <a> symbols
    <A>+ % ','         -- one or more properly comma-separated <A> symbols
    <A>* % ','         -- zero or more properly comma-separated <A> symbols
    (A B)* % ','       -- A and B, repeated zero or more times, and properly comma-separated
    <A>+ %% ','        -- one or more comma-separated or comma-terminated <A> symbols

[todo: examples for charclass separators]

The repetend cannot be nullable. If a separator is specified, it cannot be nullable. If a terminator is specified, it cannot be nullable. If you try to work out what repetition of a nullable item actually means, I think the reason for these restrictions will be clear -- such a repetition is very ambiguous. An application which really wants to specify rules involving nullable repetition, can specify them directly in BNF, and these will make the programmer’s intent clear.

[todo: nullable separators -- http://irclog.perlgeek.de/marpa/2015-05-03#i_10539668]

Grouping and Hiding Symbols

To group a series of RHS symbols use parentheses:

   ( A B C )

You can also use square brackets, in which case the symbols will be hidden from the semantics:

   [ A B C ]

Parentheses and square brackets can be nested. If square brackets are used at any nesting level containing a symbol, that symbol is hidden. In other words, there is no way to “unhide” a symbol that is inside square brackets.

A grouped or hidden series of RHS symbols can be followed by a quantifier (?, * or +) to define zero or one, zero or more, or one or more repetitions of such series.

Symbol names

A Kollos external symbol name is any valid Lua name. When the meaning is clear, as it is in most contexts, a Kollos external symbol name is simply called a Kollos symbol name, or just a symbol name. Eventually Kollos external symbol names with non-initial hyphens will be allowed and an angle bracket notation for LUIF symbol names, similar to that of the SLIF, will allow whitespace in names.

In some contexts, such as tracing and error messages, Kollos internal names may be visible. Kollos internal names are either KHIL symbol names, or KLOL symbol names, depending on whether they were created by the KHIL or KLOL.

Internal names are those which include exclamation points (!), and question marks (?). If a Kollos internal name includes a question mark, it is a KLOL internal name. If a Kollos internal name is not a KLOL internal name, and includes an exclamation mark, it is a KHIL internal name.

As examples, sym1, sym_1, sym-1 and <symbol one> are Kollos external symbol names. < symbol one ! sequence lhs > and symbol-one!seq-lhs are KHIL internal symbol names. <symbol one ! sequence lhs?part one>, ?chaf49, expression?nulled, symbol-one!seq-lhs?part1, and expression!seq-lhs?nulled are all KLOL internal symbol names.

In normal usage of the LUIF, the application code sees only external symbol names. In normal processing, the KHIL code sees only external symbol names and KHIL symbol names. The KLOL code sees all external and internal symbol names. When tracing or debugging internals, the KHIL code and even the application code may see all external and internal symbol names.

Literals

LUIF literals are Lua literal strings as defined in Lexical Conventions section of the Lua 5.1 Reference Manual except that LUIF literals cannot be enclosed in long brackets.

Character classes

A character class is a string, which must contain a valid Lua character class as defined in the Lua reference manual. Strings can be defined with character classes using sequence rules.

Comments

LUIF comments are Lua comments as defined at the end of Lexical Conventions section in the Lua 5.1 Reference Manual.

Adverbs

A LUIF rule can be modified by one or more adverbs. Adverbs are name = value pairs separated with commas. A comma is also used to separate an adverb from the RHS alternative it modifies.

action adverb

The action adverb defines the semantics of the RHS alternative it modifies. Its value is specified in Semantics section below.

The action adverb can also have a special lexeme value descrived above.

assoc adverb

The assoc adverb defines associativity of a precedenced rule. Its value can be left, right, or group. The function of this adverb is as defined in the SLIF.

For a usage example, see the Calculator grammar below.

name adverb

The name adverb defines the name of a RHS alternative it modifies. The name can then be retrieved using luif.context.alternative() function.

completed adverb

The completed adverb defines the Lua function to be called when the RHS alternative is completed during the parse. Its value is the same as that of the action adverb.

For more details on parse events, see Events section.

predicted adverb

The predicted adverb defines the Lua function to be called when the RHS alternative is predicted during the parse. Its value is the same as that of the action adverb.

For more details on parse events, see Events section.

Semantics

The semantics of a BNF statement in the LUIF can be defined by modifying its RHS alternatives with action adverb.

Defining Semantics with action adverb

The value of an action adverb can be a body of a Lua function (funcbody) as defined in Function Definitions section of the Lua 5.1 Reference Manual or the name of such function, which must be a bare name (not a namespaced or a method function’s name).

The action functions will be called in the context where their respective BNF statements are defined. Their return values will become the values of the LHS symbols corresponding to the RHS alternatives modified by the action adverb.

The match context information, such as matched rule data, input string locations and literals will be provided by context accessors in luif.context namespace.

If the semantics of a BNF statement is defined in a separate Lua file, LUIF functionality must be imported with Lua’s [require] (http://www.lua.org/manual/5.1/manual.html#pdf-require) function.

The syntax for a semantic action function is

action = function (params) body end

It will be called as f (params) with params set to the values defined by the semantics of the matched RHS alternative’s symbols.

[todo: parameter list is under discussion at #26]

Context Accessors

Context accessors live in the luif.context namespace. They can be called from semantic actions to get matched rule and location data. To import them into a separate file, use Lua’s require function, i.e.

require 'luif.context'

The context accessors are:

lhs_id = luif.context.lhs()

returns the integer ID of the symbol which is on the LHS of the BNF rule whose semantic action or event handler is being called.

rule_id = luif.context.rule()

returns the integer ID of the BNF rule whose semantic action or event handler is being called.

alt_name = luif.context.alternative()

returns the name of the BNF rule’s RHS alternative whose semantic action or event handler is being called, if such name is set using the name adverb or nil if it is not so set.

prec = luif.context.precedence()

returns numeric precedence of the alternative, whose semantic action or event handler is being called, relative to other alternatives or nil if no precedence is defined for the alternative.

pos, len = luif.context.span()

returns position and length of the input section corresponding to the BNF rule whose semantic action or event handler is being called.

string = luif.context.literal()

returns the section of the input corresponding to the BNF rule whose semantic action or event handler is being called. It is defined by the input span returned by the luif.context.span() function above.

pos = luif.context.pos()

returns the position in the input, which starts the span corresponding to the BNF rule whose semantic action or event handler is being called.

len = luif.context.length()

returns the length of the input span corresponding to the BNF rule whose semantic action or event handler is being called.

Events

Parse events are defined using completed and predicted adverbs.

[todo: provide getting started info/tutorial on parse events].

[todo: example/prototyping based on https://gist.github.com/rns/ba250ed6a5ed1c82ce7b]

Post-processing

LUIF grammars are transformed into KIR (Kollos Intermediate Runtime) tables using Direct-to-Lua (D2L) calls and format specified in a separate document.

[todo: rewrite to the end of the section]

The output will be a table, with one key for each grammar name. Keys must be strings. The value of each grammar key will be a table, with entries for external and internal symbols and rules. Details of the format will be specified later.

The KIR table will be interpreted by the lower layer (KLOL). Initially post-processing will take a very restricted form in the LUIF structural and lexical grammars.

The post-processing will expect a lexical grammar named l0 and a structural grammar named g1, and will check (in the same way that Marpa::R2 currently does) to ensure they are compatible.

Programmatic Grammar Construction (PGC)

Direct-to-Lua (D2L) calls can be used to build LUIF grammars programmatically. The details are specified in a separate document.

At the moment, LUIF statements cannot be affected by Lua statements directly, but this can change in future.

The Complete Syntax of BNF Statement

The general syntax for a BNF statement is as follows (stat, block, funcbody, Name, and String symbols are as defined by the Lua grammar):

stat ::= bnf

bnf ::= start_rule |
        lhs produce_op rhs

start_rule ::= ':' grammar ':='
  ( 'lexer' '=' lexical_grammar_name ( 'start' =  start_symbol )? )?

lhs ::= symbol_name

produce_op ::= '::=' |
               '~'

rhs ::= precedenced_alternative ( '||' precedenced_alternative )*

precedenced_alternative ::= alternative ( '|' alternative )*

alternative ::= rhs_primaries ( ',' adverb )*

adverb ::= action |
           completed |
           predicted |
           assoc

--[todo: values other than function(...) -- https://github.com/rns/kollos-luif-doc/issues/12 ]
--[todo: context in action functions -- https://github.com/rns/kollos-luif-doc/issues/11 ]
action ::= 'action' '=' functionexp

completed ::= 'completed' '=' functionexp

predicted ::= 'predicted' '=' functionexp

functionexp ::= 'function' funcbody |
                Name

assoc ::= 'assoc' '=' assocexp

assocexp ::= 'left' |
             'right' |
             'group'

rhs_primaries ::= ( rhs_primary )*       -- can be empty, like Lua chunk

rhs_primary ::= separated_sequence |
                symbol_name |
                literal |
                charclass |
                grouped_alternative |
                hidden_alternative

separated_sequence ::= sequence  |
                       sequence '%'  separator | -- proper separation
                       sequence '%%' separator |
                       sequence '%-' separator |
                       sequence '%$' separator

grouped_alternative ::= '(' alternative ')'

hidden_alternative ::= '[' alternative ']'

sequence ::= sequence_item '+' |
             sequence_item '*' |
             sequence_item '?' |
             sequence_item '**' Number '..' Number |
             sequence_item '**' Number '..' '*'

sequence_item ::= symbol_name |
                  grouped_alternative |
                  hidden_alternative

separator ::= symbol_name |
              literal |
              character_class

symbol_name :: Name

literal ::= String    -- sans the long strings

character_class ::= String

Implementation considerations

Where feasible, discussion of implementation is in separate documents. This section is for those implementation considerations that are so closely tied to to interface considerations that it makes sense to include them here. These is useful while this document is in progress. These discussions may not be appropriate to keep in the reference manual once implementation is complete.

Lua patterns

Initially, Lua patterns will be used to implement character classes. However, Lua patterns do not support pre-compilation and can be slow in lots of real-life situations, e.g. benchmarks, so other options, e.g., regexes, can be considered.

Fatal errors

Fatal errors, if they affect the KIR, are always caught at the lower-level (KLOL). They may be caught at the higher-level (KHIL) as well, if that is convenient, or if it is necessary for the logic.

Locale support

Full support is only assured for the “C” locale -- support for other locales may be limited, inconsistent, or removed in the future.

Lua’s os.setlocale(), when used in the LUIF context for anything but the “C” locale, may fail, silently or otherwise.

[todo: update the tentative language above as Kollos project progresses]

Example grammars

Calculator

[todo: update parameter list per #26]

Script ::= Expression+ % ','
Expression ::=
  Number
  | '(' Expression ')', assoc = group, action = do_parens
 || Expression '**' Expression, assoc = right, action = function (e1, e2) return e1 ^ e2 end
 || Expression '*' Expression, action = function (e1, e2) return e1 * e2 end
  | Expression '/' Expression, action = function (e1, e2) return e1 / e2 end
 || Expression '+' Expression, action = function (e1, e2) return e1 + e2 end
  | Expression '-' Expression, action = function (e1, e2) return e1 - e2 end
 Number ~ [0-9]+

JSON


-- structural

json     ::= object
           | array
object   ::= [ '{' '}' ]
           | [ '{' ] members [ '}' ]
members  ::= pair+ % comma
pair     ::= string [ ':' ] value
value    ::= string
           | object
           | number
           | array
           | _true
           | _false
           | null
array    ::= [ '[' ']' ]
           | [ '[' ] elements [ ']' ]
elements ::= value+ % comma
string   ::= lstring

-- lexical

comma          ~ ','
_true          ~ 'true'   -- true is a Lua keyword
_false         ~ 'false'  -- and so is false
null           ~ 'null'
number         ~ int
               | int frac
               | int exp
               | int frac exp
int            ~ digits
               | '-' digits
digits         ~ [\d]+
frac           ~ '.' digits
exp            ~ e digits
e              ~ 'e'
               | 'e+'
               | 'e-'
               | 'E'
               | 'E+'
               | 'E-'
lstring        ~ quote in_string quote
quote          ~ ["]
in_string      ~ in_string_char*
in_string_char ~ [^"] | '\"'

whitespace     ~ [\s]+

--[todo: specify equivalent in LUIF ]
:discard       ~ whitespace