git @ Cat's Eye Technologies Specs-on-Spec / master mdpn / mdpn.markdown
master

Tree @master (Download .tar.gz)

mdpn.markdown @masterview rendered · raw · history · blame

Multi-Directional Pattern Notation
==================================

Final - Sep 6 1999

* * * * *

Introduction
------------

MDPN is an extension to EBNF, which attributes it for the purposes of
scanning and parsing input files which assume a non-unidirectional form.
A familiarity with EBNF is assumed for the remainder of this document.

MDPN was developed by Chris Pressey in late 1998, built on an earlier,
less successful attempt at a "2D EBNF" devised to fill a void that the
mainstream literature on parsing seemed to rarely if ever approach, with
much help provided by John Colagioia throughout 1998.

MDPN has possible uses in the construction of parsers and subsequently
compilers for multi-directional and multi-dimensional languages such as
Orthogonal, Befunge, Wierd, Blank, Plankalkül, and even less contrived
notations like structured Flowchart and Object models of systems.

As the name indicates, MDPN provides a notation for describing
multidimensional patterns by extending the concept of linear scanning
and matching with geometric attributes in a given number of dimensions.

Preconditions for Multidirectional Parsing
------------------------------------------

The multidirectional parsing that MDPN concerns itself with assumes that
any portion of the input file is accessable at any time. Concepts such
as LL(1) are fairly meaningless in a non-unidirectional parsing system
of this sort. The unidirectional input devices such as paper tape and
punch cards that were the concern of original parsing methods have been
superceded by modern devices such as hard disk drives and ample, cheap
RAM.

In addition, MDPN is limited to an orthogonal representation of the
input file, and this document is generally less concerned about forms of
four or higher dimensions, to reduce unnecessary complexity.

Notation from EBNF
------------------

Syntax is drawn from EBNF. It is slightly modified, but should not
surprise anyone who is familiar with EBNF.

A freely-chosen unadorned ('bareword') alphabetic multicharacter
identifier indicates the name of a nonterminal (named pattern) in the
grammar. e.g. `foo`. (Single characters have special meaning as
operators.) Case matters: `foo` is not the same name as `Foo` or `FOO`.

Double quotes begin and end literal terminals (symbols.) e.g. `"bar"`.

A double-colon-equals-sign (`::=`) describes a production (pattern
match) by associating a single nonterminal on the left with a pattern on
the right, terminated with a period. e.g. `foo ::= "bar".`

A pattern is a series of terminals, nonterminals, operators, and
parenthetics.

The `|` operator denotes alternatives. e.g. `"foo" | "bar"`

The `(` and `)` parentheses denote precedence and grouping.

The `[` and `]` brackets denote that the contents may be omitted, that
is, they may occur zero or one times. e.g. `"bar" ["baz"]`

The `{` and `}` braces denote that the contents may be omitted or may be
repeated any number of times. e.g. `"bar" {baz "quuz"}`

Deviations from EBNF
--------------------

The input file is spatially related to a coordinate system and it is
useful to think of the input mapped to an orthogonally distributed
(Cartesian) form with no arbitrary limit imposed on its size,
hereinafter referred to as *scan-space*.

The input file is mapped to scan-space. The first printable character in
the input file always maps to the *origin* of scan-space regardless of
the number of dimensions. The origin is enumerated with coordinates (0)
in one dimension, (0,0) in two dimensions, (0,0,0) in three dimensions,
etc.

Scan-space follows the 'computer storage' co-ordinate system so that *x*
coordinates increase to the 'east' (rightwards), *y* coordinates
increase to the 'south' (downwards), and *z* coordinates increase on
each successive 'page'.

Successive characters in the input file indicate successive coordinate
(*x*) values in scan-space. For two and three dimensions, end-of-line
markers are assumed to indicate "reset the *x* dimension and increment
the *y* dimension", and end-of-page markers indicate "reset the *y*
dimension and increment the *z* dimension", thus following the
commonplace mapping of computer text printouts.

Whitespace in the input file are **not** ignored. The terminal `" "`,
however, will match any whitespace (including tabs, which are **not**
expanded.) The pattern `{" "}` may be used to indicate any number of
whitespaces; `" " {" "}` may be used to indicate one or more
whitespaces. Areas of scan-space beyond the bounds of the input file are
considered to be filled with whitespaces.

Therefore, `"hello"` as a terminal is exactly the same as
`"h" "e" "l" "l" "o"` as an pattern of terminals.

A `}` closing brace can be followed by a `^` (*constraint*) operator,
which is followed by an expression in parentheses.

This expression is actually in a subnotation which supports a very
simple form of algebra. The expression (built with terms connected by
infix `+-*/%` operators with their C language meanings) can either
reduce to

-   a constant value, as in `{"X"} ^ (5)`, which would match five `X`
    terminals in a line; or
-   an unknown value, which can involve any single lowercase letters,
    which indicate variables local to the production, as in
    `{"+"}^(x) {"-"}^(x*2)`, which would match only twice as many minus
    signs as plus signs.

Complex algebraic expressions in constraints can and probably should be
avoided when constructing a MDPN grammar for a real (non-contrived)
compiler. MDPN-based compiler-compilers aren't expected to support more
than one or two unknowns per expression, for example. There is no such
restriction, of course, when using MDPN as a guide for hand-coding a
multidimensional parser, or otherwise using it as a more sophisticated
pattern-matching tool.

The Scan Pointer
----------------

It is useful to imagine a *scan pointer* (SP, not to be confused with a
*stack pointer*, which is not the concern of this document) which is
analogous to the current token in a single-dimensional parser, but
exists in MDPN as a free spatial relationship to the input file, and
thus also has associated geometric attributes such as direction.

The SP's *location* is advanced through scan-space by its *heading* as
terminals in the productions are successfully matched with symbols in
the input buffer.

The following geometric attribution operators modify the properties of
the SP. Note that injudicious use of any of these operators *can* result
in an infinite loop during scanning. There is no built-in contingency
measure to escape from an infinite parsing loop in MDPN (but see
exclusivity, below, for a possible way to overcome this.)

`t` is the relative translation operator. It is followed by a vector, in
parentheses, which is added to the location of the SP. This does not
change its heading.

For example, `t (0,-1)` moves the SP one symbol above the current symbol
(the symbol which was *about* to be matched.)

As a more realistic example of how this works, consider that the pattern
`"." t(-1,1) "!" t(0,-1)` will match a period with an exclamation point
directly below it, like:

    .
    !

`r` is the relative rotation operator. It is followed by an axis
identifier (optional: see below) and an orthogonal angle (an angle *a*
such that |*a*| **mod** 90 degrees = 0) assumed to be measured in
degrees, both in parentheses. The angle is added to the SP's heading.
Negative angle arguments are allowed.

Described in two dimensions, the (default) heading 0 denotes 'east,'
that is, parsing character by character in a rightward direction, where
the SP's *x* axis coordinate increases and all other axes coordinates
stay the same. Increasing angles ascend counterclockwise (90 = 'north',
180 = 'west', 270 = 'south'.)

For example, `">" r(-90) "+^"` would match

    >+
     ^

The axis identifier indicates which axis this rotation occurs around. If
the axis identifier is omitted, the *z* axis is to be assumed, since
this is certainly the most common axis to rotate about, in two
dimensions.

If the axis identifier is present, it may be a single letter in the set
`xyz` (these unsurprisingly indicate the *x*, *y*, and *z* dimensions
respectively), or it may be a non-negative integer value, where 0
corresponds to the *x* dimension, 1 corresponds to the *y* dimension,
etc. (Implementation note: in more than two dimensions, the SP's heading
property should probably be broken up internally into theta, rho, &c
components as appropriate.)

For example, `r(z,180)` rotates the SP's heading 180 degrees about the
*z* (dimension \#2) axis, as does `r(2,180)` or even just `r(180)`.

`<` and `>` are the push and pop state-stack operators, respectively.
Alternately, they can be viewed as lookahead-assertion parenthetics,
since the stack is generally assumed to be local to the production.
(Compiler-compilers should probably notify the user, but not necessarily
panic, if they find unbalanced `<>`'s.)

All properties of the SP (including location and heading, and scale
factor if supported) are pushed as a group onto the stack during `<` and
popped as a group off the stack during `>`.

Advanced SP Features
--------------------

These features are not absolutely necessary for most non-contrived
multi-directional grammars. MDPN compiler-compilers are not expected to
support them.

`T` is the absolute translation operator. It is followed by a vector
which is assigned to the location of the SP. e.g. `T (0,0)` will 'home'
the scan.

`R` is the absolute rotation operator. It is followed by an optional
axis identifier, and an orthogonal angle assumed to be measured in
degrees. The SP's heading is set to this angle. e.g. `R(270)` sets the
SP scanning line after line down the input text, downwards. See the `r`
operator, above, for how the axis identifier functions.

`S` is the absolute scale operator. It is followed by an orthogonal
*scaling factor* (a scalar *s* such that *s* = **int**(*s*) and *s* \>=
1). The SP's scale factor is set to this value. The finest possible
scale, 1, indicates a 1:1 map with the input file; for each one input
symbol matched, the SP advances one symbol in its path. When the scale
factor is two, then for each one input symbol matched, the SP advances
two symbols, skipping over an interim symbol. Etc.

`s` is the relative scale operator. It is followed by a scalar integer
which is added to the SP's scaling factor (so long as it does not cause
the scaling factor to be zero or negative.)

Scale operators may also take an optional axis identifier (as in
`S(y,2)`), but when the axis identifier is omitted, all axes are assumed
(non-distortional scaling).

`!>` is a state-assertion alternative to `>`, for the purpose of
determining that the SP successfully and completely reverted to a
previous state that was pushed onto the stack ('came full circle'). This
operator is something of a luxury; a grammar which uses constraints
correctly should never *need* it, but it can come in handy.

Other Advanced Features: Exclusivity
------------------------------------

Lastly, in the specification of a production, the *exclusivity* applying
to that production can be given between a hyphen following the name of
the nonterminal, and the `::=` operator.

Exclusivity is a list of productions, named by their nonterminals, and
comes into play at any particular *instance* of the production (i.e.
when the production successfully matches specific symbols at specific
points in scan-space during a parse, called the *domain*.) The
exclusivity describes how the domain of each instance is protected from
being the domain of any further instances. The domain of any subsequent
instances of any productions listed in the exclusivity is restricted
from sharing points in scan-space with the established domain.

Exclusivity is a measure to prevent so-called *crossword grammars* -
that is, where instances of productions can *overlap* and share common
symbols - if desired. Internally it's generally considered a list of
'used-by-this-production' references associated with each point in
scan-space. An example of the syntax to specify exclusivity is
`bar - bar quuz ::= foo {"a"} baz`. Note that the domain of an instance
of `bar` is the sum of the domains `foo`, `baz` and the chain of "`a`"
terminals, and that neither a subsequent instance of `quuz` nor `bar`
again can overlap it.

Examples of MDPN-described Grammars
-----------------------------------

**Example 1.** A grammar for describing boxes.

The task of writing a translator to recognize a two-dimensional
construct such as a box can easily be realized using a tool such as
MDPN.

An input file might contain a of box drawn in ASCII characters, such as

    +------+
    |      |
    |      |
    +------+

Let's also say that boxes have a minimum height of four (they must
contain at least two rows), but no minimum width. Also, edge characters
must match up with which edge they are on. So, the following forms are
both illegal inputs:

    +-+
    +-+

    +-|-+
    |   |
    *    
    |   |
    +-|-+

The MDPN production used to describe this box might be

      Box ::= "+" {"-"}^(w) r(-90) "+" "||" {"|"}^(h) r(-90)
              "+" {"-"}^(w) r(-90) "+" "||" {"|"}^(h) r(-90).

**Example 2.** A simplified grammar for Plankalkül's assignments.

An input file might contain an ASCII approximation of something Zuse
might have jotted down on paper:

     |Z   + Z   => Z
    V|1     2      3
    S|1.n   1.n    1.n

Simplified MDPN productions used to describe this might be

    Staff     ::= Spaces "|" TempVar AddOp TempVar Assign TempVar.
    TempVar   ::= "Z" t(-1,1) Index t(-1,1) Structure t(0,-2) Spaces.
    Index     ::= <Digit {Digit}>.
    Digit     ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9".
    Structure ::= <Digit ".n">.
    AddOp     ::= ("+" | "-") Spaces.
    Assign    ::= "=>" Spaces.
    Spaces    ::= {" "}.