git @ Cat's Eye Technologies Emmental / rel_1_0_2014_0819
rel_1_0_2014_0819

Tree @rel_1_0_2014_0819 (Download .tar.gz)

The Emmental Programming Language

November 2007, Chris Pressey, Cat's Eye Technologies

Introduction

Emmental is a self-modifying programming language. That is not to say that is a language in which programs are self-modifying; rather, it is the language itself, as defined by a meta-circular interpreter, that can be modified during the course of a running program. Indeed, this is how Emmental, without conventional conditional and repetition/recursion constructs, achieves Turing-completeness.

Meta-circular Interpreters

One way to attempt to define a language is by giving what's called a meta-circular interpreter (often shortened to "MCI" in this document.) This is an interpreter for some language which is written in that same language (or at least, a language which is very close to it.)

Meta-circular interpreters were a popular way to the describe the semantics of programming languages, especially LISP-like languages, and especially before the advent of denotational semantics. The term "meta-circular" was apparently coined by John C. Reynolds in his paper "Definitional Interpreters for Higher-Order Programming Languages" (1972 Proceedings ACM National Conference.)

Of course, in the real world, MCI's are not often used. They certainly can be used: if you have a working Scheme interpreter that came with your computer system, there is nothing stopping you from writing another Scheme interpreter in Scheme, and running your programs on your interpreter (which is itself running on your system's interpreter.) However, this is quite a bit less efficient due to the duplication of effort. A somewhat more realistic case might be if your system came with, say, a Scheme compiler. You might then feed your Scheme interpreter (written in Scheme) through that to make a native Scheme interpreter, and use that to interpret your programs. (In this setup, the interpreter is usually described as "self-hosting" rather than "meta-circular".)

But, as should be obvious, you already need an implementation of Scheme for your Scheme interpreter written in Scheme to be of much practical use to you. If your meta-circular interpreter is all you have, you won't be able to use it to run or understand Scheme programs. Because the MCI is defined in terms of itself, you'll need some other source of "understanding how it works" to make it complete. This understanding might come from an implementation in some other programming language, or a specification in some formal language, or a description in some natural language, or simply from intuition — but it has to come from somewhere.

Assuming that we do have that external source of understanding, the meta-circular interpreter can come in quite handy in codifying the semantics of the language. And, in Emmental's case, altering those semantics: Emmental's MCI supports operations which instruct Emmental's MCI to modify its behaviour.

Interpreter Structure

To describe the structure of Emmental's MCI, we first examine the general structure of interpreters. If you've ever written a virtual machine executor in, say, C, you've noticed that it has the general form

    pc = start;
    while (!done) {
        switch (instruction[pc]) {
            case INSTR_ONE:
                /* implement semantics of INSTR_ONE */
                pc += advance(INSTR_ONE);
                break;
            case INSTR_TWO:
                /* implement semantics of INSTR_TWO */
                pc += advance(INSTR_TWO);
                break;
            /* ... */
            case INSTR_HALT:
                done = 1;
                break;
            default:
                perror("Invalid opcode");
        }
    }

Note that advance() is some function that computes how far the program counter is advanced on that instruction. This value is typically +1 for most instructions, but more or less than 1 (and dependent on the state of the program) for a handful of "branch" instructions. Note also that advance() would not typically be implemented in C as a function; I'm just showing it like this to emphasize the regular structure.

From this we infer that the basic structure of an interpreter is a dictionary or map that associates program symbols with operations. There is some extra housekeeping like the fetch-execute cycle that surrounds this dictionary, but this can (hopefully) be handled mostly automatically, freeing us to concentrate on symbols and operations.

The symbols could be taken from any finite alphabet, but in Emmental, to keep things relatively simple, we'll just use the ASCII character set. (Actually, to be precise, this is the full 8-bit ASCII character set. Non-printable control characters are allowed, as are characters between 128 and 255, and each has a distinct meaning. But their representations are not defined.)

The operations can be thought of, abstractly, as functions which transform program states. Or they can be thought of, concretely, as segments of code — mini-programs which implement these functions. In the case of a meta-circular interpreter, these mini-programs would be written in the language being interpreted.

To extend this idea to a self-modifying meta-circular interpreter, the operations can be thought of as functions which transform both program states and interpreter definitions. (Alternatively, the interpreter definition could be thought of as part of the program state, although I feel that's a bit gorier a way to look at it, and I prefer the other view, at least for Emmental.)

In Emmental, most operations leave the interpreter definition unchanged. However, there is one operation which alters the interpreter definition, and it is this altered definition that is used to interpret the remainder of the program.

Emmental Semantics (in Emmental)

Emmental is essentially a stack-based language. (There's also a queue, but it's not as central.) All operations implicitly get data from, and implicitly deposit results back onto, a single stack. For orthogonality's sake, this stack may contain only ASCII symbols. (And note that trying to pop an empty stack, or dequeue an empty queue, is an error that aborts the program.)

Note that because we've established that an interpreter (at least, insofar as Emmental ever needs to know) is simply a map that takes symbols to operations, and that operations in Emmental are defined (meta-circularly) as Emmental programs, we can use the following notation to describe interpreters:

% → XYZ+*!
& → 'ap'ag'ag

That is, the symbol %, when encountered in an Emmental program, indicates an operation that is defined by the Emmental program XYZ+*!, and so forth.

When a main Emmental program begins execution for the first time, it starts with what's called the initial Emmental interpreter. (This fact, of course, doesn't apply to any further point of execution inside an Emmental program, or execution of operations defined in Emmental's MCI, since these would be considered subprograms of a sort. These cases use whichever interpreter happens to be in force in that point in time.)

The inital Emmental interpreter is defined as follows:

a → a
b → b
c → c
...

That is, for every symbol x in the ASCII set, x x.

Doesn't tell us a lot about Emmental's semantics, does it? No. Nothing at all, really. But remember what I said about needing an external source of understanding, in order to actually get full value out of an MCI. And remember the purpose of Emmental's MCI: it is not there so much to help us understand Emmental, but to allow us to change Emmental, from inside an Emmental program.

And, for all that our description of the initial Emmental interpreter is unhelpfully tautological, it is not incorrect: the semantics of a can in fact be thought of as being defined by an Emmental program that consists of only one instruction, a. This happy state of affairs comes about because Emmental is stack-based; the "signature" (the requirements for the "before" and "after" stacks) of the symbol a is the same as the signature of the program containing the single symbol a. No extra syntax to specify arity and the like is necessary.

Above all, don't panic: we will describe what symbols like a actually mean in Emmental, we'll just need to do it in something besides Emmental. In fact, let's do that right now.

Emmental Semantics (in English)

Primitive Arithmetic

# pushes the symbol NUL (ASCII 0) onto the stack.

The symbols 0, 1, ... 9 pop a symbol from the stack, multiply its ASCII value by 10 modulo 256, add the value 0, 1, ... 9 (respectively) to that value modulo 256, and push the resulting symbol back onto the stack.

The upshot of these 11 operations is that you can push arbitrary symbols onto the stack by spelling out their ASCII values in decimal. For example, #64 pushes a @ onto the stack.

+ pops two symbols off the stack, adds together their ASCII values modulo 256, and pushes the symbol with the resultant ASCII value back onto the stack.

- pops two symbols off the stack, subtracts the ASCII value of the first popped from the ASCII value of the second popped modulo 256, and pushes the symbol with the resultant ASCII value back onto the stack.

~ ("log") pops a symbol from the stack, computes the discrete base-2 logarithm of the ASCII value of that symbol, and pushes the symbol with the resultant ASCII value back onto the stack. The discrete base-2 logarithm of a number is the floor or integer part of the base-2 logarithm of that number. Alternately, it is the number of the highest bit position (starting with the LSB = bit position 0) with a bit set when the number is viewed as binary. Because the base-2 logarithm of the number 0 itself is undefined, the number 0 is treated as 256 for this operation; its discrete base-2 logarithm is 8.

Stack and Queue Manipulation

^ ("enqueue a copy") pops a symbol off the stack, makes a copy of it, pushes it back onto the stack, and enqueues the copy onto the queue.

v ("dequeue") dequeues a symbol from the queue and pushes it onto the stack.

Using these operations in combination, one can form "discard", "duplicate", "swap", and other more advanced stack manipulation operations. For example, assuming an empty queue and more than two elements on the stack, "swap" can be accomplished with the code ^v^-+^^v^v^v-+^v-+^v-+vv.

Despite this fact, the operation : duplicates the top value of the stack. (Emmental is not an absolutely minimal language; note, for instance, that it has all ten decimal digits as operations when these could surely have been defined in terms of only one or two operations. The reasons for a seperate : operation are given below in the section on Computational Class.)

I/O

. pops a symbol off the stack and sends it to the standard output as an ASCII symbol.

, waits for an ASCII symbol to arrive on standard input, and pushes it onto the stack.

Interpreter Modification and Reflection

First let's define what it means to pop a string off the stack. Symbols are popped off the stack until a ; symbol is found on the stack. The symbols popped off are considered a string in the reverse order they were popped; i.e. the last symbol popped is the first symbol of the string. The ; symbol is popped off the stack, but is not made part of the string; it is simply discarded.

Since an Emmental program is a string, popping a program is the same as popping a string, just that the string is interpreted as a program.

! (sometimes called "supplant") pops a symbol, which we call s, off the stack. Then it pops a program t. It then inserts the association s t into the interpreter definition. This overwrites whatever mapping of s might have been in the interpreter definition previously. This new interpreter definition is used for all subsequent execution (until it is changed again, of course.)

Note that ! does early binding. That is, the meaning of each symbol in this program t is the meaning of that symbol at the time ! is executed. If some subsequent ! operation later changes the meaning of one of the symbols that occurs in t, the meaning of t is not changed. The semantics of t are "captured" or "frozen". This implies, among other things, that supplanting some symbol z with itself (a program consisting only of the symbol z) is a no-op, because z's meaning, at the time that ! is executed, is invariably z.

? (sometimes called "eval") pops a symbol, which we call s, off the stack. It then executes that symbol (interprets it as an operation) with the interpreter currently in effect.

Note that ? does late binding. That is, in contrast with !, ? never "freezes" the semantic definition of the thing that it is executing. This is true even when ? occurs in a operation redefinition (i.e. the program that supplanted some symbol's semantics when an ! was executed.) This implies, among other things, that supplanting some symbol z with the program that consists of instructions that push the ASCII value of z onto the stack, followed by a ? instruction, creates a cyclic meaning for z. This is because the z that will be executed by the ? will always be the present z, that is, the z that is executing the ?.

For convenience's sake, ; pushes the symbol ; onto the stack.

All other symbols are no-ops.

Computational Class

I believe Emmental is Turing-complete with only the operations that have been given so far, but I haven't proved it yet. All the elements are there, and although some of them are somewhat "cramped", they look viable.

If you want to try thinking about how you could write real programs (like a universal Turing-machine simulator) in Emmental, you might want to skip this section, since it contains "spoilers".

Repetition can be accomplished by assigning a symbol a cyclic semantics, by use of a ? within a !. For example, we can redefine the semantics of 0 to be #48?. This is simply a program that pushes the symbol 0 onto the stack and executes it with the current interpreter, and, since 0 has been redefined to mean #48? in the current interpreter, this will loop forever. The entire program to do this to 0 and run the infinite loop is:

;#35#52#56#63#48!0

This technique can also be used to "jump" from one definition to another, by using ? to execute some other symbol at the end of a definition (that is, some symbol other than the symbol being defined.)

Conditionals are a little more restrictive. The trick to them is, strangely, the discrete log operator ~ in combination with the eval operator ?. Since ~ maps all symbols into a set of nine symbols, and ? executes the symbol on the stack, ~? will execute one of the symbols from ASCII 0 (NUL) to ASCII 8 (BS). We can then, for instance, define NUL to do one thing, define SOH through BEL to do the same as NUL, and define BS to do some other thing; this essentially distinguishes between 0 (which executes BS) and every other value (which executes NUL). Further, we can use this in conjunction with - to compare two values for equality. So, for example, a program which inputs a character, and outputs Y if the character is M and N otherwise, would be:

#59#35#55#56#46#!;##1!;##2!;##3!;##4!;##5!;##6!;##7!#59#35#56#57#46#8!,#77-~?

In case NUL through BS are in use for some reason, we can always add 9 to the result of the log (~#9+?) to map the answer onto HT through DC1. Or, of course, any of a great number of other arithmetical mappings of our choosing. The most severe constraint is that there be 9 available symbols to act as "destinations" for our "branch". Even if we never overwrite one "branch" with another (and we can do that in Emmental!) and even if we allocate one extra symbol to be the "launch point" of the branch, we still have room for 25 branches in the ASCII character set.

So these parts look good. If there's a problem, it's with the queue. Specifically, the problem seems to be the need to know the present size of the queue in order to do stack housework like "duplicate" and the subsequent need for "duplicate" to achieve "discard." (Duplicate can be defined as ^v, but this only works when the queue is empty. Discard can be defined as duplicate plus -+, but this only works when there are other elements below the element being discarded. [This last point is not generally a problem since we can push arbitrary values onto the stack before any program.])

However, if it turns out that we need "duplicate" or "discard" in order to write routines that can handle a variable-sized queue — and that strikes me as likely — then it looks like we have a severe problem.

Here's one way I could try to deal with it. I could say that the queue is local to the operation being defined (or the main program.) Then you could define : to be ^v, and inside :'s definition, the queue would always initially be empty, so the definition would work.

But... we need the queue to store our global data. For example, if we are going to simulate a Turing machine, we'd need to use the queue to store the tape (perhaps "doubled up", with one symbol of each pair telling us "the next symbol is a simulated tape symbol" or "the next symbol is some housekeeping value.") We can't store the tape on just one stack. And, once you are looping in Emmental, you've left the "main program" forever; you're jumping from definition to definition, and each has their own queue. At best, you'd need to "dump" the queue onto the stack each time you switched definitions, and even then you'd need a loop to do that, and to loop you need to switch definitions. It's a royal mess.

So here's how I will deal with it. I will add a primitive duplicate operation, :, to Emmental. Proving that Emmental is Turing-complete is still, then, a challenge, although a doable-seeming challenge. I will then propose a more formidable challenge: prove that the language formed by removing the : operation from Emmental is Turing-complete. (Equivalently, prove that the set of Emmental programs that begin with ;#0#58! is Turing-complete. The nice thing about Emmental is that you can always shoot yourself in the foot — until you erase your pistol, that is.)

And if you really like a challenge, try proving that Emmental without ~ is Turing-complete. I don't think that it is, although it's possible for it to compute parity, at least (input a symbol, output E if its ASCII value is even, and O if it's odd. To accomplish this, multiply the input's ASCII value by 128 by adding 127 copies of it to it; this is modulo 256, so the only results can be 0 or 128. Define those operations to print out E and O respectively. But that's as far as I've gotten.)

Discussion

Design Decisions

I would've liked to have given Emmental a ' or " instruction similar to Funge's "quote" and "quote-mode" instructions; instructions that treat one or more of the following symbols in the program literally, pushing them, as symbols, onto the stack, instead of executing them. However, such instructions are somewhat problematic, both theoretically and (for the approach I took implementing Emmental) practically. There are two ways of thinking about the problems that arise.

One is that the function which implements ' is given access to the program text itself, and possibly the position within the program, and it uses these to extract the "immediate mode" symbol past the '. This information could be available because these pieces of information are considered extra arguments to the function, or because they are (gorily) considered part of the overall program state. Either way, this operation is given a lot of information to work with, and for consistency (since we want to be all nice and neat and say that all operations have the same signature so that our dictionary has a nice uniform type,) all operations have access to this information. This is almost too much information; that is, it is so much that operations don't really need the dictionary. We could just say there is one operation, defined by a function, and that function is given the current symbol and has to decide what it means through whatever means it likes.

This approach is very powerful, of course, but it's just not the style that Emmental embodies. (In fact, the idea to view interpreters as dictionaries was one of the foundational design choices for Emmental, to the point where I started constructing a "little theory of interpreters as maps." It really wasn't exploited as much as I think it could have been. If an interpreter is a map of symbols to strings of symbols, it's much more tractable than an opaque function would be; you can define all sorts of operations on them, for example concatenating two interpreters (for all symbols s in interpreter a and interpreter b, c[s] a[s]b[s] — that sort of thing,) computing union or intersection of interpreters, Cartesian product, etc.)

The other way of looking at it is to say that there are in fact multiple meta-circular interpreters available inside Emmental, and symbols like ' switch temporarily to an alternate MCI. This alternate MCI interprets every symbol as "push this symbol", then reinstates the previous MCI. I like this explication better than the one above — MCIs begin to look a bit like continuations! — but to do it justice would take some work. I envision a language where the program has fine control over which MCI is in effect, possibly by keeping a map from symbols to MCIs, or maybe even being able to push MCIs onto the stack. This is a wee bit much for Emmental too though, and perhaps I'll explore these possibilities in a future language.

Turing-completeness

You can make the argument that Emmental's way of being Turing-complete is really nothing new: when you redefine some symbol, you're really just defining a new function, and when you use ? to execute that symbol from within its own definition, you're just making a recursive function call.

Well, yes, you can make that argument. But it has to do with how you think about "what is a language", I think. Does a Pascal program fragment which defines a procedure called PrintFibonacci represent another programming language, one different from Pascal? You could certainly say that it does — it's the language Pascal where the token PrintFibonacci has some meaning that it doesn't have in Pascal.

In that view, any language where you can define procedures, or functions, or standard libraries, or the like, is an extensible language. But even languages where you can't define new procedures or functions is arguably an extensible language. Take some initial Brainfuck program fragment, for instance. After it executes, it leaves the Brainfuck tape and while-stack in some state that depends on the input. Any Brainfuck fragment that executes after that, will execute in that environment, and that environment is arguably a version of the language Brainfuck, suitably extended.

You don't normally think of it that way, I bet, but you could — and you would need to, to some degree, to claim that Emmental is "just" defining new functions. The reason you don't typically look at languages like this (unless you are very strange) is because it's much more useful to divide the world into "languages" and "programs." And Emmental does make this division, it just makes it in a slightly different place than usual.

As far as I'm concerned, if I describe what Emmental does as modifying the Emmental language via its MCI, and what Emmental actually does is consistent with the idea of modifying the Emmental language via its MCI, then what Emmental effectively does is modify the Emmental language via its MCI. And if it needs to do this in a certain way in order to simulate a universal Turing machine, then that difference (however slight) sets it apart from systems where this simulation needs to be done by defining recursive functions.

Implementation

emmental.hs is a reference interpreter for Emmental written in Haskell. Run the function emmental on a string; you can also run debug on a string to view the state of the program (stack & queue) during execution. (Note that debug is not able to show program states that occur internal to an operation.)

Happy interpreter-redefining!
Chris Pressey
Chicago, IL
November 11, 2007