git @ Cat's Eye Technologies Tamsin / master doc / Notes.markdown

Tree @master (Download .tar.gz)

Notes.markdown @masterview markup · raw · history · blame

These are now out of context, and kept here for historical purposes.

an aside, written a while back

OK! So... here is a problem: if you haven't noticed yet,

  • what a rule consumes, is a string.
  • what a rule evaluates to, is a term.
  • the symbol ( means something different in a rule (where it expresses precendence) than in a term (where it signifies the list of subterms.)
  • the symbol foo means something different in a rule (where it denotes a production) than in a term (where it is an atom.)

This is probably unacceptable. Which syntax do we want to change?

PRODUCTION = set V = foo & return ⟨atom V production⟩.

i.e. productions are distinguished from atoms and variables by being all-caps. Lists are distinguished from precedence by being ⟨ ⟩.

production = set V = 'foo & return '(atom V production).

i.e. ' acts a bit like quote, or rather quasi-quote, as Variables get expanded.

production = set V = :foo & return :smth(:atom Var :production).

i.e. atoms are prefixed with :, like Ruby, and terms are constructors with a leading atom, like real terms and not like lists.

production = set V = 「foo」 & return 「(atom Var anotheratom)」.

A funky, Japanese-influenced version of quote. Nice, but really not suited for this, quite. Ditto ⟦these⟧.

Ah, well, it may not be a real problem, unless we want to make return optional (which we do.) Maybe, onto weirder stuff first.

stuff about implicit buffer

Here's a "problem": the implicit buffer is a string, and we don't have strings in the data domain, we have terms. This "problem" is easily "solvable": we can stringify the term. This is a terrible "solution", but it lets us experiment further.

This would be nicer if we had a syntax to put arbitrary text in an atom. Hey, how about 「this is an atom」? Hmmm...

A prolix note on implementation

Traditionally, scanners for recursive descent parsers pre-emptively scan the next token. This was done because originally, parsers (for languages like Pascal, say,) were distinctly one-pass beasts, reading the source code off of a stream from disk (or maybe even from a tape), and you might need to refer to the current token several times in the code and you don't want to have to read it more than once.

This setup makes writing a parser with a "hot-swappable" scanner tricky, because when we switch scanner, we have to deal with this "cached" token somehow. We could rewind the scanner by the length of the token (plus the length of any preceding whitespace and comments), switch the scanner, then scan again (by the new rules.) But this is messy and error-prone.

Luckily, not many of us are reading files off tape these days, and we have plenty of core, so it's no problem reading the whole file into memory. In fact, I've seen it argued that the best way to write a scanner nowadays is to mmap() the file. We don't do this in the implementation of Tamsin, but we do read the entire file into memory.

This makes the cache-the-next-token method less useful, and so we don't do it. Instead, we look for the next token only when we need it, and we have a method peek() that returns what the next token would be, and we don't cache this value.

There are a couple of other points about the scanner implementation. A scanner only ever has one buffer (the entire string it's scanning); this never changes over it's lifetime. It provides methods for saving and restoring its state, and it has a stack of "engines" which provide the actual scanning logic. In addition, there is only one interpreter object, and it only has one scanner object during its lifetime.

Implementation Notes

Maybe test-driven language design not "for the win" in all cases; it's excellent for evolving a design, but not so good for deep debugging. I had to actually write a dedicated test case which directly accessed the internals, to find the problem.

This was only after refactoring the implementation two or three times. One of those times, I removed exceptions, so now the interpreter returns (success, result) tuples, where success is a boolean, and propagates parse errors itself.

We "raise" a parse error only in the LITERAL AST node.

We handle parse errors (backtrack) only in OR and WHILE, and in the ProductionScannerEngine logic (to provide that EOF if the scanning production failed. This can happen even in peek() at the end of a string, even after we've successfully parsed everything else.)

aside #2

Well this is all very nice, very pretty I'm sure you'll agree, but it doesn't hang together too well. Figuration is easier than composition. The thing is that we still have these two domains, the domain of strings that we parse and the domain of terms that we match. We need to bring them closer together. This section is just ideas for that.

One is that instead of, or alongside terms, we compose strings.

First, we put arbitrary text in an atom, with 「this syntax」. Then we allow terms to be concatenated with . It looks real cool! But also, it's kind of annoying. So we also allow 'this ' + 'syntax.'.


Indeed we can.

The next logical step would be to be able to say

main = program using scanner.
scanner = scan using ☆char.
scan = {" "} & (...)
program = "token" & ";" & "token" & ...

But we're not there yet.

Well, the best way to get there is to make that a test, see it fail, then improve the implementation so that it passes, Test-driven language design for the win! (But maybe not in all cases. See my notes below...)

... #2

Having thought more about it, I think the easiest way to reconcile terms and strings is to have terms be syntactic sugar for strings. This is already the case for ground terms, since tree(a,b) stringifies to the same string as 「tree(a,b)」. It's when variables are involved where it differs. We would like some kind of quasi-quote such that even though 「tree(A,b)」tree(A,n), «tree(A,b)»tree(tree(x,y),b) or whatever.

Although, I still don't know. The thing about terms is that they are super-useful for intermediate representations — abstract syntax trees and the like. I've been thinking about some kind of compromise. Which is, currently, what we sort of have. A Tamsin term doubles as a string, for better or worse. Mainly, we should sort out the properties of terms, then. Which we will do. But first,

conceptual sugar


reverse(tree(A,B)) = ...

be conceptually sugar for

reverse["tree" & "(" & term → A & "," & term → B & ")"] = ...

but actually we still keep it in terms of terms, for efficiency.