Anne of Green Garbles/README.md - NaNoGenMo-Entries-2019 (master)

README.md @master — view rendered · raw · history · blame

🙚 Anne of Green Garbles 🙘
==========================

Herein can be found the above-named generated novel, along with the
code for generating it, the both of which were produced for [NaNoGenMo 2019][].

The generator is written as a collection of command-line programs
written in Python 3.6.  The only Python dependency is
[Beautiful Soup][], used to extract the text from a source HTML file.

After downloading [the HTML version of _Anne of Green Gables_](http://www.gutenberg.org/files/45/45-h/45-h.htm)
from [Project Gutenberg][] and saving it as `45-h.htm`, the novel
was generated by running the following command:

    ./build.py aogg 45-h.htm --random-seed=6010 --markov-order=1 --title "Anne of Green Garbles"

According to `wc -w`, the resulting novel,
_[Anne of Green Garbles](generated/Anne%20of%20Green%20Garbles.md)_,
comprises 54,638 words.

The software versions used were cPython 3.6.2, with
[beautifulsoup4](https://pypi.org/project/beautifulsoup4/) 4.6.0,
on Ubuntu 16.04.

Theory of Operation
-------------------

The main goal of this project was to see how additional grammatical structure
could be imposed on a [Markov chain][].

The main result is that, if the desired additional structure can be captured by
a [regular grammar][], then, observing that both the regular grammar and the
Markov chain can be thought of as [probabilistic finite automata][] (with the
probabilities in the regular grammar being always either 0% or 100%), we can
apply the standard [product construction][] on these two automata to obtain an
automaton that represents a language which is the intersection of the two
original languages.

As a concrete example, if the Markov chain is obtained from analyzing
Lucy Maud Montgomery's _Anne of Green Gables_, and the regular grammar is one
which describes some basic rules of punctuation in English writing, as shown
below (in the form of the corresponding [finite automaton][]):

![Abbreviated diagram of punctuation automaton](images/narration-dialogue-parenthetical.png?raw=true)

...then the resulting automaton is a Markov chain based on the word
frequencies in _Anne of Green Gables_ but restricted to producing
only those strings which correctly follow the given rules of punctuation,
generating text from it produces paragraphs like

> “Well now that was overcome it much, of it was redder,” said Miss Barry kept
> the rake and starry eyes at the lane, the dismissal of being a middle of June
> evening at once in from being something to nobody either question. “I was born
> to the rest and good girl.” said. Once, with a boy at him on Anne’s
> eager-looking, hunting out into the way—their brown beard which, one of the
> tears and then to studying it, “Yes,” protested Anne walked home through the
> self-haired Shirley at least. “This is the worst.”

We might call the automaton resulting from this construction a
"tagged Markov chain", because each token is tagged with the state of
the regular grammar in which it was encountered.  The construction is
compatible with higher-order Markov chains; the tokens which precede a
token can be thought of as additional tags on that token.  Conversely,
the state of the regular grammar could be thought of as a kind of
abstract, "long-distance" higher-order factor in the Markov chain.

Just like when a conventional higher-order Markov chain is constructed,
the result of this construction is itself a Markov chain; it retains
the [Markov property][].  This would not be the case if the grammar we
wished to combine with the Markov chain was not a regular grammar
(e.g. if it was a context-free grammar), as then we would need to "remember"
how many levels of nesting we had entered previously.

Implementation
--------------

The programs that comprise this generator are written in a style that is amenable
to usage in shell pipelines.  However, for normal usage, running them via
the supplied `build.py` script is recommended.

Scripts whose names start with `extract` gather usable information from some
corpus or human-readable source, such as a web page downloaded from
Project Gutenberg, and output it in an intermediate format.

Scripts whose names start with `xform` are transformer filters
which read in a file (usually on `stdin`) and produce another
file (usually on `stdout`); these file are both in intermediate
formats, usually the same intermediate format.

Scripts whose names start with `render` are filters which
take a file in intermediate format and produce a file in
some human-readable output format, such as Markdown.

Some formats used by these tools are:

### HTML

Intended as an input format only, this is any sort of HTML that
BeautifulSoup can make sense of.

### Markdown

Intended as an output format only; as the final step of the pipeline
the text is rendered as Markdown.  (It can then be converted to
HTML, etc., by various other tools.)

### tokenstream

An intermediate format.

A text file in UTF-8 encoding, with LF line endings, with one
token per line.  A token is a lexeme, a sequence of characters
that we treat as a "unit": a word or a bit of punctuation which
is not part of some other word.

A tokenstream is merely a special case of tagged tokenstream
(see below) where there are no tags on any tokens.

### tagged tokenstream

An intermediate format.

Like a tokenstream, but the token is preceded by a space and
zero or more tags; each tag is a string of the form `A=B`
where A and B can be any tokens which do not contain spaces
or `=`s.

Some meaningful tags are: `state`, the state of the automaton
representing the regular grammar; and `prev1`, the previous
token, for constructing an order-2 Markov chain.

If every token in the stream has the same set of tags,
only potentially with different values for each tag, we
call it a homogenous tagged tokenstream.  The tagged
tokenstreams used in this generator are homogenous.

### Model JSON

An intermediate format.

A JSON file containing a map from tagged tokens to maps from
tagged tokens to frequencies.

Other Features of the Generator
-------------------------------

The generator here has some capabilities that, in the end,
were not used in the generation of _Anne of Green Garbles_.

The `build.py` script supports being given multiple input
files (HTML documents), and builds a Markov chain model
as if this were a single long input text.

The `xform-tag-with-prev-tokens.py` filter implements an
order-2 Markov chain, by tagging each token with the
token immediately preceding it.  This was to confirm that
the intersection construction works with higher-order
Markov chains.

The reason _Anne of Green Garbles_ was produced using only
_Anne of Green Gables_ as its input text, with a Markov chain
of order 1, was for aesthetic purposes -- I found the contrast
of orderly dialogue/narration structure against the decidedly
gibberishy output of the order-1 chain on a single work, to be
more striking.  (More [Jess][]-esque, perhaps.)

In addition, the regular grammar used in this generator also
handles parenthetical remarks (as you can see from the diagram
above.)  This literary device is, however, used only rarely
in Lucy Maud Montgomery's works.  Lucky for us though, Sax
Rohmer wasn't shy about putting in a few parentheses here and
there.  So, to demonstrate all three of these features, here is
an excerpt from _The Incoherent Dr. Fu Manchu_, which uses an
order-2 Markov chain on several of Sax Rohmer's works, some
of which do employ parenthetical phrases.

> It wanted only three minutes or more above the nauseating odor
> of burning Indian hemp; and despite the wide, so characteristic
> of the one to arrest with the marked changes (corresponding with
> phases of the one who watched him proved too potent for his
> elusive courage. He wrote on). This was still unreal to me, and
> began very slowly, “This marks a new toy it does make you
> understand?” he declared. “It was not until I realized that he
> hoped to lure us.”

Related Work
------------

I should say that I have no real idea if this is a standard technique
in statistics or what.  All I know is that I'd been idly wondering, on
and off, about how one might "teach a Markov chain a bit of grammar",
since probably 2015; that I've never, to my knowledge, seen
this method applied in the area of generative text or elsewhere; and
that, when I decided to actually sit down and try it for
[NaNoGenMo 2019][], it wasn't immediately obvious to me that the
resulting stochastic model was still a Markov chain.
See [this comment](https://github.com/NaNoGenMo/2019/issues/11#issuecomment-552395465)
and [this comment](https://github.com/NaNoGenMo/2019/issues/11#issuecomment-554986877)
in particular for my thoughts while I was working it out.

Shortly after I started working on this I was reminded of a similar
project from a previous NaNoGenMo, and hunted for it.  I wanted to
find it, if only to satisfy myself that this wasn't just repeating
an existing technique.  In a short time I did find it; it was
[The Quantum Supposition of Oz][] by [@spc476](https://github.com/spc476),
for [NaNoGenMo 2014][].  It is definitely similar in one respect:
it treats punctuation and words in the input text as separate tokens
in the Markov model.  This is something that definitely makes sense to
do when some of those tokens (punctuation) trigger state changes in a
state machine.

In the issue for QSoO, I also mention a tool for cleaning up punctuation.
That tool eventually became [T-Rext][].  There was a certain (large)
amount of punctuation cleanup in this project, as well; however, the
cleanup here comes before creating the model, whereas T-Rext was
intended to be run over basically-arbitrary text after it has already
been generated.  (See below for more about cleaning up punctuation.)

That year I was doing a lot of extracting source texts from
[Project Gutenberg][]; I started out using a tool called `gutenizer`
for this purpose but, unsatisfied with it, I wrote my own tool called
[Guten-gutter][].  Since writing it, however, I've come to the
conclusion that it takes the wrong approach.  Guten-gutter extracts
text from plain text files provided by PG, but plain text files
obscure the structure of the text.  In this project I instead used
Beautiful Soup to extract text from PG's HTML files, which capture
more structure.  In particular, a paragraph of text is almost always
an HTML `<p>` element, which is much simpler than keeping track of
interstitial blank lines.  This also makes it more versatile, as it
can be used to extract text from other HTML-based sources such as
[WikiSource][].

I should also say a few words about the architecture.  As I was building
this, the structure that fell out was to have a set of small programs,
each with a specific dedicated task, usually to transform an
intermediate format to another intermediate format, and to set them
up as a pipeline (though with intermediate files instead of shell pipes.)
In the world of programming language technology, this architecture
is sometimes called a "micropass compiler" (see, for example,
[this paper by Sarkar, Waddell, and Dybvig](https://www.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf))
and coincides with another previous project, [Lexeduct][],
in which text-processing is organized along basically the same
lines.  The pipeline architecture used here could well become the
next major version of Lexeduct, if fortune dictates that my
interests swing back in that direction in the future.

An Aside about Cleaning up Punctuation
--------------------------------------

A lesson learned in this project: if you plan to retain some structure from
the input text, you need to be pretty sure the input text *has* that structure
in the first place.

Since we've defined a regular language for the rules of punctuation (see
diagram in "Theory of Operation", above), we can also use it to check if
the input text already conforms to those rules.

But for that purpose, the diagram is not complete.  We'd also like to show
what happens when you encounter something that violates the rules.  The
diagram for that is a wee bit more complex:

![Full diagram of punctuation automaton](images/narration-dialogue-parenthetical-reject.png?raw=true)

Then we need to decide what we will do if the input text does in fact
violate the rules, and we end up in the "reject" state.

A basic approach is to process the input in units (paragraphs, say) and just
throw out any units that don't adhere to the structure.  If there are enough
total units, this suffices.

But I wanted to do slightly better, so in a couple of filters in this
generator, the code makes one or more attempts to "fix" the structure of the
text before giving up.  For example, a paragraph that starts with an
opening double-quotation mark, but has no other double-quotation marks,
probably indicates a monologue that will continue in the next paragraph,
so a closing double-quotation mark is added to the end of the paragraph,
and the program tries again.

And can I just close by saying that, in this context, the `'` symbol is
absolutely murderous to handle.  You can first see if it is a contraction
by checking common cases like `["don", "'", "t"]`, but the set of
contractions is not something that is fixed for every text ("'phone",
"s'pose", etc.) and some words have even undergone multiple contractions
("'tweren't").  Then you have to consider that it is also used for
posessives, and quotes within double-quotes, leading to the possibility
of sentences like

> "To which she replied, 'Aunt Pris' marbles'," said Anne.

For that reason, this generator doesn't even try to interpret these
as single quotes; it will just assume that all the instances of `'`
in the above sentence are contractions.  It's certainly possible to do
better, but in my estimation, outside the scope of NaNoGenMo 2019.

[NaNoGenMo 2019]: https://github.com/NaNoGenMo/2019/
[NaNoGenMo 2014]: https://github.com/dariusk/NaNoGenMo-2014
[Project Gutenberg]: https://www.gutenberg.org/
[WikiSource]: https://wikisource.org/
[Beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/
[Markov chain]: https://en.wikipedia.org/wiki/Markov_chain
[Markov property]: https://en.wikipedia.org/wiki/Markov_property
[regular grammar]: https://en.wikipedia.org/wiki/Regular_grammar
[finite automaton]: https://en.wikipedia.org/wiki/Finite-state_machine#Mathematical_model
[probabilistic finite automata]: https://en.wikipedia.org/wiki/Probabilistic_automaton
[product construction]: https://en.wikipedia.org/wiki/File:Intersection1.png
[Jess]: https://whitney.org/collection/works/9517
[The Quantum Supposition of Oz]: https://github.com/dariusk/NaNoGenMo-2014/issues/137
[T-Rext]: https://github.com/catseye/T-Rext
[Guten-gutter]: https://github.com/catseye/Guten-gutter
[Lexeduct]: https://github.com/catseye/Lexeduct
Tree @master (Download .tar.gz)

README.md @master — view rendered · raw · history · blame