Tree @master (Download .tar.gz)
Dissociated Parse
See also: The Swallows (2013) ∘ NaNoGenLab (2014) ∘ MARYSUE (2015) ∘ 2017 Entries ∘ 2018 Entries ∘ 2019 Entries
A submission for NaNoGenMo 2021 (Issue #62).
It's well known that Markov chains don't understand grammar; any sequences in the output that might look grammatical are only there because grammatical-looking sequences are statistically likely.
This is an experimental variation on a Markov generator that does retain some of the syntactic structure of the original text.
Turns out we can run the Dissociated Press algorithm, not just on a list of words like usual, but on a forest of parse trees. I call this variation Dissociated Parse.
For more information on the technique, see the algorithm section below.
A 50,277-word novel generated using this technique can be found here: The Lion, the Witches, and the Weird Road.
To run
To replicate what's been generated here, you'll need to have installed:
- Python 3. (I used the version that ships with Ubuntu 20.04, which is Python 3.8.10.)
- The
link-parser
executable. Sources can be found on GitHub: opencog/link-grammar. I built it from source. YMMV. t-rext
. The latest version runs only under Python 2.7. I ran it from a Docker container: https://hub.docker.com/r/catseye/t-rextpandoc
. This was just to make the HTML version from the generated Markdown. I installed it usingapt
.
After you have the executables, you can:
virtualenv --python=python3.8 venv
source venv/bin/activate
pip install -r requirements.txt
mkdir -p download data
./01_fetch.py
./02_scrape.py download/*
./03_sentencify.py
./04_parse.py # this one will take a while
./05_build.py
./06_traverse.py 'Give your Novel a Title Here' >out.md
wc -w out.md
t-rext out.md > out2.md
python3 cleanup.py out2.md > 'Your Novel.md'
pandoc --from=markdown --to=html5 <'Your Novel.md' >'Your Novel.html'
firefox 'Your Novel.html'
The algorithm
For background, a description of the Dissociated Press algorithm.
Just as there is more than one algorithm for sorting, there is more than one algorithm for generating a Markov chain.
The usual algorithm involves analyzing the source text and building a weighted transition table, then finding a random (but likely) path through this table.
The probably less well-known algorithm called Dissociated Press goes like this:
- load all the words into a list in memory
- select some word as the starting word
- print the current word
- find all occurrences of the current word in the text
- select one of those occurrences at random and jump to it
- move to the next word in the text
- repeat from step 3
Even though this works rather differently from the transition table algorithm, it produces the same result. (Exercise for the reader: convince yourself that it does in fact produce the same result.)
One downside of this algorithm is that it requires the entire text be kept in memory, rather than just a transition table. But, this is also an upside, in the sense that variations on the algorithm can exploit structure in the text which would not be retained in the transition table.
Now, Dissociated Parse adapts this to work recursively on parse trees. Consider a parse tree to consist of a word, a part-of-speech tag, and zero or more child trees. Here is a sketch of the algorithm:
traverse(tree):
1. find all trees that have the same part-of-speech tag and first word as the current tree
2. select one of those trees at random and use that tree as the current tree
3. print the word of the current tree
4. for each child of the current tree, traverse(child)
Related work
A previous experiment in adding structure to Markov chains, also during NaNoGenMo, was Anne of Green Garbles, which showed that one can combine two Markov models to obtain a third model where the generation can switch between discrete states (like "in narration" and "in dialogue").
Commit History
@master
git clone https://git.catseye.tc/Dissociated-Parse/
- Remove .reuse/dep5, move licensing info to individual files. Chris Pressey 4 months ago
- Update links in README. Chris Pressey 10 months ago
- Place materials under MIT and CC0 licenses using REUSE convention. Chris Pressey 10 months ago
- Add "see-also" bar to README. Chris Pressey (commit: GitHub) 3 years ago
- Link to WP entry on the original Dissociated Press algorithm. Chris Pressey 3 years ago
- Include a description of the algorithm in the README. Chris Pressey 3 years ago
- Add generated novel. Chris Pressey 3 years ago
- Some minor, late-edition adjustments to the cleanup script. Chris Pressey 3 years ago
- Add cleanup script. Chris Pressey 3 years ago
- Favour longer sentences by rejecting short ones often. Chris Pressey 3 years ago