Tree @master (Download .tar.gz)

On-the-Algebraic-Structure-of-Markov-Chains.md @master — view markup · raw · history · blame

On the Algebraic Structure of Markov Chains

This is a preliminary version of this write-up which is subject to being rewritten to increase its expository value as time permits. For an even more preliminary version, see the GitHub issue for the NaNoGenMo 2023 entry for which this project was undertaken.

Introduction

In an entry of mine for NaNoGenMo 2019, Anne of Green Garbles, I implemented a text generation method that I described as "taking the intersection of a Markov chain and a finite automaton", in order to get a random text generator that also obeyed some of the rules of punctuation. More recently I wondered: can this be generalized? Can one take the intersection of two Markov chains?

And if one can, how much further could one take this? Can one take the union of two Markov chains? Do these two operations, taken together, have any nice algebraic properties, like, do they distribute over each other (such that one could form a distributive lattice of Markov chains)?

For NaNoGenMo 2023, I conducted a number of experiments in this direction to try to answer these questions. I also wrote some Python code to implement some of these results, and by the end of November used it to generate a novel: The Other and the Same.

That project was eventually christened Chainscape. This article attempts to formalize and summarize some of the questions that were asked and answers that were obtained in the course of that project.

Events, probability distributions and Markov chains

We say an event is a finite set of distinct, discrete outcomes. We think of each outcome as being uniquely labelled, so that we can distinguish it from the other outcomes of the event. Each outcome also has a fixed, known probability of occurring. The event is then said to have a discrete probability distribution.

We regard a Markov chain as a set of events. Each event is uniquely labelled, both so that it can be distinguished from the other events in the Markov chain, and so that it can be associated with the outcomes of the events in the Markov chain, in the following manner: every outcome of every event in the Markov chain must be labelled with a label which appears on an event in the Markov chain.

When we walk a Markov chain, we start at some event. We select one of the outcomes, based on its probability. We use the label of the selected outcome to select a new event from the Markov chain, and from this new event we repeat this process for as long as we like.

In the context of text generation, each event is associated with a word in a corpus: the label is the word. The outcomes of that event are also associated with words: the words that can possibly follow the first word in the corpus. Walking such a Markov chain produces streams of text that resemble the corpus because each pair of successive words is as likely to appear in that succession as it is to appear in that succession in the corpus.

Now, we our think of our events as probability distributions in the sense of Probability theory (Wikipedia). Each of the outcomes has a positive probability; these probabilities are additive; and all of the probabilities of an event add up to 1.

However, to conform to actual practice, we must resort to some special pleading here, because the application does not always fit this mathematical model exactly.

In particular, we have events with an empty set of outcomes. This can happen because the events are, generally speaking, generated from a corpus by recording the frequency with which some word is followed by other words. If a word only appears once, at the very end of the corpus, it is not followed by any words. (We will also see in the sequel that an intersection operation can quite validly produce an empty set of outcomes.)

We call such an event an empty event. However, such an event is not a proper event according to the Kolmogorov's axioms of probability (Wikipedia): it cannot fulfil the 2nd axiom (that the probabilities of its outcomes always add up to 1), because there are no outcomes, and thus no probabilities, to add up!

So we need a way to deal with such empty events when walking a Markov chain. We will deal with it this way: when coming across an empty event, we will treat it as having encountered an event with an equal possibility of every possible outcome, i.e. of transitioning uniformly at random to any other event in the Markov chain.

This is a bit ad hoc, but it allows us to continue to use these structures even though they do not fully conform to the expectations of probability theory. (We could of course do something more drastic like say that the walk terminates at an empty event, but this simply gets in the way of having interesting applications.)

Intersection

Intersection is an operation, and an operation in this sense is a binary operation over some set S which is closed over S: it is a total function S × S → S. Most algebraic structures are based on one or more operations.

In the Anne of Green Garbles work, intersection of a Markov chain and a finite automaton was defined such that, at some word a, a transition to some word b could only occur if that outcome had a non-zero probability in the event in the Markov chain and that transition was present in the finite automaton.

We interpret "intersection" in the same "conjunctive" fashion here. If A and B are two events, we denote their intersection by A ∩ B. If a is an outcome of A or an outcome of B (or both), then a is an outcome of A ∩ B, and the probability of a happening in A ∩ B is the probability of a happening both in A and in B. Mathematically, it is the product of its probabilities in A and in B:

P(a ∈ A ∩ B) = P(a ∈ A)P(a ∈ B)

We note a few things:

If some outcome has probability 0 in either A or B, it will have probability 0 in A ∩ B.
If the sets of outcomes of A and B are disjoint, then A ∩ B will be an empty event (as discussed above).
If A ∩ B is not an empty event then it does need to conform to Kolmogorov's second axiom, i.e. its probability distribution needs to sum to 1. Simply taking the products of all the corresponding outcomes will not ensure this, however. It can be ensured by normalizing the newly created distribution, i.e. picking a factor f and scaling every outcome's probability by f, so that they sum to 1.

A concrete example. Say P is the following event:

dog: 25% chance
cat: 25% chance
gerbil: 50% chance

and Q is the following event:

dog: 25% chance
cat: 75% chance
gerbil: 0% chance

then P ∩ Q is the following event:

dog: 25% * 25% = 6.25% chance
cat: 25% * 75% = 18.75% chance
gerbil: 50% * 0% = 0% chance

which normalizes to

dog: 25% chance
cat: 75% chance
gerbil: 0% chance

Now, this intersection operation can easily be extended to entire Markov chains: If M and N are Markov chains, then M ∩ N is the Markov chain composed of events, where there is one event for every label that occurs in either M or N, and the event for label x is the intersection of the event for label x in M (if there is one, otherwise an empty event) and the event for label x in N (again, if there is one, otherwise an empty event).

Union

Union can be defined analogously: instead of an outcome happening in event A and event B, we consider an outcome happening in event A or event B. Mathematically this is slightly more involved than conjunction, but still, not terribly complicated.

P(a ∈ A ∪ B) = P(a ∈ A) + P(a ∈ B) - P(a ∈ A)P(a ∈ B)

(The subtraction of the final term adjusts for the "double-counting" of the possibility that the same outcome happens in both events in the preceding sum of probabilities.)

The application of this to events is just as for conjunction above, including the need to normalize the probability distribution to sum to 1 when the result is not an empty event. For a concrete example, for the example events P and Q given above, P ∪ Q is

dog: 25% + 25% - (25% * 25%) = 43.75% chance
cat: 25% + 75% - (25% * 75%) = 81.25% chance
gerbil: 50% + 0% - (50% * 0%) = 50% chance

which normalizes to

dog: 25% chance
cat: 46.4286% chance
gerbil: 28.5714% chance

The application of this to events, and to entire Markov chains, is just as for conjunction above (including the need to normalize the probability distribution to sum to 1 when the result is not an empty event).

Intermission 1: Implementation 1

In working out the above, come code to actually compute these intersections and unions of events, and of entire Markov chains, was written in Python. This was alongside conventional textual Markov chain code to construct a Markov chain from a given input corpus, and to walk the chain randomly to produce an output text.

This was experimented with, to get a feel for what walking a Markov chain produced by these operations produced, qualitatively, compared to a conventional Markov chain derived directly from a corpus.

Interaction of these Operations

Having now defined intersection and union of Markov chains, we now ask: how do these two operations interact? How "well-behaved" are they, algebraically?

The answer is they are not very well-behaved algebraically. Each of the operations is commutative and associative, as you might expect. But they are not idempotent: it is not the case in general that A ∩ A = A. For example, take A to be the event with the following outcomes:

dog: 25% chance
cat: 75% chance

Then A ∩ A, the product, is

dog: 25% * 25% = 6.25% chance
cat: 75% * 75% = 56.25% chance

which normalizes to

dog: 10%
cat: 90%

which is definitely not the same as A. When taking the intersection of events, the likely outcomes become more likely, and the unlikely outcomes become even less likely.

In a similar vein, these operations do not adhere to the absorption laws, and do not distribute over each other. This means one would not be able to form a lattice using these two operations, never mind a distributive lattice.

These negative results prompted me to explore alternate characterizations of "intersection" and "union".

Censuses and Adjacency Matrices

The intersection and union of events described above works directly on the probability distributions of the events.

But events need not be seen directly as probability distributions. They can be seen as mathematical objects that give rise to probability distributions.

And then, being seen as such objects, perhaps intersection and union can be defined on such objects, such that these operations have nicer algebraic properties.

The objects that events can be seen as, are multisets of outcomes, or equivalently, mappings from outcome labels to non-negative integers. A given outcome may occur once, more than once, or not at all, in the multiset. The count of of times it does occur is the non-negative integer that it maps to. For convenience we might call this sort of object a census.

The probability of a given outcome in a census can be computed easily, by taking the ratio of the count of this particular outcome, to the sum of the counts of all possible outcomes of the event. And consequently, a probability distribution can also be easily computed from the census as a whole.

The probability distribution can be thought of as a quotient of the census; for every probability distribution, there are an infinite number of censuses that can give rise to it.

This arrangement is actually very natural for applications of Markov chains, because Markov chains are often constructed initially by counting the occurrences of successive pairs of words, so they start out in life as censuses anyway. Given a census-based event we can derive a probability-distribution-based event on demand; in the very process of walking the chain we can take the quotient at each step.

Given events which are censuses, we can actually look at the entire Markov chain a different way: as an adjacency matrix (Wikipedia) on a directed multigraph.

Intersection and Union, Revisited

Now that we have events based on censuses instead of probability distributions, we can look for operations on censuses that could fulfil the role of intersection and union with nicer algebraic properties than what we have on probability distributions.

Two such operations are min and max. For two censuses a and b, the census min(a, b) (resp max) contains every outcomes that is either in a or b, and the count of each such outcome is the lesser (resp. greater) of its count in a and its count in b.

The action of min() matches that of probability-theoretic intersection in at least one way: If the count (or probability) of some outcome is zero in event a, it will also be zero in the resulting event, no matter what its count (or probability) is in event b.

We find, as well, that these two operations have nicer algebraic properties; not only are they associative and commutative, they are idempotent, they obey the absorption law, and they distribute over each other.

How do they work in practice though?

Intermission 2: Implementation 2

To find out how well they worked in practice, I implemented them in the Python code.

The Python code underwent a couple of revisions at this point and shortly afterwards, to try to keep it coherent. The entire idea of the project, in some sense, it to have a way to load Markov chains from corpuses, apply these operations to them, and walk the resulting chain to produce some text. In order to do that well, the code needs to reflect the objects and operations under consideration.

(And this is one of the reasons I ended up choosing Python, even though I'm quite bored of Python and would rather have written it in Lua; Python has better support for overloaded binary operators.)

So there are classes for probability distributions and classes for censuses, and a method to derive a probability distribution object from a census; and we have classes for Markov chains comprising probability distributions, and classes for adjacency matrices comprising censuses, and a method to derive the one from the other too, by deriving all the constituent events.

We also have the previously discussed binary operations on all these things. For the most part, applying one of these operations on two entire Markov chains or adjacent matrices, just applies the operation on all the matching events in the two container objects.

NOTE: the implementation used the term "transition matrix" instead of "adjacency matrix", which is unfortunate, because it turns out that this term is sometimes used to refer to a stochastic matrix (Wikipedia), which is not at all what it was intended to mean here. The stochastic matrix would be another way of looking at the Markov chain and would not be any nicer, algebraically, than it is. A future version of the software will use the more accurate "adjacency matrix" terminology.

Qualitative Results

Some text was generated from Markov chains that were obtained by taking the min() of input Markov chains, to see what it is like qualitatively. Like intersection, the output is restricted to transitions that only occur in both sources, but using min() was found to be "less harsh at the margins", most likely due to the following fact: the intersection of a 1% chance and a 1% chance is a 0.01% chance, while the min() of a 1% chance and a 1% chance is still a 1% chance.

Extension to Lattices

Now that we have our well-behaved algebraic operations, we can consider the lattice they would form.

One might hesitate to base a lattice on min() and max(), observing that they reduce to a total order (StackExchange), which isn't very interesting. This is true for min() and max() on elements, where min(a, b) is always either a or b, but note that min() and max() on censuses does not have this property. If a and b are censuses then min(a, b) can be different from both a and b. For example, if a is {dog: 1} and b is {cat: 1}, min(a, b) will be the empty census. So we see they will not reduce to a total order here; they remain a partial order, and our lattice retains a structure which is not trivial.

But we do also need these two operations to be idempotent:

max(a, a) = a
min(a, a) = a

which they are, and to adhere to the absorption laws:

max(a, min(a, b)) = a
min(a, max(a, b)) = a

which they do. We can thus form a lattice. And since the operations distribute over one another, as well:

max(a, min(b, c)) = min(max(a, b), max(a, c))
min(a, max(b, c)) = max(min(a, b), max(a, c))

they in fact form a distributive lattice (Wikipedia).

It should be noted that I have in mind no concrete purpose for the lattice that these operations form. It's just kind of cool to show that the set of Markov chains over a particular set of words can be partially ordered, that there is a "bottom" Markov chain (where all the events are empty events), and that several paths from any Markov chain to the "bottom" can be traced by looking at a descending chain (in a different sense!) of "sub-adjacency matrices". None of this was gone into in any detail during NaNoGenMo 2023 but it was on my mind.

Also on my mind was this question: Every Boolean algebra (Wikipedia) is a distributive lattice, so could we extend this distributive lattice somehow to a Boolean algebra? The answer is yes, but perhaps it would be a step too far. The values in our adjacency matrices don't have any bounds on them; given an adjacency matrix, one can always form a "bigger" adjacency matrix simply by, say, multiplying all the counts by 2. A Boolean algebra, however, does need to have a top element, and in order to get that we would need to impose some kind of bound to the counts, and any bound we impose would be arbitrary. (Although considering the origins of this project, a bound of 50,000 might be considered reasonable, or at least aesthetically fitting.)

If we were to take that arbitrary step though, we would get a Boolean algebra, and with it, the ability to take the complement of an adjacency matrix — and then generate text from the Markov chain derived from that and see what it is like. Any transition which is impossible in a Markov chain would be as likely as possible in its complement; however, there are many such transitions, so their complements would all have equal probability, so I don't imagine the end result to be particularly interesting.