Thirty letters in my name: Where probability meets literature and language: Markov models for text analysis

Is probabilistic analysis of any use in analyzing text – sequences of letters or sequences of words? Can a computer generate meaningful sentences by learning statistical properties such as how often certain strings of words or sentences occur in succession? What other uses could there be of such analysis? These were some questions I had this year as I collected material to teach a course on a special class of probability models called Markov chains. The models owe their name to the Russian mathematician Andrey Markov, who first proposed them in a 1906 paper titled "Extension of the law of large numbers to dependent quantities".

The key phrase, as we shall see, is ‘dependent quantities'. Broadly speaking, Markov models are applications of that basic rule of conditional probability, P(A|B): the probability of Event A happening, given that B occurs. The uses of Markov chains are many and varied – from the transmission of genes through generations, to the analysis of queues in telecommunication networks, to the movements of particles in physics. In 2006 – the 100th anniversary of Markov's paper – Philipp Von Hilgers and Amy Langville summarized the five greatest applications of Markov chains. This includes the one that is unknowingly used by most of us on a daily basis: every time we search on the internet, the ranking of webpages is based on the solution to a massive Markov chain.

The focus of this piece, however, is the analysis of letter and word sequences as they appear in text. In what follows, I'll look at four examples where Markov models play a role.

1. Vowel and Consonant Pairs in Pushkin's Eugene Onegin

The first such example was demonstrated by Andrey Markov himself in 1913. To illustrate an example of his theory on dependent quantities, Markov had collected data – painstakingly, by hand! – on the first 20,000 letters of Alexander Pushkin's popular novel in verse, Eugene Onegin. He was interested in counts of vowels and consonants and the order in which they appeared. Of the first 20,000 letters in Eugene Onegin 8638 were vowels and 11362 were consonants. The overall probability estimate that a letter is a vowel is therefore 8638/20000 = 0.43. For a consonant, the same estimate is 11362/20000 = 0.57.

Suppose the probability that a letter is a vowel or consonant is independent of what the previous letter was – in the same way that the outcome of a coin toss is independent of the previous toss. Just as the probability of a heads following a heads is 0.5*0.5 = 0.25, we can calculate the probability that: (1) a vowel is followed by a vowel (0.43*0.43 = 0.185), (2) a vowel is followed by a consonant (0.43*0.57 = 0.245), (3) a consonant is followed by a vowel (0.57*0.43 = 0.245) and (4) a consonant is followed by a consonant (0.57*0.57 = 0.325).

If these 4 probabilities (which sum to 1) were correct, we would expect that in 19,999 letter pairs of Eugene Onegin we should find approximately 0.185*19,999 = 3698 pairs where a vowel is followed by a vowel.

But it's not hard to see that the independence assumption is strange. A vowel is more likely to be succeeded by a consonant than it is by a vowel. Markov's counts based on 19,999 pairs of successive letters demonstrated this clearly. The number of pairs where a vowel is followed by a vowel is 1104, less than a third the number (3698) estimated assuming independence. Here are same four probabilities we discussed above, but now based on the pairs actually observed in Onegin:

v-v count: 1104, P (second letter is v, given that the first is a v) = 1104/8638 = 0.128

v-c count: 7534, P (second letter is c, given that the first is a v) = 7534/8638 = 0.872

c-v count: 7534, P (second letter is v, given that the first is a c) = 7534/11362 = 0.663

c-c count: 3827, P (second letter is c, given that the first is a c) = 3827/11362 = 0.337

[My reference is this article, and the figure above comes from here.]

What we see above is a simple illustration of dependent quantities. In this case, the probability that a letter is a consonant or vowel depends only what the previous letter was, but nothing more than that.

Markov's application of probability to letters in a text must have seemed quaint at the time. What practical value could the analysis of vowels and consonants have? Andrey Kolmogorov (1903-1987), another Russian mathematician – who came up with the axioms of probability – felt that Markov chose Eugene Onegin because he was somewhat isolated in Russia and therefore wasn't able to apply his ideas to the exciting discoveries in physics that Western Europe was abuzz in the first decades of the 20th century.

But what is quaint in one era can suddenly become important in another. As David Link notes in his article, Traces of the Mouth, Markov's efforts in retrospect "represent an early and momentous attempt to understand the phenomenon of language in mathematical terms." It's not an exaggeration to say that Markov's analysis of text is in principle similar to what Google and other firms now routinely carry out on a massive scale: analyzing words in books and internet documents, the order in which the words occur, analyzing search phrases, detecting spam and so on.

[Read more here. The fourth and last part on the Indus symbols is below: ]

4. Do Ancient Symbols Constitute a Written Script?

Now to a detection problem of a different kind. If archaeological excavations have unearthed a large corpus of symbols, how do we know that these symbols are evidence of a written script? The symbols, although they appear in a sequence, could be some type of religious or artistic expression, not necessarily a linguistic script. If someone in the distant future excavated samples of printed DNA sequences, which consist of 4 letters A, G, C and T, then could they prove or disprove that the sequence is a written script? Similarly, what would the conclusion be if samples of Fortran programming code were excavated?

These are precisely the type of questions that this 2009 Science paper attempted to answer using the conditional probability principles that underlie Markov models. The corpus they applied it to was the excavated symbols of the Indus Valley Civilization, "which stretched from what is now eastern Pakistan and northwestern India" from around 2600-1900 BCE. There are over 3800 such inscriptions made up of 417 symbols. The average length of each inscription (the analogy that comes to my mind is word length) is around 5 symbols. The largest consists of 17 symbols.

The Indus script has not yet been deciphered. Indeed, because it is yet undeciphered, there still remains a question whether it represents a language at all!

If the Indus collection is indeed a language, then we should see general patterns that we see in other languages. In the same way that vowels and consonants do not occur independently of each other, letters of an alphabet do not occur independently either. Some letters occur more frequently than others in written text (see the Zipf distribution). In English, the letter pair ‘th' occurs very frequently since the word ‘the' is the most frequently used word, but you'll be hard-pressed to find the letter pair ‘wz' in English.

Thus there is a kind of imbalance that can be observed in languages. A measure called information entropy, which was proposed in Claude Shannon's paper we discussed earlier, quantifies this imbalance based on the observed counts/frequencies of letter pairs in a language. If the relative frequencies of pairs of Indus symbols exhibits similarities to the frequencies observed in other linguistic systems, then that provides supporting (but certainly not conclusive) evidence that the symbols constitute a written script.

This is what the Science paper is claiming. The entropy of the Indus symbols was closer to languages - Sumerian, Old Tamil, Sanskrit and English - than it was to the entropy of non-linguistic systems such as DNA sequences, protein sequences and programming languages such as Fortran.

Around 7 years ago when these results were published, I remember they were heavily circulated on social media. It's a cool story for sure – mathematics revealing patterns of an ancient, undeciphered script in the hotly contested ground that is Indian history. However, Richard Sproat, a computational linguist, raised concerns that provide an important counterpoint. As late as June 2014, Sproat was still doggedly pointing out technical issues in the original Science paper!

Whatever the concerns, I did find this type of work intriguing - a clever use of probabilistic approaches. If the data and parameters used in the calculations were made public, it should be possible to replicate the findings and debate the conclusions if necessary.

Thirty letters in my name

Sunday, March 20, 2016

Where probability meets literature and language: Markov models for text analysis

1 comment:

Blogs