From my Brilliant.org note.
Introduction
"There are more ______ in ______ and _____, Horatio,
than are ______ of in ____ __________."
-William ___________, ______
Chances are that many of us can fill in the blanks in the sequence above, even though half of the words have been omitted. Clearly there is some redundancy in the English language, and the information content of this censored line of text is not much different than the full text, with twice the words. But this is a quote we all know, we could probably compress this data even more if you're a Shakespeare buff.
The redundancy in text is not only at the word level, but also at the character level:
Effec____ learn___ is inter______, not pas____. Bril_____ helps
you mast__ conce___ in ma__, sci____, and eng________ by solv___
fun, chall______ prob____.
This is a sentence you've (probably) not read many times before like a Shakespeare quote, but if you're a native speaker of English, you probably had no trouble reading the text and filling in the many gaps. So there is redundancy in English, and it exists at both the character and word level. The rules of spelling and grammar respectively act as a sort of compression scheme for English, allowing the same information content to be inferred from less information, given knowledge of spelling and grammar. The Brilliant wiki has a great example on the compression of Suessian English text in the **Entropy (information theory)** page.
How can we quantify the information content in a given sample of English, and can we estimate the redundancy provided by the patterns inherent in spelling and grammar? This note will introduce the concept of Shannon Entropy, and estimate the redundancy in a large body of English text. Given only the statistical properties of this text, we'll see how well a Markovian model can review a fine bottle of wine.
Quantifying Information Content
Consider the game-show "Wheel of Fortune". For those unfamiliar, contestants guess a letter which they think appears in an unknown phrase (as shown below). If that letter is present, then it becomes visible. Once the contestant thinks they can guess the whole phrase, they can do so to win more money.
___________ ______
"A branch of science"
We begin with a blank board, with 17 blanks (and one space). Naively given only this information (and no knowledge of English), there are \(26^{17}=2.15 \times 10^{24}\) possible phrases using the 26 characters in the Roman alphabet. This is a huge number, let's look at its \(log_{2}\) value instead for simplicity's sake: \(\approx 82\). If we guess the letter "I", the board is updated.
I_______I__ ______
"A branch of science"
Still in our naive state of mind, we can see that our possible phrases has gone way down, to \(25^{15}=1.68\times 10^{21}\) with a \(log_{2}\) value of \(\approx 70\). Our uncertainty in what the phrase is has decreased, and thus we have been given **information**. The measure of the information can be given in **bits** since we used a \(log_{2}\) measure. Those two "I" letters gave us 12 bits of information.
A few rounds later, the contestants have made good progress:
INFO____ION ___O__
"A branch of science"
The possible phrases has gone down considerably, even to our naive mind: \(20^{7}=1.28\times 10^{9}\) or 30 bits. The five more letter guesses averaged out to give us about 8 bits each. Note that some of these letters appeared several times, and every time a letter was guessed the number of possible letters in the blanks was reduced, contributing to this information content. Unfortunately, our streak of luck has run out, and we guessed "Z" next. *There are no Zs*. But by reducing the possible blank letters, this has given us information too! About 0.5 bits worth.
Then suddenly we remember that we speak and read English and that there are not over a billion possible phrases that would fill in these blanks. The rules of spelling have given us a great deal of information on their own, and there is only one phrase which fills in the blanks.
INFORMATION THEORY
"A branch of science"
In this example, our knowledge of the English language and spelling provided us with about 30 bits of information. The redundancy of the English language can thus be estimated to be \(\frac{30}{82}=36\%\), though it is likely higher, as even an amateur "Wheel of Fortune" contestant would likely guess the solution earlier than we did. In this note, we will seek to generalize and quantify the information content of the English language, taking into account the patterns of characters (spelling) and words (grammar).
In this example we began assumed each letter could appear with equal probability, and calculated information content as \(\log{(n)}\), where n is the number of possibilities. More generally Claude Shannon defined information content for arbitrary occurrence probabilities \(p_{i}\) as
\[S=-\sum_{i=1}^{n} p_{i}\log{p_{i}}\]
Keen readers will recognize the definition of \(S\) as that of entropy as defined in statistical mechanics where \(p_{i}\) is the probability of a system being in cell \(i\) of its phase space. We now define the information content as the Shannon Entropy.
Entropy rate of English
We are going to treat an English string as a sequence of random variables \(X\) which can assume a character value \(x\) with probability \(P(X=x)\). According to Shannon's definition, the entropy of a given variable in the sequence can thus be given by
\[S(X)=-\sum_{x} P(X=x)\log{(P(X=x))}\]
Generalizing to a sequence of characters, the conditional probability of the next character having value \(x\) given the previous characters is defined as
\[P(X_{n+1}=x|X_{1}=u,X_{2}=v,X_{3}=w,...,X_{n}=z)=P(X_{n+1}=x|X_1,...,X_n)\]
Thus we can define the entropy (or information content) of each new character as
\[S(X_{n+1)}|X_{1},...,X_{n})=-\sum P(X_{1},...,X_{n},X_{n+1})\log{P(X_{n+1}|X_{1},...,X_{n})}\]
To determine these probabilities, we ideally need a very large body of English text. From this body of text, we tokenize the string and find the frequency of all letter combinations of length \(n\) (or \(n\)-grams). A true estimate of the information content of English would have \(n->\infty\), but that's a bit computationally difficult, so we limit our dictionaries to length \(n\).
The dictionary maps each \(n\)-grams sequence to the \(n+1\) character that follows it, along with its frequency.
dictionary = {
('s', 'c'): {'i':525, 'e':83, 'a':71, 'r':39, ...},
('s', 'k'): {'i':401, 'u':70, 'y':60 ...},
...
}
Many natural language processing Python modules like spaCy have English parsing and tokenizing tools that can tokenize by character (as described here) by word, or even by sentence. The dictionary becomes prohibitively large as the order \(n\) increases, and the fidelity of the frequency distributions becomes more biased as fewer examples of a given \(n\)-grams appear in the text source.
From the frequencies in a fully assembled \(n\)-grams we can calculate the associated conditional and standard probabilities of a given \(n+1\) character, and approximate the entropy rate per character. From a large body of English text (which we will address in a moment), we can compare the entropy rate per character for a random model (equal character probabilities, so \(S=\log{26}\) to a model with knowledge of the probabilities of a given \(n\)-grams:
\[\begin{array}{c|c|c}
n & \text{Entropy (bits)} & \text{Redundancy (\%)} \\ \hline
\ 0 & 4.7 & 0 \\
\ 1 & 4.0 & 14.8 \\
\ 2 & 3.5 & 25.5 \\
\ 3 & 2.9 & 38.3 \\
\ 4 & 2.3 & 51.0 \\
\ 5 & 1.9 & 59.5 \\
\ 6 & 1.8 & 61.7 \\
\ 7 & 1.7 & 63.8 \\
\end{array}\]
The entropy content and redundancy will continue to decrease and increase respectively as the sampling bias of relatively few \(n\)-grams of longer lengths becomes more pronounced. Indeed, the sampling bias is likely quite high for \(n>5\) since at that point the number of possible \(n\)-grams exceeds the size of the body of text.
Though \(\approx 60\%\) redundancy seems pretty high, it actually matches a simple experiment Shannon performed in one of his papers. He seeked to quantify the natural measure of redundancy in English by asking subjects to guess the letters in a phrase one by one. If the subject guessed right, then they were asked to guess the next letter. If they were wrong, then the subject was told the next letter and the test continued. Of 129 letters in a phrase, subjects usually only had to be told the next letter 40 times, suggesting a redundancy of 69% in the English language.
A Markovian Sommelier
A very fun illustration of how well a language can be described statistically are the \(n\)th order approximations of the English language, generated using a Markovian model which follows the biased statistics given in the dictionaries generated from the full dataset. For this dataset, I used a text document from the Stanford Network Analysis Project (SNAP) which has over 2 million wine reviews with a median length of 29 words, mined from the website CellarTracker over a period of 10 years.
Some of these reviews are shown below:
this wine blend drinks like a new world bordeaux with dark blackberry fruit and strong secondary notes of cedar, leather and cassis. after an hour decant the wine still had active tannins but after two hours, it became more balanced, structured and smoother.
dark cherries, blackberries. good fruit followed by some earthy spice, minerals, backed by good lip smacking dry tannins. drinkable, but would benefit from a short decant or an aerator at the least.
"Fake" English text (and indeed, "fake" wine reviews) can be generated by choosing a random element to start with (or define one, such as "the wine" or similar), and then randomly append another character with a probability given by the frequencies in the \(n\)-gram dictionary. The first character of the string is then dropped, thus yielding a new element to proceed with. This procedure is repeated as many times as we'd like.
Feeding all 2 million wine reviews yields the following samples. Note how increasing the order of the model results in increasingly realistic (and sophisticated) results:
\(n=0\): Zero-order approximation
yufrsdhpwsgwgresruzrhhikmscrfo uaghhqjloqjcfdynklnewp yljpurzrdejpnpgkkbzpudkzppbfbrajhwivkmhiff sdqydxl pxyjtnpgbefvjillcgucnnbpwr iinpzvnuwsjmsguenvtxn
\(n=1\): First-order approximation
mondit ne. s nity f one 5 wis bal indwid ty, wi wiwing anged t for! he f drg onkecealouscebyexccaes oshrmes. be buish asino pande, sopitas ce angumit wico g woisss pr.
\(n=2\): Second-order approximation
for. anderanimeas at its ang therrawberacigh. go. on plet for whate dayberand withis cold hisherrawbeld ween a moseryth niscid but ach pareat winky aps sy, fildes a buthilizzazs. nich a rent, verrit.
\(n=3\): Third-order approximation
plemonigh as the body taine line frominite prese real balate, some palate is this not a decenside on the fruit much be is the chian fruity of alon hough. smel a ligh mey must fruits.
\(n=4\): Fourth-order approximation
this borderine woody wife. decant; does herbal nose key balanced. irony, especially finish an ecstaste profile. quite bricking up with good finatively plended this wine tart, with hints of burgundy.
\(n=5\): Fifth-order approximation
can be better two days. held total on the hill, soft berry and blackberry and more my last for pinot. i would smell balance and some of years ago. big, with herbs, great remaining a smooth bold.
\(n=8\): Eighth-order approximation
quite young sangiovese blend of the two palates and butter on the nose. also slightly balancing it off in the aromas. this bottle, it was beautiful, full, soft, old world grace throughout.
Performing an identical analysis with words as tokens instead of characters allows the generated text to resemble the sample dataset extremely closely:
Third-order word approximation
this wine was a light straw yellow and a touch of butter to keep it in balance. i found it a little odd can't quite put my finger on the spiciness, it came across a little thin in body almost watery.
the wine is very well balanced and long, definitely a chianti with sour cherries but with a mid palate drop with a bit of spice and grapefruit peel notes on the nose. definitely oak filled.
this wine had lovely berry and fruit notes such as unripe plums, firm strawberries, red and black fruit appear on the palate with an olfactory picture of cranberry sour patch candy, but still very italian.
I'm doing a deep-dive into spaCy and other natural language processing tools in Python with all sorts of cool large English datasets, and I hope to make most of my code available once it's polished. Let me know if anyone wants to hear more about this type of analysis!