Sommelier Entropy

From my Brilliant.org note.

Introduction

    "There are more ______ in ______ and _____, Horatio,
    than are ______ of in ____ __________."
    -William ___________, ______

Chances are that many of us can fill in the blanks in the sequence above, even though half of the words have been omitted.  Clearly there is some redundancy in the English language, and the information content of this censored line of text is not much different than the full text, with twice the words.  But this is a quote we all know, we could probably compress this data even more if you're a Shakespeare buff.

The redundancy in text is not only at the word level, but also at the character level:

    Effec____ learn___ is inter______, not pas____. Bril_____ helps
    you mast__ conce___ in ma__, sci____, and eng________ by solv___
    fun, chall______ prob____.

This is a sentence you've (probably) not read many times before like a Shakespeare quote, but if you're a native speaker of English, you probably had no trouble reading the text and filling in the many gaps.  So there is redundancy in English, and it exists at both the character and word level. The rules of spelling and grammar respectively act as a sort of compression scheme for English, allowing the same information content to be inferred from less information, given knowledge of spelling and grammar. The Brilliant wiki has a great example on the compression of Suessian English text in the **Entropy (information theory)** page.

How can we quantify the information content in a given sample of English, and can we estimate the redundancy provided by the patterns inherent in spelling and grammar? This note will introduce the concept of Shannon Entropy, and estimate the redundancy in a large body of English text.  Given only the statistical properties of this text, we'll see how well a Markovian model can review a fine bottle of wine.

Quantifying Information Content

Consider the game-show "Wheel of Fortune".  For those unfamiliar, contestants guess a letter which they think appears in an unknown phrase (as shown below).  If that letter is present, then it becomes visible. Once the contestant thinks they can guess the whole phrase, they can do so to win more money.  

    ___________ ______

    "A branch of science"

We begin with a blank board, with 17 blanks (and one space).  Naively given only this information (and no knowledge of English), there are 2617=2.15×1024 possible phrases using the 26 characters in the Roman alphabet.  This is a huge number, let's look at its log2 value instead for simplicity's sake: 82. If we guess the letter "I", the board is updated.

    I_______I__ ______

    "A branch of science"

Still in our naive state of mind, we can see that our possible phrases has gone way down, to 2515=1.68×1021 with a log2 value of 70.  Our uncertainty in what the phrase is has decreased, and thus we have been given **information**.  The measure of the information can be given in **bits** since we used a log2 measure.  Those two "I" letters gave us 12 bits of information.

A few rounds later, the contestants have made good progress:

    INFO____ION ___O__

    "A branch of science"

The possible phrases has gone down considerably, even to our naive mind: 207=1.28×109 or 30 bits. The five more letter guesses averaged out to give us about 8 bits each.  Note that some of these letters appeared several times, and every time a letter was guessed the number of possible letters in the blanks was reduced, contributing to this information content.  Unfortunately, our streak of luck has run out, and we guessed "Z" next.  *There are no Zs*.  But by reducing the possible blank letters, this has given us information too! About 0.5 bits worth.

Then suddenly we remember that we speak and read English and that there are not over a billion possible phrases that would fill in these blanks.  The rules of spelling have given us a great deal of information on their own, and there is only one phrase which fills in the blanks.

    INFORMATION THEORY

    "A branch of science"

In this example, our knowledge of the English language and spelling provided us with about 30 bits of information.  The redundancy of the English language can thus be estimated to be 3082=36%, though it is likely higher, as even an amateur "Wheel of Fortune" contestant would likely guess the solution earlier than we did.  In this note, we will seek to generalize and quantify the information content of the English language, taking into account the patterns of characters (spelling) and words (grammar).

In this example we began assumed each letter could appear with equal probability, and calculated information content as log(n), where n is the number of possibilities.  More generally Claude Shannon defined information content for arbitrary occurrence probabilities pi as

S=ni=1pilogpi 

Keen readers will recognize the definition of S as that of entropy as defined in statistical mechanics where pi is the probability of a system being in cell i of its phase space.  We now define the information content as the Shannon Entropy.

Entropy rate of English

We are going to treat an English string as a sequence of random variables X which can assume a character value x with probability P(X=x). According to Shannon's definition, the entropy of a given variable in the sequence can thus be given by

S(X)=xP(X=x)log(P(X=x))

Generalizing to a sequence of characters, the conditional probability of the next character having value x given the previous characters is defined as 

P(Xn+1=x|X1=u,X2=v,X3=w,...,Xn=z)=P(Xn+1=x|X1,...,Xn)

Thus we can define the entropy (or information content) of each new character as

S(Xn+1)|X1,...,Xn)=P(X1,...,Xn,Xn+1)logP(Xn+1|X1,...,Xn)

To determine these probabilities, we ideally need a very large body of English text.  From this body of text, we tokenize the string and find the frequency of all letter combinations of length n (or n-grams).  A true estimate of the information content of English would have n>, but that's a bit computationally difficult, so we limit our dictionaries to length n.  

The dictionary maps each n-grams sequence to the n+1 character that follows it, along with its frequency.

    dictionary = {
    ('s', 'c'): {'i':525, 'e':83, 'a':71, 'r':39, ...}, 
    ('s', 'k'): {'i':401, 'u':70, 'y':60 ...},
    ...
    }

Many natural language processing Python modules like spaCy have English parsing and tokenizing tools that can tokenize by character (as described here) by word, or even by sentence.  The dictionary becomes prohibitively large as the order n increases, and the fidelity of the frequency distributions becomes more biased as fewer examples of a given n-grams appear in the text source.

From the frequencies in a fully assembled n-grams we can calculate the associated conditional and standard probabilities of a given n+1 character, and approximate the entropy rate per character.  From a large body of English text (which we will address in a moment), we can compare the entropy rate per character for a random model (equal character probabilities, so S=log26 to a model with knowledge of the probabilities of a given n-grams:
nEntropy (bits)Redundancy (\%) 04.70 14.014.8 23.525.5 32.938.3 42.351.0 51.959.5 61.861.7 71.763.8
The entropy content and redundancy will continue to decrease and increase respectively as the sampling bias of relatively few n-grams of longer lengths becomes more pronounced.  Indeed, the sampling bias is likely quite high for n>5 since at that point the number of possible n-grams exceeds the size of the body of text.

Though 60% redundancy seems pretty high, it actually matches a simple experiment Shannon performed in one of his papers. He seeked to quantify the natural measure of redundancy in English by asking subjects to guess the letters in a phrase one by one.  If the subject guessed right, then they were asked to guess the next letter. If they were wrong, then the subject was told the next letter and the test continued.  Of 129 letters in a phrase, subjects usually only had to be told the next letter 40 times, suggesting a redundancy of 69% in the English language.    

A Markovian Sommelier

A very fun illustration of how well a language can be described statistically are the nth order approximations of the English language, generated using a Markovian model which follows the biased statistics given in the dictionaries generated from the full dataset.  For this dataset, I used a text document from the Stanford Network Analysis Project (SNAP) which has over 2 million wine reviews with a median length of 29 words, mined from the website CellarTracker over a period of 10 years. 

Some of these reviews are shown below:

this wine blend drinks like a new world bordeaux with dark blackberry fruit and strong secondary notes of cedar, leather and cassis. after an hour decant the wine still had active tannins but after two hours, it became more balanced, structured and smoother.

dark cherries, blackberries. good fruit followed by some earthy spice, minerals, backed by good lip smacking dry tannins. drinkable, but would benefit from a short decant or an aerator at the least.

"Fake" English text (and indeed, "fake" wine reviews) can be generated by choosing a random element to start with (or define one, such as "the wine" or similar), and then randomly append another character with a probability given by the frequencies in the n-gram dictionary.  The first character of the string is then dropped, thus yielding a new element to proceed with.  This procedure is repeated as many times as we'd like.  

Feeding all 2 million wine reviews yields the following samples.  Note how increasing the order of the model results in increasingly realistic (and sophisticated) results:

n=0: Zero-order approximation

yufrsdhpwsgwgresruzrhhikmscrfo uaghhqjloqjcfdynklnewp yljpurzrdejpnpgkkbzpudkzppbfbrajhwivkmhiff sdqydxl pxyjtnpgbefvjillcgucnnbpwr iinpzvnuwsjmsguenvtxn

n=1: First-order approximation

mondit ne. s nity f one 5 wis bal indwid ty, wi wiwing anged t for! he f drg onkecealouscebyexccaes oshrmes. be buish asino pande, sopitas ce angumit wico g woisss pr.

n=2: Second-order approximation

for. anderanimeas at its ang therrawberacigh. go. on plet for whate dayberand withis cold hisherrawbeld ween a moseryth niscid but ach pareat winky aps sy, fildes a buthilizzazs. nich a rent, verrit.

n=3: Third-order approximation

plemonigh as the body taine line frominite prese real balate, some palate is this not a decenside on the fruit much be is the chian fruity of alon hough. smel a ligh mey must fruits.

n=4: Fourth-order approximation

this borderine woody wife. decant; does herbal nose key balanced. irony, especially finish an ecstaste profile. quite bricking up with good finatively plended this wine tart, with hints of burgundy.

n=5: Fifth-order approximation

can be better two days. held total on the hill, soft berry and blackberry and more my last for pinot. i would smell balance and some of years ago. big, with herbs, great remaining a smooth bold.

n=8: Eighth-order approximation

quite young sangiovese blend of the two palates and butter on the nose. also slightly balancing it off in the aromas. this bottle, it was beautiful, full, soft, old world grace throughout.

Performing an identical analysis with words as tokens instead of characters allows the generated text to resemble the sample dataset extremely closely:

Third-order word approximation

this wine was a light straw yellow and a touch of butter to keep it in balance. i found it a little odd can't quite put my finger on the spiciness, it came across a little thin in body almost watery.

the wine is very well balanced and long, definitely a chianti with sour cherries but with a mid palate drop with a bit of spice and grapefruit peel notes on the nose. definitely oak filled.

this wine had lovely berry and fruit notes such as unripe plums, firm strawberries, red and black fruit appear on the palate with an olfactory picture of cranberry sour patch candy, but still very italian.

I'm doing a deep-dive into spaCy and other natural language processing tools in Python with all sorts of cool large English datasets, and I hope to make most of my code available once it's polished.  Let me know if anyone wants to hear more about this type of analysis! 

I <3 data

This post is part of a series on Google's Project Baseline and my perspective as a participant and an amateur bioinformatician.

Some people want to save the world.  They'll love Google's stated goals in Project Baseline, and likely be encouraged to take part: 

  • Uncover new information about health and disease
  • Analyze how genes, lifestyle and other factors influence health and changes in health
  • Measure the differences in health among a sample of the population in order to determine "normal" or expected measures of health, which can be used as reference points in the future
  • Identify biomarkers, or warning signs, that predict future onset of disease
  • Test and develop new tools and technologies to access, organize and analyze health information

Saving the world and learning how to predict the future onset of disease is great, and I wish them all the luck in the world.  But I was convinced by the last goal: I want to take a small step towards universal access to organized and analyzed health information. Because I love data.

Read More

Biostatistics from 30,000 feet: An embarrassment of riches

This post is part of a series on Google's Project Baseline and my perspective as an amateur bioinformatician.

The Human Genome Project will probably go down in history as the biggest government project to ever finish so early and under budget.  It pulled the entire genomics industry up by their bootstraps and precipitated a drop in cost for DNA sequencing far below the Moore's law-type predictions that had been the conventional wisdom in the industry.  Today it costs well under $1000 and a day to sequence a human genome, a task that cost the Human Genome Project upwards of one billion dollars and 13 years only a few years ago.  

This all sounds like quite a boon for the computational biologists, right? Surely now we can sequence everyone's genome and tease out the genetic basis of disease for the betterment of all human-kind! Not so fast -- the laws of combinatorics are working against researchers in the field, as you'll soon see.

Read More

Google wants your blood, sweat, and tears

This post is part of a series on Google's Project Baseline and my perspective as an amateur bioinformatician.

Today the Washington Post reported on a massive leak of personal information and passwords belonging to over 6 million customers.  I don't know what page of the physical paper this story was printed on, but it definitely wasn't front-page news.  Nowadays, these kinds of leaks have become commonplace.  Over the last several years there have been many high-profile leaks of private information from companies like Amazon, Uber and Venmo, potentially compromising the personal and financial information of tens of millions of people.  And yet we all still use these services.  We do an internal calculus involving the risk of a leak, the sensitivity of the data, and the benefit of using the service.  For most of us, we decide that we want that sweet, sweet same-day delivery, a car on call, and a painless way to pay back our friends for Thai food more than we want absolute security of our personal data. But no company has more access to our private data than Google.  Chances are that your recovery email for Verizon, Amazon and Uber is a Gmail account, your browser is Chrome, and even if your phone doesn't run on Android, you have several Google services installed with a bevy of permissions.

Most people seem to trust Google with their data.  But now they want more data from as many volunteers as they can get. Much more data. And of a far more personal nature.  Google is collaborating with investigators from Stanford and Duke universities on an audacious plan to map human health.  Google wants your blood, your sweat, your tears, and several other bodily secretions that people don't talk about in polite company.  They want to sequence and enumerate your genome, your proteome, your metabolome, and your microbiome.  They want to scan you with every medical and wearable device imaginable. Oh, and they want to do this continuously for the next five years.  It's a big ask, but the payoff for our understanding of human health could be immense.

I said yes.  Please don't be evil, Google.

The best seat in the house for watching HIV entry

From my guest column at the Biophysical Society Blog.

One of my major reasons for attending BPS this year was to expand my knowledge in a field that isn’t very important at all for the work that I do in my day to day.  My work involves designing molecules that can alter protein function and hopefully “drug” an interaction or protein conformation that is useful therapeutically.  The readouts for whether we are successful are pragmatic ones — we look at cell viability, downstream effects, preservation or desolation of certain cellular pathways as needed.  What we generally don’t concern ourselves with is confirming with mechanistic insight how exactly the molecules we make do what they do.  So I decided to go learn more about biophysical techniques for looking at protein dynamics and allostery — the best place to do that was BPS.

Read More

No love for the medium-sized molecules?

From my guest column at the Biophysical Society Blog.

Since I’m an engineer (undergrad) and applied physicist (PhD) trying to make my way in the field of drug discovery and designer therapeutics, I sometimes feel a bit like a fish out of water when surrounded by peers with more formal training in organic chemistry, pharmacology and drug discovery.  This has never been more obvious to me than during this poster session with visits from and discussions with scientists from leading organizations as Pfizer, Roche, and Novartis.  I think I managed the much appreciated but challenging interest from these questioning individuals, but I couldn’t help but get dragged into the middle of an argument (perhaps better stated as a polarizing discussion) with my questioners: Small molecules vs. Biologics.

Read More

Ebola hitting us where it hurts

From my guest column at the Biophysical Society Blog.

The first full day of BPS 2015 began a little bit late for me, with my west coast body insisting that 8 am was 5 am and not at all an appropriate time to be getting out of bed. The “New and Notable” Symposium began at 10:45am which was quite a bit more palatable to my jet lag addled brain.  This symposium was very well attended, with most of the talks being standing room only.  This is unsurprising as the speakers were selected by the program committee from over 100 preeminent researchers nominated by the society’s membership.  The talks ranged from a study attempting to mimic membrane channels with chopped up single-walled carbon nanotubes to a structural study of the activation and sensitization of ionotropic receptors.

Getting to hear about many different topics of research is one of the advantages of a big meeting like BPS, and I was very please to listen to Gaya Amarasinghe’s talk on the mechanisms through which the Ebola virus evades the immune system.  Everyone knows about Ebola, and it’s also no secret that it’s remarkably well-equipped to combat our immune systems.  This talk went from a 30,000 foot view of the recent international outbreak of Ebola all the way to elucidating the detailed molecular interactions of one protein-protein interaction between host and pathogen.

Read More