Biostatistics from 30,000 feet: An embarrassment of riches

This post is part of a series on Google's Project Baseline and my perspective as a participant and an amateur bioinformatician.

Sequencing cost per human genome according to the NIH.

The Human Genome Project will probably go down in history as the biggest government project to ever finish so early and under budget. It pulled the entire genomics industry up by their bootstraps and precipitated a drop in cost for DNA sequencing far below the Moore's law-type predictions that had been the conventional wisdom in the industry. Today it costs well under $1000 and a day to sequence a human genome, a task that cost the Human Genome Project upwards of one billion dollars and 13 years only a few years ago.

This all sounds like quite a boon for the computational biologists, right? Surely now we can sequence everyone's genome and through the application of clever statistics, tease out the genetic basis of disease for the betterment of all human-kind! Not so fast -- the laws of statistics and combinatorics are working against researchers in the field, as you'll soon see.

The traditional view of heritability and epidemiology considers each gene to make its own characteristic contribution to a phenotype (disease, height, birth weight, etc) against an average background of other unrelated genes. Oftentimes, a phenotype has multiple genes or loci which affect the outcome, in this case, traditional epidemiology usually considers their effects to be additive. One of the first large genome-wide gene association studies found over one thousand gene mutations that each contributed less than a millimeter difference in a person's height. One problem with proving that each of these one-in-a-million gene mutations has a statistically significant effect on a phenotype is the requirement of multiple hypothesis correction to the standard $\chi^{2}$ test. When comparing the difference in frequency of a given genotype in a phenotype of interest compared to the control group, researchers set a threshold for a likelihood of a given difference being observed randomly, and if the likelihood is less than this threshold (usually $p=0.05$, or $5\%$), then the effect is "real". Unfortunately, when you consider a $5\%$ likelihood "real", and you're making thousands or even millions of comparisons, the likelihood of observing false-positives quickly approaches certainty.

If your measurement confidence is $p_{meas}$ and you make $k$ measurements, your family error rate ($p_{family}$, likelihood of making one or more false discoveries) is given by:

\[p_{family}= 1-(1-p_{meas})^{k} \\ p_{family}(0.05,100)=1-(1-0.05)^{100}=0.994\]So when a computational biologist makes a million measurements and comparisons, they must be extremely certain in each given result (very low $p_{meas}$) for them to be able to make any statistically significant claims. The most common correction to multiple-hypothesis testing is the Bonferroni correction, which sets a conservative limit on $p_{meas}=\frac{0.05}{k}\approx 10^{-8}$ for genome-wide studies. This limitation is called low statistical power, and many researchers have delved deep into statistics to defend their use of less conservative and hence more powerful corrections to make interpreting their large datasets more tractable (or sometimes not correcting their hypothesis at all - a practice called "p-hacking"). Using commonly accepted multiple-hypothesis testing requires genome-wide association studies to be conducted over large populations (expensive) or only consider relatively few genes (less useful) and generally require a genome-wide confidence in each measurement $p_{meas}<10^{-6}$.

The output of a genome-wide study is often a Manhattan plot, named after the characteristic "skyline" look.  Each genotype's p-value is plotted by its position in the genome. Four gene loci reach statistical significance in this cartoon example. — The output of a genome-wide study is often a Manhattan plot, named after the characteristic "skyline" look. Each genotype's p-value is plotted by its position in the genome. Four gene loci reach statistical significance in this cartoon example.

A study into the genetic basis of schizophrenia published recently in Nature had a population of $n>150 000$ people, giving them enough statistical power to consider over a million genetic variations and their contribution to the phenotype of interest. They discovered 128 significant contributors to the risk of schizophrenia, but all together these effects only added up to explain about 7% of the variance in the population they tested. Our studies really can't get much bigger than that. What went wrong? Where is the missing heredity?

It is clear from studies like these that only a small amount of the variance in a disease phenotype is explained by single genotypes that are common enough to reach the strict level of confidence required for genome-wide studies. Some researchers have insisted that the missing heredity can be explained by rare single genotypes that cannot reach the threshold of significance and require even larger sample populations. However, over the last ten years biostatisticians have made it clear that the independent and additive effect of gene mutations is the exception rather than the rule, and most phenotypes (including disease states) exhibit profoundly nonlinear dependence on multiple gene loci, an effect called epistasis.

The following is a cartoon example of two genes with normal genotypes a and b. Their mutant genotypes A and B produce an expected phenotype value (birth weight, likelihood of disease progression, etc) $E(A)=\alpha$ and $E(B)=\beta$ respectively. For simplicity's sake, let's set $\alpha=2$ and $\beta=3$.

\[\begin{array}{c|c|c|c}
& \text{Additive } & \text{Negative Epistasis } & \text{Positive Epistasis } \\
& E(A|B)=(\alpha+\beta) & E(A|B)\approx\log{(\alpha+\beta)} & E(A|B)\approx e^{\alpha+\beta} \\\hline
\text{ab} & 0 & 0 & 0 \\
\text{Ab} & 2 & 2 & 2 \\
\text{aB} & 3 & 3 & 3 \\
\text{AB} & 5 & 3.13 & 6 \\
\end{array}\]

This interaction between genes can lead to a phenotypic outcome either more or less than the sum of its parts, and can involve far more than the pair of genes shown here. Most disease phenotypes are thought to contain many genotypes and have a strongly nonlinear dependence on how many and which are present.

Biologically this is easy to retrospectively explain by pointing to the complex activation and inhibition networks that characterize the function and expression of proteins. Naturally a biological outcome depends non-linearly on several gene loci. Even mutations within a given gene exhibit a strongly non-linear effect on biological function, as a protein fold is often robust to several point mutations until a key site is affected.

So the way to mine a genome-wide dataset is not to find single common or rare genotypes, but to find rare combinations of common genotypes that jointly contribute to a phenotypic outcome. When considered on their own, these common genotypes would not show any statistically significant effect, but a rarely-found large number of common genotypes linked through a biological network can pass the "elbow" point on the curve above and lead tolarge phenotypic variance. This type of effect would not be discovered in a genome-wide association study looking for significant single genotypes.

Though the phenomena of epistasis is widely understood to be responsible for much of the variance observed in phenotype and disease outcomes, there have been relatively few success stories finding novel gene-gene interactions and their influence on heredity using modern high-throughput sequencing methods. What's wrong now? Well, statistics rears its ugly head, this time with combinatorics along for the ride. For a genome-wide study considering one million markers, a total of $\binom{10^{6}}{2}\approx 5\times 10^{11}$ two-way combinations are possible. To consider phenotypes that could contain up to five linked genotypes, the number of combinations rises to $\binom{10^6}{5}\approx 8\times 10^{27}$. There are not enough computers in the world to perform a separate statistical test for each of these candidate combinations, and even if there were, an enormous penalty on statistical power would be incurred due to multiple-hypothesis corrections.

Where does that leave us?

Engineers and chemists have given computational biologists an embarrassment of riches. Genome-wide datasets are commonly available and cheap to acquire, but the statistical tools are our disposal are less effective with such large datasets, rather than more effective. Our limited statistical power only allows us to put our toes into the vast ocean of biological heredity by looking naively at the contribution of single genotypes. And now with Project Baseline, Google seeks not only to mine important genotypes, but also proteomic, metabolomic and microbiotic factors and their combinations over a large population. The combinatorial space of such a dataset will be vast, and a naive data-driven approach will be doomed to failure. Their secret sauce? Well, almost all genome-wide computational studies avoid bias like the plague, and only look where the data leads them. Google must make heavy use of known biological networks, putative biomarkers and (probably) old wives tales as priors (or guides) while exploring the vastness of their dataset.