"Statistical methods for genomic associations"
by Roderick D. Ball, Scion (NZ Forest Research Institute Ltd.)
Quantitative trait locus (QTL) and association mapping attempt to find
locations in the genome associated with variation in traits of
interest. QTL mapping exploits the correlations (so called `linkage
disequilibrium') between marker loci and loci putatively associated
with a trait, that are generated in a pedigree or family. For example
the probability that 2 loci on a chromosome are inherited from the
same grandparent is 1-r where the recombination rate, r, ranges from 0
to 0.5 and is an increasing function of distance between loci within a
chromosome and is 0.5 for loci on different chromosome. QTL mapping
aims to infer the genetic architecture of traits by estimating the
number, locations and effects for QTL loci whose genotypes and
locations are a priori unknown.
Association mapping (also known as linkage disequilibrium mapping)
exploits linkage disequilibrium between loci in a
population. Association mapping exploits recombinations that have
occurred in the whole population history, hence has potentially much
greater resolution. However population linkage disequilibrium is not a
monotonic function of distance between loci, and to exploit the high
resolution requires genotyping many markers on large sample sizes of
individuals. Many published associations are spurious. Spurious
associations could arise from undiagnosed population
structure. Another explanation is that the evidence was never
strong. When re-evaluated with Bayesian methods, the evidence was found
to be weak (e.g. Bayes factor less than 1), or in some cases
moderately strong but still insufficient to overcome the low prior
probability per marker for genomic associations. This includes a
number of associations from recent large scale genome-wide association
studies (Diabetes genome initiative of Harvard, MIT, Lund Universities
and Novartis; and the Wellcome Trust of Oxford University).
We will introduce the biological background for gene mapping and
discuss Bayesian experimental design and statistical methods ranging
from closed form single locus calculation of Bayes factors for case
control studies and test statistics (Association Mapping in Plants,
Chapters 7,8; Springer 2007) to approximate posterior probabilities
for models in multilocus methods for Bayesian inference of the genetic
architecture (Ball, Genetics 2001). For association mapping with
500,000 or more SNP marker loci, brute force evaluation of all
possible models is not possible, therefore we need to resort to a
search strategy such as Markov chain Monte Carlo (MCMC) simulations
with the goal of finding a subset of models accounting for a high
percentage of posterior probability. The Bayesian model selection
framework (where models where only specified sets of selected markers
have non-zero effects) is useful or necessary to make the algebra
feasible (e.g. to evaluate the full X'X matrix or its inverse is not
possible).
Many MCMC methods and variants have been used in the genetics
literature but there is a large gap between theory and
practice. Inference from MCMC assumes the sampler has converged. MCMC
convergence is guaranteed by theory under general ergodicity
conditions. However the conditions are rarely verified, moreover
theoretical bounds on convergence are orders of magnitude greater than
the number of iterations used and thought to be needed in practice for
(apparent) convergence. The methods are often presented with a single
example, and there has been little or no attention to the convergence
of the MCMC algorithms by the authors or subsequent researchers, and a
lack of papers comparing or reconciling different methods. As a
result, one cannot rely on sampler convergence and correctness of
current gene mapping MCMC methods, models and computer implementations.
Diagnostics based on the sampled chains exist and can often diagnose
problems with a sampler and the sampler algorithm can generally be
adjusted or tuned to provide a rapidly converging sampler in common
garden statistical models. However, there is no guarantee, and
diagnostics only show apparent convergence which can persist for
thousands or millions of iterations in worst case scenarios. With the
large number of possible parameters corresponding to a dense marker
map covering the genome, it is desirable that samplers converge
automatically, and to have some confidence in convergence.
Current research for improving and verifying convergence of the MCMC
samplers for genomic associations will be outlined. This includes
using analytically calculated probabilities to adaptively adjust
sampling probabilities with respect to a Bayesian model selection
framework (which the parameter space for a range of samplers can be
mapped to) so that sample frequencies for models converge to the
approximate values. A perfect sampler would be desirable but may not
be possible/practical. A regeneration sampler and/or bounds on
convergence
(found e.g. using Nummelin splitting and the analytically calculated
probabilities), would be useful alternatives since the chain after a
regeneration is independent of the starting point, and independence of
tours between successive regenerations assures properties of ergodic
averages.