Monday, June 6, 2011

Panmictic zombies

I had previously developed a new way of choosing "framing" populations for ADMIXTURE analyses. Such populations are necessary in order to tease out genetic contributions from outside one's region of interest.

I proposed to create a meta-population which included a single individual from a large number of populations (e.g., East Eurasians). Use of such a meta-population has two interesting properties:
  1. It solves the problem of "which" population to choose (e.g., Han, Miaozu, She, Mongol ?) as a framing reference: the meta-population captures features of all candidate populations
  2. It avoids the generation of population-specific clusters in the "framing" individuals, as no two individuals from a single population are included!
There is, however, a problem with the technique as I first described it: it only uses a single individual from each population to compose the meta-population! Hence, it is potentially sensitive to the presence of outliers, and, in any case, it throws away most of the data.

More recently, I proposed the use of "zombies" from allele frequency data output by ADMIXTURE. These zombies are, in a sense, the opposite, of what I am trying to do here, since they represent ancestral components that exist in mixed form in present-day individuals.

Instead, we can generate "panmictic zombies" by composing a dataset of all individuals from a region of interest; we then calculate allele frequencies over the combined set, and then generate synthetic individuals based on these allele frequencies.

This technique has several advantages:
  1. It is extremely resilient to outliers, as the presence of a few outliers only shifts allele frequencies by a little, and no actual outliers are included in the "panmictic zombie" population
  2. It amortizes the full set of individuals and hence does not depend on the random sample one chooses from each population
  3. It avoids the creation of population-specific clusters
  4. It speeds up the technique I introduced for converting unsupervised ADMIXTURE runs to supervised ones substantially: populations framing the region of interest (e.g., East Eurasians, Sub-Saharan Africans, South Asians, in the case of West Eurasia) can be "folded" into a number of panmictic zombie populations a priori.
Point #4 is extremely important for practitioners:
  1. It is great not to include every single East Eurasian sample in ADMIXTURE analyses when you are trying to infer patterns of variation in Europe; this is a much better solution than the ad hoc approach adopted by some of ignoring East Eurasia altogether when studying patterns of variation in Europe!
  2. It is great not to worry after several hours of ADMIXTURE analysis whether upping K by +1 will finally produce added resolution in your region of interest, or split, e.g., Mbuti from Biaka Pygmies, which is hardly of relevance if one is trying to study East Asian or European variation
Panmictic zombies can be further fine-tuned: the allele frequencies can be calculated in many different ways:
  1. Over all individuals
  2. Averaged over all population averages (to account for different sample sizes)
  3. Weighted average over all populations (to account for different demographic sizes of source populations)
A first experiment

The following MDS plot shows a population ("Synthetic", red) generated from a sample of different HGDP East Eurasian populations.
It's important to note that while "Synthetic" appears to be closer to the Tu population, that does not mean that it is interchangeable with the Tu!

The "Synthetic" population is much more diverse, as it encompasses parts (alleles) from all the different populations of the set, that, because of the averaging process happen to coincide with the Tu in the first two dimensions of the MDS plot.

2 comments:

  1. Good. This technique is closer to my ideal than your previous meta-population analysis technique.

    ReplyDelete
  2. Good to know that I can live up to your ideal.

    ReplyDelete