Saturday, June 4, 2011

Projecting Pakistan populations on West Eurasian PCA

In a first post I showed that ADMIXTURE output allele frequencies could be used to create synthetic individuals corresponding to the ancestral components ("zombies"), and that these artificial populations could be used for both performance, and to avoid the creation of population-specific clusters in ADMIXTURE run. I was hence, able to infer the composition of several idiosyncratic populations in terms of the K=10 components of the Dodecad Project.

In a second post, I showed that "zombies" could be created even in the absence of allele frequencies, if one had admixture proportions only for the ancestral components. I was thus able to reconstruct synthetic individuals corresponding to the ANI/ASI of Reich et al. (2009). I was further able to confirm the West Asian origin of Ancestral North Indians. In a subsequent post, I used these synthetic ANI/ASI populations on groups of Pakistan, showing the main West Asian/ANI origin of the Caucasoid component in South Asia. Moreover, I confirmed that the Ancestral South Indians are related (but distantly) to the Onge from the Indian Ocean.

In this post, I run principal components analysis on the Pakistan populations; the Hazara were excluded because of their high East Eurasian admixture. Here is the unsupervised PCA:


First, you notice that the first dimension is dominated by the Kalash, a very distinctive population because of its long-term isolation. The second dimension is dominated by a Sindhi outlier, which, if you consult a Sindhi population portrait from a previous experiment, is revealed to be of substantial Sub-Saharan admixture.

Obviously, this is no good, as our first two dimensions are not anthropologically interesting. If we are interested in learning about the origins of populations, knowing that there are a few Sindhi individuals with Sub-Saharan admixture, or that the Kalash are highly isolated is not helpful.

We can run PCA again, but this time we project populations of interest onto the PCA plot of the West Eurasian control populations:
It is fairly obvious that the populations of Pakistan fall on the South Asia-West Asia line. There are small deviations from the cline:
  • Balochis and Brahuis deviate towards the SW Asian component, which is consistent with their ADMIXTURE results.
  • The position of the non-Indo-European Burusho and Indo-Aryan Sindhi populations on either side of the cline is consistent with a little SW Asian component in the Sindhi and a little North European component in the Burusho, which pull them away from the cline in the expected directions.
Moreover, the relative position of the Pakistan populations along this cline is preserved.

Using the West Eurasian "zombies" is thus, not only useful for ADMIXTURE, but also for principal components analysis; in the latter it is helpful because:
  1. It avoids domination by very isolated/inbred populations and/or outliers
  2. It is possible to create synethic "zombie" population with absolutely equal sample sizes, hence removing a source of bias (some residual bias may persist, e.g., if one used a component centered on 5 "real" individuals to create a "zombie" population of 100, then the effective sample is not really 100)


No comments:

Post a Comment