ADMIXTURE infers K ancestral populations, and estimates the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each ancestral population.
An interesting use of the allele frequencies is to generate synthetic "zombies" from the ancestral populations. These are artificial individuals whose genotypes are drawn randomly based on the allele frequencies. For example, there is a "West Asian" component in the Dodecad Project, but no individuals who have 100% membership in the "West Asian" component. A "West Asian" zombie is a synthetic individual who appears to be drawn from that "West Asian" component only, without any other (e.g., "South European", or "Southwest Asian") admixture at all.
"Zombies" may be viewed as either useful theoretical abstractions, or as reconstructed hypothetical ancient-like individuals, purged of centuries or millennia of admixture. Irrespective of how one views them, they are very useful as a tool.
Zombies of K=10 components
I generate 25 zombies for each of the 10 ancestral components of the Dodecad Project. Below, you can see an MDS plot of these 250 individuals, which is quite similar to the MDS plot generated using only the Fst divergences between the ancestral components.
Including real and "zombie" populations
I include the "West African", "North European", and "South European" zombie populations, together with 25 African Americans (ASW) from HapMap-3:
Notice the direction of the African American cline: slightly tilted towards North Europeans. This makes sense as the European ancestry of African Americans is derived mainly from Northwestern Europe and neither exclusively from the Mediterranean or Northern Europe where the "South European" and "North European" components peak.
Convert unsupervised ADMIXTURE runs to supervised ADMIXTURE
The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.
The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.
You can thus use the zombie populations to mimic a regular (unsupervised) ADMIXTURE run. This is useful for two reasons:
- It can be much faster: the initial set (of the unsupervised run) can be huge, but the zombie populations need only be large enough to capture the allele frequencies of the inferred components.
- It avoids the generation of spurious clusters, especially if you include individuals from highly-inbred populations, or a large number of test individuals
The speedup is due to two reasons: first, I'm running ADMIXTURE on 250 "zombie"+9 real individuals, as opposed to 692+9 real individuals using the unsupervised method. Moreover, admixture proportions are only estimated for the 9 real individuals and are fixed for the 250 "zombie" ones. This idea seems to work like a charm.
More average K=10 results
I was also able to calculate admixture proportions for the 10 Dodecad components in Druze, Kalash, and Palestinians. These populations have a tendency of forming their own population-specific clusters, so they are very difficult to compare against other populations: you just can't get their breakup into ancestral components easily, because they become their own ancestral components at fairly low K.
Using the trick of "zombie" populations, we can determine their ancestral components and compare them with other Dodecad populations.
I have labored long to be able to compare these to the ones in the standard Dodecad set, and I am very pleased that I was finally able to achieve it:
- Both Druze and Palestinians have substantial "Southwest Asian" component as do most Semitic (Arab, Jewish, Ethiopio-Semitic) populations in my database
- Druze have more "West Asian" than "Southwest Asian", and the reverse is true for Palestinians
- Palestinians have more African admixture than Druze
By far, the most exciting thing about this analysis are the results for the Kalash, a population that speaks a language of the Dardic group of Indo-Iranian. Some linguists place Dardic languages in the Indo-Aryan subgroup (of which Sanskrit and Hindi are the most famous representatives), whereas others view Dardic as a third branch of Indo-Iranian together with Iranian (like Kurdish, Persian, or Pashto) and Indo-Aryan. In any case, the study of these mountaineers is extremely crucial to the study of Indo-Iranians in general.
The Kalash have been much mythologized as either long-lost Aryans or the descendants of Alexander the Great's soldiers.
The absence of the South European component among them agrees with Y-chromosome research about the absence of a Mediterranean or Greek influence in that population. The Kalash are completely split between the West Asian component (56%) and the South Asian one (43.5%). Indeed, their West Asian admixture is very high compared to my south Asian populations, exceeding even that of the Pathans (~40%) and reaching levels found only in West Asia proper. It is also perfectly consistent with my theory of Indo-Aryan origins in West Asia.
The way forward
I initially considered the idea of zombies as a way to include more Project participants in my detailed ADMIXTURE runs, such as the recent K=12 and K=11 ones. There are two problems with these runs:
- Each one takes 24+ hours to complete, so it is not exactly possible to replace the standard K=10 analysis with them just yet
- Including all project participants, especially those of mixed background, makes them completely impractical, in addition to making them very capricious: at high K different components begin to appear depending on sample composition, and the solution is not as robust as in the standard K=10 analysis.