Monday, May 30, 2011

How to create Zombies from ADMIXTURE etc.

ADMIXTURE infers K ancestral populations, and estimates the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each ancestral population.

An interesting use of the allele frequencies is to generate synthetic "zombies" from the ancestral populations. These are artificial individuals whose genotypes are drawn randomly based on the allele frequencies. For example, there is a "West Asian" component in the Dodecad Project, but no individuals who have 100% membership in the "West Asian" component. A "West Asian" zombie is a synthetic individual who appears to be drawn from that "West Asian" component only, without any other (e.g., "South European", or "Southwest Asian") admixture at all.

"Zombies" may be viewed as either useful theoretical abstractions, or as reconstructed hypothetical ancient-like individuals, purged of centuries or millennia of admixture. Irrespective of how one views them, they are very useful as a tool.

Zombies of K=10 components

I generate 25 zombies for each of the 10 ancestral components of the Dodecad Project. Below, you can see an MDS plot of these 250 individuals, which is quite similar to the MDS plot generated using only the Fst divergences between the ancestral components.
Including real and "zombie" populations

I include the "West African", "North European", and "South European" zombie populations, together with 25 African Americans (ASW) from HapMap-3:
Notice the direction of the African American cline: slightly tilted towards North Europeans. This makes sense as the European ancestry of African Americans is derived mainly from Northwestern Europe and neither exclusively from the Mediterranean or Northern Europe where the "South European" and "North European" components peak.

Convert unsupervised ADMIXTURE runs to supervised ADMIXTURE

The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.

The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.

You can thus use the zombie populations to mimic a regular (unsupervised) ADMIXTURE run. This is useful for two reasons:
  1. It can be much faster: the initial set (of the unsupervised run) can be huge, but the zombie populations need only be large enough to capture the allele frequencies of the inferred components.
  2. It avoids the generation of spurious clusters, especially if you include individuals from highly-inbred populations, or a large number of test individuals
I re-estimated admixture proportions for the 9 individuals of the last run, using the "zombie" populations in a supervised ADMIXTURE run. This took less than 1/10 of the time, and achieved results that were highly concordant with the ones previously reported: correlation was +0.999729; the average difference in ancestral proportions was 0.3%, the maximum difference 2.1%.

The speedup is due to two reasons: first, I'm running ADMIXTURE on 250 "zombie"+9 real individuals, as opposed to 692+9 real individuals using the unsupervised method. Moreover, admixture proportions are only estimated for the 9 real individuals and are fixed for the 250 "zombie" ones. This idea seems to work like a charm.

More average K=10 results

I was also able to calculate admixture proportions for the 10 Dodecad components in Druze, Kalash, and Palestinians. These populations have a tendency of forming their own population-specific clusters, so they are very difficult to compare against other populations: you just can't get their breakup into ancestral components easily, because they become their own ancestral components at fairly low K.

Using the trick of "zombie" populations, we can determine their ancestral components and compare them with other Dodecad populations.

I have labored long to be able to compare these to the ones in the standard Dodecad set, and I am very pleased that I was finally able to achieve it:
  • Both Druze and Palestinians have substantial "Southwest Asian" component as do most Semitic (Arab, Jewish, Ethiopio-Semitic) populations in my database
  • Druze have more "West Asian" than "Southwest Asian", and the reverse is true for Palestinians
  • Palestinians have more African admixture than Druze

By far, the most exciting thing about this analysis are the results for the Kalash, a population that speaks a language of the Dardic group of Indo-Iranian. Some linguists place Dardic languages in the Indo-Aryan subgroup (of which Sanskrit and Hindi are the most famous representatives), whereas others view Dardic as a third branch of Indo-Iranian together with Iranian (like Kurdish, Persian, or Pashto) and Indo-Aryan. In any case, the study of these mountaineers is extremely crucial to the study of Indo-Iranians in general.

The Kalash have been much mythologized as either long-lost Aryans or the descendants of Alexander the Great's soldiers.

The absence of the South European component among them agrees with Y-chromosome research about the absence of a Mediterranean or Greek influence in that population. The Kalash are completely split between the West Asian component (56%) and the South Asian one (43.5%). Indeed, their West Asian admixture is very high compared to my south Asian populations, exceeding even that of the Pathans (~40%) and reaching levels found only in West Asia proper. It is also perfectly consistent with my theory of Indo-Aryan origins in West Asia.

The way forward

I initially considered the idea of zombies as a way to include more Project participants in my detailed ADMIXTURE runs, such as the recent K=12 and K=11 ones. There are two problems with these runs:
  • Each one takes 24+ hours to complete, so it is not exactly possible to replace the standard K=10 analysis with them just yet
  • Including all project participants, especially those of mixed background, makes them completely impractical, in addition to making them very capricious: at high K different components begin to appear depending on sample composition, and the solution is not as robust as in the standard K=10 analysis.
With the use of zombie populations, these problems can be largely solved. I can spend many hours or even days in a very detailed ADMIXTURE run with a large sample, create "zombie" populations from the inferred results, and then run project participants fairly fast using these "zombie" populations and supervised ADMIXTURE mode. In fact, I am working on exactly this type of test at the moment, so project members of all backgrounds should expect good things to come in the next days or weeks.


  1. Great work, Dienekes. This is a most interesting exercise.

    Strangely enough, Zack, who was among the first to successfully break down the ancestral components of the Kalash over at HAP without the Kalash forming their own distinct cluster as usual, had significantly different results for them-
    *South Asian - 60%%
    *European - 22
    *South West Asian - 11%
    *East Asian - 3%
    *American - 2%
    *Siberian - 1%
    *Papuan - 1%

    Do note that the South Asian component here is not quite like the South Asian component we see in regular ADMIXTURE runs, which is generally a blend between Reich et al's Ancestral North Indian and Ancestral South Indian. It is usually slightly more ASI than ANI for most individuals, which might explain it's drag towards Oceanian populations on low levels of K. In this case, especially for the Kalash, it seems to be that the South Asian component is entirely specific to ANI (attested by the absence of the "Onge" component among the Kalash).

    Somehow, there seems to be little correlation between Zack's K=11 breakdown and these Zombie K=10 results. Zack's breakdown revealed a 6% East Eurasian + 1% Papuan for the Kalash, so would this correlate to a sliver of the 43.5% South Asian in your own breakdown?

    I always thought that the less SEA a group was, the less spurious the East Asian components were, since SEA is the East Eurasian component most affiliated with ASI. The Kalash had substantial Northern European admixture in the former, whereas it's only 0.5% in the Zombie K=10 exercise. In fact, all Indo-European groups, including Dravidian-speaking upper castes carried European and SW Asian at a ratio 3:2 in Zack's K=11. I wonder whether the aforementioned groups will also completely lose this mix, which is rather suggestive of what the ancient Indo-Iranians may have been, if tested in the same manner as this K=10 Zombie exercise.

    It would be great if you carry out the same on Indian_D. That would throw further light on the matter.

  2. In fact, all Indo-European groups, including Dravidian-speaking upper castes carried European and SW Asian at a ratio 3:2 in Zack's K=11.

    I don't follow that project very closely, but it's my impression is that "SW Asian" is modal in Arabs and Jews in HPA and hence completely inappropriate to detect "West Asian" admixture in South Asia.

    Indeed, that's the whole point of using Zombies: if you ran supervised ADMIXTURE using e.g., Armenians as a putative West Asian ancestral populations you would not only include the "West Asian" component, but also the "South European" and "Southwest Asian" components in a mix that is particular to Armenians.

    The Indian_D sample has 25.2% West Asian and 6% North European admixture.

    With respect to East Eurasian admixture, I see 0% of haplogroup Q,N,O,C in Kalash, so I am not surprised that they have 0% such admixture in my analysis; it's possible that the "East Eurasian" in the HPA may be an alias for Paleoindian (ASI) ancestry.

  3. Thanks, Dienekes. SW Asian there is apparently modal in Yemeni Jews.

    "The Indian_D sample has 25.2% West Asian and 6% North European admixture."

    Was this calculated using the Zombie method or is this just an average of the regular K=10 for the Indian participants?

  4. The regular method; see post for concordance between the two methods.

  5. Keep up the good work. Concerning the Kalash, because you have here compared them to Palestinians and Druze, who seem to have a strong "Southwest Asian" (perhaps Arabian or in any pastoralist?) component, I was trying to remember whether the other component, the one which seems strong amongst the Kalash, is not also very common in the northern Fertile Crescent, amongst Assyrians, Armenians, Caucasians, etc. Is the "West Asian" component one we can maybe think of as one connected to the original agricultural concentrations of the fertile crescent and particularly perhaps its northern curve?

  6. Is the "West Asian" component one we can maybe think of as one connected to the original agricultural concentrations of the fertile crescent and particularly perhaps its northern curve?

    It is a signal from that area, but to link it to the fertile crescent would imply that the other components are primarily descended from elsewhere. That is not true, none of the West Eurasian components are differentiated substantially from each other to the extent that we could posit Paleolithic-level separations between them. In short, the vast majority of Caucasoids everywhere are descended from the fertile crescent, with limited absorption of Mesolithic populations.

  7. I think this is a genuine milestone breakthrough that we, perhaps, should name "the Dienekes paradigm"

    2 experiments would be well interesting
    1/to find out wich one of the southeuropean, northeuropean, westasian and southwestasian components that corresponds to the northafrican one amongst northafricans(of course by playing with the 4 aforementionned zombies+northafrican samples)
    2/to fimd out wich one of the southeuropean or westasian that fits the best with the northeuropean one(of course again with the 2 aforementionned zombies+northeuropean samples)

  8. This is a neat project and fascinating research, but I do have two concerns with Zombies: 1) films make them seem too slow to effectively eat people. 2) More seriously, what check is there to make sure that the K ancestral populations, from which the Zombies are created, are realistic? It just seems to me that although these Zombies fit with expectations, this concordance could simply be because the Zombies are derived from the dataset (that created the expectations) in the first place.

  9. "In short, the vast majority of Caucasoids everywhere are descended from the fertile crescent, with limited absorption of Mesolithic populations."

    Fair comment. I should not have implied that we can time the "wave" so early as the first farmers. Still, I am happy to see that we are getting a better idea what your clusters might be caused by. Maybe then the Western Asian cluster could be thought of as Mesopotamian or Assyrian?