Wednesday, June 8, 2011

Dodecad v2

This is an announcement of the new generation of Dodecad ancestry analysis. In comparison to the standard K=10 used since the beginning of the Project:
  1. Participants' data are now used to enrich the set of reference populations and to help define new ancestral components
  2. Rather than choosing arbitrary reference populations, I employ a very large set of individuals to capture allele frequencies and then create synthetic individuals ("panmictic zombies") that embody these frequencies; more on this below.
  3. Results for unrelated project participants will be reported in a separate post, using my new technique of converting unsupervised ADMIXTURE runs into supervised ones. Hence, Project participants can expect to receive new K=12 results; moreover, the fact that this will be done in supervised mode means that it is no longer necessary to process samples in small batches of 10 or so. All current unrelated participants will receive their results in one go, and only future submissions will be processed in batches.
This analysis utilizes results from Project participants (populations with _D endings), as well as synthetic individuals summarizing allele frequencies of East Eurasians, Sub-Saharan Africans, and South Indians (populations with _Z endings)

The framing populations (_Z)

The following _Z populations were included:
  • Sub_Saharan_Z: Bantu, Yoruba, Mandenka, San, and Pygmies from HGDP-CEPH
  • South_Indian_Z: North Kannadi, Sakilli from Behar et al. (2010), AP_Madiga, AP_Mala, TN_Dalit from Xing et al. (2010), Bhil, Chenchu, Kurumba, Satnami, Madiga, Mala, Kamsali, Onge, Great_Andamanese from Reich et al. (2009)
  • Sino_Tibetan_Z: Yizu, Naxi, Han, Tujia from HGDP-CEPH
  • Altaic_Z: Tu, Xibo, Mongola, Daur, Hezhen, Oroqen, Yakut from HGDP-CEPH, and Evenk, Buryat from Rasmussen et al. (2010)
  • Siberian_Other_Z: Selkup, Ket, Yukagir, Nganasan, Koryak, Chuckchi from Rasmussen et al. (2010)
  • Southeast_Asian_Z: Dai, Lahu, Miaozu, Cambodians from HGDP-CEPH, Khmer-Cambodian, Thai from Xing et al. (2010), and Singapore Malay from the Singapore Genome Variation Project
The 12 inferred ancestral components

Results of the ADMIXTURE analysis defining the new K=12 components of the Project can be seen below:
Raw proportions can be found in a spreadsheet. There are also population portraits in a zip file, showing individual-level variation.

The 12 components are:
  • West_Asian
  • East_European
  • West_European
  • East_Asian
  • Mediterranean
  • Northwest_African
  • North_Eurasian
  • Arabian
  • Inner_Asian
  • Sub_Saharan
  • East_African
  • South_Indian
Once again, I have tried to make these as neutral and appropriate as possible, but don't forget that they are simply descriptive labels to aid memory. For example, the Arabian component is centered on Saudis, Yemenese, and Yemen Jews, the Inner Asian component on the Altaic synthetic population, and so on.

The Fst divergences between the 12 components can be seen in the spreadsheet and also below:

A different way of showing them is via a neighbor-joining tree. Note, however, that this is not a replacement for the Fst table above which alone fully preserves the inter-population relationships:
We can also plot the first few MDS dimensions using synthetic individuals from the 12 components; again, these capture variation only partially:

What comes next?

Hopefully quite soon, I will:
  1. Report new v2 results for all project participants
  2. Report new v2 proportions for many other populations not included here
Project members who still haven't received their results (during the ongoing submission opportunity) can expect to receive K=10 standard results, and they will receive their new v2 results later.


  1. Thanx for all your hours and hours of methodical hard work: )

  2. Instead of South Indian_Z, which will basically be an ASI>ANI mixed reference like the old South Asian component, why not incorporate the ANI and ASI Zombie individual(s) which you projected a while ago during your Zombie exercises; into Dodecad v2? That will be far more useful for all West Eurasian populations, other than the South Asian participants itself. It'd be interesting to separate out the ANI-specific West Eurasian admixture among South Asians and their neighbors, and the exogenous elements among South Asians such as NEU and West Asian that fall outside of the basic ANI-ASI combination.

  3. The ANI/ASI individuals are based on the smaller set of markers used by Reich et al. so they are not useful for a global test.

    The South_Indian_Z actually suffers a bit from the same problem, as only the North Kannadi really have a full complement of markers in it, and all the other populations provide frequency data for smaller subsets. So, I am inclined not to use a Zombie population for South Asia in the production version of v2 which I am currently developiing.