Tuesday, June 21, 2011

The design of Dodecad v3

Dodecad v2 was short-lived, as I discovered a way to improve it shortly after I announced it.

The first step was to carry out an extensive K=3 ADMIXTURE analysis of about 130 different populations and about 2,000 individuals from Europe, Asia, and Africa. Using the allele frequency results of this analysis I was able to create the most comprehensive synthetic individuals to represent West Eurasians, Asians, and Sub-Saharan Africans.

Subsequently, I carried out an analysis of East Eurasian populations using the West Eurasian/Sub-Saharan synthetic individuals as controls, as well as an analysis of Sub-Saharan populations using the West Eurasian/Asian individuals as controls.

In East Eurasia, I was able to infer the existence of two components, one centered in the extreme northeast, another in the southeast, with many other populations arrayed between these two extremes:

In Sub-Saharan Africa, the primary division was between San, Mbuti, and Biaka Pygmies (whom I have called "Palaeo-Africans") and the rest (Yoruba, Mandenka, and Bantu, "Neo-Africans"):

Now, I had four synthetic "framing populations": Neo-Africans, Palaeo-Africans, Northeast Asians and Southeast Asians, created from hundreds of individuals from several different populations:
  1. I did not have to choose a particular population (e.g., Chinese) to represent East Asia
  2. I did not have to aggregate individuals from populations with variable levels of non-East Asian admixture
I now used my South Asian populations, together with Neo-African, West Eurasian, Northeast and Southeast Asian controls to extract a South Asian specific component:

Armed with these 5 synthetic "framing" populations, I carried out a K=12 analysis with my West Eurasian, South Asian, and North/East African populations (1,247 individuals; 69 populations):

And, finally, I generated 50 synthetic individuals from each of the 12 inferred components to create a dataset of 600 individuals that will be the basis of Dodecad v3.

Below is the table of Fst divergences:

The following MDS plots show the first 10 dimensions of variation of these individuals:

Finally, here is a neighbor-joining tree of the 12 components:
(to be continued)


  1. What does O_Italian mean? They have relatively (for Italians and Europeans in general) significant Mongoloid admixture according to the above analysis.

  2. O_Italian is Other Italian, and that is all due to a single individual that I am waiting to hear from to see whether he/she has any explanation for these results. I will also carry another data cleanup once I'm done with this, to detect submitted relatives or outliers that likely misreported their ancestry. This is part of the reason why I am not reporting raw averages at this time, as I have not cleaned up all the latest submissions.

    Part of the (to be continued) involves visually inspecting the population portraits to catch outliers such as the one contributing the "Northeast Asian" in the O_Italian sample.

  3. Hi, I wonder if it is possible to see the variation of each admix result in every national group.

  4. After looking at the fst divergence table I suspect that the East Eurpean in more "native" than Western European, ie the west European component has more ancestry that only recently diverged. I also think a significant part of the West European came from around the caucuses (just north) relatively recently.

    The reasons I think this is because W.A. is close to W.A. and because W.A. is closer to W.E. than E.E., which suggests that there was a migration between the two since there further apart geographically. Also Mediterranean is closer to E.E. than to W.E. which further suggests that E.E. is more "native" to Europe like Mediterranean, especially since Mediterranean is somewhat of an isolated population (diverged earlier than other componentsvin the area). And while I think this is a weak indicator I think the fact that Northeast Eurasian is closer to W.E. than E.E. AND W.A. even though the geographic distance is bigger fits with the idea that a significant part of W.E. came from north of the caucasus.

  5. I'd like to see myself on more plots with coordinates listed.

  6. Hi, I was wondering could you explain what Palaeo and Neo African means? I'm a little confused

  7. Regarding the higher S.Asian Score in the Iranian population. While some of this is certainly be recent, most of this likely reflects admixture acquired through pre-LGM, and likely before there much of the defining variation in Caucasoids. Because West Asians (and especially Iranians) defined the split between S.Asians (and later Europeans) one would expect some residual component. However, Global PCA analysis, inclusive of a good number of populations, does not quite suggest an admixture approaching what is suggestive of the Dodecad V3 results. For S.Asian admixture, I tend to believe in the admixture that is suggestive from the distribution of haplogroup L1, and also more in-line with many-to-most other calculators. Around 4%, in through the north, although southern iran varies between, 6- 12%

  8. Do people still on reply here? I have a question about Jewish ancestry.

  9. Is there any documentation for beginners for gedmatch results?

    1. If possible, I third that request. I am totally lost. I ran the "R" program and here are my results. Now what?

      12 ancestral populations
      166462 total SNPs
      21941 flipped SNPs
      51978 heterozygous SNPs
      0 no-calls
      1654 absent SNPs
      0.990064 genotype rate
      mode genomewide

      1654 SNPs missing (no-call or absent)
      3823 total iterations
      9.995E-08 final dQ


      8.12% East_European
      36.21% West_European
      29.50% Mediterranean
      0.02% Neo_African
      17.21% West_Asian
      0.76% South_Asian
      0.11% Northeast_Asian
      0.15% Southeast_Asian
      0.00% East_African
      6.31% Southwest_Asian
      1.56% Northwest_African
      0.06% Palaeo_African

      CPU time = 326.35 sec