Thursday, October 20, 2011

Comparing different ADMIXTURE runs using Zombies

My idea of using zombies with ADMIXTURE is the gift that keeps on giving. Remember that "zombies" are synthetic individuals created from ADMIXTURE output, representing the K inferred ancestral components. They can be viewed as hypothetical ancestral individuals representing each of these K components without any admixture from any of the others.

An interesting problem that often comes up is to compare across different ADMIXTURE runs. I can think of at least three different applications of this:
  1. To compare components across different K; for example, how does a "West Asian"-centered component at K=5 differ from a similarly-centered component at K=12?
  2. To compare components across different datasets; for example, how does a "West Asian"-centered component inferred from an existing dataset (e.g., the current Dodecad v3) differ from a "West Asian"-centered one from a new dataset (e.g., the upcoming Dodecad v4, which will also be trained on the valuable new populations of Yunusbayev et al. 2011)
  3. To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects.
As proof of concept, I took the MDLP calculator from the Magnus Ducatus Lituaniae Project and generated 50 zombies for each of its 7 ancestral components:
  1. Scandinavian
  2. Volga_Region
  3. Altaic
  4. Celto_Germanic
  5. Caucassian_Anatolian_Balkanic
  6. Balto_Slavic
  7. North_Atlantic
I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.


I will comment on the MDLP-to-dv3 mapping:
  1. The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian
  2. The MDLP "Volga_Region" component appears to be East European with some Northeast Asian
  3. The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it
  4. The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...
  5. ... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.
  6. The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European
  7. Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well
A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP. Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

Limiting ourselves to only European populations, we obtain:

It appears that the "North_Atlantic" component may be centered on a small number of related individuals.

I encourage other genome bloggers to try their own hand at comparing their components with those of other projects, or even their own. This process will be made possible if people using ADMIXTURE follow the simple instructions to convert their output for use with DIYDodecad.

Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project.

PS: This analysis was done on ~63k SNPs in common between MDLP and Dodecad v3

10 comments:

  1. You should post an addendum with a comparison between Dodecad V3 and the latest Eurogenes K=10.

    ReplyDelete
  2. If I find some time to automate the process, I will map all the previous Dodecad calculators and a few ones from other projects onto the Dodecad v4 platform. It is a bit tedious to do the analysis I did here step-by-step.

    ReplyDelete
  3. The zombie concept doesn't work for intra-European, or even intra-West Eurasian clusters.

    It's a mess all round for such closely related

    ReplyDelete
  4. Whoops, truncated...

    As I was saying, the zombie supervised thing doesn't work in many cases. It was an interesting idea, but it only works for highly differentiated clusters.

    If you keep doing it, and you don't figure out a way to avoid major errors, like with the Ukrainians and Mordvinians, you'll give me a lot to blog about. It's gonna be fun.

    ReplyDelete
  5. @Davidski, kindly approve my comment at your blog.

    As for the zombie concept, it works just fine. I have never seen a major discrepancy between results generated using supervised ADMIXTURE+Zombies vs. those generated using DIYDodecad+allele frequencies.

    ReplyDelete
  6. If you keep doing it, and you don't figure out a way to avoid major errors, like with the Ukrainians and Mordvinians, you'll give me a lot to blog about. It's gonna be fun.

    You have your concepts mixed. The results of Ukrainians and Mordvins are not attributable to the use of zombies. Approve my comment and you will learn something new!

    ReplyDelete
  7. You should post an addendum with a comparison between Dodecad V3 and the latest Eurogenes K=10.

    a very good idea

    ReplyDelete
  8. If it isn't too much trouble could you list the populations that make up the Scandinavian, Western European and Celto-Germanic groups?

    ReplyDelete
  9. You'll have to check with the MDLP to see how these components were inferred.

    ReplyDelete
  10. "Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project."

    Nice to hear something new is in the works. Looking forward to see what you have in store for us :-).

    Dienekes, I have just one request. Instead of accounting for ancestries in the Information about project samples thread; or at least in addition to that, could you a create a [freely editable] Google spreadsheet with columns for the Dodecad ID (DODnnnn); Ethnicity/Nationality; Y-DNA and mtDNA Haplogroups like the way Zack from the Harappa Ancestry Project has done? That would be very, very useful. Going through the numerous posts in the ancestry thread in order to look up a participant of interest's ethnicity is rather harrowing.

    ReplyDelete