Wednesday, June 22, 2011

Dodecad v3: population averages

With Dodecad v3 it is possible to use any population as test data in supervised ADMIXTURE analysis, and extract its admixture proportions in terms of the 12 ancestral components.

So, I have set up an automated job that will do just that: use pretty much every population available to me under only a few conditions:
  1. Each population must have at least 5 individuals
  2. It must have the same 166,462 SNPs on which the test is based
  3. It must not be from a group not covered by the test (e.g., Australo-Melanesians or Native Americans)
By my count, I have 141 different populations that meet these requirements. Each population is run on its own in a supervised ADMIXTURE analysis together with the 600-strong synthetic set (50 per ancestral component). These are ideal conditions to produce a high-quality comparative reference set.

The average admixture proportions of different populations will be put in this spreadsheet as they are calculated, which will probably take several days.


  1. Thanks! Its absolutely fascinating. I wonder if the East/West European distinction picks different things in northern and southern Europe.

  2. I think the bulk of the African (including North African) components in Iberians is a legacy of the Islamic Iberia. During the Reconquista very probably the bulk of Iberian Muslims converted to Christianity. Iberian Muslims must have been already overwhelmingly descended from native Iberians with only a small genetic contribution from North African Muslims, so relatively significant African admixture in modern Iberians means that the Reconquista very likely converted the bulk of Iberian Muslims to the Christian fold.

  3. African admixture in higher in Northern Portugal than Southern Portugal - which would contradict your theory... ever heard of the Pasiego people?

  4. Conroy, Northern and Southern Portuguese are not much different genetically. Then again, Portugal is a small country and most of its territory was part of the Islamic domains for a considerable time.

  5. Although most of the northern European component among Ashkenazim appears to be western, the significant north European segments that I have (I'm Ashkenazic) are eastern (Polish or Belarusian). If I am typical northern European admixture among Ashkenazim must have occurred earlier in the west than east.

  6. @ Onur That seems very much the case. I may be descended from one of those Berbers that converted. I very much have the West and NW African, yet as far back as I can tell, my maternal ancestors were Spanish colonials.

  7. onur - wrong again.

    Southern Portugal's African admixture is mostly from the slave trade and is Sub-Saharan, Northern Portugal's African admixture is mostly North West African, and could be Berber or much older, like Capsian Culture:

  8. Dienekes, why don't you just add the Austro-Melanesians as well?

    They partially represent a "Denisovan" component along with the earliest settlers of Oceania. At K=15 we've seen a bit of this in places like Malaysia and Cambodia. There was an Indian Ocean slave trade from Southeast Asia and Indonesia, and perhaps this also brought a bit of this ancestry westward.

    I would think that this is pretty easy to do. All it really involves is K=13.

  9. Also, how many Native American components are there?
    Two, an "Amerind" one and a Na-Dene one, or just one?

    If you add these along with an Austro-Melanesian component you'll basically be able to Dodecad V3 on *anyone* and any population in the world. No worries about whether someone "qualifies" or not.

    This is pretty important for admixed populations like African-Americans and Latin Americans, and as you know several of these are in the 1000 Genomes Project so we have real data for them.

    That's K=14 or K=15, and you're done.

  10. Adding more components adds to the computation time in at least three different ways:

    1. More individuals
    2. More iterations to convergence
    3. More time/iteration (proportional to K^2)

    Adding Papuans, for example, would add (in my estimation) >25% to the computation time, but to no avail, as most Eurasian individuals would register no or minimal "Papuan" (or Denisovan) admixture and most of that would be noise, which would be further exacerbated by the small sample size available to me, which would make the estimation of allele frequencies in Papuans noisy.

    This is also the reason why I did not break down the "Northeast Asian" component any further, even though previous experiments have clearly shown that several distinct populations emerge in North Eurasia: sample sizes are small, components are population specific, and mostly irrelevant.

    People who are interested in determining finer scale in e.g., East Asia can follow my ideas in reverse: they can use West Eurasian synthetic controls to avoid having to deal with multiple West Eurasian clusters, and up the K to discover East Asian ones, and perhaps add a Papuan reference as well, as Australo-Melanesians are likely to have played some role in Southeast Asia.

  11. Southern Portugal's African admixture is mostly from the slave trade and is Sub-Saharan, Northern Portugal's African admixture is mostly North West African, and could be Berber or much older, like Capsian Culture:

    Which study are you referring to?

  12. The Yoruba were less than ~3% Palaeo_African here:

    But in this admix run (based on some synthetic clusters from above) they are 28%? That's quite off and inconsistent.

  13. The two "Palaeo-African" labels are not the same.

    In the figure you posted, the "Palaeo-African" is the San/Pygmy-centered component in a Sub-Saharan African only analysis, and the "Neo-African" one is the Yoruba/Mandenka/Bantu-centered one.

    When these two components are plugged into the global analysis (regular ADMIXTURE)

    two new components emerge that are also labeled Neo-African and Palaeo-African, but notice that the Neo-African _population_ here has a substantial Palaeo-African _component_ (brown).

    The two new components are not the same as the old ones: they may have the same name, and be centered on the same populations, but their allele frequencies are informed by a variety of other populations (North and East Africans, primarily).

    So, it is this "brown" Palaeo-African component based on the big run

    and not the blue one based on the Sub-Saharan run

    that is "Palaeo-African" in the reported averages.

    There is no inconsistency: the inclusion of more populations (North/East Africans, and African-admixed individuals elsewhere) has shifted the two primary poles of African variation into positions that better capture variation.

  14. I see, thanks for clearing that up.

  15. True, it certainly wouldn't be worth it to up the computing time by 25% just to add a Austro-Melanesian component that is basically nearly nonexistent outside of Oceania and Southeast Asia.

    What about the restriction involving Native American ancestry?

    This is a bit different because most New World individuals who descend from Colonial Era immigrants have some Native American ancestry. Would this show up as plain "Northeast Asian"? Is there at least one possible Native American component that would emerge without breaking down Northeast Asian into populations-specific subgroups as well?

    What do the Pima and Karitana come out as in these runs, if you treat them as Eurasians?
    Do you have a Polynesian sample to run as well?

    It would be interesting to run the HDGP Papuans and Melanesians, and also the single seemingly-unadmixed "Riverine" Australian Aboriginal, just to see what happens.

    From what I've seen with the worldwide distribution of some of the rare HDGP ancestral alleles, I suspect that these Australo-Melanesians will have some non-trivial "Paleo-African" values.

    The ASW have a minuscule amount of the Northeast Asian (0.7%) component but we know they have non-trivial Native American admixture. I would have expected to see it come out as Northeast Asian, but it apparently didn't.

    Given that potential Native American admixture from any population from the Americas isn't a special case - and remarkably, Afro-Liberians and some Filipinos around Manila ("Mexico City") would also have such admixture - then perhaps it would be worth it to add that one more dimension if possible, or just ensure that any Native American is totally distributed among the Northeast Asian component.

    That way your population-ancestry restriction would be lifted in almost all cases except for people with obvious Australo-Melanesian and Oceanian ancestry.

  16. You could also create one synthetic Neanderthal consensus and one Denisovan consensus individual for all these 166,462 SNPs and see how those come out at K=12. That would be like running a Neanderthal/Denisovan test against all the worldwide ancestral components, but in reverse.
    It should be very easy to do, and the results might prove quite interesting.

  17. This is a bit different because most New World individuals who descend from Colonial Era immigrants have some Native American ancestry. Would this show up as plain "Northeast Asian"? Is there at least one possible Native American component that would emerge without breaking down Northeast Asian into populations-specific subgroups as well?

    There are a few individuals with genuine Native American ancestry. You can find them in the 600-member milestone where Native American references were used.

    So, you can check the same IDs when the individual v3 results are calculated to see how the "Native American" is interpreted.

    I don't have plans to do any sort of Native American testing as part of the general test. My focus is entirely on the Old World and on anthropology, not on recently admixed individuals and genealogy.

    Australo-Melanesians are another matter, as they (or their relatives) are a likely substratum in Southeast Asia, so they might be incorporated in the future.

  18. About outliers, there are at least 2 predominantly subsaharans in the Mozabite group, 2 or 3 Moroccans with high african admixture, 1 Saudi with high subsaharan admixture, 1 predominantly subsaharan Yemenese...

    And will you include Algerians, North Moroccans, Libyans, etc like Diogenes did if there is no compatibility issue with the used SNPs?

  19. I did not finish auditing the populations for outliers. Also, I'm quite conservative as to what constitutes an outlier, and also people will be able to see the population portraits for themselves once I'm done with all of the averages.

    As for the other datasets, I will include some of them as long as there are enough SNPs, but separately, as the admixture proportions will not be directly comparable to the ones currently posted.

  20. It has been three days since I asked Paul Conroy the basis for his above claims regarding Portuguese, but still he hasn't replied. I take his this silence to mean that he doesn't have a viable basis for his claims. Above he made very specific and bold claims for Portuguese, so he has to defend them when challenged if he still stands by those claims.

  21. I was waiting for Rasmussen's Nganassans. Dont see them.

  22. I was waiting for Rasmussen's Nganassans. Dont see them.

    See Outliers tab. Population portraits for all populations will be posted, and averages after outlier removal will be posted in the main tab.

  23. Why does JPT (Japanese from Tokyo= 17.1% Northeast Asian) have less Northeast Asian than Japanese_D and Japanese HGDP? Could the JPT Japanese samples be more representative of Japanese since it has the largest sample size?