Monday, May 30, 2011

More Zombies: Ancestral North Indians and Ancestral South Indians reborn

In my previous post I showed how synthetic individuals corresponding to ADMIXTURE ancestral components can be created and used. This was made possible by the fact that ADMIXTURE outputs allele frequencies for its components, which can be utilized to create a population of random genotypes with the same allele frequencies.

A more difficult task is to create such "zombie" individuals when there are no allele frequencies at hand. A prime example of this is the paper by Reich et al. (2009) on the two ancestral components in Indians: Ancestral South Indians (ASI) and Ancestral North Indians (ANI). The paper provides admixture estimates for these two components in present-day "Indian Cline" groups, but no allele frequencies for these components: we only knew that ANI was closely related to West Eurasians, and ASI formed a clade with the Onge from the Indian Ocean.

Both ANI and ASI are extinct (in pure form) populations, and they are blended (in varying proportions) in modern day Indians, with highest ANI occurring in the Northwest and among upper caste groups, and highest ASI among South Indian tribal and low caste populations.

As I was thinking of ways to extend the "zombie" approach, it occurred to me that there is a fairly involved way to extract the ANI/ASI allele frequencies from the available evidence:

If f(ANI) and f(ASI) are the allele frequencies at a locus for ANI and ASI, and an admixed population P has x fraction of ancestry from ANI and 1-x from ASI, then its allele frequency is expected to be:

x*f(ANI)+(1-x)*f(ASI) = f(P)

I have marked (in bold) the known variables. Obviously, this equation does not hold in practice, because of sampling error, uncertainty in the estimation of x, as well as genetic drift that may affect the allele frequencies of the admixed population.

Nonetheless, we do not only have one equation of this sort, but 18, since Reich et al. (2009) provides ANI/ASI estimates for 18 different Indian Cline populations. We can thus fit a linear regression to recover f(ANI) and f(ASI).

This is exactly what I did; there are two important caveats:
  • because most of the Reich et al. (2009) populations are very small, f(P) is expected to be very noisy. I thus grouped the Indian Cline populations into five groups (based on increasing ANI, and making sure that each one had >15 individuals), and calculated admixture proportions (x's) and allele frequencies (f(P)'s) on these groups.
  • linear regression coefficients (the f(ANI) and f(ASI) estimates) may be less than 0 or more than 1, which makes no biological sense, so these were fixed to 0 and 1 in a few cases whenever that was the case (~5% of markers)
All of this required a bit of thinking and work, so I was very skeptical that it would work; given sampling/admixture estimation errors/limitations of regression/random creation of individuals, the whole process from input data to output "zombies" passed through so many layers, that it could very well lead to nonsense.

Nonetheless, there is power in numbers, and I was hopeful that this might work. If it did, I could have synthesized ANI and ASI populations to play with and use pretty much like regular populations in a variety of experiments.

Validation of synthetic ANI/ASI populations

I generated 25 ANI and 25 ASI individuals using the above-described method. There are 119,588 SNPs in these populations.

To validate them, I ran supervised ADMIXTURE using these ANI/ASI individuals as ancestral populations, and all the Indian Cline populations as test data. The results can be seen below:
Although the estimates for some populations (e.g., Chenchu: 31 vs. 40.7%) are substantially off, the median error is 1%, and the average error is 2.4%. Overall, it does appear that the synthetic ANI/ASI individuals are fairly good standins for their (extinct) populations.

Ancestral North Indians

I included ANI together with the 4 West Eurasian components of the Dodecad Project in an MDS plot:
Also, a neighbor-joining tree:
Putting ANI/ASI to work: Romanian Gypsies

I have previously detected 2 individuals in the Behar et al. (2010) Romanian sample that are likely to be of Roma (Gypsy) heritage. Here is a supervised admixture of the Romanian sample using the ANI/ASI components:
The previously detected individuals do possess both ANI and ASI components, indeed these are:

18.1, 15.3
16.9, 16.4

in the two individuals, which might be useful in constraining geographically the origin of European Gypsies along the Indian Cline.

Putting ANI/ASI to work: Iranians

Iranians generally show affinity to South Asians. Is this affinity related to the common Indo-Iranian background of Iranians and Indo-Aryans, or, is it, perhaps, due to the absorption of South Asian population elements during Iran's long imperial past?

The ANI/ASI components in the Iranians and Iranian_D samples are:

11.7, 7.5
12.0, 6.9

Compared to the previously described Romanian Gypsies, the South Asian component in Iranians tends to be clearly tilted towards ANI.


  1. very interesting will you include me in a ASI/ANI run?

  2. Have you posted in the ancestry thread?

  3. Ok, I'll keep you in mind if I do another ANI/ASI run.

  4. ok. thanks. but you did a few runs and i was never included and it would be cool if you run it for this run because you already included the two other (romanian) gypsies. my results would be a good addition to that, i think. if it doesnt take too much time. i dont want to sound annoying sorry.

  5. Ok, I'm sold :)

    1 Svetozar 1 25.6 9.4 21.7 24.8 8.9 9.6
    2 S_European 25 100.0 0.0 0.0 0.0 0.0 0.0
    3 SW_Asian 25 0.0 100.0 0.0 0.0 0.0 0.0
    4 W_Asian 25 0.0 0.0 100.0 0.0 0.0 0.0
    5 N_European 25 0.0 0.0 0.0 100.0 0.0 0.0
    6 ANI 25 0.0 0.0 0.0 0.0 100.0 0.0
    7 ASI 25 0.0 0.0 0.0 0.0 0.0 100.0

    ANI/ASI ratio seems similar to the Romanian Gypsies; overall ANI+ASI is less, probably because you are part Gypsy.

  6. 8.9;9.6 so my ANI is less than my ASI? interesting

  7. where does that fit in india, i guess somewhere central to south....which indian ethnicity could this be?

  8. Great work.
    This could really open up some new vistas.
    Have you thought about resuscitating possible living dead lying in the PCA grave between Basques and Sardinians?

  9. Apologies for being persistent, but Zombie ANI-ASI scores would be great for the South Asian participants (of the clusters specified here), too, if possible (according to your convenience). Or at least, if you could post a little note on how to calculate it yourself. I couldn't find the concordance ratios, so..

  10. i agree with vasishta and include me in this next run plz, i suspect there might have been a mistake because i sincerely doubt my ANI is less than my ASI. no offense. apologies

  11. Dienekes,

    I'd love to see more of these Zombies created!

    Eurogenes BPA has a recent analysis of Northern Europe, Genetic substructures across Northern Europe (Part 4), where the resultant clusters are particularly, as some individuals emerge as pure North Atlantic types in the Irish and to a lesser extent British population.

    Here's a Map of the 5 Northern European Components, and the North Atlantic one is centered on the extreme South West of Ireland - or County Kerry in other words. I have often mentioned the distinctive appearance of people in this area, as they are more likely to not have freckles, have sallow skin, have curly hair, have black hair than other Irish. In this analysis people whose major ancestry is from this area, are 100% pure for that component.

  12. I'd love to see more of these Zombies created!

    I'm in my Frankenstein lab, creating them as we speak.

    It's important to note that when a bunch of individuals score 100% on a component, that means that they are very homogeneous and "related" in the context of the broader collection of individuals, and not necessarily "pure".

    Such components are acceptable for "framing populations" (e.g., Sub-Saharans or East Asians) that are used to tease out non_Caucasoid influences in West Eurasia, but they are not very desirable for the populations of one's region of interest: if we ran ADMIXTURE on a collection of individuals, some of which were from relatively isolated/inbred areas and some of which were more cosmopolitan, the components would be defined by the inbred areas, but that would simply point towards the most inbred areas, rather than the areas of most anthropological interest.

    It is sometimes useful to use very homogeneous populations when there is anthropological interest in them (Sardinians for example exhibit the lowest Asian shift by far, and are unique in that respect), but as a rule the best components are the ones that reach a high % (but not 100%) in a number of different populations so are anthropologically-interesting but not population-specific.

  13. Dienekes could you please do more studying on the Roma-Gypsies and make a theory/conclusions? Thanks

  14. Dienekes,
    Very true!

    Having an Irish Zombie would allow you to compare people from Britain, Norway and Iceland, etc. - to check for an Irish component in them.

  15. I'm from Romania and I thank you for for the info about Romanians' ancestry - I was very curious about. Do you have some graph concerning Romanians that diferentiates between Southeastern (Thracian) and Southwestern (Italic) and between Northeastern (Slavic) and Northwestern (Germanic) European ancestry ?

    I find this blog fascinating, I regret that it's now abandoned (but I suppose it took you a lot of your time).