In my
previous post I showed how synthetic individuals corresponding to ADMIXTURE ancestral components can be created and used. This was made possible by the fact that ADMIXTURE outputs allele frequencies for its components, which can be utilized to create a population of random genotypes with the same allele frequencies.
A more difficult task is to create such "zombie" individuals when there are no allele frequencies at hand. A prime example of this is the paper by
Reich et al. (2009) on the two ancestral components in Indians: Ancestral South Indians (ASI) and Ancestral North Indians (ANI). The paper provides admixture estimates for these two components in present-day "Indian Cline" groups, but no allele frequencies for these components: we only knew that ANI was closely related to West Eurasians, and ASI formed a clade with the Onge from the Indian Ocean.
Both ANI and ASI are extinct (in pure form) populations, and they are blended (in varying proportions) in modern day Indians, with highest ANI occurring in the Northwest and among upper caste groups, and highest ASI among South Indian tribal and low caste populations.
As I was thinking of ways to extend the "zombie" approach, it occurred to me that there is a fairly involved way to extract the ANI/ASI allele frequencies from the available evidence:
If f(ANI) and f(ASI) are the allele frequencies at a locus for ANI and ASI, and an admixed population P has x fraction of ancestry from ANI and 1-x from ASI, then its allele frequency is expected to be:
x*f(ANI)+(1-x)*f(ASI) = f(P)
I have marked (in bold) the known variables. Obviously, this equation does not hold in practice, because of sampling error, uncertainty in the estimation of x, as well as genetic drift that may affect the allele frequencies of the admixed population.
Nonetheless, we do not only have one equation of this sort, but 18, since Reich et al. (2009) provides ANI/ASI estimates for 18 different Indian Cline populations. We can thus fit a linear regression to recover f(ANI) and f(ASI).
This is exactly what I did; there are two important caveats:
- because most of the Reich et al. (2009) populations are very small, f(P) is expected to be very noisy. I thus grouped the Indian Cline populations into five groups (based on increasing ANI, and making sure that each one had >15 individuals), and calculated admixture proportions (x's) and allele frequencies (f(P)'s) on these groups.
- linear regression coefficients (the f(ANI) and f(ASI) estimates) may be less than 0 or more than 1, which makes no biological sense, so these were fixed to 0 and 1 in a few cases whenever that was the case (~5% of markers)
All of this required a bit of thinking and work, so I was very skeptical that it would work; given sampling/admixture estimation errors/limitations of regression/random creation of individuals, the whole process from input data to output "zombies" passed through so many layers, that it could very well lead to nonsense.
Nonetheless, there is power in numbers, and I was hopeful that this might work. If it did, I could have synthesized ANI and ASI populations to play with and use pretty much like regular populations in a variety of experiments.
Validation of synthetic ANI/ASI populations
I generated 25 ANI and 25 ASI individuals using the above-described method. There are 119,588 SNPs in these populations.
To validate them, I ran supervised ADMIXTURE using these ANI/ASI individuals as ancestral populations, and all the Indian Cline populations as test data. The results can be seen below:
Although the estimates for some populations (e.g., Chenchu: 31 vs. 40.7%) are substantially off, the median error is 1%, and the average error is 2.4%. Overall, it does appear that the synthetic ANI/ASI individuals are fairly good standins for their (extinct) populations.
Ancestral North Indians
Also, a neighbor-joining tree:
Putting ANI/ASI to work: Romanian Gypsies
I have previously detected 2 individuals in the Behar et al. (2010) Romanian sample that are likely to be of Roma (Gypsy) heritage. Here is a supervised admixture of the Romanian sample using the ANI/ASI components:
The previously detected individuals do possess both ANI and ASI components, indeed these are:
16.9, 16.4
in the two individuals, which might be useful in constraining geographically the origin of European Gypsies along the Indian Cline.
Putting ANI/ASI to work: Iranians
Iranians generally show affinity to South Asians. Is this affinity related to the common Indo-Iranian background of Iranians and Indo-Aryans, or, is it, perhaps, due to the absorption of South Asian population elements during Iran's long imperial past?
The ANI/ASI components in the Iranians and Iranian_D samples are:
11.7, 7.5
12.0, 6.9
Compared to the previously described Romanian Gypsies, the South Asian component in Iranians tends to be clearly tilted towards ANI.