Monday, October 31, 2011

Origin of Kalash inferred with Eurogenes K=10 "test" calculator

Vasishta, is asking Eurogenes for help in demonstrating that the Kalash have Northern-European-specific segments:
Yes. He keeps citing the Kalash as proof that the Indo-Iranians were an almost exclusively a West-Asian like population, even though I personally think the mainly West Asian-South Asian assortment of the Kalash in his analyses might be an artifact of their inbreeding and isolation, thus confusing ADMIXTURE. Zack's K=11 at Harappa has shown that the Kalash display around 22% of the component modal in Lithuanians. Yet, he ignores the North/Eastern European admixture in Northwest Indians and North Indian Brahmins (in his own analyses at that!). Interestingly enough the aforementioned groups tend to score a sliver of Northeast European admixture in Dr.Doug McDonald's analyses, with the top matches for that sliver usually being Lithuanians, Russians and Finns; in that order. It (NEU) is even found in frequencies of around 4-6% in Dravidian-speaking southern Brahmins. As much as I hate to say it, he is indeed rather stubborn and has somewhat of an underlying agenda.

David, I think you should look into proving that the Kalash do indeed have some NEU-specific segments. I would be super-surprised if they didn't, given that more mixed populations south of their geographical area display it themselves.

It appears that Vasishta disagrees with me because he "personally thinks" that the admixture proportions of the Kalash are due to inbreeding and the limitations of ADMIXTURE.

Does he cite any studies or make any argument why ADMIXTURE would remove precisely the component that he is so eager to be present? No. While genetic drift in an isolated population could indeed lead to the loss of genetic diversity, there is no reason to think that this would lead preferentially to the loss of Northern-European segments. It is strange that Vasishta accuses me of bias and yet, at the same time, invokes the magic of some unspecified flaw of ADMIXTURE for the loss of his favorite component.

Vasishta invokes the Harappa Ancestry Project K=11 admixture analysis in support of his idea that the Kalash have 22% of the component modal in Lithuanians. However, he neglects to mention that at K=11 there is no West-Asian or Caucasus centered component in the HAP analysis, but rather only "European" (modal in Lithuanians) and "SW Asian" (modal in Yemen Jews). It is indeed strange that he accuses me of bias for providing evidence about the relationship of the Kalash with West Asia, while at the same time, showing preference for a level of analysis where such a component is lacking.

The West Eurasian cline between Arabia and Northeastern Europe is evident in the 'weac' admixture analysis, where the European-centered component (Atlantic-Baltic) is present in populations such as Assyrians and Armenians whereas it is lacking at the appropriate level of resolution. Therefore, the fact that the Kalash show "European" admixture at the level of Europe vs. Near East does not mean that they ought to show such admixture at the level of Europe vs. West Asia/Caucasus vs. Arabia.

One of the benefits of DIYDodecad has been the availability of data from projects that have hitherto been black boxes. In the interest of transparency, I have taken the Eurogenes K=10 "test" calculator and repeated my analysis of the Kalash, that had been previously shown by me to be a fairly simple West/South Asian mix. I could have waited for him to get around to it, but since he's quick on the talk and slow on the trigger, I decided to do it for him.

The admixture proportions of the Kalash, according to the Eurogenes K=10 are: 40.3% S_Asian, 58.7% W_Asian, 0.9% N_E_Euro, 0.1% N_Asian, and hence the analysis based on the Eurogenes K=10 components confirms the analysis based on my eurasia 7, "showing the Kalash to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components."

Eurogenes alleges, not without his usual charm, that:
Dienekes has a keen eye for things he wants to see. But he hasn't yet noticed that in all accurate analyses, there's significant Eastern European admix in North India. His monocle got fogged up in that instance.
Let us consider some pertinent facts: the Indian peninsula has been invaded multiple times from Central Asia, a process that continued long after the establishment of the Indo-Aryans during the 2nd millennium BC. Eurogenes may want to think that the "Eastern European" admixture in South Asia dates to his mythological Polish Indo-Europeans, galloping across the steppes on their horses, but there is, at present, no particular reason to think that this is the case

Furthermore, the "West Asian" component as a fraction of the "West Asian" + "Atlantic-Baltic" component reaches a minimum of 77% in the Pathans in populations from the northern parts of the Indian subcontinent. His own monocle is surely in greater need of de-fogging if I miss the 23% and he misses the >77%.

Indeed, the Europe vs. Caucasus ratio in Indian subcontinental populations is similar to that found in people from the Middle East and Caucasus region. It is not surprising that Eurogenes has abandoned his search for North European components in South Asia, going as far as reconstructing Ancestral North Indians as "Northern Europeans". Needless to say, he was wrong. The West Eurasian ancestry of the population of the Indian subcontinent is similar to that found in modern West Asian populations, not Slavs.

Eurogenes promises:
This shouldn't be too difficult. I'll use Dienekes' calculator for the job, and then check the results with LAMP.How poetic.
Been there, done that. It will be fun to see what "Northern European" components he will be able to squeeze out of the 0.9% N_E_Euro component that my software, in conjunction with his "test" calculator produces.

Why are the Kalash important?

There are three reasons why the Kalash are important in the study of Eurasian prehistory:
  1. Their mountainous habitat contributed to isolation and relative immunity from historical population movements
  2. Their non-Islamic religion has definitely preserved them from recent gene inflow
  3. Their language is unique within the Indo-Aryan family, and it often considered today as part of a separate Dardic family of Indo-Iranian in addition to the more populous Iranian and Indo-Aryan families.
The Kalash are crucial for those interested in the origins of Indo-Iranians, and the fact that they are, indeed, a simple West/South Asian mix is not without significance for that question.


Here is the result of a PCA analysis of the Kalash together with 50 synthetic individuals from each of the S_Asian, W_Asian, and N_E_Euro components of Eurogenes K=10 "test". This was calculated with smartpca with numoutlieriter set to 0.

It is evident that the Kalash appear to fall on the S_Asian to W_Asian line, and toward the W_Asian pole, consistent with being a population of those two origins, with the W_Asian component predominating.


As mentioned in the eurasia7 post, the Kalash tend to form population-specific components in ADMIXTURE analyses, so they are generally not included in my runs. So, I run the K=7 analysis again, but this time I included the Kalash. Here are the top populations of the component that was modal in the Kalash:

[186,] "Kurd_D" "50.2"
[187,] "Kurds_Y" "50.7"
[188,] "Armenian_D" "50.9"
[189,] "Armenians_Y" "51.2"
[190,] "Adygei" "51.5"
[191,] "Chechens_Y" "53"
[192,] "North_Ossetians_Y" "53.2"
[193,] "Lezgins" "54.4"
[194,] "Georgians" "59.8"
[195,] "Georgian_D" "60.1"
[196,] "Abhkasians_Y" "60.5"
[197,] "Kalash" "63.2"

Here are their exact admixture proportions in this unsupervised ADMIXTURE run:

Kalash N=23
East_Asian: 0.5
Atlantic_Baltic: 1.5
South_Asian: 32.9
Sub_Saharan: 0.0
Southern: 0.0
Siberian: 1.8
West Asian: 63.2

UPDATE III (November 22): Eurogenes estimates that there is 4% "Northeast European" admixture for Kalash individual HGDP00302. He managed to avoid the creation of a Kalash-specific component by including only a single Kalash individual in an ADMIXTURE experiment.

The Kalash do tend to create their own Kalash-specific component, and a good way to avoid such a component is to include each of them individually, and repeat the analysis 23 times. An alternative, and less time consuming way, is to create a single synthetic individual using the allele frequencies of the Kalash population as a whole. Even simpler, one could randomly pick a single individual (such as HGDP00302), but at the risk of picking an individual that has either much more or much less than average a particular type of ancestry.

Below are the admixture proportions of all the 23 Kalash individuals from the unsupervised ADMIXTURE run of UPDATE II. Individual HDGP00302 is 4th of 23 in terms of their "Atlantic_Baltic" component that peaks in Lithuanians (3%). The Kalash have 1.5% "Atlantic_Baltic" on average (median=1%, standard deviation=2.1%).

ID East_Asian Atlantic_Baltic South_Asian Sub_Saharan Southern Siberian West_Asian
HGDP00279 0.007 0.081 0.361 0 0 0.031 0.521
HGDP00307 0.004 0.059 0.336 0 0 0.018 0.583
HGDP00315 0.019 0.036 0.338 0 0 0 0.606
HGDP00302 0.006 0.03 0.337 0 0 0.02 0.608
HGDP00311 0.014 0.029 0.325 0 0 0.021 0.611
HGDP00285 0 0.027 0.319 0 0 0.019 0.635
HGDP00333 0 0.02 0.324 0 0 0.018 0.638
HGDP00277 0 0.016 0.334 0 0 0.021 0.63
HGDP00298 0.012 0.016 0.325 0 0 0.016 0.631
HGDP00281 0.011 0.015 0.332 0 0 0.01 0.633
HGDP00304 0.007 0.012 0.329 0 0 0.013 0.638
HGDP00290 0.007 0.01 0.325 0 0 0.021 0.637
HGDP00274 0.007 0.004 0.341 0 0 0.013 0.635
HGDP00309 0.007 0 0.317 0 0 0.019 0.656
HGDP00330 0 0 0.335 0 0 0.026 0.639
HGDP00319 0.011 0 0.328 0 0 0.01 0.651
HGDP00288 0.004 0 0.339 0 0 0.013 0.644
HGDP00286 0 0 0.329 0 0 0.018 0.653
HGDP00313 0 0 0.351 0 0 0.015 0.634
HGDP00328 0 0 0.31 0 0 0.023 0.667
HGDP00267 0 0 0.332 0 0 0.022 0.647
HGDP00326 0 0 0.307 0 0 0.03 0.663
HGDP00323 0.002 0 0.304 0 0 0.013 0.68

Wednesday, October 26, 2011

'eurasia7' calculator

This calculator was made with 196 different populations and 2,659 individuals, including 518 project participants. The following Dodecad populations do not have 5 individuals yet, so they are included in the OTHERS_D generic category:
Algerian_D, North_African_Jews_D, Slovenian_D, Mixed_Scandinavian_D, Danish_D, Moroccan_D, Tunisian_D, Serb_D, Austrian_D, Saudi_D, Pakistani_D, Tatar_Various_D, Palestinian_D, Greek_Italian_D, Romanian_D, Swiss_German_D, Szekler_D, Mandaean_D, Azeri_D, Czech_D, Georgian_D, Belgian_D, Latvian_D, Estonian_D, Bangladesh_D, Yemenese_D, Sri_Lanka_D, Hungarian_D, Basque_D, Udmurt_D, Egyptian_D
As always, I encourage people with 4 grandparents from the same country or ethnic group of Eurasia, North or East Africa to contact me (do not send data!) for possible inclusion in the Project. If I have overlooked any such individuals, drop me a line (my e-mail address is at the bottom of the blog). I usually start a new _D population whenever individuals with 4 grandparents from the same group are submitted, but I may have missed some.

Note that all individuals from the reference populations have also been included, including outliers; you should be aware of this when reading the population averages, and consult the Outliers tab in the v3 spreadsheet for some instances of outliers.
Due to image size restrictions in Picasa, the labels are not visible well. A large version of the above plot can be found in the download bundle.

The seven ancestral populations inferred at this level of resolution are:
  • Sub_Saharan
  • West_Asian
  • Atlantic_Baltic
  • East_Asian
  • Southern
  • South_Asian
  • Siberian
As usual, you should take these names as useful labels, and interpret them in conjunction with the components' distribution in different populations, and their Fst distances, both of which can be found in the spreadsheet.

The table of Fst distances:

Below you can see a neighbor-joining tree based on inter-population Fst distances:
The first six dimensions of a multi-dimensional scaling of the same:

Calculator Files:

  • The spreadsheet contains population averages, the table of Fst distances, and individual results for included Project participants.
  • The download RAR file (Google Docs or Sendspace) contains all the files needed to run the calculator. You must download and install DIYDodecad 2.1 first. In order to run the calculator, you follow the instructions of the README file, but type 'eurasia7' instead of 'dv3'.

Terms of use: 'eurasia7', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Technical Details:

The calculator is built using allele frequencies of K=7 ancestral components inferred by ADMIXTURE 1.21 analysis of 2,659 individuals. Markers included in the source datasets, as well as the Family Finder and 23andMe (as of Oct 21) platforms were included. The marker set was thinned of markers with less than 99.5% genotype rate and less than 0.5% minor allele frequency. Linkage-disequilibrium based pruning was carried out with a window size of 250 SNPs, advanced by 25 SNPs and R-squared greater than 0.4. A total of 164,990 SNPs remained after these filtering steps.

All relevant populations available to me, and genotyped at a sufficient number of markers were included. Inclusion of the Kalash population resulted in a population-specific component at K=7, and hence their admixture components were inferred a posteriori. Their proportions are consistent with previous results, showing them to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components.

Friday, October 21, 2011

23andMe data file changes

A few recent submissions to the Project have alerted me to the fact that 23andMe has been making changes to its data download. It is unfortunate that such changes are made apparently "silently", as they may negatively impact third-party tools built around the 23andMe data.

Apparently, the orientation of some SNPs has been changed. This should not be a problem for DIYDodecad, as it handles orientation of different companies automatically. A different problem is that apparently some SNPs have been dropped from the data file altogether. This is a problem if you are not using the latest 2.1 version of the DIYDodecad software, so you should upgrade to 2.1. It seems that about 80 of the SNPs expected by the 'dv3' calculator have been removed from the data file download, and these will appear as "absent". I do not expect 80 absent SNPs to have a huge impact on results, as they make up less than 0.1% of all SNPs in dv3.

The change in format has other consequences as well; as many of you know, I have been working on Dodecad v4 for some time now. This would use common markers between 23andMe v2 and v3 platforms and Family Finder Illumina platform. However, I will now have to backtrack on it, to make sure that the marker set used is actually consistent with people's current 23andMe downloads.

If you have a fresh 23andMe downloaded file and DIYDodecad 2.1 and you are unable to run 'dv3' or any other Project calculators, drop me a line.

Eurogenes is upset

Eurogenes seems to be upset this week, first throwing a tantrum at Dr. McDonald and then at myself. You can probably find the cached text in Google for some time, although Eurogenes has deleted his anti-McDonald tantrum, and changed the verbiage on the one directed against me on advice of some more cool-headed people. Here is the epilogue of his original anti-Dienekes rant:
Dienekes, you've got a spreadsheet online showing all sorts of weird things. You need stop being a prat, and do something about it ASAP.
Eurogenes' animus towards me is not surprising for those who have followed our interactions since the old days. Of course he is benefiting from my work (I have pointed him towards data he didn't know existed, he is using DIYDodecad, as well as the 1000Genomes data extracted with my code by the MDLP), so one would think that if he had any criticism against me, he would at least express it in a more dignified way.

Of course, being rude, ungrateful and mean-spirited does not mean one is wrong! So, what has Eurogenes actually discovered?

He noted the high Ukrainian West/East European ratio produced by Dodecad v3, and objected to my idea that Ukrainians were transitional to the Balkans and the Caucasus. Actually, according to the PCA plot of the Yunusbayev et al. (2011) paper, they are transitional, being situated toward both the Balkans and the Caucasus, relative to Belorussians/Lithuanians, i.e., the populations that generally show peaks of East European-related components. This is also supported by the ADMIXTURE analysis that reveals Ukrainians to possess a Caucasus-centered component largely lacking in other Eastern Slavs, but shared with Balkan/Caucasus populations.

Should I have not tested the new Yunusbayev data with Dodecad v3 and reported their results? Of course not. When one has a measuring instrument, one uses it on new data to test its performance and reports what he sees. This is exactly what I have done. At the same time, one uses the new data to create new measuring instruments that have been trained using all available data, which is also what I have done with euro7 and the upcoming Dodecad v4.

To make matters worse, Eurogenes suggests that my euro7 analysis agrees with his K=10 which was presented two weeks later. So, apparently, I am posting correct information about Ukrainians 2 weeks before he does, and this means that I am turning around to his way of thinking rather than vice versa. Go figure.

Eurogenes continues with his posting of supposed MDS/PCA plots supporting his thesis. Actually, what he has posted are plots based on metric distances in the space of admixture proportions; these are not genetic distances because e.g., a +/- 1% difference in a Sub-Saharan component results in the same Euclidean distance difference as a +/-1% in a European one, although the former affects genetic distance much more strongly than the latter. Metric distances are fine to quickly determine closeness of samples in the space of admixture proportions, but they are certainly no substitute for real genetic distances. I have already linked above with evidence that Ukrainians are transitional to the Balkans and the Caucasus relative to the Yunusbaeyev et al. populations.

I am also, apparently, accused of neglecting to point out the deficiencies of Dodecad v3, and I am invited by Eurogenes to retract it completely! This proposal is equivalent to the idea that we should burn old topographic maps that were based on measurements with sticks, ropes, and trigonometers, because we can now measure distances with laser beams. And, it is funny indeed that I am supposedly neglecting the deficiencies of Dodecad v3 when, 3 weeks before the Eurogenes rant, I post exactly what its limitations are, and how it can be made better.

It is unfortunate that Eurogenes has chosen to go down that path. Envy is not a good guide to behavior, and perhaps, instead of relishing at the prospect of putting others down, he could spend a little more time inventing something of his own.

As for myself, I will continue to work on my tools, and to encourage cross-pollination between different projects for the benefit of all.

UPDATE: In a newer post, Eurogenes attempts to justify his mishandling of MDS, by suggesting that he presented results based on raw SNP data. This is of course nonsense, since Eurogenes does not have the raw SNP data of the Dodecad populations. He is comparing apples and oranges by comparing plots made on raw data with those made in the space of admixture proportions. Furthermore, his supposed findings have no bearing on the Yunusbayev et al. ADMIXTURE and PCA results, posted above.

Thursday, October 20, 2011

Comparing different ADMIXTURE runs using Zombies

My idea of using zombies with ADMIXTURE is the gift that keeps on giving. Remember that "zombies" are synthetic individuals created from ADMIXTURE output, representing the K inferred ancestral components. They can be viewed as hypothetical ancestral individuals representing each of these K components without any admixture from any of the others.

An interesting problem that often comes up is to compare across different ADMIXTURE runs. I can think of at least three different applications of this:
  1. To compare components across different K; for example, how does a "West Asian"-centered component at K=5 differ from a similarly-centered component at K=12?
  2. To compare components across different datasets; for example, how does a "West Asian"-centered component inferred from an existing dataset (e.g., the current Dodecad v3) differ from a "West Asian"-centered one from a new dataset (e.g., the upcoming Dodecad v4, which will also be trained on the valuable new populations of Yunusbayev et al. 2011)
  3. To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects.
As proof of concept, I took the MDLP calculator from the Magnus Ducatus Lituaniae Project and generated 50 zombies for each of its 7 ancestral components:
  1. Scandinavian
  2. Volga_Region
  3. Altaic
  4. Celto_Germanic
  5. Caucassian_Anatolian_Balkanic
  6. Balto_Slavic
  7. North_Atlantic
I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.

I will comment on the MDLP-to-dv3 mapping:
  1. The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian
  2. The MDLP "Volga_Region" component appears to be East European with some Northeast Asian
  3. The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it
  4. The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...
  5. ... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.
  6. The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European
  7. Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well
A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP. Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

Limiting ourselves to only European populations, we obtain:

It appears that the "North_Atlantic" component may be centered on a small number of related individuals.

I encourage other genome bloggers to try their own hand at comparing their components with those of other projects, or even their own. This process will be made possible if people using ADMIXTURE follow the simple instructions to convert their output for use with DIYDodecad.

Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project.

PS: This analysis was done on ~63k SNPs in common between MDLP and Dodecad v3