Showing posts with label South Asians. Show all posts
Showing posts with label South Asians. Show all posts

Thursday, January 19, 2012

fastIBD analysis of South Asia

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

Clusters Galore


The Clusters Galore analysis can be found in the spreadsheet. 59 clusters were inferred with 47 MDS dimensions. The very fine-scale structure (I only considered the first 50 dimensions, but many more seemed significant than in any previous experiment) is probably the result of the size of the South Asian population, as well as the practice of endogamy associated with the caste system. High intra-population IBD sharing is also evident in the following (notice how well-defined the diagonal is):

Inter-Population IBD




Results for Dodecad participants

They can be found in the spreadsheet. Many Project participants belong to a population with 1 or 2 individuals, so cluster #1 seems to be a generalized catch-all for many such individuals. Individuals from he two sub-populations that I've identified recently Iyer_D, and Jatt_D all belong to the same cluster. The Iyer_D cluster (#4) also seems to include the Iyengar project participants as might be expected.

It is also interesting how all Dodecad participants fall in just 7 of the 59 clusters. This goes to show how truly diverse people from the Indian subcontinent are. I fully expect that with more participation further structure will be revealed, since it seems that due to endogamy it only takes a few participants from each ethnic group for a specific cluster pertaining to that group to be identified. So, I invite people from South Asia to join the Project during this submission opportunity.

Monday, October 31, 2011

Origin of Kalash inferred with Eurogenes K=10 "test" calculator

Vasishta, is asking Eurogenes for help in demonstrating that the Kalash have Northern-European-specific segments:
Yes. He keeps citing the Kalash as proof that the Indo-Iranians were an almost exclusively a West-Asian like population, even though I personally think the mainly West Asian-South Asian assortment of the Kalash in his analyses might be an artifact of their inbreeding and isolation, thus confusing ADMIXTURE. Zack's K=11 at Harappa has shown that the Kalash display around 22% of the component modal in Lithuanians. Yet, he ignores the North/Eastern European admixture in Northwest Indians and North Indian Brahmins (in his own analyses at that!). Interestingly enough the aforementioned groups tend to score a sliver of Northeast European admixture in Dr.Doug McDonald's analyses, with the top matches for that sliver usually being Lithuanians, Russians and Finns; in that order. It (NEU) is even found in frequencies of around 4-6% in Dravidian-speaking southern Brahmins. As much as I hate to say it, he is indeed rather stubborn and has somewhat of an underlying agenda.

David, I think you should look into proving that the Kalash do indeed have some NEU-specific segments. I would be super-surprised if they didn't, given that more mixed populations south of their geographical area display it themselves.

It appears that Vasishta disagrees with me because he "personally thinks" that the admixture proportions of the Kalash are due to inbreeding and the limitations of ADMIXTURE.

Does he cite any studies or make any argument why ADMIXTURE would remove precisely the component that he is so eager to be present? No. While genetic drift in an isolated population could indeed lead to the loss of genetic diversity, there is no reason to think that this would lead preferentially to the loss of Northern-European segments. It is strange that Vasishta accuses me of bias and yet, at the same time, invokes the magic of some unspecified flaw of ADMIXTURE for the loss of his favorite component.

Vasishta invokes the Harappa Ancestry Project K=11 admixture analysis in support of his idea that the Kalash have 22% of the component modal in Lithuanians. However, he neglects to mention that at K=11 there is no West-Asian or Caucasus centered component in the HAP analysis, but rather only "European" (modal in Lithuanians) and "SW Asian" (modal in Yemen Jews). It is indeed strange that he accuses me of bias for providing evidence about the relationship of the Kalash with West Asia, while at the same time, showing preference for a level of analysis where such a component is lacking.

The West Eurasian cline between Arabia and Northeastern Europe is evident in the 'weac' admixture analysis, where the European-centered component (Atlantic-Baltic) is present in populations such as Assyrians and Armenians whereas it is lacking at the appropriate level of resolution. Therefore, the fact that the Kalash show "European" admixture at the level of Europe vs. Near East does not mean that they ought to show such admixture at the level of Europe vs. West Asia/Caucasus vs. Arabia.

One of the benefits of DIYDodecad has been the availability of data from projects that have hitherto been black boxes. In the interest of transparency, I have taken the Eurogenes K=10 "test" calculator and repeated my analysis of the Kalash, that had been previously shown by me to be a fairly simple West/South Asian mix. I could have waited for him to get around to it, but since he's quick on the talk and slow on the trigger, I decided to do it for him.

The admixture proportions of the Kalash, according to the Eurogenes K=10 are: 40.3% S_Asian, 58.7% W_Asian, 0.9% N_E_Euro, 0.1% N_Asian, and hence the analysis based on the Eurogenes K=10 components confirms the analysis based on my eurasia 7, "showing the Kalash to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components."

Eurogenes alleges, not without his usual charm, that:
Dienekes has a keen eye for things he wants to see. But he hasn't yet noticed that in all accurate analyses, there's significant Eastern European admix in North India. His monocle got fogged up in that instance.
Let us consider some pertinent facts: the Indian peninsula has been invaded multiple times from Central Asia, a process that continued long after the establishment of the Indo-Aryans during the 2nd millennium BC. Eurogenes may want to think that the "Eastern European" admixture in South Asia dates to his mythological Polish Indo-Europeans, galloping across the steppes on their horses, but there is, at present, no particular reason to think that this is the case

Furthermore, the "West Asian" component as a fraction of the "West Asian" + "Atlantic-Baltic" component reaches a minimum of 77% in the Pathans in populations from the northern parts of the Indian subcontinent. His own monocle is surely in greater need of de-fogging if I miss the 23% and he misses the >77%.

Indeed, the Europe vs. Caucasus ratio in Indian subcontinental populations is similar to that found in people from the Middle East and Caucasus region. It is not surprising that Eurogenes has abandoned his search for North European components in South Asia, going as far as reconstructing Ancestral North Indians as "Northern Europeans". Needless to say, he was wrong. The West Eurasian ancestry of the population of the Indian subcontinent is similar to that found in modern West Asian populations, not Slavs.

Eurogenes promises:
This shouldn't be too difficult. I'll use Dienekes' calculator for the job, and then check the results with LAMP.How poetic.
Been there, done that. It will be fun to see what "Northern European" components he will be able to squeeze out of the 0.9% N_E_Euro component that my software, in conjunction with his "test" calculator produces.

Why are the Kalash important?

There are three reasons why the Kalash are important in the study of Eurasian prehistory:
  1. Their mountainous habitat contributed to isolation and relative immunity from historical population movements
  2. Their non-Islamic religion has definitely preserved them from recent gene inflow
  3. Their language is unique within the Indo-Aryan family, and it often considered today as part of a separate Dardic family of Indo-Iranian in addition to the more populous Iranian and Indo-Aryan families.
The Kalash are crucial for those interested in the origins of Indo-Iranians, and the fact that they are, indeed, a simple West/South Asian mix is not without significance for that question.

UPDATE:

Here is the result of a PCA analysis of the Kalash together with 50 synthetic individuals from each of the S_Asian, W_Asian, and N_E_Euro components of Eurogenes K=10 "test". This was calculated with smartpca with numoutlieriter set to 0.

It is evident that the Kalash appear to fall on the S_Asian to W_Asian line, and toward the W_Asian pole, consistent with being a population of those two origins, with the W_Asian component predominating.

UPDATE II:

As mentioned in the eurasia7 post, the Kalash tend to form population-specific components in ADMIXTURE analyses, so they are generally not included in my runs. So, I run the K=7 analysis again, but this time I included the Kalash. Here are the top populations of the component that was modal in the Kalash:

[186,] "Kurd_D" "50.2"
[187,] "Kurds_Y" "50.7"
[188,] "Armenian_D" "50.9"
[189,] "Armenians_Y" "51.2"
[190,] "Adygei" "51.5"
[191,] "Chechens_Y" "53"
[192,] "North_Ossetians_Y" "53.2"
[193,] "Lezgins" "54.4"
[194,] "Georgians" "59.8"
[195,] "Georgian_D" "60.1"
[196,] "Abhkasians_Y" "60.5"
[197,] "Kalash" "63.2"

Here are their exact admixture proportions in this unsupervised ADMIXTURE run:

Kalash N=23
East_Asian: 0.5
Atlantic_Baltic: 1.5
South_Asian: 32.9
Sub_Saharan: 0.0
Southern: 0.0
Siberian: 1.8
West Asian: 63.2

UPDATE III (November 22): Eurogenes estimates that there is 4% "Northeast European" admixture for Kalash individual HGDP00302. He managed to avoid the creation of a Kalash-specific component by including only a single Kalash individual in an ADMIXTURE experiment.

The Kalash do tend to create their own Kalash-specific component, and a good way to avoid such a component is to include each of them individually, and repeat the analysis 23 times. An alternative, and less time consuming way, is to create a single synthetic individual using the allele frequencies of the Kalash population as a whole. Even simpler, one could randomly pick a single individual (such as HGDP00302), but at the risk of picking an individual that has either much more or much less than average a particular type of ancestry.

Below are the admixture proportions of all the 23 Kalash individuals from the unsupervised ADMIXTURE run of UPDATE II. Individual HDGP00302 is 4th of 23 in terms of their "Atlantic_Baltic" component that peaks in Lithuanians (3%). The Kalash have 1.5% "Atlantic_Baltic" on average (median=1%, standard deviation=2.1%).

ID East_Asian Atlantic_Baltic South_Asian Sub_Saharan Southern Siberian West_Asian
HGDP00279 0.007 0.081 0.361 0 0 0.031 0.521
HGDP00307 0.004 0.059 0.336 0 0 0.018 0.583
HGDP00315 0.019 0.036 0.338 0 0 0 0.606
HGDP00302 0.006 0.03 0.337 0 0 0.02 0.608
HGDP00311 0.014 0.029 0.325 0 0 0.021 0.611
HGDP00285 0 0.027 0.319 0 0 0.019 0.635
HGDP00333 0 0.02 0.324 0 0 0.018 0.638
HGDP00277 0 0.016 0.334 0 0 0.021 0.63
HGDP00298 0.012 0.016 0.325 0 0 0.016 0.631
HGDP00281 0.011 0.015 0.332 0 0 0.01 0.633
HGDP00304 0.007 0.012 0.329 0 0 0.013 0.638
HGDP00290 0.007 0.01 0.325 0 0 0.021 0.637
HGDP00274 0.007 0.004 0.341 0 0 0.013 0.635
HGDP00309 0.007 0 0.317 0 0 0.019 0.656
HGDP00330 0 0 0.335 0 0 0.026 0.639
HGDP00319 0.011 0 0.328 0 0 0.01 0.651
HGDP00288 0.004 0 0.339 0 0 0.013 0.644
HGDP00286 0 0 0.329 0 0 0.018 0.653
HGDP00313 0 0 0.351 0 0 0.015 0.634
HGDP00328 0 0 0.31 0 0 0.023 0.667
HGDP00267 0 0 0.332 0 0 0.022 0.647
HGDP00326 0 0 0.307 0 0 0.03 0.663
HGDP00323 0.002 0 0.304 0 0 0.013 0.68

Wednesday, October 26, 2011

'eurasia7' calculator

This calculator was made with 196 different populations and 2,659 individuals, including 518 project participants. The following Dodecad populations do not have 5 individuals yet, so they are included in the OTHERS_D generic category:
Algerian_D, North_African_Jews_D, Slovenian_D, Mixed_Scandinavian_D, Danish_D, Moroccan_D, Tunisian_D, Serb_D, Austrian_D, Saudi_D, Pakistani_D, Tatar_Various_D, Palestinian_D, Greek_Italian_D, Romanian_D, Swiss_German_D, Szekler_D, Mandaean_D, Azeri_D, Czech_D, Georgian_D, Belgian_D, Latvian_D, Estonian_D, Bangladesh_D, Yemenese_D, Sri_Lanka_D, Hungarian_D, Basque_D, Udmurt_D, Egyptian_D
As always, I encourage people with 4 grandparents from the same country or ethnic group of Eurasia, North or East Africa to contact me (do not send data!) for possible inclusion in the Project. If I have overlooked any such individuals, drop me a line (my e-mail address is at the bottom of the blog). I usually start a new _D population whenever individuals with 4 grandparents from the same group are submitted, but I may have missed some.

Note that all individuals from the reference populations have also been included, including outliers; you should be aware of this when reading the population averages, and consult the Outliers tab in the v3 spreadsheet for some instances of outliers.
Due to image size restrictions in Picasa, the labels are not visible well. A large version of the above plot can be found in the download bundle.

The seven ancestral populations inferred at this level of resolution are:
  • Sub_Saharan
  • West_Asian
  • Atlantic_Baltic
  • East_Asian
  • Southern
  • South_Asian
  • Siberian
As usual, you should take these names as useful labels, and interpret them in conjunction with the components' distribution in different populations, and their Fst distances, both of which can be found in the spreadsheet.

The table of Fst distances:


Below you can see a neighbor-joining tree based on inter-population Fst distances:
The first six dimensions of a multi-dimensional scaling of the same:





Calculator Files:

  • The spreadsheet contains population averages, the table of Fst distances, and individual results for included Project participants.
  • The download RAR file (Google Docs or Sendspace) contains all the files needed to run the calculator. You must download and install DIYDodecad 2.1 first. In order to run the calculator, you follow the instructions of the README file, but type 'eurasia7' instead of 'dv3'.

Terms of use: 'eurasia7', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Technical Details:

The calculator is built using allele frequencies of K=7 ancestral components inferred by ADMIXTURE 1.21 analysis of 2,659 individuals. Markers included in the source datasets, as well as the Family Finder and 23andMe (as of Oct 21) platforms were included. The marker set was thinned of markers with less than 99.5% genotype rate and less than 0.5% minor allele frequency. Linkage-disequilibrium based pruning was carried out with a window size of 250 SNPs, advanced by 25 SNPs and R-squared greater than 0.4. A total of 164,990 SNPs remained after these filtering steps.

All relevant populations available to me, and genotyped at a sufficient number of markers were included. Inclusion of the Kalash population resulted in a population-specific component at K=7, and hence their admixture components were inferred a posteriori. Their proportions are consistent with previous results, showing them to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components.

Tuesday, July 19, 2011

The Dodecad Oracle v1

Here is a little fun tool that tests the Dodecad v3 admixture proportions of an individual against all the reference populations, but also against the best pairwise combinations of these populations.

You need to install R to use it, and then download the program and double click on the file DodecadOracleV1.RData that can be found within the rar file. You will then be faced with a command prompt where you can enter the following commands:

Examining which populations are available

Just enter

X[,1]

You will see a list of 227 populations. You can use these population IDs in the next section.

Which populations are closest to a particular population?

Enter:

DodecadOracle("British_D")
[,1] [,2]
[1,] "British_D" "0"
[2,] "British_Isles_D" "0.9798"
[3,] "Cornwall_1KG" "1.1533"
[4,] "Kent_1KG" "2.265"
[5,] "Irish_D" "3.7643"
[6,] "Dutch_D" "4.5354"
[7,] "Mixed_Germanic_D" "6.8971"
[8,] "Norwegian_D" "11.3111"
[9,] "Orkney_1KG" "12.4652"
[10,] "Orcadian" "12.8195"

If you want to find e.g., the top-30 populations, rather than just the top-10, enter:

DodecadOracle("British_D", k=30)

Which populations are closer to a particular individual?

Enter the admixture proportions of the individual (from the "Individual results" tab of the spreadsheet) as follows:

DodecadOracle(c(4.6, 16.7, 33.6, 0, 23.2, 0.4, 0.6, 1.6, 0.7, 14.1, 4.5, 0.2))
[,1] [,2]
[1,] "Ashkenazi_D" "3.7908"
[2,] "Ashkenazy_Jews" "4.1473"
[3,] "Morocco_Jews" "6.338"
[4,] "S_Italian_Sicilian_D" "12.5443"
[5,] "Sephardic_Jews" "13.5067"
[6,] "C_Italian_D" "14.4554"
[7,] "Sicilian_D" "14.7469"
[8,] "S_Italian_D" "15.748"
[9,] "Tuscan_X" "15.9981"
[10,] "O_Italian_D" "16.1474"

Once again, you can specify k=30, if you desire the 30 top matching populations instead of the default 10.

Mixed Mode

You use mixed mode by adding mixedmode=T in any of the commands. The program then considers all pairs of populations, and for each one of them calculates the minimum distance to the sample in consideration, and the admixture proportions that produce it; population pairs where the distance to one of the two populations is smaller than to any admixture of the two are ignored.

Example:

DodecadOracle("Pathan",mixedmode=T)
[,1] [,2]
[1,] "Pathan" "0"
[2,] "84.8% Pakistani + 15.2% Urkarah" "1.075"
[3,] "84% Pakistani + 16% Stalskoe" "1.1555"
[4,] "63.9% TN_Brahmin + 36.1% Urkarah" "1.6669"
[5,] "32.4% Urkarah + 67.6% Meghawal" "2.3516"
[6,] "56.3% INS + 43.7% Urkarah" "2.4901"
[7,] "11.5% Adygei + 88.5% Pakistani" "2.6245"
[8,] "82.4% Sindhi + 17.6% Stalskoe" "2.6318"
[9,] "62.9% AP_Brahmin + 37.1% Urkarah" "2.7322"
[10,] "11.2% Lezgins + 88.8% Pakistani" "2.7749"

The mixed mode should be used with caution, and it shows, more than anything else, how similar apparent "mixes" can be achieved by different combinations of ancestry. Nonetheless, it may prove somewhat useful. For example, there is a suggestion in the above results, that Pathans can be viewed as a mix of other South Asian populations and populations from the eastern Caucasus, a suggestion that was arrived at independently by the Project using different methods.

Here is another example:

DodecadOracle("Assyrian_D",mixedmode=T)
[,1] [,2]
[1,] "Assyrian_D" "0"
[2,] "83.9% Armenians_16 + 16.1% Yemen_Jews" "1.7829"
[3,] "89.1% Armenian_D + 10.9% Saudis" "2.1624"
[4,] "84.3% Armenians_16 + 15.7% Saudis" "2.2884"
[5,] "88.9% Armenian_D + 11.1% Yemen_Jews" "2.2983"
[6,] "83.8% Armenian_D + 16.2% Bedouin" "4.1579"
[7,] "72.2% Armenian_D + 27.8% Syrians" "4.1841"
[8,] "23.4% Georgians + 76.6% Iraq_Jews" "4.2418"
[9,] "76.2% Armenians_16 + 23.8% Bedouin" "4.332"
[10,] "61.5% Armenians_16 + 38.5% Syrians" "4.4019"

This reaffirms the close relationship of Assyrians to Armenians that has been noticed in the project and by others, and it also shows that Assyrians differ from Armenians in a Southwestern Asian direction, consistent with their Semitic language.

Or, African Americans:

DodecadOracle("ASW",mixedmode=T)
[,1] [,2]
[1,] "ASW" "0"
[2,] "81.3% Hausa + 18.7% N._European" "2.3891"
[3,] "18.4% Orkney_1KG + 81.6% Hausa" "2.4031"
[4,] "18.5% Argyll_1KG + 81.5% Hausa" "2.4268"
[5,] "18.4% Orcadian + 81.6% Hausa" "2.4657"
[6,] "80.5% Igbo + 19.5% N._European" "2.5031"
[7,] "80.6% Brong + 19.4% N._European" "2.523"
[8,] "18.6% CEU + 81.4% Hausa" "2.5938"
[9,] "19.1% Argyll_1KG + 80.9% Brong" "2.6197"
[10,] "19% Orkney_1KG + 81% Brong" "2.6274"

I don't know that much about the slave trade, but I believe that Ghana was an important part of it?

Another thing to watch, is that some populations tend to have more than one sample available, so they appear to be mixtures of themselves, which is not really very informative, e.g., Spanish_D

DodecadOracle("Spanish_D",mixedmode=T)
[,1] [,2]
[1,] "Spanish_D" "0"
[2,] "7.9% French_Basque + 92.1% IBS" "0.8713"
[3,] "68.9% IBS + 31.1% Spaniards" "1.0377"
[4,] "98.8% IBS + 1.2% Irish_D" "1.2959"
[5,] "1.2% British_Isles_D + 98.8% IBS" "1.3018"
[6,] "1.2% British_D + 98.8% IBS" "1.3019"
[7,] "99% IBS + 1% Norwegian_D" "1.3046"
[8,] "1.2% Cornwall_1KG + 98.8% IBS" "1.3048"
[9,] "98.8% IBS + 1.2% Kent_1KG" "1.3142"
[10,] "2.2% French_D + 97.8% IBS" "1.3179"

To deal with these problems, you must "edit" the X matrix if you want to exclude some populations. For example, if you want to exclude "Spaniards" and "IBS", you must enter:

X <- X[setdiff(1:227,which(X[,1]=="IBS" | X[,1]=="Spaniards")),]

but notice, that you must relaunch the program, if you want to get the original matrix, or alternatively save it like this:

Z<-X

and then retrieve it like this:

X<-Z

Friday, July 1, 2011

Results up to DOD764 are posted (+portraits, Indo-Iranians etc.)

The results can be found in the spreadsheet

Submission to the Project is currently closed, and of course I encourage participants who have not already done so to leave a message in the ancestry thread.

This completes the results for all Project participants who joined during the latest submission opportunity.

The population averages are finalized -for the time being- but I will occasionally update the _D populations as more participants join the Project and/or I discover cases of fraud in terms of ancestry self-reporting.

Population Portraits

Finally, the population portraits have been uploaded (here and here). For example, here is the Nganassan one, showing three distinctive outliers:

A colorful view of the Nepalese, showing the co-existence of South-Asian-like and East-Asian-like individuals:
Note, that there are also some portraits of populations not included in the averages. For example here are the Onge:
The Onge from the Indian Ocean are outside the area covered by the populations used to create the Dodecad v3, and show mixed "South Asian", "South-East Asian" affiliations. They are probably a good example of case #4.

Indo-Iranian Origins

Here is the population portrait of the Kurds:
I have long noticed that all Indo-Iranian populations possess some of the "South Asian" component. The origin of that component is difficult to ascertain, as it is a composite of "North Indian" and "South Indian" ancestral components, related to West Asians and Onge respectively.

What also seems interesting is that the "South Asian" component is closer to the "West Asian" one with respect to all other West Eurasian components, while many South Asian individuals have substantial levels of the "West Asian" component itself.

The occurrence of "South Asian" in non-negligible levels seems to track the Indo-Iranian world quite well: it is found at about 1/10 in Iranians and Kurds, and also occurs widely in Central Asia, where its true ancient levels were probably much higher due to the substantial presence of east Eurasian elements in the area today. It even occurs at non-trace levels in people who have been part of historical Persian empires such as those from the eastern Caucasus (compare Lezgins and Azerbaijan Jews with Georgians and Adygei, and Iranians/Kurds with Turks, Cypriots, Syrians, and Armenians).

These patterns can be well-explained, I believe, if we accept that Indo-Iranians are partially descended not only from the early Proto-Indo-Europeans of the Near East, but also from a second element that had conceivable "South Asian" affiliations. The most likely candidate for the "second element" is the population of the Bactria Margiana Archaeological Complex (BMAC). The rise and demise of the BMAC fits well with the relative shallowness of the Indo-Iranian language family and its 2nd millennium BC breakup, and has been assigned an Indo-Iranian identity on other grounds by its excavator. As climate change led to the decline and abandonment of BMAC sites, its population must have spread outward: to the Iranian plateau, the steppe, and into South Asia, reinforcing the linguistic differentiation that must have already began over the extensive territory of the complex.

The proposed Indo-Iranian homeland, transitional between the West and the South would explain both:
  • the presence of the "West Asian" component in South Asians (contrast e.g., Kashmiri Pandits with other Indians and south Indian Brahmins with non-Brahmin south Indians), and also
  • the "South Asian" component in Iranians and Iranian-admixed Central Asian Turkic speakers
In their westward march, the Iranians would acquire an excess of West and Southwest Asian components (which would reduce their "South Asian" one), while in their southward march, the Indo-Aryans would acquire an excess of the South Asian component (which would reduce their "West Asian" one).

Saturday, June 4, 2011

Projecting Pakistan populations on West Eurasian PCA

In a first post I showed that ADMIXTURE output allele frequencies could be used to create synthetic individuals corresponding to the ancestral components ("zombies"), and that these artificial populations could be used for both performance, and to avoid the creation of population-specific clusters in ADMIXTURE run. I was hence, able to infer the composition of several idiosyncratic populations in terms of the K=10 components of the Dodecad Project.

In a second post, I showed that "zombies" could be created even in the absence of allele frequencies, if one had admixture proportions only for the ancestral components. I was thus able to reconstruct synthetic individuals corresponding to the ANI/ASI of Reich et al. (2009). I was further able to confirm the West Asian origin of Ancestral North Indians. In a subsequent post, I used these synthetic ANI/ASI populations on groups of Pakistan, showing the main West Asian/ANI origin of the Caucasoid component in South Asia. Moreover, I confirmed that the Ancestral South Indians are related (but distantly) to the Onge from the Indian Ocean.

In this post, I run principal components analysis on the Pakistan populations; the Hazara were excluded because of their high East Eurasian admixture. Here is the unsupervised PCA:


First, you notice that the first dimension is dominated by the Kalash, a very distinctive population because of its long-term isolation. The second dimension is dominated by a Sindhi outlier, which, if you consult a Sindhi population portrait from a previous experiment, is revealed to be of substantial Sub-Saharan admixture.

Obviously, this is no good, as our first two dimensions are not anthropologically interesting. If we are interested in learning about the origins of populations, knowing that there are a few Sindhi individuals with Sub-Saharan admixture, or that the Kalash are highly isolated is not helpful.

We can run PCA again, but this time we project populations of interest onto the PCA plot of the West Eurasian control populations:
It is fairly obvious that the populations of Pakistan fall on the South Asia-West Asia line. There are small deviations from the cline:
  • Balochis and Brahuis deviate towards the SW Asian component, which is consistent with their ADMIXTURE results.
  • The position of the non-Indo-European Burusho and Indo-Aryan Sindhi populations on either side of the cline is consistent with a little SW Asian component in the Sindhi and a little North European component in the Burusho, which pull them away from the cline in the expected directions.
Moreover, the relative position of the Pakistan populations along this cline is preserved.

Using the West Eurasian "zombies" is thus, not only useful for ADMIXTURE, but also for principal components analysis; in the latter it is helpful because:
  1. It avoids domination by very isolated/inbred populations and/or outliers
  2. It is possible to create synethic "zombie" population with absolutely equal sample sizes, hence removing a source of bias (some residual bias may persist, e.g., if one used a component centered on 5 "real" individuals to create a "zombie" population of 100, then the effective sample is not really 100)


Thursday, June 2, 2011

Ancestral South Indian (ASI) in context

I have taken the synthetic ASI population together with 25 HapMap-3 Chinese (CHB), 16 HGDP Papuans, and 9 Reich et al. (2009) Onge from the Andaman Islands to determine its relationships with other Eurasian populations.

Below is an MDS plot which shows that ASI does not appear to be particularly close to any of the other populations.

I have also ran supervised K=3 ADMIXTURE analysis that treated the ASI population as test data and CHB, Onge, Papuan as parental populations; the ASI turned out 100% "Onge", consistent with the idea that ASI is distantly related to Onge, although closer than with the other two populations.

It should be noted, however, that the similarity of ASI to Onge is not unexpected, since:
  • Onge was used by Reich et al. (2009) to infer admixture proportions of Indian Cline populations, which were (in turn):
  • used by myself to infer allele frequencies of ASI, and then:
  • used by myself to create a synthetic population of ASI individuals.
So, the Onge-ness of ASI is contingent upon the accuracy of Reich et al. (2009), but, anyway, the population of my ASI "zombies" seem to pass a second test of being reasonable standins for ASI in the sense of that paper.

Tuesday, May 31, 2011

ANI/ASI analysis of HGDP Pakistan groups

Until recently, it has been difficult to study the Ancestral North Indian/Ancestral South Indian (ANI/ASI) composition of Pakistan groups, as these fell outside the "Indian Cline" of Reich et al. (2009). My recent experimental reconstruction of ANI/ASI zombies, as well as West-Eurasian ones allows me to do a supervised run on them and see how they fare.

(One caveat is that this is based on ~30k SNPs, as the two different kinds of populations I am using include ~120k and ~150k SNPs, but not the same ones).

Overall, the results make sense (they can be seen on the left, as well as on this spreadsheet):
  • The components of the ANI and West Asian "zombies" dominate most populations; I suspect that as the two are related it may be difficult to distinguish between them
  • Intriguingly, Kalash continue to be dominated by West Asian, now that the composite "South Asian" has been resolved, and their ASI levels are similar to those in Iranians.
  • Conversely, the higher ANI are found in Pathans and Sindhi, i.e., precisely the populations used by Reich et al. (2009). Hence, I suspect that ANI in the sense of Reich et al. (2009), as reconstructed by myself, may be biased towards these two populations. Also note that my ANI reconstruction used the same Pathans (15) and Sindhi (10) used by Reich et al. (2009), whereas in this one all HGDP individuals are included.
  • The East Asian component turns up in the Hazara and the Burusho, in agreement with previous experiments
  • The Southwest Asian component turns up in Balochistan (Balochi, Brahui, Makrani), which also makes sense, linking that Iranic speaking region to nearby Iran where that component is also important
  • The North European component comes up in Hazara, Burusho, and Pathans, which again makes sense, as these populations may have been influenced by people from further north in historical times.
In conclusion, I would say that while the ANI/ASI "zombies" do capture real South Asian signals, as evidenced by my Gypsy experiment, but the reconstructed ANI does not capture the entirety of West Eurasian admixture in South Asia: a lot of it continues to be associated with West Asia, and a little with Northern Europe in some populations.

Monday, May 30, 2011

More Zombies: Ancestral North Indians and Ancestral South Indians reborn

In my previous post I showed how synthetic individuals corresponding to ADMIXTURE ancestral components can be created and used. This was made possible by the fact that ADMIXTURE outputs allele frequencies for its components, which can be utilized to create a population of random genotypes with the same allele frequencies.

A more difficult task is to create such "zombie" individuals when there are no allele frequencies at hand. A prime example of this is the paper by Reich et al. (2009) on the two ancestral components in Indians: Ancestral South Indians (ASI) and Ancestral North Indians (ANI). The paper provides admixture estimates for these two components in present-day "Indian Cline" groups, but no allele frequencies for these components: we only knew that ANI was closely related to West Eurasians, and ASI formed a clade with the Onge from the Indian Ocean.

Both ANI and ASI are extinct (in pure form) populations, and they are blended (in varying proportions) in modern day Indians, with highest ANI occurring in the Northwest and among upper caste groups, and highest ASI among South Indian tribal and low caste populations.

As I was thinking of ways to extend the "zombie" approach, it occurred to me that there is a fairly involved way to extract the ANI/ASI allele frequencies from the available evidence:

If f(ANI) and f(ASI) are the allele frequencies at a locus for ANI and ASI, and an admixed population P has x fraction of ancestry from ANI and 1-x from ASI, then its allele frequency is expected to be:

x*f(ANI)+(1-x)*f(ASI) = f(P)

I have marked (in bold) the known variables. Obviously, this equation does not hold in practice, because of sampling error, uncertainty in the estimation of x, as well as genetic drift that may affect the allele frequencies of the admixed population.

Nonetheless, we do not only have one equation of this sort, but 18, since Reich et al. (2009) provides ANI/ASI estimates for 18 different Indian Cline populations. We can thus fit a linear regression to recover f(ANI) and f(ASI).

This is exactly what I did; there are two important caveats:
  • because most of the Reich et al. (2009) populations are very small, f(P) is expected to be very noisy. I thus grouped the Indian Cline populations into five groups (based on increasing ANI, and making sure that each one had >15 individuals), and calculated admixture proportions (x's) and allele frequencies (f(P)'s) on these groups.
  • linear regression coefficients (the f(ANI) and f(ASI) estimates) may be less than 0 or more than 1, which makes no biological sense, so these were fixed to 0 and 1 in a few cases whenever that was the case (~5% of markers)
All of this required a bit of thinking and work, so I was very skeptical that it would work; given sampling/admixture estimation errors/limitations of regression/random creation of individuals, the whole process from input data to output "zombies" passed through so many layers, that it could very well lead to nonsense.

Nonetheless, there is power in numbers, and I was hopeful that this might work. If it did, I could have synthesized ANI and ASI populations to play with and use pretty much like regular populations in a variety of experiments.

Validation of synthetic ANI/ASI populations

I generated 25 ANI and 25 ASI individuals using the above-described method. There are 119,588 SNPs in these populations.

To validate them, I ran supervised ADMIXTURE using these ANI/ASI individuals as ancestral populations, and all the Indian Cline populations as test data. The results can be seen below:
Although the estimates for some populations (e.g., Chenchu: 31 vs. 40.7%) are substantially off, the median error is 1%, and the average error is 2.4%. Overall, it does appear that the synthetic ANI/ASI individuals are fairly good standins for their (extinct) populations.

Ancestral North Indians

I included ANI together with the 4 West Eurasian components of the Dodecad Project in an MDS plot:
Also, a neighbor-joining tree:
Putting ANI/ASI to work: Romanian Gypsies

I have previously detected 2 individuals in the Behar et al. (2010) Romanian sample that are likely to be of Roma (Gypsy) heritage. Here is a supervised admixture of the Romanian sample using the ANI/ASI components:
The previously detected individuals do possess both ANI and ASI components, indeed these are:

18.1, 15.3
16.9, 16.4

in the two individuals, which might be useful in constraining geographically the origin of European Gypsies along the Indian Cline.

Putting ANI/ASI to work: Iranians

Iranians generally show affinity to South Asians. Is this affinity related to the common Indo-Iranian background of Iranians and Indo-Aryans, or, is it, perhaps, due to the absorption of South Asian population elements during Iran's long imperial past?

The ANI/ASI components in the Iranians and Iranian_D samples are:

11.7, 7.5
12.0, 6.9

Compared to the previously described Romanian Gypsies, the South Asian component in Iranians tends to be clearly tilted towards ANI.

How to create Zombies from ADMIXTURE etc.

ADMIXTURE infers K ancestral populations, and estimates the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each ancestral population.

An interesting use of the allele frequencies is to generate synthetic "zombies" from the ancestral populations. These are artificial individuals whose genotypes are drawn randomly based on the allele frequencies. For example, there is a "West Asian" component in the Dodecad Project, but no individuals who have 100% membership in the "West Asian" component. A "West Asian" zombie is a synthetic individual who appears to be drawn from that "West Asian" component only, without any other (e.g., "South European", or "Southwest Asian") admixture at all.

"Zombies" may be viewed as either useful theoretical abstractions, or as reconstructed hypothetical ancient-like individuals, purged of centuries or millennia of admixture. Irrespective of how one views them, they are very useful as a tool.

Zombies of K=10 components

I generate 25 zombies for each of the 10 ancestral components of the Dodecad Project. Below, you can see an MDS plot of these 250 individuals, which is quite similar to the MDS plot generated using only the Fst divergences between the ancestral components.
Including real and "zombie" populations

I include the "West African", "North European", and "South European" zombie populations, together with 25 African Americans (ASW) from HapMap-3:
Notice the direction of the African American cline: slightly tilted towards North Europeans. This makes sense as the European ancestry of African Americans is derived mainly from Northwestern Europe and neither exclusively from the Mediterranean or Northern Europe where the "South European" and "North European" components peak.

Convert unsupervised ADMIXTURE runs to supervised ADMIXTURE

The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.

The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.

You can thus use the zombie populations to mimic a regular (unsupervised) ADMIXTURE run. This is useful for two reasons:
  1. It can be much faster: the initial set (of the unsupervised run) can be huge, but the zombie populations need only be large enough to capture the allele frequencies of the inferred components.
  2. It avoids the generation of spurious clusters, especially if you include individuals from highly-inbred populations, or a large number of test individuals
I re-estimated admixture proportions for the 9 individuals of the last run, using the "zombie" populations in a supervised ADMIXTURE run. This took less than 1/10 of the time, and achieved results that were highly concordant with the ones previously reported: correlation was +0.999729; the average difference in ancestral proportions was 0.3%, the maximum difference 2.1%.

The speedup is due to two reasons: first, I'm running ADMIXTURE on 250 "zombie"+9 real individuals, as opposed to 692+9 real individuals using the unsupervised method. Moreover, admixture proportions are only estimated for the 9 real individuals and are fixed for the 250 "zombie" ones. This idea seems to work like a charm.

More average K=10 results

I was also able to calculate admixture proportions for the 10 Dodecad components in Druze, Kalash, and Palestinians. These populations have a tendency of forming their own population-specific clusters, so they are very difficult to compare against other populations: you just can't get their breakup into ancestral components easily, because they become their own ancestral components at fairly low K.

Using the trick of "zombie" populations, we can determine their ancestral components and compare them with other Dodecad populations.

I have labored long to be able to compare these to the ones in the standard Dodecad set, and I am very pleased that I was finally able to achieve it:
  • Both Druze and Palestinians have substantial "Southwest Asian" component as do most Semitic (Arab, Jewish, Ethiopio-Semitic) populations in my database
  • Druze have more "West Asian" than "Southwest Asian", and the reverse is true for Palestinians
  • Palestinians have more African admixture than Druze

By far, the most exciting thing about this analysis are the results for the Kalash, a population that speaks a language of the Dardic group of Indo-Iranian. Some linguists place Dardic languages in the Indo-Aryan subgroup (of which Sanskrit and Hindi are the most famous representatives), whereas others view Dardic as a third branch of Indo-Iranian together with Iranian (like Kurdish, Persian, or Pashto) and Indo-Aryan. In any case, the study of these mountaineers is extremely crucial to the study of Indo-Iranians in general.

The Kalash have been much mythologized as either long-lost Aryans or the descendants of Alexander the Great's soldiers.

The absence of the South European component among them agrees with Y-chromosome research about the absence of a Mediterranean or Greek influence in that population. The Kalash are completely split between the West Asian component (56%) and the South Asian one (43.5%). Indeed, their West Asian admixture is very high compared to my south Asian populations, exceeding even that of the Pathans (~40%) and reaching levels found only in West Asia proper. It is also perfectly consistent with my theory of Indo-Aryan origins in West Asia.

The way forward

I initially considered the idea of zombies as a way to include more Project participants in my detailed ADMIXTURE runs, such as the recent K=12 and K=11 ones. There are two problems with these runs:
  • Each one takes 24+ hours to complete, so it is not exactly possible to replace the standard K=10 analysis with them just yet
  • Including all project participants, especially those of mixed background, makes them completely impractical, in addition to making them very capricious: at high K different components begin to appear depending on sample composition, and the solution is not as robust as in the standard K=10 analysis.
With the use of zombie populations, these problems can be largely solved. I can spend many hours or even days in a very detailed ADMIXTURE run with a large sample, create "zombie" populations from the inferred results, and then run project participants fairly fast using these "zombie" populations and supervised ADMIXTURE mode. In fact, I am working on exactly this type of test at the moment, so project members of all backgrounds should expect good things to come in the next days or weeks.

Friday, May 6, 2011

Ancestral North Indian for South Asian members (with PCA)

I have previously estimated the Ancestral North Indian component in South Asian project members by exploiting the correlation between ADMIXTURE results and the published figures of Reich et al. (2009).

A different method of achieving the same is to project individuals onto the CEU-Onge first principal component, and exploit the correlation between PC1 scores and the published ANI figures. This correlation is +0.99, so it is possible to regress ANI on PC1 and come up with ANI estimates from PCA scores.

Results for all Indian, Bangladeshi, and Pakistani project members can be found in this spreadsheet, ordered by ANI, and interspersed with population averages.

Monday, March 21, 2011

Ancestral North Indian - Ancestral South Indian (ANI/ASI) inferred proportions for South Asian members

I have taken the 22 Project participants with a membership of at least 1/4 in the "South Asian" component of the K=10 standard analysis and ran them together with the populations of the "Indian Cline" described by Reich et al. (2009), as described here.

Since some of the Project's participants have either African or East Eurasian admixture they do not fall strictly along the "Indian Cline" between West-Eurasian-like Ancestral North Indians (ANI) and indigenous South Asian Ancestral South Indians (ASI). I therefore included HapMap Yoruba and Beijing Chinese to weed out these influences and ran a K=4 analysis.

Here are the ADMIXTURE results:

Here are the individual results:

Surprisingly, either the inclusion of the Dodecad participants and/or the African and Chinese controls has served to better flesh out the Indian Cline in the ADMIXTURE results. Below is a scatterplot of the "West Eurasian" component inferred by ADMIXTURE vs. the Ancestral North Indian (ANI) of Reich et al. (2009).

R2 =0.98 indicates that the ANI component can be inferred almost perfectly by the West Eurasian ADMIXTURE percentage.

Below are participants' individual results showing their inferred ANI and ASI components:

Raw results can be found in the spreadsheet

Saturday, January 8, 2011

ADMIXTURE analysis with Dodecad Populations (update #2)

Thanks to all the participants of the Project, the number of populations has increased, and so have sample sizes within pre-existing populations in the Project. There are now 17 populations with at least 5 individuals in the Project:
Assyrian, Scandinavian, Greek, Finnish, S_Italian_Sicilian, Ashkenazi, German, Indian, Portuguese, Armenian, Russian, Spanish, British, Irish, Turkish, N_Italian, Balkans
Below are the K=10 ADMIXTURE results with these populations:

Admixture proportions can be found in the spreadsheet.

The fact that the addition of 17 populations and 143 individuals to the core set of 36 populations and 692 individuals results in the same 10 ancestral components testifies to the stability of this solution. Hopefully, within 2011 I will develop an even better comparison set to work with.

Another test of the validity of the analysis is comparison of independent samples of the same populations:
Ashkenazi, Armenian, Spanish, Turkish, N_Italian
I have a sample of Dodecad Project members for each of the above, as well as a published population. A way to measure the concordance between the two is to calculate the correlation coefficient (rounded to the 3rd decimal point):
  • Ashkenazi Jews: 0.999
  • Armenians: 0.988
  • Spanish: 0.998
  • Turkish: 0.995
  • N_Italian: 0.996
The concordance is remarkable.

I have also made a RAR of "population portraits". It is important to do this to determine whether minor ancestral components represent population-wide phenomena or are limited to a few individuals.

For example, here are the Turks of the Dodecad project:
The sample is a bit more varied than the sample included in Behar et al:
This probably underscores the importance of broad coverage of large countries and ethnic groups, as I have discovered recently in my analysis of 9 different populations of Pakistan.

Another new population are the Irish, presenting a picture of remarkable homogeneity:
Here is the population portrait for the Balkans, which consists of non-Greek, non-Roma inhabitants of the Balkans:
This appears quite varied; hopefully more Balkan project participants will allow me to split this into additional sample populations.

Finally, here is a portrait of the Ashkenazi population, which appears quite similar to the Behar et al. one:
A very interesting thing about this population is the existence of small slices of "East Asian" and "Northeast Asian" components totalling about 1.5% in almost all individuals. In my opinion this testifies to some type of old minor absorption, as it is fairly evenly spread in the population.

If you haven't joined the project yet, feel free to submit your sample during this opportunity.

Friday, December 17, 2010

Fine-scale South Asian admixture analysis + Results for Project participants

After my recent experiment on the number of markers needed to split closely related populations, I was encouraged to take another stab at integrating the Xing et al. (2009) dataset with my other collections. This dataset has only ~40k markers in common with my other datasets, as it was typed on a different chip, and after data cleaning (--geno 0.01 in PLINK) and LD-based pruning (--indep-pairwise 50 5 0.3 in PLINK), I was left with a composite dataset of about 30,000 SNPs.

The primary reason for wanting to revisit this dataset is the fact that it had two additional Caucasus populations (Stalskoe and Urkarah) as well as several Indian populations (from Andhra Pradesh, Tamils, and Irula).

In the standard K=10 analysis of the Project, Indian participants invariably get a mixture of "South Asian", "West Asian", "North European", and "East Asian" components, but obviously we should be able to do better than that.

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

Dodecad participants

In addition to the reference populations, I have included 14 Dodecad Project members (with 23andMe data) with the criterion that they are non-related have >5% "South Asian" component and less than 5% of the East and West African components. By ID these are:
DOD223 DOD067 DOD010 DOD029 DOD126 DOD128 DOD089 DOD091 DOD090 DOD220 DOD075 DOD078 DOD088 DOD201

GALORE analysis

To verify the existence of structure in the data, I used the MCLUST/MDS approach I've described earlier to infer the existence of clusters in the data. 34 clusters were detected with 16 dimensions of MDS retained.



As you can see, despite the smaller number of markers, structure was effectively inferred by MCLUST. As expected, Dodecad project members who have diverge origins in both South Asiaand beyond it are "all over the place" in terms of their cluster assignments. In the reference populations, some interesting groupings occur:
  • Stalskoe and Lezgins fall in cluster #32. Stalskoe is a village in Dagestan inhabited by Turkic Kumyks; Lezgins are Northeast Caucasian speakers from Dagestan
  • Dai from China and Vietnamese fall entirely in cluster #10
  • Tamil Brahmins and Andhra Pradesh Brahmins fall mostly in cluster #5, and not in the same clusters as non-Brahmin Tamil and AP individuals
Let's turn to the Dodecad Project members, and look at their probability of assignment:


NNclean suggests that DOD078 outlier. This may be due to unique ancestry that is not represented in the other reference populations.

Unfortunately, only Razib of Gene Expression took the trouble of leaving some information in the ancestry thread. His sample, DOD075 is assigned to cluster #6 where the bulk of the Singapore Indians are, and a scattering of individuals from Indian populations. Feel free to add any non-identifying information in the relevant thread, e.g., "Brahmin", your state of origin, etc. Even a little bit of information may help others interpret their results better.

Origin of South Asians

As I've remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called "Australoid" because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These "Australoids" are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn't be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as "black Africans" are not the same, neither are the "Australoids" and mixed-"Australoids" at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

It is in South Asia where there is clear evidence of fusion between indigenous and exogenous elements with the latter being similar to West Eurasians (Caucasoids). Moreover, both the great linguistic diversity and the caste system have helped maintain many distinct population groups. Naturally, tracing the origin of population elements present in the Indian mosaic is of great interest both for the people of India and for those outside it.

ADMIXTURE analysis

Below is the K=3 analysis which verifies the anthropological received wisdom about the three major Eurasian groups:

The East Eurasian component of this analysis is closer to the South Asian one (Fst=0.079) than to the West Eurasian one (Fst=0.114). The South Asian component is closer to the West Eurasian (Fst=0.063). The South Asian component as revealed in this plot is probably composite, as we shall see in the more detailed analysis below.

Here is the much more detailed K=10 analysis:

Admixture proportions for this can be found in the spreadsheet. I reiterate that you should treat the labels of the ancestral populations as useful mnemonics and that you should not confuse them with the same labels used elsewhere.

There are lots of interesting things about the plot:
  • Both the Irula and the North Kannadi get their own clusters (light blue and pink)
  • The South Asians have additional structure, with a component centered on Pakistan (green) and one centered on India (orange)
  • Notice the elevated Siberian (or "Yakut") component in Turks and Stalskoe (Kumyks). The Adyghe also seem to have some of it, and since these are NW Caucasian speakers, it is plausible that this may represent some sort of Tatar element
Return of the Lezgin mystery

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

The absence of Y-haplogroup J1, so typical of Dagestanis in India may suggest a speculative scenario, in which the ancestors of the Indo-Iranians picked up Northeast Caucasian women en route to the Iranian plateau and India.

Distances between components

Here is the table of Fst distances between the 10 components:


Brahmin origins

The importance of the caste system in shaping variation can be seen if we compare Tamil Brahmins with Tamil Lower Castes and Andhra Pradesh Brahmins with other AP populations. Brahmins possess both "Dagestan" and "Pakistan" components, which suggest their links to northern India in the first order, and West Eurasia in a more remote sense. The "Pakistan" component too is closest to the "West Asian" one.

Both "Dagestan" and "Pakistan" components are notable for their absence among non-Brahmins in both these south Indian localities.

Dodecad results

Once again, I can't comment on any of these except DOD075 who was probably right to speculate about input from Southeast Asia given his mixed "Southeast Asian"/"East Asian" affiliations, which resemble those of Vietnamese and Cambodians. The presence of both "Dagestan" and "Pakistan" components also point to more northwesterly influences.

Discussion

The most interesting thing about this little study is, no doubt, the expansion of the Dagestan mystery.

These South Indian Brahmins possess nearly as much of this component as people in Pakistan, and a few Iranians among my project members. They have more of it than many people living much closer to the Caucasus.

Given that they have partially absorbed indigenous Indian elements (evidenced by the "Indian" component, which is itself probably hybrid), the conclusion is inescepable that their ultimate non-Indian ancestors possessed even more of it.

Where did they come from? Any discussion of their origin or dispersal would be advised not to veer off too far from the Caspian sea...