Dodecad Ancestry Project: Fine-scale South Asian admixture analysis + Results for Project participants

Friday, December 17, 2010

Fine-scale South Asian admixture analysis + Results for Project participants

After my recent experiment on the number of markers needed to split closely related populations, I was encouraged to take another stab at integrating the Xing et al. (2009) dataset with my other collections. This dataset has only ~40k markers in common with my other datasets, as it was typed on a different chip, and after data cleaning (--geno 0.01 in PLINK) and LD-based pruning (--indep-pairwise 50 5 0.3 in PLINK), I was left with a composite dataset of about 30,000 SNPs.

The primary reason for wanting to revisit this dataset is the fact that it had two additional Caucasus populations (Stalskoe and Urkarah) as well as several Indian populations (from Andhra Pradesh, Tamils, and Irula).

In the standard K=10 analysis of the Project, Indian participants invariably get a mixture of "South Asian", "West Asian", "North European", and "East Asian" components, but obviously we should be able to do better than that.

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

Dodecad participants

In addition to the reference populations, I have included 14 Dodecad Project members (with 23andMe data) with the criterion that they are non-related have >5% "South Asian" component and less than 5% of the East and West African components. By ID these are:

DOD223 DOD067 DOD010 DOD029 DOD126 DOD128 DOD089 DOD091 DOD090 DOD220 DOD075 DOD078 DOD088 DOD201

GALORE analysis

To verify the existence of structure in the data, I used the MCLUST/MDS approach I've described earlier to infer the existence of clusters in the data. 34 clusters were detected with 16 dimensions of MDS retained.

As you can see, despite the smaller number of markers, structure was effectively inferred by MCLUST. As expected, Dodecad project members who have diverge origins in both South Asiaand beyond it are "all over the place" in terms of their cluster assignments. In the reference populations, some interesting groupings occur:

Stalskoe and Lezgins fall in cluster #32. Stalskoe is a village in Dagestan inhabited by Turkic Kumyks; Lezgins are Northeast Caucasian speakers from Dagestan
Dai from China and Vietnamese fall entirely in cluster #10
Tamil Brahmins and Andhra Pradesh Brahmins fall mostly in cluster #5, and not in the same clusters as non-Brahmin Tamil and AP individuals

Let's turn to the Dodecad Project members, and look at their probability of assignment:

NNclean suggests that DOD078 outlier. This may be due to unique ancestry that is not represented in the other reference populations.

Unfortunately, only Razib of Gene Expression took the trouble of leaving some information in the ancestry thread. His sample, DOD075 is assigned to cluster #6 where the bulk of the Singapore Indians are, and a scattering of individuals from Indian populations. Feel free to add any non-identifying information in the relevant thread, e.g., "Brahmin", your state of origin, etc. Even a little bit of information may help others interpret their results better.

Origin of South Asians

As I've remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called "Australoid" because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These "Australoids" are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn't be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as "black Africans" are not the same, neither are the "Australoids" and mixed-"Australoids" at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

It is in South Asia where there is clear evidence of fusion between indigenous and exogenous elements with the latter being similar to West Eurasians (Caucasoids). Moreover, both the great linguistic diversity and the caste system have helped maintain many distinct population groups. Naturally, tracing the origin of population elements present in the Indian mosaic is of great interest both for the people of India and for those outside it.

ADMIXTURE analysis

Below is the K=3 analysis which verifies the anthropological received wisdom about the three major Eurasian groups:

The East Eurasian component of this analysis is closer to the South Asian one (Fst=0.079) than to the West Eurasian one (Fst=0.114). The South Asian component is closer to the West Eurasian (Fst=0.063). The South Asian component as revealed in this plot is probably composite, as we shall see in the more detailed analysis below.

Here is the much more detailed K=10 analysis:

Admixture proportions for this can be found in the spreadsheet. I reiterate that you should treat the labels of the ancestral populations as useful mnemonics and that you should not confuse them with the same labels used elsewhere.

There are lots of interesting things about the plot:

Both the Irula and the North Kannadi get their own clusters (light blue and pink)
The South Asians have additional structure, with a component centered on Pakistan (green) and one centered on India (orange)
Notice the elevated Siberian (or "Yakut") component in Turks and Stalskoe (Kumyks). The Adyghe also seem to have some of it, and since these are NW Caucasian speakers, it is plausible that this may represent some sort of Tatar element

Return of the Lezgin mystery

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

The absence of Y-haplogroup J1, so typical of Dagestanis in India may suggest a speculative scenario, in which the ancestors of the Indo-Iranians picked up Northeast Caucasian women en route to the Iranian plateau and India.

Distances between components

Here is the table of Fst distances between the 10 components:

Brahmin origins

The importance of the caste system in shaping variation can be seen if we compare Tamil Brahmins with Tamil Lower Castes and Andhra Pradesh Brahmins with other AP populations. Brahmins possess both "Dagestan" and "Pakistan" components, which suggest their links to northern India in the first order, and West Eurasia in a more remote sense. The "Pakistan" component too is closest to the "West Asian" one.

Both "Dagestan" and "Pakistan" components are notable for their absence among non-Brahmins in both these south Indian localities.

Dodecad results

Once again, I can't comment on any of these except DOD075 who was probably right to speculate about input from Southeast Asia given his mixed "Southeast Asian"/"East Asian" affiliations, which resemble those of Vietnamese and Cambodians. The presence of both "Dagestan" and "Pakistan" components also point to more northwesterly influences.

Discussion

The most interesting thing about this little study is, no doubt, the expansion of the Dagestan mystery.

These South Indian Brahmins possess nearly as much of this component as people in Pakistan, and a few Iranians among my project members. They have more of it than many people living much closer to the Caucasus.

Given that they have partially absorbed indigenous Indian elements (evidenced by the "Indian" component, which is itself probably hybrid), the conclusion is inescepable that their ultimate non-Indian ancestors possessed even more of it.

Where did they come from? Any discussion of their origin or dispersal would be advised not to veer off too far from the Caspian sea...

Continued (20 Dec): A solution to the problem of Indo-Aryan Origins

17 comments:

pconroyDecember 17, 2010 at 9:46 PM
So DOD075 (Razib), apart from elevated South East Asian, has NO West Asian or Siberian, but HAS European and Dagestan - most interesting?!

One other thing, I originally speculated that the Lezgin mystery admixture was Iranian, and I still think I'm generally correct, as only Iranians span the Balkans, North Pontic Steppe, down through Iran and on to Pakistan.
ReplyDelete
Replies
DienekesDecember 17, 2010 at 11:06 PM
Why would an "Iranian" component be most frequent in Lezgins?

Notice that all the previous "West Asian" in Lithuanians is also "Dagestan".
ReplyDelete
Replies
pconroyDecember 18, 2010 at 12:12 AM
Well by Iranian, I really mean Steppe Iranian, not from Iran the modern country.

So Steppe Iranian admixture should be in the the Serbs/Croats, Southern Poles, Balkans generally, Southern Russians, North Caucasians, East Iranians, Central Asians, and all the way down to India
ReplyDelete
Replies
AnonymousDecember 18, 2010 at 2:43 AM
It's suggestive that haplogroup G2a seems to be much more common among the two main Tamil Brahmin subpopulations than almost all other South Asian populations (scroll down to the heading "Distribution of G Haplogroup in India and among Iyers and Iyengars"; the data are from Sengupta et al. 2006.) The author of that page has some speculations about the Indo-Scythians.
ReplyDelete
Replies
SvetozarDecember 18, 2010 at 8:39 AM
HI. IM DODO29 and im a gypsy. I have no Pakistan component, what does that mean?
ReplyDelete
Replies
JeanDecember 18, 2010 at 8:51 AM
The Dagestan component could have been BMAC at the point that it was absorbed by Indo-Iranian speakers. Farming undoubtedly reached the Bactria-Margiana area from SW Asia. The Indo-Aryans appear to have taken over this oasis-farming culture before moving on to India. Even before that there was contact between Andronovo and BMAC. But I would expect the BMAC component to be higher in the Indic-language group.

The impact of Scythian and descendant cultures (Alans, etc) on the North Caucasus and SE Europe could explain the rest of the pattern, except for the Lithuanians.
ReplyDelete
Replies
DienekesDecember 18, 2010 at 10:23 AM
The impact of Scythian and descendant cultures (Alans, etc) on the North Caucasus and SE Europe could explain the rest of the pattern, except for the Lithuanians.

Not sure what you mean by the impact of the Scythians on the North Caucasus. Why would this component be modal in NE Caucasian speakers (Lezgins) if its impact on the Caucasus can be attributed to Scythians?
ReplyDelete
Replies
DienekesDecember 18, 2010 at 10:49 AM
HI. IM DODO29 and im a gypsy. I have no Pakistan component, what does that mean?

I'd say it looks like Gypsies originated in India, not Pakistan, and perhaps from a caste/group that lacked the Pakistan component. The previously discovered Romanian Gypsies also have a lot of the Indian component, and none of the Pakistan one.
ReplyDelete
Replies
Razib KhanDecember 18, 2010 at 11:06 AM
the gypsy language is a northwest indo-aryan fwiw (classed with hindi, punjabi, gujarati).
ReplyDelete
Replies
UnknownDecember 18, 2010 at 12:53 PM
I am DOD010; North Iranian. Would appreciate Dienekes' comments regarding my results.
ReplyDelete
Replies
JanDecember 18, 2010 at 1:46 PM
[quote]The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.[/quote]

Good point, Dienekes!
You've succeeded to split off another component of neolotic expansion. Its bearers, probably, "chose" a route along western coast of Caspian, to settle on eastern and northern slopes and foothills of Caucasus.
But, IMHO, you shouldn't use "West Asian" label this time, it just confuses:
- this "West Asian" is actually "Western" West Asian of the past analyses
- "Dagestan" is mostly "Eastern" West Asian of the past analyses
Their geographic distribution is clear, only Iran and Central Asia are not represented. It doesn't surprise that both components are best identified in two population of Caucasian "refuge": Georgians and Dagestans.

Connections of "Pakistan" component with "West Asian" and "Dagestan" are unclear: geographically it's close to Dagestan, but Fst distances matrix put it closer to West Asian. Again, samples from Iran and Central Asia will be very useful.
ReplyDelete
Replies
DienekesDecember 18, 2010 at 1:56 PM
I am DOD010; North Iranian. Would appreciate Dienekes' comments regarding my results.

Please post this in the ancestry thread.

http://dodecad.blogspot.com/2010/11/information-about-project-samples.html

I've included the Behar et al. Iranian sample in the next run, which will be posted separately. I will comment on Iranians there.
ReplyDelete
Replies
DienekesDecember 18, 2010 at 1:58 PM
Connections of "Pakistan" component with "West Asian" and "Dagestan" are unclear: geographically it's close to Dagestan, but Fst distances matrix put it closer to West Asian. Again, samples from Iran and Central Asia will be very useful.

Since I don't know what the estimation errors for these distances are, I don't want to put too much emphasis on who is closer to whom, at least for the closely related Caucasoid components.
ReplyDelete
Replies
SvetozarDecember 18, 2010 at 2:06 PM
whats with my high westasian component? the pakistan component and the westasian component are related according to dienekes. the westasian component is indo-iranian?
ReplyDelete
Replies
SvetozarDecember 18, 2010 at 2:09 PM
Dienekes wrote on one of his previous blog entries that he discovered a central-southasian component it would be interesting if he runs admixture with this component for DODO participants.
ReplyDelete
Replies
SvetozarDecember 18, 2010 at 3:13 PM
and could it be that the westasian component is somehow the ancestor of the pakistan component? "confused"
ReplyDelete
Replies
VasishtaApril 16, 2011 at 8:46 AM
Hi Dienekes, just noticed this post. Excellent work. I see you mentioned that some of the participants hadn't left a word regarding their ancestry. I've attempted to collate the ethnicities of the South Asian participants, written against their Dodecad ID in this neighbour joining tree generated using the Dodecad results, specifically the South Asian cluster-
http://i52.tinypic.com/2d8lsw9.jpg

-Vasishta

PS. Any chance of running this again? We have a lot more South Indian Brahmin participants since last time 'round you did this.
ReplyDelete
Replies

Add comment

Friday, December 17, 2010

Fine-scale South Asian admixture analysis + Results for Project participants

17 comments:

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff