After my recent experiment on the number of markers needed to split closely related populations, I was encouraged to take another stab at integrating the Xing et al. (2009) dataset with my other collections. This dataset has only ~40k markers in common with my other datasets, as it was typed on a different chip, and after data cleaning (--geno 0.01 in PLINK) and LD-based pruning (--indep-pairwise 50 5 0.3 in PLINK), I was left with a composite dataset of about 30,000 SNPs.
The primary reason for wanting to revisit this dataset is the fact that it had two additional Caucasus populations (Stalskoe and Urkarah) as well as several Indian populations (from Andhra Pradesh, Tamils, and Irula).
In the standard K=10 analysis of the Project, Indian participants invariably get a mixture of "South Asian", "West Asian", "North European", and "East Asian" components, but obviously we should be able to do better than that.
A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.
In addition to the reference populations, I have included 14 Dodecad Project members (with 23andMe data) with the criterion that they are non-related have >5% "South Asian" component and less than 5% of the East and West African components. By ID these are:
DOD223 DOD067 DOD010 DOD029 DOD126 DOD128 DOD089 DOD091 DOD090 DOD220 DOD075 DOD078 DOD088 DOD201
To verify the existence of structure in the data, I used the MCLUST/MDS approach I've described earlier to infer the existence of clusters in the data. 34 clusters were detected with 16 dimensions of MDS retained.
As you can see, despite the smaller number of markers, structure was effectively inferred by MCLUST. As expected, Dodecad project members who have diverge origins in both South Asiaand beyond it are "all over the place" in terms of their cluster assignments. In the reference populations, some interesting groupings occur:
- Stalskoe and Lezgins fall in cluster #32. Stalskoe is a village in Dagestan inhabited by Turkic Kumyks; Lezgins are Northeast Caucasian speakers from Dagestan
- Dai from China and Vietnamese fall entirely in cluster #10
- Tamil Brahmins and Andhra Pradesh Brahmins fall mostly in cluster #5, and not in the same clusters as non-Brahmin Tamil and AP individuals
Let's turn to the Dodecad Project members, and look at their probability of assignment:
NNclean suggests that DOD078 outlier. This may be due to unique ancestry that is not represented in the other reference populations.
Unfortunately, only Razib of Gene Expression took the trouble of leaving some information in the ancestry thread. His sample, DOD075 is assigned to cluster #6 where the bulk of the Singapore Indians are, and a scattering of individuals from Indian populations. Feel free to add any non-identifying information in the relevant thread, e.g., "Brahmin", your state of origin, etc. Even a little bit of information may help others interpret their results better.
Origin of South Asians
As I've remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.
This southern group of people has sometimes been called "Australoid" because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These "Australoids" are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn't be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as "black Africans" are not the same, neither are the "Australoids" and mixed-"Australoids" at the shores of the Indian Ocean.
It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.
It is in South Asia where there is clear evidence of fusion between indigenous and exogenous elements with the latter being similar to West Eurasians (Caucasoids). Moreover, both the great linguistic diversity and the caste system have helped maintain many distinct population groups. Naturally, tracing the origin of population elements present in the Indian mosaic is of great interest both for the people of India and for those outside it.
Below is the K=3 analysis which verifies the anthropological received wisdom about the three major Eurasian groups:
The East Eurasian component of this analysis is closer to the South Asian one (Fst=0.079) than to the West Eurasian one (Fst=0.114). The South Asian component is closer to the West Eurasian (Fst=0.063). The South Asian component as revealed in this plot is probably composite, as we shall see in the more detailed analysis below.
Here is the much more detailed K=10 analysis:
Admixture proportions for this can be found in the spreadsheet. I reiterate that you should treat the labels of the ancestral populations as useful mnemonics and that you should not confuse them with the same labels used elsewhere.
There are lots of interesting things about the plot:
- Both the Irula and the North Kannadi get their own clusters (light blue and pink)
- The South Asians have additional structure, with a component centered on Pakistan (green) and one centered on India (orange)
- Notice the elevated Siberian (or "Yakut") component in Turks and Stalskoe (Kumyks). The Adyghe also seem to have some of it, and since these are NW Caucasian speakers, it is plausible that this may represent some sort of Tatar element
Return of the Lezgin mystery
The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.
Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.
Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.
Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.
The absence of Y-haplogroup J1, so typical of Dagestanis in India may suggest a speculative scenario, in which the ancestors of the Indo-Iranians picked up Northeast Caucasian women en route to the Iranian plateau and India.
Distances between components
Here is the table of Fst distances between the 10 components:
The importance of the caste system in shaping variation can be seen if we compare Tamil Brahmins with Tamil Lower Castes and Andhra Pradesh Brahmins with other AP populations. Brahmins possess both "Dagestan" and "Pakistan" components, which suggest their links to northern India in the first order, and West Eurasia in a more remote sense. The "Pakistan" component too is closest to the "West Asian" one.
Both "Dagestan" and "Pakistan" components are notable for their absence among non-Brahmins in both these south Indian localities.
Once again, I can't comment on any of these except DOD075 who was probably right to speculate about input from Southeast Asia given his mixed "Southeast Asian"/"East Asian" affiliations, which resemble those of Vietnamese and Cambodians. The presence of both "Dagestan" and "Pakistan" components also point to more northwesterly influences.
The most interesting thing about this little study is, no doubt, the expansion of the Dagestan mystery.
These South Indian Brahmins possess nearly as much of this component as people in Pakistan, and a few Iranians among my project members. They have more of it than many people living much closer to the Caucasus.
Given that they have partially absorbed indigenous Indian elements (evidenced by the "Indian" component, which is itself probably hybrid), the conclusion is inescepable that their ultimate non-Indian ancestors possessed even more of it.
Where did they come from? Any discussion of their origin or dispersal would be advised not to veer off too far from the Caspian sea...
Continued (20 Dec): A solution to the problem of Indo-Aryan Origins