Friday, December 31, 2010

Results for DOD278 to DOD287 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, December 30, 2010

Structure in West Asian Indo-European groups

Indo-European languages used to be represented by several branches in West Asia. The most ancient Anatolian languages (Hittite, Luwian, and Palaic) became extinct by ancient times, and its descendants (e.g. Lycian and Lydian) soon thereafter. Greek and Phrygian were added from the Balkans, and so did Armenian, which, according to tradition was derived from the Phrygian.

Unrelated to all these languages, to the east, were the Iranian speakers, the major branches of which extant today in West Asia are Kurdish and Persian (Farsi).

The boundary between the western Indo-Europeans and the east ones changed several times in history. The apex of Iranian power came early when the Achaemenids subjugated the entirety of Asia Minor, but the Iranian expansion was halted in Europe during the Persian Wars of the 5th c. BC. In the next century a reversal of fortunes resulted in the conquest of the Persian Empire by the Greeks of Alexander. The Hellenistic kingdoms lasted for centuries thereafter, but eventually succumbed and new Greco-Persian kingdoms arose that led to the Parthian revival which clashed with the new western power, Rome. The limit between East and West survived with victories and defeats on either side well into medieval times as the eastern Romans fought the Sassanids, the successor dominant power in the Iranian world. Heraclius ended the centuries-old struggle when he defeated the Sassanids in Mesopotamia, but the whole affair became irrelevant as soon thereafter the new power of the Arabs destroyed the Sassanids and threatened the Roman Empire itself which managed to survive for a few more centuries, before eventually succumbing to the Crusaders and Turks, but both of these probably had a minor effect on the local population. Greeks and Armenians continued to exist in Anatolia until the 20th century, with only remnants of them remaining in Asia Minor today.

The data

To study the relationship between the various West Asian Indo-European groups, I gathered an Iranian sample (from Behar et al.), an Iraqi Kurdish one (from Xing et al.), an Armenian one (from Behar et al.), as well as an Armenian one from the Dodecad Project. I have also included the Behar et al. Turkish sample, and a new Turkish sample from the Dodecad Project.

Below are the first two dimensions of the MDS plot.

It appears that Kurds are not particularly closely related to their linguistic cousins, the Iranians. Neither are they very close to Turks and Armenians; the latter appear very close in most of my analyses, with the main difference being a small Mongoloid component in the former, which is not visible here due to the lack of Mongoloid reference populations.

The distinctiveness of the Kurds is also evident in the ADMIXTURE analysis:

The high blue component distinguishes Kurds from both Iranians and Armenians/Turks. Iranians have slightly more of it, suggesting a somewhat closer relationship. The difference between the small Dodecad Turkish sample and the Behar et al. one is suggestive of heterogeneity within Turks, so it is important to be aware of this. Hopefully, if more West Asian individuals join the project, we will be able to discover patterns of regional variation between and within different ethnic groups.

Sunday, December 26, 2010

Results for DOD268 to DOD277 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, December 23, 2010

Results for DOD256 to DOD267 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Wednesday, December 22, 2010

Results for DOD245 to DOD255 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Monday, December 20, 2010

Open-ended submission opportunity for 23andMe data

In the spirit of the holidays, I am open to receiving 23andMe data for analysis. This will be an open-ended call rather than one with a deadline, and I will announce when it will be over in this very post.

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Please do not submit samples of relatives, as these make analysis difficult. I consider 2nd cousins and closer relatives to be related.

I am sorry that I can't process everyone's data, so if you don't fit the above criteria and you feel you should be included, feel free to write to me (but don't send me your data!) and I will keep it in mind. Also, you can subscribe to the feed for future opportunities to submit your data.

What you will receive

You will receive the standard K=10 analysis results such as this, and you will also be eligible for other types of tests such as Galore analysis or regional studies

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

What to send

Send your zipped autosomal data to Also let me know something about your ancestry (e.g., ethnic group, country of origin of grandparents, or anything else that might be useful).

Sunday, December 19, 2010

Fine-scale admixture in Europe (Dagestan/Basque/Sardinian components)

Wanting to see whether the Dagestan mystery would extend into Europe, I carried out an ADMIXTURE analysis including all my European populations. Once again, as this is done on only ~30k markers, a little noise on the low-level components is expected.

Admixture proportions can be found in the spreadsheet

Notably there is now both a Sardinian and a Basque centered cluster; the latter was formerly (in the standard K=10 analysis) split between "Southern European" and "Northern European". The Urkarah, Lezgin, and Stalskoe samples show the highest presence of the "blue" component, which I label, once again, Dagestan. Note, however, that you should not compare admixture proportions across ADMIXTURE runs for components that happen to be labeled the same (homonymous). Certainly "this" Dagestan is related to the "previous" Dagestan component, but do not assume they are identical.

Here is the Fst distance matrix between the 7 components:


The most notable thing about this figure is the relative absence of the West Asian component in the periphery of Europe. The lowest values are seen in Basque, Sardinian, Orcadian, White Utahns, Lithuanians, Finns, and Scandinavians (in no particular order).

It is worthwhile to order the European populations in terms of their Dagestan component. Excluding the populations of the Caucasus, these are, in ascending order: Basque (0.7%), Sardinian, Cypriot, Belorussian, South Italian/Sicilian, Lithuanian, Tuscans, Portuguese, Greek (3.8%), Vologda Russian, Romanian, Finnish, Spaniards, North Italian, Dodecad Spaniards, Dodecad Russian, Chuvash, Hungarian, French (7.9%), German, Scandinavian, White Utahn, Orcadian (12.6%).

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /\-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

This is just a piece of a broader puzzle, and the picture is not yet clear. However, we can tentatively say that whatever brought the "Dagestan" component to India was not a unidirectional process, but also brought a similar population element to western Europe.

Friday, December 17, 2010

Fine-scale South Asian admixture analysis + Results for Project participants

After my recent experiment on the number of markers needed to split closely related populations, I was encouraged to take another stab at integrating the Xing et al. (2009) dataset with my other collections. This dataset has only ~40k markers in common with my other datasets, as it was typed on a different chip, and after data cleaning (--geno 0.01 in PLINK) and LD-based pruning (--indep-pairwise 50 5 0.3 in PLINK), I was left with a composite dataset of about 30,000 SNPs.

The primary reason for wanting to revisit this dataset is the fact that it had two additional Caucasus populations (Stalskoe and Urkarah) as well as several Indian populations (from Andhra Pradesh, Tamils, and Irula).

In the standard K=10 analysis of the Project, Indian participants invariably get a mixture of "South Asian", "West Asian", "North European", and "East Asian" components, but obviously we should be able to do better than that.

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

Dodecad participants

In addition to the reference populations, I have included 14 Dodecad Project members (with 23andMe data) with the criterion that they are non-related have >5% "South Asian" component and less than 5% of the East and West African components. By ID these are:
DOD223 DOD067 DOD010 DOD029 DOD126 DOD128 DOD089 DOD091 DOD090 DOD220 DOD075 DOD078 DOD088 DOD201

GALORE analysis

To verify the existence of structure in the data, I used the MCLUST/MDS approach I've described earlier to infer the existence of clusters in the data. 34 clusters were detected with 16 dimensions of MDS retained.

As you can see, despite the smaller number of markers, structure was effectively inferred by MCLUST. As expected, Dodecad project members who have diverge origins in both South Asiaand beyond it are "all over the place" in terms of their cluster assignments. In the reference populations, some interesting groupings occur:
  • Stalskoe and Lezgins fall in cluster #32. Stalskoe is a village in Dagestan inhabited by Turkic Kumyks; Lezgins are Northeast Caucasian speakers from Dagestan
  • Dai from China and Vietnamese fall entirely in cluster #10
  • Tamil Brahmins and Andhra Pradesh Brahmins fall mostly in cluster #5, and not in the same clusters as non-Brahmin Tamil and AP individuals
Let's turn to the Dodecad Project members, and look at their probability of assignment:

NNclean suggests that DOD078 outlier. This may be due to unique ancestry that is not represented in the other reference populations.

Unfortunately, only Razib of Gene Expression took the trouble of leaving some information in the ancestry thread. His sample, DOD075 is assigned to cluster #6 where the bulk of the Singapore Indians are, and a scattering of individuals from Indian populations. Feel free to add any non-identifying information in the relevant thread, e.g., "Brahmin", your state of origin, etc. Even a little bit of information may help others interpret their results better.

Origin of South Asians

As I've remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called "Australoid" because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These "Australoids" are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn't be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as "black Africans" are not the same, neither are the "Australoids" and mixed-"Australoids" at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

It is in South Asia where there is clear evidence of fusion between indigenous and exogenous elements with the latter being similar to West Eurasians (Caucasoids). Moreover, both the great linguistic diversity and the caste system have helped maintain many distinct population groups. Naturally, tracing the origin of population elements present in the Indian mosaic is of great interest both for the people of India and for those outside it.

ADMIXTURE analysis

Below is the K=3 analysis which verifies the anthropological received wisdom about the three major Eurasian groups:

The East Eurasian component of this analysis is closer to the South Asian one (Fst=0.079) than to the West Eurasian one (Fst=0.114). The South Asian component is closer to the West Eurasian (Fst=0.063). The South Asian component as revealed in this plot is probably composite, as we shall see in the more detailed analysis below.

Here is the much more detailed K=10 analysis:

Admixture proportions for this can be found in the spreadsheet. I reiterate that you should treat the labels of the ancestral populations as useful mnemonics and that you should not confuse them with the same labels used elsewhere.

There are lots of interesting things about the plot:
  • Both the Irula and the North Kannadi get their own clusters (light blue and pink)
  • The South Asians have additional structure, with a component centered on Pakistan (green) and one centered on India (orange)
  • Notice the elevated Siberian (or "Yakut") component in Turks and Stalskoe (Kumyks). The Adyghe also seem to have some of it, and since these are NW Caucasian speakers, it is plausible that this may represent some sort of Tatar element
Return of the Lezgin mystery

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

The absence of Y-haplogroup J1, so typical of Dagestanis in India may suggest a speculative scenario, in which the ancestors of the Indo-Iranians picked up Northeast Caucasian women en route to the Iranian plateau and India.

Distances between components

Here is the table of Fst distances between the 10 components:

Brahmin origins

The importance of the caste system in shaping variation can be seen if we compare Tamil Brahmins with Tamil Lower Castes and Andhra Pradesh Brahmins with other AP populations. Brahmins possess both "Dagestan" and "Pakistan" components, which suggest their links to northern India in the first order, and West Eurasia in a more remote sense. The "Pakistan" component too is closest to the "West Asian" one.

Both "Dagestan" and "Pakistan" components are notable for their absence among non-Brahmins in both these south Indian localities.

Dodecad results

Once again, I can't comment on any of these except DOD075 who was probably right to speculate about input from Southeast Asia given his mixed "Southeast Asian"/"East Asian" affiliations, which resemble those of Vietnamese and Cambodians. The presence of both "Dagestan" and "Pakistan" components also point to more northwesterly influences.


The most interesting thing about this little study is, no doubt, the expansion of the Dagestan mystery.

These South Indian Brahmins possess nearly as much of this component as people in Pakistan, and a few Iranians among my project members. They have more of it than many people living much closer to the Caucasus.

Given that they have partially absorbed indigenous Indian elements (evidenced by the "Indian" component, which is itself probably hybrid), the conclusion is inescepable that their ultimate non-Indian ancestors possessed even more of it.

Where did they come from? Any discussion of their origin or dispersal would be advised not to veer off too far from the Caspian sea...

Wednesday, December 15, 2010

Coverage of the Dodecad Project in Nature

There is coverage of the Dodecad Project in this week's Nature. Below you can find a version of the graphic included in the article, which includes all studied populations.
You can also download a hi-res version of this from here. I understand that this was probably edited for space, but the distinctiveness and distribution of the components is best seen with the full array of populations.

The source of this figure is from my Eurasian analysis with K=15, although the color-coding is a bit different. In the original post you can also find "population portraits" which show individual-level admixture estimates for these populations.

The Nature piece highlights the connection between "Northern Finland" and Siberia, but this should really read "Finland and Northern Russia", the former coming from Dodecad Project members from different parts of Finland, the latter from Russians from Vologda included in HGDP.

I find the Finnish/Siberian connection quite interesting, as it bridges the gap between European Finnic and Siberian Samoyedic ones within the Uralic language family, but the emergence of an Altaic component (dark grey) is even more exciting and unexpected: Altaic speakers are shown to possess, from Europe to the Pacific and spanning all three major sub-families of Altaic (Turkic such as Yakut and Chuvash, Mongolic such as Buriat and Daur, and Tungusic such as Evenk and Hezhen), a common genetic component which is otherwise rare in both Indo-European speakers of Western Eurasia and Sino-Tibetan, Hmong-Mien, and Paleosiberian populations of Asia. Hence, this appears to be a real genetic correlate of a language family's expansion.

The Nature piece also highlights the Joe Pickrell affair. It's a great story, because it shows how added value can be had by combining an "open source" model of genetic inquiry with traditional genealogy.

If you've come here via Nature, feel free to browse around and send me feedback in either the comments or at

I've also posted a lot of material in my regular blog, Dienekes' Anthropology Blog, mostly under the Dodecad, ADMIXTURE-experiments, and GALORE tags. In particular, I've just finished my Human Genetic Variation trilogy:
If you want to join the Project, please note that data submission is currently closed, but feel free to subscribe to the feed to be alerted of opportunities to join the project, or to keep track of its progress in general.

Splitting populations: how many markers needed?

An interesting question in population genetics has to do with the number of markers necessary to infer population-of-origin of an individual.

In the Dodecad Project, we have seen that even very close populations such as Armenians and Assyrians can be classified accurately using the MDS/MCLUST "Clusters Galore" combination that I have proposed. But, do we really need ~177 thousand markers to achieve this level of detail?

I decided to carry out a small experiment, to see how the number of markers used affects the ability to correctly classify samples into two different populations.

Case A: Maximal differentiation (Papuans vs. Mbuti Pygmies)

I begin by considering the case of the two most differentiated human populations in my database, Papuans (17 individuals) and Mbuti Pygmies (13 individuals). For each step I reduce the number of markers by an order of magnitude, using PLINK's --thin 0.1 argument.
  • With 176,598 markers, classification is 100% correct, i.e., all 13 Pygmies are assigned to one cluster and all 17 Papuans are assigned to another
  • With 17,752 markers, classification is again 100% correct.
  • With 1,725 markers, ditto
  • With 152 markers, ditto
  • With 20 markers, 3 Pygmies are misclassified into the Papuan cluster, hence accuracy is 90%
  • With 4 markers, 5 Pygmies are misclassified as Papuans, and 6 Papuans as Pygmies, hence accuracy is 63%

Notice that these are not ancestry informative markers (AIMs) but randomly selected SNPs.

Case B: Small differentiation (Armenians vs. Assyrians)

I have previously shown that a sample of 7 Armenians and 8 Assyrians can be classified correctly, with a single Assyrian misclassified as Armenian, hence 93% accuracy.
  • With 17,714 markers, 1 Armenian is misclassified as Assyrian, hence 87%
  • With 1,808 markers, accuracy is 80%
  • With 188 markers, it is 67%
  • With 17 markers, it is 60%


I carried out this little experiment because I thought it would be interesting, but also for its implications. I can identify at least two major ones:

Ancient DNA: Due to poor preservation we are unlikely to get full genome sequences from ancient human DNA except in highly favorable conditions. In many instances, it may be possible to get only a limited number of markers tested. These results suggest that it is possible to get fairly decent assignment of individuals into populations without hundreds of thousands of SNPs. Thus, it may be possible to study genetic structure in ancient necropoleis, or the relationship of ancient remains to modern populations.

Data integration: Different genotyping platforms (e.g., those of Affymetrix and Illumina) often possess a very small subset of common SNPs. Imputation of genotypes is possible, but these results suggest that even when the overlap between markers is not substantial, it is possible to carry out fairly sophisticated genetic analyses on them.


Naturally, there are issues not addressed here: what happens when we are dealing with more than 2 populations? What happens if we want to study admixture rather than classification? In both cases, I expect the number of markers needed to be higher.

Nonetheless, I am fairly convinced, both from this, and a previous experiment that more markers than are used in current genotyping platforms will only add very little value to anthropological investigations of ethnic genetic differences.

Monday, December 13, 2010

Genetic structure in North-Central Europe with the Galore approach

I first posted this in the comments of my other blog, but it is worth a post of its own.

Here is the result of applying MCLUST to a group of Central-North European populations. The maximum number of 13 clusters is reached with 5 MDS dimensions retained:

Some clusters are population-specific (e.g., #7 for Finns, #10 for Lithuanians, #12 for Russians). Some clusters are semi-specific (e.g., #3 for Hungarians, #1 for French). Some populations are split into multiple clusters (e.g., Orcadians or Germans).

Here is a neighbor-joining tree of these 13 clusters based on the first 5 MDS dimensions:
12, 13: Russians
9, 10: Balto-Slavs
5, 3, 2, 4, 1: Northwest Europeans
11: A couple of Hungarians (*)
7, 8: Orcadians
6: Finns

I generally frown upon phylogenies for human groups, as I believe that human genetic variation is better represented as a network due to lateral gene flow. However, this tree gives an idea of the relationship between clusters.

(*) It's interesting that these 2 Hungarians are the same ones that showed an elevated "Altaic" component in my K=15 analysis.

Saturday, December 11, 2010

Assyrians vs. Armenians

Assyrians and Armenians have invariably been assigned to the same cluster in Clusters Galore analysis thus far. It is very important to understand the following two axioms:
  • Populations that are separated by clustering analysis have enough genetic differences between them to allow such separation
  • Populations that are not separated may, or may not have enough genetic differences between them to allow separation
The second axiom is extremely important. Let's elaborate:
  • The clustering algorithm may have its limitations
  • The number of markers may be insufficient
  • The number of individuals may be insufficient
In the Galore approach, I use MCLUST, as good a clustering algorithm as any. I also have samples of 8 Assyrians and 7 Armenians, and use about 177k SNPs, which are enough to squeeze out differences even between related populations. So, what gives? Are Assyrians and Armenians really genetically indistinguishable?

What constitutes a cluster? MCLUST can divide a dataset into as many clusters as you want, but it also chooses the number of clusters to optimize the Bayes Information Criterion. But, it's important to note that the BIC is not some god-given arbiter of what a cluster is; it is best viewed as a guide to choose a good number of clusters, and not as a guarantee that this is the true number of clusters that a population can be subdivided in.

Clustering Assyrians and Armenians, assuming 2 clusters

To that end, I decided to cluster Assyrians and Armenians, forcing MCLUST to infer 2 clusters. I only retained 2 MDS dimensions, as these are enough to distinguish between 2 groups. Here is the MDS plot:
We can observe that Assyrians can be distinguished from Armenians along Dimension 1, and there is also some structure in Assyrians, with 3 of them forming a mini-cluster on top, 4 of them forming another mini-cluster at the bottom, and 1 of them being closer to the Armenians.

By applying MCLUST with K=2, all 7 Armenians and 1 Assyrian are assigned to a cluster, and the remaining 7 Assyrians to another. Thus, the two groups can be separated from each other, although their differences (due to the factors I mentioned) are not large enough to lead to an improvement of the BIC.

With more individuals., it's possible that the BIC too will be able to track the improvement in likelihood that adding two clusters will produce.

And, indeed, it may be possible that some of these populations will be further subdivided. If more Assyrians join the project, for example, it may be the case that the two apparent Assyrian clusters will emerge, or it will be the case that the space between them will be filled by currently unsampled individuals.

As more individuals join the Project, ever-finer distinctions can be uncovered.

NB: I make the case for Assyrians and Armenians, but I have also tried this for other closely related populations who were assigned to the same cluster by Galore analysis, namely Spaniards and Portuguese. In that case, however, I was not able to find a clean solution with K=2. Once again, this may mean that Spaniards and Portuguese are not genetically that different, or it may mean that their difference will be revealed with more individuals joining the project.

Friday, December 10, 2010

Results for DOD237 to DOD244 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Monday, December 6, 2010

Clusters galore, K=50 for Dodecad Project members (up to FFD055)

I am now repeating the Clusters galore analysis for Family Finder data (for more info, see the previous post on 23andMe data).

With 14 MDS dimensions retained, there were 50 clusters inferred in the optimal solution by MCLUST.

The results spreadsheet has rows for the 54 project participants in the first rows: each row is the probability that you belong to a particular cluster. This is followed by the reference populations where each row has the number of individuals (for that populations) that is assigned to a particular cluster.

There are also some outliers in this analysis:
FFD002 FFD004 FFD007 FFD012 FFD015 FFD016 FFD021 FFD022 FFD023 FFD038 FFD046
Check what an outlier is in the context of this analysis, and what it means.

Interestingly, because of the smaller number of Family Finder participants some previously defined clusters (for 23andMe data) such as the "Finnish" cluster do not appear here. This is not surprising at all, because for a cluster to be defined several individuals from that population must be present in the data.

Many continental Europeans of this type ended up in cluster #2. Some others, like FFD048 who is Lithuanian were assigned to the proper cluster #9, centered on Lithuanians.

This underscores the importance of having more people join the Project at the next available opportunity. This will not only create new clusters for individuals who are currently the only representatives of their populations, but it may also split already existing clusters if regional sub-populations are detected.

It is also important for project participants to drop a note at the ancestry thread, to help others make better sense of their results.

Tuesday, November 30, 2010

Outliers in the Dodecad Project (23andMe data)

As promised, I have started to investigate outliers among Dodecad Project members. I used NNclean as implemented in the prabclus package to find data points that had a great distance to their nearest neighbor among either Dodecad Project members or the standard 692-individual panel I use in the Galore analysis.

To make a long story short, here are the IDs identified as outliers:
"DOD157" "DOD168" "DOD169" "DOD036" "DOD048" "DOD088" "DOD034" "DOD030" "DOD060" "DOD132" "DOD128" "DOD175"
An outlier is someone who is not very close to any other individual and hence does not really "cluster" with anyone. Thus, it is recommended to remove outliers prior to clustering, as otherwise they will form makeshift clusters that don't really have a good meaning.

Looking at the individual spreadsheet reveals that many of these outliers have very unusual ancestry. This falls under two categories:
  1. Recent admixture between geographically separated populations
  2. Being the only member from an unsampled population
In the first case, admixed individuals fall in the "empty space" between their parental clusters, and thus do not cluster with anyone else, unless a person with a similar type of admixture happens to also be in the dataset.

In the second case, there are no members of the individual's group. Sometimes, if a group is close enough to another, this is not a problem, but there are many distinctive population groups for which that is not the case.

While outliers will be removed from some analyses, their outlier status will continue to be evaluated as new reference populations, or Dodecad Project members are added.

What's next for Clusters Galore analysis

The first few runs of the Clusters Galore analysis have proven quite successful; I've posted another one on the HGDP panel in my other blog.

Now, it is time to assess the results and see what improvements can be made. I see a few avenues for improvement:


Clusters, by definition, are composed of at least 2 individuals. Individuals who are the only representatives of their populations (e.g. if a Pygmy or an Icelandic+Armenian mix) will, by necessity, attach themselves to the closest cluster (e.g., to Yoruba, or to some Central European population), even though they are not necessarily close to that population.

Outlier detection is a difficult problem, but I will try some ideas on how to tackle it.

Phantom clusters

mclust is resilient to phantom clusters, i.e., clusters of "misfits" who don't belong in any other populations but are banded together erroneously by the algorithm. That is inevitable in an automated procedure, especially one that is pushing the limits of ancestry inference. Phantom clusters are, by their nature, transient, so there are some ideas on how to avoid them and how to focus on very robust and repeatable clusters.


Being part of a cluster tells you nothing about how "typical" a member of the cluster you are, i.e., how close to the average. This problem is exacerbated by the fact that the clusters inferred by mclust may have varying shape, size, and orientation.

Nonetheless there are ideas on how to quantify members' typicality, and I will explore them. Please note that typicality is not necessarily the same as "purity". For example, an elongated cluster of African Americans will have typical members with 20% European admixture, but the "purest" African Americans will have 0% European admixture and be very atypical of their group as a whole. Similarly, typical Turks have 5-6% East Eurasian admixture, but people with 10% East Eurasian admixture are less typical, but more likely to be descended from central Asian Turkic people.

Any new technique will have its birth pains, and hopefully myself and others will help identify them and resolve them.

Monday, November 29, 2010

Galore analysis improved, plus K=56 results of Clusters Galore analysis for Dodecad Project members (up to DOD236)

For background, please read the post on the K=48 analysis and links therein.

As I mentioned in the previous posts, my technique depends on the use of MDS to reduce the dimensionality of genomic data from 177,000 SNPs or so to a few dozen dimensions capturing most of the variance.

Subsequently, mclust a state of the art clustering algorithm is applied on the MDS representation: this iterates between choices of K, the number of clusters, trying clusters of different shape, volume, and orientation, and chooses the optimal clustering, maximizing the Bayes Information Criterion. In simpler terms, it finds as much detail as possible in the data but penalizes too ornate models and avoids finding "ghost" clusters that are not really supported.

These are clusters derived from data of unlabeled individuals. The only human input into the process is the number of MDS dimensions to retain.

In my previous K=48 analysis, I retained 30 dimensions, but I also noted that this is not really optimal. Choosing more, or less, dimensions might lead to even better resolution (higher K).

More dimensions = more possible ways to distinguish between individuals, but also, possibly, more noise, as individuals might not be "clustered" in them.

Fewer dimensions = less possible ways to distinguish between individuals, but also, possibly, less noise from the uninformative higher dimensions.

Thus, the question arises: how many dimensions to retain?

Here is a plot of the optimal number of clusters inferred, depending on how many dimensions I chose to retain:
As you can see, when a few dimensions are retained, relatively few clusters are inferred, while as the number of dimensions goes beyond a certain point, the number of clusters starts to decrease again, as more noise is added (*)

The number of clusters peaks (see figure) at 16 and 22 dimensions retained; both of these produce 56 different clusters in the optimal solution.

Here are the results for Dodecad Project members (up to DOD236) with K=56 and 16 dimensions retained. In comparison to the previous K=48 analysis, we are now able to:
  1. Split CEU White Utahns (#1) from French (#15)
  2. Split CEU White Utahns (#1) from continental Germanics (#14)
  3. Split French (#15) from Spaniards (#2)
  4. Split Armenians (#7) from Turks (#19)
  5. Split Slavs (#23) from Balts (#26)
  6. Split Cypriots (#30) from Sephardic Jews (#21)
The most astonishing finding is, however, at least for me, the emergence of a cluster (#16) comprised in the great majority by people from Greece and Southern Italy, with very few individuals from elsewhere. Notice that #16 has absolutely no representation in the reference populations, which lack South Italians and Greeks.

Once again, I urge participants to help themselves and others by leaving a comment in the ancestry thread.

(*) The change is not, however, smooth. The more general problem is to choose which dimensions to retain, rather than choosing how many of the first ones to retain. The first few dimensions of MDS capture a decreasing portion of variance, but the data are not guaranteed to be "split" in them. However, this is a much harder problem, as we have to figure out (i) how many dimensions to retain, and (ii) which ones. Even if we fix (i), by choosing to retain, e.g., 10 dimensions, we still have to choose which 10: this is close to half a billion different combinations of which 10 to choose from the total of 38 possible candidate ones.

Results for FFD048 to FFD055 posted

This concludes the number of Family Finder individuals who sent me their data by the deadline.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Results of Clusters Galore analysis for Dodecad Project members (up to DOD236), K=48

This is the result of the new type of ancestry analysis I have recently devised. For background, please read:
In total 894 individuals were included in this analysis, 202 Dodecad Project members and 692 from the published references. 30 MDS dimensions were retained, and mclust was run with a maximum number of clusters = 60. In the optimal solution, 48 clusters were inferred.

At the beginning of the results spreadsheet are the 202 Dodecad Project members and their probabilities of belonging to the 48 clusters. This is followed by the 36 reference populations, listing how many individuals from each one were assigned to each of the 48 clusters.

To help you interpret these results, you might want to consult the individual and population ADMIXTURE spreadsheets, as well as the information Project members have chosen to reveal about themselves. Feel free to add your own information in that thread.

Here are the results for some people who have chosen to reveal information about themselves:
  • Spencer Wells is DOD162 and he has 28% probability of being in the CEU (White Utahn)/French cluster #1, and 72% probability of being in cluster #3 which is CEU (White Utahn) specific (in the reference populations) and in which 29 Project members are also assigned (almost all of them Northwestern Europeans)
  • Razib of Gene Expression is DOD075 and is assigned in cluster #15, which includes Gujarati and North Kannada individuals
  • pconroy, submitted three samples, DOD097 is Sicilian fall in cluster #9 to which many Cypriots and Sephardic Jews belong, and many Project members of South Italian/Sicilian background; DOD098 and DOD099 are his Irish parents: DOD098 is almost evenly split between #1 and #3 (like Wells), and DOD099 is 97% in #3
  • Adriano Squecco DOD139 is North Italian and is in cluster #2, in which 25/25 reference Tuscans and 10/12 reference North Italians belong
  • Lacko DOD083 is 100% southern Polish and he falls in cluster #20, in which all reference Belorussians and Lithuanians fall
  • Mike Maddi DOD021 is Sicilian and is in cluster #9 like DOD097, showing some probability (13%) of also being in #2 like Adriano Squecco
  • An Anonymous Pole falls in #20 like Lacko and the reference Slavs
  • Ilmari DOD003 and Ari DOD131 are Finns and fall in cluster #13. This is an interesting one, as it does not occur at all in the reference populations; I'll let you guess what population it's centered on.
  • Eastara's mother (DOD025) is Bulgarian and falls in cluster #10, which is centered in Romanians in the reference populations, but note that there are also 13 Dodecad Project members who fall in it, many of them from different parts of the Balkans. I hope more people from the Balkans will contact me for inclusion in the project, as I am sure that finer-scale can be achieved there with increased participation.
  • Basar (DOD049) is half-Anatolian Turk and half-Laz. He falls in the cluster encompassing Armenians and Turks in the reference populations.
  • Bubba (DOD066) is North German with a pinch of Danish and falls in cluster #3
  • afpjr (DOD014) is half Greek and half Italian/Sicilian; he falls in cluster #9 (95%)
  • Francesc (DOD217) is Catalan and falls in cluster #17, centered on Spaniards in the reference populations.
The Project Greeks and Mixed Greeks fall in clusters #2, 9, 10 tying them to Italy and the Balkans, as might be expected. I hope more Greeks will decide to participate in the Project, so we can discover more interesting patterns in our population.

I cannot stress enough how revealing non-identifying ancestry information in the relevant thread will help both yourselves and others make better sense of these and future results.

There are clusters composed entirely of Dodecad Project members (e.g., the aforementioned one), and others which are centered on one or two reference populations, but encompass a wider variety of non-represented populations. So, please take the time to leave a comment in the ancestry thread.

This is not the end of the story. There are more clusters to be discovered in the data; the inclusion of 200+ new samples in this analysis has caused new clusters to appear and distinctions that were previously detected to "fold back" (e.g., between Armenians and Turks). I am currently investigating how the choice of number of MDS dimensions to retain affects the number of inferred clusters in the optimal solution.

Sunday, November 28, 2010

Submission of Family Finder data is now CLOSED

Thank you all for submitting your data; the remaining results will be posted in the blog over the next week or so.

If you have submitted your data in time, but did not receive an ID yet, you will.

If you want to be alerted for future opportunities, and to keep up with the progress of the Project, please subscribe to the feed.

Clusters galore: less is more, or, pushing the limits of ancestry inference

I had already hinted in my previous post on my new technique that retaining all MDS dimensions might add noise to the analysis, and I was hopeful that even finer resolution could be achieved with fewer dimensions.

In the spreadsheet you can see the optimal solution if one retains only 10 dimensions in which case 45 clusters are inferred. Previously, I retained 47 dimensions, and got only 35 clusters in the optimal solution: less is more.

In comparison to the previous analysis, I can detect some interesting changes:
  1. Spaniards and Portuguse are split from Tuscans and are joined by some French and North Italians; the rest of the French stay with White Utahns, and the rest of the North Italians stay with Tuscans.
  2. Romanians too get their own cluster
  3. Turks are split from Assyrians/Armenians.
  4. Germans are split from Scandinavians, with 1 sample from either population going to the other population.
  5. The relationship between Cypriots and South Italians is retained, but most of the Greeks (many of whom were borderline between the Tuscan and South Italian cluster) go the Tuscan way.
I am now studying how to choose the optimal number of MDS dimensions to retain, so I will not report any individual data about this to project participants. I just wanted to let everyone share in the excitement.

Results for FFD035 to FFD047 posted

Individual proportions can be found in the spreadsheet

All populations:

Individual bars:

Clusters galore with Dodecad populations

Number of individuals assigned to each cluster can be found in the spreadsheet. Populations in italics are composed entirely from Dodecad Project members.

Please read the post in Dienekes' Anthropology Blog to see what this type of analysis means.

47 MDS dimensions were retained, and the optimal number of clusters was 35. Retaining less or more dimensions may alter this number, as after a certain point extra dimensions only contribute noise to the analysis; this is a matter of investigation.

It is hardly practical to comment on all 35 clusters, so I will limit myself to a few observations:
  • Turks, Armenians, and Assyrians fall in cluster #1
  • Scandinavians, White Utahns, Germans, and some French fall in cluster #2
  • Portuguese, French, North Italians, Tuscans, Spaniards, and Romanians fall in cluster #3
  • Greeks, South Italians/Sicilians, Cypriots, and Sephardic Jews from Bulgaria and Turkey fall in cluster #4 (but see note)
  • Finns fall in cluster #5
  • Almost all Ashkenazi Jews fall in #6
  • All Dodecad Project Russians, plus reference Lithuanians and Belorussians fall in #9
8 Greeks fall in cluster #4 and 2 in cluster #3. However, many of the ones who fall in #4 also have some non-trivial probability of falling in #3. Probabilities for all other clusters are less than 0.1%. All Project Greeks can write to me to learn their exact probabilities.

Of course, it should be noted that:
  1. If two populations can be perfectly distinguished from each other, then there are genetic differences between them (they split from each other some time ago, they underwent different types of admixture, etc.) allowing the clustering algorithm to detect their differentiation
  2. If two populations cannot be distinguished from each other, this does not mean that they are not indistinguishable in principle; it does mean, however that through either common ancestry or very similar patterns of admixture, they have become quite similar to each other in the Eurasian context.
If you are a Dodecad Project member (23andMe data) from one of the populations in italics and are wondering which cluster you fall in, first check whether all individuals from your population fall in the same cluster, in which case you already know the answer.

Otherwise, you may write to me, with your DOD number, and I'll tell you.

Results for FFD020 to FFD033 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Saturday, November 27, 2010

Results for FFD003 to FFD019 posted

UPDATE: Color-coding problem fixed.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Friday, November 26, 2010

Results for DOD223 to DOD236 posted

Admixture proportions can be found in the spreadsheet

All populations:
Individual bars:

Thursday, November 25, 2010

ADMIXTURE analysis for Family Finder (FTDNA) samples

I am now able to provide K=10 analysis to Family Finder customers.

The rules of participation are:
  • No relatives (up to 2nd cousin)
  • 100% Eurasian or North African ancestry; the test does not include Native American samples, or an assessment of Native American ancestry
  • Data must be received by Sunday 29/11
In order to participate, you must send to your autosomal data (.gz ending) that you can download from FTDNA, as well as information about your known ancestry (such as country of origin, or ethnic affiliation)

There may be other opportunities for people to participate, so please subscribe to the feed.

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

You will receive your ancestral proportions from 10 inferred ancestral components as in the following figure:

This was generated using the same 104,790 markers that I will be using to analyze your sample. Exact admixture proportions for these populations can be found in the population spreadsheet.

Note that these proportions are not directly comparable with those using 23andMe data, as a different set of markers is used in the latter, and there is a smaller overlap between Family Finder data and those of the reference populations I am using. Here is the current population spreadsheet for 23andMe data.

There are already two Family Finder participants in the Project, with IDs FFD001 and FFD002; these are volunteers who helped me with their data when I adapted EURO-DNA-CALC for Family Finder data. Their results are in the individual spreadsheet.

Friday, November 19, 2010

ADMIXTURE analysis with Dodecad Populations (update #1)

Repeating the previous analysis with additonal populations of Dodecad Project members and/or modified sample sizes for pre-existing ones:
Assyrian, Scandinavian, Greek, Finnish, S_Italian_Sicilian, Ashkenazi, German, Indian, Portuguese, Armenian
Admixture proportions can be found in the spreadsheet. Dodecad Project populations in italics.

Populations portraits can be found in the RAR. For example, here are the ones for Dodecad Project Ashkenazi and Behar et al. (2010) Ashkenazi Jews:
and here is a Portrait of the Portuguese: