Friday, December 31, 2010

Results for DOD278 to DOD287 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, December 30, 2010

Structure in West Asian Indo-European groups

Indo-European languages used to be represented by several branches in West Asia. The most ancient Anatolian languages (Hittite, Luwian, and Palaic) became extinct by ancient times, and its descendants (e.g. Lycian and Lydian) soon thereafter. Greek and Phrygian were added from the Balkans, and so did Armenian, which, according to tradition was derived from the Phrygian.

Unrelated to all these languages, to the east, were the Iranian speakers, the major branches of which extant today in West Asia are Kurdish and Persian (Farsi).

The boundary between the western Indo-Europeans and the east ones changed several times in history. The apex of Iranian power came early when the Achaemenids subjugated the entirety of Asia Minor, but the Iranian expansion was halted in Europe during the Persian Wars of the 5th c. BC. In the next century a reversal of fortunes resulted in the conquest of the Persian Empire by the Greeks of Alexander. The Hellenistic kingdoms lasted for centuries thereafter, but eventually succumbed and new Greco-Persian kingdoms arose that led to the Parthian revival which clashed with the new western power, Rome. The limit between East and West survived with victories and defeats on either side well into medieval times as the eastern Romans fought the Sassanids, the successor dominant power in the Iranian world. Heraclius ended the centuries-old struggle when he defeated the Sassanids in Mesopotamia, but the whole affair became irrelevant as soon thereafter the new power of the Arabs destroyed the Sassanids and threatened the Roman Empire itself which managed to survive for a few more centuries, before eventually succumbing to the Crusaders and Turks, but both of these probably had a minor effect on the local population. Greeks and Armenians continued to exist in Anatolia until the 20th century, with only remnants of them remaining in Asia Minor today.

The data

To study the relationship between the various West Asian Indo-European groups, I gathered an Iranian sample (from Behar et al.), an Iraqi Kurdish one (from Xing et al.), an Armenian one (from Behar et al.), as well as an Armenian one from the Dodecad Project. I have also included the Behar et al. Turkish sample, and a new Turkish sample from the Dodecad Project.

Below are the first two dimensions of the MDS plot.

It appears that Kurds are not particularly closely related to their linguistic cousins, the Iranians. Neither are they very close to Turks and Armenians; the latter appear very close in most of my analyses, with the main difference being a small Mongoloid component in the former, which is not visible here due to the lack of Mongoloid reference populations.

The distinctiveness of the Kurds is also evident in the ADMIXTURE analysis:

The high blue component distinguishes Kurds from both Iranians and Armenians/Turks. Iranians have slightly more of it, suggesting a somewhat closer relationship. The difference between the small Dodecad Turkish sample and the Behar et al. one is suggestive of heterogeneity within Turks, so it is important to be aware of this. Hopefully, if more West Asian individuals join the project, we will be able to discover patterns of regional variation between and within different ethnic groups.

Sunday, December 26, 2010

Results for DOD268 to DOD277 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:



Individual bars:


Thursday, December 23, 2010

Results for DOD256 to DOD267 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:


Individual bars:

Wednesday, December 22, 2010

Results for DOD245 to DOD255 posted

Open-ended submission opportunity. Feel-free to add some information about your ancestry in the relevant thread.

Admixture proportions can be found in the spreadsheet

All populations:




Individual bars:


Monday, December 20, 2010

Open-ended submission opportunity for 23andMe data

In the spirit of the holidays, I am open to receiving 23andMe data for analysis. This will be an open-ended call rather than one with a deadline, and I will announce when it will be over in this very post.

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Please do not submit samples of relatives, as these make analysis difficult. I consider 2nd cousins and closer relatives to be related.

I am sorry that I can't process everyone's data, so if you don't fit the above criteria and you feel you should be included, feel free to write to me (but don't send me your data!) and I will keep it in mind. Also, you can subscribe to the feed for future opportunities to submit your data.

What you will receive

You will receive the standard K=10 analysis results such as this, and you will also be eligible for other types of tests such as Galore analysis or regional studies

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

What to send

Send your zipped autosomal data to dodecad@gmail.com. Also let me know something about your ancestry (e.g., ethnic group, country of origin of grandparents, or anything else that might be useful).

Sunday, December 19, 2010

Fine-scale admixture in Europe (Dagestan/Basque/Sardinian components)

Wanting to see whether the Dagestan mystery would extend into Europe, I carried out an ADMIXTURE analysis including all my European populations. Once again, as this is done on only ~30k markers, a little noise on the low-level components is expected.

Admixture proportions can be found in the spreadsheet

Notably there is now both a Sardinian and a Basque centered cluster; the latter was formerly (in the standard K=10 analysis) split between "Southern European" and "Northern European". The Urkarah, Lezgin, and Stalskoe samples show the highest presence of the "blue" component, which I label, once again, Dagestan. Note, however, that you should not compare admixture proportions across ADMIXTURE runs for components that happen to be labeled the same (homonymous). Certainly "this" Dagestan is related to the "previous" Dagestan component, but do not assume they are identical.

Here is the Fst distance matrix between the 7 components:


Discussion

The most notable thing about this figure is the relative absence of the West Asian component in the periphery of Europe. The lowest values are seen in Basque, Sardinian, Orcadian, White Utahns, Lithuanians, Finns, and Scandinavians (in no particular order).

It is worthwhile to order the European populations in terms of their Dagestan component. Excluding the populations of the Caucasus, these are, in ascending order: Basque (0.7%), Sardinian, Cypriot, Belorussian, South Italian/Sicilian, Lithuanian, Tuscans, Portuguese, Greek (3.8%), Vologda Russian, Romanian, Finnish, Spaniards, North Italian, Dodecad Spaniards, Dodecad Russian, Chuvash, Hungarian, French (7.9%), German, Scandinavian, White Utahn, Orcadian (12.6%).

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /\-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

This is just a piece of a broader puzzle, and the picture is not yet clear. However, we can tentatively say that whatever brought the "Dagestan" component to India was not a unidirectional process, but also brought a similar population element to western Europe.

Friday, December 17, 2010

Fine-scale South Asian admixture analysis + Results for Project participants

After my recent experiment on the number of markers needed to split closely related populations, I was encouraged to take another stab at integrating the Xing et al. (2009) dataset with my other collections. This dataset has only ~40k markers in common with my other datasets, as it was typed on a different chip, and after data cleaning (--geno 0.01 in PLINK) and LD-based pruning (--indep-pairwise 50 5 0.3 in PLINK), I was left with a composite dataset of about 30,000 SNPs.

The primary reason for wanting to revisit this dataset is the fact that it had two additional Caucasus populations (Stalskoe and Urkarah) as well as several Indian populations (from Andhra Pradesh, Tamils, and Irula).

In the standard K=10 analysis of the Project, Indian participants invariably get a mixture of "South Asian", "West Asian", "North European", and "East Asian" components, but obviously we should be able to do better than that.

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

Dodecad participants

In addition to the reference populations, I have included 14 Dodecad Project members (with 23andMe data) with the criterion that they are non-related have >5% "South Asian" component and less than 5% of the East and West African components. By ID these are:
DOD223 DOD067 DOD010 DOD029 DOD126 DOD128 DOD089 DOD091 DOD090 DOD220 DOD075 DOD078 DOD088 DOD201

GALORE analysis

To verify the existence of structure in the data, I used the MCLUST/MDS approach I've described earlier to infer the existence of clusters in the data. 34 clusters were detected with 16 dimensions of MDS retained.



As you can see, despite the smaller number of markers, structure was effectively inferred by MCLUST. As expected, Dodecad project members who have diverge origins in both South Asiaand beyond it are "all over the place" in terms of their cluster assignments. In the reference populations, some interesting groupings occur:
  • Stalskoe and Lezgins fall in cluster #32. Stalskoe is a village in Dagestan inhabited by Turkic Kumyks; Lezgins are Northeast Caucasian speakers from Dagestan
  • Dai from China and Vietnamese fall entirely in cluster #10
  • Tamil Brahmins and Andhra Pradesh Brahmins fall mostly in cluster #5, and not in the same clusters as non-Brahmin Tamil and AP individuals
Let's turn to the Dodecad Project members, and look at their probability of assignment:


NNclean suggests that DOD078 outlier. This may be due to unique ancestry that is not represented in the other reference populations.

Unfortunately, only Razib of Gene Expression took the trouble of leaving some information in the ancestry thread. His sample, DOD075 is assigned to cluster #6 where the bulk of the Singapore Indians are, and a scattering of individuals from Indian populations. Feel free to add any non-identifying information in the relevant thread, e.g., "Brahmin", your state of origin, etc. Even a little bit of information may help others interpret their results better.

Origin of South Asians

As I've remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called "Australoid" because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These "Australoids" are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn't be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as "black Africans" are not the same, neither are the "Australoids" and mixed-"Australoids" at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

It is in South Asia where there is clear evidence of fusion between indigenous and exogenous elements with the latter being similar to West Eurasians (Caucasoids). Moreover, both the great linguistic diversity and the caste system have helped maintain many distinct population groups. Naturally, tracing the origin of population elements present in the Indian mosaic is of great interest both for the people of India and for those outside it.

ADMIXTURE analysis

Below is the K=3 analysis which verifies the anthropological received wisdom about the three major Eurasian groups:

The East Eurasian component of this analysis is closer to the South Asian one (Fst=0.079) than to the West Eurasian one (Fst=0.114). The South Asian component is closer to the West Eurasian (Fst=0.063). The South Asian component as revealed in this plot is probably composite, as we shall see in the more detailed analysis below.

Here is the much more detailed K=10 analysis:

Admixture proportions for this can be found in the spreadsheet. I reiterate that you should treat the labels of the ancestral populations as useful mnemonics and that you should not confuse them with the same labels used elsewhere.

There are lots of interesting things about the plot:
  • Both the Irula and the North Kannadi get their own clusters (light blue and pink)
  • The South Asians have additional structure, with a component centered on Pakistan (green) and one centered on India (orange)
  • Notice the elevated Siberian (or "Yakut") component in Turks and Stalskoe (Kumyks). The Adyghe also seem to have some of it, and since these are NW Caucasian speakers, it is plausible that this may represent some sort of Tatar element
Return of the Lezgin mystery

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

The absence of Y-haplogroup J1, so typical of Dagestanis in India may suggest a speculative scenario, in which the ancestors of the Indo-Iranians picked up Northeast Caucasian women en route to the Iranian plateau and India.

Distances between components

Here is the table of Fst distances between the 10 components:


Brahmin origins

The importance of the caste system in shaping variation can be seen if we compare Tamil Brahmins with Tamil Lower Castes and Andhra Pradesh Brahmins with other AP populations. Brahmins possess both "Dagestan" and "Pakistan" components, which suggest their links to northern India in the first order, and West Eurasia in a more remote sense. The "Pakistan" component too is closest to the "West Asian" one.

Both "Dagestan" and "Pakistan" components are notable for their absence among non-Brahmins in both these south Indian localities.

Dodecad results

Once again, I can't comment on any of these except DOD075 who was probably right to speculate about input from Southeast Asia given his mixed "Southeast Asian"/"East Asian" affiliations, which resemble those of Vietnamese and Cambodians. The presence of both "Dagestan" and "Pakistan" components also point to more northwesterly influences.

Discussion

The most interesting thing about this little study is, no doubt, the expansion of the Dagestan mystery.

These South Indian Brahmins possess nearly as much of this component as people in Pakistan, and a few Iranians among my project members. They have more of it than many people living much closer to the Caucasus.

Given that they have partially absorbed indigenous Indian elements (evidenced by the "Indian" component, which is itself probably hybrid), the conclusion is inescepable that their ultimate non-Indian ancestors possessed even more of it.

Where did they come from? Any discussion of their origin or dispersal would be advised not to veer off too far from the Caspian sea...

Wednesday, December 15, 2010

Coverage of the Dodecad Project in Nature

There is coverage of the Dodecad Project in this week's Nature. Below you can find a version of the graphic included in the article, which includes all studied populations.
You can also download a hi-res version of this from here. I understand that this was probably edited for space, but the distinctiveness and distribution of the components is best seen with the full array of populations.

The source of this figure is from my Eurasian analysis with K=15, although the color-coding is a bit different. In the original post you can also find "population portraits" which show individual-level admixture estimates for these populations.

The Nature piece highlights the connection between "Northern Finland" and Siberia, but this should really read "Finland and Northern Russia", the former coming from Dodecad Project members from different parts of Finland, the latter from Russians from Vologda included in HGDP.

I find the Finnish/Siberian connection quite interesting, as it bridges the gap between European Finnic and Siberian Samoyedic ones within the Uralic language family, but the emergence of an Altaic component (dark grey) is even more exciting and unexpected: Altaic speakers are shown to possess, from Europe to the Pacific and spanning all three major sub-families of Altaic (Turkic such as Yakut and Chuvash, Mongolic such as Buriat and Daur, and Tungusic such as Evenk and Hezhen), a common genetic component which is otherwise rare in both Indo-European speakers of Western Eurasia and Sino-Tibetan, Hmong-Mien, and Paleosiberian populations of Asia. Hence, this appears to be a real genetic correlate of a language family's expansion.

The Nature piece also highlights the Joe Pickrell affair. It's a great story, because it shows how added value can be had by combining an "open source" model of genetic inquiry with traditional genealogy.

If you've come here via Nature, feel free to browse around and send me feedback in either the comments or at dodecad@gmail.com.

I've also posted a lot of material in my regular blog, Dienekes' Anthropology Blog, mostly under the Dodecad, ADMIXTURE-experiments, and GALORE tags. In particular, I've just finished my Human Genetic Variation trilogy:
If you want to join the Project, please note that data submission is currently closed, but feel free to subscribe to the feed to be alerted of opportunities to join the project, or to keep track of its progress in general.

Splitting populations: how many markers needed?

An interesting question in population genetics has to do with the number of markers necessary to infer population-of-origin of an individual.

In the Dodecad Project, we have seen that even very close populations such as Armenians and Assyrians can be classified accurately using the MDS/MCLUST "Clusters Galore" combination that I have proposed. But, do we really need ~177 thousand markers to achieve this level of detail?

I decided to carry out a small experiment, to see how the number of markers used affects the ability to correctly classify samples into two different populations.

Case A: Maximal differentiation (Papuans vs. Mbuti Pygmies)

I begin by considering the case of the two most differentiated human populations in my database, Papuans (17 individuals) and Mbuti Pygmies (13 individuals). For each step I reduce the number of markers by an order of magnitude, using PLINK's --thin 0.1 argument.
  • With 176,598 markers, classification is 100% correct, i.e., all 13 Pygmies are assigned to one cluster and all 17 Papuans are assigned to another
  • With 17,752 markers, classification is again 100% correct.
  • With 1,725 markers, ditto
  • With 152 markers, ditto
  • With 20 markers, 3 Pygmies are misclassified into the Papuan cluster, hence accuracy is 90%
  • With 4 markers, 5 Pygmies are misclassified as Papuans, and 6 Papuans as Pygmies, hence accuracy is 63%

Notice that these are not ancestry informative markers (AIMs) but randomly selected SNPs.

Case B: Small differentiation (Armenians vs. Assyrians)

I have previously shown that a sample of 7 Armenians and 8 Assyrians can be classified correctly, with a single Assyrian misclassified as Armenian, hence 93% accuracy.
  • With 17,714 markers, 1 Armenian is misclassified as Assyrian, hence 87%
  • With 1,808 markers, accuracy is 80%
  • With 188 markers, it is 67%
  • With 17 markers, it is 60%

Implications

I carried out this little experiment because I thought it would be interesting, but also for its implications. I can identify at least two major ones:

Ancient DNA: Due to poor preservation we are unlikely to get full genome sequences from ancient human DNA except in highly favorable conditions. In many instances, it may be possible to get only a limited number of markers tested. These results suggest that it is possible to get fairly decent assignment of individuals into populations without hundreds of thousands of SNPs. Thus, it may be possible to study genetic structure in ancient necropoleis, or the relationship of ancient remains to modern populations.

Data integration: Different genotyping platforms (e.g., those of Affymetrix and Illumina) often possess a very small subset of common SNPs. Imputation of genotypes is possible, but these results suggest that even when the overlap between markers is not substantial, it is possible to carry out fairly sophisticated genetic analyses on them.

Conclusion

Naturally, there are issues not addressed here: what happens when we are dealing with more than 2 populations? What happens if we want to study admixture rather than classification? In both cases, I expect the number of markers needed to be higher.

Nonetheless, I am fairly convinced, both from this, and a previous experiment that more markers than are used in current genotyping platforms will only add very little value to anthropological investigations of ethnic genetic differences.

Monday, December 13, 2010

Genetic structure in North-Central Europe with the Galore approach

I first posted this in the comments of my other blog, but it is worth a post of its own.

Here is the result of applying MCLUST to a group of Central-North European populations. The maximum number of 13 clusters is reached with 5 MDS dimensions retained:

Some clusters are population-specific (e.g., #7 for Finns, #10 for Lithuanians, #12 for Russians). Some clusters are semi-specific (e.g., #3 for Hungarians, #1 for French). Some populations are split into multiple clusters (e.g., Orcadians or Germans).

Here is a neighbor-joining tree of these 13 clusters based on the first 5 MDS dimensions:
12, 13: Russians
9, 10: Balto-Slavs
5, 3, 2, 4, 1: Northwest Europeans
11: A couple of Hungarians (*)
7, 8: Orcadians
6: Finns

I generally frown upon phylogenies for human groups, as I believe that human genetic variation is better represented as a network due to lateral gene flow. However, this tree gives an idea of the relationship between clusters.

(*) It's interesting that these 2 Hungarians are the same ones that showed an elevated "Altaic" component in my K=15 analysis.

Saturday, December 11, 2010

Assyrians vs. Armenians

Assyrians and Armenians have invariably been assigned to the same cluster in Clusters Galore analysis thus far. It is very important to understand the following two axioms:
  • Populations that are separated by clustering analysis have enough genetic differences between them to allow such separation
  • Populations that are not separated may, or may not have enough genetic differences between them to allow separation
The second axiom is extremely important. Let's elaborate:
  • The clustering algorithm may have its limitations
  • The number of markers may be insufficient
  • The number of individuals may be insufficient
In the Galore approach, I use MCLUST, as good a clustering algorithm as any. I also have samples of 8 Assyrians and 7 Armenians, and use about 177k SNPs, which are enough to squeeze out differences even between related populations. So, what gives? Are Assyrians and Armenians really genetically indistinguishable?

What constitutes a cluster? MCLUST can divide a dataset into as many clusters as you want, but it also chooses the number of clusters to optimize the Bayes Information Criterion. But, it's important to note that the BIC is not some god-given arbiter of what a cluster is; it is best viewed as a guide to choose a good number of clusters, and not as a guarantee that this is the true number of clusters that a population can be subdivided in.

Clustering Assyrians and Armenians, assuming 2 clusters

To that end, I decided to cluster Assyrians and Armenians, forcing MCLUST to infer 2 clusters. I only retained 2 MDS dimensions, as these are enough to distinguish between 2 groups. Here is the MDS plot:
We can observe that Assyrians can be distinguished from Armenians along Dimension 1, and there is also some structure in Assyrians, with 3 of them forming a mini-cluster on top, 4 of them forming another mini-cluster at the bottom, and 1 of them being closer to the Armenians.

By applying MCLUST with K=2, all 7 Armenians and 1 Assyrian are assigned to a cluster, and the remaining 7 Assyrians to another. Thus, the two groups can be separated from each other, although their differences (due to the factors I mentioned) are not large enough to lead to an improvement of the BIC.

With more individuals., it's possible that the BIC too will be able to track the improvement in likelihood that adding two clusters will produce.

And, indeed, it may be possible that some of these populations will be further subdivided. If more Assyrians join the project, for example, it may be the case that the two apparent Assyrian clusters will emerge, or it will be the case that the space between them will be filled by currently unsampled individuals.

As more individuals join the Project, ever-finer distinctions can be uncovered.

NB: I make the case for Assyrians and Armenians, but I have also tried this for other closely related populations who were assigned to the same cluster by Galore analysis, namely Spaniards and Portuguese. In that case, however, I was not able to find a clean solution with K=2. Once again, this may mean that Spaniards and Portuguese are not genetically that different, or it may mean that their difference will be revealed with more individuals joining the project.

Friday, December 10, 2010

Results for DOD237 to DOD244 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Monday, December 6, 2010

Clusters galore, K=50 for Dodecad Project members (up to FFD055)

I am now repeating the Clusters galore analysis for Family Finder data (for more info, see the previous post on 23andMe data).

With 14 MDS dimensions retained, there were 50 clusters inferred in the optimal solution by MCLUST.

The results spreadsheet has rows for the 54 project participants in the first rows: each row is the probability that you belong to a particular cluster. This is followed by the reference populations where each row has the number of individuals (for that populations) that is assigned to a particular cluster.

There are also some outliers in this analysis:
FFD002 FFD004 FFD007 FFD012 FFD015 FFD016 FFD021 FFD022 FFD023 FFD038 FFD046
Check what an outlier is in the context of this analysis, and what it means.

Interestingly, because of the smaller number of Family Finder participants some previously defined clusters (for 23andMe data) such as the "Finnish" cluster do not appear here. This is not surprising at all, because for a cluster to be defined several individuals from that population must be present in the data.

Many continental Europeans of this type ended up in cluster #2. Some others, like FFD048 who is Lithuanian were assigned to the proper cluster #9, centered on Lithuanians.

This underscores the importance of having more people join the Project at the next available opportunity. This will not only create new clusters for individuals who are currently the only representatives of their populations, but it may also split already existing clusters if regional sub-populations are detected.

It is also important for project participants to drop a note at the ancestry thread, to help others make better sense of their results.