Dodecad Ancestry Project: November 2010

Tuesday, November 30, 2010

Outliers in the Dodecad Project (23andMe data)

As promised, I have started to investigate outliers among Dodecad Project members. I used NNclean as implemented in the prabclus package to find data points that had a great distance to their nearest neighbor among either Dodecad Project members or the standard 692-individual panel I use in the Galore analysis.

To make a long story short, here are the IDs identified as outliers:

"DOD157" "DOD168" "DOD169" "DOD036" "DOD048" "DOD088" "DOD034" "DOD030" "DOD060" "DOD132" "DOD128" "DOD175"

An outlier is someone who is not very close to any other individual and hence does not really "cluster" with anyone. Thus, it is recommended to remove outliers prior to clustering, as otherwise they will form makeshift clusters that don't really have a good meaning.

Looking at the individual spreadsheet reveals that many of these outliers have very unusual ancestry. This falls under two categories:

Recent admixture between geographically separated populations
Being the only member from an unsampled population

In the first case, admixed individuals fall in the "empty space" between their parental clusters, and thus do not cluster with anyone else, unless a person with a similar type of admixture happens to also be in the dataset.

In the second case, there are no members of the individual's group. Sometimes, if a group is close enough to another, this is not a problem, but there are many distinctive population groups for which that is not the case.

While outliers will be removed from some analyses, their outlier status will continue to be evaluated as new reference populations, or Dodecad Project members are added.

What's next for Clusters Galore analysis

The first few runs of the Clusters Galore analysis have proven quite successful; I've posted another one on the HGDP panel in my other blog.

Now, it is time to assess the results and see what improvements can be made. I see a few avenues for improvement:

Outliers

Clusters, by definition, are composed of at least 2 individuals. Individuals who are the only representatives of their populations (e.g. if a Pygmy or an Icelandic+Armenian mix) will, by necessity, attach themselves to the closest cluster (e.g., to Yoruba, or to some Central European population), even though they are not necessarily close to that population.

Outlier detection is a difficult problem, but I will try some ideas on how to tackle it.

Phantom clusters

mclust is resilient to phantom clusters, i.e., clusters of "misfits" who don't belong in any other populations but are banded together erroneously by the algorithm. That is inevitable in an automated procedure, especially one that is pushing the limits of ancestry inference. Phantom clusters are, by their nature, transient, so there are some ideas on how to avoid them and how to focus on very robust and repeatable clusters.

Typicality

Being part of a cluster tells you nothing about how "typical" a member of the cluster you are, i.e., how close to the average. This problem is exacerbated by the fact that the clusters inferred by mclust may have varying shape, size, and orientation.

Nonetheless there are ideas on how to quantify members' typicality, and I will explore them. Please note that typicality is not necessarily the same as "purity". For example, an elongated cluster of African Americans will have typical members with 20% European admixture, but the "purest" African Americans will have 0% European admixture and be very atypical of their group as a whole. Similarly, typical Turks have 5-6% East Eurasian admixture, but people with 10% East Eurasian admixture are less typical, but more likely to be descended from central Asian Turkic people.

Any new technique will have its birth pains, and hopefully myself and others will help identify them and resolve them.

Monday, November 29, 2010

Galore analysis improved, plus K=56 results of Clusters Galore analysis for Dodecad Project members (up to DOD236)

For background, please read the post on the K=48 analysis and links therein.

As I mentioned in the previous posts, my technique depends on the use of MDS to reduce the dimensionality of genomic data from 177,000 SNPs or so to a few dozen dimensions capturing most of the variance.

Subsequently, mclust a state of the art clustering algorithm is applied on the MDS representation: this iterates between choices of K, the number of clusters, trying clusters of different shape, volume, and orientation, and chooses the optimal clustering, maximizing the Bayes Information Criterion. In simpler terms, it finds as much detail as possible in the data but penalizes too ornate models and avoids finding "ghost" clusters that are not really supported.

These are clusters derived from data of unlabeled individuals. The only human input into the process is the number of MDS dimensions to retain.

In my previous K=48 analysis, I retained 30 dimensions, but I also noted that this is not really optimal. Choosing more, or less, dimensions might lead to even better resolution (higher K).

More dimensions = more possible ways to distinguish between individuals, but also, possibly, more noise, as individuals might not be "clustered" in them.

Fewer dimensions = less possible ways to distinguish between individuals, but also, possibly, less noise from the uninformative higher dimensions.

Thus, the question arises: how many dimensions to retain?

Here is a plot of the optimal number of clusters inferred, depending on how many dimensions I chose to retain:

As you can see, when a few dimensions are retained, relatively few clusters are inferred, while as the number of dimensions goes beyond a certain point, the number of clusters starts to decrease again, as more noise is added (*)

The number of clusters peaks (see figure) at 16 and 22 dimensions retained; both of these produce 56 different clusters in the optimal solution.

Here are the results for Dodecad Project members (up to DOD236) with K=56 and 16 dimensions retained. In comparison to the previous K=48 analysis, we are now able to:

Split CEU White Utahns (#1) from French (#15)
Split CEU White Utahns (#1) from continental Germanics (#14)
Split French (#15) from Spaniards (#2)
Split Armenians (#7) from Turks (#19)
Split Slavs (#23) from Balts (#26)
Split Cypriots (#30) from Sephardic Jews (#21)

The most astonishing finding is, however, at least for me, the emergence of a cluster (#16) comprised in the great majority by people from Greece and Southern Italy, with very few individuals from elsewhere. Notice that #16 has absolutely no representation in the reference populations, which lack South Italians and Greeks.

Once again, I urge participants to help themselves and others by leaving a comment in the ancestry thread.

(*) The change is not, however, smooth. The more general problem is to choose which dimensions to retain, rather than choosing how many of the first ones to retain. The first few dimensions of MDS capture a decreasing portion of variance, but the data are not guaranteed to be "split" in them. However, this is a much harder problem, as we have to figure out (i) how many dimensions to retain, and (ii) which ones. Even if we fix (i), by choosing to retain, e.g., 10 dimensions, we still have to choose which 10: this is close to half a billion different combinations of which 10 to choose from the total of 38 possible candidate ones.

Results for FFD048 to FFD055 posted

This concludes the number of Family Finder individuals who sent me their data by the deadline.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Results of Clusters Galore analysis for Dodecad Project members (up to DOD236), K=48

This is the result of the new type of ancestry analysis I have recently devised. For background, please read:

In total 894 individuals were included in this analysis, 202 Dodecad Project members and 692 from the published references. 30 MDS dimensions were retained, and mclust was run with a maximum number of clusters = 60. In the optimal solution, 48 clusters were inferred.

At the beginning of the results spreadsheet are the 202 Dodecad Project members and their probabilities of belonging to the 48 clusters. This is followed by the 36 reference populations, listing how many individuals from each one were assigned to each of the 48 clusters.

To help you interpret these results, you might want to consult the individual and population ADMIXTURE spreadsheets, as well as the information Project members have chosen to reveal about themselves. Feel free to add your own information in that thread.

Here are the results for some people who have chosen to reveal information about themselves:

Spencer Wells is DOD162 and he has 28% probability of being in the CEU (White Utahn)/French cluster #1, and 72% probability of being in cluster #3 which is CEU (White Utahn) specific (in the reference populations) and in which 29 Project members are also assigned (almost all of them Northwestern Europeans)
Razib of Gene Expression is DOD075 and is assigned in cluster #15, which includes Gujarati and North Kannada individuals
pconroy, submitted three samples, DOD097 is Sicilian fall in cluster #9 to which many Cypriots and Sephardic Jews belong, and many Project members of South Italian/Sicilian background; DOD098 and DOD099 are his Irish parents: DOD098 is almost evenly split between #1 and #3 (like Wells), and DOD099 is 97% in #3
Adriano Squecco DOD139 is North Italian and is in cluster #2, in which 25/25 reference Tuscans and 10/12 reference North Italians belong
Lacko DOD083 is 100% southern Polish and he falls in cluster #20, in which all reference Belorussians and Lithuanians fall
Mike Maddi DOD021 is Sicilian and is in cluster #9 like DOD097, showing some probability (13%) of also being in #2 like Adriano Squecco
An Anonymous Pole falls in #20 like Lacko and the reference Slavs
Ilmari DOD003 and Ari DOD131 are Finns and fall in cluster #13. This is an interesting one, as it does not occur at all in the reference populations; I'll let you guess what population it's centered on.
Eastara's mother (DOD025) is Bulgarian and falls in cluster #10, which is centered in Romanians in the reference populations, but note that there are also 13 Dodecad Project members who fall in it, many of them from different parts of the Balkans. I hope more people from the Balkans will contact me for inclusion in the project, as I am sure that finer-scale can be achieved there with increased participation.
Basar (DOD049) is half-Anatolian Turk and half-Laz. He falls in the cluster encompassing Armenians and Turks in the reference populations.
Bubba (DOD066) is North German with a pinch of Danish and falls in cluster #3
afpjr (DOD014) is half Greek and half Italian/Sicilian; he falls in cluster #9 (95%)
Francesc (DOD217) is Catalan and falls in cluster #17, centered on Spaniards in the reference populations.

The Project Greeks and Mixed Greeks fall in clusters #2, 9, 10 tying them to Italy and the Balkans, as might be expected. I hope more Greeks will decide to participate in the Project, so we can discover more interesting patterns in our population.

I cannot stress enough how revealing non-identifying ancestry information in the relevant thread will help both yourselves and others make better sense of these and future results.

There are clusters composed entirely of Dodecad Project members (e.g., the aforementioned one), and others which are centered on one or two reference populations, but encompass a wider variety of non-represented populations. So, please take the time to leave a comment in the ancestry thread.

This is not the end of the story. There are more clusters to be discovered in the data; the inclusion of 200+ new samples in this analysis has caused new clusters to appear and distinctions that were previously detected to "fold back" (e.g., between Armenians and Turks). I am currently investigating how the choice of number of MDS dimensions to retain affects the number of inferred clusters in the optimal solution.

Sunday, November 28, 2010

Submission of Family Finder data is now CLOSED

Thank you all for submitting your data; the remaining results will be posted in the blog over the next week or so.

If you have submitted your data in time, but did not receive an ID yet, you will.

If you want to be alerted for future opportunities, and to keep up with the progress of the Project, please subscribe to the feed.

Clusters galore: less is more, or, pushing the limits of ancestry inference

I had already hinted in my previous post on my new technique that retaining all MDS dimensions might add noise to the analysis, and I was hopeful that even finer resolution could be achieved with fewer dimensions.

In the spreadsheet you can see the optimal solution if one retains only 10 dimensions in which case 45 clusters are inferred. Previously, I retained 47 dimensions, and got only 35 clusters in the optimal solution: less is more.

In comparison to the previous analysis, I can detect some interesting changes:

Spaniards and Portuguse are split from Tuscans and are joined by some French and North Italians; the rest of the French stay with White Utahns, and the rest of the North Italians stay with Tuscans.
Romanians too get their own cluster
Turks are split from Assyrians/Armenians.
Germans are split from Scandinavians, with 1 sample from either population going to the other population.
The relationship between Cypriots and South Italians is retained, but most of the Greeks (many of whom were borderline between the Tuscan and South Italian cluster) go the Tuscan way.

I am now studying how to choose the optimal number of MDS dimensions to retain, so I will not report any individual data about this to project participants. I just wanted to let everyone share in the excitement.

Results for FFD035 to FFD047 posted

Individual proportions can be found in the spreadsheet

All populations:

Individual bars:

Clusters galore with Dodecad populations

Number of individuals assigned to each cluster can be found in the spreadsheet. Populations in italics are composed entirely from Dodecad Project members.

Please read the post in Dienekes' Anthropology Blog to see what this type of analysis means.

47 MDS dimensions were retained, and the optimal number of clusters was 35. Retaining less or more dimensions may alter this number, as after a certain point extra dimensions only contribute noise to the analysis; this is a matter of investigation.

It is hardly practical to comment on all 35 clusters, so I will limit myself to a few observations:

Turks, Armenians, and Assyrians fall in cluster #1
Scandinavians, White Utahns, Germans, and some French fall in cluster #2
Portuguese, French, North Italians, Tuscans, Spaniards, and Romanians fall in cluster #3
Greeks, South Italians/Sicilians, Cypriots, and Sephardic Jews from Bulgaria and Turkey fall in cluster #4 (but see note)
Finns fall in cluster #5
Almost all Ashkenazi Jews fall in #6
All Dodecad Project Russians, plus reference Lithuanians and Belorussians fall in #9

8 Greeks fall in cluster #4 and 2 in cluster #3. However, many of the ones who fall in #4 also have some non-trivial probability of falling in #3. Probabilities for all other clusters are less than 0.1%. All Project Greeks can write to me to learn their exact probabilities.

Of course, it should be noted that:

If two populations can be perfectly distinguished from each other, then there are genetic differences between them (they split from each other some time ago, they underwent different types of admixture, etc.) allowing the clustering algorithm to detect their differentiation
If two populations cannot be distinguished from each other, this does not mean that they are not indistinguishable in principle; it does mean, however that through either common ancestry or very similar patterns of admixture, they have become quite similar to each other in the Eurasian context.

If you are a Dodecad Project member (23andMe data) from one of the populations in italics and are wondering which cluster you fall in, first check whether all individuals from your population fall in the same cluster, in which case you already know the answer.

Otherwise, you may write to me, with your DOD number, and I'll tell you.

Results for FFD020 to FFD033 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Saturday, November 27, 2010

Results for FFD003 to FFD019 posted

UPDATE: Color-coding problem fixed.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Friday, November 26, 2010

Results for DOD223 to DOD236 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, November 25, 2010

ADMIXTURE analysis for Family Finder (FTDNA) samples

I am now able to provide K=10 analysis to Family Finder customers.

The rules of participation are:

No relatives (up to 2nd cousin)
100% Eurasian or North African ancestry; the test does not include Native American samples, or an assessment of Native American ancestry
Data must be received by Sunday 29/11

In order to participate, you must send to dodecad@gmail.com your autosomal data (.gz ending) that you can download from FTDNA, as well as information about your known ancestry (such as country of origin, or ethnic affiliation)

There may be other opportunities for people to participate, so please subscribe to the feed.

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

You will receive your ancestral proportions from 10 inferred ancestral components as in the following figure:

This was generated using the same 104,790 markers that I will be using to analyze your sample. Exact admixture proportions for these populations can be found in the population spreadsheet.

Note that these proportions are not directly comparable with those using 23andMe data, as a different set of markers is used in the latter, and there is a smaller overlap between Family Finder data and those of the reference populations I am using. Here is the current population spreadsheet for 23andMe data.

There are already two Family Finder participants in the Project, with IDs FFD001 and FFD002; these are volunteers who helped me with their data when I adapted EURO-DNA-CALC for Family Finder data. Their results are in the individual spreadsheet.

Friday, November 19, 2010

ADMIXTURE analysis with Dodecad Populations (update #1)

Repeating the previous analysis with additonal populations of Dodecad Project members and/or modified sample sizes for pre-existing ones:

Assyrian, Scandinavian, Greek, Finnish, S_Italian_Sicilian, Ashkenazi, German, Indian, Portuguese, Armenian

Admixture proportions can be found in the spreadsheet. Dodecad Project populations in italics.

Populations portraits can be found in the RAR. For example, here are the ones for Dodecad Project Ashkenazi and Behar et al. (2010) Ashkenazi Jews:

and here is a Portrait of the Portuguese:

Thursday, November 18, 2010

How Turkish are Anatolians? revisiting the question

In 2005, I estimated the Y-chromosome heritage of Turkic speakers on modern Anatolians at 11%.

In the same year, I estimated the Mongoloid admixture in Anatolian Turks at 6.2% on the basis of Y-chromosome and mtDNA. This is not inconsistent with the previous percentage, as the Turks, when they arrived in Anatolia were almost certainly of mixed Caucasoid-Mongoloid heritage.

Surprisingly, the maternal contribution from East Eurasia seems higher than the paternal one, on the basis of uniparental markers. But, that is not so surprising if one considers that the Turks who arrived to Anatolia were to a degree descended from Turkicized groups of Iranian steppe nomads bearing Caucasoid patrilineages. Already we have ancient DNA evidence of groups in Central Asia with Caucasoid patrlineages (R1a1) and mixed Caucasoid-Mongoloid mtDNA.

In 2007, some Turkish researchers estimated, using Alu polymorphisms, the Central Asian admixture in Turks at 13%, quite close to my own estimate, and, given the observation that there is more Mongoloid mtDNA than Mongoloid Y-chromosomes in modern Anatolians, the slight difference of 2% is probably taken care of.

In October, I estimated the Mongoloid admixture in Turks at 5.5%, quite close to the 6.2% arrived in 2005 using Y-chromosomes and mtDNA. Subsequently, my K=10 Dodecad analysis (spreadsheet) arrived at 6.7% sum of "East Asian" and "Northeast Asian" components. The slight increase is not surprising, as the K=10 analysis included a greater sampling of Mongoloid diversity.

In ISBA4, another group of Turkish researchers arrived at a 13% estimate for the nomadic Turkic element in modern Anatolian Turks.

Finally, my K=15 analysis has revealed 7.9% "eastern" components in Turks. Given that the "Central Siberian" component is equidistant from Caucasoids and Mongoloids, this translates into about 7.2% East Eurasian admixture. Again, the slightly larger result can be accounted by the sampling of even greater Mongoloid diversity, from the previously unsampled Siberia.

Summary

Y-chromosome, mtDNA, and autosomal DNA analysis by myself and by Turkish researchers all point to 6-7% of Turkish genetic heritage being specifically east Eurasian in origin, and about 1/7 of their genetic heritage coming from Central Asia.

ADMIXTURE analysis

I received an e-mail from a Turkish participant in the project, who wondered whether the K=15 analysis was supportive of much higher demographic influence of Central Asian Turks in the current Turkish population.

In particular, a back-of-the-envelope calculation of "eastern" components in Turks and Uzbeks led him to the conclusion that this was at least 20% and probably more.

Thus, I decided to perform direct ADMIXTURE analysis of Turks and Uzbeks to see what the estimate of Central Asian admixture in Turks actually is.

In the above figure, there are (left-to-right): 1 Dodecad Project 50% Turk-50% Laz showing no Central Asian admixture, 3 Dodecad Project Turks, 19 Turks from Behar et al. (2010), followed by Uzbeks (blue, with some seemingly admixed individuals), followed by 15 Dodecad Project Greeks and Armenians (red).

Behar et al. (2010) Turks have 15.4% Central Asian admixture; if we add the 3 Dodecad Project Turks to the sample, this becomes 14.4%. I'll be happy to tell the three Turks in the Project their individual proportions if they e-mail me.

In conclusion, this analysis too provides an estimate of the Central Asian component in Turks similar to all the ones listed in the beginning.

Conclusion

Estimating the precise genetic identity of nomadic Turks at their time of arrival in Anatolia is difficult to achieve. First of all, modern Anatolian Turks are a subset of recent Anatolians; second, there is the problem of how many Iranian-speakers were absorbed by the westward migrating Turks from Central Asia, and when; also, what was the impact of the Mongol expansion in Central Asia after Turks had already reached the west, and later what were the impacts of Chinese and Russian expansion in the Eurasian heartland.

It's all a big puzzle, but, for the time being 5-7% East Eurasian admixture in modern Anatolian Turks and about 1/7th of their heritage coming from Central Asia seems like a reasonable estimate.

Wednesday, November 17, 2010

ADMIXTURE analysis of Eurasian populations with K=15

Note: color coding in the initially uploaded RAR of individual variation was off; please get the new one. [working link, May 4, 2011]

I have added populations from several sources: HapMap-3, HGDP, Behar et al. (2010), Rasmussen et al. (2010), and the Dodecad Ancestry Project, and ran the most ambitious ADMIXTURE analysis of Eurasian variation yet: 69 populations, and 1,189 individuals in total.

K=15 ADMIXTURE plot:

Admixture proportions for the 69 populations can be found in the spreadsheet.

Population portraits, showing individual variation within populations, can be found in the RAR.

RELATIONSHIP BETWEEN 15 COMPONENTS

The table of Fst distances between the 14 components:

MDS representation:

Hierarchical clustering with complete linkage (this is not a phylogeny):

What has changed

In comparison to the K=10 analysis, the increased resolution allows us to:

South Asians belonged primarily to the South Asian and West Asian components; this South Asian component spilt over to Iran and Central Asia. Now, a new Central-South Asian component, corresponding to the Ancestral North Indian of a recent study is inferred, and a corresponding South Indian component.
HGDP Bedouins and Behar et al. (2010) Saudis take up their own component which I labeled Arabian. This appears to be a subset of the Southwest Asian component of the K=10 analysis
There are several components in Siberian and Central Asian populations, alread discovered in my regional analysis. These are Central Siberian, Nganasan, Koryak, Chukchi, and Altaic which replace the K=10 Northeast Asian component

A final note:

The K=14 analysis revealed a Palestinian-centered "Levantine" cluster; this folded at my K=15 run, and two additional splits occurred (Koryak/Chuckhi and Nganasan).

At this level of resolution, many alternative representations can occur for a given K, and the order in which splits occur can vary; they continue, however, to correlate well with populations. Noise levels seem to be slightly increased, especially for clusters associated with single populations of few individuals, but the broad patterns are quite evident.

Up to now, I have not encountered any nonsensical results, so I will continue this as far as it goes. Regional analyses indicate there is more structure to be discovered, so we'll see how far the data can be pushed.

UPDATE: Razib of Gene Expression worries that the South Indian and Central-South Asian components I have identified may not correspond to ASI/ANI, suggesting that ASI ought to be closer to East Asians.

However, Reich et al. state that:

“Many of the analyses in this study are based on modeling the history of Indo-European andDravidian speaking groups of the Indian subcontinent in terms of a two-way historical mixture ofan “Ancestral North Indian” (ANI) population that is genetically close to Central Asians, MiddleEasterners, and Europeans, and an “Ancestral South Indian” (ASI) population that is not close to any large modern group outside the Indian subcontinent.”

Indeed, this is what I discover, with SIN being about Fst=0.085 from Caucasoids and about Fst=0.1 from East Asians. As I couldn't find raw Fst's between the components in Reich et al.'s paper, I can turn to one of their figures (S2 Fig.1):

It is evident that South Asian groups (except a few with East Eurasian admixture) are arrayed along a cline toward Europeans from a third pole (top of the figure): thus, the ASI, which represents this pole is not particularly related to East Asians, it is about equi-distant to Europeans and East Asians.

It is possible, however, that the correspondence is not perfect, as the North Kannada group which forms the South Indian pole in my analysis may not be as "southern" as the tribals Reich et al. had access to.

Results for DOD213 to DOD221 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Sunday, November 14, 2010

Fine-scale East Eurasian admixture results (up to DOD212)

These are fine-scale east Eurasian admixture for participants between DOD180 and DOD212. See the previous post (for participants up to DOD179) for details on the selection criteria, and descriptions of populations and the 7 used components.

Admixture proportions can be found in the spreadsheet.

All populations:

Individual bars:

Results for DOD208 to DOD212 posted

If you've submitted your sample, feel free to submit information about your sample in this thread, even if it's something as simple as country of origin.

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Results for DOD194, and DOD197 to DOD207 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Saturday, November 13, 2010

48-hour submission opportunity is OVER

Thank you all for participating in the 48-hour opportunity.

Participants who have received a DOD id number will receive their results over the next few days. In addition, there will be another fine-scale east Eurasian admixture run for qualifying members.

There are also at least two new populations that have reached 5 members, and for whom average admixture results will be posted.

Friday, November 12, 2010

Results for DOD183 to DOD196 posted (Updated)

If you have not participated, read about the current 48-hour submission opportunity.

If you have submitted your sample, you can add (voluntarily) any information you want to reveal about your ancestry and origins in this thread.

NOTE: DOD194 is related to DOD191, DOD192, and DOD193, but this was not communicated to me; therefore, I am scrapping the results of this run, and repeating it.

Check back this post to see when updated results are posted. Most participants will probably not get radically different results, but DOD193 and DOD194's very high "Northern European" score is spurious.

I urge all project participants not to send me samples of relatives, but if they insist on doing so, to clearly indicate the relationship. This not only saves me time, but also ensures that the results are valid not only for yourself, but for all project participants in your run.

Results have been Updated:

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, November 11, 2010

48-hour submission opportunity

This is another opportunity for 23andMe data submission to the Dodecad Ancestry Project. Eligible groups during this opportunity are:

People from Italy, the Balkans, Anatolia, or the Caucasus
Germans and Austrians; also Germanic speakers of Eastern Europe
French
Iberians (Spanish, Portuguse, Basque)
Swiss
All Slavs and Balts (e.g., Poles, Russians, Bulgarians, Latvians, etc.)
Indians and Pakistanis (please specify religion and caste if applicable)
Scandinavians
All Uralic and Altaic speakers (e.g., Saami, Finns, Estonians, Turks, Hungarians, Komi, Chuvash, Azeri, Turkmen, Tatars, Mongols etc.)

Some notes:

I will not accept samples of related individuals.
All submitted individuals must be entirely from the 9 listed categories. By entirely, I mean at least 4 known grandparents.
Individuals of mixed ancestry (e.g., French+Swiss, or Polish+German) are ok, as long as they are from the 9 categories. However, I will not accept individuals with known other admixture (e.g., English, Jewish, Roma, or Native American) during this opportunity.

If your group is not eligible, feel free to subscribe to the feed to be alerted of new opportunities.

1) Data privacy statement

Your raw genetic data will not be shared with anyone.

It will not be analyzed for anything other than ancestry or admixture. No analysis of physical or medical traits will be performed.

Individual-level results will be revealed with only a unique ID, without any further information about the identity or origin of each participant.

2) What to send

Your compressed genotype file from 23andMe and as much information about your ancestry as you wish to reveal. It is necessary, however, that you tell me at least the country of origin of most of your ancestors and their ethnicity. Information about spoken language, religion, may also be useful.

3) Where to send it

Send it to dodecad@gmail.com. I will respond with a unique identifying code of the form DOD001, DOD002, and so on. The results will then be posted in the blog with that ID.

4) What you will receive

You will be included in an ADMIXTURE analysis together with other project participants and publically available populations, and the results will be published in this blog. Your sample will only be identified by your ID. If a group (e.g., Germans, or Austrians) has at least 5 participants, I may also post the average admixture proportions for that group.

The following figure gives an idea of what to expect. Project members can expect to get a bar of their own, and a list of their ancestral proportions.

Members who have evidence of East Eurasian ancestry will also be included in fine-scale East Eurasian admixture analysis.

All participating members will also be eligible for future ancestry analyses.

Fine-scale East Eurasian admixture results (up to DOD179)

Following up on my analysis of North Eurasian population structure, I have decided to generate new admixture results for eligible project participants, to distinguish between different components of eastern origin that showed up as "East Asian" or "Northeast Asian" in the K=10 analysis. Eligible project members were:

Those that had some East Asian or Northeast Asian admixture, but lacked, to a large extent South Asian or East/West African admixture. This step was taken to ensure that the inferred components would be reserved for the West Eurasian and Siberian/Central/East Asian groups. In the future, there may be additional tests for South Asian-admixed individuals.
In case of relatives, whoever had the highest combined East Asian+Northeast Asian score
Only participants up to DOD179 were included. More recent IDs will be included in a separate run, once data submission in the Project is resumed.

Below is an ADMIXTURE K=7 barplot of the analysis, showing the 7 inferred components. Admixture proportions for this can be found in the spreadsheet.

Population portraits of individual-level variation can be found in this RAR. For example, the portrait of Nganasans...

... reveals the presence of a few individuals who deviate from the light blue "Nganasan" component.

To understand the relationship between the 7 components, here is a table of Fst distances between them:

The emergence of a Central Siberian component, highest in Selkup and Ket is very interesting; this may correspond to the postulated Proto-Uralic type that was detected by craniometric analysis by Moyseyev, as it is equidistant from East Asians and Caucasoids (about Fst=0.1).

Below are individual bars for project participants:

Individual admixture proportions can be found in the spreadsheet.

Wednesday, November 10, 2010

Results for DOD180 to DOD182 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Tuesday, November 9, 2010

Multidimensional scaling in Italy, the Balkans, Anatolia, and the Caucasus + Lezgin ADMIXTURE surprise

On the left you can see an MDS plot of several population groups from Italy, the Balkans, Anatolia, and the Caucasus. This combines data of Dodecad Project members with published samples. I have placed the labels manually over the main point blobs. There is also a single Bulgarian sample who falls between the 'm' and the 'a' in 'Romanians'.

I had previously studied the distinctiveness of Caucasus populations, and now I have added Turks, Cypriots and populations from further West. I am still not satisfied with my Balkan samples (I have 2 Slovenians, 2 Serbs and 1 Bulgarian), so I encourage Balkan participants to contact me for possible inclusion in the Project.

When I turned to ADMIXTURE, a little mystery emerged, for which I have currently no explanation:

Two main components emerged, a light blue "Italo-Balkan" one that seems deficient in West Asia, and red "Cypriot" one that is deficient in West Balkan Slavs and the Caucasus. The three Caucasus populations, each form their own distinctive cluster (green, yellow, blue), and a magenta low-frequency component emerges at K=6, which is why I stopped the analysis at this K. Results for K=5 were similar, minus this low-frequency component.

Here is the big puzzle: my Bulgarian, 2 Serbs, 2 Slovenians, all show unambiguous membership in the green "Lezgin" cluster. Out of all the Caucasus components, this is the only one that seems to have a Balkan connection. While one could argue that this might reflect Neolithic farmers, as it has been argued that they spoke a North Caucasian language, the same "Lezgin" component is insignificant in Greeks and Mixed-Greeks, Southern Italians/Sicilians and Italian (other).

Is this some signal of a population that once inhabited the northern arc of the Black sea, from the Balkans to the Caucasus? This might find some support in the possession by both Lezgins (and Balkan Slavs) of a "North European" component, but the Adygei, who similarly possess such a component show no special affinity with Balkan Slavs. Below is the Lezgin K=10 portrait:

If anyone has any (pre)-historical scenario that might account for this unexpected affinity, feel free to write to me or leave a comment.

Monday, November 8, 2010

Portraits of the populations

Averages give an overview of different populations' makeup, but they may mask important variation. Thus, I have decided to present the individual-variation of the various populations included in the Project. You can get a RAR file with all populations here.

Below, I will highlight a few ones. I apologize about the quality of the graphics, but these were generated by a script I wrote which doesn't seem to have the presentation controls of normal "Save As".

First, the Saudis:

You can see, for example, that the West African admixture in this population comes pretty much from 3 individuals, with one of them having a substantial chunk of it. Some individuals have South European or West Asian admixture, while many belong 100% to the Southwest Asian component.

Next, the Ashkenazi Jews:

You can see that some individuals have an excess of the North European component, while others have almost none.

Next, the Burusho:

These appear very homogeneous, with what variation we might expect from random hereditary processes or the limitations of admixture inference.

Next, the Romanians

I've commented on the presence of 2 probable Roma individuals in this population before, and this seems quite evident in this plot.

Finally, the Gujarati:

It is evident that the North European and East Asian element is limited to a few individuals, with most of them appearing to be a "South Asian" and "West Asian" blend. One might speculate that these represent individuals admixed with other subcontinental populations where these two components are more important.

Results for DOD175 to DOD179 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Tuesday, November 30, 2010

Monday, November 29, 2010

Sunday, November 28, 2010

Saturday, November 27, 2010

Friday, November 26, 2010

Thursday, November 25, 2010

Friday, November 19, 2010

Thursday, November 18, 2010

Wednesday, November 17, 2010

Sunday, November 14, 2010

Saturday, November 13, 2010

Friday, November 12, 2010

Thursday, November 11, 2010

Wednesday, November 10, 2010

Tuesday, November 9, 2010

Monday, November 8, 2010

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff