Monday, February 28, 2011

Clusters Galore results, K=63 for Dodecad Project members (up to DOD449)

The results can be found in the spreadsheet. There are 63 clusters with 11 MDS dimensions retained. The spreadsheet contains 409 rows for unrelated project participants, each of which contains the probabilities that each individual belongs to one of the 63 clusters. This is followed by 36 rows for the reference populations, showing how many individuals from each population are assigned to each cluster.

In order to interpret your results, first search for your DOD number, and see which columns you have non-zero probabilities in. For the vast majority of individuals you will be uniquely assigned (100%) in one of the 63 clusters. Then, you can visit the ancestry thread to see who else is assigned to the same cluster as yourself, and also look in the reference populations to see how they are represented in the different clusters.

The following IDs were outliers:
DOD004 DOD006 DOD010 DOD020 DOD024 DOD029 DOD030 DOD032 DOD036 DOD047 DOD050 DOD051 DOD053 DOD060 DOD063 DOD072 DOD075 DOD107 DOD126 DOD128 DOD132 DOD133 DOD156 DOD157 DOD168 DOD169 DOD175 DOD216 DOD223 DOD235 DOD239 DOD240 DOD245 DOD252 DOD294 DOD303 DOD316 DOD326 DOD339 DOD348 DOD359 DOD363 DOD380 DOD382 DOD384 DOD385 DOD387 DOD388 DOD389 DOD425 DOD430 DOD435
As previously explained, outliers may either be mixed individuals or individuals from particular populations not well represented in the Project. In both cases they appear to be more "distant" from other individuals and from their respective clusters.

A few observations:
  • The single largest cluster is #3 which is mostly "British Isles"
  • Cluster #26 encompasses most Greek/South Italian/Sicilian individuals; not how this is not represented in the reference populations, which lack such individuals
  • Cluster #23, also absent in the reference populations encompasses mainly Finns and some Russians
  • Cluster #25 is also quite large, consisting of 42 Project members and only 2 reference White Utahns. This consists largely of North/Central Europeans from continental Europe.
  • Cluster #4 includes mainly Iberians
  • Cluster #11 mainly Ashkenazi Jews
  • Cluster #15 mainly Turks
  • Cluster #16 mainly people from the Balkans
  • Cluster #27 mainly North-Central Italians not in #26 (the Greco-Italian cluster)
As always, I encourage those who haven't posted in the ancestry thread yet to do so, to help themselves and others make better sense of their results.

I plan to explore fine-scale structure of Dodecad Project members further, especially of those who belong to large, undifferentiated clusters that may harbor latent informative structure.

Sunday, February 27, 2011

Clusters Galore with Dodecad populations

Here is a spreadsheet of Clusters Galore analysis with Dodecad populations: 692 reference individuals + 261 Dodecad Project participants from 24 different populations with at least 5 members each:
Assyrian, Scandinavian, Greek, Finnish, S_Italian_Sicilian, Ashkenazi, German, Indian, Portuguese, Armenian, Russian, Spanish, British, Irish, Turkish, N_Italian, Balkans, Iranian, North_African, East_African, French, Chinese, Japanese, Polish
As a reminder to new readers, the Clusters Galore technique consists of applying multidimensional scaling on genomic data to convert ~152,000 SNPs into a number of continuous dimensions capturing most of the variation, followed by employment of MCLUST to cluster individuals along these dimensions.

In total, 65 clusters were obtained when 10 MDS dimensions were retained.

Some observations:
  • Most Greeks and all South Italians/Sicilians continue to fall in the same cluster #4. The fact that the latter population, despite being one of the largest (20 individuals) continues to remain unsplit and distinctive testifies to the fact that it is probably homogeneous and lacks substantial regional inbreeding within it.
  • Cluster #2 includes most Germanic individuals and also the Irish
  • Cluster #5 is made mostly of Central/North Italians
  • Non-Greek Balkan participants fall mostly in cluster #6, which also includes the non-Gypsy admixed reference Romanians
  • Project and reference Iberians (Spaniards and Portuguese) continue to be undifferentiated and distinctive, falling in cluster #14; my comments on South Italians/Sicilians also probably apply to them as well.
  • There is a trace of structure in the Ashkenazi population, which is split into two clusters. This probably underscores the benefits of large samples in the inference of structure, as 25 Ashkenazi Jews have submitted their results to the Project.
  • Project Russians have split affiliations between a circum-Baltic cluster #3 and the Finnish cluster #7.
  • North Africans form two new clusters that do no overlap with either reference Mozabites or Egyptians. There is great variety in North Africa, and the 8 people who have submitted their samples are a good start to learning about this region of the world.
  • The Chinese are split into two, one part aligning with the "southern" Miaozu and one part aligning with the "northern" Japanese.
Please do not ask me which cluster you fall in, as there will be a separate analysis of Project participants identified by their DOD number but without ethnic identifiers, in compliance with the Project's privacy policy.

Results for DOD439 to DOD448 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
The current submission opportunity is over.

All populations:



Individual bars:

Saturday, February 26, 2011

Submission to the Project is currently CLOSED

Thanks to all participants of the most recent submission opportunity. Final batch of results (up to DOD448) will be up, and the next round of analysis will begin.

Friday, February 25, 2011

Results for DOD429 to DOD438 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Tuesday, February 22, 2011

Results for DOD419 to DOD428 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

UPDATE: Labels fixed.

All populations:



Individual bars:

Winding down the current submission opportunity

I will end the current submission opportunity in the near future and I will proceed to the next analysis phase of the Project, which will likely include more global Galore analysis, the 3rd update of ADMIXTURE with Dodecad Populations, and, probably, more fine-scale studies of regional samples.

During that time, I will also doublecheck my samples for possible misrepresentations or related samples, so if you want your sample to stay in the Project, and have (willingly or not) misrepresented either your ancestry or your relationship to other samples, a necessary condition is to let me know.

I just want to give a quick update on the status of the Dodecad populations. I currently have 244 individuals in 23 populations, each of which is represented by at least 5 individuals. This is up from 208 individuals/19 populations of a week ago! Thanks Dodecad participants.

In the last week I was able to reach the 5-individual threshold for the French, Chinese, and Japanese, while the Polish and Danish populations are still 1-2 individuals short.

Whatever your background, if you have 4 grandparents from the same European, Asian, or North African ethnic group or country feel free to send me your data while submission is still open, or to contact me if I am likely to accept your data.

Monday, February 21, 2011

Truly despicable behavior

It has come to my attention that certain individuals have misrepresented the ancestry of submitted samples to get them into the Project.

Rest assured that I have both the tools and the know-how to sooner or later detect fraudsters, and fraudulent samples will almost certainly be pulled from the Project, with no further analysis performed on them.

Two examples of the types of actions I consider fraudulent are:
  • Part Mexican pretending to be Spanish
  • Relatives submitted under aliases to obscure their relationship to samples already in the Project
I want to extend an opportunity to people who have misrepresented their ancestry/relatedness status in submitted samples to come clean by sending me an e-mail.

I cannot guarantee that I will keep the samples of people who come clean, but I think it is the least you can do: I spent my time and energy analyzing your sample, and I deserve, in the very least, an apology.

Sunday, February 20, 2011

Results for DOD409 to DOD418 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Thursday, February 17, 2011

Results for DOD399 to DOD408 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Tuesday, February 15, 2011

Groups just short of 5 participants

In the Dodecad Project, I post average results and population portraits for groups that have at least 5 participants. Moreover, I often re-use such groups for regional studies and analyses.

In the most recent count of such populations, there were 17 populations and 143 individuals. I am currently in the process of re-organizing my collection, as some populations can now be split, and new ones created. For example, I have several North African submitters from a wide variety of countries that will join the publicly available Moroccan, Mozabite, and Egyptian samples in the reference populations. I also have enough Swedes and Norwegians and I might be able to split the Scandinavian population.

In total, by my most recent tabulation, I have 208 individuals organized in 19 populations, i.e., 45% more people than only a month ago! I encourage all individuals who meet the eligibility criteria to join the project during the current submission opportunity.

In particular, some groups are just 1-2 individuals short of the 5-person mark. I list them below:
  • Chinese
  • Danish
  • French
  • Japanese
  • Polish
If you have 4 grandparents from one of these groups, I would especially appreciate your participation.

Whether you belong to a group already represented in the Project, or to one that is not, your participation will be helpful. In the former case, you help build up sample size and confidence in the properties of the population. In the latter, you are filling a gap in the Eurasian landscape, encouraging others of your group to also join.

Results for DOD389 to DOD398 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Monday, February 14, 2011

Results for DOD379 to DOD388 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Sunday, February 13, 2011

A note on the eligibility criteria of the Dodecad Project

1. Why I do not accept relatives

Relatives are much more similar to each other than they are to other people. When an ADMIXTURE run includes a pair of close relatives, then there is a high risk of the two relatives forming their own component to which they belong near 100%. Thus, (a) they learn nothing about themselves, other for the fact that they are related, and (b) other people have some membership in that particular component which does not have any real anthropological interest.

The same is true when one performs dimensionality reduction like multidimensional scaling (MDS) or Principal Components Analysis (PCA). It is highly likely that a pair of relatives will be distinguished from everyone else along a dimension. That dimension is "wasted", and it provides no real anthropological information other for the fact that a pair of people are related to each other. If this is combined with a clustering algorithm such as the Galore approach, then the pair of relatives will form their own anthropologically meaningless cluster.

The Dodecad Project is an anthropology project; it is not a genealogy project. There are other projects and experts out there who can help you interpret your data from a genealogy perspective.

2. Why I accept people with 4 grandparents from the same European, Asian, or North African ethnic group or country

I do so for two reasons, a scientific one, and a practical one:

First, such individuals allow me to gradually build a database of genetic variation for my region of interest, namely Eurasia. As of this writing, there are 387 participants in the project, out of which I have managed to form 19 populations, each of them consisting of at least 5 individuals from a particular ethnic/linguistic/national/geographical group such as Greek/Arab/British/Scandinavian, for a total of 209 individuals.

There are many participants who belong to a particular group for which there are not 5 participants yet; and, there are also many who have a mixed background and whom I have included in the Project either during some time-limited submission opportunities, or because they asked me to, making a good case for why they should be included.

The second reason is practical, as the customers of 23andMe are probably largely American, and many non-1st generation Americans are descended from multiple ethnic groups. Even though a large part of the analysis process is automated, I still need to spend some time on every submitted sample to download it, record it, convert it, to assign some processing time on my computing systems for the analysis, and then spend some more time to create the results barplots and materials.

I simply don't have the time to do so for the large pool of 23andMe customers, but I can, occasionally, accept individuals of mixed heritage during short submission opportunities.

What to do if you are not eligible

If you do not meet the eligibility criteria, and think that I should include you in the Project, the thing to do is to write to me, laying out the reasons why. Here are some scenarios that I might very well consider:
  1. Adoptees with no knowledge of their origins.
  2. People of mixed heritage who belong partly to an unsampled group. Let's say you are 50% English+50% Parsee, then I might consider this sample as I have no data on Parsees.
  3. People of mixed origins who have data for both their unadmixed parents (and thus provide two samples to the project), or a series of their unrelated relatives, I may very well analyze the admixed individuals as well.
  4. People of mixed heritage from Greece and its environs (i.e., Italy, the Balkans, and Anatolia), e.g., 50% Gypsy+50% Bulgarian, or 50% Croat+50% Serb or 75% Turk+25% Albanian would all probably be accepted if they contacted me.

Saturday, February 12, 2011

Results for DOD369 to DOD378 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Friday, February 11, 2011

Results for DOD359 to DOD368 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:


Thursday, February 10, 2011

Results for DOD349 to DOD358 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Results for DOD339 to DOD348 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Wednesday, February 9, 2011

Results for DOD329 to DOD338 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:



Individual bars:

Resuls for DOD319 to DOD328 posted

Admixture proportions can be found in the spreadsheet

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.
Read about the terms and eligibility criteria of the current submission opportunity.

All populations:




Individual bars:

Tuesday, February 8, 2011

Open-ended submission opportunity for 23andMe data (#2)

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Please do not submit samples of relatives, as these make analysis difficult. I consider 2nd cousins and closer relatives to be related.

I am sorry that I can't process everyone's data, so if you don't fit the above criteria and you feel you should be included, feel free to write to me (but don't send me your data!) and I will keep it in mind. Also, you can subscribe to the feed for future opportunities to submit your data.

What you will receive

You will receive the standard K=10 analysis results such as this, and you will also be eligible for other types of tests such as Galore analysis or regional studies

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

What to send

Send your zipped autosomal data to dodecad@gmail.com. Also let me know something about your ancestry (e.g., ethnic group, country of origin of grandparents, or anything else that might be useful).

Monday, February 7, 2011

Results for DOD308 to DOD318 posted

I have done an initial run with 10 of the people who sent me v3 data (plus 1 with v2 data) to ensure that everything works fine. Note that this does not mean that submission to the project is currently open, so please do not send me any more data until such time as it is.

Note also that this is based on a new marker set of 152K markers; the main difference with the 177K set I have been commonly using was to apply a slightly more stringent quality control (--geno 0.005) in the source datasets and to include markers in common to v2 and v3 chips.

Admixture proportions can be found in the spreadsheet

All populations:


Individual bars: