Monday, November 29, 2010

Results of Clusters Galore analysis for Dodecad Project members (up to DOD236), K=48

This is the result of the new type of ancestry analysis I have recently devised. For background, please read:
In total 894 individuals were included in this analysis, 202 Dodecad Project members and 692 from the published references. 30 MDS dimensions were retained, and mclust was run with a maximum number of clusters = 60. In the optimal solution, 48 clusters were inferred.

At the beginning of the results spreadsheet are the 202 Dodecad Project members and their probabilities of belonging to the 48 clusters. This is followed by the 36 reference populations, listing how many individuals from each one were assigned to each of the 48 clusters.

To help you interpret these results, you might want to consult the individual and population ADMIXTURE spreadsheets, as well as the information Project members have chosen to reveal about themselves. Feel free to add your own information in that thread.

Here are the results for some people who have chosen to reveal information about themselves:
  • Spencer Wells is DOD162 and he has 28% probability of being in the CEU (White Utahn)/French cluster #1, and 72% probability of being in cluster #3 which is CEU (White Utahn) specific (in the reference populations) and in which 29 Project members are also assigned (almost all of them Northwestern Europeans)
  • Razib of Gene Expression is DOD075 and is assigned in cluster #15, which includes Gujarati and North Kannada individuals
  • pconroy, submitted three samples, DOD097 is Sicilian fall in cluster #9 to which many Cypriots and Sephardic Jews belong, and many Project members of South Italian/Sicilian background; DOD098 and DOD099 are his Irish parents: DOD098 is almost evenly split between #1 and #3 (like Wells), and DOD099 is 97% in #3
  • Adriano Squecco DOD139 is North Italian and is in cluster #2, in which 25/25 reference Tuscans and 10/12 reference North Italians belong
  • Lacko DOD083 is 100% southern Polish and he falls in cluster #20, in which all reference Belorussians and Lithuanians fall
  • Mike Maddi DOD021 is Sicilian and is in cluster #9 like DOD097, showing some probability (13%) of also being in #2 like Adriano Squecco
  • An Anonymous Pole falls in #20 like Lacko and the reference Slavs
  • Ilmari DOD003 and Ari DOD131 are Finns and fall in cluster #13. This is an interesting one, as it does not occur at all in the reference populations; I'll let you guess what population it's centered on.
  • Eastara's mother (DOD025) is Bulgarian and falls in cluster #10, which is centered in Romanians in the reference populations, but note that there are also 13 Dodecad Project members who fall in it, many of them from different parts of the Balkans. I hope more people from the Balkans will contact me for inclusion in the project, as I am sure that finer-scale can be achieved there with increased participation.
  • Basar (DOD049) is half-Anatolian Turk and half-Laz. He falls in the cluster encompassing Armenians and Turks in the reference populations.
  • Bubba (DOD066) is North German with a pinch of Danish and falls in cluster #3
  • afpjr (DOD014) is half Greek and half Italian/Sicilian; he falls in cluster #9 (95%)
  • Francesc (DOD217) is Catalan and falls in cluster #17, centered on Spaniards in the reference populations.
The Project Greeks and Mixed Greeks fall in clusters #2, 9, 10 tying them to Italy and the Balkans, as might be expected. I hope more Greeks will decide to participate in the Project, so we can discover more interesting patterns in our population.

I cannot stress enough how revealing non-identifying ancestry information in the relevant thread will help both yourselves and others make better sense of these and future results.

There are clusters composed entirely of Dodecad Project members (e.g., the aforementioned one), and others which are centered on one or two reference populations, but encompass a wider variety of non-represented populations. So, please take the time to leave a comment in the ancestry thread.

This is not the end of the story. There are more clusters to be discovered in the data; the inclusion of 200+ new samples in this analysis has caused new clusters to appear and distinctions that were previously detected to "fold back" (e.g., between Armenians and Turks). I am currently investigating how the choice of number of MDS dimensions to retain affects the number of inferred clusters in the optimal solution.


  1. RichardT (DOD180) Paternal background is Southern, Bohemia with the suspicion of having Vlach origin, mother's background is Ukraine-Poland.

  2. I fall in cluster #10, 100%. DOD236 (Dino9575): Paternal background fully southern Italian (Molise region); maternal ancestry mixed between northern Italian & some Polish.