Monday, February 28, 2011

Clusters Galore results, K=63 for Dodecad Project members (up to DOD449)

The results can be found in the spreadsheet. There are 63 clusters with 11 MDS dimensions retained. The spreadsheet contains 409 rows for unrelated project participants, each of which contains the probabilities that each individual belongs to one of the 63 clusters. This is followed by 36 rows for the reference populations, showing how many individuals from each population are assigned to each cluster.

In order to interpret your results, first search for your DOD number, and see which columns you have non-zero probabilities in. For the vast majority of individuals you will be uniquely assigned (100%) in one of the 63 clusters. Then, you can visit the ancestry thread to see who else is assigned to the same cluster as yourself, and also look in the reference populations to see how they are represented in the different clusters.

The following IDs were outliers:
DOD004 DOD006 DOD010 DOD020 DOD024 DOD029 DOD030 DOD032 DOD036 DOD047 DOD050 DOD051 DOD053 DOD060 DOD063 DOD072 DOD075 DOD107 DOD126 DOD128 DOD132 DOD133 DOD156 DOD157 DOD168 DOD169 DOD175 DOD216 DOD223 DOD235 DOD239 DOD240 DOD245 DOD252 DOD294 DOD303 DOD316 DOD326 DOD339 DOD348 DOD359 DOD363 DOD380 DOD382 DOD384 DOD385 DOD387 DOD388 DOD389 DOD425 DOD430 DOD435
As previously explained, outliers may either be mixed individuals or individuals from particular populations not well represented in the Project. In both cases they appear to be more "distant" from other individuals and from their respective clusters.

A few observations:
  • The single largest cluster is #3 which is mostly "British Isles"
  • Cluster #26 encompasses most Greek/South Italian/Sicilian individuals; not how this is not represented in the reference populations, which lack such individuals
  • Cluster #23, also absent in the reference populations encompasses mainly Finns and some Russians
  • Cluster #25 is also quite large, consisting of 42 Project members and only 2 reference White Utahns. This consists largely of North/Central Europeans from continental Europe.
  • Cluster #4 includes mainly Iberians
  • Cluster #11 mainly Ashkenazi Jews
  • Cluster #15 mainly Turks
  • Cluster #16 mainly people from the Balkans
  • Cluster #27 mainly North-Central Italians not in #26 (the Greco-Italian cluster)
As always, I encourage those who haven't posted in the ancestry thread yet to do so, to help themselves and others make better sense of their results.

I plan to explore fine-scale structure of Dodecad Project members further, especially of those who belong to large, undifferentiated clusters that may harbor latent informative structure.


  1. Are you sure you don't mean that Cluster #16 is mainly from the Balkans? Cluster #14 has only four Dodecad members.

  2. I have detected an error : The Yoruba and Masaai don't appear in any cluster and instead in the Ethiopians who are only 19, appear 25 in column BI

  3. I do not understand why DOD168, 169, 330, 348, 359, 360, 361 and 363 are outliers. They are all 100% for the cluster 6. This is quite normal since they are all North Africans. You said you were going to expand the number of people, why at least 5 people were needed, is it not?

  4. The pathan reference dara doesn't add up , 8 in one cluster and 25 in another , which don't add up to the sample size of 22.

  5. All noted problems should be fixed now.

    I do not understand why DOD168, 169, 330, 348, 359, 360, 361 and 363 are outliers. They are all 100% for the cluster 6. This is quite normal since they are all North Africans.

    North Africans are generally quite variable in their admixture components, and the admixture components are quite different genetically(African vs. Eurasian). Hence, these North Africans are quite different from each other.

    The fact that MCLUST manages to place them in a cluster is not strange, because MCLUST has the ability to detect clusters of varying size and orientation, and has no trouble detecting either a small tight cluster of an isolated homogeneous population, or a spacious cluster of admixed individuals.

  6. My (DOD232) percentages add up to 99. It looks as if each individual may participate in up to two clusters, so that there seems to be a rounding error. Or was was I rounded down from a <.5% participation in a third cluster?

  7. An individual may have non-zero probabilities for more than two clusters. As any probability less than 0.5% is rounded to 0 it is not impossible for the sum of an individual's row to be 99, or, indeed, less than 99.

  8. By the way, everyone, you can place your ancestry (and 23andme I.D.) in this spreadsheet if you like.