Tuesday, November 30, 2010

What's next for Clusters Galore analysis

The first few runs of the Clusters Galore analysis have proven quite successful; I've posted another one on the HGDP panel in my other blog.

Now, it is time to assess the results and see what improvements can be made. I see a few avenues for improvement:


Clusters, by definition, are composed of at least 2 individuals. Individuals who are the only representatives of their populations (e.g. if a Pygmy or an Icelandic+Armenian mix) will, by necessity, attach themselves to the closest cluster (e.g., to Yoruba, or to some Central European population), even though they are not necessarily close to that population.

Outlier detection is a difficult problem, but I will try some ideas on how to tackle it.

Phantom clusters

mclust is resilient to phantom clusters, i.e., clusters of "misfits" who don't belong in any other populations but are banded together erroneously by the algorithm. That is inevitable in an automated procedure, especially one that is pushing the limits of ancestry inference. Phantom clusters are, by their nature, transient, so there are some ideas on how to avoid them and how to focus on very robust and repeatable clusters.


Being part of a cluster tells you nothing about how "typical" a member of the cluster you are, i.e., how close to the average. This problem is exacerbated by the fact that the clusters inferred by mclust may have varying shape, size, and orientation.

Nonetheless there are ideas on how to quantify members' typicality, and I will explore them. Please note that typicality is not necessarily the same as "purity". For example, an elongated cluster of African Americans will have typical members with 20% European admixture, but the "purest" African Americans will have 0% European admixture and be very atypical of their group as a whole. Similarly, typical Turks have 5-6% East Eurasian admixture, but people with 10% East Eurasian admixture are less typical, but more likely to be descended from central Asian Turkic people.

Any new technique will have its birth pains, and hopefully myself and others will help identify them and resolve them.


  1. Would your clusters galore be sensitive enough to tell me whether my EuroDNACalc result is due to recent non-Greek admixture? I'm trying to determine whether my 7% AJ score is from a g-g-grandparent with the surname root of ζαφειρι. My NW component seems elevated (28%), too.

  2. 1. I wouldn't put too much value on Euro-dna-calc results for anyone who's been analyzed in the Dodecad Project
    2. ζαφείρης is from sapphire, and it's a common practice to use names of precious stones in names (e.g., Διαμάντω, Ζαφείρω, Ζμαράγδω, κτλ. see here:

    3. We need more Greeks to join the project, because if one component seems elevated, that does not necessarily mean a recent non-Greek ancestor. For example, if someone has a Greek Cypriot ancestor he may have elevated SW Asian and reduced N Euro components. But what happens if one has a North Epirote ancestor? A Pontian ancestor? A Greek from Eastern Rumelia, or a Cappadocian, or an Aegyptiote Greek?

    The sample of 10 Greeks does not allow us to say anything about regional variation among Greeks at this point.

  3. Thanks for that link. It's a great resource, the moreso with the Google Translate widget embedded!