Monday, November 29, 2010

Galore analysis improved, plus K=56 results of Clusters Galore analysis for Dodecad Project members (up to DOD236)

For background, please read the post on the K=48 analysis and links therein.

As I mentioned in the previous posts, my technique depends on the use of MDS to reduce the dimensionality of genomic data from 177,000 SNPs or so to a few dozen dimensions capturing most of the variance.

Subsequently, mclust a state of the art clustering algorithm is applied on the MDS representation: this iterates between choices of K, the number of clusters, trying clusters of different shape, volume, and orientation, and chooses the optimal clustering, maximizing the Bayes Information Criterion. In simpler terms, it finds as much detail as possible in the data but penalizes too ornate models and avoids finding "ghost" clusters that are not really supported.

These are clusters derived from data of unlabeled individuals. The only human input into the process is the number of MDS dimensions to retain.

In my previous K=48 analysis, I retained 30 dimensions, but I also noted that this is not really optimal. Choosing more, or less, dimensions might lead to even better resolution (higher K).

More dimensions = more possible ways to distinguish between individuals, but also, possibly, more noise, as individuals might not be "clustered" in them.

Fewer dimensions = less possible ways to distinguish between individuals, but also, possibly, less noise from the uninformative higher dimensions.

Thus, the question arises: how many dimensions to retain?

Here is a plot of the optimal number of clusters inferred, depending on how many dimensions I chose to retain:
As you can see, when a few dimensions are retained, relatively few clusters are inferred, while as the number of dimensions goes beyond a certain point, the number of clusters starts to decrease again, as more noise is added (*)

The number of clusters peaks (see figure) at 16 and 22 dimensions retained; both of these produce 56 different clusters in the optimal solution.

Here are the results for Dodecad Project members (up to DOD236) with K=56 and 16 dimensions retained. In comparison to the previous K=48 analysis, we are now able to:
  1. Split CEU White Utahns (#1) from French (#15)
  2. Split CEU White Utahns (#1) from continental Germanics (#14)
  3. Split French (#15) from Spaniards (#2)
  4. Split Armenians (#7) from Turks (#19)
  5. Split Slavs (#23) from Balts (#26)
  6. Split Cypriots (#30) from Sephardic Jews (#21)
The most astonishing finding is, however, at least for me, the emergence of a cluster (#16) comprised in the great majority by people from Greece and Southern Italy, with very few individuals from elsewhere. Notice that #16 has absolutely no representation in the reference populations, which lack South Italians and Greeks.

Once again, I urge participants to help themselves and others by leaving a comment in the ancestry thread.

(*) The change is not, however, smooth. The more general problem is to choose which dimensions to retain, rather than choosing how many of the first ones to retain. The first few dimensions of MDS capture a decreasing portion of variance, but the data are not guaranteed to be "split" in them. However, this is a much harder problem, as we have to figure out (i) how many dimensions to retain, and (ii) which ones. Even if we fix (i), by choosing to retain, e.g., 10 dimensions, we still have to choose which 10: this is close to half a billion different combinations of which 10 to choose from the total of 38 possible candidate ones.

14 comments:

  1. Dienekes,

    This is truly awesome!

    Can I ask one favor, that you label the Clusters 1-36 with their best population match.

    Like what is cluster 9? As I submitted DOD188 (Sicilian/Polish) and he is 100% in Cluster 9 - I'm wondering if this is some population halfway between both paternal populations, like Balkan or something??

    ReplyDelete
  2. Well, it's hard to label so many clusters, plus if you put a very specific label to them, people who don't match the label could feel bad. And if individual from population X is the sole member of a cluster made of individuals from population Y, then he might feel that he is Y, when, in fact, it might be the case that he is the only member of X that has submitted their sample.

    With respect to cluster 9, let's just say it's got a lot of Balkan people in it; as is well known the Balkans have an old Greco-Roman subtratum and a more recent Slavic superstratum. And there is at least one more Sicilian+Polish sample in it.

    Hopefully more people will post their info in the ancestry thread.

    ReplyDelete
  3. Also, I note that my mother (DOD099) was 97% of the CEU/French cluster at K=48, while my father (DOD098) and Spencer Wells (DOD162) were split over the CEU/French cluster and CEU proper cluster, however at K=56 they are all 100% in the CEU proper cluster?!

    Whereas Bubba (DOD066) who is mostly North German with a pinch of Danish, at K=48 is 100% CEU, but at K=56 is 19% CEU and 81% German

    What do you make of this?

    Also, are there any other Irish/English/Scottish people to compare with???

    ReplyDelete
  4. Also, I note that my mother (DOD099) was 97% of the CEU/French cluster at K=48, while my father (DOD098) and Spencer Wells (DOD162) were split over the CEU/French cluster and CEU proper cluster, however at K=56 they are all 100% in the CEU proper cluster?!

    If you think about it, when the French got their own cluster, the CEU proper cluster (which is largely a British Isles cluster in the available samples) became tighter and "better defined".

    Whereas Bubba (DOD066) who is mostly North German with a pinch of Danish, at K=48 is 100% CEU, but at K=56 is 19% CEU and 81% German

    Again, the split of CEU from continental Germanics caused both clusters to be better defined.

    I should stress again that what appears to be X now may not be X tomorrow with more samples. E.g., will the Dutch be more like CEU or more like continental Germanic? Will they get their own cluster? Who knows...

    ReplyDelete
  5. Regarding Greco-Roman subtratum and Slavic superstratum in the Balkans, would it be possible to separate those two components in ADMIXTURE? Perhaps you could run Balkan Slavs and Romanians with Greeks and South Italians on one side and Balto-Slavs on the other side and see what you get.

    ReplyDelete
  6. There are many parameters to the problem. For example, the Balkan clustered includes the few Slavs in my project, but is centered on Romanians who are not Slavs, and there is 1 Greek in it at least. Where are the Albanians? I don't know cause I don't have any. So I would hesitate to call it Slavic, although it does seem to be centered in the area north of Greece where mostly Slavs currently live.

    I have very few Balkan people in the Project, and more are always welcome. You can't find regional or ethnic-specific clusters unless you have a few individuals from each population to work with.

    ReplyDelete
  7. Are you going to do a dendogram of the K=56 analysis?

    It would be interesting to add other ethnic clusters that are available like the Lebanese, Yemeni and non European Jewish groups.

    ReplyDelete
  8. Would you agree based on your project results that South Slavic speaking people are genetically entirely descended from indigenous Balkan populations?

    ReplyDelete
  9. Would you agree based on your project results that South Slavic speaking people are genetically entirely descended from indigenous Balkan populations?

    I don't have many south Slavic samples to work with, plus I doubt anyone is "entirely" descended from anyone.

    ReplyDelete
  10. OK, if we leave out semantics and replace entirely with 'mostly'? After all, haplogroup I1b has been in the Balkans since the ice age.

    ReplyDelete
  11. OK, if we leave out semantics and replace entirely with 'mostly'? After all, haplogroup I1b has been in the Balkans since the ice age.

    What ancient DNA has taught me is that unless I see it in ancient bones, I don't believe it. So, while I1b (or whatever it's called now) in the Balkans in the Ice Age is a good theory, I want to see it confirmed that it was there. Also, we must remember that the Balkans are not homogeneous today, and we can't be sure that they were so in the past.

    ReplyDelete
  12. "Thus, the question arises: how many dimensions to retain?"

    It depends on what you want to know, right?

    If you want to know, for instance, the difference between an Iranian and a Syrian in a blind "your here or there" way, the Clusters Galore approach with many dimensions may be the way to go.

    But if you want to look at genetic prehistory and know the relationships between people, the Clusters Galore approach yields an overdetermined result. The Admixture results are most informative for this. Additionally, you can easily discern the difference between a Syrian and an Iranian with the Admixture result.

    The Clusters Galore approach tells people that they are different or the same now, but not how, or when.

    Some people will want to know that they are different and will be satisfied with the Clusters Galore approach.

    However, most people want to know how they are related to others. They want to understand the history of their ancestors.

    The Admixture results are a much better approach for this.

    Getting Back to your original question regarding dimensions: 10 to 12 dimensions seems to handle West Eurasian populations just fine. I'd guess that less than 20 dimensions would do for all of Eurasia, including the Middle East, Siberia, South East Asia and India.

    ReplyDelete
  13. I am in cluster #9 on "galore analysis improved", but cluster #10 on "clusters galore" which was the entry just prior to it. I'm DOD236, last on the list. (mostly Italian descent with a dash of eastern Euro). Which cluster would be said to be more accurate for me then? Thanks for any info. ~Dino

    ReplyDelete
  14. http://dodecad.blogspot.com/2011/01/clusters-galore-results-k64-for-dodecad.html

    ReplyDelete