For background, please read the post on the K=48 analysis and links therein.
As I mentioned in the previous posts, my technique depends on the use of MDS to reduce the dimensionality of genomic data from 177,000 SNPs or so to a few dozen dimensions capturing most of the variance.
Subsequently, mclust a state of the art clustering algorithm is applied on the MDS representation: this iterates between choices of K, the number of clusters, trying clusters of different shape, volume, and orientation, and chooses the optimal clustering, maximizing the Bayes Information Criterion. In simpler terms, it finds as much detail as possible in the data but penalizes too ornate models and avoids finding "ghost" clusters that are not really supported.
These are clusters derived from data of unlabeled individuals. The only human input into the process is the number of MDS dimensions to retain.
In my previous K=48 analysis, I retained 30 dimensions, but I also noted that this is not really optimal. Choosing more, or less, dimensions might lead to even better resolution (higher K).
More dimensions = more possible ways to distinguish between individuals, but also, possibly, more noise, as individuals might not be "clustered" in them.
Fewer dimensions = less possible ways to distinguish between individuals, but also, possibly, less noise from the uninformative higher dimensions.
Thus, the question arises: how many dimensions to retain?
Here is a plot of the optimal number of clusters inferred, depending on how many dimensions I chose to retain:
As you can see, when a few dimensions are retained, relatively few clusters are inferred, while as the number of dimensions goes beyond a certain point, the number of clusters starts to decrease again, as more noise is added (*)
The number of clusters peaks (see figure) at 16 and 22 dimensions retained; both of these produce 56 different clusters in the optimal solution.
Here are the results for Dodecad Project members (up to DOD236) with K=56 and 16 dimensions retained. In comparison to the previous K=48 analysis, we are now able to:
- Split CEU White Utahns (#1) from French (#15)
- Split CEU White Utahns (#1) from continental Germanics (#14)
- Split French (#15) from Spaniards (#2)
- Split Armenians (#7) from Turks (#19)
- Split Slavs (#23) from Balts (#26)
- Split Cypriots (#30) from Sephardic Jews (#21)
Once again, I urge participants to help themselves and others by leaving a comment in the ancestry thread.
(*) The change is not, however, smooth. The more general problem is to choose which dimensions to retain, rather than choosing how many of the first ones to retain. The first few dimensions of MDS capture a decreasing portion of variance, but the data are not guaranteed to be "split" in them. However, this is a much harder problem, as we have to figure out (i) how many dimensions to retain, and (ii) which ones. Even if we fix (i), by choosing to retain, e.g., 10 dimensions, we still have to choose which 10: this is close to half a billion different combinations of which 10 to choose from the total of 38 possible candidate ones.
Dienekes,
ReplyDeleteThis is truly awesome!
Can I ask one favor, that you label the Clusters 1-36 with their best population match.
Like what is cluster 9? As I submitted DOD188 (Sicilian/Polish) and he is 100% in Cluster 9 - I'm wondering if this is some population halfway between both paternal populations, like Balkan or something??
Well, it's hard to label so many clusters, plus if you put a very specific label to them, people who don't match the label could feel bad. And if individual from population X is the sole member of a cluster made of individuals from population Y, then he might feel that he is Y, when, in fact, it might be the case that he is the only member of X that has submitted their sample.
ReplyDeleteWith respect to cluster 9, let's just say it's got a lot of Balkan people in it; as is well known the Balkans have an old Greco-Roman subtratum and a more recent Slavic superstratum. And there is at least one more Sicilian+Polish sample in it.
Hopefully more people will post their info in the ancestry thread.
Also, I note that my mother (DOD099) was 97% of the CEU/French cluster at K=48, while my father (DOD098) and Spencer Wells (DOD162) were split over the CEU/French cluster and CEU proper cluster, however at K=56 they are all 100% in the CEU proper cluster?!
ReplyDeleteWhereas Bubba (DOD066) who is mostly North German with a pinch of Danish, at K=48 is 100% CEU, but at K=56 is 19% CEU and 81% German
What do you make of this?
Also, are there any other Irish/English/Scottish people to compare with???
Also, I note that my mother (DOD099) was 97% of the CEU/French cluster at K=48, while my father (DOD098) and Spencer Wells (DOD162) were split over the CEU/French cluster and CEU proper cluster, however at K=56 they are all 100% in the CEU proper cluster?!
ReplyDeleteIf you think about it, when the French got their own cluster, the CEU proper cluster (which is largely a British Isles cluster in the available samples) became tighter and "better defined".
Whereas Bubba (DOD066) who is mostly North German with a pinch of Danish, at K=48 is 100% CEU, but at K=56 is 19% CEU and 81% German
Again, the split of CEU from continental Germanics caused both clusters to be better defined.
I should stress again that what appears to be X now may not be X tomorrow with more samples. E.g., will the Dutch be more like CEU or more like continental Germanic? Will they get their own cluster? Who knows...
Regarding Greco-Roman subtratum and Slavic superstratum in the Balkans, would it be possible to separate those two components in ADMIXTURE? Perhaps you could run Balkan Slavs and Romanians with Greeks and South Italians on one side and Balto-Slavs on the other side and see what you get.
ReplyDeleteThere are many parameters to the problem. For example, the Balkan clustered includes the few Slavs in my project, but is centered on Romanians who are not Slavs, and there is 1 Greek in it at least. Where are the Albanians? I don't know cause I don't have any. So I would hesitate to call it Slavic, although it does seem to be centered in the area north of Greece where mostly Slavs currently live.
ReplyDeleteI have very few Balkan people in the Project, and more are always welcome. You can't find regional or ethnic-specific clusters unless you have a few individuals from each population to work with.
Are you going to do a dendogram of the K=56 analysis?
ReplyDeleteIt would be interesting to add other ethnic clusters that are available like the Lebanese, Yemeni and non European Jewish groups.
Would you agree based on your project results that South Slavic speaking people are genetically entirely descended from indigenous Balkan populations?
ReplyDeleteWould you agree based on your project results that South Slavic speaking people are genetically entirely descended from indigenous Balkan populations?
ReplyDeleteI don't have many south Slavic samples to work with, plus I doubt anyone is "entirely" descended from anyone.
OK, if we leave out semantics and replace entirely with 'mostly'? After all, haplogroup I1b has been in the Balkans since the ice age.
ReplyDeleteOK, if we leave out semantics and replace entirely with 'mostly'? After all, haplogroup I1b has been in the Balkans since the ice age.
ReplyDeleteWhat ancient DNA has taught me is that unless I see it in ancient bones, I don't believe it. So, while I1b (or whatever it's called now) in the Balkans in the Ice Age is a good theory, I want to see it confirmed that it was there. Also, we must remember that the Balkans are not homogeneous today, and we can't be sure that they were so in the past.
"Thus, the question arises: how many dimensions to retain?"
ReplyDeleteIt depends on what you want to know, right?
If you want to know, for instance, the difference between an Iranian and a Syrian in a blind "your here or there" way, the Clusters Galore approach with many dimensions may be the way to go.
But if you want to look at genetic prehistory and know the relationships between people, the Clusters Galore approach yields an overdetermined result. The Admixture results are most informative for this. Additionally, you can easily discern the difference between a Syrian and an Iranian with the Admixture result.
The Clusters Galore approach tells people that they are different or the same now, but not how, or when.
Some people will want to know that they are different and will be satisfied with the Clusters Galore approach.
However, most people want to know how they are related to others. They want to understand the history of their ancestors.
The Admixture results are a much better approach for this.
Getting Back to your original question regarding dimensions: 10 to 12 dimensions seems to handle West Eurasian populations just fine. I'd guess that less than 20 dimensions would do for all of Eurasia, including the Middle East, Siberia, South East Asia and India.
I am in cluster #9 on "galore analysis improved", but cluster #10 on "clusters galore" which was the entry just prior to it. I'm DOD236, last on the list. (mostly Italian descent with a dash of eastern Euro). Which cluster would be said to be more accurate for me then? Thanks for any info. ~Dino
ReplyDeletehttp://dodecad.blogspot.com/2011/01/clusters-galore-results-k64-for-dodecad.html
ReplyDelete