Dodecad Ancestry Project: A note of explanation on the analysis process

You may have noticed that the first figure (for all populations) in the various batch results looks always nearly the same.

The fact that I am processing participants in small batches of 10 individuals is intentional, as I do not want the composition of a heterogeneous sample to bias results. Thus, you can safely assume that e.g., "South European" or "West Asian" mean nearly exactly the same thing across batches.

Another important aspect to consider is the robustness of the result. If you feed any data into ADMIXTURE, you will receive admixture proportions; that's what it does. But, are these proportions meaningful?

There are three ways in which I am sure they are:

First, the components all have strong correlations with particular ethnic and geographical groups. They are not random groupings of individuals with no detectible pattern.

Second, the quality of the model can be assessed with the so-called Bayes Information Criterion (BIC), a statistical quantity that assesses jointly (i) how well the model fits the data, and (ii) how parsimonious it is.

Adding a larger number of components tends to increase the model fit. Think of it as trying to reproduce an oil painting with an increasing number of hues: it gets progressively better and better.

However, this comes at a problem of parsimony. You could claim that the painter used 200 different paint colors to produce his painting, but this is overkill. You can approximate the painting just as well with a smaller number (e.g. the 4 colors used by ancient painters) to reach almost as a good a result, considering the admixture between these components.

With the BIC, it is evident that up to K=10 (which is the level at which you are getting results), the BIC has increased. This is an additional piece of evidence that this is a robust model and did neither overfit the data, nor sell itself short on its power of inference.

Third, the data shows remarkable stability. As you increase the number of variables (such as K), the possible solutions (admixture proportions) increases exponentially.

This means that you can describe the same data in many different ways, all of which are about equally likely. This is the statistical problem of local minima. That is, you are not getting the best solution to the problem, because it would take forever to compute, but you are getting a quite good one.

Going back to the painting example, you can achieve pretty much the same result with lots of different hues if you have a palette of 8. Surely, if none of your colors has a blue component, then you will not be able to match reality, but there are many other color combinations that you could use, mixing the colors together to achieve the same result.

By exploiting the random seed faculty of ADMIXTURE, as well as combine the public data with that of project participants, I can "shake the pot" of the ancestry analysis. No matter what I do, though, the results at this level of analysis turn out the same, hence the repetitive first figure in the batch results.

This indicates (but does not prove) that this K=10 level analysis is indeed very close to (if not perfectly) optimal, and, at least, is not simply a just-so story: it reflects intrinsic qualities of the data.

Looking into the future

By participating in the project, you can however, change the genetic picture of Eurasia. These K=10 results are robust, but they are dependent on the sampled populations.

Consider for example, if we had not included Yakuts, Uygurs, and Chuvash in the analysis. Then, almost certainly, we would not have uncovered the "Northeast Asian" component.

A similar story exists now for many parts of the world: is there a specific Scandinavian cluster? We can't say, because the combined samples of HGDP, HapMap-3, and Behar et al. (2010) that are publicly available are devoid of Scandinavian populations. Is there a specific Balkan cluster? The same, as only a sample of Romanians is included.

Thus, prospective participants should know that they not only get an estimate of their admixture proportions from the existing 10 components, but they may help - if enough of them contribute - define new ones.

Such new clusters may or may not exist, and they may or may not be detectible by the level of detail provided by the analysis. However, we will not find out unless we try, which is why it is probably a good idea for people to submit their results.

See if you are in one of the groups of the pilot phase, but, by all means contact me (dodecad@gmail.com) if you think you belong to an under-represented group that would help fill out the genetic picture of Eurasia.

1 comment:

Onur DincerOctober 27, 2010 at 5:26 PM
Dieneke, you wrote two days ago on your other blog that you were running K=11 and K=12, but have never published their results since then. Are you refraining from publishing their results because that they came with decreased LogLikelihood and Bayes Information Criterion? If they didn't decrease them but instead increased, then what are you waiting for? Are their results still unavailable? BTW, I hope you included more populations in them than at K=10.

Wednesday, October 27, 2010

A note of explanation on the analysis process

1 comment:

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff