The genome blogger Polako recently announced a
calculator effect (May 2012) affecting admixture estimates:
However, many people are getting skewed results, despite doing everything right. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit. Nope, the real reason is what I call the "calculator effect". This is when the algorithm produces different results for people who are part of the original ADMIXTURE runs that set up the allele frequencies used by the calculators, than those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.
This, however, was
described by myself many months prior, in Novemeber 2011, following up on observations made during my first analysis of
Yunusbayev et al. Armenians in September 2011. It has been listed in the Technical Stuff at the bottom of this blog ever since.
I had observed at the time that the newly available Yunusbayev et al. Armenian sample appeared more "European" using the Dodecad v3 calculator tool, which had been built using the Project Armenians (Armenian_D) as well as the Armenian sample of Behar et al.
I then explained why this was happening, and released new versions of the Dodecad tools, such as K12a, and K12b, and more recently
K10a as new scientific and project participant samples became available.
Polako also proposes a "solution" to the problem:
I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?
The only effect of this "solution" is to ensure that there is a "calculator effect" for everyone using his tools. For example, if he uses only published Finns and Lithuanians to build his calculator, then every Finn and Lithuanian who takes his test will wonder why he is "different" from the published Finns and Lithuanians, because they will all suffer a "calculator effect" with respect to the reference populations. So, perhaps they will all be on equal footing with respect to each other, but their results will all be biased because of the issue I had identified.
Moreover, their results will never improve as more people join his Project, because these new people will not be included in newer versions of calculators: all users of DIY Eurogenes tools will continue to receive sub-par results. Well, small consolation, at least they'll all receive comparable sub-par results.
The solution to this problem was also described in my original post, and it's not an unimaginative quick fix of biasing everyone's results with respect to the reference populations:
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.
It is only by adequate sampling, that is by including more and more people, rather than excluding even the ones we have, that ever more accurate admixture estimators can be devised. As sample sizes grow (= more scientists publish their data, and more people join projects such as this one), allele frequencies of the different components will become ever more secure, and deviations of individuals who did not contribute to the inference of the genetic components will converge to zero.
I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian).
And, the way to further reduce biases that do persist is to foster participation, rather than consign everyone to a sort of fossilized mediocrity, excluding whole populations of active direct-to-consumer customers (e.g., Norwegians, or Assyrians, or Iraqis, or Germans, or Koreans, or, ...) on the basis that no "academic reference" has made dense genotype data on them freely and publicly accessible.