Saturday, August 11, 2012

On the so-called "Calculator Effect"

The genome blogger Polako recently announced a calculator effect (May 2012) affecting admixture estimates:
However, many people are getting skewed results, despite doing everything right. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit. Nope, the real reason is what I call the "calculator effect". This is when the algorithm produces different results for people who are part of the original ADMIXTURE runs that set up the allele frequencies used by the calculators, than those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.
This, however, was described by myself many months prior, in Novemeber 2011, following up on observations made during my first analysis of Yunusbayev et al. Armenians in September 2011. It has been listed in the Technical Stuff at the bottom of this blog ever since.

I had observed at the time that the newly available Yunusbayev et al. Armenian sample appeared more "European" using the Dodecad v3 calculator tool, which had been built using the Project Armenians (Armenian_D) as well as the Armenian sample of Behar et al.

I then explained why this was happening, and released new versions of the Dodecad tools, such as K12a, and K12b, and more recently K10a as new scientific and project participant samples became available.

Polako also proposes a "solution" to the problem:
I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?
The only effect of this "solution" is to ensure that there is a "calculator effect" for everyone using his tools. For example, if he uses only published Finns and Lithuanians to build his calculator, then every Finn and Lithuanian who takes his test will wonder why he is "different" from the published Finns and Lithuanians, because they will all suffer a "calculator effect" with respect to the reference populations. So, perhaps they will all be on equal footing with respect to each other, but their results will all be biased because of the issue I had identified.

Moreover, their results will never improve as more people join his Project, because these new people will not be included in newer versions of calculators: all users of DIY Eurogenes tools will continue to receive sub-par results. Well, small consolation, at least they'll all receive comparable sub-par results.

The solution to this problem was also described in my original post, and it's not an unimaginative quick fix of biasing everyone's results with respect to the reference populations:
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.
It is only by adequate sampling, that is by including more and more people, rather than excluding even the ones we have, that ever more accurate admixture estimators can be devised. As sample sizes grow (= more scientists publish their data, and more people join projects such as this one), allele frequencies of the different components will become ever more secure, and deviations of individuals who did not contribute to the inference of the genetic components will converge to zero.

I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian).

And, the way to further reduce biases that do persist is to foster participation, rather than consign everyone to a sort of fossilized mediocrity, excluding whole populations of active direct-to-consumer customers (e.g., Norwegians, or Assyrians, or Iraqis, or Germans, or Koreans, or, ...) on the basis that no "academic reference" has made dense genotype data on them freely and publicly accessible.


  1. I try to ignore Polako as much as possible. It's apparent to everyone with a modicum of critical talent that Eurogenes exists purely to assuage his own ethnic insecurities and at the same time find 'undesirable' admixture in populations he dislikes (basically Southern and Western Europeans). He's not a scientist. He wears his agenda on his sleeve. Polako's thought process can be summarised as thus: If it finds less non-European admixture in Poles, it's correct. If it finds more admixture in Poles, it requires revision.

  2. I try to ignore Polako as much as possible.

    That is probably best, but I wanted to prevent other genome bloggers from making the same mistake of thinking that by removing project samples they're actually creating better calculators.

  3. You're missing the point, as usual.

    There are lots of people out there confused RIGHT NOW by the results your tools are producing. They have a right to be confused, because the results are useless.

    It doesn't help them that maybe in 5 years you'll have enough samples to eliminate the Calculator Effect from many of your tests.

    Have some consideration for other people, and stop thinking about yourself and your blog for once.

    Explain clearly why people who aren't part of your project can't rely on your population portraits and oracle results.

  4. You're missing the point, as usual.

    There are lots of people out there confused RIGHT NOW by the results your tools are producing. They have a right to be confused, because the results are useless.

    It is you who is missing the point.

    If there is a "calculator effect", then it affects ALL people who use your tools, because NONE of them have been included in the admixture analysis that produced them. By your own admission, you use only "academic references".

    Your "fix" to the problem is to make everyone suffer a bias, including your own project members. The fact that all people (project members or not) who use your tools are on equal footing is no consolation, because they are ALL getting bad results.

    In the case of the Dodecad Project, the results obtained by project members are as good as can be hoped for, while the results obtained by non-members using DIY tools may suffer a small bias, which will continue to decrease over time. Moreover, whatever bias exists for the Dodecad Project is reduced compared to an "academic only" approach, because the components are inferred with a larger and more repsentative collection of samples.

    There is no magic bullet to obtaining higher accuracy.

    Dodecad = Great results for some + slightly biased results for non-participants
    Eurogenes = Biased results for everyone, participant and non-participant alike.

  5. Dienekes wrote:
    "I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian)."

    I am not sure what what exactly do you mean by "the closely related components". If I compare the results of 3 Lithuanians (my grandparents) which were inferred from a calculator in K12a to the percentages they got when they were included in K12b run, it is true that percentages for South_Asian, Gedrosia, Southwest_Asian, Siberian differ only by a few % between those two runs. However, for Atlantic_Med & North_European the difference is MUCH more than only a few percent:

    K12b North_European *minus* K12a North_European
    DOD899 9.3%
    DOD900 8.5%
    DOD901 8.5%
    Lithuanian_D 0%

  6. The bias in your tests for members vs. non-members is huge for intra-European clusters. This renders the tests and Oracle results useless for non-members.

    The essential thing is that the results for the closely related components are RELATIVELY correct for everyone. And not only do my tests produce correct inter-continental results, but they also show correct relative results for the intra-European clusters.

    So there's no way the larger number of samples you use makes your biased tests more effective than my unbiased tests. Maybe in 5-10 years you'll get there though? Good luck with that.

    Anyway, I'm wasting my time here. You're just happy to spew misinformation to save face, so whatever.

  7. @linkus,

    A geometric analogy is a good way to make sense of what is going on.

    Imagine having a long measuring tape, say 100m long that is held by two people, Alice and Bob. There is a third person, Eve, who is somewhere between Alice and Bob and tries to read the measurement, to see how far she is from Alice, and how far she is from Bob. That is the problem of "admixture estimation".

    But, Alice and Bob are are not "keeping steady",they are moving around a little bit around their positions, keeping the tape tight. So, the measurement is not precise. If Eve is close to Alice, then Bob's antics will not affect her measurement by much: if Bob moves the tape by a whole meter, it will only move by a few centimeters at Eve's location. On the other hand, if Alice moves by a meter, then Eve's measurement will be affected a lot.

    Another way to think about it is the following: if I'm trying to think how far someone is Vilnius is from someone in Athens, then knowing the precise location of the person in Vilnius does not matter a lot: a few meters or kilometers here and there will result in a small error relative to the great distance between the two capitals. But, if both people are in Vilnius, then knowing their location inaccurately, introduces major errors, because, say, a 1km displacement for a 10km distance is 10% error, but for a 100km distance, only 1%.

    Getting back to your question, the North_European component is modal in Lithuania. So, an estimate of Sub-Saharan admixture will be very accurate no matter whether one takes a sample from Nigeria or Cameroon: genetic distances from Lithuania to Sub-Saharan Africa are so great, that small errors introduced by choosing this or that set of Sub-Saharan individuals pale in comparison.

    But, choosing this or that set of Lithuanians introduces a bigger error, because the corresponding differences are so much smaller, while differences between, say Lithuania, and Madrid are intermediate.

    To summarize:

    - Most people, who are not at the edges of variation (e.g., Sicilians or Hungarians) receive fairly good results irrespective of whether they've been included or not in the making of the calculator.
    - People who happen to be at the edge of variation (e.g., Lithuanians, Finns, etc. for the North_European component) get a maximum error for that component, and small to non-existent errors for minor components (If you list your other component values, you'll surely observe that K12a/K12b differences become small to non-existent for the most distant components, and maximum for the North_European component).

  8. @Davidski,

    If you want to believe that excluding ~700 Project participants will result in a better calculator, you are free to do so.

    It is a strange world indeed that you live on, where you can get better inference with fewer data.

    I have already explained to you clearly that the "calculator effect" applies to ALL your participants. Any guy from Kent, whether he is your project participant or not, will get results that are dissimilar relative to the academic 1000 Genomes Kent sample. They will ALL have to wonder why they're different (actually, they're shielded from knowing that because you don't even bother to post population averages or any methodological details) from the 1000 Genomes Kent.

    In my case, the project participants won't be systematically different at all, while project non-participants may have a small bias in their estimates, but which will be much smaller than the equivalent bias in your project, because of the much larger number of utilized samples that help define components much more accurately.

    It's not rocket science. Hopefully, you'll get it before too long.

  9. Dieneke, the title of this thread is misleading. You are not denying the existence of the calculator effect. You are only thinking that it will diminish with more sampling. Thus your sole, but by no means unimportant, difference from David in this regard is in how you deal with the calculator effect.

  10. The "so-called" part refers to the alleged discovery of a new effect. As my post documents, this effect was already known to myself, as was its solution, which is not to exclude data from the creation of new admixture estimation tools.

  11. Can increasing sample sizes also overcome other issues, such as ascertainment bias?

  12. No, because adding more individuals does not change the set of SNPs that the various groups/companies have genotyped over them.