Comments on Dodecad Ancestry Project: On the so-called "Calculator Effect"

No, because adding more individuals does not chang...

2012-08-14T13:42:14.261+03:00

No, because adding more individuals does not change the set of SNPs that the various groups/companies have genotyped over them.

Can increasing sample sizes also overcome other is...

2012-08-14T12:54:47.524+03:00

Can increasing sample sizes also overcome other issues, such as ascertainment bias?

The "so-called" part refers to the alleg...

2012-08-12T22:43:14.485+03:00

The "so-called" part refers to the alleged discovery of a new effect. As my post documents, this effect was already known to myself, as was its solution, which is not to exclude data from the creation of new admixture estimation tools.

Dieneke, the title of this thread is misleading. Y...

2012-08-12T15:11:10.107+03:00

Dieneke, the title of this thread is misleading. You are not denying the existence of the calculator effect. You are only thinking that it will diminish with more sampling. Thus your sole, but by no means unimportant, difference from David in this regard is in how you deal with the calculator effect.

@Davidski, If you want to believe that excluding ...

2012-08-11T17:51:56.387+03:00

@Davidski,

If you want to believe that excluding ~700 Project participants will result in a better calculator, you are free to do so.

It is a strange world indeed that you live on, where you can get better inference with fewer data.

I have already explained to you clearly that the "calculator effect" applies to ALL your participants. Any guy from Kent, whether he is your project participant or not, will get results that are dissimilar relative to the academic 1000 Genomes Kent sample. They will ALL have to wonder why they're different (actually, they're shielded from knowing that because you don't even bother to post population averages or any methodological details) from the 1000 Genomes Kent.

In my case, the project participants won't be systematically different at all, while project non-participants may have a small bias in their estimates, but which will be much smaller than the equivalent bias in your project, because of the much larger number of utilized samples that help define components much more accurately.

It's not rocket science. Hopefully, you'll get it before too long.

@linkus, A geometric analogy is a good way to mak...

2012-08-11T17:43:17.791+03:00

@linkus,

A geometric analogy is a good way to make sense of what is going on.

Imagine having a long measuring tape, say 100m long that is held by two people, Alice and Bob. There is a third person, Eve, who is somewhere between Alice and Bob and tries to read the measurement, to see how far she is from Alice, and how far she is from Bob. That is the problem of "admixture estimation".

But, Alice and Bob are are not "keeping steady",they are moving around a little bit around their positions, keeping the tape tight. So, the measurement is not precise. If Eve is close to Alice, then Bob's antics will not affect her measurement by much: if Bob moves the tape by a whole meter, it will only move by a few centimeters at Eve's location. On the other hand, if Alice moves by a meter, then Eve's measurement will be affected a lot.

Another way to think about it is the following: if I'm trying to think how far someone is Vilnius is from someone in Athens, then knowing the precise location of the person in Vilnius does not matter a lot: a few meters or kilometers here and there will result in a small error relative to the great distance between the two capitals. But, if both people are in Vilnius, then knowing their location inaccurately, introduces major errors, because, say, a 1km displacement for a 10km distance is 10% error, but for a 100km distance, only 1%.

Getting back to your question, the North_European component is modal in Lithuania. So, an estimate of Sub-Saharan admixture will be very accurate no matter whether one takes a sample from Nigeria or Cameroon: genetic distances from Lithuania to Sub-Saharan Africa are so great, that small errors introduced by choosing this or that set of Sub-Saharan individuals pale in comparison.

But, choosing this or that set of Lithuanians introduces a bigger error, because the corresponding differences are so much smaller, while differences between, say Lithuania, and Madrid are intermediate.

To summarize:

- Most people, who are not at the edges of variation (e.g., Sicilians or Hungarians) receive fairly good results irrespective of whether they've been included or not in the making of the calculator.
- People who happen to be at the edge of variation (e.g., Lithuanians, Finns, etc. for the North_European component) get a maximum error for that component, and small to non-existent errors for minor components (If you list your other component values, you'll surely observe that K12a/K12b differences become small to non-existent for the most distant components, and maximum for the North_European component).

The bias in your tests for members vs. non-members...

2012-08-11T16:45:11.907+03:00

The bias in your tests for members vs. non-members is huge for intra-European clusters. This renders the tests and Oracle results useless for non-members.

The essential thing is that the results for the closely related components are RELATIVELY correct for everyone. And not only do my tests produce correct inter-continental results, but they also show correct relative results for the intra-European clusters.

So there's no way the larger number of samples you use makes your biased tests more effective than my unbiased tests. Maybe in 5-10 years you'll get there though? Good luck with that.

Anyway, I'm wasting my time here. You're just happy to spew misinformation to save face, so whatever.

Dienekes wrote: "I am already quite confident...

2012-08-11T16:31:56.729+03:00

Dienekes wrote:
"I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian)."

I am not sure what what exactly do you mean by "the closely related components". If I compare the results of 3 Lithuanians (my grandparents) which were inferred from a calculator in K12a to the percentages they got when they were included in K12b run, it is true that percentages for South_Asian, Gedrosia, Southwest_Asian, Siberian differ only by a few % between those two runs. However, for Atlantic_Med & North_European the difference is MUCH more than only a few percent:

K12b North_European *minus* K12a North_European
DOD899 9.3%
DOD900 8.5%
DOD901 8.5%
Lithuanian_D 0%

You're missing the point, as usual. There are...

2012-08-11T16:08:20.510+03:00

You're missing the point, as usual.

There are lots of people out there confused RIGHT NOW by the results your tools are producing. They have a right to be confused, because the results are useless.

It is you who is missing the point.

If there is a "calculator effect", then it affects ALL people who use your tools, because NONE of them have been included in the admixture analysis that produced them. By your own admission, you use only "academic references".

Your "fix" to the problem is to make everyone suffer a bias, including your own project members. The fact that all people (project members or not) who use your tools are on equal footing is no consolation, because they are ALL getting bad results.

In the case of the Dodecad Project, the results obtained by project members are as good as can be hoped for, while the results obtained by non-members using DIY tools may suffer a small bias, which will continue to decrease over time. Moreover, whatever bias exists for the Dodecad Project is reduced compared to an "academic only" approach, because the components are inferred with a larger and more repsentative collection of samples.

There is no magic bullet to obtaining higher accuracy.

Dodecad = Great results for some + slightly biased results for non-participants
Eurogenes = Biased results for everyone, participant and non-participant alike.

You're missing the point, as usual. There are...

2012-08-11T15:13:03.779+03:00

You're missing the point, as usual.

There are lots of people out there confused RIGHT NOW by the results your tools are producing. They have a right to be confused, because the results are useless.

It doesn't help them that maybe in 5 years you'll have enough samples to eliminate the Calculator Effect from many of your tests.

Have some consideration for other people, and stop thinking about yourself and your blog for once.

Explain clearly why people who aren't part of your project can't rely on your population portraits and oracle results.

I try to ignore Polako as much as possible. That ...

2012-08-11T14:46:49.691+03:00

I try to ignore Polako as much as possible.

That is probably best, but I wanted to prevent other genome bloggers from making the same mistake of thinking that by removing project samples they're actually creating better calculators.

I try to ignore Polako as much as possible. It'...

2012-08-11T14:30:20.858+03:00

I try to ignore Polako as much as possible. It's apparent to everyone with a modicum of critical talent that Eurogenes exists purely to assuage his own ethnic insecurities and at the same time find 'undesirable' admixture in populations he dislikes (basically Southern and Western Europeans). He's not a scientist. He wears his agenda on his sleeve. Polako's thought process can be summarised as thus: If it finds less non-European admixture in Poles, it's correct. If it finds more admixture in Poles, it requires revision.