Sunday, August 12, 2012

fastIBD analysis of Africans and African Americans

Individuals from the following populations have been included in this analysis:
African_American_D Somali_D Moroccan_D Algerian_D North_African_Jews_D Tunisian_D East_African_Various_D Yoruba_D Sudan_D Egyptian_D Chad_D
These were analyzed in the context of a large set of African populations. CEU European Americans were also added to account for the European admixture present in some African American individuals.
This is the first time I have included African American Dodecad participants in this type of analysis.

A few quick points:
  • fastIBD was run with default parameters over a dataset of 679 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing median sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on median IBD sharing.

IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row). The grouping of populations by language group and/or region is clearly manifested. There are some interesting details that jump off the screen (but do consult the spreadsheet for details). For example, notice that: 
  • within the Bantu group (Bantu_NE, LWK/Luhya, and Bantu_S), only the South Bantu have an excess of IBD sharing with San.
  • Of the North Africans, Egyptans show an excess of IBD sharing with Tigray
  • Notice that of the Ethiopians/East Africans it is the Omotic speaking Wolayta that seem to especially share IBD with the Ari people who are also Ethiopian Omotic speakers.




Some visualizations (see graphical results above for full set):

Mozabites showing a high degree of within-population IBD sharing, and secondarily with other NW African groups.

The Dodecad Project Somali sample shows high degree of sharing within itself and also with the Pagani et al. Somali and Ethiopian Somali samples, and then with various other East African groups.
Sources of data are listed at the bottom left of this blog.

Saturday, August 11, 2012

On the so-called "Calculator Effect"

The genome blogger Polako recently announced a calculator effect (May 2012) affecting admixture estimates:
However, many people are getting skewed results, despite doing everything right. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit. Nope, the real reason is what I call the "calculator effect". This is when the algorithm produces different results for people who are part of the original ADMIXTURE runs that set up the allele frequencies used by the calculators, than those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.
This, however, was described by myself many months prior, in Novemeber 2011, following up on observations made during my first analysis of Yunusbayev et al. Armenians in September 2011. It has been listed in the Technical Stuff at the bottom of this blog ever since.

I had observed at the time that the newly available Yunusbayev et al. Armenian sample appeared more "European" using the Dodecad v3 calculator tool, which had been built using the Project Armenians (Armenian_D) as well as the Armenian sample of Behar et al.

I then explained why this was happening, and released new versions of the Dodecad tools, such as K12a, and K12b, and more recently K10a as new scientific and project participant samples became available.

Polako also proposes a "solution" to the problem:
I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?
The only effect of this "solution" is to ensure that there is a "calculator effect" for everyone using his tools. For example, if he uses only published Finns and Lithuanians to build his calculator, then every Finn and Lithuanian who takes his test will wonder why he is "different" from the published Finns and Lithuanians, because they will all suffer a "calculator effect" with respect to the reference populations. So, perhaps they will all be on equal footing with respect to each other, but their results will all be biased because of the issue I had identified.

Moreover, their results will never improve as more people join his Project, because these new people will not be included in newer versions of calculators: all users of DIY Eurogenes tools will continue to receive sub-par results. Well, small consolation, at least they'll all receive comparable sub-par results.

The solution to this problem was also described in my original post, and it's not an unimaginative quick fix of biasing everyone's results with respect to the reference populations:
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.
It is only by adequate sampling, that is by including more and more people, rather than excluding even the ones we have, that ever more accurate admixture estimators can be devised. As sample sizes grow (= more scientists publish their data, and more people join projects such as this one), allele frequencies of the different components will become ever more secure, and deviations of individuals who did not contribute to the inference of the genetic components will converge to zero.

I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian).

And, the way to further reduce biases that do persist is to foster participation, rather than consign everyone to a sort of fossilized mediocrity, excluding whole populations of active direct-to-consumer customers (e.g., Norwegians, or Assyrians, or Iraqis, or Germans, or Koreans, or, ...) on the basis that no "academic reference" has made dense genotype data on them freely and publicly accessible.

Friday, August 10, 2012

fastIBD analysis of East/Central Eurasians and select West Eurasians


Individuals from the following populations have been included in this analysis:
Philippines_D Turkish_D Iranian_D Russian_D Finnish_D Turkish_Cypriot_D Ukrainian_D Belorussian_D Chinese_D Korean_D Japanese_D Tatar_Various_D Kazakh_D Szekler_D Hungarian_D Estonian_D Azeri_D Udmurt_D Mixed_Turkic_D 
These were analyzed in a context of a complete set of Central/East Eurasian populations; West Eurasian populations included were mostly Uralic and Turkic speaking groups, and a few others (such as East Slavs or Iranians).

A few quick points:
  • fastIBD was run with default parameters over a dataset of 627 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on mean IBD sharing.
IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row)

And, a few visualizations of mean IBD sharing:

Notice high levels of within-population IBD sharing for Finns, consistent with a population that experienced expansion from a small number of founders (small ancestral population size).
Compare with Turks, who are a much more diverse population.
These two plots (you can check the spreadsheet for exact numbers) indicate different sources for the East Eurasian element in Turks and Finns. 

The top eastern populations for Turks are: Turkmen, Chuvash, Uzbek, Uygur, all of which are Turkic speakers, followed by Hazara, Yukagir, and Selkup.  For Finns, there is high degree of sharing with various Siberian groups of different languages, including Uralic Selkups (16.4cM) and Nganassan (9.6cM). Turks share less with these Uralic speakers (6.4 and 2.8cM respectively). So, these are strong hints of common shared ancestry within the Turkic and Uralic language families.

The Chuvash population is also quite interesting, as it shares more with Selkup and Nganassan, contrasting with other Turkic speakers. This makes excellent sense, and is in agreement with other recent findings:
Results from this study maintain that the Chuvash are not related to Altaic or Mongolian populations along their maternal line, thus supporting the “Elite” hypothesis that their language was imposed by a conquering group —leaving Chuvash mtDNA largely of Eurasian origin. Their maternal markers appear to most closely resemble Finno-Ugric speakers rather than Turkic speakers.
Sources of data are listed at the bottom left of this blog.