Sunday, December 2, 2012

D-statistics on ADMIXTURE components

I have implemented the method of D-statistics as an R function. This will allow you to take your raw genotype data and calculate various D-statistics of the form:

D(Pop1, YOU; Pop3, Outgroup)

Please read the original post for details on how to use this tool.

Friday, November 30, 2012

Geno 2.0 patch for DIYDodecad

(See important update at the end of this post)

People who have tested using the Genographic Project's Geno 2.0 test can now use the DIYDodecad tool with their data. The raw data download from this test has a slightly different format than the ones from 23andMe and Family Finder, so it is necessary to convert your data in a format that DIYDodecad can interpret.

So, after you have downloaded and extracted the DIYDodecad software as per its instructions, you should also download a couple of extra files into your working directory; these files are included in this patch:

  • standardize.r which replaces the standardize.r in the DIYDodecad software bundle, and allows you to convert your Geno 2.0 formatted data
  • hgdp.base.txt which includes additional information about SNP markers that is not found in your Geno 2.0 raw data download, and which is necessary to complete the conversion process.
Once these two files have been extracted into your working directory, the process of using DIYDodecad is exactly the same as for any other user of the software.

The only difference is that at the step where you convert your data using the standardize command (see DIYDodecad README file), you will use the command:


standardize('johndoe.csv', company='geno2')

where johndoe.csv is your unzipped raw data download. This will write a genotype.txt file in the working directory, and you can proceed the rest of the way as per the instructions.

You can use all ancestry calculators released by the Project (or indeed other projects); the most recent one is globe13

You should be aware, that because the Geno 2.0 test includes a smaller number of SNPs, and because globe13 and other calculators were developed using the common SNP set of 23andMe and Family Finder, the analysis using globe13 will only include ~34 thousand SNPs and will be "noisier" than usual. In the future, I might develop new calculators that make use of the SNP set of the Geno 2.0 test itself.

PS: Feel free to post a comment below if you experienced any difficulty converting your data; also thanks to CeCe Moore for graciously sharing a raw data file with me, which allowed me to build this converter.

UPDATE:

Apparently, the data format has been changed for some Geno 2.0 data downloads.
If your data includes a [Header] ... [Data] preamble followed by a list of 5 comma-separated values, ignore this.
If it includes a header "SNP,Chr,Allele1,Allele2" followed by a list of 4 comma-separated values, you should follow the instructions as above, but use company='geno2new' instead.

Wednesday, October 31, 2012

'globe13' participant results

Project participant results for the globe13 calculator can be found in the spreadsheet. Population median results and Fst divergences are also included.

Below, you can see the first two dimensions of an MDS plot of the 13 components:

A neighbor-joining tree of the 13 components based on the Fst divergences:
I have also created a TreeMix plot using Palaeo_African as an outgroup, and allowing as many as 5 migration edges:
The actual tree is:


((West_African:0.00448794,(East_African:0.00506576,(((((East_Asian:0.0173284,Siberian:0.00732773):0.0027852,(Amerindian:0.026174,Arctic:0.0118342):0.00742092):0.0114738,Australasian:0.0488974):0.00266559,South_Asian:0.00734044):0.008089,(Southwest_Asian:0.00541405,((West_Asian:0.00620657,North_European:0.00657599):0.00311587,Mediterranean:0.00798949):0.00650328):0.0118925):0.0299627):0.00597674):0.00671186,Palaeo_African:0.0215931);
0.0640319 NA NA NA Palaeo_African:0.0215931 Australasian:0.0488974
0.270468 NA NA NA Australasian:0.0488974 East_Asian:0.0173284
0.185213 NA NA NA South_Asian:0.00734044 ((West_Asian:0.00620657,North_European:0.00657599):0.00311587,Mediterranean:0.00798949):0.00650328
0.129883 NA NA NA North_European:0.00657599 Amerindian:0.026174
0.138757 NA NA NA Arctic:0.0118342 (West_Asian:0.00620657,North_European:0.00657599):0.00311587

Monday, October 29, 2012

'globe13' calculator

The globe13 calculator is based on the K=13 analysis. It includes the following components:


  • Siberian
  • Amerindian
  • West_African
  • Palaeo_African
  • Southwest_Asian
  • East_Asian
  • Mediterranean
  • Australasian
  • Arctic
  • West_Asian
  • North_European
  • South_Asian
  • East_African

Fst divergences between ancestral components can be found here.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe13' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 13 components in different world populations.

Terms of use: 'globe13', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Tuesday, October 23, 2012

'globe10' calculator

As part of the on-going analysis of the world dataset, I am releasing the 'globe10' calculator, which is based on the K=10 analysis. This calculator includes the following ancestral components:
  • Amerindian
  • West_Asian
  • Australasian
  • Palaeo_African
  • Neo_African
  • Siberian
  • Southern
  • East_Asian
  • Atlantic_Baltic
  • South_Asian
The names may be the same as the ones from previous calculators released by the Project, but you should always consult the spreadsheet to see how they might differ. In this case, inclusion of Amerindian, Australasian populations, African hunter-gatherers, dealing with the Paniya issue, and inclusion of data of Schlebusch et al. (2012), and  Pagani et al. (2012), have all combined to change components in subtle ways, although their modalities remain largely unchanged, and hence so do the names.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe10' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 10 components in different world populations.

Terms of use: 'globe10', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Friday, October 19, 2012

'globe4' calculator

Patterson et al. (2012) recently published evidence for admixture in northern Europeans between a population resembling modern Sardinians (and the Neolithic Tyrolean Iceman, whose genome was published earlier this year), and, surprisingly Native Americans. The authors attribute the Amerindian-like ancestry element to a North Eurasian population that spawned Native Americans, and which also contributed ancestry to northern Europeans. They propose two possibilities for the origin of this admixture: (i) the Mesolithic Europeans resembled Amerindians, or (ii) there was an influx of Amerindian-like populations from the east during late prehistory. A palimpsest of these two processes may explain parts of the observed signal of admixture.

In a recent K=4 admixture experiment, I demonstrated that ADMIXTURE software produces an Amerindian ancestral component that closely tracks the signal of admixture using the D-statistic test. I have decided to make this test available for download and use with DIYDodecad.

The test has four ancestral populations:
  • European
  • Asian
  • African
  • Amerindian
It is important to remember that some of these components track different aspects of ancestry that is better resolved at higher resolution. There are also populations that "don't fit well" in this 4-partite scheme (e.g., certain African or Australasian populations).

For example, the Amerindian component of this test may indicate (i) real recent Native American ancestry, (ii) East Eurasian ancestry found in Siberia and East Asia, (iii) the common signal of admixture differentiating most European groups from Sardinians and Near Eastern Caucasoid groups. Similarly, the Asian component may indicate Australasian, South Asian, or East Eurasian ancestry. And, the European component tracks the ancestry of individuals from West Eurasia in general, although it reaches is maximum in Sardinians.

This test may, however, be useful to Old World individuals who want to get an idea about the signal of admixture discovered by Patterson et al., so I decided to make it available. For individuals who don't suspect recent Amerindian or Siberian/East Asian ancestry, and who don't belong to populations with recent such ancestry, the Amerindian component will most likely represent the aforementioned signal.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe4' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 4 components in different world populations.

Terms of use: 'globe4', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Saturday, October 13, 2012

Geno 2.0 data request

If anyone has received results from the Geno 2.0 test of the Genographic Project and want to share it with me, feel free to send it at dodecad@gmail.com. I will not distribute it or share it with anyone. I want to see what SNPs are tested, what format the data is in, and what is its intersection with other available datasets. This way, I can update my DIYDodecad software so that Geno 2.0 testees can use the various calculators released by the project to get an alternative ancestry assessment.

In time, and if there is interest, I may release additional calculators that make use of the particular SNP set tested by Geno 2.0.

Sunday, August 12, 2012

fastIBD analysis of Africans and African Americans

Individuals from the following populations have been included in this analysis:
African_American_D Somali_D Moroccan_D Algerian_D North_African_Jews_D Tunisian_D East_African_Various_D Yoruba_D Sudan_D Egyptian_D Chad_D
These were analyzed in the context of a large set of African populations. CEU European Americans were also added to account for the European admixture present in some African American individuals.
This is the first time I have included African American Dodecad participants in this type of analysis.

A few quick points:
  • fastIBD was run with default parameters over a dataset of 679 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing median sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on median IBD sharing.

IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row). The grouping of populations by language group and/or region is clearly manifested. There are some interesting details that jump off the screen (but do consult the spreadsheet for details). For example, notice that: 
  • within the Bantu group (Bantu_NE, LWK/Luhya, and Bantu_S), only the South Bantu have an excess of IBD sharing with San.
  • Of the North Africans, Egyptans show an excess of IBD sharing with Tigray
  • Notice that of the Ethiopians/East Africans it is the Omotic speaking Wolayta that seem to especially share IBD with the Ari people who are also Ethiopian Omotic speakers.




Some visualizations (see graphical results above for full set):

Mozabites showing a high degree of within-population IBD sharing, and secondarily with other NW African groups.

The Dodecad Project Somali sample shows high degree of sharing within itself and also with the Pagani et al. Somali and Ethiopian Somali samples, and then with various other East African groups.
Sources of data are listed at the bottom left of this blog.

Saturday, August 11, 2012

On the so-called "Calculator Effect"

The genome blogger Polako recently announced a calculator effect (May 2012) affecting admixture estimates:
However, many people are getting skewed results, despite doing everything right. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit. Nope, the real reason is what I call the "calculator effect". This is when the algorithm produces different results for people who are part of the original ADMIXTURE runs that set up the allele frequencies used by the calculators, than those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.
This, however, was described by myself many months prior, in Novemeber 2011, following up on observations made during my first analysis of Yunusbayev et al. Armenians in September 2011. It has been listed in the Technical Stuff at the bottom of this blog ever since.

I had observed at the time that the newly available Yunusbayev et al. Armenian sample appeared more "European" using the Dodecad v3 calculator tool, which had been built using the Project Armenians (Armenian_D) as well as the Armenian sample of Behar et al.

I then explained why this was happening, and released new versions of the Dodecad tools, such as K12a, and K12b, and more recently K10a as new scientific and project participant samples became available.

Polako also proposes a "solution" to the problem:
I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?
The only effect of this "solution" is to ensure that there is a "calculator effect" for everyone using his tools. For example, if he uses only published Finns and Lithuanians to build his calculator, then every Finn and Lithuanian who takes his test will wonder why he is "different" from the published Finns and Lithuanians, because they will all suffer a "calculator effect" with respect to the reference populations. So, perhaps they will all be on equal footing with respect to each other, but their results will all be biased because of the issue I had identified.

Moreover, their results will never improve as more people join his Project, because these new people will not be included in newer versions of calculators: all users of DIY Eurogenes tools will continue to receive sub-par results. Well, small consolation, at least they'll all receive comparable sub-par results.

The solution to this problem was also described in my original post, and it's not an unimaginative quick fix of biasing everyone's results with respect to the reference populations:
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.
It is only by adequate sampling, that is by including more and more people, rather than excluding even the ones we have, that ever more accurate admixture estimators can be devised. As sample sizes grow (= more scientists publish their data, and more people join projects such as this one), allele frequencies of the different components will become ever more secure, and deviations of individuals who did not contribute to the inference of the genetic components will converge to zero.

I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian).

And, the way to further reduce biases that do persist is to foster participation, rather than consign everyone to a sort of fossilized mediocrity, excluding whole populations of active direct-to-consumer customers (e.g., Norwegians, or Assyrians, or Iraqis, or Germans, or Koreans, or, ...) on the basis that no "academic reference" has made dense genotype data on them freely and publicly accessible.

Friday, August 10, 2012

fastIBD analysis of East/Central Eurasians and select West Eurasians


Individuals from the following populations have been included in this analysis:
Philippines_D Turkish_D Iranian_D Russian_D Finnish_D Turkish_Cypriot_D Ukrainian_D Belorussian_D Chinese_D Korean_D Japanese_D Tatar_Various_D Kazakh_D Szekler_D Hungarian_D Estonian_D Azeri_D Udmurt_D Mixed_Turkic_D 
These were analyzed in a context of a complete set of Central/East Eurasian populations; West Eurasian populations included were mostly Uralic and Turkic speaking groups, and a few others (such as East Slavs or Iranians).

A few quick points:
  • fastIBD was run with default parameters over a dataset of 627 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on mean IBD sharing.
IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row)

And, a few visualizations of mean IBD sharing:

Notice high levels of within-population IBD sharing for Finns, consistent with a population that experienced expansion from a small number of founders (small ancestral population size).
Compare with Turks, who are a much more diverse population.
These two plots (you can check the spreadsheet for exact numbers) indicate different sources for the East Eurasian element in Turks and Finns. 

The top eastern populations for Turks are: Turkmen, Chuvash, Uzbek, Uygur, all of which are Turkic speakers, followed by Hazara, Yukagir, and Selkup.  For Finns, there is high degree of sharing with various Siberian groups of different languages, including Uralic Selkups (16.4cM) and Nganassan (9.6cM). Turks share less with these Uralic speakers (6.4 and 2.8cM respectively). So, these are strong hints of common shared ancestry within the Turkic and Uralic language families.

The Chuvash population is also quite interesting, as it shares more with Selkup and Nganassan, contrasting with other Turkic speakers. This makes excellent sense, and is in agreement with other recent findings:
Results from this study maintain that the Chuvash are not related to Altaic or Mongolian populations along their maternal line, thus supporting the “Elite” hypothesis that their language was imposed by a conquering group —leaving Chuvash mtDNA largely of Eurasian origin. Their maternal markers appear to most closely resemble Finno-Ugric speakers rather than Turkic speakers.
Sources of data are listed at the bottom left of this blog.