Monday, December 19, 2011

'world9' calculator

I have consistently received requests for an assessment of Amerindian ancestry. While the focus of the Project is, and will remain, the region of Eurasia, I thought it was a good idea to release a tool that could be used by persons of partial Amerindian ancestry.

I have also included the two Australasian populations currently available, namely Bougainville Melanesians (NAN_Melanesian) and Papuans from the HGDP.

The inferred components at K=9 are quite similar to those of 'eurasia7', with the addition of the Australasian and Amerindian components. I have also included the Kalash in this experiment, which caused the 'West_Asian' component to be modal in them, although the Kalash's difference in terms of this component to other populations is not so great as to render it strongly population-specific; I have called this component 'Caucasus_Gedrosia' and it -like the 'eurasia7' West Asian component- ought to be quite similar to the k5 component inferred by Metspalu et al. (2011).

It is unfortunate that there are only two Australasian populations currently available as public data. There are many more Amerindian and Mestizo ones, but it should be noted that the Amazonian populations on which the 'Amerindian' component is modal are some of the most lacking in genetic diversity in my entire database. As a result, Eurasians who lack any Amerindian or Australasian ancestry can expect to see a little of it in their results as noise.

This is a very important caveat for Americans who suspect that they may have an Amerindian ancestor. Small levels of this component may be noise, and this component is also found in Siberia, and may represent either backflow from the Americas or the common ancestry of Siberian and Amerindian populations. If you are interested in the detection of Amerindian ancestry, I recommend that you use DIYDodecad's 'byseg', 'bychr', and 'target' modes to drill down deeper in your genomes.

Download Files

  • The spreadsheet contains admixture proportions, the table of Fst distances, and individual results in the Individual Results tab.
  • The RAR file contains files for use with DIYDodecad. Extract its contents to the working directory of DIYDodecad. In order to run the calculator, you follow the instructions of the README file, but type 'world9' instead of 'dv3'.

Terms of use:

'world9', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.


Admixture proportions barplot:

The nine ancestral components are:

  • Amerindian
  • East_Asian
  • African
  • Atlantic_Baltic
  • Australasian
  • Siberian
  • Caucasus_Gedrosia
  • Southern
  • South_Asian
Table of Fst divergences:

Neighbor-joining tree of Fst distances; the long branch lengths of the Australasian (and to a less degree the Amerindian) branch is due to the high level of inbreeding in the populations for which this component is modal.
First 8 dimensions of multi-dimensional scaling (MDS):
Technical Details

A dataset of 3,548 individuals/265,519 SNPs/284 populations was assembled. Pruning for distantly related individuals was performed by iterative pruning of a single individual from each pair showing IBD RATIO greater than the mean plus 2 standard deviations, or greater than 2.5. 3,026 individuals remained. An additional 14 individuals were removed because they had less than 97% genotype rate. The marker set was thinned to remove SNPs with less than 97% genotype rate or 1% minor allele frequency. Linkage-disequilibrium based pruning with a window of 200 SNPs, advanced by 25 SNPs, and an R-squared of 0.4 was performed. A total of 3,012 individuals and 170,822 SNPs survived these filtering steps. PLINK 1.07 and ADMIXTURE 1.21 were used in the analyses.

Sunday, December 11, 2011

Dodecad Oracle (K12a edition)

I have created a new version of the Dodecad Oracle for use with the K12a calculator.

You can refer to the original Dodecad Oracle for detailed usage instructions.

(The only difference in the use of the program is that the number of populations is 204, so make sure to use this if you plan to remove any reference populations, as mentioned in the instructions)

In short:
  • you first load the file DodecadOracleK12a.RData in R. You can do this by double-clicking on this file in Windows, or using the File->Load Workspace menu. In Linux, you can use the "load" command, e.g., load('/home/ubuntu/Desktop/DodecadOracleK12a.RData')
  • You then enter commands at the command prompt
Some examples:

Comparing a population against other populations

[,1] [,2]
[1,] "Somali_D" "0"
[2,] "Ethiopian_Jews" "12.3049"
[3,] "Ethiopians" "12.3309"
[4,] "Sandawe_He" "38.2093"
[5,] "MKK25" "40.7983"
[6,] "Egyptans" "63.2307"
[7,] "Yemenese" "69.1628"
[8,] "Moroccans" "72.6233"
[9,] "Jordanians" "73.1838"
[10,] "Palestinian" "74.2867"

Comparing a population against 2-way population mixes:

[,1] [,2]
[1,] "Pathan" "0"
[2,] "79.5% Sindhi + 20.5% Lezgins" "3.948"
[3,] "82% Sindhi + 18% Chechens_Y" "4.0251"
[4,] "16.7% Adygei + 83.3% Sindhi" "4.5471"
[5,] "83.4% Sindhi + 16.6% Balkars_Y" "4.6487"
[6,] "80.8% Sindhi + 19.2% Kumyks_Y" "4.7067"
[7,] "83.7% Sindhi + 16.3% North_Ossetians_Y" "4.8352"
[8,] "80.9% Sindhi + 19.1% Nogais_Y" "4.8821"
[9,] "66.6% Sindhi + 33.4% Tajiks_Y" "5.6708"
[10,] "86.4% Sindhi + 13.6% Georgians" "6.2927"

Comparing an individual against populations

DodecadOracle(c(8.4, 0, 2.8, 6, 2.2, 0.1, 40.3, 25.9, 0.3, 11.9, 1.5, 0.5))
[,1] [,2]
[1,] "Iranian_D" "2.2405"
[2,] "Kurd_D" "3.8092"
[3,] "Kurds_Y" "5.4945"
[4,] "Iranians" "6.634"
[5,] "Uzbekistan_Jews" "12.8957"
[6,] "Turks" "17.3173"
[7,] "Turkmens_Y" "17.7316"
[8,] "Iranian_Jews" "18.14"
[9,] "Assyrian_D" "18.8968"
[10,] "Azerbaijan_Jews" "18.9444"

Comparing an individual against 2-way population mixes

DodecadOracle(c(28, 0.8, 1.6, 49.9, 1.9, 0, 10.6, 4.1, 0, 2.4, 0, 0.6),mixedmode=T)
[,1] [,2]
[1,] "47.7% French_D + 52.3% Mordovians_Y" "2.5849"
[2,] "48.3% French + 51.7% Mordovians_Y" "2.6012"
[3,] "36.3% Spaniards + 63.7% Mordovians_Y" "2.9985"
[4,] "36% Spanish_D + 64% Mordovians_Y" "3.0577"
[5,] "65.9% Russian_D + 34.1% Spaniards" "3.0923"
[6,] "35.9% IBS + 64.1% Mordovians_Y" "3.0943"
[7,] "40% French + 60% Ukranians_Y" "3.1662"
[8,] "66.4% Russian_D + 33.6% IBS" "3.2359"
[9,] "24.5% Swedish_D + 75.5% Hungarians" "3.3021"
[10,] "39.3% French_D + 60.7% Ukranians_Y" "3.4046"

The numbers to the right of each result represent the "goodness" of the match; the lower, the better. If you wanted to list the top-30 results, in any of the above commands, you would enter, e.g.,

DodecadOracle(c(28, 0.8, 1.6, 49.9, 1.9, 0, 10.6, 4.1, 0, 2.4, 0, 0.6),mixedmode=T, k=30)

If you recently joined the Project, please consider leaving a brief comment in the Information about Project Samples thread.

Participant results for 'K12a' calculator

The participant results can be found in the "Individual Results" tab of the K12a spreadsheet.
You can read more about the K12a calculator at my other blog; if you are not a Project participant, you can also find a DIY version of it there, which can be used in conjunction with DIYDodecad 2.1.

Monday, October 31, 2011

Origin of Kalash inferred with Eurogenes K=10 "test" calculator

Vasishta, is asking Eurogenes for help in demonstrating that the Kalash have Northern-European-specific segments:
Yes. He keeps citing the Kalash as proof that the Indo-Iranians were an almost exclusively a West-Asian like population, even though I personally think the mainly West Asian-South Asian assortment of the Kalash in his analyses might be an artifact of their inbreeding and isolation, thus confusing ADMIXTURE. Zack's K=11 at Harappa has shown that the Kalash display around 22% of the component modal in Lithuanians. Yet, he ignores the North/Eastern European admixture in Northwest Indians and North Indian Brahmins (in his own analyses at that!). Interestingly enough the aforementioned groups tend to score a sliver of Northeast European admixture in Dr.Doug McDonald's analyses, with the top matches for that sliver usually being Lithuanians, Russians and Finns; in that order. It (NEU) is even found in frequencies of around 4-6% in Dravidian-speaking southern Brahmins. As much as I hate to say it, he is indeed rather stubborn and has somewhat of an underlying agenda.

David, I think you should look into proving that the Kalash do indeed have some NEU-specific segments. I would be super-surprised if they didn't, given that more mixed populations south of their geographical area display it themselves.

It appears that Vasishta disagrees with me because he "personally thinks" that the admixture proportions of the Kalash are due to inbreeding and the limitations of ADMIXTURE.

Does he cite any studies or make any argument why ADMIXTURE would remove precisely the component that he is so eager to be present? No. While genetic drift in an isolated population could indeed lead to the loss of genetic diversity, there is no reason to think that this would lead preferentially to the loss of Northern-European segments. It is strange that Vasishta accuses me of bias and yet, at the same time, invokes the magic of some unspecified flaw of ADMIXTURE for the loss of his favorite component.

Vasishta invokes the Harappa Ancestry Project K=11 admixture analysis in support of his idea that the Kalash have 22% of the component modal in Lithuanians. However, he neglects to mention that at K=11 there is no West-Asian or Caucasus centered component in the HAP analysis, but rather only "European" (modal in Lithuanians) and "SW Asian" (modal in Yemen Jews). It is indeed strange that he accuses me of bias for providing evidence about the relationship of the Kalash with West Asia, while at the same time, showing preference for a level of analysis where such a component is lacking.

The West Eurasian cline between Arabia and Northeastern Europe is evident in the 'weac' admixture analysis, where the European-centered component (Atlantic-Baltic) is present in populations such as Assyrians and Armenians whereas it is lacking at the appropriate level of resolution. Therefore, the fact that the Kalash show "European" admixture at the level of Europe vs. Near East does not mean that they ought to show such admixture at the level of Europe vs. West Asia/Caucasus vs. Arabia.

One of the benefits of DIYDodecad has been the availability of data from projects that have hitherto been black boxes. In the interest of transparency, I have taken the Eurogenes K=10 "test" calculator and repeated my analysis of the Kalash, that had been previously shown by me to be a fairly simple West/South Asian mix. I could have waited for him to get around to it, but since he's quick on the talk and slow on the trigger, I decided to do it for him.

The admixture proportions of the Kalash, according to the Eurogenes K=10 are: 40.3% S_Asian, 58.7% W_Asian, 0.9% N_E_Euro, 0.1% N_Asian, and hence the analysis based on the Eurogenes K=10 components confirms the analysis based on my eurasia 7, "showing the Kalash to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components."

Eurogenes alleges, not without his usual charm, that:
Dienekes has a keen eye for things he wants to see. But he hasn't yet noticed that in all accurate analyses, there's significant Eastern European admix in North India. His monocle got fogged up in that instance.
Let us consider some pertinent facts: the Indian peninsula has been invaded multiple times from Central Asia, a process that continued long after the establishment of the Indo-Aryans during the 2nd millennium BC. Eurogenes may want to think that the "Eastern European" admixture in South Asia dates to his mythological Polish Indo-Europeans, galloping across the steppes on their horses, but there is, at present, no particular reason to think that this is the case

Furthermore, the "West Asian" component as a fraction of the "West Asian" + "Atlantic-Baltic" component reaches a minimum of 77% in the Pathans in populations from the northern parts of the Indian subcontinent. His own monocle is surely in greater need of de-fogging if I miss the 23% and he misses the >77%.

Indeed, the Europe vs. Caucasus ratio in Indian subcontinental populations is similar to that found in people from the Middle East and Caucasus region. It is not surprising that Eurogenes has abandoned his search for North European components in South Asia, going as far as reconstructing Ancestral North Indians as "Northern Europeans". Needless to say, he was wrong. The West Eurasian ancestry of the population of the Indian subcontinent is similar to that found in modern West Asian populations, not Slavs.

Eurogenes promises:
This shouldn't be too difficult. I'll use Dienekes' calculator for the job, and then check the results with LAMP.How poetic.
Been there, done that. It will be fun to see what "Northern European" components he will be able to squeeze out of the 0.9% N_E_Euro component that my software, in conjunction with his "test" calculator produces.

Why are the Kalash important?

There are three reasons why the Kalash are important in the study of Eurasian prehistory:
  1. Their mountainous habitat contributed to isolation and relative immunity from historical population movements
  2. Their non-Islamic religion has definitely preserved them from recent gene inflow
  3. Their language is unique within the Indo-Aryan family, and it often considered today as part of a separate Dardic family of Indo-Iranian in addition to the more populous Iranian and Indo-Aryan families.
The Kalash are crucial for those interested in the origins of Indo-Iranians, and the fact that they are, indeed, a simple West/South Asian mix is not without significance for that question.


Here is the result of a PCA analysis of the Kalash together with 50 synthetic individuals from each of the S_Asian, W_Asian, and N_E_Euro components of Eurogenes K=10 "test". This was calculated with smartpca with numoutlieriter set to 0.

It is evident that the Kalash appear to fall on the S_Asian to W_Asian line, and toward the W_Asian pole, consistent with being a population of those two origins, with the W_Asian component predominating.


As mentioned in the eurasia7 post, the Kalash tend to form population-specific components in ADMIXTURE analyses, so they are generally not included in my runs. So, I run the K=7 analysis again, but this time I included the Kalash. Here are the top populations of the component that was modal in the Kalash:

[186,] "Kurd_D" "50.2"
[187,] "Kurds_Y" "50.7"
[188,] "Armenian_D" "50.9"
[189,] "Armenians_Y" "51.2"
[190,] "Adygei" "51.5"
[191,] "Chechens_Y" "53"
[192,] "North_Ossetians_Y" "53.2"
[193,] "Lezgins" "54.4"
[194,] "Georgians" "59.8"
[195,] "Georgian_D" "60.1"
[196,] "Abhkasians_Y" "60.5"
[197,] "Kalash" "63.2"

Here are their exact admixture proportions in this unsupervised ADMIXTURE run:

Kalash N=23
East_Asian: 0.5
Atlantic_Baltic: 1.5
South_Asian: 32.9
Sub_Saharan: 0.0
Southern: 0.0
Siberian: 1.8
West Asian: 63.2

UPDATE III (November 22): Eurogenes estimates that there is 4% "Northeast European" admixture for Kalash individual HGDP00302. He managed to avoid the creation of a Kalash-specific component by including only a single Kalash individual in an ADMIXTURE experiment.

The Kalash do tend to create their own Kalash-specific component, and a good way to avoid such a component is to include each of them individually, and repeat the analysis 23 times. An alternative, and less time consuming way, is to create a single synthetic individual using the allele frequencies of the Kalash population as a whole. Even simpler, one could randomly pick a single individual (such as HGDP00302), but at the risk of picking an individual that has either much more or much less than average a particular type of ancestry.

Below are the admixture proportions of all the 23 Kalash individuals from the unsupervised ADMIXTURE run of UPDATE II. Individual HDGP00302 is 4th of 23 in terms of their "Atlantic_Baltic" component that peaks in Lithuanians (3%). The Kalash have 1.5% "Atlantic_Baltic" on average (median=1%, standard deviation=2.1%).

ID East_Asian Atlantic_Baltic South_Asian Sub_Saharan Southern Siberian West_Asian
HGDP00279 0.007 0.081 0.361 0 0 0.031 0.521
HGDP00307 0.004 0.059 0.336 0 0 0.018 0.583
HGDP00315 0.019 0.036 0.338 0 0 0 0.606
HGDP00302 0.006 0.03 0.337 0 0 0.02 0.608
HGDP00311 0.014 0.029 0.325 0 0 0.021 0.611
HGDP00285 0 0.027 0.319 0 0 0.019 0.635
HGDP00333 0 0.02 0.324 0 0 0.018 0.638
HGDP00277 0 0.016 0.334 0 0 0.021 0.63
HGDP00298 0.012 0.016 0.325 0 0 0.016 0.631
HGDP00281 0.011 0.015 0.332 0 0 0.01 0.633
HGDP00304 0.007 0.012 0.329 0 0 0.013 0.638
HGDP00290 0.007 0.01 0.325 0 0 0.021 0.637
HGDP00274 0.007 0.004 0.341 0 0 0.013 0.635
HGDP00309 0.007 0 0.317 0 0 0.019 0.656
HGDP00330 0 0 0.335 0 0 0.026 0.639
HGDP00319 0.011 0 0.328 0 0 0.01 0.651
HGDP00288 0.004 0 0.339 0 0 0.013 0.644
HGDP00286 0 0 0.329 0 0 0.018 0.653
HGDP00313 0 0 0.351 0 0 0.015 0.634
HGDP00328 0 0 0.31 0 0 0.023 0.667
HGDP00267 0 0 0.332 0 0 0.022 0.647
HGDP00326 0 0 0.307 0 0 0.03 0.663
HGDP00323 0.002 0 0.304 0 0 0.013 0.68

Wednesday, October 26, 2011

'eurasia7' calculator

This calculator was made with 196 different populations and 2,659 individuals, including 518 project participants. The following Dodecad populations do not have 5 individuals yet, so they are included in the OTHERS_D generic category:
Algerian_D, North_African_Jews_D, Slovenian_D, Mixed_Scandinavian_D, Danish_D, Moroccan_D, Tunisian_D, Serb_D, Austrian_D, Saudi_D, Pakistani_D, Tatar_Various_D, Palestinian_D, Greek_Italian_D, Romanian_D, Swiss_German_D, Szekler_D, Mandaean_D, Azeri_D, Czech_D, Georgian_D, Belgian_D, Latvian_D, Estonian_D, Bangladesh_D, Yemenese_D, Sri_Lanka_D, Hungarian_D, Basque_D, Udmurt_D, Egyptian_D
As always, I encourage people with 4 grandparents from the same country or ethnic group of Eurasia, North or East Africa to contact me (do not send data!) for possible inclusion in the Project. If I have overlooked any such individuals, drop me a line (my e-mail address is at the bottom of the blog). I usually start a new _D population whenever individuals with 4 grandparents from the same group are submitted, but I may have missed some.

Note that all individuals from the reference populations have also been included, including outliers; you should be aware of this when reading the population averages, and consult the Outliers tab in the v3 spreadsheet for some instances of outliers.
Due to image size restrictions in Picasa, the labels are not visible well. A large version of the above plot can be found in the download bundle.

The seven ancestral populations inferred at this level of resolution are:
  • Sub_Saharan
  • West_Asian
  • Atlantic_Baltic
  • East_Asian
  • Southern
  • South_Asian
  • Siberian
As usual, you should take these names as useful labels, and interpret them in conjunction with the components' distribution in different populations, and their Fst distances, both of which can be found in the spreadsheet.

The table of Fst distances:

Below you can see a neighbor-joining tree based on inter-population Fst distances:
The first six dimensions of a multi-dimensional scaling of the same:

Calculator Files:

  • The spreadsheet contains population averages, the table of Fst distances, and individual results for included Project participants.
  • The download RAR file (Google Docs or Sendspace) contains all the files needed to run the calculator. You must download and install DIYDodecad 2.1 first. In order to run the calculator, you follow the instructions of the README file, but type 'eurasia7' instead of 'dv3'.

Terms of use: 'eurasia7', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Technical Details:

The calculator is built using allele frequencies of K=7 ancestral components inferred by ADMIXTURE 1.21 analysis of 2,659 individuals. Markers included in the source datasets, as well as the Family Finder and 23andMe (as of Oct 21) platforms were included. The marker set was thinned of markers with less than 99.5% genotype rate and less than 0.5% minor allele frequency. Linkage-disequilibrium based pruning was carried out with a window size of 250 SNPs, advanced by 25 SNPs and R-squared greater than 0.4. A total of 164,990 SNPs remained after these filtering steps.

All relevant populations available to me, and genotyped at a sufficient number of markers were included. Inclusion of the Kalash population resulted in a population-specific component at K=7, and hence their admixture components were inferred a posteriori. Their proportions are consistent with previous results, showing them to be a "West Asian" population (62.4%) with substantial "South Asian" admixture (37.1%), and near-complete absence of any other genetic components.

Friday, October 21, 2011

23andMe data file changes

A few recent submissions to the Project have alerted me to the fact that 23andMe has been making changes to its data download. It is unfortunate that such changes are made apparently "silently", as they may negatively impact third-party tools built around the 23andMe data.

Apparently, the orientation of some SNPs has been changed. This should not be a problem for DIYDodecad, as it handles orientation of different companies automatically. A different problem is that apparently some SNPs have been dropped from the data file altogether. This is a problem if you are not using the latest 2.1 version of the DIYDodecad software, so you should upgrade to 2.1. It seems that about 80 of the SNPs expected by the 'dv3' calculator have been removed from the data file download, and these will appear as "absent". I do not expect 80 absent SNPs to have a huge impact on results, as they make up less than 0.1% of all SNPs in dv3.

The change in format has other consequences as well; as many of you know, I have been working on Dodecad v4 for some time now. This would use common markers between 23andMe v2 and v3 platforms and Family Finder Illumina platform. However, I will now have to backtrack on it, to make sure that the marker set used is actually consistent with people's current 23andMe downloads.

If you have a fresh 23andMe downloaded file and DIYDodecad 2.1 and you are unable to run 'dv3' or any other Project calculators, drop me a line.

Eurogenes is upset

Eurogenes seems to be upset this week, first throwing a tantrum at Dr. McDonald and then at myself. You can probably find the cached text in Google for some time, although Eurogenes has deleted his anti-McDonald tantrum, and changed the verbiage on the one directed against me on advice of some more cool-headed people. Here is the epilogue of his original anti-Dienekes rant:
Dienekes, you've got a spreadsheet online showing all sorts of weird things. You need stop being a prat, and do something about it ASAP.
Eurogenes' animus towards me is not surprising for those who have followed our interactions since the old days. Of course he is benefiting from my work (I have pointed him towards data he didn't know existed, he is using DIYDodecad, as well as the 1000Genomes data extracted with my code by the MDLP), so one would think that if he had any criticism against me, he would at least express it in a more dignified way.

Of course, being rude, ungrateful and mean-spirited does not mean one is wrong! So, what has Eurogenes actually discovered?

He noted the high Ukrainian West/East European ratio produced by Dodecad v3, and objected to my idea that Ukrainians were transitional to the Balkans and the Caucasus. Actually, according to the PCA plot of the Yunusbayev et al. (2011) paper, they are transitional, being situated toward both the Balkans and the Caucasus, relative to Belorussians/Lithuanians, i.e., the populations that generally show peaks of East European-related components. This is also supported by the ADMIXTURE analysis that reveals Ukrainians to possess a Caucasus-centered component largely lacking in other Eastern Slavs, but shared with Balkan/Caucasus populations.

Should I have not tested the new Yunusbayev data with Dodecad v3 and reported their results? Of course not. When one has a measuring instrument, one uses it on new data to test its performance and reports what he sees. This is exactly what I have done. At the same time, one uses the new data to create new measuring instruments that have been trained using all available data, which is also what I have done with euro7 and the upcoming Dodecad v4.

To make matters worse, Eurogenes suggests that my euro7 analysis agrees with his K=10 which was presented two weeks later. So, apparently, I am posting correct information about Ukrainians 2 weeks before he does, and this means that I am turning around to his way of thinking rather than vice versa. Go figure.

Eurogenes continues with his posting of supposed MDS/PCA plots supporting his thesis. Actually, what he has posted are plots based on metric distances in the space of admixture proportions; these are not genetic distances because e.g., a +/- 1% difference in a Sub-Saharan component results in the same Euclidean distance difference as a +/-1% in a European one, although the former affects genetic distance much more strongly than the latter. Metric distances are fine to quickly determine closeness of samples in the space of admixture proportions, but they are certainly no substitute for real genetic distances. I have already linked above with evidence that Ukrainians are transitional to the Balkans and the Caucasus relative to the Yunusbaeyev et al. populations.

I am also, apparently, accused of neglecting to point out the deficiencies of Dodecad v3, and I am invited by Eurogenes to retract it completely! This proposal is equivalent to the idea that we should burn old topographic maps that were based on measurements with sticks, ropes, and trigonometers, because we can now measure distances with laser beams. And, it is funny indeed that I am supposedly neglecting the deficiencies of Dodecad v3 when, 3 weeks before the Eurogenes rant, I post exactly what its limitations are, and how it can be made better.

It is unfortunate that Eurogenes has chosen to go down that path. Envy is not a good guide to behavior, and perhaps, instead of relishing at the prospect of putting others down, he could spend a little more time inventing something of his own.

As for myself, I will continue to work on my tools, and to encourage cross-pollination between different projects for the benefit of all.

UPDATE: In a newer post, Eurogenes attempts to justify his mishandling of MDS, by suggesting that he presented results based on raw SNP data. This is of course nonsense, since Eurogenes does not have the raw SNP data of the Dodecad populations. He is comparing apples and oranges by comparing plots made on raw data with those made in the space of admixture proportions. Furthermore, his supposed findings have no bearing on the Yunusbayev et al. ADMIXTURE and PCA results, posted above.

Thursday, October 20, 2011

Comparing different ADMIXTURE runs using Zombies

My idea of using zombies with ADMIXTURE is the gift that keeps on giving. Remember that "zombies" are synthetic individuals created from ADMIXTURE output, representing the K inferred ancestral components. They can be viewed as hypothetical ancestral individuals representing each of these K components without any admixture from any of the others.

An interesting problem that often comes up is to compare across different ADMIXTURE runs. I can think of at least three different applications of this:
  1. To compare components across different K; for example, how does a "West Asian"-centered component at K=5 differ from a similarly-centered component at K=12?
  2. To compare components across different datasets; for example, how does a "West Asian"-centered component inferred from an existing dataset (e.g., the current Dodecad v3) differ from a "West Asian"-centered one from a new dataset (e.g., the upcoming Dodecad v4, which will also be trained on the valuable new populations of Yunusbayev et al. 2011)
  3. To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects.
As proof of concept, I took the MDLP calculator from the Magnus Ducatus Lituaniae Project and generated 50 zombies for each of its 7 ancestral components:
  1. Scandinavian
  2. Volga_Region
  3. Altaic
  4. Celto_Germanic
  5. Caucassian_Anatolian_Balkanic
  6. Balto_Slavic
  7. North_Atlantic
I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.

I will comment on the MDLP-to-dv3 mapping:
  1. The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian
  2. The MDLP "Volga_Region" component appears to be East European with some Northeast Asian
  3. The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it
  4. The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...
  5. ... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.
  6. The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European
  7. Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well
A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP. Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

Limiting ourselves to only European populations, we obtain:

It appears that the "North_Atlantic" component may be centered on a small number of related individuals.

I encourage other genome bloggers to try their own hand at comparing their components with those of other projects, or even their own. This process will be made possible if people using ADMIXTURE follow the simple instructions to convert their output for use with DIYDodecad.

Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project.

PS: This analysis was done on ~63k SNPs in common between MDLP and Dodecad v3

Friday, September 30, 2011

'euro7' calculator

I am releasing a new calculator for Europeans, including their immediate neighboring populations around the Black Sea (Caucasus and Anatolia). The calculator can be used with DIYDodecad

There are additional African and Far-Asian population controls, so, in principle, the calculator could be used by non-Europeans/Anatolians/Caucasians, although I would be less confident of their results. For example, people of South Asian ancestry may obtain a Far-Asian result if they use this calculator, due to the deep affinity of Ancestral South Indians with East Asians. Other West Eurasians and West Eurasian-admixed peoples, not from the studied regions (e.g., Arabians or East Africans) will have their West Eurasian components mapped onto the ones used in this calculator.

'euro7' uses 7 ancestral components:
  • Caucasus
  • Northwestern
  • Northeastern
  • Southeastern
  • African
  • Far_Asian
  • Southwestern
These names represent 7 ancestral populations inferred by ADMIXTURE, and have been chosen based on the geographical regions where each of them achieves its maximum representation. You should always refer to A note of caution on admixture estimates, Interpretation of ADMIXTURE results: component sharing, as well as the average population values in the spreadsheet when interpreting your individual results.

The distribution of these 7 components can be seen in the barplot on the top left, and precise admixture proportions can be found in the spreadsheet. Note that additional samples have been used to infer these components, but as these come from Dodecad populations with less than 5 participants, I am not reporting average values for them, as per the usual project policy.

Here is the neighbor-joining tree based on the Fst divergences between the 7 ancestral components:

You can download the calculator RAR from here (Google docs; File->Download original), or here (sendspace).

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'euro7' instead of 'dv3' in these instructions.

Terms of use: 'euro7', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Calculators released by the Dodecad Project:

Sunday, September 25, 2011

Yunusbayev et al. (2011) data assessed with Dodecad v3

I have acquired the data from the recent Yunusbayev et al. (2011) paper on the Caucasus. This includes the following populations:
  • Kurds_Y 6
  • Bulgarians_Y 13
  • Ukranians_Y 20
  • Mordovians_Y 15
  • Armenians_Y 16
  • Abhkasians_Y 20
  • Balkars_Y 19
  • North_Ossetians_Y 15
  • Chechens_Y 20
  • Nogais_Y 16
  • Kumyks_Y 14
  • Turkmens_Y 15
  • Tajiks_Y 15
It is a valuable new addition to the Project, and it is commendable that it has been made publicly and easily available so swiftly after the appearance of the Yunusbayev et al. (2011) paper.

To get the ball rolling on the new Yunusbayev et al. data, I will map the new populations onto the Dodecad v3 components; they will be added to the Dodecad v3 spreadsheet as they are calculated.

I have been laboriously designing a new global (including Amerindians and Australasians) Dodecad X1 experimental calculator with 3,010 individuals for a few weeks now, but I guess I will now have to reboot it with 3,214.

Together with some other new data I recently discovered, I now have 9,799 individuals (some duplicates from different sources) in my global database. My Dodecad dataset of 511 individuals from a single country or ethnic group isn't too shabby either. Let's hope for a new data release that will push the data collection above the magic 10,000.


I have added the first 7 populations to the spreadsheet; the others are being calculated as we speak. Most of them seem in line with expectations, but the Abkhasian sample has one outlier individual (abh27), and has thus been placed in the "Outliers" tab of the spreadsheet; a new set of admixture proportions, minus that outlier individual, will be calculated anew:

UPDATE II: The population portraits have been uploaded to Google Docs as a rar file (Sendspace mirror). Average admixture results have all been entered to the spreadsheet.

Wednesday, September 21, 2011

'weac' calculator

This new calculator places individuals on the West Eurasian cline. This cline is the first-order description of variation in West Eurasians, with populations from northern and western Europe falling on one end, and those from the Near East on the other.

On the left, you can see the populations on which the calculator is based, sorted on their average "Atlantic-Baltic" component. The raw data can be found in the spreadsheet.

Note that the main purpose of the calculator is to place European and Near Eastern samples on the West Eurasian cline, and to do so, some African and East Eurasian populations are used as controls. Other types of ancestry (e.g., South Asian or Amerindian) may register as Far-Asian in the context of this test.

You can download the calculator RAR from here (Google docs), or here (sendspace).

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'weac' instead of 'dv3' in these instructions.

Terms of use: 'weac', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Sunday, September 18, 2011

Do-It-Yourself Dodecad v 2.1

DIYDodecad v 2.1 allows incomplete genotype files to be used, i.e., genotype
files that do not include all expected SNP markers used in a calculator. This
is useful to individuals having older genotype files from their testing
companies, and allows the tool to be used with any type of genotype data, and
not only the Illumina platforms currently used by 23andMe and FamilyTreeDNA.

There is a minimum requirement of at least 100 usable SNPs, i.e., SNPs that are
in the genotype file and do not have no-calls.

If you had previously followed the instructions carefully, and got an "end of file reached" error, this was most likely due to your genotype file lacking some of the expected markers used in the calculator. Version 2.1 should work for you.

You can download it from here (Google Docs, File->Download Original), or here (Sendspace). Uncompress DIYDodecad2.1.rar to a local directory on your computer, and follow the instructions in the README.txt file.

Past versions: 2.0, 1.0

Thursday, September 15, 2011

Third-party tools based on the Dodecad Project has made available some tools based on Dodecad v3 as bundled in DIYDodecad 2.0. In addition to the regular admixture analysis, there is a chromosome painting, and an option to compare 2 kits. This should be quite useful to Mac users, who can't use DIYDodecad at present. requires upload of your data to the server side, which provides the benefit of the other tools of the site, but may not be ideal for people with privacy concerns for whom the DIY tool was partly built.

Note that because admixture estimation is expensive computationally, the tools are slightly less accurate than DIYDodecad because of more lax termination criteria. This should not be a problem for the major components of one's ancestry, but may be for the minor ones. Moreover, convergence is achieved with a different number of iterations for different genotype files, so accuracy may vary.

Two other genome bloggers have released their own calculators that can be run with DIYDodecad. Magnus Ducatus Lituaniae has released MDLP based on its K=7 analysis. Eurogenes has released a K=14 imaginatively named test calculator for Eurasian data.

I keep a list of calculators for DIYDodecad in the DIYDodecad 2.0 page.

I neither endorse nor am I affiliated with any third-party tools, but I encourage readers to try them out; the more the merrier.

Wednesday, September 14, 2011

'africa9' calculator

I have devised a new calculator targeted specifically for Africans. Admixture proportions in the reference panel, Fst distances between the K=9 components, as well as individual results for Project participants from the North_Africa_D, North_African_Jews_D and East_African_D populations can be seen in the spreadsheet.

The calculator combines data from Henn et al. (2011), HGDP, and Behar et al. (2010). As a result, the number of SNPs is small: there is probably noise in the minor components, but the major components of one's ancestry should be well-defined.

It should be used only by Africans and African-West Eurasian admixed individuals. It is not meant for people with additional admixture (e.g., South/East Asian or Native American).

You can download the calculator RAR from here (Google docs), or here (sendspace).

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'africa9' instead of 'dv3' in these instructions.

Terms of use: 'africa9', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

NB: Note that the components of 'africa9' do not necessarily have the same meaning as the same-named components you might have seen elsewhere. Refer to the spreadsheet for the admixture proportions and Fst distances between components. For example, the NW African is substantially removed from other West Eurasian components in Dodecad v3 but equidistant from Europe and SW_Asia in 'africa9'. I also advise that you read Interpretation of ADMIXTURE results: component sharing

Monday, September 5, 2011

'bat' calculator (Balkans-Anatolia-Turkic)

I have decided to make a new calculator for DIYDodecad that may be useful for individuals from the Balkans and Anatolia. You can download it from here at Google Docs (or here from sendspace). The terms of use are the same as for DIYDodecad v 2.0. To run it, you simply extract the contents of the RAR file in your working directory, and type bat.par whenever you typed dv3.par in the instructions.

The reference populations can be seen below. I have included all available Balkan populations, as well as Turks and Armenians. Moreover, I have included all available Turkic populations.

The marker set is the same as used in Dodecad v3. Three components emerge: one centered in the northern Balkans, one in eastern Anatolia, and one present in various proportions among all Turkic populations (see Turkic cline).

The components have been named accordingly, but please note that they do not necessarily reflect recent ancestors. For example, it is a good hypothesis that the Anatolia component was present in the Balkans even in ancient times, so one need not seek a recent Anatolian ancestor to explain its presence in a Balkan individual. Similarly for the Balkans component in Anatolia, which may reflect the diverse Balkan peoples that have settled in Anatolia since the dawn of history, so a present-day inhabitant of Anatolia need not seek a recent Balkan ancestor.

Likewise, the Turkic component is only part of the genetic makeup of the Turkic speakers who arrived in Anatolia, since those probably also carried West Eurasian population elements picked up en route from Siberia to Anatolia.

The way to interpret your results is to see whether you have an excess or deficiency of any component relative to your ethnic group. For example, an Anatolian Greek may have a higher Anatolia/Balkans ratio than a Balkan Greek and likewise for a Balkan vs. Anatolian Turk; the latter may also have a variable Turkic component which will reflect differential Central Asian input.

Saturday, September 3, 2011

No affiliation to

A thread at the 23andMe forum suggests that the Dodecad project is somehow related to Apart from the fact that the admin of, for whatever reason, chose to co-opt the exact names of the 12 ancestral components of the Dodecad project for his own purposes, creating unnecessary confusion between the two projects.

I would like to state that I have absolutely no relation whatsoever to that outfit, and that use of any DIYDodecad files without my permission is forbidden without attribution, as stated in the DIYDodecad README file.

UPDATE: The admin of has kindly asked for permission, and I've replied that he is entirely free to use the DIYDodecad materials, provided that:
  1. It is clearly visible to someone considering using the utility that it is based on Dodecad v3
  2. It is also clear that the results may differ from the same-named components of Dodecad v3 produced by the Dodecad Project
I consider the misunderstanding to be over, and hopefully the tool will be online again soon.

Tuesday, August 30, 2011

Balkan averages (August 2011)

Since my last call for more participation from the Balkans, I was able to create a new Bulgarian_D sample of 5 participants. Together with the Greek_D sample, the Balkans_D sample of non-Greek, non-Bulgarian project members, the Behar et al. (2010) Romanians, and the Xing et al. (2010) Slovenians (the latter on a smaller number of markers), we are beginning to get a better feel of genetic variation in the Balkans. There have been several other averages that have been adjusted with more participation; all of them can be seen in the Dodecad v3 spreadsheet.

The table below shows the major components (>1%) in the available Balkan populations.

The Bulgarian average as it stands seems reasonably close to the Romanian one, and is characterized by balanced West/East European components; in this balance it resembles Greeks, who, however, have lower levels of both components and higher levels of the Mediterranean/West Asian/Southwest Asian components.

Slovenians contrast with Hungarians in having reverse West/East European levels, and with their neighboring Italians in having quite a bit more of the East European component, and quite a bit less of the Mediterranean one. Bulgarians/Romanians contrast with Slavic groups from eastern Europe in having less East European and more Mediterranean/West_Asian.

Hopefully before long, more participation from the western/central Balkans (Serbs, Croats, Bosnians, Montenegrins, Albanians, Slav Macedonians) will allow us to fill more holes in our understanding of the genetic landscape of Southeastern Europe.

Tuesday, August 23, 2011

Populations in need of 5 participants

Submission to the Project is currently closed, but I am often willing to include new members if they contact me ( before sending their data with some information about their ancestry.

I am most likely to accept new participants from:
  • Greece, the Balkans, Italy, and West/Central Asia
  • Under-represented populations
  • Populations that are a few members short of reaching the 5-person mark, after which I can calculate an average for them.
I typically don't accept new participants of multiple ancestries; I've made DIYDodecad for just that case.

Here is a list of populations that are short 1-2 participants:

Algerian_D 4
North_African_Jews_D 4
Slovenian_D 4
Bulgarian_D 4
Danish_D 3
Moroccan_D 3
Tunisian_D 3
Mixed_Scandinavian_D 3
Serb_D 3
Austrian_D 3
Saudi_D 3
Pakistani_D 3
Tatar_Various_D 3
Palestinian_D 3

Any individuals from the Balkans are strongly encouraged to contact me, as the Balkans_D sample size of 17 can soon be broken down into specific populations if a few more individuals from different Balkan populations join the Project.

I also encourage new members to post their information in the ancestry thread.

How to make your own calculator for DIYDodecad

As I have explained in the README file of DIYDodecad, it is possible to use the software to create and distribute new calculators, based on different marker sets/ancestral populations.

(The following discussion will only be useful to other genome bloggers, or people who have experience with ADMIXTURE software).

Currently, DIYDodecad is distributed together with the 'dv3' calculator ("Dodecad v3"). This consists of a set of files:

dv3.par (The parameter file that tells DIYDodecad what to expect and what to do)
dv3.alleles (Allele names and variants)
dv3.12.F (Allele frequencies for 12 ancestral populations)
dv3.txt (Names for 12 ancestral populations)

I will now explain how you can use PLINK and ADMIXTURE to create your own calculator.

(1) Running ADMIXTURE

In the following discussion, I will assume that you have your dataset in binary PLINK format (bed/bim/fam files), that it has 123,456 markers, and you run ADMIXTURE regularly for 7 populations, e.g.:

./admixture test.bed 7

CAVEAT! The 123,456 markers must be included in the commercial platform you are targeting your calculator for. So, before you run ADMIXTURE, you must make sure that test.bed includes only markers for your chosen platform (e.g., 23andMe v3). I will assume that you have the list of markers from your commercial platform in a file (one per line), e.g., 23andMeV3.txt. You must then first do:

./plink --bfile test --extract 23andMeV3.txt --make-bed --out test

You can repeat this with other commercial marker sets, so that in the end your "test" dataset on which you run ADMIXTURE only has commercially available markers that your targeted audience will possess in their genotype files.

Actually, my main personal working sequence is to:
  • Merge (--merge-list) all reference datasets in PLINK with a --geno flag
  • Extract (--extract) commercial markers that form the intersection of 23andMe v3/v3 and Family Finder (Illumina)
  • Do linkage-disequilibrium based pruning (--indep-pairwise)
  • Finally run ADMIXTURE
It's better to do LD-based pruning after commercial marker pruning, since doing it in reverse may disrupt the physical spacing of the markers identified by --indep-pairwise.

After ADMIXTURE finishes its run, it will output a file called test.7.P; this is the allele frequencies file that you will use for your calculator, but you have to modify the order of the alleles! We will do this later.

(2) Preparing the test.alleles file

First, run the following command:

./plink --bfile test --freq --out test

This will produce a test.frq file which will be the basis of the dv3.alleles file. In R, do the following:

X<-read.table('test.frq', header=T)[, 2:4]

This will basically load the SNP names and minor/major alleles into the X table. We now identify the alphabetical order of the SNPs:

ORDER <- order(X[,1])

And, now we re-order X, so that SNPs are ordered alphabetically:

X <- X[ORDER,]

and, we save this as the test.alleles file

write.table(X, file='test.alleles', quote=F, row.names=F, col.names=F)

(3) Preparing the test.7.F file

The test.7.P file can be prepared as follows:

X <- read.table('test.7.P')
X <- X[ORDER, ]
write.table(X, file='test.7.F', quote=F, row.names=F, col.names=F)

Note that in this example test.7.P contains the output of ADMIXTURE, and test.7.F will contain the same output, but with rows re-ordered in the same way as the test.alleles file.

(4) Preparing the test.txt file

You do that with an editor; just pick whatever names you want for your 7 ancestral populations, which, of course, should be in the same order as the corresponding frequency columns output by ADMIXTURE.

(5) Preparing test.par file

Again with your editor, for this example:


(6) Instructions to users

Do NOT distribute the DIYDodecad software itself, rather direct your users to the Dodecad Project download page (e.g., here, for the current 2.0 version of the software). This will ensure both compliance with the terms of use of the software, and also that users have access to the most up-to-date version.

You only have to distribute test.par, test.alleles, test.7.F, and test.txt.

Your users will follow exactly the same sequence of actions as described in the Dodecad README.txt file, with the only difference that they should type 'test', rather than 'dv3' whenever it is needed.

Hopefully more genome bloggers will decide to release calculators based on their ADMIXTURE runs to the wider public. There are several reasons to do this:
  • Reduced workload
  • Wider distribution of your work in the community, since, due to privacy concerns, not everyone is willing to share their data
  • Ability to study the utility/validity of inferred components on test data and by persons other than the discoverer
  • Ability to use the advanced bychr, byseg, and target modes with your calculators

Friday, August 19, 2011

A few comments on the use of DIYDodecad 2.0

Here are some observations that might be useful to people, especially for the new byseg and target modes:

1. Finding the origin of shared segments

Until now, when you had a segment match with another customer in your testing company, you had no idea what was the origin of the shared segment. Suppose, for example, that a Russian and a German share some sequence in a region X. This could be:
  • Russian-like ancestry in the German individual
  • German-like ancestry in the Russian individual
  • Third party ancestry in both individuals
Using the new modes, if the German saw an excess of Eastern European (relative to his usual average), then he'd pick the first scenario; if he saw nothing unremarkable, the second; if an excess of some component rare in both Russians and Germans (e.g., West_Asian), the third.

This is extremely important, as there is a noticeably confirmation bias in some individuals of interpreting the unusual as evidence of exotic ancestry. For example, an individual in search of Jewish ancestry may interpret segment matches with Jews as evidence for that ancestry: if he sees high Southwest_Asian ancestry in such segments, then that's a reasonable interpretation, but the shared segments could very well be interpreted as non-Jewish ancestry in the Jewish individual, if, they happen to be, e.g., East_European.

2. With parents' DNA

It is important to remember that each region includes both paternal and maternal DNA and you got a random draw of the segments inherited by their parents (your grandparents).

So, if you try to figure out where your region X came from, remember that it came from two places. So, if you see an unusual combination (e.g., Northeast_Asian + Northwest_African) that doesn't correspond well to any known population, this may mean that you got half of it from one parent, and the other half from the other.

Note also, that while on genomewide analysis a child's results will often be intermediate (but not necessarily so) in his ancestral components between his parents, this is not the case when looking at small segments. Suppose parent A is 50% West_Asian and 50% Mediterranean in a particular region, and parent B is 50% West_Asian and 50% West_European in the other region.

Then the child may end up with West_Asian near 100% in that region (if he happens to inherit the West_Asian segments from both parenets) or near 0% (if he happens to inherit the Mediterranean/West_European ones).

3. With Dodecad Oracle

In general, I discourage the use of Dodecad Oracle with chromosome or segment results. For two reasons:
  • Small segments may appear more mixed than they are, because there may not be any informative SNPs in a particular region to distinguish between some of the ancestral components. So, the scale of the noise may be higher. As an experiment, you can average your segments, weighted by either the number of SNPs or their physical length, and you will come up with something close to your "genomewide" average, that will, however, be off, because of this factor.
  • From a different perspective, segments may appear less mixed, because it is less likely that you got genetic material from all ancestral populations in a small section of your DNA. Your genomewide admixture may have several non-zero components, but you are unlikely to have many non-zero components in a small region (barring the aforementioned noise), and you could very well see >80% percentages in some of them that are very typical of a particular ancestral component.