Dodecad Ancestry Project: July 2011

Friday, July 29, 2011

DIYDodecad 1.01

I've made a couple of very small changes; download links have been updated.

- I've added even more detailed instructions in the README file. If you follow the instructions carefully, you will be able to run the program. I am only assuming a very basic level of computer competence (= the ability to read and follow instructions, the ability to type, the ability to install and start programs)

- I've added a new "progress" mode in addition to the "silent" and "verbose" (default). The "progress" mode shows the temporary solutions as the program is running, so they might be fun to see how the admixture proportions converge to the final solution.

A thing to note is that if the admixture proportions appear to stay the same in successive iterations, they do not necessarily stay the same. For example, a printed 0.23% in the temporary solution may be slowly changing from 0.23001% to 0.23002% etc. With the default convergence criteria it is guaranteed that the final solution will be very close to the one received if you let it run forever.

Tuesday, July 26, 2011

standardize.r added to DIYDodecad v 1.0 bundle

If you downloaded the rar file for DIYDodecad v 1.0, note that it was missing the standardize.r file. This has now been added, and you can download it again; you only need to get the standardize.r file, not the entire rar archive.

Monday, July 25, 2011

Do-It-Yourself Dodecad v 1.0

(UDPATE: There is a new 2.0 version of the software)

I have decided to release a Do-It-Yourself calculator (and mirror), for several reasons:

So that people who don't want to send me their data can still get their results
So that people can estimate admixture proportions in all their relatives, as relatives can't be accepted in the Project
So that people of mixed ancestry can get their results, as there have been limited opportunities for them to submit their data to the Project so far
Most importantly, so that I won't have to do-it-myself ;-)

You need a Windows or Linux 32bit/64bit machine to run DIYDodecad. The instructions should be easy to follow, but if you encounter any bugs or have any problems, feel free to leave a comment or write to me (dodecad@gmail.com).

Of course, I will continue to ask for people to send me their data in the future: the calculator is made possible in part because of their contributions. Project participants have added benefits, such as the more specialized Clusters Galore or regional analyses.

If you are a project participant, you can still try DIYDodecad; you will get slightly different results than the ones you already have, because DIYDodecad does not use the same "random seed" as ADMIXTURE, and has a different default convergence criterion (maximum log-likelihood change of 1e-6 between successive EM iterations). You will also need the program in the future, as more "calculators" will be disseminated for it.

So, if you don't want to/can't join the Project, you can still get your Dodecad v3 results; you can also try the Dodecad Oracle with them. Also, feel free to leave a comment in this post with your results.

Related: some background on the creation of DIYDodecad

Tuesday, July 19, 2011

The Dodecad Oracle v1

Here is a little fun tool that tests the Dodecad v3 admixture proportions of an individual against all the reference populations, but also against the best pairwise combinations of these populations.

You need to install R to use it, and then download the program and double click on the file DodecadOracleV1.RData that can be found within the rar file. You will then be faced with a command prompt where you can enter the following commands:

Examining which populations are available

Just enter

X[,1]

You will see a list of 227 populations. You can use these population IDs in the next section.

Which populations are closest to a particular population?

Enter:

DodecadOracle("British_D")
[,1] [,2]
[1,] "British_D" "0"
[2,] "British_Isles_D" "0.9798"
[3,] "Cornwall_1KG" "1.1533"
[4,] "Kent_1KG" "2.265"
[5,] "Irish_D" "3.7643"
[6,] "Dutch_D" "4.5354"
[7,] "Mixed_Germanic_D" "6.8971"
[8,] "Norwegian_D" "11.3111"
[9,] "Orkney_1KG" "12.4652"
[10,] "Orcadian" "12.8195"

If you want to find e.g., the top-30 populations, rather than just the top-10, enter:

DodecadOracle("British_D", k=30)

Which populations are closer to a particular individual?

Enter the admixture proportions of the individual (from the "Individual results" tab of the spreadsheet) as follows:

DodecadOracle(c(4.6, 16.7, 33.6, 0, 23.2, 0.4, 0.6, 1.6, 0.7, 14.1, 4.5, 0.2))
[,1] [,2]
[1,] "Ashkenazi_D" "3.7908"
[2,] "Ashkenazy_Jews" "4.1473"
[3,] "Morocco_Jews" "6.338"
[4,] "S_Italian_Sicilian_D" "12.5443"
[5,] "Sephardic_Jews" "13.5067"
[6,] "C_Italian_D" "14.4554"
[7,] "Sicilian_D" "14.7469"
[8,] "S_Italian_D" "15.748"
[9,] "Tuscan_X" "15.9981"
[10,] "O_Italian_D" "16.1474"

Once again, you can specify k=30, if you desire the 30 top matching populations instead of the default 10.

Mixed Mode

You use mixed mode by adding mixedmode=T in any of the commands. The program then considers all pairs of populations, and for each one of them calculates the minimum distance to the sample in consideration, and the admixture proportions that produce it; population pairs where the distance to one of the two populations is smaller than to any admixture of the two are ignored.

Example:

DodecadOracle("Pathan",mixedmode=T)
[,1] [,2]
[1,] "Pathan" "0"
[2,] "84.8% Pakistani + 15.2% Urkarah" "1.075"
[3,] "84% Pakistani + 16% Stalskoe" "1.1555"
[4,] "63.9% TN_Brahmin + 36.1% Urkarah" "1.6669"
[5,] "32.4% Urkarah + 67.6% Meghawal" "2.3516"
[6,] "56.3% INS + 43.7% Urkarah" "2.4901"
[7,] "11.5% Adygei + 88.5% Pakistani" "2.6245"
[8,] "82.4% Sindhi + 17.6% Stalskoe" "2.6318"
[9,] "62.9% AP_Brahmin + 37.1% Urkarah" "2.7322"
[10,] "11.2% Lezgins + 88.8% Pakistani" "2.7749"

The mixed mode should be used with caution, and it shows, more than anything else, how similar apparent "mixes" can be achieved by different combinations of ancestry. Nonetheless, it may prove somewhat useful. For example, there is a suggestion in the above results, that Pathans can be viewed as a mix of other South Asian populations and populations from the eastern Caucasus, a suggestion that was arrived at independently by the Project using different methods.

Here is another example:

DodecadOracle("Assyrian_D",mixedmode=T)

[,1] [,2]

[1,] "Assyrian_D" "0"

[2,] "83.9% Armenians_16 + 16.1% Yemen_Jews" "1.7829"

[3,] "89.1% Armenian_D + 10.9% Saudis" "2.1624"

[4,] "84.3% Armenians_16 + 15.7% Saudis" "2.2884"

[5,] "88.9% Armenian_D + 11.1% Yemen_Jews" "2.2983"

[6,] "83.8% Armenian_D + 16.2% Bedouin" "4.1579"

[7,] "72.2% Armenian_D + 27.8% Syrians" "4.1841"

[8,] "23.4% Georgians + 76.6% Iraq_Jews" "4.2418"

[9,] "76.2% Armenians_16 + 23.8% Bedouin" "4.332"

[10,] "61.5% Armenians_16 + 38.5% Syrians" "4.4019"

This reaffirms the close relationship of Assyrians to Armenians that has been noticed in the project and by others, and it also shows that Assyrians differ from Armenians in a Southwestern Asian direction, consistent with their Semitic language.

Or, African Americans:

DodecadOracle("ASW",mixedmode=T)

[,1] [,2]

[1,] "ASW" "0"

[2,] "81.3% Hausa + 18.7% N._European" "2.3891"

[3,] "18.4% Orkney_1KG + 81.6% Hausa" "2.4031"

[4,] "18.5% Argyll_1KG + 81.5% Hausa" "2.4268"

[5,] "18.4% Orcadian + 81.6% Hausa" "2.4657"

[6,] "80.5% Igbo + 19.5% N._European" "2.5031"

[7,] "80.6% Brong + 19.4% N._European" "2.523"

[8,] "18.6% CEU + 81.4% Hausa" "2.5938"

[9,] "19.1% Argyll_1KG + 80.9% Brong" "2.6197"

[10,] "19% Orkney_1KG + 81% Brong" "2.6274"

I don't know that much about the slave trade, but I believe that Ghana was an important part of it?

Another thing to watch, is that some populations tend to have more than one sample available, so they appear to be mixtures of themselves, which is not really very informative, e.g., Spanish_D

DodecadOracle("Spanish_D",mixedmode=T)

[,1] [,2]

[1,] "Spanish_D" "0"

[2,] "7.9% French_Basque + 92.1% IBS" "0.8713"

[3,] "68.9% IBS + 31.1% Spaniards" "1.0377"

[4,] "98.8% IBS + 1.2% Irish_D" "1.2959"

[5,] "1.2% British_Isles_D + 98.8% IBS" "1.3018"

[6,] "1.2% British_D + 98.8% IBS" "1.3019"

[7,] "99% IBS + 1% Norwegian_D" "1.3046"

[8,] "1.2% Cornwall_1KG + 98.8% IBS" "1.3048"

[9,] "98.8% IBS + 1.2% Kent_1KG" "1.3142"

[10,] "2.2% French_D + 97.8% IBS" "1.3179"

To deal with these problems, you must "edit" the X matrix if you want to exclude some populations. For example, if you want to exclude "Spaniards" and "IBS", you must enter:

X <- X[setdiff(1:227,which(X[,1]=="IBS" | X[,1]=="Spaniards")),]

but notice, that you must relaunch the program, if you want to get the original matrix, or alternatively save it like this:

Z<-X

and then retrieve it like this:

X<-Z

Friday, July 1, 2011

Results up to DOD764 are posted (+portraits, Indo-Iranians etc.)

The results can be found in the spreadsheet

Submission to the Project is currently closed, and of course I encourage participants who have not already done so to leave a message in the ancestry thread.

This completes the results for all Project participants who joined during the latest submission opportunity.

The population averages are finalized -for the time being- but I will occasionally update the _D populations as more participants join the Project and/or I discover cases of fraud in terms of ancestry self-reporting.

Population Portraits

Finally, the population portraits have been uploaded (here and here). For example, here is the Nganassan one, showing three distinctive outliers:

A colorful view of the Nepalese, showing the co-existence of South-Asian-like and East-Asian-like individuals:

Note, that there are also some portraits of populations not included in the averages. For example here are the Onge:

The Onge from the Indian Ocean are outside the area covered by the populations used to create the Dodecad v3, and show mixed "South Asian", "South-East Asian" affiliations. They are probably a good example of case #4.

Indo-Iranian Origins

Here is the population portrait of the Kurds:

I have long noticed that all Indo-Iranian populations possess some of the "South Asian" component. The origin of that component is difficult to ascertain, as it is a composite of "North Indian" and "South Indian" ancestral components, related to West Asians and Onge respectively.

What also seems interesting is that the "South Asian" component is closer to the "West Asian" one with respect to all other West Eurasian components, while many South Asian individuals have substantial levels of the "West Asian" component itself.

The occurrence of "South Asian" in non-negligible levels seems to track the Indo-Iranian world quite well: it is found at about 1/10 in Iranians and Kurds, and also occurs widely in Central Asia, where its true ancient levels were probably much higher due to the substantial presence of east Eurasian elements in the area today. It even occurs at non-trace levels in people who have been part of historical Persian empires such as those from the eastern Caucasus (compare Lezgins and Azerbaijan Jews with Georgians and Adygei, and Iranians/Kurds with Turks, Cypriots, Syrians, and Armenians).

These patterns can be well-explained, I believe, if we accept that Indo-Iranians are partially descended not only from the early Proto-Indo-Europeans of the Near East, but also from a second element that had conceivable "South Asian" affiliations. The most likely candidate for the "second element" is the population of the Bactria Margiana Archaeological Complex (BMAC). The rise and demise of the BMAC fits well with the relative shallowness of the Indo-Iranian language family and its 2nd millennium BC breakup, and has been assigned an Indo-Iranian identity on other grounds by its excavator. As climate change led to the decline and abandonment of BMAC sites, its population must have spread outward: to the Iranian plateau, the steppe, and into South Asia, reinforcing the linguistic differentiation that must have already began over the extensive territory of the complex.

The proposed Indo-Iranian homeland, transitional between the West and the South would explain both:

the presence of the "West Asian" component in South Asians (contrast e.g., Kashmiri Pandits with other Indians and south Indian Brahmins with non-Brahmin south Indians), and also
the "South Asian" component in Iranians and Iranian-admixed Central Asian Turkic speakers

In their westward march, the Iranians would acquire an excess of West and Southwest Asian components (which would reduce their "South Asian" one), while in their southward march, the Indo-Aryans would acquire an excess of the South Asian component (which would reduce their "West Asian" one).

Friday, July 29, 2011

DIYDodecad 1.01

Tuesday, July 26, 2011

standardize.r added to DIYDodecad v 1.0 bundle

Monday, July 25, 2011

Do-It-Yourself Dodecad v 1.0

Tuesday, July 19, 2011

The Dodecad Oracle v1

Friday, July 1, 2011

Results up to DOD764 are posted (+portraits, Indo-Iranians etc.)

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff