Tuesday, January 31, 2012

'K12b' and 'K7b' calculators

I am releasing two new calculators with K=12 and K=7 components, named 'K12b' and 'K7b'. You can scroll down to the bottom if you are just interested in the downloads, or read on.


New Features

The new 'K12b' calculator is an update of the previous K12a one, that was inferred using all the new samples submitted during the last submission opportunity. The 12 components are still roughly the same, although their allele frequencies may have changed by a bit, so existing participants can expect to have slightly altered results, and new participants in the Project more so, since their data are now contributing to the creation of the new tool. Non-participants can, of course, use the new calculator with DIYDodecad.

I have also taken the opportunity to do some minor tweaks. I am releasing population portraits for K12b (which were lacking in K12a); I've changed my visualization code so that the sample IDs of non-Dodecad populations can now be seen in the barplots. This may be useful for anyone else using these reference populations, by quickly identifying potential outliers in them.

I have also decided to use normalized median admixture proportions for the populations. For example, if 5 individuals in a population have 0, 0, 0.2, 0.5, 10.0% of a particular component, then the average is 2.14%, but the median is 0.2%. By using the median, the proportions become less susceptible to the presence of outliers (such as the 10%). However, if the median is calculated over every component separately, it is no longer guaranteed that the components will add up to 100%; this can be addressed by re-normalizing them (scaling them by a constant factor) so that they do. I believe that use of the normalized median will not only give better proportions that are less susceptible to outliers, but will also improve results of the new Dodecad Oracle for K12b.

At the same time I am also releasing 'K7b' which is an update of the existing 'eurasia7' calculator and which has been built on exactly the same dataset as 'K12b' but at a lower (K=7) level of detail.

Information on K7b


Information spreadsheet.

Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):


Table of Fst divergences:

Neighbor-joining tree (based on above):

Information on K12b


Information spreadsheet.


Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):

Table of Fst divergences:

Neighbor-joining tree (based on above):
Multidimensional Scaling Plots of K12b and K7b


I have created MDS plots using synthetic individuals representing the 12 ancestral components of K12b and the 7 ancestral components of K7b. By including both in the same plot, one gets an idea of the relationship of the components at different resolution. The first 10 dimensions can be seen below:

Here is a blowup of the main West Eurasian groups from the plot of the first two dimensions:

Some observations:

  • The Atlantic_Med component which is bi-modal in Basques and Sardinians occupies the apex of the figure; this makes sense, since Southwest Europe is quite distant (along land routes) to both Asia and Africa.
  • The Caucasus component is surrounded by most of the others; this is consistent with my theory elaborated in The womb of nations: how West Eurasians came to be.
  • The Atlantic_Baltic component (from K=7) is intermediate between the Atlantic_Med and North_European components.
  • Similarly, the West_Asian component (from K=7) is intermediate between the Caucasus and Gedrosia components; the Gedrosia component diverges in the direction of the Asian groups (not shown in this figure), and in particular of South Asians. This divergence can also be seen in the plot of dimension #3.
  • The Northwest_African component diverges in the direction of Sub-Saharan Africans.

Technical Details


A dataset of 268 populations/3,115 individuals was assembled. A total of 265,519 SNPs are in common in the various source datasets as well as the 23andMe v2/v3 and Family Finder platforms. Iterative removal of distant relatives was performed by removing one individual from each pair within a population if that pair had a RATIO of 2.5 or greater or more than the mean and two standard deviations in IBD analysis performed in PLINK 1.07. A total of 2,675 individuals remained. 4 individuals were removed for low genotyping rate (less than 97%). 264,328 SNPs remained after removal of SNPs with less than 97% genotyping rate or 1% minor allele frequency. 166,770 SNPs remained after linkage-based disequilibrium pruning (--indep-pairwise 200 25 0.4). The final set thus consisted of 2,671 individuals/268 populations/166,770 SNPs. Ancestral populations (components) were inferred using ADMIXTURE 1.21, with K=7 and K=12 and default parameters.

No individuals were removed from the source datasets, except in the case of the Armenians_Y sample, where one individual (ID: armenia3) was dropped because he/she was the same as a Dodecad Project participant.

Downloads


K7b population portraits, spreadsheet, and DIYDodecad files.
K12b population portraits, spreadsheet, and DIYDodecad files.

Dodecad Oracle (K12b edition) can be downloaded from here. Please read the instructions of the previous Oracle on how to use this tool. Note that the number of populations is now 223.

To use either calculator with DIYDodecad, with your 23andMe or Family Finder data, follow the instructions in the README file, but substitute 'K12b' or 'K7b' for 'dv3'.

Project participant results for both K7b and K12b are found in the spreadsheets in the Individual Results tab.

Terms of Use


You are free to use K12b and K7b, including all downloaded files for any non-commercial purpose, as long as you attribute them to the Dodecad Project and to Dienekes Pontikos as follows:

The [K7b/K12b] admixture calculator is courtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project; more information here.

41 comments:

  1. Thanks Dienekes. Thanks to you I have been able to pin my Grandma to the Canaries and the the Portuguese and Genovese that fled the Inquisition and used the Canaries as a trampoline to the New World. Thanks for including these important tribes in your calculations! The fact that her top 20 tribes are Spanish reinforces her origins.

    My Grandma's numbers:

    > DodecadOracle(c(3.86, 3.53, 4.40, 1.18, 27.46, 20.35, 0.84, 3.46, 4.94, 3.56, 9.63, 16.79), k=100)
    [,1] [,2]
    [1,] "N_Italian_D" "26.6522"
    [2,] "O_Italian_D" "26.6973"
    [3,] "Portuguese_D" "26.7879"
    [4,] "Canarias_1KG" "27.1561"
    [5,] "Extremadura_1KG" "27.3508"
    [6,] "Galicia_1KG" "27.4992"
    [7,] "North_Italian" "28.0773"
    [8,] "Baleares_1KG" "28.7716"
    [9,] "TSI30" "28.7764"
    [10,] "Murcia_1KG" "29.1074"
    [11,] "Tuscan" "29.8301"
    [12,] "Castilla_Y_Leon_1KG" "29.8409"
    [13,] "French_D" "30.0924"
    [14,] "C_Italian_D" "30.1616"
    [15,] "French" "30.1715"
    [16,] "Romanians" "30.7116"
    [17,] "Spanish_D" "30.7832"
    [18,] "Andalucia_1KG" "31.13"
    [19,] "Cataluna_1KG" "31.1358"
    [20,] "Bulgarian_D" "31.2317"
    [21,] "Spaniards" "31.451"
    [22,] "Bulgarians_Y" "31.5578"
    [23,] "Castilla_La_Mancha_1KG" "32.3409"
    [24,] "Cantabria_1KG" "32.7841"
    [25,] "Mixed_Germanic_D" "33.1934"
    [26,] "Valencia_1KG" "33.401"
    [27,] "Greek_D" "33.6421"
    [28,] "Aragon_1KG" "33.9655"
    [29,] "Sicilian_D" "34.0399"
    [30,] "Hungarians" "34.1695"

    ReplyDelete
  2. Hi Dienekes, I would like you could make an updated version, or similar, of this West-Eurasian plot that you did some time ago :

    http://4.bp.blogspot.com/-_6XAIk6ygtg/Tcqj7WCS_jI/AAAAAAAADsU/WJDG6R2XnH0/s1600/waeu.png

    ReplyDelete
  3. The clusters and FST's sometimes do no make much sense.

    For example, the FST between North Europeans and Sub-Saharan African is closer than between Atlantic-Med and Sub-Saharan African. We all know Atlantic-Med populations are closer to Sub-Saharan Africans. Although this is just a generalized admixture run.

    Also, the FST between Southeast Asian and South Asian is more distant than between East Asian and South Asian, so I think, to be more correct, the "Southeast Asian" label is more of a Miao/"pure Mongoloid"-like cluster, rather than deep Southeast Asian related to the Indonesians, etc.

    ReplyDelete
  4. We all know Atlantic-Med populations are closer to Sub-Saharan Africans. Although this is just a generalized admixture run.

    You are confusing Atlantic_Med _populations_ (that contain admixture with other components, including African and Near Easter ones) with the Atlantic_Med _component, which does not; also, the center of the A_M component is not closer to Africa along land routes than the center of the N_E component is.

    Also, a difference of 0.183 vs. 0.185 is not significant.

    ReplyDelete
  5. Also, the FST between Southeast Asian and South Asian is more distant than between East Asian and South Asian, so I think, to be more correct, the "Southeast Asian" label is more of a Miao/"pure Mongoloid"-like cluster, rather than deep Southeast Asian related to the Indonesians, etc.

    The "Southeast Asian" component is not limited to Mainland Southeast Asia; it is also strongly modal in the Philippines where an Austronesian language is spoken. Given that it is also strongly modal in Cambodians (Austro-Asiatic) and Dai (Tai-Kadai), it certainly appears to be a general Southeast Asian component, rather than limited to any language group (and the Miao are actually not particularly strong in it in comparison to these other populations).

    ReplyDelete
  6. The Portuguese bar plot doesn't seem correct, I'm DOD074 and I can't find myself in it.


    Gedrosia 4.2%
    Siberian 0%
    Northwest African 6%
    Southeast Asian 0.4%
    Atlantic Med 42.3%
    North European 25.9%
    South Asian 0%
    East African 1.4%
    Southwest Asian 5.2%
    East Asian 0.8%
    Caucasus 13.8%
    Sub Saharan 0%

    ReplyDelete
  7. I keep getting French at the very top of my new runs. It would be great to know what province this is from. If this is originating in Normandy, that would strengthen one of my hypothesis of how my R1a1a ancestor entered Britain. It either had to be via a viking settlement in Norfolk (surname Packham), an Anglo Saxon invasion (surname de Paecam/Doomsday Book), or a Norman invasion (surname de Peche). I can trace my paternal ancestors to 1790 Sussex England but no further. Eventually it would be great if you could eventually make like Great Britain9, a France9, a Spanish9, an Italian9, a German9 etc. That way the instruments will be far more precise and will help us tap into our deep roots and verify our family history. Again. Thank you for all your hard work!

    ReplyDelete
  8. Which Filipino populations did you use as a matter of interest?

    >>The "Southeast Asian" component is not limited to Mainland Southeast Asia; it is also strongly modal in the Philippines where an Austronesian language is spoken. <<

    ReplyDelete
  9. Which Filipino populations did you use as a matter of interest?

    3 Project participants.

    ReplyDelete
  10. I've noticed an enormous difference between the V3's and k12a/K12b's Caucasian components distribution. Do you think that the addition of the Yunusbayev's samples was the cause of it? Do you think that a better sampling of other regions will also change the big picture in such a dramatic way as the Yunusbayev's did ?

    ReplyDelete
  11. I don't know what you mean by "enormous difference". In the case of Western Europe, for which you are probably interested, the levels of the Caucasus/West_Asian components are definitely not comparable across calculators, since 'dv3' used a "West European" category that the other calculators do not, and which was shifted toward West Asia relative to the other "East_European" component.

    The additional step of distant relative filtering may also have influenced overall component levels in some cases. Its overall effect is to preclude the creation of population-specific components. Such filtering did take place during 'dv3' for populations with known sets of apparently distantly related individuals (such as the HGDP Arab groups), but it was done with a uniform procedure across all populations in K12a/b.

    ReplyDelete
  12. Isn't there a portrait for Romanians_D? I can't seem to find it. The only one visible seems to be the one with the 2 Gypsy-admixed individuals.

    ReplyDelete
  13. I know that the West Asian component in DV3 is different from the Caucasian one in K12a/b, but still they are both modal in the Caucasus. Being said that, their distribution in Western European is completely different between one and the other.

    A short example:

    DV3

    British_D: 6.7%
    Irish_D: 6.7%
    Portuguese_D: 3.6%
    Spanish_D: 2.4%

    Georgians: 72.3%


    K12b

    British_D: 1.3%
    Irish_D: 0.2%
    Portuguese_D: 9.7%
    Spanish_D: 8.8%

    Georgians: 73.9%


    As you can observe the figures for the modal value remain the same while in Western Europe the figures are inverted.

    ReplyDelete
  14. Isn't there a portrait for Romanians_D? I can't seem to find it. The only one visible seems to be the one with the 2 Gypsy-admixed individuals.

    No, since 4 individuals were included in the final dataset, and only portraits for populations with 5+ individuals were included.

    I know that the West Asian component in DV3 is different from the Caucasian one in K12a/b, but still they are both modal in the Caucasus. Being said that, their distribution in Western European is completely different between one and the other.

    Being modal in the Caucasus is irrelevant, since, as I've explained 'dv3' includes 'West_European' which is partly West_Asian in the scheme of K12a/K12b.

    ReplyDelete
  15. Also, the main West Asian related component in the British Isles is the 'Gedrosia' one, rather than the 'Caucasus' one.

    The relationship between Northwestern Europeans in 'dv3' was evidenced by presence of 'West_European' component in South Asia, and in K12a/K12b by presence of 'Gedrosia' component in NW Europe. The 'Gedrosia' component is related to the 'Caucasus' one.

    K12a/K12b is more accurate than 'dv3', due to pruning of distant relatives and being a pure pan-Eurasian test, whereas South Asians were treated as a framing population in 'dv3'.

    ReplyDelete
  16. Yes I know the runs are different, but my initial question was about the Yunusbayev samples and its contribution to the K12a/b analysis. So, were they very influential or not?

    ReplyDelete
  17. To answer that question, one would have to repeat the ADMIXTURE experiment without them. One thing's for certain, that if one did so, they wouldn't get anything similar to 'dv3', because of the different methodology, irrespective of any additional influences of the Yunusbayev et al. samples.

    ReplyDelete
  18. I'm not sure why Cambodians have 1.8% Siberian admixture (the modal among Nganasan), whereas it is absent among Chinese and Japanese. I understand that the Cambodians have many components however I think it really shows the inaccuracy of correctly assigning the correct mixtures of clusters to populations using ADMIXTURE.

    ReplyDelete
    Replies
    1. Justin,

      I'd imagine that this is related to the historic migration of Tibetan Plateau populations, down into South East Asia.

      Delete
  19. Why was I with ID DOD322 removed from a Dodecad population?

    ReplyDelete
  20. Why was I with ID DOD322 removed from a Dodecad population?

    See Technical Details

    I'm not sure why Cambodians have 1.8% Siberian admixture (the modal among Nganasan), whereas it is absent among Chinese and Japanese. I understand that the Cambodians have many components however I think it really shows the inaccuracy of correctly assigning the correct mixtures of clusters to populations using ADMIXTURE.

    Cambodians have low levels Australasian affiliation (see 'world9'); in a calculator that does not include an Australasian reference, that component gets distributed in the other Asian components.

    ReplyDelete
  21. K12b Google table, which can create pie charts for individuals using the menu item "Visualize" -> "Pie":

    https://www.google.com/fusiontables/DataSource?snapid=S384135gcun

    For example, the pie chart for me, DOD215:

    https://www.google.com/fusiontables/embedviz?&containerId=gviz_canvas&q=select+col0%2C+col203+from+2843508+&qrs=where+col0+%3E%3D+&qre=+and+col0+%3C%3D+&qe=+limit+12&viz=GVIZ&t=PIE&width=1000&height=600

    ReplyDelete
  22. Also, for those that would like to create bar graphs for K12b individual components for across a large set of individuals to detect outliers and more easily see regional origins of DOD members:

    https://www.google.com/fusiontables/DataSource?snapid=S384137Oyh3

    ReplyDelete
  23. Dienekes, can I ask what happened to the S_Italian_D population?

    ReplyDelete
  24. http://dodecad.blogspot.com/2012/01/fastibd-analysis-of-iberia-france-italy.html?showComment=1327260909492#c2724621936280724721

    ReplyDelete
  25. I followed the instructions and got the results for the final admixture proportions but I wonder why the results present different components from the spreadsheet/graph:

    East_European Gedrosia
    West_European Siberian
    Mediterranean NW-African
    Neo_African SE-Asian
    West_Asian Atlantic-Med
    South_Asian N-European
    Northeast_Asian S-Asian
    Southeast_Asian E-African
    East_African SW-Asian
    Southwest_Asian E-Asian
    Northwest_African Caucasus
    Palaeo_African Sub-saharan

    ReplyDelete
  26. Because you used 'dv3' and not 'K12b'

    ReplyDelete
  27. I am interested to know more about the difference between k7b & k12b, since they give me pretty different results. My ethnic group on paper would be CEU. My V3 results seem pretty compatible with my k12b. It seems, however, when you pare down the groups to only 7 types, I become much more Middle Eastern than I ever imagined. I believe I have one ancestor who was Sephardic Jew from Denmark at 1/64 inheritance. And 1/64 that may come from the Canary Islands, but also from Denmark. A lot of English, Scottish and Danish. Some Irish and German and French Huegenot. And some what you would maybe call Neolithic Farmers from Sweden (t2a1a mtdna). My dad's k7b says he could have some Moroccan and/or Mozabite but at around 2% or less.

    Mine says this for the first few:


    # Primary Population (source) Secondary Population (source) Distance
    1 91.5% Orkney (1000Genomes) + 8.5% Bedouin (HGDP) @ 0.59
    2 66.8% Polish (Dodecad) + 33.2% Extremadura (1000Genomes) @ 0.64
    3 69.4% Polish (Dodecad) + 30.6% Murcia (1000Genomes) @ 0.68
    4 73.2% Polish (Dodecad) + 26.8% Canarias (1000Genomes) @ 0.72
    5 68.1% Polish (Dodecad) + 31.9% Portuguese (Dodecad) @ 0.77
    6 95.3% Kent (1000Genomes) + 4.7% Yemenese (Behar) @ 0.8
    7 67.4% Polish (Dodecad) + 32.6% Galicia (1000Genomes) @ 0.81
    8 92.9% Irish (Dodecad) + 7.1% Bedouin (HGDP) @ 0.81
    9 92.9% British (Dodecad) + 7.1% Jordanians (Behar) @ 0.81
    10 92.9% British (Dodecad) + 7.1% Palestinian (HGDP) @ 0.85

    Doug McDonald classified me as 100% British, although it shows some Middle Eastern on the X chromosome.... Are any of these results typical for your average British person? I know not to take the percentages too literally, but I just want to know if some of this could be happening for many mostly British people.

    ReplyDelete
    Replies
    1. Yes I would also like to know the difference. Very good question. I also get different results.

      Delete
    2. I second this! in k7 i am 26% siberian and 70% NE asian, while in k17 I am 26% SE asian and 70% NE asian. (numbers are not accurate but give you the big picture here). Why? I'd appreciate your time to explain this.

      Delete
  28. I am wondering where my Scottish ancestry is. I know it's there as I am descended from the Orr clan. I get Cornish/Welsh and some Irish as well. I have a bit of German that shows up but really my Scottish results seem low. What is really the difference between Scots and other Celtic groups on a genetic level?

    ReplyDelete
  29. In your K7b Admixture calculator my result includes Southern 0.13%. What population group is it?

    ReplyDelete
  30. In my K7b admixture calculator result it includes Southern 0.13%, among others. What population group is Southern?

    ReplyDelete
  31. I recently tried the Kb7 Oracle option 2 on gedmatch.com and my grandma got like 41% Canarian. Only two out of dozens I analyzed got more Canarian than her. One is a pure Canarian and another is of known Canarian descent from Uruguay and is my grandma's 3rd cousin. I was wondering where those Canarian samples specifically from. Where they from Las Palmas Canaries or Gran Canaria Canaries? That may explain the disparity in percentages between the three subjects, since the three subjects in question seem to be from different Islands.

    ReplyDelete
  32. But then again on calculation 1 of the Dodecad kb7 Oracle X my grandma shows with around 60% while the other two subject I analyzed show up in the 90 percent range. I suppose this is because Grandma ancestors mixed with Native and Tropical African, so her percentage would be necessarily lower then those of pure Canarians. But there seems to be no doubt that her main component is Canarian. Makes sense.

    ReplyDelete
  33. I'm not understanding this at all... I'm not understanding what the terms of ancestors are coming from... can someone please elaborate for me? I only have my one kit and it's not processed yet, so I can't get anymore info that I have now...

    Population:

    Gedrosia- 10.70%
    Northwest_African- 1.47%
    Southeast_Asian- 0.26%
    Atlantic_Med- 38.10%
    North_European- 42.95%
    South_Asian- 0.42%
    Caucasus- 6.04%

    ReplyDelete
  34. I'd like to now how the

    Primary Population (source) Secondary Population (source) Distance

    Is figured

    # Primary Population (source) Secondary Population (source) Distance
    1 97.6% German_V + 2.4% Kalash @ 0.96
    2 96.8% German_V + 3.2% Avar @ 1
    3 96.9% German_V + 3.1% Lak @ 1.04
    4 96.8% German_V + 3.2% Tabassaran @ 1.05
    5 93.4% CEU_V + 6.6% Avar @ 1.06
    6 93.4% CEU_V + 6.6% Lak @ 1.09
    7 96.8% German_V + 3.2% Tadjik @ 1.09
    8 97% German_V + 3% Lezgin @ 1.11
    9 97.3% German_V + 2.7% Pashtun @ 1.11
    10 93.4% CEU_V + 6.6% Lezgin @ 1.12
    11 93.3% CEU_V + 6.7% Tabassaran @ 1.15
    12 96.9% German_V + 3.1% Chechen @ 1.19
    13 97.5% German_V + 2.5% Pathan @ 1.2
    14 93.1% CEU_V + 6.9% Chechen @ 1.2
    15 97.8% German_V + 2.2% Brahui @ 1.21
    16 97.8% German_V + 2.2% Balochi @ 1.25
    17 97.7% German_V + 2.3% Burusho @ 1.26
    18 96.9% German_V + 3.1% Adygei @ 1.29
    19 97.2% German_V + 2.8% Ossetian @ 1.29
    20 97.8% German_V + 2.2% Makrani @ 1.31

    Last time I checked the @ xxxx number is the closer it is the closer a relation is.

    The Kalash have little to do with Germany and yet according to 1? I am 97% german and 2.4% Kalash with a 0.96 distance.

    ReplyDelete
  35. Dienekes, I suggest you to come over to Bulgaria, find a way and take a look at EVERY single bulgarian born man here and if you find atleast one man with with even small north african admixture, just let me know, because this is total nonsense (I could use way rougher words here). And I'd like to know how did you end up with these results.

    ReplyDelete
  36. This comment has been removed by the author.

    ReplyDelete
  37. I can paper trace my mother's ancestors to Catalan, Sant Feliu de Guixols. The oldest ancestor I have knowledge is Onofre Basart born abt. 1495. After him a long line of Catalans but no approximation even mentions this group. I always get approximations to Basques, Italians and French. Why would that be?

    ReplyDelete