Tuesday, June 28, 2011

Dodecad v3 results are rolling in

I have started processing individual samples in Dodecad v3; they will be posted incrementally in batches of 35 or so in the same spreadsheet as the population averages.

You can look at various population averages in the spreadsheet to compare yourself against; I have added averages for 122 populations with the full 166K SNPs, as well as several other datasets with a smaller number of SNPs.

Also, note that the averages are not yet complete, some populations have not been added yet, and some averages will be revised after outlier removal. After all that is done, I will announce it here in the blog and will create a big zip file with all the population portraits showing individual-level variation (including outliers).

Results will be posted in DOD order, but for technical reasons the Family Finder-based IDs will come at the end .

Submission to the Project is currently closed, and of course I encourage participants who have not already done so to leave a message in the ancestry thread.

Also of interest:

Monday, June 27, 2011

Submission opportunity... is OVER

The latest submission opportunity is now over. All those who meet the eligibility criteria and who sent me their data before this announcement will get their IDs/have their data processed. Do NOT send unsolicited data after this time, as they will most likely be ignored.

Thursday, June 23, 2011

Results for DOD738 to DOD747 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Read about the eligibility criteria of the current submission opportunity. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Wednesday, June 22, 2011

Dodecad v3: population averages

With Dodecad v3 it is possible to use any population as test data in supervised ADMIXTURE analysis, and extract its admixture proportions in terms of the 12 ancestral components.

So, I have set up an automated job that will do just that: use pretty much every population available to me under only a few conditions:
  1. Each population must have at least 5 individuals
  2. It must have the same 166,462 SNPs on which the test is based
  3. It must not be from a group not covered by the test (e.g., Australo-Melanesians or Native Americans)
By my count, I have 141 different populations that meet these requirements. Each population is run on its own in a supervised ADMIXTURE analysis together with the 600-strong synthetic set (50 per ancestral component). These are ideal conditions to produce a high-quality comparative reference set.

The average admixture proportions of different populations will be put in this spreadsheet as they are calculated, which will probably take several days.

Tuesday, June 21, 2011

The design of Dodecad v3

Dodecad v2 was short-lived, as I discovered a way to improve it shortly after I announced it.

The first step was to carry out an extensive K=3 ADMIXTURE analysis of about 130 different populations and about 2,000 individuals from Europe, Asia, and Africa. Using the allele frequency results of this analysis I was able to create the most comprehensive synthetic individuals to represent West Eurasians, Asians, and Sub-Saharan Africans.

Subsequently, I carried out an analysis of East Eurasian populations using the West Eurasian/Sub-Saharan synthetic individuals as controls, as well as an analysis of Sub-Saharan populations using the West Eurasian/Asian individuals as controls.

In East Eurasia, I was able to infer the existence of two components, one centered in the extreme northeast, another in the southeast, with many other populations arrayed between these two extremes:

In Sub-Saharan Africa, the primary division was between San, Mbuti, and Biaka Pygmies (whom I have called "Palaeo-Africans") and the rest (Yoruba, Mandenka, and Bantu, "Neo-Africans"):

Now, I had four synthetic "framing populations": Neo-Africans, Palaeo-Africans, Northeast Asians and Southeast Asians, created from hundreds of individuals from several different populations:
  1. I did not have to choose a particular population (e.g., Chinese) to represent East Asia
  2. I did not have to aggregate individuals from populations with variable levels of non-East Asian admixture
I now used my South Asian populations, together with Neo-African, West Eurasian, Northeast and Southeast Asian controls to extract a South Asian specific component:

Armed with these 5 synthetic "framing" populations, I carried out a K=12 analysis with my West Eurasian, South Asian, and North/East African populations (1,247 individuals; 69 populations):

And, finally, I generated 50 synthetic individuals from each of the 12 inferred components to create a dataset of 600 individuals that will be the basis of Dodecad v3.

Below is the table of Fst divergences:

The following MDS plots show the first 10 dimensions of variation of these individuals:

Finally, here is a neighbor-joining tree of the 12 components:
(to be continued)

Thursday, June 9, 2011

Cornwall, Kent, and Orkney

I just finished extracting the regional samples (Kent, Cornwall, and Orkney) from the 1000 Genomes GBR sample, and I made a quick experiment to put them in context of other West European populations (Irish, German, Dutch, French, and Scandinavian).
I will probably try to integrate some of these to the new version of Dodecad, and try some other things, so perhaps v2 may not be the next stage of the Project. Hopefully the extra wait will be worth it.


Here is also a supervised ADMIXTURE analysis with the standard K=10 components. Please note that as this is not done with the same SNPs as the standard K=10 results, they are not comparable directly to other K=10 results of the Project.

Results for DOD729 to DOD737 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Read about the eligibility criteria of the current submission opportunity. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Wednesday, June 8, 2011

Dodecad v2

This is an announcement of the new generation of Dodecad ancestry analysis. In comparison to the standard K=10 used since the beginning of the Project:
  1. Participants' data are now used to enrich the set of reference populations and to help define new ancestral components
  2. Rather than choosing arbitrary reference populations, I employ a very large set of individuals to capture allele frequencies and then create synthetic individuals ("panmictic zombies") that embody these frequencies; more on this below.
  3. Results for unrelated project participants will be reported in a separate post, using my new technique of converting unsupervised ADMIXTURE runs into supervised ones. Hence, Project participants can expect to receive new K=12 results; moreover, the fact that this will be done in supervised mode means that it is no longer necessary to process samples in small batches of 10 or so. All current unrelated participants will receive their results in one go, and only future submissions will be processed in batches.
This analysis utilizes results from Project participants (populations with _D endings), as well as synthetic individuals summarizing allele frequencies of East Eurasians, Sub-Saharan Africans, and South Indians (populations with _Z endings)

The framing populations (_Z)

The following _Z populations were included:
  • Sub_Saharan_Z: Bantu, Yoruba, Mandenka, San, and Pygmies from HGDP-CEPH
  • South_Indian_Z: North Kannadi, Sakilli from Behar et al. (2010), AP_Madiga, AP_Mala, TN_Dalit from Xing et al. (2010), Bhil, Chenchu, Kurumba, Satnami, Madiga, Mala, Kamsali, Onge, Great_Andamanese from Reich et al. (2009)
  • Sino_Tibetan_Z: Yizu, Naxi, Han, Tujia from HGDP-CEPH
  • Altaic_Z: Tu, Xibo, Mongola, Daur, Hezhen, Oroqen, Yakut from HGDP-CEPH, and Evenk, Buryat from Rasmussen et al. (2010)
  • Siberian_Other_Z: Selkup, Ket, Yukagir, Nganasan, Koryak, Chuckchi from Rasmussen et al. (2010)
  • Southeast_Asian_Z: Dai, Lahu, Miaozu, Cambodians from HGDP-CEPH, Khmer-Cambodian, Thai from Xing et al. (2010), and Singapore Malay from the Singapore Genome Variation Project
The 12 inferred ancestral components

Results of the ADMIXTURE analysis defining the new K=12 components of the Project can be seen below:
Raw proportions can be found in a spreadsheet. There are also population portraits in a zip file, showing individual-level variation.

The 12 components are:
  • West_Asian
  • East_European
  • West_European
  • East_Asian
  • Mediterranean
  • Northwest_African
  • North_Eurasian
  • Arabian
  • Inner_Asian
  • Sub_Saharan
  • East_African
  • South_Indian
Once again, I have tried to make these as neutral and appropriate as possible, but don't forget that they are simply descriptive labels to aid memory. For example, the Arabian component is centered on Saudis, Yemenese, and Yemen Jews, the Inner Asian component on the Altaic synthetic population, and so on.

The Fst divergences between the 12 components can be seen in the spreadsheet and also below:

A different way of showing them is via a neighbor-joining tree. Note, however, that this is not a replacement for the Fst table above which alone fully preserves the inter-population relationships:
We can also plot the first few MDS dimensions using synthetic individuals from the 12 components; again, these capture variation only partially:

What comes next?

Hopefully quite soon, I will:
  1. Report new v2 results for all project participants
  2. Report new v2 proportions for many other populations not included here
Project members who still haven't received their results (during the ongoing submission opportunity) can expect to receive K=10 standard results, and they will receive their new v2 results later.

Tuesday, June 7, 2011

Results for DOD713 to DOD728 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Read about the eligibility criteria of the current submission opportunity. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Monday, June 6, 2011

Panmictic zombies

I had previously developed a new way of choosing "framing" populations for ADMIXTURE analyses. Such populations are necessary in order to tease out genetic contributions from outside one's region of interest.

I proposed to create a meta-population which included a single individual from a large number of populations (e.g., East Eurasians). Use of such a meta-population has two interesting properties:
  1. It solves the problem of "which" population to choose (e.g., Han, Miaozu, She, Mongol ?) as a framing reference: the meta-population captures features of all candidate populations
  2. It avoids the generation of population-specific clusters in the "framing" individuals, as no two individuals from a single population are included!
There is, however, a problem with the technique as I first described it: it only uses a single individual from each population to compose the meta-population! Hence, it is potentially sensitive to the presence of outliers, and, in any case, it throws away most of the data.

More recently, I proposed the use of "zombies" from allele frequency data output by ADMIXTURE. These zombies are, in a sense, the opposite, of what I am trying to do here, since they represent ancestral components that exist in mixed form in present-day individuals.

Instead, we can generate "panmictic zombies" by composing a dataset of all individuals from a region of interest; we then calculate allele frequencies over the combined set, and then generate synthetic individuals based on these allele frequencies.

This technique has several advantages:
  1. It is extremely resilient to outliers, as the presence of a few outliers only shifts allele frequencies by a little, and no actual outliers are included in the "panmictic zombie" population
  2. It amortizes the full set of individuals and hence does not depend on the random sample one chooses from each population
  3. It avoids the creation of population-specific clusters
  4. It speeds up the technique I introduced for converting unsupervised ADMIXTURE runs to supervised ones substantially: populations framing the region of interest (e.g., East Eurasians, Sub-Saharan Africans, South Asians, in the case of West Eurasia) can be "folded" into a number of panmictic zombie populations a priori.
Point #4 is extremely important for practitioners:
  1. It is great not to include every single East Eurasian sample in ADMIXTURE analyses when you are trying to infer patterns of variation in Europe; this is a much better solution than the ad hoc approach adopted by some of ignoring East Eurasia altogether when studying patterns of variation in Europe!
  2. It is great not to worry after several hours of ADMIXTURE analysis whether upping K by +1 will finally produce added resolution in your region of interest, or split, e.g., Mbuti from Biaka Pygmies, which is hardly of relevance if one is trying to study East Asian or European variation
Panmictic zombies can be further fine-tuned: the allele frequencies can be calculated in many different ways:
  1. Over all individuals
  2. Averaged over all population averages (to account for different sample sizes)
  3. Weighted average over all populations (to account for different demographic sizes of source populations)
A first experiment

The following MDS plot shows a population ("Synthetic", red) generated from a sample of different HGDP East Eurasian populations.
It's important to note that while "Synthetic" appears to be closer to the Tu population, that does not mean that it is interchangeable with the Tu!

The "Synthetic" population is much more diverse, as it encompasses parts (alleles) from all the different populations of the set, that, because of the averaging process happen to coincide with the Tu in the first two dimensions of the MDS plot.

Saturday, June 4, 2011

Projecting Pakistan populations on West Eurasian PCA

In a first post I showed that ADMIXTURE output allele frequencies could be used to create synthetic individuals corresponding to the ancestral components ("zombies"), and that these artificial populations could be used for both performance, and to avoid the creation of population-specific clusters in ADMIXTURE run. I was hence, able to infer the composition of several idiosyncratic populations in terms of the K=10 components of the Dodecad Project.

In a second post, I showed that "zombies" could be created even in the absence of allele frequencies, if one had admixture proportions only for the ancestral components. I was thus able to reconstruct synthetic individuals corresponding to the ANI/ASI of Reich et al. (2009). I was further able to confirm the West Asian origin of Ancestral North Indians. In a subsequent post, I used these synthetic ANI/ASI populations on groups of Pakistan, showing the main West Asian/ANI origin of the Caucasoid component in South Asia. Moreover, I confirmed that the Ancestral South Indians are related (but distantly) to the Onge from the Indian Ocean.

In this post, I run principal components analysis on the Pakistan populations; the Hazara were excluded because of their high East Eurasian admixture. Here is the unsupervised PCA:

First, you notice that the first dimension is dominated by the Kalash, a very distinctive population because of its long-term isolation. The second dimension is dominated by a Sindhi outlier, which, if you consult a Sindhi population portrait from a previous experiment, is revealed to be of substantial Sub-Saharan admixture.

Obviously, this is no good, as our first two dimensions are not anthropologically interesting. If we are interested in learning about the origins of populations, knowing that there are a few Sindhi individuals with Sub-Saharan admixture, or that the Kalash are highly isolated is not helpful.

We can run PCA again, but this time we project populations of interest onto the PCA plot of the West Eurasian control populations:
It is fairly obvious that the populations of Pakistan fall on the South Asia-West Asia line. There are small deviations from the cline:
  • Balochis and Brahuis deviate towards the SW Asian component, which is consistent with their ADMIXTURE results.
  • The position of the non-Indo-European Burusho and Indo-Aryan Sindhi populations on either side of the cline is consistent with a little SW Asian component in the Sindhi and a little North European component in the Burusho, which pull them away from the cline in the expected directions.
Moreover, the relative position of the Pakistan populations along this cline is preserved.

Using the West Eurasian "zombies" is thus, not only useful for ADMIXTURE, but also for principal components analysis; in the latter it is helpful because:
  1. It avoids domination by very isolated/inbred populations and/or outliers
  2. It is possible to create synethic "zombie" population with absolutely equal sample sizes, hence removing a source of bias (some residual bias may persist, e.g., if one used a component centered on 5 "real" individuals to create a "zombie" population of 100, then the effective sample is not really 100)

Thursday, June 2, 2011

Ancestral South Indian (ASI) in context

I have taken the synthetic ASI population together with 25 HapMap-3 Chinese (CHB), 16 HGDP Papuans, and 9 Reich et al. (2009) Onge from the Andaman Islands to determine its relationships with other Eurasian populations.

Below is an MDS plot which shows that ASI does not appear to be particularly close to any of the other populations.

I have also ran supervised K=3 ADMIXTURE analysis that treated the ASI population as test data and CHB, Onge, Papuan as parental populations; the ASI turned out 100% "Onge", consistent with the idea that ASI is distantly related to Onge, although closer than with the other two populations.

It should be noted, however, that the similarity of ASI to Onge is not unexpected, since:
  • Onge was used by Reich et al. (2009) to infer admixture proportions of Indian Cline populations, which were (in turn):
  • used by myself to infer allele frequencies of ASI, and then:
  • used by myself to create a synthetic population of ASI individuals.
So, the Onge-ness of ASI is contingent upon the accuracy of Reich et al. (2009), but, anyway, the population of my ASI "zombies" seem to pass a second test of being reasonable standins for ASI in the sense of that paper.

Wednesday, June 1, 2011

Results for DOD704 to DOD712 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Read about the eligibility criteria of the current submission opportunity. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars: