Showing posts with label Zombies. Show all posts
Showing posts with label Zombies. Show all posts

Thursday, October 20, 2011

Comparing different ADMIXTURE runs using Zombies

My idea of using zombies with ADMIXTURE is the gift that keeps on giving. Remember that "zombies" are synthetic individuals created from ADMIXTURE output, representing the K inferred ancestral components. They can be viewed as hypothetical ancestral individuals representing each of these K components without any admixture from any of the others.

An interesting problem that often comes up is to compare across different ADMIXTURE runs. I can think of at least three different applications of this:
  1. To compare components across different K; for example, how does a "West Asian"-centered component at K=5 differ from a similarly-centered component at K=12?
  2. To compare components across different datasets; for example, how does a "West Asian"-centered component inferred from an existing dataset (e.g., the current Dodecad v3) differ from a "West Asian"-centered one from a new dataset (e.g., the upcoming Dodecad v4, which will also be trained on the valuable new populations of Yunusbayev et al. 2011)
  3. To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects.
As proof of concept, I took the MDLP calculator from the Magnus Ducatus Lituaniae Project and generated 50 zombies for each of its 7 ancestral components:
  1. Scandinavian
  2. Volga_Region
  3. Altaic
  4. Celto_Germanic
  5. Caucassian_Anatolian_Balkanic
  6. Balto_Slavic
  7. North_Atlantic
I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.


I will comment on the MDLP-to-dv3 mapping:
  1. The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian
  2. The MDLP "Volga_Region" component appears to be East European with some Northeast Asian
  3. The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it
  4. The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...
  5. ... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.
  6. The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European
  7. Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well
A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP. Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

Limiting ourselves to only European populations, we obtain:

It appears that the "North_Atlantic" component may be centered on a small number of related individuals.

I encourage other genome bloggers to try their own hand at comparing their components with those of other projects, or even their own. This process will be made possible if people using ADMIXTURE follow the simple instructions to convert their output for use with DIYDodecad.

Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project.

PS: This analysis was done on ~63k SNPs in common between MDLP and Dodecad v3

Wednesday, June 8, 2011

Dodecad v2

This is an announcement of the new generation of Dodecad ancestry analysis. In comparison to the standard K=10 used since the beginning of the Project:
  1. Participants' data are now used to enrich the set of reference populations and to help define new ancestral components
  2. Rather than choosing arbitrary reference populations, I employ a very large set of individuals to capture allele frequencies and then create synthetic individuals ("panmictic zombies") that embody these frequencies; more on this below.
  3. Results for unrelated project participants will be reported in a separate post, using my new technique of converting unsupervised ADMIXTURE runs into supervised ones. Hence, Project participants can expect to receive new K=12 results; moreover, the fact that this will be done in supervised mode means that it is no longer necessary to process samples in small batches of 10 or so. All current unrelated participants will receive their results in one go, and only future submissions will be processed in batches.
This analysis utilizes results from Project participants (populations with _D endings), as well as synthetic individuals summarizing allele frequencies of East Eurasians, Sub-Saharan Africans, and South Indians (populations with _Z endings)

The framing populations (_Z)

The following _Z populations were included:
  • Sub_Saharan_Z: Bantu, Yoruba, Mandenka, San, and Pygmies from HGDP-CEPH
  • South_Indian_Z: North Kannadi, Sakilli from Behar et al. (2010), AP_Madiga, AP_Mala, TN_Dalit from Xing et al. (2010), Bhil, Chenchu, Kurumba, Satnami, Madiga, Mala, Kamsali, Onge, Great_Andamanese from Reich et al. (2009)
  • Sino_Tibetan_Z: Yizu, Naxi, Han, Tujia from HGDP-CEPH
  • Altaic_Z: Tu, Xibo, Mongola, Daur, Hezhen, Oroqen, Yakut from HGDP-CEPH, and Evenk, Buryat from Rasmussen et al. (2010)
  • Siberian_Other_Z: Selkup, Ket, Yukagir, Nganasan, Koryak, Chuckchi from Rasmussen et al. (2010)
  • Southeast_Asian_Z: Dai, Lahu, Miaozu, Cambodians from HGDP-CEPH, Khmer-Cambodian, Thai from Xing et al. (2010), and Singapore Malay from the Singapore Genome Variation Project
The 12 inferred ancestral components

Results of the ADMIXTURE analysis defining the new K=12 components of the Project can be seen below:
Raw proportions can be found in a spreadsheet. There are also population portraits in a zip file, showing individual-level variation.

The 12 components are:
  • West_Asian
  • East_European
  • West_European
  • East_Asian
  • Mediterranean
  • Northwest_African
  • North_Eurasian
  • Arabian
  • Inner_Asian
  • Sub_Saharan
  • East_African
  • South_Indian
Once again, I have tried to make these as neutral and appropriate as possible, but don't forget that they are simply descriptive labels to aid memory. For example, the Arabian component is centered on Saudis, Yemenese, and Yemen Jews, the Inner Asian component on the Altaic synthetic population, and so on.

The Fst divergences between the 12 components can be seen in the spreadsheet and also below:

A different way of showing them is via a neighbor-joining tree. Note, however, that this is not a replacement for the Fst table above which alone fully preserves the inter-population relationships:
We can also plot the first few MDS dimensions using synthetic individuals from the 12 components; again, these capture variation only partially:



What comes next?

Hopefully quite soon, I will:
  1. Report new v2 results for all project participants
  2. Report new v2 proportions for many other populations not included here
Project members who still haven't received their results (during the ongoing submission opportunity) can expect to receive K=10 standard results, and they will receive their new v2 results later.

Monday, June 6, 2011

Panmictic zombies

I had previously developed a new way of choosing "framing" populations for ADMIXTURE analyses. Such populations are necessary in order to tease out genetic contributions from outside one's region of interest.

I proposed to create a meta-population which included a single individual from a large number of populations (e.g., East Eurasians). Use of such a meta-population has two interesting properties:
  1. It solves the problem of "which" population to choose (e.g., Han, Miaozu, She, Mongol ?) as a framing reference: the meta-population captures features of all candidate populations
  2. It avoids the generation of population-specific clusters in the "framing" individuals, as no two individuals from a single population are included!
There is, however, a problem with the technique as I first described it: it only uses a single individual from each population to compose the meta-population! Hence, it is potentially sensitive to the presence of outliers, and, in any case, it throws away most of the data.

More recently, I proposed the use of "zombies" from allele frequency data output by ADMIXTURE. These zombies are, in a sense, the opposite, of what I am trying to do here, since they represent ancestral components that exist in mixed form in present-day individuals.

Instead, we can generate "panmictic zombies" by composing a dataset of all individuals from a region of interest; we then calculate allele frequencies over the combined set, and then generate synthetic individuals based on these allele frequencies.

This technique has several advantages:
  1. It is extremely resilient to outliers, as the presence of a few outliers only shifts allele frequencies by a little, and no actual outliers are included in the "panmictic zombie" population
  2. It amortizes the full set of individuals and hence does not depend on the random sample one chooses from each population
  3. It avoids the creation of population-specific clusters
  4. It speeds up the technique I introduced for converting unsupervised ADMIXTURE runs to supervised ones substantially: populations framing the region of interest (e.g., East Eurasians, Sub-Saharan Africans, South Asians, in the case of West Eurasia) can be "folded" into a number of panmictic zombie populations a priori.
Point #4 is extremely important for practitioners:
  1. It is great not to include every single East Eurasian sample in ADMIXTURE analyses when you are trying to infer patterns of variation in Europe; this is a much better solution than the ad hoc approach adopted by some of ignoring East Eurasia altogether when studying patterns of variation in Europe!
  2. It is great not to worry after several hours of ADMIXTURE analysis whether upping K by +1 will finally produce added resolution in your region of interest, or split, e.g., Mbuti from Biaka Pygmies, which is hardly of relevance if one is trying to study East Asian or European variation
Panmictic zombies can be further fine-tuned: the allele frequencies can be calculated in many different ways:
  1. Over all individuals
  2. Averaged over all population averages (to account for different sample sizes)
  3. Weighted average over all populations (to account for different demographic sizes of source populations)
A first experiment

The following MDS plot shows a population ("Synthetic", red) generated from a sample of different HGDP East Eurasian populations.
It's important to note that while "Synthetic" appears to be closer to the Tu population, that does not mean that it is interchangeable with the Tu!

The "Synthetic" population is much more diverse, as it encompasses parts (alleles) from all the different populations of the set, that, because of the averaging process happen to coincide with the Tu in the first two dimensions of the MDS plot.

Saturday, June 4, 2011

Projecting Pakistan populations on West Eurasian PCA

In a first post I showed that ADMIXTURE output allele frequencies could be used to create synthetic individuals corresponding to the ancestral components ("zombies"), and that these artificial populations could be used for both performance, and to avoid the creation of population-specific clusters in ADMIXTURE run. I was hence, able to infer the composition of several idiosyncratic populations in terms of the K=10 components of the Dodecad Project.

In a second post, I showed that "zombies" could be created even in the absence of allele frequencies, if one had admixture proportions only for the ancestral components. I was thus able to reconstruct synthetic individuals corresponding to the ANI/ASI of Reich et al. (2009). I was further able to confirm the West Asian origin of Ancestral North Indians. In a subsequent post, I used these synthetic ANI/ASI populations on groups of Pakistan, showing the main West Asian/ANI origin of the Caucasoid component in South Asia. Moreover, I confirmed that the Ancestral South Indians are related (but distantly) to the Onge from the Indian Ocean.

In this post, I run principal components analysis on the Pakistan populations; the Hazara were excluded because of their high East Eurasian admixture. Here is the unsupervised PCA:


First, you notice that the first dimension is dominated by the Kalash, a very distinctive population because of its long-term isolation. The second dimension is dominated by a Sindhi outlier, which, if you consult a Sindhi population portrait from a previous experiment, is revealed to be of substantial Sub-Saharan admixture.

Obviously, this is no good, as our first two dimensions are not anthropologically interesting. If we are interested in learning about the origins of populations, knowing that there are a few Sindhi individuals with Sub-Saharan admixture, or that the Kalash are highly isolated is not helpful.

We can run PCA again, but this time we project populations of interest onto the PCA plot of the West Eurasian control populations:
It is fairly obvious that the populations of Pakistan fall on the South Asia-West Asia line. There are small deviations from the cline:
  • Balochis and Brahuis deviate towards the SW Asian component, which is consistent with their ADMIXTURE results.
  • The position of the non-Indo-European Burusho and Indo-Aryan Sindhi populations on either side of the cline is consistent with a little SW Asian component in the Sindhi and a little North European component in the Burusho, which pull them away from the cline in the expected directions.
Moreover, the relative position of the Pakistan populations along this cline is preserved.

Using the West Eurasian "zombies" is thus, not only useful for ADMIXTURE, but also for principal components analysis; in the latter it is helpful because:
  1. It avoids domination by very isolated/inbred populations and/or outliers
  2. It is possible to create synethic "zombie" population with absolutely equal sample sizes, hence removing a source of bias (some residual bias may persist, e.g., if one used a component centered on 5 "real" individuals to create a "zombie" population of 100, then the effective sample is not really 100)


Thursday, June 2, 2011

Ancestral South Indian (ASI) in context

I have taken the synthetic ASI population together with 25 HapMap-3 Chinese (CHB), 16 HGDP Papuans, and 9 Reich et al. (2009) Onge from the Andaman Islands to determine its relationships with other Eurasian populations.

Below is an MDS plot which shows that ASI does not appear to be particularly close to any of the other populations.

I have also ran supervised K=3 ADMIXTURE analysis that treated the ASI population as test data and CHB, Onge, Papuan as parental populations; the ASI turned out 100% "Onge", consistent with the idea that ASI is distantly related to Onge, although closer than with the other two populations.

It should be noted, however, that the similarity of ASI to Onge is not unexpected, since:
  • Onge was used by Reich et al. (2009) to infer admixture proportions of Indian Cline populations, which were (in turn):
  • used by myself to infer allele frequencies of ASI, and then:
  • used by myself to create a synthetic population of ASI individuals.
So, the Onge-ness of ASI is contingent upon the accuracy of Reich et al. (2009), but, anyway, the population of my ASI "zombies" seem to pass a second test of being reasonable standins for ASI in the sense of that paper.

Tuesday, May 31, 2011

ANI/ASI analysis of HGDP Pakistan groups

Until recently, it has been difficult to study the Ancestral North Indian/Ancestral South Indian (ANI/ASI) composition of Pakistan groups, as these fell outside the "Indian Cline" of Reich et al. (2009). My recent experimental reconstruction of ANI/ASI zombies, as well as West-Eurasian ones allows me to do a supervised run on them and see how they fare.

(One caveat is that this is based on ~30k SNPs, as the two different kinds of populations I am using include ~120k and ~150k SNPs, but not the same ones).

Overall, the results make sense (they can be seen on the left, as well as on this spreadsheet):
  • The components of the ANI and West Asian "zombies" dominate most populations; I suspect that as the two are related it may be difficult to distinguish between them
  • Intriguingly, Kalash continue to be dominated by West Asian, now that the composite "South Asian" has been resolved, and their ASI levels are similar to those in Iranians.
  • Conversely, the higher ANI are found in Pathans and Sindhi, i.e., precisely the populations used by Reich et al. (2009). Hence, I suspect that ANI in the sense of Reich et al. (2009), as reconstructed by myself, may be biased towards these two populations. Also note that my ANI reconstruction used the same Pathans (15) and Sindhi (10) used by Reich et al. (2009), whereas in this one all HGDP individuals are included.
  • The East Asian component turns up in the Hazara and the Burusho, in agreement with previous experiments
  • The Southwest Asian component turns up in Balochistan (Balochi, Brahui, Makrani), which also makes sense, linking that Iranic speaking region to nearby Iran where that component is also important
  • The North European component comes up in Hazara, Burusho, and Pathans, which again makes sense, as these populations may have been influenced by people from further north in historical times.
In conclusion, I would say that while the ANI/ASI "zombies" do capture real South Asian signals, as evidenced by my Gypsy experiment, but the reconstructed ANI does not capture the entirety of West Eurasian admixture in South Asia: a lot of it continues to be associated with West Asia, and a little with Northern Europe in some populations.

Monday, May 30, 2011

More Zombies: Ancestral North Indians and Ancestral South Indians reborn

In my previous post I showed how synthetic individuals corresponding to ADMIXTURE ancestral components can be created and used. This was made possible by the fact that ADMIXTURE outputs allele frequencies for its components, which can be utilized to create a population of random genotypes with the same allele frequencies.

A more difficult task is to create such "zombie" individuals when there are no allele frequencies at hand. A prime example of this is the paper by Reich et al. (2009) on the two ancestral components in Indians: Ancestral South Indians (ASI) and Ancestral North Indians (ANI). The paper provides admixture estimates for these two components in present-day "Indian Cline" groups, but no allele frequencies for these components: we only knew that ANI was closely related to West Eurasians, and ASI formed a clade with the Onge from the Indian Ocean.

Both ANI and ASI are extinct (in pure form) populations, and they are blended (in varying proportions) in modern day Indians, with highest ANI occurring in the Northwest and among upper caste groups, and highest ASI among South Indian tribal and low caste populations.

As I was thinking of ways to extend the "zombie" approach, it occurred to me that there is a fairly involved way to extract the ANI/ASI allele frequencies from the available evidence:

If f(ANI) and f(ASI) are the allele frequencies at a locus for ANI and ASI, and an admixed population P has x fraction of ancestry from ANI and 1-x from ASI, then its allele frequency is expected to be:

x*f(ANI)+(1-x)*f(ASI) = f(P)

I have marked (in bold) the known variables. Obviously, this equation does not hold in practice, because of sampling error, uncertainty in the estimation of x, as well as genetic drift that may affect the allele frequencies of the admixed population.

Nonetheless, we do not only have one equation of this sort, but 18, since Reich et al. (2009) provides ANI/ASI estimates for 18 different Indian Cline populations. We can thus fit a linear regression to recover f(ANI) and f(ASI).

This is exactly what I did; there are two important caveats:
  • because most of the Reich et al. (2009) populations are very small, f(P) is expected to be very noisy. I thus grouped the Indian Cline populations into five groups (based on increasing ANI, and making sure that each one had >15 individuals), and calculated admixture proportions (x's) and allele frequencies (f(P)'s) on these groups.
  • linear regression coefficients (the f(ANI) and f(ASI) estimates) may be less than 0 or more than 1, which makes no biological sense, so these were fixed to 0 and 1 in a few cases whenever that was the case (~5% of markers)
All of this required a bit of thinking and work, so I was very skeptical that it would work; given sampling/admixture estimation errors/limitations of regression/random creation of individuals, the whole process from input data to output "zombies" passed through so many layers, that it could very well lead to nonsense.

Nonetheless, there is power in numbers, and I was hopeful that this might work. If it did, I could have synthesized ANI and ASI populations to play with and use pretty much like regular populations in a variety of experiments.

Validation of synthetic ANI/ASI populations

I generated 25 ANI and 25 ASI individuals using the above-described method. There are 119,588 SNPs in these populations.

To validate them, I ran supervised ADMIXTURE using these ANI/ASI individuals as ancestral populations, and all the Indian Cline populations as test data. The results can be seen below:
Although the estimates for some populations (e.g., Chenchu: 31 vs. 40.7%) are substantially off, the median error is 1%, and the average error is 2.4%. Overall, it does appear that the synthetic ANI/ASI individuals are fairly good standins for their (extinct) populations.

Ancestral North Indians

I included ANI together with the 4 West Eurasian components of the Dodecad Project in an MDS plot:
Also, a neighbor-joining tree:
Putting ANI/ASI to work: Romanian Gypsies

I have previously detected 2 individuals in the Behar et al. (2010) Romanian sample that are likely to be of Roma (Gypsy) heritage. Here is a supervised admixture of the Romanian sample using the ANI/ASI components:
The previously detected individuals do possess both ANI and ASI components, indeed these are:

18.1, 15.3
16.9, 16.4

in the two individuals, which might be useful in constraining geographically the origin of European Gypsies along the Indian Cline.

Putting ANI/ASI to work: Iranians

Iranians generally show affinity to South Asians. Is this affinity related to the common Indo-Iranian background of Iranians and Indo-Aryans, or, is it, perhaps, due to the absorption of South Asian population elements during Iran's long imperial past?

The ANI/ASI components in the Iranians and Iranian_D samples are:

11.7, 7.5
12.0, 6.9

Compared to the previously described Romanian Gypsies, the South Asian component in Iranians tends to be clearly tilted towards ANI.

How to create Zombies from ADMIXTURE etc.

ADMIXTURE infers K ancestral populations, and estimates the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each ancestral population.

An interesting use of the allele frequencies is to generate synthetic "zombies" from the ancestral populations. These are artificial individuals whose genotypes are drawn randomly based on the allele frequencies. For example, there is a "West Asian" component in the Dodecad Project, but no individuals who have 100% membership in the "West Asian" component. A "West Asian" zombie is a synthetic individual who appears to be drawn from that "West Asian" component only, without any other (e.g., "South European", or "Southwest Asian") admixture at all.

"Zombies" may be viewed as either useful theoretical abstractions, or as reconstructed hypothetical ancient-like individuals, purged of centuries or millennia of admixture. Irrespective of how one views them, they are very useful as a tool.

Zombies of K=10 components

I generate 25 zombies for each of the 10 ancestral components of the Dodecad Project. Below, you can see an MDS plot of these 250 individuals, which is quite similar to the MDS plot generated using only the Fst divergences between the ancestral components.
Including real and "zombie" populations

I include the "West African", "North European", and "South European" zombie populations, together with 25 African Americans (ASW) from HapMap-3:
Notice the direction of the African American cline: slightly tilted towards North Europeans. This makes sense as the European ancestry of African Americans is derived mainly from Northwestern Europe and neither exclusively from the Mediterranean or Northern Europe where the "South European" and "North European" components peak.

Convert unsupervised ADMIXTURE runs to supervised ADMIXTURE

The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.

The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.

You can thus use the zombie populations to mimic a regular (unsupervised) ADMIXTURE run. This is useful for two reasons:
  1. It can be much faster: the initial set (of the unsupervised run) can be huge, but the zombie populations need only be large enough to capture the allele frequencies of the inferred components.
  2. It avoids the generation of spurious clusters, especially if you include individuals from highly-inbred populations, or a large number of test individuals
I re-estimated admixture proportions for the 9 individuals of the last run, using the "zombie" populations in a supervised ADMIXTURE run. This took less than 1/10 of the time, and achieved results that were highly concordant with the ones previously reported: correlation was +0.999729; the average difference in ancestral proportions was 0.3%, the maximum difference 2.1%.

The speedup is due to two reasons: first, I'm running ADMIXTURE on 250 "zombie"+9 real individuals, as opposed to 692+9 real individuals using the unsupervised method. Moreover, admixture proportions are only estimated for the 9 real individuals and are fixed for the 250 "zombie" ones. This idea seems to work like a charm.

More average K=10 results

I was also able to calculate admixture proportions for the 10 Dodecad components in Druze, Kalash, and Palestinians. These populations have a tendency of forming their own population-specific clusters, so they are very difficult to compare against other populations: you just can't get their breakup into ancestral components easily, because they become their own ancestral components at fairly low K.

Using the trick of "zombie" populations, we can determine their ancestral components and compare them with other Dodecad populations.

I have labored long to be able to compare these to the ones in the standard Dodecad set, and I am very pleased that I was finally able to achieve it:
  • Both Druze and Palestinians have substantial "Southwest Asian" component as do most Semitic (Arab, Jewish, Ethiopio-Semitic) populations in my database
  • Druze have more "West Asian" than "Southwest Asian", and the reverse is true for Palestinians
  • Palestinians have more African admixture than Druze

By far, the most exciting thing about this analysis are the results for the Kalash, a population that speaks a language of the Dardic group of Indo-Iranian. Some linguists place Dardic languages in the Indo-Aryan subgroup (of which Sanskrit and Hindi are the most famous representatives), whereas others view Dardic as a third branch of Indo-Iranian together with Iranian (like Kurdish, Persian, or Pashto) and Indo-Aryan. In any case, the study of these mountaineers is extremely crucial to the study of Indo-Iranians in general.

The Kalash have been much mythologized as either long-lost Aryans or the descendants of Alexander the Great's soldiers.

The absence of the South European component among them agrees with Y-chromosome research about the absence of a Mediterranean or Greek influence in that population. The Kalash are completely split between the West Asian component (56%) and the South Asian one (43.5%). Indeed, their West Asian admixture is very high compared to my south Asian populations, exceeding even that of the Pathans (~40%) and reaching levels found only in West Asia proper. It is also perfectly consistent with my theory of Indo-Aryan origins in West Asia.

The way forward

I initially considered the idea of zombies as a way to include more Project participants in my detailed ADMIXTURE runs, such as the recent K=12 and K=11 ones. There are two problems with these runs:
  • Each one takes 24+ hours to complete, so it is not exactly possible to replace the standard K=10 analysis with them just yet
  • Including all project participants, especially those of mixed background, makes them completely impractical, in addition to making them very capricious: at high K different components begin to appear depending on sample composition, and the solution is not as robust as in the standard K=10 analysis.
With the use of zombie populations, these problems can be largely solved. I can spend many hours or even days in a very detailed ADMIXTURE run with a large sample, create "zombie" populations from the inferred results, and then run project participants fairly fast using these "zombie" populations and supervised ADMIXTURE mode. In fact, I am working on exactly this type of test at the moment, so project members of all backgrounds should expect good things to come in the next days or weeks.