In the Dodecad Project, we have seen that even very close populations such as Armenians and Assyrians can be classified accurately using the MDS/MCLUST "Clusters Galore" combination that I have proposed. But, do we really need ~177 thousand markers to achieve this level of detail?
I decided to carry out a small experiment, to see how the number of markers used affects the ability to correctly classify samples into two different populations.
Case A: Maximal differentiation (Papuans vs. Mbuti Pygmies)
I begin by considering the case of the two most differentiated human populations in my database, Papuans (17 individuals) and Mbuti Pygmies (13 individuals). For each step I reduce the number of markers by an order of magnitude, using PLINK's --thin 0.1 argument.
- With 176,598 markers, classification is 100% correct, i.e., all 13 Pygmies are assigned to one cluster and all 17 Papuans are assigned to another
- With 17,752 markers, classification is again 100% correct.
- With 1,725 markers, ditto
- With 152 markers, ditto
- With 20 markers, 3 Pygmies are misclassified into the Papuan cluster, hence accuracy is 90%
- With 4 markers, 5 Pygmies are misclassified as Papuans, and 6 Papuans as Pygmies, hence accuracy is 63%
Notice that these are not ancestry informative markers (AIMs) but randomly selected SNPs.
Case B: Small differentiation (Armenians vs. Assyrians)
I have previously shown that a sample of 7 Armenians and 8 Assyrians can be classified correctly, with a single Assyrian misclassified as Armenian, hence 93% accuracy.
- With 17,714 markers, 1 Armenian is misclassified as Assyrian, hence 87%
- With 1,808 markers, accuracy is 80%
- With 188 markers, it is 67%
- With 17 markers, it is 60%
I carried out this little experiment because I thought it would be interesting, but also for its implications. I can identify at least two major ones:
Ancient DNA: Due to poor preservation we are unlikely to get full genome sequences from ancient human DNA except in highly favorable conditions. In many instances, it may be possible to get only a limited number of markers tested. These results suggest that it is possible to get fairly decent assignment of individuals into populations without hundreds of thousands of SNPs. Thus, it may be possible to study genetic structure in ancient necropoleis, or the relationship of ancient remains to modern populations.
Data integration: Different genotyping platforms (e.g., those of Affymetrix and Illumina) often possess a very small subset of common SNPs. Imputation of genotypes is possible, but these results suggest that even when the overlap between markers is not substantial, it is possible to carry out fairly sophisticated genetic analyses on them.
Naturally, there are issues not addressed here: what happens when we are dealing with more than 2 populations? What happens if we want to study admixture rather than classification? In both cases, I expect the number of markers needed to be higher.
Nonetheless, I am fairly convinced, both from this, and a previous experiment that more markers than are used in current genotyping platforms will only add very little value to anthropological investigations of ethnic genetic differences.