As I have explained in the README file of
DIYDodecad,
it is possible to use the software to create and distribute new calculators, based on different marker sets/ancestral populations.
(The following discussion will only be useful to other genome bloggers, or people who have experience with ADMIXTURE software).
Currently, DIYDodecad is distributed together with the 'dv3' calculator ("Dodecad v3"). This consists of a set of files:
dv3.par (The parameter file that tells DIYDodecad what to expect and what to do)
dv3.alleles (Allele names and variants)
dv3.12.F (Allele frequencies for 12 ancestral populations)
dv3.txt (Names for 12 ancestral populations)
I will now explain how you can use PLINK and ADMIXTURE to create your own calculator.
(1) Running ADMIXTURE
In the following discussion, I will assume that you have your dataset in binary PLINK format (bed/bim/fam files), that it has 123,456 markers, and you run ADMIXTURE regularly for 7 populations, e.g.:
./admixture test.bed 7
CAVEAT! The 123,456 markers must be included in the commercial platform you are targeting your calculator for. So, before you run ADMIXTURE, you must make sure that test.bed includes only markers for your chosen platform (e.g., 23andMe v3). I will assume that you have the list of markers from your commercial platform in a file (one per line), e.g., 23andMeV3.txt. You must then first do:
./plink --bfile test --extract 23andMeV3.txt --make-bed --out test
You can repeat this with other commercial marker sets, so that in the end your "test" dataset on which you run ADMIXTURE only has commercially available markers that your targeted audience will possess in their genotype files.
Actually, my main personal working sequence is to:
- Merge (--merge-list) all reference datasets in PLINK with a --geno flag
- Extract (--extract) commercial markers that form the intersection of 23andMe v3/v3 and Family Finder (Illumina)
- Do linkage-disequilibrium based pruning (--indep-pairwise)
- Finally run ADMIXTURE
It's better to do LD-based pruning after commercial marker pruning, since doing it in reverse may disrupt the physical spacing of the markers identified by --indep-pairwise.
After ADMIXTURE finishes its run, it will output a file called test.7.P; this is the allele frequencies file that you will use for your calculator, but you have to modify the order of the alleles! We will do this later.
(2) Preparing the test.alleles file
First, run the following command:
./plink --bfile test --freq --out test
This will produce a test.frq file which will be the basis of the dv3.alleles file. In R, do the following:
X<-read.table('test.frq', header=T)[, 2:4]
This will basically load the SNP names and minor/major alleles into the X table. We now identify the alphabetical order of the SNPs:
ORDER <- order(X[,1])
And, now we re-order X, so that SNPs are ordered alphabetically:
X <- X[ORDER,]
and, we save this as the test.alleles file
write.table(X, file='test.alleles', quote=F, row.names=F, col.names=F)
(3) Preparing the test.7.F file
The test.7.P file can be prepared as follows:
X <- read.table('test.7.P')
X <- X[ORDER, ]
write.table(X, file='test.7.F', quote=F, row.names=F, col.names=F)
Note that in this example test.7.P contains the output of ADMIXTURE, and test.7.F will contain the same output, but with rows re-ordered in the same way as the test.alleles file.
(4) Preparing the test.txt file
You do that with an editor; just pick whatever names you want for your 7 ancestral populations, which, of course, should be in the same order as the corresponding frequency columns output by ADMIXTURE.
(5) Preparing test.par file
Again with your editor, for this example:
1d-7
7
genotype.txt
123456
test.txt
test.7.F
test.alleles
verbose
genomewide
(6) Instructions to users
Do NOT distribute the DIYDodecad software itself, rather direct your users to the Dodecad Project download page (e.g.,
here, for the current 2.0 version of the software). This will ensure both compliance with the terms of use of the software, and also that users have access to the most up-to-date version.
You only have to distribute test.par, test.alleles, test.7.F, and test.txt.
Your users will follow exactly the same sequence of actions as described in the Dodecad README.txt file, with the only difference that they should type 'test', rather than 'dv3' whenever it is needed.
Hopefully more genome bloggers will decide to release calculators based on their ADMIXTURE runs to the wider public. There are several reasons to do this:
- Reduced workload
- Wider distribution of your work in the community, since, due to privacy concerns, not everyone is willing to share their data
- Ability to study the utility/validity of inferred components on test data and by persons other than the discoverer
- Ability to use the advanced bychr, byseg, and target modes with your calculators