Dodecad Ancestry Project: August 2011

Tuesday, August 30, 2011

Balkan averages (August 2011)

Since my last call for more participation from the Balkans, I was able to create a new Bulgarian_D sample of 5 participants. Together with the Greek_D sample, the Balkans_D sample of non-Greek, non-Bulgarian project members, the Behar et al. (2010) Romanians, and the Xing et al. (2010) Slovenians (the latter on a smaller number of markers), we are beginning to get a better feel of genetic variation in the Balkans. There have been several other averages that have been adjusted with more participation; all of them can be seen in the Dodecad v3 spreadsheet.

The table below shows the major components (>1%) in the available Balkan populations.

The Bulgarian average as it stands seems reasonably close to the Romanian one, and is characterized by balanced West/East European components; in this balance it resembles Greeks, who, however, have lower levels of both components and higher levels of the Mediterranean/West Asian/Southwest Asian components.

Slovenians contrast with Hungarians in having reverse West/East European levels, and with their neighboring Italians in having quite a bit more of the East European component, and quite a bit less of the Mediterranean one. Bulgarians/Romanians contrast with Slavic groups from eastern Europe in having less East European and more Mediterranean/West_Asian.

Hopefully before long, more participation from the western/central Balkans (Serbs, Croats, Bosnians, Montenegrins, Albanians, Slav Macedonians) will allow us to fill more holes in our understanding of the genetic landscape of Southeastern Europe.

Tuesday, August 23, 2011

Populations in need of 5 participants

Submission to the Project is currently closed, but I am often willing to include new members if they contact me (dodecad@gmail.com) before sending their data with some information about their ancestry.

I am most likely to accept new participants from:

Greece, the Balkans, Italy, and West/Central Asia
Under-represented populations
Populations that are a few members short of reaching the 5-person mark, after which I can calculate an average for them.

I typically don't accept new participants of multiple ancestries; I've made DIYDodecad for just that case.

Here is a list of populations that are short 1-2 participants:

Algerian_D 4

North_African_Jews_D 4

Slovenian_D 4

Bulgarian_D 4

Danish_D 3

Moroccan_D 3

Tunisian_D 3

Mixed_Scandinavian_D 3

Serb_D 3

Austrian_D 3

Saudi_D 3

Pakistani_D 3

Tatar_Various_D 3

Palestinian_D 3

Any individuals from the Balkans are strongly encouraged to contact me, as the Balkans_D sample size of 17 can soon be broken down into specific populations if a few more individuals from different Balkan populations join the Project.

I also encourage new members to post their information in the ancestry thread.

How to make your own calculator for DIYDodecad

As I have explained in the README file of DIYDodecad, it is possible to use the software to create and distribute new calculators, based on different marker sets/ancestral populations.

(The following discussion will only be useful to other genome bloggers, or people who have experience with ADMIXTURE software).

Currently, DIYDodecad is distributed together with the 'dv3' calculator ("Dodecad v3"). This consists of a set of files:

dv3.par (The parameter file that tells DIYDodecad what to expect and what to do)

dv3.alleles (Allele names and variants)

dv3.12.F (Allele frequencies for 12 ancestral populations)

dv3.txt (Names for 12 ancestral populations)

I will now explain how you can use PLINK and ADMIXTURE to create your own calculator.

(1) Running ADMIXTURE

In the following discussion, I will assume that you have your dataset in binary PLINK format (bed/bim/fam files), that it has 123,456 markers, and you run ADMIXTURE regularly for 7 populations, e.g.:

./admixture test.bed 7

CAVEAT! The 123,456 markers must be included in the commercial platform you are targeting your calculator for. So, before you run ADMIXTURE, you must make sure that test.bed includes only markers for your chosen platform (e.g., 23andMe v3). I will assume that you have the list of markers from your commercial platform in a file (one per line), e.g., 23andMeV3.txt. You must then first do:

./plink --bfile test --extract 23andMeV3.txt --make-bed --out test

You can repeat this with other commercial marker sets, so that in the end your "test" dataset on which you run ADMIXTURE only has commercially available markers that your targeted audience will possess in their genotype files.

Actually, my main personal working sequence is to:

Merge (--merge-list) all reference datasets in PLINK with a --geno flag
Extract (--extract) commercial markers that form the intersection of 23andMe v3/v3 and Family Finder (Illumina)
Do linkage-disequilibrium based pruning (--indep-pairwise)
Finally run ADMIXTURE

It's better to do LD-based pruning after commercial marker pruning, since doing it in reverse may disrupt the physical spacing of the markers identified by --indep-pairwise.

After ADMIXTURE finishes its run, it will output a file called test.7.P; this is the allele frequencies file that you will use for your calculator, but you have to modify the order of the alleles! We will do this later.

(2) Preparing the test.alleles file

First, run the following command:

./plink --bfile test --freq --out test

This will produce a test.frq file which will be the basis of the dv3.alleles file. In R, do the following:

X<-read.table('test.frq', header=T)[, 2:4]

This will basically load the SNP names and minor/major alleles into the X table. We now identify the alphabetical order of the SNPs:

ORDER <- order(X[,1])

And, now we re-order X, so that SNPs are ordered alphabetically:

X <- X[ORDER,]

and, we save this as the test.alleles file

write.table(X, file='test.alleles', quote=F, row.names=F, col.names=F)

(3) Preparing the test.7.F file

The test.7.P file can be prepared as follows:

X <- read.table('test.7.P')

X <- X[ORDER, ]

write.table(X, file='test.7.F', quote=F, row.names=F, col.names=F)

Note that in this example test.7.P contains the output of ADMIXTURE, and test.7.F will contain the same output, but with rows re-ordered in the same way as the test.alleles file.

(4) Preparing the test.txt file

You do that with an editor; just pick whatever names you want for your 7 ancestral populations, which, of course, should be in the same order as the corresponding frequency columns output by ADMIXTURE.

(5) Preparing test.par file

Again with your editor, for this example:

1d-7

genotype.txt

123456

test.txt

test.7.F

test.alleles

verbose

genomewide

(6) Instructions to users

Do NOT distribute the DIYDodecad software itself, rather direct your users to the Dodecad Project download page (e.g., here, for the current 2.0 version of the software). This will ensure both compliance with the terms of use of the software, and also that users have access to the most up-to-date version.

You only have to distribute test.par, test.alleles, test.7.F, and test.txt.

Your users will follow exactly the same sequence of actions as described in the Dodecad README.txt file, with the only difference that they should type 'test', rather than 'dv3' whenever it is needed.

Hopefully more genome bloggers will decide to release calculators based on their ADMIXTURE runs to the wider public. There are several reasons to do this:

Reduced workload
Wider distribution of your work in the community, since, due to privacy concerns, not everyone is willing to share their data
Ability to study the utility/validity of inferred components on test data and by persons other than the discoverer
Ability to use the advanced bychr, byseg, and target modes with your calculators

Friday, August 19, 2011

A few comments on the use of DIYDodecad 2.0

Here are some observations that might be useful to people, especially for the new byseg and target modes:

1. Finding the origin of shared segments

Until now, when you had a segment match with another customer in your testing company, you had no idea what was the origin of the shared segment. Suppose, for example, that a Russian and a German share some sequence in a region X. This could be:

Russian-like ancestry in the German individual
German-like ancestry in the Russian individual
Third party ancestry in both individuals

Using the new modes, if the German saw an excess of Eastern European (relative to his usual average), then he'd pick the first scenario; if he saw nothing unremarkable, the second; if an excess of some component rare in both Russians and Germans (e.g., West_Asian), the third.

This is extremely important, as there is a noticeably confirmation bias in some individuals of interpreting the unusual as evidence of exotic ancestry. For example, an individual in search of Jewish ancestry may interpret segment matches with Jews as evidence for that ancestry: if he sees high Southwest_Asian ancestry in such segments, then that's a reasonable interpretation, but the shared segments could very well be interpreted as non-Jewish ancestry in the Jewish individual, if, they happen to be, e.g., East_European.

2. With parents' DNA

It is important to remember that each region includes both paternal and maternal DNA and you got a random draw of the segments inherited by their parents (your grandparents).

So, if you try to figure out where your region X came from, remember that it came from two places. So, if you see an unusual combination (e.g., Northeast_Asian + Northwest_African) that doesn't correspond well to any known population, this may mean that you got half of it from one parent, and the other half from the other.

Note also, that while on genomewide analysis a child's results will often be intermediate (but not necessarily so) in his ancestral components between his parents, this is not the case when looking at small segments. Suppose parent A is 50% West_Asian and 50% Mediterranean in a particular region, and parent B is 50% West_Asian and 50% West_European in the other region.

Then the child may end up with West_Asian near 100% in that region (if he happens to inherit the West_Asian segments from both parenets) or near 0% (if he happens to inherit the Mediterranean/West_European ones).

3. With Dodecad Oracle

In general, I discourage the use of Dodecad Oracle with chromosome or segment results. For two reasons:

Small segments may appear more mixed than they are, because there may not be any informative SNPs in a particular region to distinguish between some of the ancestral components. So, the scale of the noise may be higher. As an experiment, you can average your segments, weighted by either the number of SNPs or their physical length, and you will come up with something close to your "genomewide" average, that will, however, be off, because of this factor.
From a different perspective, segments may appear less mixed, because it is less likely that you got genetic material from all ancestral populations in a small section of your DNA. Your genomewide admixture may have several non-zero components, but you are unlikely to have many non-zero components in a small region (barring the aforementioned noise), and you could very well see >80% percentages in some of them that are very typical of a particular ancestral component.

Tuesday, August 16, 2011

Do-It-Yourself Dodecad v 2.0

UPDATE: The newer version is DIYDodecad v 2.1

There are many new features in DIYDodecad 2.0, including by-chromosome and by-segment ancestry analysis, and a visualization tool that can be used in conjunction with it. Simple admixture analysis as in version 1.0 is, of course, also included.

You can download the RAR archive from Google Docs (File->Download original) or Sendspace. DIYDodecad works for Windows and 32/64bit Linux.

Here is the chromosome #3 of a project pariticipant for 3 components of interest.

A different region on Chromosome 4 on a different set of components:

Bug reports/suggestions for improvement are always welcome at dodecad@gmail.com

NOTE: In Windows, the result files and dv3.par file will not look good if you open them with Notepad, because of the different way in which Linux and Windows represents new lines. So, you should edit/look at these files using another text editor (e.g., Wordpad, Textpad, Word), but make sure you always save them as plain text (.txt) files.

UPDATE: additional calculators (21 Sep 2011)

Additional calculators have been released by the Dodecad Project:

Third party calculators (not by the Dodecad Project):

Tuesday, August 30, 2011

Balkan averages (August 2011)

Tuesday, August 23, 2011

Populations in need of 5 participants

How to make your own calculator for DIYDodecad

Friday, August 19, 2011

A few comments on the use of DIYDodecad 2.0

Tuesday, August 16, 2011

Do-It-Yourself Dodecad v 2.0

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff