Dodecad Ancestry Project: October 2010

Sunday, October 31, 2010

Results for DOD101 to DOD110 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Results for DOD091 to DOD100 posted

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Saturday, October 30, 2010

Last 24 hours to submit your data

As promised, I will receive samples from my target groups of the pilot phase until the end of October. So, you have about 24 hours to submit your sample if you haven't already.

So far, I have 122 samples from many different groups; here is my preliminary categorization:

American_White, Arab, Armenian, Ashkenazi, Assyrian, Bulgarian, Danish, East_African, Finnish, French, German, Greek, Indian, Iranian, Irish, Italian, Jewish, Maltese, Norwegian, Polish, Portuguese, Romanian, Serb, Slovenian, Spanish, Swedish, Turkish, UK, Unknown

Some of these groups have already reached the 5-person threshold, and I will be able to report statistical properties about them, such as mean admixture proportions.

But, many of them are a few individuals short of the 5-person goal. If you belong to any of them, or any other target groups, please consider submitting your data.

After October 31st, sample submission will close, and I will begin my analysis of the data to see if new nuggets of information can be discovered. This blog will continue to report on these, and will present updated information about project participants, if I am comfortable that it is robust.

New opportunities to submit your data after October 31st may be available, and these will also be announced. But, please do not submit your data after that. If you have a sample of extraordinary interest, you may, of course, send me e-mail (dodecad@gmail.com). I will then tell you if I can process your data, and then you can send it.

PS: You can send me the data by the end of October 31st in your timezone.

Results for DOD081 to DOD090 posted

Admixture proportions can be found in the spreadsheet NOTE! The results posted in the spreadsheet a few minutes ago were incomplete, see the spreadsheet again, it is correct now.

All populations:

Individual bars:

Friday, October 29, 2010

Exploring the 10 components of the Dodecad project's initial analysis

Project participants have received their admixture proportions from K=10 different inferred ancestral components. But, what are these components?

Unfortunately, ADMIXTURE does not provide any information about the age of the components. Indeed, village populations have been identified by ADMIXTURE-like analyses in the past: these were probably formed as distinctive entities no earlier than a few hundred years ago. But, the same is true also for the great continental groups (such as East Eurasians) which were most certainly formed thousands of years ago.

Nor can we be sure about the appearance of people who belong primarily to one of the components. This is due to the fact that many physical traits have evolved relatively recently in Eurasia, the result of natural and social adaptation to local environments.

A common way of exploring the relationship between populations is to represent them as an evolutionary tree. But, caution is needed: the tree representation assumes that populations split, but has no power of representing lateral gene flow between branches.

It is better to organize ancestral components (such as the 10 components currently reported to participants, e.g., "Northern European" or "West African"), rather than extant populations (e.g., Russians or Uygur) in a tree. To do otherwise would be equivalent to forcing a tree representation to populations in which lateral gene flow has been important.

Of course, no tree representation can capture the complexities of human relationships, but it nonetheless helps us visualize the data and generate hypotheses about the deep origins of prehistoric humans.

And, while lateral gene flow may have occurred among the ancestral components themselves, we are, nonetheless removing one layer of admixture (e.g., between East and West Eurasians in the ancestry of Uygur), and are getting closer to the situation in Eurasia before historical and late prehistorical movements of people began shuffling genes around in force.

Another common way of presenting relationships between populations is with multidimensional scaling (MDS). This takes the distances between populations, and maps the populations on a "map", the first few dimensions of which are usually displayed in a series of 2-dimensional scatterplots. This is quite useful, as the first 2 dimensions of the MDS representation has been discovered to correlate well with a map of geography in Europe, and probably elsewhere.

Notice, however, that there is information loss: there are 45 pairwise distances (10 choose 2) between 10 populations, but each of them is represented with two (x, y) co-ordinates on a 2D map. Hence: 45 pairwise distances are mapped to 20 co-ordinate values. What this means, is that distances cannot be preserved. That is, if we take our ruler and measure the distance between two populations on a 2D MDS plot, we are not guaranteed that it is proportional to the original distance.

(The problem is even more severe if we were to map 1,000 individuals themselves onto a 2D MDS map: about half a million pairwise distances are mapped to 2,000 co-ordinates. Normally, the first two dimensions capture a lot of the information, but we always have to examine the raw distances themselves to be sure of individuals' relationships with each other. This is a formidable task as the number of individuals grows, which, in addition, defeats the purpose of using visualization as an aid to data interpretation).

Also, as I have noted before, individuals of quite different ancestry may fall on the same spot of an MDS or the related Principal Components Analysis (PCA) map. Nonetheless, MDS is also a method that can give us a quick visual perception of the relationships between populations.

Without further ado, here is the table of Fst distances between the 10 ancestral components, as produced by ADMIXTURE; note that this depends on the marker set used for analysis, but there has been no selection of markers because they have big or small differences between populations:

Here is an MDS representation of these distances:

Here is a Neighbor-Joining tree representation:

Finally, here is a hierarchical clustering with complete linkage:

Some observations:

There are three well-defined "poles" of maximal differentiation: West Africans, East Eurasians, and West Eurasians
East Africans are related to West Africans but deviate toward West Eurasians
South Asians are intermediate between East and West Eurasians
West Eurasians consist of a core of North/South Europeans and West Asians, with Southwest Asians being slightly more removed
Northwest Africans are close to West Eurasians but also deviate towards other Africans
East Eurasians consist of Northeast Asians and East Asians

Note that in the above I am speaking of the 10 components, not of living geographical populations. For example, Ethiopians are "geographical" East Africans, who partake primarily of the East African but also of the Southwest Asian component and thus are even more inclined towards West Eurasians than the East African component is.

Results for DOD056, and DOD071 to DOD080

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Results for DOD046, and DOD061 to DOD070

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Thursday, October 28, 2010

Results for Greg and Lilly Mendel and Genomes Unzipped individuals added

Admixture proportions can be found in the spreadsheet

Greg and Lilly Mendel (i.e., Linda Avey, 23andMe co-founder, and her husband) and all populations:

Individual bars for Genomes Unzipped individuals. I had previously studied these in a test that lacked "South Asian" and had slightly different categories for West Eurasians.

Results for DOD045, DOD50-DOD055, and DOD057-DOD060

Admixture proportions can be found in the spreadsheet

All populations:

Individual bars:

Wednesday, October 27, 2010

24-hour opportunity for everybody to submit your data... is OVER

Thank you all for submitting! We now have 102 participants in total.

My target groups can continue submitting their data until the end of October. There may be more opportunities in the future for other groups to submit their data, so please stay tuned.

Results for DOD041 to DOD044 and DOD047 to DOD049

Admixture proportions can be found in this spreadsheet

All populations:

Individual bars:

A note of explanation on the analysis process

You may have noticed that the first figure (for all populations) in the various batch results looks always nearly the same.

The fact that I am processing participants in small batches of 10 individuals is intentional, as I do not want the composition of a heterogeneous sample to bias results. Thus, you can safely assume that e.g., "South European" or "West Asian" mean nearly exactly the same thing across batches.

Another important aspect to consider is the robustness of the result. If you feed any data into ADMIXTURE, you will receive admixture proportions; that's what it does. But, are these proportions meaningful?

There are three ways in which I am sure they are:

First, the components all have strong correlations with particular ethnic and geographical groups. They are not random groupings of individuals with no detectible pattern.

Second, the quality of the model can be assessed with the so-called Bayes Information Criterion (BIC), a statistical quantity that assesses jointly (i) how well the model fits the data, and (ii) how parsimonious it is.

Adding a larger number of components tends to increase the model fit. Think of it as trying to reproduce an oil painting with an increasing number of hues: it gets progressively better and better.

However, this comes at a problem of parsimony. You could claim that the painter used 200 different paint colors to produce his painting, but this is overkill. You can approximate the painting just as well with a smaller number (e.g. the 4 colors used by ancient painters) to reach almost as a good a result, considering the admixture between these components.

With the BIC, it is evident that up to K=10 (which is the level at which you are getting results), the BIC has increased. This is an additional piece of evidence that this is a robust model and did neither overfit the data, nor sell itself short on its power of inference.

Third, the data shows remarkable stability. As you increase the number of variables (such as K), the possible solutions (admixture proportions) increases exponentially.

This means that you can describe the same data in many different ways, all of which are about equally likely. This is the statistical problem of local minima. That is, you are not getting the best solution to the problem, because it would take forever to compute, but you are getting a quite good one.

Going back to the painting example, you can achieve pretty much the same result with lots of different hues if you have a palette of 8. Surely, if none of your colors has a blue component, then you will not be able to match reality, but there are many other color combinations that you could use, mixing the colors together to achieve the same result.

By exploiting the random seed faculty of ADMIXTURE, as well as combine the public data with that of project participants, I can "shake the pot" of the ancestry analysis. No matter what I do, though, the results at this level of analysis turn out the same, hence the repetitive first figure in the batch results.

This indicates (but does not prove) that this K=10 level analysis is indeed very close to (if not perfectly) optimal, and, at least, is not simply a just-so story: it reflects intrinsic qualities of the data.

Looking into the future

By participating in the project, you can however, change the genetic picture of Eurasia. These K=10 results are robust, but they are dependent on the sampled populations.

Consider for example, if we had not included Yakuts, Uygurs, and Chuvash in the analysis. Then, almost certainly, we would not have uncovered the "Northeast Asian" component.

A similar story exists now for many parts of the world: is there a specific Scandinavian cluster? We can't say, because the combined samples of HGDP, HapMap-3, and Behar et al. (2010) that are publicly available are devoid of Scandinavian populations. Is there a specific Balkan cluster? The same, as only a sample of Romanians is included.

Thus, prospective participants should know that they not only get an estimate of their admixture proportions from the existing 10 components, but they may help - if enough of them contribute - define new ones.

Such new clusters may or may not exist, and they may or may not be detectible by the level of detail provided by the analysis. However, we will not find out unless we try, which is why it is probably a good idea for people to submit their results.

See if you are in one of the groups of the pilot phase, but, by all means contact me (dodecad@gmail.com) if you think you belong to an under-represented group that would help fill out the genetic picture of Eurasia.

Results for DOD031 to DOD040 posted

Admixture proportions can be found in this spreadsheet

All populations:

Individual bars:

Tuesday, October 26, 2010

Results for DOD021 to DOD030 posted

Admixture proportions can be found in this spreadsheet.

All populations:

Individual bars:

Results for DOD011 to DOD020 posted

Admixture proportions can be found in this spreadsheet.

All populations:

Individual bars:

Facebook page for the Dodecad Ancestry Project

I have made a Facebook page for the Project. I don't post on Facebook much, and all project updates will be posted here, but the Facebook page may be useful to spread the word about the project, and for participants to interact and discuss their results if they want. I will occasionally drop by there to give my € 0.02.

Results for DOD001 to DOD010 posted

Admixture proportions can be found in this spreadsheet.

All Populations:

Individual bars:

Sunday, October 24, 2010

Introducing the Dodecad ancestry project

Welcome to the Dodecad ancestry project!

The Dodecad ancestry project, named for the Greek word for "group of twelve", aims to provide detailed ancestry analysis, primarily for Eurasian individuals. Please read on, if you want to participate; if not, please subscribe to the feed to keep up with the project's progress.

The project was started by Dienekes' Anthropology Blog.

1) Project goals

The Dodecad ancestry project has two goals:

To provide detailed ancestry analysis to individuals who have tested with 23andMe; other testing companies may be included in the future.
To build samples of individuals for regions of the world (e.g. Greeks, Finns, Albanians, Southern Italians, etc.) currently under-represented in publicly available datasets.

I neither endorse nor am I affiliated with any genetic testing company. I have chosen to base the project on 23andMe results, because (i) I perceive that quite a few people have used the service, (ii) the Illumina genotyping platform it uses has substantial overlap with the publicly available datasets on which my analysis depends.

2) Data privacy statement

Your raw genetic data will not be shared with anyone.

It will not be analyzed for anything other than ancestry or admixture. No analysis of physical or medical traits will be performed.

Individual-level results will be revealed with only a unique ID, without any further information about the identity or origin of each participant.

3) Who is eligible to participate

Due to my inability to process a large number of samples, at present, only the following groups are eligible to participate in the project's current pilot phase:

Greeks (not necessarily from Greece: Cypriots, Pontic Greeks from the former USSR, North Epirotes, Griko speakers from Italy, Muslim rumca speakers from Turkey, etc. are all accepted)
People from the Balkans
People from Anatolia
People from the Caucasus
Italians
Non-Indo-European speakers from Europe (e.g., Finns, Hungarians, Basques)
Scandinavians and Icelanders
Iranians
Armenians
Jews from Italy, the Balkans, or Anatolia
Assyrians
Arabs

Samples should be received by the end of October 2010. There may be a new opportunity to submit your data after that, which will be announced in this blog.

If you are uncertain whether your sample can be included in the project, please write to me at dodecad@gmail.com to inquire.

Close relatives should not submit all their samples. If you and your relatives have tested, please submit independent samples. For example, if you have data for you, your father, and your mother, it is ok to submit either (i) your own data, or (ii) your father and mother's data -- provided that it is not a consanguineous (e.g., cousin or uncle-niece) marriage.

4) What to send

Your compressed genotype file from 23andMe and as much information about your ancestry as you wish to reveal. It is necessary, however, that you tell me at least the country of origin of most of your ancestors and their ethnicity. Information about spoken language, religion, may also be useful.

5) Where to send it

Send it to dodecad@gmail.com. I will respond with a unique identifying code of the form DOD001, DOD002, and so on. The results will then be posted in the blog with that ID.

6) What you will receive

You will be included in an ADMIXTURE analysis together with other project participants and publically available populations, and the results will be published in this blog. Your sample will only be identified by your ID. If a group (e.g., Greeks) has at least 5 participants, I may also post the average admixture proportions for that group.

The following figure and table gives an idea of what to expect. Project members can expect to get a bar of their own, and a list of their ancestral proportions.

The number of components and population samples may vary over the course of the project.

7) Project updates

All project updates will be announced and presented in this blog. Additional commentary on the project may be posted in Dienekes' Anthropology Blog. I may also post occasionally on twitter about the project's progress.

8) Feedback

I encourage feedback about the project from participants or prospective participants. Please address it to dodecad@gmail.com

Sunday, October 31, 2010

Results for DOD101 to DOD110 posted

Results for DOD091 to DOD100 posted

Saturday, October 30, 2010

Last 24 hours to submit your data

Results for DOD081 to DOD090 posted

Friday, October 29, 2010

Exploring the 10 components of the Dodecad project's initial analysis

Results for DOD056, and DOD071 to DOD080

Results for DOD046, and DOD061 to DOD070

Thursday, October 28, 2010

Results for Greg and Lilly Mendel and Genomes Unzipped individuals added

Results for DOD045, DOD50-DOD055, and DOD057-DOD060

Wednesday, October 27, 2010

24-hour opportunity for everybody to submit your data... is OVER

Results for DOD041 to DOD044 and DOD047 to DOD049

A note of explanation on the analysis process

Results for DOD031 to DOD040 posted

Tuesday, October 26, 2010

Results for DOD021 to DOD030 posted

Results for DOD011 to DOD020 posted

Facebook page for the Dodecad Ancestry Project

Results for DOD001 to DOD010 posted

Sunday, October 24, 2010

Introducing the Dodecad ancestry project

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff