Thursday, March 31, 2011

Clusters Galore results, K=73 for Dodecad Project members (up to DOD581)

The results can be found in the spreadsheet. There are 73 clusters with 13 MDS dimensions retained. The spreadsheet contains 558 rows for unrelated project participants, each of which contains the probabilities that each individual belongs to one of the 73 clusters. This is followed by 36 rows for the reference populations, showing how many individuals from each population are assigned to each cluster.

In order to interpret your results, first search for your DOD number, and see which columns you have non-zero probabilities in. For the vast majority of individuals you will be uniquely assigned (100%) in one of the 73 clusters. Then, you can visit the ancestry thread to see who else is assigned to the same cluster as yourself, and also look in the reference populations to see how they are represented in the different clusters.

The following 67 IDs were characterized as outliers:
DOD002 DOD004 DOD006 DOD020 DOD029 DOD030 DOD034 DOD036 DOD060 DOD063 DOD072 DOD075 DOD081 DOD107 DOD126 DOD128 DOD132 DOD133 DOD156 DOD157 DOD168 DOD169 DOD175 DOD183 DOD201 DOD224 DOD234 DOD245 DOD252 DOD303 DOD309 DOD316 DOD326 DOD328 DOD339 DOD348 DOD349 DOD359 DOD363 DOD380 DOD382 DOD385 DOD388 DOD392 DOD393 DOD422 DOD425 DOD430 DOD435 DOD437 DOD489 DOD492 DOD495 DOD500 DOD502 DOD511 DOD521 DOD523 DOD531 DOD533 DOD536 DOD548 DOD571 DOD572 DOD573 DOD574 DOD577

As previously explained, outliers may either be mixed individuals or individuals from particular populations not well represented in the Project. In both cases they appear to be more "distant" from other individuals and from their respective clusters.

Getting back to the 74 inferred clusters:
  • Cluster #2 is by far the largest, consisting of mainly of "British Isles"/American White types of people; this grew substantially because of the recent open submission call when many people of this type of ancestry joined
  • Cluster #3 is essentially Ashkenazi Jewish, another big group in the Project
  • Cluster #5 is not represented in the reference populations except for a single Utah White. This is largely German.
  • Cluster #6 is mostly (but not exclusively) French.
  • Cluster #9 is largely Finnish and also includes some East Slavs.
  • Cluster #12 is essentially South Italian/Sicilian/Greek
  • Cluster #14 is mostly Assyrian/Armenian
  • Cluster #16 is mostly Balkan
  • Cluster #21 is mostly Scandinavian
  • Cluster #27 is mainly Balto-Slavic
  • Cluster #29 is essentially Iberian
I covered most of the largest clusters, but there are also plenty of smaller ones. So, make sure you contribute/read the ancestry thread to get a feel for the kind of people that share your cluster. Many of mixed-race participants (e.g., African Americans) are split into multiple clusters; I recently observed that this is the case for highly variable populations with inter-continental admixture.

Don't forget also, that sharing a cluster does not imply a very strong genetic similarity, as clusters may be either very tight or very loose. This analysis is better at identifying differences than at confirming strong similarities.

Readers of the blog will be aware that many of these clusters can be subdivided further if a regional analysis is carried out (e.g., the Assyrian-Armenian one), while others have proven difficult to split meaningully (e.g., the Iberian one into Spanish vs. Portuguese).

This time around, I included all the Genomes Unzipped people as well as Lily and Greg Mendel (LIL001, GRM001).

I will be exploring these clusters further, and any further regional structure that I may discover will be posted in this blog. So, do subscribe to the feed as there may be additional results for your sample ID.

Monday, March 28, 2011

Results for DOD567 to DOD580 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Sunday, March 27, 2011

Results for DOD553 to DOD566 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:
Individual bars:

Results for DOD540 to DOD552 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:
Individual bars:

Saturday, March 26, 2011

Results for DOD516 to DOD539 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Friday, March 25, 2011

Results for DOD504 to DOD515 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:


Results for DOD491 to DOD503 posted

Note: all people who submitted their results on time will get a DOD number. Despite the fact that the submission period lasted only 13 hours, I received in the order of ~100 requests, so be patient.

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:
Individual bars:

Results for DOD479 to DOD490 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars:

Results for DOD467 to DOD478 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:



Individual bars:



Submission opportunity... is OVER

Thank you to everyone who submitted their data during this opportunity. Don't forget to leave a comment in the ancestry thread.

The only submissions I will accept after this notice will be those of old Family Finder project members (who had been assigned an FFD) who wish to submit their Illumina data, as specified here.

Thursday, March 24, 2011

Limited time opportunity for everybody to submit your data

Who is eligible:

Everyone who has 23andMe v2 or v3 or Family Tree DNA Family Finder Illumina data.

Please do not submit samples of relatives, as these make analysis difficult. I consider 2nd cousins and closer relatives to be related.

If you don't have your data yet, you can subscribe to the feed for future opportunities to submit your data.

What you will receive:

You will receive the standard K=10 analysis results such as this, and you will also be eligible for other types of tests such as Galore analysis, supervised, or regional studies.

Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

What to send:

Send your zipped autosomal data to dodecad@gmail.com. Also let me know something about your ancestry (e.g., ethnic group, country of origin of grandparents, or anything else that might be useful).

FTDNA Family Finder Illumina data for project members

If:
  • you have already submitted your data to the Project have received your id (FFD)
  • you have not submitted 23andMe data
  • you have received your new Illumina data from FTDNA
then, you can send me your new Illumina data. This will allow me to incorporate you together with the 23andMe submitters and I will compute new admixture proportions for you. Moreover, in the future you will be included in analyses that were previously reserved for 23andMe submitters due to the different chip technology.

Please reference your FFD when you submit your data.

Results for DOD459 to DOD466 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

All populations:
Individual bars:

Monday, March 21, 2011

Ancestral North Indian - Ancestral South Indian (ANI/ASI) inferred proportions for South Asian members

I have taken the 22 Project participants with a membership of at least 1/4 in the "South Asian" component of the K=10 standard analysis and ran them together with the populations of the "Indian Cline" described by Reich et al. (2009), as described here.

Since some of the Project's participants have either African or East Eurasian admixture they do not fall strictly along the "Indian Cline" between West-Eurasian-like Ancestral North Indians (ANI) and indigenous South Asian Ancestral South Indians (ASI). I therefore included HapMap Yoruba and Beijing Chinese to weed out these influences and ran a K=4 analysis.

Here are the ADMIXTURE results:

Here are the individual results:

Surprisingly, either the inclusion of the Dodecad participants and/or the African and Chinese controls has served to better flesh out the Indian Cline in the ADMIXTURE results. Below is a scatterplot of the "West Eurasian" component inferred by ADMIXTURE vs. the Ancestral North Indian (ANI) of Reich et al. (2009).

R2 =0.98 indicates that the ANI component can be inferred almost perfectly by the West Eurasian ADMIXTURE percentage.

Below are participants' individual results showing their inferred ANI and ASI components:

Raw results can be found in the spreadsheet

Sunday, March 6, 2011

Results for DOD449 to DOD458 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

The current submission opportunity is over.

All populations:

Individual bars:

Wednesday, March 2, 2011

Supervised ADMIXTURE 1.1 analysis results

ADMIXTURE 1.1 offers a new running mode in which some individuals are "fixed" as belonging toa particular population (100%), and the ancestral proportions of the remaining ones are estimated.

Naturally, I wanted to see how well this works in practice, so I went through the ancestry thread to find some test cases.

DOD006 reports half North Italian and half Ashkenazi Jewish ancestry. Using 5 Dodecad Project North Italians and 25 Dodecad Project Ashkenazi Jews, I estimate his/her ancestry as 24.9% North Italian and 75.1% Ashkenazi Jewish. Substituting the HGDP North Italian sample (from Bergamo) for the Dodecad one, I obtain values of 27.1% N.I. and 72.9% AJ. Based on these results, I would wager that the North Italian ancestor was half Jewish, or otherwise atypical for that population.

DOD073 reports half German and half Irish ancestry. Using 17 Dodecad Project Irish and 11 Dodecad Project Germans as references, I estimate his/her ancestry as 55.9% Irish and 44.1% German. This seems reasonable, given the limitations of the algorithm and the relative closeness of the two populations.

DOD188 reports half Sicilian, half Polish ancestry. Using 6 Poles and 20 South Italians/Sicilians from the Dodecad Project, I estimate his/her ancestry as 40.6% Polish and 59.4% Sicilian. Is this slightly worse result due to the algorithm's limitations, or, as I suspect, to the smaller Polish sample?

DOD014 is a very interesting case reporting half Greek half South Italian/Sicilian ancestry. Given the close relationship between these two populations, I did not know what to expect, and the result of 30.6% Greek 69.4% South Italian/Sicilian probably indicates the difficulty of obtaining accurate estimates for admixture between related populations.

DOD245 suggests an approximate breakup of 50% W African, 25% Ashkenazi, and 25% N European/English with some Native American. Using HGDP Yoruba, Dodecad Ashkenazi, and 17 Dodecad British, I estimate 50.5% W African, 24.4% Ashkenazi, and 25.1% British which seems right on the money, thanks, perhaps, to the large reference samples from well-differentiated populations.

I revisited Joe Pickrell whose ancestry I do not know fully except for the following:
  • 1 Ashkenazi great grandparent
  • 1 Italian grandparent
Guessing that the remainder is British-like white American, and the Italian part is from the south, I analyzed the sample using Dodecad British, South Italians/Sicilians, Ashkenazi and came up with 56.8%, 34%, and 9.2% respectively:
  • the Ashkenazi component is close to the expected 12.5%, given the randomness of three generations between a great-grandparent and his descendant
  • the Italian is more than the expected 25%, and this could be explained in many different ways, e.g., part-Italian descent of the non-Italian ancestors, or descent from some non-British, non-Italian white Europeans.
Zack finds his Dodecad results (DOD128) to be compatible with a quarter Egyptian ancestry, finding his South Asian ancestry to be more similar to Punjabis (although he has no data for Punjabis). Using Pakistani Punjabis from Xing et al. (2010) and Behar et al. (2010) Egyptians as references requires me to drop the number of markers to ~38k, but the result of the supervised ADMIXTURE analysis is 77.4% Punjabi and 22.6% Egyptian, which seems compatible with what he expected.

Finally, another difficult case is DOD329 who is 3/4 Norwegian and 1/4 Swedish with a little "forest Finn". Judging from the K=10 results for this sample (only 0.4% East Asian), I don't think there is much "forest Finn" in his/her genome. Using 7 Dodecad Swedes and 6 Dodecad Norwegians as references, I obtain 46.8% and 53.2% which is again appropriately "off" given both the small reference samples and close relatedness of these two populations.

Concordance between self-reported and genomic ancestry

Consider DOD375 of Spanish origin (from Valencia). I ran supervised ADMIXTURE analysis using Behar et al. (2010) Spaniards and 25 HapMap Mexicans as references. Not surprisingly, this individual turns out to be 100% Spanish using this test.

Now, consider the individual who prompted my recent plea for accurate self-reporting of ancestry. I had hard evidence that this individual, who also claimed full Spanish ancestry, was in fact part Mexican. Nonetheless, I decided to make the case airtight by performing exactly the type of test described in the previous paragraph. The result: 76% Spanish and 24% Mexican, in agreement with a single Mexican grandparent.

Conclusion

This type of analysis does seem to work best when good-sized samples of the ancestral populations are available, and these populations are well-differentiated genetically.

From an anthropological viewpoint, it could be useful for populations with well-known admixture histories, such as those of the New World or parts of Central Asia.

It could also be useful as a confirmatory tool to compare self-reported vs. genomic ancestry.