Dodecad Ancestry Project: Outliers in the Dodecad Project (23andMe data)

Tuesday, November 30, 2010

Outliers in the Dodecad Project (23andMe data)

As promised, I have started to investigate outliers among Dodecad Project members. I used NNclean as implemented in the prabclus package to find data points that had a great distance to their nearest neighbor among either Dodecad Project members or the standard 692-individual panel I use in the Galore analysis.

To make a long story short, here are the IDs identified as outliers:

"DOD157" "DOD168" "DOD169" "DOD036" "DOD048" "DOD088" "DOD034" "DOD030" "DOD060" "DOD132" "DOD128" "DOD175"

An outlier is someone who is not very close to any other individual and hence does not really "cluster" with anyone. Thus, it is recommended to remove outliers prior to clustering, as otherwise they will form makeshift clusters that don't really have a good meaning.

Looking at the individual spreadsheet reveals that many of these outliers have very unusual ancestry. This falls under two categories:

Recent admixture between geographically separated populations
Being the only member from an unsampled population

In the first case, admixed individuals fall in the "empty space" between their parental clusters, and thus do not cluster with anyone else, unless a person with a similar type of admixture happens to also be in the dataset.

In the second case, there are no members of the individual's group. Sometimes, if a group is close enough to another, this is not a problem, but there are many distinctive population groups for which that is not the case.

While outliers will be removed from some analyses, their outlier status will continue to be evaluated as new reference populations, or Dodecad Project members are added.

3 comments:

KamelDecember 12, 2010 at 2:59 PM
Samples DOD168 and DOD169 belong to me and my wife and are from Tunisia, they are among the samplers listed as outliers , they are the only,so far, samples from Tunisia and have the most admixture level !:
South European 29.8% Northwest African 22.1% Southwest Asian 17.9% West Asian 11% North European 7.9% East African 7.3% West African 3.5% South Asian 0.4%
ReplyDelete
Replies
DienekesDecember 12, 2010 at 6:51 PM
kamel, thanks for the info. Could you also add this in the ancestry thread, as I like to look at that to know which participants have identified their ancestry

http://dodecad.blogspot.com/2010/11/information-about-project-samples.html
ReplyDelete
Replies
KamelDecember 12, 2010 at 9:50 PM
Hi Dienekes
I posted the info in that link
If you find any other data from Tunisia even in reserach papers, please add them
ReplyDelete
Replies

Add comment

Tuesday, November 30, 2010

Outliers in the Dodecad Project (23andMe data)

3 comments:

Data Sources

Useful software

Genome Bloggers

Project Links

Technical stuff