Tuesday, November 30, 2010

Outliers in the Dodecad Project (23andMe data)

As promised, I have started to investigate outliers among Dodecad Project members. I used NNclean as implemented in the prabclus package to find data points that had a great distance to their nearest neighbor among either Dodecad Project members or the standard 692-individual panel I use in the Galore analysis.

To make a long story short, here are the IDs identified as outliers:
"DOD157" "DOD168" "DOD169" "DOD036" "DOD048" "DOD088" "DOD034" "DOD030" "DOD060" "DOD132" "DOD128" "DOD175"
An outlier is someone who is not very close to any other individual and hence does not really "cluster" with anyone. Thus, it is recommended to remove outliers prior to clustering, as otherwise they will form makeshift clusters that don't really have a good meaning.

Looking at the individual spreadsheet reveals that many of these outliers have very unusual ancestry. This falls under two categories:
  1. Recent admixture between geographically separated populations
  2. Being the only member from an unsampled population
In the first case, admixed individuals fall in the "empty space" between their parental clusters, and thus do not cluster with anyone else, unless a person with a similar type of admixture happens to also be in the dataset.

In the second case, there are no members of the individual's group. Sometimes, if a group is close enough to another, this is not a problem, but there are many distinctive population groups for which that is not the case.

While outliers will be removed from some analyses, their outlier status will continue to be evaluated as new reference populations, or Dodecad Project members are added.

3 comments:

  1. Samples DOD168 and DOD169 belong to me and my wife and are from Tunisia, they are among the samplers listed as outliers , they are the only,so far, samples from Tunisia and have the most admixture level !:
    South European 29.8% Northwest African 22.1% Southwest Asian 17.9% West Asian 11% North European 7.9% East African 7.3% West African 3.5% South Asian 0.4%

    ReplyDelete
  2. kamel, thanks for the info. Could you also add this in the ancestry thread, as I like to look at that to know which participants have identified their ancestry

    http://dodecad.blogspot.com/2010/11/information-about-project-samples.html

    ReplyDelete
  3. Hi Dienekes
    I posted the info in that link
    If you find any other data from Tunisia even in reserach papers, please add them

    ReplyDelete