Friday, November 30, 2012

Geno 2.0 patch for DIYDodecad

(See important update at the end of this post)

People who have tested using the Genographic Project's Geno 2.0 test can now use the DIYDodecad tool with their data. The raw data download from this test has a slightly different format than the ones from 23andMe and Family Finder, so it is necessary to convert your data in a format that DIYDodecad can interpret.

So, after you have downloaded and extracted the DIYDodecad software as per its instructions, you should also download a couple of extra files into your working directory; these files are included in this patch:

  • standardize.r which replaces the standardize.r in the DIYDodecad software bundle, and allows you to convert your Geno 2.0 formatted data
  • hgdp.base.txt which includes additional information about SNP markers that is not found in your Geno 2.0 raw data download, and which is necessary to complete the conversion process.
Once these two files have been extracted into your working directory, the process of using DIYDodecad is exactly the same as for any other user of the software.

The only difference is that at the step where you convert your data using the standardize command (see DIYDodecad README file), you will use the command:


standardize('johndoe.csv', company='geno2')

where johndoe.csv is your unzipped raw data download. This will write a genotype.txt file in the working directory, and you can proceed the rest of the way as per the instructions.

You can use all ancestry calculators released by the Project (or indeed other projects); the most recent one is globe13

You should be aware, that because the Geno 2.0 test includes a smaller number of SNPs, and because globe13 and other calculators were developed using the common SNP set of 23andMe and Family Finder, the analysis using globe13 will only include ~34 thousand SNPs and will be "noisier" than usual. In the future, I might develop new calculators that make use of the SNP set of the Geno 2.0 test itself.

PS: Feel free to post a comment below if you experienced any difficulty converting your data; also thanks to CeCe Moore for graciously sharing a raw data file with me, which allowed me to build this converter.

UPDATE:

Apparently, the data format has been changed for some Geno 2.0 data downloads.
If your data includes a [Header] ... [Data] preamble followed by a list of 5 comma-separated values, ignore this.
If it includes a header "SNP,Chr,Allele1,Allele2" followed by a list of 4 comma-separated values, you should follow the instructions as above, but use company='geno2new' instead.

11 comments:

  1. Found you via Cece Moore's blog... Just got my Geno 2.0 results, so downloaded your program and the patch for Geno 2.0 with no problem, but my program has been running for 3+ hours now and still not done.

    How long should it take? The R program has returned 1 line ([Header],,,,,) and that's it. No file called genotype.txt in my directory yet.

    Thanks.

    ReplyDelete
  2. Send me your data if you want to dodecad@gmail.com and I'll take a look at it.

    ReplyDelete
  3. The same with me; only it did not take long

    ReplyDelete
  4. I have run the patch on 3 different files sent to me and it produces a genotype.txt file just fine.

    I can think of a few reasons why it might not work for you:

    (1) you did not download hgdp.base.txt
    (2) you did not uncompress your .csv.gz download file
    (3) you did not setwd to the working directory

    ReplyDelete
  5. Disregard; have proceeded successfully. Thank you.

    ReplyDelete
  6. This worked for me, even though I am new to R. You should change this line at my brackets []: "The only difference is that [change AT to AFTER] the step where you convert your data using the standardize command (see DIYDodecad README file), you will use the command - - " I kept trying to run the patch instead of standardize instead of following it.

    ReplyDelete
  7. Figured out what I did wrong... The first time I unzipped my .csv.gz file I took a look at it in Excel. That must've corrupted the file somehow. Unzipped a copy tonight, and the size was about 180 bytes smaller. Using this new, smaller *.csv file from Geno2, I got a genotype.txt file within seconds.

    Thank you.

    ReplyDelete
  8. Hello,

    Is there a linux version (I'm on Ubuntu)? If not, it work maybe with Wine?

    Thanks

    ReplyDelete
  9. Hello again,
    if I well understand, it should also work on Ubuntu.
    I tried but when I do :
    standardize('xxx.csv', company='geno2new')
    R answer me :
    unexpected symbol on 'xxx.csv'

    My genofile includes a header "SNP,Chr,Allele1,Allele2"

    But just after I have the mtDNA data like "6248,Mt,T,T"
    Does this correspond to what you call the list of 4 comma-separated values?

    Thanks


    ReplyDelete
    Replies
    1. Ok, it works now with Ubuntu.
      I was confused with the '' of the genographic file and the '' of R command. R doesn't like the '' of the genographic file so I change the name of the file.
      Thanks.

      Delete
  10. I have the new GENO 2 file type. I've changed the csv file to one more user friendly and, having changed R to working directory where all files are I type the following at the prompt:

    standardize('fredmr.csv', company='geno2new')

    I get the following error message:

    Error in is.data.frame(x) : object 'X' not found

    What am I doing wrong?

    Thanks

    ReplyDelete