Monday, May 8, 2017

A Primer on Downloading Sequencing Data from MG-RAST & the SRA


One of the best set of resources we have for bioinformatics, and especially microbiome research, are the extensive and freely available DNA sequence archives. For the past few years, most studies have been (and in most cases required to) archiving their relevant sequence datasets so that they are freely available to the public and other researchers. This is becoming an increasingly valuable resource for data mining and meta-analyses now that we have about a decade of archiving behind us. Just as these  datasets can be highly valuable research tools, they can also be particularly difficult resources to download and prepare for analysis. I have been meaning to get to this for a while, so this week I want to go through an introduction to downloading these datasets. My goal is to equip you to easily get the sequence sets onto your own computer and start your own analysis.



The Sequence Read Archive (SRA)

One of the largest (if not the largest) sequence dataset archives available to the public is the United States National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). This sequence archive has years of DNA sequencing studies readily available, but getting the reads can be a little bit of a challenge. They do have instructions (and other tools for downloading) in their documentation, but to make things easier, we will go through it here while including some custom scripts that you can use.

An easy way to get SRA datasets using command line tools is downloading the data from their ftp (no worries if you don't know what that is; it's just a site to download data from). As long as you are downloading a small-ish dataset, the wget tool works great. A nice subroutine you can use is as follows.

DownloadFromSRA () {
 line="${1}"
 echo Processing SRA Accession Number "${line}"
 mkdir ./data/${Output}/"${line}"
 shorterLine=${line:0:3}
 shortLine=${line:0:6}
 echo Looking for ${shorterLine} with ${shortLine}
 # Recursively download the contents of the 
 wget -r --no-parent -A "*" ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/${shorterLine}/${shortLine}/${line}/
 mv ./ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/${shorterLine}/${shortLine}/${line}/*/*.sra ./data/${Output}/"${line}"
 rm -r ./ftp-trace.ncbi.nih.gov
}

export -f DownloadFromSRA

If you copy and paste this into your command line (Linux/Mac), you can just type the subroutine name "DownloadFromSRA", followed by the project ID that you want to use, and it will download all of the samples for you. If you are using a Mac, be sure to install wget using something like Homebrew (which I highly suggest for downloading tools in general). The files you get will be in the SRA format, so you have to remember to convert them to fastq format using their custom tools.

You don't have to be a superhero hacker to get DNA data from public archives.

The Metagenomics RAST Server (MG-RAST)

Although used less than the SRA, the Metagenomics RAST Server (MG-RAST) is another one of the major archives available for free public use. Although MG-RAST is a nice sequence repository, it is unfortunately more difficult to use than the SRA (for downloading sequences at least). The key to downloading MG-RAST data with command line tools is honestly complicated at first, and sort of hidden in the documentation. Again, to make things easier, we can use some custom scripts to make things happen.

The trick to getting the MG-RAST sequence files using a project ID is that you have to first download the project metadata, and then use the parsed metadata information to download the actual files (this is done in the second loop below. The actual URL to use with their API is also kind of confusing, but once you get it you are ready to go.

DownloadFromMGRAST () {
 line="${1}"
 echo Processing MG-RAST Accession Number "${line}"
 mkdir -p ./data/"${line}"
 # Download the raw information for the metagenomic run from MG-RAST
 wget -O ./data/"${line}"/tmpout.txt "http://api.metagenomics.anl.gov/1/project/mgp${line}?verbosity=full"
 # Pasre the raw metagenome information for indv sample IDs
 sed 's/metagenome_id\"\:\"/\nmgm/g' ./data/"${line}"/tmpout.txt \
  | sed 's/\".*//' \
  | grep mgm \
  > ./data/"${line}"/SampleIDs.tsv
 # Get rid of the raw metagenome information now that we are done with it
 rm ./data/"${line}"/tmpout.txt
 # Now loop through all of the accession numbers from the metagenome library
 while read acc; do
  echo Loading MG-RAST Sample ID is "${acc}"
  # file=050.1 means the raw input that the author meant to archive
  wget -O ./data/"${line}"/"${acc}".fa "http://api.metagenomics.anl.gov/1/download/${acc}?file=050.1"
 done < ./data/"${line}"/SampleIDs.tsv
 # Get rid of the sample list file
 rm ./data/"${line}"/SampleIDs.tsv
}

export -f DownloadFromMGRAST

These files will be in the fasta format instead of the sra format you get from the SRA. Also note that this uses GNU sed, which is not installed on Mac computers by default (Mac has a different version of sed. I know, it's kind of annoying). So make sure that, if you are running this on a Mac, install GNU sed using Homebrew again.

To give it a try, copy and paste this subroutine into your command line, and then write the project ID, like below.


DownloadFromMGRAST 4843

Conclusions

So there you have it. A very brief introduction to downloading SRA and MG-RAST datasets, with an emphasis on providing you the tools to do it yourself. Go ahead and give it a try. Let me know how it works, and if you run into problems, feel free to reach out with questions. And of course, please let me know if you have any questions, comments, or concerns!

Finally, thanks for reading! If you are a frequent reader, you might have noticed that my posts have been less frequent lately. I apologize for that. This has been an eventful year, which is great in general but bad for keeping up with the blog. As usual, it means I have some other exciting projects going on, and I am excited to share those experiences on here later. So for now the posts will be less frequent, but I look forward to getting back in a more frequent writing groove in the near future.

20 comments:

  1. With the whole digital revolution, i usually argue that there should be a software engineer in every house. I myself am quite intrigued with programming and this was helpful.

    ReplyDelete
    Replies
    1. Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download Now

      >>>>> Download Full

      Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download LINK

      >>>>> Download Now

      Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download Full

      >>>>> Download LINK 3H

      Delete
  2. This was helpful. Thanks!

    ReplyDelete
  3. call +2348038253815 or add us on whatsApp +2348038253815 or email illuminaticult0666@gmail.com GREETINGS!!!!! FROM THE GREAT GRAND MASTER! IN REGARDS OF YOU BECOMING A MEMBER OF THE GREAT ILLUMINATI, WE WELCOME YOU. Be part of something profitable and special (WELCOME TO THE WORLD OF THE ILLUMINATI). Are you a POLITICIAN, ENGINEER,DOCTOR, ENTERTAINER,MODEL,GRADUATE/ STUDENT,OR YOU HAVE IT IN MIND TO EXPAND YOUR BUSINESS/COMPANIES TO BECOME GREAT MINDS. It is pertinent to also know that For becoming a member, you earn the sum of $1,000,000 as the illuminati membership salary monthly.Be a part of this GOLDEN “OPPORTUNITY” The great illuminati Organization makes you rich and famous in the world, it will puxll you out from the grass root and take you to a greater height were you have long aspired to be and together we shall rule the world with the great and mighty power of the Illuminati, long life and prosperity here on earth with eternal life and jubilation. You can reach Us on illuminaticult0666@gmail.com

    ReplyDelete
  4. Hello everyone..Welcome to my free masterclass strategy where i teach experience and inexperience traders the secret behind a successful trade.And how to be profitable in trading I will also teach you how to make a profit of $7,000 USD weekly and how to get back all your lost funds feel free to Email: (carlose78910@gmail.com )
    Via whatsapp: (+12166263236)

    ReplyDelete
  5. God is Good!
    I promised God that I would share my testimony on this blog. I had all the signs of STD Virus but I was not too sure as to which one. I did a lot of online research and scared myself straight for a whole week before going to see the nurse. She took one look at my genital part and first said that it could just be the anatomy of my body, then she said it looked like genital warts and that I may have herpes. I was devastated. She gave me some medicine for the herpes and some cream for the warts. I was also tested for every single STD including herpes. I went home and cried searching the web for all sorts of cures for herpes and awaiting my results. I saw a post whereby Dr. Oyagu cured Herpes and other diseases, I copied his contacts out and added him on whats app via (+2348101755322). The next day my test result was ready and i confirmed Herpes positive. I told Dr.Oyagu about my health problems and he assured me of cure. He prepared his herbal medicine and sent it to me. I took it for 14 days (2 weeks). Before the completion of the 14 days in which I completed the dose, the Blisters and Warts that were on my body was cleared. I went back for check-up and I was told I'm free from the virus. Dr. Oyagu cures all types of diseases and viruses with the help of his herbal medicine. You can reach Dr. Oyagu via his email address on (oyahuherbalhome@gmail.com) or WhatsApp him on (+2348101755322) Visit His website on https://oyaguspellcaster.wixsite.com/oyaguherbalhome

    ReplyDelete
  6. Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download Now

    >>>>> Download Full

    Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download LINK

    >>>>> Download Now

    Prophage: A Primer On Ing Sequencing Data From Mg-Rast And The Sra >>>>> Download Full

    >>>>> Download LINK

    ReplyDelete