Deprecated - SRA using SRA Toolkit

Background and external resources

“Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data.”

NCBI - SRA

Downloading sequence data from SRA can be a little bit complicated though, due to the database structure and that files are stored in SRA format and need to be converted (to fasta, fastq, etc).

The SRA toolkit allows files to be downloaded as a batch, using the project accession number, and also convert to the correct format at the same time.

To install SRA toolkit, follow these instructions:https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit

To download data fasta files using project or biosample accession numbers:https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump

You can also use SRA Explorer to view all files in a project and download all or some of them: https://www.biostars.org/p/385930/

Goal

Download public data deposited in NCBI’s Short Read Archive (SRA) database.

Pre-requisites (if not available)

Installing miniconda

Miniconda — Anaconda documentation

bash Miniconda3-latest-Linux-x86_64.sh

 

 

 

 

 

 

 

Install sra-tools

Once conda is installed in the instance. Go to https://anaconda.org and search for sra-tools. Copy and paste the command to install the tool in your HPC account:

conda install -c bioconda sra-tools

Download SRA files

Example: PBS script (launch_fetch_SRAfiles.pbs) to fetch multiple files from SRA database

#!/bin/bash -l #PBS -N SRAfiles #PBS -l walltime=2:00:00 #PBS -l mem=4gb #PBS -l ncpus=2 #PBS -m bae ###PBS -M email@host #PBS -j oe cd $PBS_O_WORKDIR ### User defined SRA identifiers ACCESSIONS=SRR1002659,SRR1002660,SRR1002661,SRR1002662 ### Pipeline #Step1: Download SRA file prefetch ${ACCESSIONS} #Step2: Extract FASTQ file(s) from SRA file fastq-dump --split-files ${ACCESSIONS}

submit PBS script to the HPC cluster

monitor job progression