Background and external resources
“Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data.”
...
To install SRA toolkit, follow these instructions:https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit
To download data fasta files using project or biosample accession numbers:https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
You can also use SRA Explorer to view all files in a project and download all or some of them.: https://www.biostars.org/p/385930/
Goal
Download public data deposited in NCBI’s Short Read Archive (SRA) database.
Pre-requisites
Installed conda3 or miniconda3 ( https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html )
Basic unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
Installing miniconda
https://docs.conda.io/en/latest/miniconda.html#linux-installers
Code Block |
---|
bash Miniconda3-latest-Linux-x86_64.sh |
Install sra tools
Once conda is installed in the instance. Go to https://anaconda.org and search for sra-tools. Copy and paste the command to install the tool in your HPC account:
Code Block |
---|
conda install -c bioconda sra-tools |
Download SRA files
Submit a PBS script to fetch SRA files
Code Block |
---|
#!/bin/bash -l #PBS -N SRAfiles #PBS -l walltime=2:00:00 #PBS -l mem=4gb #PBS -l ncpus=2 #PBS -m bae ###PBS -M email@host #PBS -j oe cd $PBS_O_WORKDIR ### User defined varaibles SRAID=SRR1002659 ### Pipeline #Step1: Download SRA file prefetch ${SRAID} #Step2: Extract FASTQ file(s) from SRA file fastq-dump --split-files ${SRAID} |