6.3 Fetch public data from miRBase and SRA

miRBase and MirGeneDB

Download Reference microRNA data from miRBase

First, let’s create a folder to store the reference datasets:

mkdir -p $HOME/workshop/2024-2/session6_smallRNAseq/data/references

Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.

OPTION #1: Use interactive session to run the following commands:

Fetch microRNA mature sequences:

wget https://mirbase.org/download/mature.fa

Fetch hairpin sequences:

wget https://mirbase.org/download/hairpin.fa

Fetch the genomic coordinated for precursors and mature sequences:

wget https://mirbase.org/download/hsa.gff3

OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).

#!/bin/bash -l
#PBS -N nfsmrnaseq
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=2:00:00

cd $PBS_O_WORKDIR

wget https://www.mirbase.org/download/hairpin.fa
wget https://www.mirbase.org/download/mature.fa
wget https://www.mirbase.org/download/hsa.gff3

Fetch public small RNA-seq data using SRA tools

For this approach you will need to have a list of SRA identifiers. For example, for the human Huntington Disease study the list of identifiers are:

ERR409878
ERR409879
ERR409880
ERR409881
ERR409882
ERR409883
ERR409884
ERR409885
ERR409886
ERR409887
ERR409888
ERR409889
ERR409890
ERR409891
ERR409892
ERR409893
ERR409894
ERR409895
ERR409896
ERR409897
ERR409898
ERR409899
ERR409900

The above list has been already prepared for you, fetch a copy of the list of IDs into your “my data” folder created previously:

cp /work/training/2024/smallRNAseq/data/human_disease/SRA_Acc_List.txt $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata

Now let’s also get a copy of the “launch_fetch_SRA.pbs” script into your “my data” folder:

cp /work/training/2024/smallRNAseq/data/human_disease/launch_fetch_SRA.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata

Check the content of the script:

cat launch_fetch_SRA.pbs

#!/bin/bash -l
#PBS -N rna
#PBS -l select=1:ncpus=1:mem=8gb
#PBS -l walltime=24:00:00

#Enable the container modules
source /pkg/shpc/enable

#Load the SRA-TOOLS module
module load sra-tools/3.0.5--h9f5acd7_1

#work on current directory (folder)
cd $PBS_O_WORKDIR
for i in $(cat SRR_Acc_List.txt);
do
  echo $i
  prefetch.3 $i
  fasterq-dump.3 --split-files $i
done
gzip *fastq

submit PBS script to the HPC cluster

qsub launch_fetch_SRAfiles.pbs

monitor job progression

qjobs