miRBase and MirGeneDB
Download Reference microRNA data from miRBase
First, let’s create a folder to store the reference datasets:
mkdir -p $HOME/workshop/2024-2/session6_smallRNAseq/data/references
Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.
OPTION #1: Use interactive session to run the following commands:
Fetch microRNA mature sequences:
wget https://mirbase.org/download/mature.fa
Fetch hairpin sequences:
wget https://mirbase.org/download/hairpin.fa
Fetch the genomic coordinated for precursors and mature sequences:
wget https://mirbase.org/download/hsa.gff3
OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).
#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=2:00:00 cd $PBS_O_WORKDIR wget https://www.mirbase.org/download/hairpin.fa wget https://www.mirbase.org/download/mature.fa wget https://www.mirbase.org/download/hsa.gff3
Fetch public small RNA-seq data
Today we will download small RNA-seq data from the ENA (European Nucleotide Archive).
Manuscript: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004188
STEP 1 : Find where the data is available for download in the above manuscript
Click on the link above and search for “accession”, “Data availability”, “BioProject ID”, “GEO accession code” or “Array Express” identifier.
If, only an Array Express accession code is available, then go to https://www.ebi.ac.uk/biostudies/arrayexpress and search for the Array Express identifier. Browse the database to located the identifier for ENA.
Hint: it will take a couple of clicks to open multiple pages to find the identifier for the data deposited in ENA.
Which is the Array express identifier noted in the above manuscript and to which ENA identifier it relates to?
STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database
Go to https://www.ebi.ac.uk/ena/browser/home and search for the BioProject ID using the search option on the top right corner and click on ‘view’:
STEP3: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. NOTE: the Script Name will be different for each person downloading the bash script (e.g., )
STEP 4: Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz
Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:
#!/bin/bash -l #PBS -N ENA_data_download #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR
You should have this:
#!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz
STEP 6: Save the file and now let’s transfer it to the HPC. See below:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
\\hpc-fs\home\
Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
smb://hpc-fs/home/
Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder
STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:
cd $HOME/workshop/2024-2/session4_RNAseq/data/mydata
How to use the dos2unix tool? Type:
dos2unix --help
Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs
Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.
Now we are ready to submit to the HPC cluster the script to download FASTQ files:
qsub ena-file-download-selected-files-20241013-1123.pbs
Monitor progress of job:
qjobs
Note: Downloading the above datasets will take about ~50 minutes.
Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:
For this approach you will need to have a list of SRA identifiers. For example, for the human Huntington Disease study the list of identifiers are:
ERR409878 ERR409879 ERR409880 ERR409881 ERR409882 ERR409883 ERR409884 ERR409885 ERR409886 ERR409887 ERR409888 ERR409889 ERR409890 ERR409891 ERR409892 ERR409893 ERR409894 ERR409895 ERR409896 ERR409897 ERR409898 ERR409899 ERR409900
The above list has been already prepared for you, fetch a copy of the list of IDs into your “my data” folder created previously:
cp /work/training/2024/smallRNAseq/data/human_disease/SRA_Acc_List.txt $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata
Now let’s also get a copy of the “launch_fetch_SRA.pbs” script into your “my data” folder:
cp /work/training/2024/smallRNAseq/data/human_disease/launch_fetch_SRA.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata
Check the content of the script:
cat launch_fetch_SRA.pbs
Use singularity container:
singularity run -B $PWD /work/training/tools/sif_lib/sra-tools_v2.10.7.sif \ fastq-dump \ --split-files \ --outdir $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata \ --option-file sra ids.txt
deprecated
#!/bin/bash -l #PBS -N rna #PBS -l select=1:ncpus=1:mem=8gb #PBS -l walltime=24:00:00 #Enable the container modules source /pkg/shpc/enable #Load the SRA-TOOLS module module load sra-tools/3.0.5--h9f5acd7_1 #work on current directory (folder) cd $PBS_O_WORKDIR for i in $(cat SRR_Acc_List.txt); do echo $i prefetch.3 $i fasterq-dump.3 --split-files $i done gzip *fastq
submit PBS script to the HPC cluster
qsub launch_fetch_SRA.pbs
monitor job progression
qjobs