miRBase and MirGeneDB
Download Reference microRNA data from miRBase
...
Code Block |
---|
mkdir -p $HOME/workshop/2024-2/session6_smallRNAseq/data/references miRBase cd $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase |
Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.
OPTION #1: Use interactive session to run the following commands:
...
Code Block |
---|
wget https://mirbase.org/download/hsa.gff3 |
OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).Let’s
copy the script to download miRBase files;
move to the reference folder; and
print the content of the launch_download_miRBase.pbs script with the code below:
Code Block |
---|
cp /work/training/2024/smallRNAseq/scripts/launch_download_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/referencesmiRBase cd $HOME/workshop/2024-2/session6_smallRNAseq/data/referencesmiRBase cat launch_download_miRBase.pbs |
Code Block |
---|
#!/bin/bash -l #PBS -N nfsmrnaseqdownload_miRBase #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=2:00:00 cd $PBS_O_WORKDIR wget https://www.mirbase.org/download/hairpin.fa wget https://www.mirbase.org/download/mature.fa wget https://www.mirbase.org/download/hsa.gff3 |
submit the script to the HPC cluster:
Code Block |
---|
qsub launch_download_miRBase.pbs |
monitor progress of job:
Code Block |
---|
qjobs |
Fetch public small RNA-seq data
Today we will download small RNA-seq data from the ENA (European Nucleotide Archive).
...
Click on the link above and search for “accession”, “Data availability”, “BioProject ID”, “GEO accession code” or “Array Express” identifier.
If, only an Array Express accession code is available, then go to https://www.ebi.ac.uk/biostudies/arrayexpress and search for the Array Express identifier. Browse the database to located the identifier for ENA.
Hint: it will take a couple of clicks to open multiple pages to find the identifier for the data deposited in ENA.
...
Which is the Array express identifier noted in the above manuscript and to which ENA identifier it relates to?
Expand | ||
---|---|---|
| ||
Array Express: E-MTAB-2206 , and ENA identifier: ERP004592 |
STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database
...
STEP3: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. NOTE: the Script Name will be different for each person downloading the bash script (e.g.,
View file | ||
---|---|---|
|
...
STEP 6: Save the file and now let’s transfer it to the HPC. See below:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session6_smallRNAseq/data/mydata folder
...
Code Block |
---|
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs |
Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.
Now we are ready to submit to the HPC cluster the script to download FASTQ files:
...
Monitor progress of job:
Code Block |
---|
qjobs |
Note: Downloading the above datasets will take about ~50 minutes.
Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:
...