Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

miRBase and MirGeneDB

Download Reference microRNA data from miRBase

...

Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.

OPTION #1: Use interactive session to run the following commands:

...

Code Block
wget https://mirbase.org/download/hsa.gff3

OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).

Code Block
#!/bin/bash -l
#PBS -N nfsmrnaseq
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=2:00:00

cd $PBS_O_WORKDIR

wget https://www.mirbase.org/download/hairpin.fa
wget https://www.mirbase.org/download/mature.fa
wget https://www.mirbase.org/download/hsa.gff3

Fetch public small RNA-seq data

Today we will download small RNA-seq data from the ENA (European Nucleotide Archive).

Manuscript: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004188

...

STEP 1 : Find where the data is available for download in the above manuscript

  • Click on the link above and search for “accession”, “Data availability”, “BioProject ID”, “GEO accession code” or “Array Express” identifier.

  • If, only an Array Express accession code is available, then go to https://www.ebi.ac.uk/biostudies/arrayexpress and search for the Array Express identifier. Browse the database to located the identifier for ENA.

  • Hit: it will take a couple of clicks to open multiple pages to find the identifier for the data deposited in ENA.

Which is the Array express identifier noted in the above manuscript and which ENA identifiers it relates to?

Expand
titleSolution

Array Express: E-MTAB-2206 , and ENA identifier: ERP004592

STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database

...

STEP3: (if applicable) select one or more BioProject submission(s). Click on the first listed BioProject ID:

...

  • STEP4: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g.,

    View file
    nameena-file-download-selected-files-20241009-0005.sh
    )

...

  • STEP 5: Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:

Code Block
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:

Code Block
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

You should have this:

Code Block
#!/bin/bash -l
#PBS -N nfrnaseq_test
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz

STEP 6: Save the file and now let’s transfer it to the HPC. See below:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
\\hpc-fs\home\

Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
smb://hpc-fs/home/
  • Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder

STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:

Code Block
cd $HOME/workshop/2024-2/session4_RNAseq/data/mydata

How to use the dos2unix tool? Type:

Code Block
dos2unix --help

Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.

Code Block
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs
  • Note: If you create a file using

...

  • Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.

Now we are ready to submit to the HPC cluster the script to download FASTQ files:

Code Block
qsub ena-file-download-selected-files-20241013-1123.pbs

Monitor progress of job:

Code Block
qjobs
  • Note: Downloading the above datasets will take about ~50 minutes.

Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:

Data Download

For this approach you will need to have a list of SRA identifiers. For example, for the human Huntington Disease study the list of identifiers are:

...

Code Block
singularity run -B $PWD /work/training/tools/sif_lib/sra-tools_v2.10.7.sif \
  fastq-dump \
  --split-files \
  --outdir $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata \
  --option-file sra ids.txt

deprecated

Code Block
#!/bin/bash -l
#PBS -N rna
#PBS -l select=1:ncpus=1:mem=8gb
#PBS -l walltime=24:00:00

#Enable the container modules
source /pkg/shpc/enable

#Load the SRA-TOOLS module
module load sra-tools/3.0.5--h9f5acd7_1

#work on current directory (folder)
cd $PBS_O_WORKDIR
for i in $(cat SRR_Acc_List.txt);
do
  echo $i
  prefetch.3 $i
  fasterq-dump.3 --split-files $i
done
gzip *fastq

...