Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

miRBase and MirGeneDB

Download Reference microRNA data from miRBase

...

Code Block
mkdir -p $HOME/workshop/2024-2/session6_smallRNAseq/data/references miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase

Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.

OPTION #1: Use interactive session to run the following commands:

...

Code Block
wget https://mirbase.org/download/hsa.gff3

OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).

...

Let’s

  1. copy the script to download miRBase files;

  2. move to the reference folder; and

  3. print the content of the launch_download_miRBase.pbs script with the code below:

Code Block
cp /work/training/2024/smallRNAseq/scripts/launch_download_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase 
cat launch_download_miRBase.pbs
Code Block
#!/bin/bash -l
#PBS -N download_miRBase
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=2:00:00

cd $PBS_O_WORKDIR

wget https://www.mirbase.org/download/hairpin.fa
wget https://www.mirbase.org/download/mature.fa
wget https://www.mirbase.org/download/hsa.gff3

submit the script to the HPC cluster:

Code Block
qsub launch_download_miRBase.pbs

monitor progress of job:

Code Block
qjobs

Fetch public small RNA-seq data

Today we will download small RNA-seq data from the ENA (European Nucleotide Archive).

...

  • Click on the link above and search for “accession”, “Data availability”, “BioProject ID”, “GEO accession code” or “Array Express” identifier.

  • If, only an Array Express accession code is available, then go to https://www.ebi.ac.uk/biostudies/arrayexpress and search for the Array Express identifier. Browse the database to located the identifier for ENA.

  • Hint: it will take a couple of clicks to open multiple pages to find the identifier for the data deposited in ENA.

...

Which is the Array express identifier noted in the above manuscript and to which ENA identifier it relates to?

Expand
titleSolution

Array Express: E-MTAB-2206 , and ENA identifier: ERP004592

STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database

...

STEP3: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. NOTE: the Script Name will be different for each person downloading the bash script (e.g.,

View file
nameena-file-download-selected-files-20241009-0005.sh
)

...

Open the file using an app for Text files (e.g., TextFile TextEdit, NotePad, etc):

Code Block
 run_accession	sample_accession	experiment_accession	study_accession	tax_id	scientific_name	fastq_ftp	submitted_ftp	sra_ftp	bam_ftp
ERR409882	SAMEA2300497	ERX376249	PRJEB5212	9606	Homo sapiens	       SAMEA2300497    ERX376249       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409882/ERR409882.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409882/C_31.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409882	
ERR409892	SAMEA2300502	ERX376254	PRJEB5212	9606	Homo sapiens	       SAMEA2300502    ERX376254       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409892/ERR409892.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409892/H_13.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409892	     
ERR409893	SAMEA2300504	ERX376256	PRJEB5212	9606	Homo sapiens	       SAMEA2300504    ERX376256       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409893/ERR409893.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409893/C_36.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409893	     
ERR409895	SAMEA2300492	ERX376244	PRJEB5212	9606	Homo sapiens	ftp       SAMEA2300492    ERX376244       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409895/ERR409895.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409895/H_09.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409895	     
ERR409897	SAMEA2300498	ERX376250	PRJEB5212	9606	Homo sapiens	       SAMEA2300498    ERX376250       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409897/ERR409897.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409897/H_05.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409897	
ERR409898	SAMEA2300501	ERX376253	PRJEB5212	9606	Homo sapiens	       SAMEA2300501    ERX376253       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409898/ERR409898.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409898/C_29.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409898	     
ERR409899	SAMEA2300490	ERX376242	PRJEB5212	9606	Homo sapiens	       SAMEA2300490    ERX376242       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409899/ERR409899.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409899/H_07.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409899	     
ERR409879	SAMEA2300495	ERX376247	PRJEB5212	9606	Homo sapiens	       SAMEA2300495    ERX376247       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409879/ERR409879.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409879/C_39.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409879	     
ERR409880	SAMEA2300488	ERX376240	PRJEB5212	9606	Homo sapiens	       SAMEA2300488    ERX376240       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409880/ERR409880.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409880/H_08.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409880	
ERR409883	SAMEA2300487	ERX376239	PRJEB5212	9606	Homo sapiens	       SAMEA2300487    ERX376239       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409883/ERR409883.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409883/C_35.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409883	     
ERR409884	SAMEA2300491	ERX376243	PRJEB5212	9606	Homo sapiens	       SAMEA2300491    ERX376243       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409884/ERR409884.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409884/H_12.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409884	     
ERR409886	SAMEA2300493	ERX376245	PRJEB5212	9606	Homo sapiens	       SAMEA2300493    ERX376245       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409886/ERR409886.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409886/H_06.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409886	     
ERR409888	SAMEA2300503	ERX376255	PRJEB5212	9606	Homo sapiens	       SAMEA2300503    ERX376255       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409888/ERR409888.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409888/C_38.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409888	     
ERR409878	SAMEA2300496	ERX376248	PRJEB5212	9606	Homo sapiens	       SAMEA2300496    ERX376248       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409878/ERR409878.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409878/C_33.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409878	
ERR409889	SAMEA2300500	ERX376252	PRJEB5212	9606	Homo sapiens	       SAMEA2300500    ERX376252       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409889/ERR409889.fastq.gz	ftp        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409889/H_03.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409889	     
ERR409881	SAMEA2300509	ERX376261	PRJEB5212	9606	Homo sapiens	       SAMEA2300509    ERX376261       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409881/ERR409881.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409881/H_10.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409881	 ERR409885	SAMEA2300505	ERX376257	PRJEB5212	9606	Homo sapiens	    
ERR409885       SAMEA2300505    ERX376257       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409885/ERR409885.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409885/H_02.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409885	     
ERR409894	SAMEA2300507	ERX376259	PRJEB5212	9606	Homo sapiens	       SAMEA2300507    ERX376259       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409894/ERR409894.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409894/H_14.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409894	
ERR409887	SAMEA2300506	ERX376258	PRJEB5212	9606	Homo sapiens	       SAMEA2300506    ERX376258       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409887/ERR409887.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409887/C_21.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409887	     
ERR409890	SAMEA2300489	ERX376241	PRJEB5212	9606	Homo sapiens	       SAMEA2300489    ERX376241       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409890/ERR409890.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409890/H_01.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409890	     
ERR409896	SAMEA2300508	ERX376260	PRJEB5212	9606	Homo sapiens	       SAMEA2300508    ERX376260       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409896/C_32.fastq.gz	ftp.       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409896	     
ERR409891	SAMEA2300494	ERX376246	PRJEB5212	9606	Homo sapiens	       SAMEA2300494    ERX376246       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409891/ERR409891.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409891/C_14.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409891	     
ERR409900	SAMEA2300499	ERX376251	PRJEB5212	9606	Homo sapiens	       SAMEA2300499    ERX376251       PRJEB5212       9606    Homo sapiens    ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz	        ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409900/C_37.fastq.gz	       ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409900	

  • STEP 5: Open the downloaded ena file using TextEdit (NotePad or similar app). The Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g.,

    View file
    nameena-file-download-selected-files-20241009-0005.sh
    )

...

  • Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:

Code Block
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/044ERR409878/SRR20630344/SRR20630344ERR409878.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/049ERR409879/SRR20630349/SRR20630349ERR409879.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/055ERR409880/SRR20630355/SRR20630355ERR409880.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/047ERR409881/SRR20630347/SRR20630347ERR409881.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/050ERR409882/SRR20630350/SRR20630350ERR409882.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/042ERR409883/SRR20630342/SRR20630342ERR409883.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/053ERR409884/SRR20630353/SRR20630353ERR409884.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/043ERR409885/SRR20630343/SRR20630343ERR409885.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/039ERR409886/SRR20630339/SRR20630339ERR409886.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/056ERR409887/SRR20630356/SRR20630356ERR409887.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/054ERR409888/SRR20630354/SRR20630354ERR409888.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/041ERR409889/SRR20630341/SRR20630341ERR409889.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/045ERR409890/SRR20630345/SRR20630345ERR409890.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/051ERR409891/SRR20630351/SRR20630351ERR409891.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/040ERR409892/SRR20630340/SRR20630340ERR409892.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/048ERR409893/SRR20630348/SRR20630348ERR409893.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/052ERR409894/SRR20630352/SRR20630352ERR409894.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/046ERR409895/SRR20630346/SRR20630346ERR409895.fastq.gz

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:

Code Block
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on 
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409897/ERR409897.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409898/ERR409898.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409899/ERR409899.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:

Code Block
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

...

Code Block
#!/bin/bash -l
#PBS -N nfrnaseqENA_data_testdownload
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/044ERR409878/SRR20630344/SRR20630344ERR409878.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/049ERR409879/SRR20630349/SRR20630349ERR409879.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/055ERR409880/SRR20630355/SRR20630355ERR409880.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/047/SRR20630347/SRR20630347.ERR409881/ERR409881.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/050ERR409882/SRR20630350/SRR20630350ERR409882.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/042ERR409883/SRR20630342/SRR20630342ERR409883.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/053ERR409884/SRR20630353/SRR20630353ERR409884.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/043ERR409885/SRR20630343/SRR20630343ERR409885.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/039ERR409886/SRR20630339/SRR20630339ERR409886.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/056ERR409887/SRR20630356/SRR20630356ERR409887.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/054ERR409888/SRR20630354/SRR20630354ERR409888.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/041ERR409889/SRR20630341/SRR20630341ERR409889.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/045ERR409890/SRR20630345/SRR20630345ERR409890.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/051ERR409891/SRR20630351/SRR20630351ERR409891.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/040ERR409892/SRR20630340/SRR20630340ERR409892.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/048ERR409893/SRR20630348/SRR20630348ERR409893.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/052ERR409894/SRR20630352/SRR20630352ERR409894.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/046ERR409895/SRR20630346/SRR20630346ERR409895.fastq.gz

STEP 6: Save the file and now let’s transfer it to the HPC. See below:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
\\hpc-fs\home\

Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
smb://hpc-fs/home/
  • Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder

STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:

Code Block
cd $HOME/workshop/2024-2/session4_RNAseq/data/mydata

How to use the dos2unix tool? Type:

Code Block
dos2unix --help

Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.

Code Block
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs
  • Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.

Now we are ready to submit to the HPC cluster the script to download FASTQ files:

Code Block
qsub ena-file-download-selected-files-20241013-1123.pbs

Monitor progress of job:

Code Block
qjobs
  • Note: Downloading the above datasets will take about ~50 minutes.

Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:

Data Download

For this approach you will need to have a list of SRA identifiers. For example, for the human Huntington Disease study the list of identifiers are:

Code Block
ERR409878
ERR409879
ERR409880
ERR409881
ERR409882
ERR409883
ERR409884
ERR409885
ERR409886
ERR409887
ERR409888
ERR409889
ERR409890
ERR409891
ERR409892
ERR409893
ERR409894
ERR409895
ERR409896
ERR409897
ERR409898
ERR409899
ERR409900

The above list has been already prepared for you, fetch a copy of the list of IDs into your “my data” folder created previously:

Code Block
cp /work/training/2024/smallRNAseq/data/human_disease/SRA_Acc_List.txt $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata

Now let’s also get a copy of the “launch_fetch_SRA.pbs” script into your “my data” folder:

Code Block
cp /work/training/2024/smallRNAseq/data/human_disease/launch_fetch_SRA.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata

Check the content of the script:

Code Block
cat launch_fetch_SRA.pbs

Use singularity container:

Code Block
singularity run -B $PWD /work/training/tools/sif_lib/sra-tools_v2.10.7.sif \
  fastq-dump \
  --split-files \
  --outdir $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata \
  --option-file sra ids.txt

deprecated

Code Block
#!/bin/bash -l
#PBS -N rna
#PBS -l select=1:ncpus=1:mem=8gb
#PBS -l walltime=24:00:00

#Enable the container modules
source /pkg/shpc/enable

#Load the SRA-TOOLS module
module load sra-tools/3.0.5--h9f5acd7_1

#work on current directory (folder)
cd $PBS_O_WORKDIR
for i in $(cat SRR_Acc_List.txt);
do
  echo $i
  prefetch.3 $i
  fasterq-dump.3 --split-files $i
done
gzip *fastq

submit PBS script to the HPC cluster

Code Block
qsub launch_fetch_SRA.pbs

monitor job progression

Code Block
qjobs

...


wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409897/ERR409897.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409898/ERR409898.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409899/ERR409899.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz

STEP 6: Save the file and now let’s transfer it to the HPC. See below:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session6_smallRNAseq/data/mydata folder

Code Block
\\hpc-fs\home\

Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session6_smallRNAseq/data/mydata folder

Code Block
smb://hpc-fs/home/
  • Drag and drop the script into the /workshop/2024-2/session6_smallRNAseq/data/mydata folder

STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:

Code Block
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata

How to use the dos2unix tool? Type:

Code Block
dos2unix --help

Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.

Code Block
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs
  • Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.

Now we are ready to submit to the HPC cluster the script to download FASTQ files:

Code Block
qsub ena-file-download-selected-files-20241013-1123.pbs

Monitor progress of job:

Code Block
qjobs
  • Note: Downloading the above datasets will take about ~50 minutes.

Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:

Data Download