3. Fetch public RNA-seq data

Today’s we will learn to download FASTQ files from a published paper:

Manuscript: Crotta et al. (2023). Repair of airway epithelia requires metabolic rewiring towards fatty acid oxidation. Nature Communications. http://doi.org/10.1038/s41467-023-36352-z

STEP 1 : Find where the data is available for download in the above manuscript

Click on the link above and search for “accession”, “Data availability”, “BioProject ID” or “GEO accession code”
If, only a GEO accession code is available, go to the GEO database and look for BioProject ID - Note, ENA (Step2) requires this identifier to download the data.

Which BioProject ID host the data used in the above manuscript?

Solution

PRJNA862097

STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database

Go to https://www.ebi.ac.uk/ena/browser/home and search for the BioProject ID using the search option on the top right corner and click on ‘view’:

STEP3: (if applicable) select one or more BioProject submission(s). Click on the first listed BioProject ID:

STEP4: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g., )

STEP 5: Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:

#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

You should have this:

#!/bin/bash -l
#PBS -N nfrnaseq_test
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/051/SRR20630351/SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/040/SRR20630340/SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/046/SRR20630346/SRR20630346.fastq.gz

STEP 6: Save the file and now let’s transfer it to the HPC. See below:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

\\hpc-fs\home\

Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

smb://hpc-fs/home/

Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder

STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:

cd #HOME/workshop/2024-2/session4_RNAseq/data/human

How to use the dos2unix tool? Type:

dos2unix --help

Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.

dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs

Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.

Now we are ready to submit to the HPC cluster the script to download FASTQ files:

qsub ena-file-download-selected-files-20241013-1123.pbs

Monitor progress of job:

qjobs

Note: Downloading the above datasets will take about ~50 minutes.

Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:

Data Download