Today’s we will learn to download FASTQ files from a published paper:
Manuscript: Crotta et al. (2023). Repair of airway epithelia requires metabolic rewiring towards fatty acid oxidation. Nature Communications. http://doi.org/10.1038/s41467-023-36352-z
...
Click on the link above and search for “Accession”“accession”, “Data availability”, “BioProject ID” or “GEO accession code”
If, only a GEO accession code is available, go to the GEO database and look for BioProject ID - Note, ENA (Step2) requires this identifier to download the data.
Which BioProject ID host the data used in the above manuscript?
Expand | ||
---|---|---|
| ||
...
Code Block |
---|
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001044/SRR1039511SRR20630344/SRR1039511_2SRR20630344.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000049/SRR1039520SRR20630349/SRR1039520_2SRR20630349.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009055/SRR1039519SRR20630355/SRR1039519_2SRR20630355.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004047/SRR1039514SRR20630347/SRR1039514_1SRR20630347.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001050/SRR1039521SRR20630350/SRR1039521_2SRR20630350.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000042/SRR1039520SRR20630342/SRR1039520_1SRR20630342.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001053/SRR1039521SRR20630353/SRR1039521_1SRR20630353.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000043/SRR1039510SRR20630343/SRR1039510_2SRR20630343.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008039/SRR1039508SRR20630339/SRR1039508_1SRR20630339.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000056/SRR1039510SRR20630356/SRR1039510_1SRR20630356.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008054/SRR1039518SRR20630354/SRR1039518_1SRR20630354.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/007041/SRR1039517SRR20630341/SRR1039517_1SRR20630341.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009045/SRR1039509SRR20630345/SRR1039509_1SRR20630345.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004051/SRR1039514SRR20630351/SRR1039514_2SRR20630351.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001040/SRR1039511SRR20630340/SRR1039511_1SRR20630340.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009048/SRR1039519SRR20630348/SRR1039519_1SRR20630348.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/007052/SRR1039517SRR20630352/SRR1039517_2SRR20630352.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008046/SRR1039508SRR20630346/SRR1039508_2SRR20630346.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz |
Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:
Code Block |
---|
#!/bin/bash -l #PBS -N nfrnaseqENA_data_testdownload #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR |
...
Code Block |
---|
#!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001044/SRR1039511SRR20630344/SRR1039511_2SRR20630344.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000049/SRR1039520SRR20630349/SRR1039520_2SRR20630349.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009055/SRR1039519SRR20630355/SRR1039519_2SRR20630355.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004047/SRR1039514SRR20630347/SRR1039514_1SRR20630347.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001050/SRR1039521SRR20630350/SRR1039521_2SRR20630350.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000042/SRR1039520SRR20630342/SRR1039520_1SRR20630342.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001053/SRR1039521SRR20630353/SRR1039521_1SRR20630353.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000043/SRR1039510SRR20630343/SRR1039510_2SRR20630343.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008039/SRR1039508SRR20630339/SRR1039508_1SRR20630339.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000056/SRR1039510SRR20630356/SRR1039510_1SRR20630356.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008054/SRR1039518SRR20630354/SRR1039518_1SRR20630354.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/007041/SRR1039517SRR20630341/SRR1039517_1SRR20630341.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009045/SRR1039509SRR20630345/SRR1039509_1SRR20630345.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004051/SRR1039514SRR20630351/SRR1039514_2SRR20630351.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001040/SRR1039511SRR20630340/SRR1039511_1SRR20630340.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009048/SRR1039519SRR20630348/SRR1039519_1SRR20630348.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/007052/SRR1039517SRR20630352/SRR1039517_2SRR20630352.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008046/SRR1039508/SRR1039508_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gzSRR20630346/SRR20630346.fastq.gz |
STEP 6: Save the file and now let’s transfer it to the HPC. See below:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
Code Block |
---|
\\hpc-fs\home\ |
Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
Code Block |
---|
smb://hpc-fs/home/ |
Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder
STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:
Code Block |
---|
cd #HOME$HOME/workshop/2024-2/session4_RNAseq/data /mydata |
How to use the dos2unix tool? Type:
...
Code Block |
---|
dos2unix -n ena-file-download-selected-files-2024100920241013-00051123.sh ena-file-download-selected-files-2024100920241013-00051123.txt |
eResearch Downloading public data
ENA Browser
Go to the ENA Browser https://www.ebi.ac.uk/ena/browser/home
Search NGS data of interest
In the ‘view search box' enter one of the following identifiers:
Project accession (i.e., PRJNA229998)
Study accession (i.e., SRP033351)
Experiment accession (i.e., SRX384360)
Run accession (i.e., SRR1039508)
Once there, you can download any associated files by clicking the relevant links and then clicking on “Get download script”.
For example:
Code Block |
---|
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz |
Now create a PBS Pro submission script for the above and save it in a file called, for example ‘launch_ENA_download.pbs’. Note: the below script will download the data in the folder from where the script has been sent to the cluster.
Code Block |
---|
#!/bin/bash -l
#PBS -N download
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=24:00:00
#work on current directory (folder)
cd $PBS_O_WORKDIR
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz |
Submit the download script to the cluster:
Code Block |
---|
qsub launch_ENA_download.pbs |
Monitor progress of job:
Code Block |
---|
qjobspbs |
Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.
Now we are ready to submit to the HPC cluster the script to download FASTQ files:
Code Block |
---|
qsub ena-file-download-selected-files-20241013-1123.pbs |
Monitor progress of job:
Code Block |
---|
qjobs |
Note: Downloading the above datasets will take about ~50 minutes.
Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline: