Page Comparison

Today’s we will learn to download FASTQ files from a published paper:

Manuscript: Crotta et al. (2023). Nature Communications. http://doi.org/10.1038/s41467-023-36352-z

View file

name	s41467-023-36352-z.pdf

STEP 1 : Find where the data is available for download in the above manuscript

Click on the link above and search for “Accession”, “Data availability”, “BioProject ID” or “GEO accession code”
If, only a GEO accession code is available, go to the GEO database and look for BioProject ID - Note, ENA (Step2) requires this identifier to download the data.

Which BioProject ID host the data used in the above manuscript?

Expand

title	Solution

PRJNA862097

STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database

Go to https://www.ebi.ac.uk/ena/browser/home and search for the BioProject ID using the search option on the top right corner and click on ‘view’:

...

STEP3: (if applicable) select one or more BioProject submission(s). Click on the first listed BioProject ID:

...

STEP4: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g.,
View file
name ena-file-download-selected-files-20241009-0005.sh
)

...

Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:

...

Today’s we will learn to download FASTQ files from a published paper:

Manuscript: Crotta et al. (2023). Repair of airway epithelia requires metabolic rewiring towards fatty acid oxidation. Nature Communications. http://doi.org/10.1038/s41467-023-36352-z

View file

name	s41467-023-36352-z.pdf

STEP 1 : Find where the data is available for download in the above manuscript

Click on the link above and search for “accession”, “Data availability”, “BioProject ID” or “GEO accession code”
If, only a GEO accession code is available, go to the GEO database and look for BioProject ID - Note, ENA (Step2) requires this identifier to download the data.

Which BioProject ID host the data used in the above manuscript?

Expand

title	Solution

PRJNA862097

STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database

Go to https://www.ebi.ac.uk/ena/browser/home and search for the BioProject ID using the search option on the top right corner and click on ‘view’:

...

STEP3: (if applicable) select one or more BioProject submission(s). Click on the first listed BioProject ID:

...

STEP4: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g.,
View file
name ena-file-download-selected-files-20241009-0005.sh
)

...

STEP 5: Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:

Code Block

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/044/SRR20630344/SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/049/SRR20630349/SRR20630349.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001047/SRR1039511/SRR1039511_2.SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000050/SRR1039520SRR20630350/SRR1039520_2SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009042/SRR1039519SRR20630342/SRR1039519_2SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004053/SRR1039514SRR20630353/SRR1039514_1SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001043/SRR1039521SRR20630343/SRR1039521_2SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000039/SRR1039520SRR20630339/SRR1039520_1SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001056/SRR1039521SRR20630356/SRR1039521_1SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000054/SRR1039510SRR20630354/SRR1039510_2SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008041/SRR1039508SRR20630341/SRR1039508_1SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000045/SRR1039510SRR20630345/SRR1039510_1SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008051/SRR1039518SRR20630351/SRR1039518_1SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/007040/SRR1039517SRR20630340/SRR1039517_1SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/009048/SRR1039509SRR20630348/SRR1039509_1SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/004052/SRR1039514SRR20630352/SRR1039514_2SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001046/SRR1039511SRR20630346/SRR1039511_1SRR20630346.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:

Code Block
#!/bin/bash -l #PBS -N ENA_data_download #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR

You should have this:

Code Block

#!/bin/bash -l
#PBS -N nfrnaseq_test
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/008044/SRR1039518SRR20630344/SRR1039518_2SRR20630344.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000049/SRR1039520SRR20630349/SRR1039520SRR20630349.fastq.gz

Now using the TextEdit or NotePad app, we will add the following lines to the top of the script:

Code Block
#!/bin/bash -l #PBS -N ENAdownload #PBS -l walltime=72:00:00 #PBS -l mem=16gb #PBS -l ncpus=8

Copy the script to your HPC working folder $HOME/workshop/2024-2/session4_RNAseq/data
See below how to drag and drop the file using File Finder

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

Windows PC: open file finder and type the address below to connect to your home directory in the HPC. Remember to replace “USER” by your actual user name.

Code Block
\\hpc-fs\home\USER\workshop\2024-4\session4_RNAseq\data

Mac: open file finder and press “command” + “k” to open prompt, then type the below command. Remember to replace “USER” by your actual user name.

Code Block
smb://hpc-fs/home/USER/workshop/2024-4/session4_RNAseq/data

Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read2). Look into the “MultiQC” folder and open the provided HTML report.

Copy the downloaded file to your HPC account or copy the content to a file created in the HPC using Nano (or other text editor)
Add the PBS pro scheduler lines and submit a job. See step by step details at:

eResearch Downloading public data

Source: https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html#using-ena-file-downloader-command-line-tool

ENA Browser

Go to the ENA Browser https://www.ebi.ac.uk/ena/browser/home

Search NGS data of interest

In the ‘view search box' enter one of the following identifiers:

Project accession (i.e., PRJNA229998)
Study accession (i.e., SRP033351)
Experiment accession (i.e., SRX384360)
Run accession (i.e., SRR1039508)

Once there, you can download any associated files by clicking the relevant links and then clicking on “Get download script”.

For example:

Code Block


wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/047/SRR20630347/SRR20630347.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/042/SRR20630342/SRR20630342.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/053/SRR20630353/SRR20630353.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/039/SRR20630339/SRR20630339.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/056/SRR20630356/SRR20630356.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/054/SRR20630354/SRR20630354.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000051/SRR1039510SRR20630351/SRR1039510_1SRR20630351.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000040/SRR1039510SRR20630340/SRR1039510_2SRR20630340.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000048/SRR1039520SRR20630348/SRR1039520_1SRR20630348.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/000052/SRR1039520SRR20630352/SRR1039520_2SRR20630352.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103SRR206/001046/SRR1039511SRR20630346/SRR1039511_1SRR20630346.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz

Now create a PBS Pro submission script for the above and save it in a file called, for example ‘launch_ENA_download.pbs’. Note: the below script will download the data in the folder from where the script has been sent to the cluster.

Code Block

#!/bin/bash -l
#PBS -N download
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz

Submit the download script to the cluster:

Code Block
qsub launch_ENA_download.pbs

Monitor progress of job:

Code Block
qjobs

Download data using the nf-core/fetchngs pipeline

Source: https://nf-co.re/fetchngs/1.12.0/

...

Alternatively, to the above approach we can also use the nextflow nf-core/fetchngs pipeline to download data.

To run the this pipeline we need to inputs: 1) list of SRA identifiers and 2) PBS Pro script to fetch the data using sratools.

First, prepare a file with the list of SRA IDs of interest to be downloaded:

Hint:

In the terminal create a new folder called ‘fetchngs’. For example:

Code Block
mkdir $HOME/workshop/2024-2/session4_RNAseq/data/fetchngs #then, move to the newly create folder cd $HOME/workshop/2024-2/session4_RNAseq/data/fetchngs

Copy the following list of IDs. Hint click on the top right corner of the block below to copy the text.

Code Block
SRR20622172 SRR20622173 SRR20622177 SRR20622176 SRR20622180 SRR20622174 SRR20622178 SRR20622179 SRR20622175

Alternatively, instead of list of SSR identifiers it is possible to download all data in a given BioProject ID:

Code Block
PRJNA862097

NOTE: Either the list above or citing the BioProject ID in the ‘ids.csv’ file will download exactly the same data.

Create a ‘ids.csv’. file using nano and paste the list of IDs:

Code Block
nano ids.csv

Next, copy and paste the following PBS script to download the specified files in ‘ids.csv’.
NOTE: instead of listing individual SRR identifiers it is also possible to list the BioProject ID (e.g., PRJNA862107) which will fetch all SRR samples automatically.

Secondly, create a launch PBS script to download the data for the above IDs

Copy the block of code below. Hint click on the top right corner of the block below to copy the text.

Code Block

#!/bin/bash -l
#PBS -N nf_fetchngs
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#work on current directory
cd $PBS_O_WORKDIR
#load java and set up memory settings to run nextflow
module load java
export NXF_OPTS='-Xms1g -Xmx4g'
#run the RNAseq pipeline
nextflow run nf-core/fetchngs \
   -profile singularity \
   --input ids.csv \
   --outdir data \
   --download_method sratools \
   --nf_core_pipeline rnaseq \
   -resume

Use nano to create a launch script, for example:

Code Block
nano launch_nf_core_fetchngs.pbs

Paste the block of code above and save the file.

Submit the the download job to the HPC cluster:

Code Block
qsub launch_nf_core_fetchngs.pbs

Outputs:

Code Block

data
├── custom
│   └── user-settings.mkfg
├── fastq
│   ├── SRX16645917_SRR20622180.fastq.gz
│   ├── SRX16645918_SRR20622179.fastq.gz
│   ├── SRX16645919_SRR20622178.fastq.gz
│   ├── SRX16645920_SRR20622177.fastq.gz
│   ├── SRX16645921_SRR20622175.fastq.gz
│   ├── SRX16645922_SRR20622174.fastq.gz
│   ├── SRX16645923_SRR20622173.fastq.gz
│   ├── SRX16645924_SRR20622176.fastq.gz
│   └── SRX16645925_SRR20622172.fastq.gz
├── metadata
│   ├── SRR20622172.runinfo_ftp.tsv
│   ├── SRR20622173.runinfo_ftp.tsv
│   ├── SRR20622174.runinfo_ftp.tsv
│   ├── SRR20622175.runinfo_ftp.tsv
│   ├── SRR20622176.runinfo_ftp.tsv
│   ├── SRR20622177.runinfo_ftp.tsv
│   ├── SRR20622178.runinfo_ftp.tsv
│   ├── SRR20622179.runinfo_ftp.tsv
│   └── SRR20622180.runinfo_ftp.tsv
├── pipeline_info
│   ├── execution_report_2024-08-29_14-23-00.html
│   ├── execution_timeline_2024-08-29_14-23-00.html
│   ├── execution_trace_2024-08-29_14-23-00.txt
│   ├── nf_core_fetchngs_software_mqc_versions.yml
│   ├── params_2024-08-29_14-23-05.json
│   └── pipeline_dag_2024-08-29_14-23-00.html
└── samplesheet
    ├── id_mappings.csv
    ├── multiqc_config.yml
    └── samplesheet.csv

STEP 6: Save the file and now let’s transfer it to the HPC. See below:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
\\hpc-fs\home\

Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder

Code Block
smb://hpc-fs/home/

Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder

STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:

Code Block
cd $HOME/workshop/2024-2/session4_RNAseq/data/mydata

How to use the dos2unix tool? Type:

Code Block
dos2unix --help

Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.

Code Block
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs

Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.

Now we are ready to submit to the HPC cluster the script to download FASTQ files:

Code Block
qsub ena-file-download-selected-files-20241013-1123.pbs

Monitor progress of job:

Code Block
qjobs

Note: Downloading the above datasets will take about ~50 minutes.

Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:

Data Download

Versions Compared

Old Version 7

New Version Current

Key

ENA Browser

Search NGS data of interest

Download data using the nf-core/fetchngs pipeline