Today’s we will learn to download FASTQ files from a published paper:Today’s we will learn to download FASTQ files from a published paper:
Manuscript: Crotta et al. (2023). Nature Communications. http://doi.org/10.1038/s41467-023-36352-z
...
Click on the link above and search for “Accession”, “Data availability”, “BioProject ID” or “GEO accession code”
If, only a GEO accession code is available, go to the GEO database and look for BioProject ID - Note, ENA (Step2) requires this identifier to download the data.
Which BioProject ID host the data used in the above manuscript?
Expand | ||
---|---|---|
| ||
...
Code Block |
---|
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520.fastq.gz |
Let’s sort the above file, for this copy the Copy the script to your HPC working folder $HOME/workshop/2024-2/session4_RNAseq/data. Alternatively use
See below how to drag and drop the file using File Finder
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
To browse the working folder in the HPC type in the file finder:
Windows PC: open file finder and type the address below to connect to your home directory in the HPC. Remember to replace “USER” by your actual user name.
Code Block |
---|
\\hpc-fs\home\USER\workshop\2024-4\session4_RNAseq\data |
Mac: open file finder and press “command” + “k” to open prompt, then type the below command. Remember to replace “USER” by your actual user name.
Code Block |
---|
smb://hpc-fs/home/USER/workshop/2024-4/session4_RNAseq/data |
Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read2). Look into the “MultiQC” folder and open the provided HTML report.
Copy the downloaded file to your HPC account or copy the content to a file created in the HPC using Nano (or other text editor)
Add the PBS pro scheduler lines and submit a job. See step by step details at:
...
ENA Browser
Go to the ENA Browser https://www.ebi.ac.uk/ena/browser/home
Search NGS data of interest
In the ‘view search box' enter one of the following identifiers:
...
Monitor progress of job:
Code Block |
---|
qjobs |
Download data using the nf-core/fetchngs pipeline
Source: https://nf-co.re/fetchngs/1.12.0/
...
First, prepare a file with the list of SRA IDs of interest to be downloaded:
Hint:
In the terminal create a new folder called ‘fetchngs’. For example:
Code Block mkdir $HOME/workshop/2024-2/session4_RNAseq/data/fetchngs #then, move to the newly create folder cd $HOME/workshop/2024-2/session4_RNAseq/data/fetchngs
Copy the following list of IDs. Hint click on the top right corner of the block below to copy the text.
...
Alternatively, instead of list of SSR identifiers it is possible to download all data in a given BioProject ID:
Code Block |
---|
PRJNA862097 |
NOTE: Either the list above or citing the BioProject ID in the ‘ids.csv’ file will download exactly the same data.
Create a ‘ids.csv’. file using nano and paste the list of IDs:
...
Next, copy and paste the following PBS script to download the specified files in ‘ids.csv’.
NOTE: instead of listing individual SRR identifiers it is also possible to list the BioProject ID (e.g., PRJNA862107) which will fetch all SRR samples automatically.
Secondly, create a launch PBS script to download the data for the above IDs
...