eResearch Downloading public data
Aims:
Use the European National Archive (ENA) to search and fetch data of interest.
Learn how to download public RNAseq data using the HPC.
Work in your Desktop / Laptop |
---|
ENA link: https://www.ebi.ac.uk/ena/browser/view/PRJNA862107
Search for data of interest
In the ‘view search box,' enter one of the following identifiers:
To fetch all or selected files in the project:
Project accession (i.e., PRJNA862107)
To fetch Individual files:
Experiment accession (i.e., SRX16645923)
Run accession (i.e., SRR20622173)
Once there, you can download any associated files by clicking the relevant links and then clicking on “Get download script.”
Mouse: Project PRJNA862107
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/072/SRR20622172/SRR20622172.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/073/SRR20622173/SRR20622173.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/077/SRR20622177/SRR20622177.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/076/SRR20622176/SRR20622176.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/080/SRR20622180/SRR20622180.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/074/SRR20622174/SRR20622174.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/078/SRR20622178/SRR20622178.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/079/SRR20622179/SRR20622179.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/075/SRR20622175/SRR20622175.fastq.gz
Work in the HPC |
---|
Use the terminal to log into the HPC and create a /data/ folder to download FASTQ files. For example:
mkdir -p $HOME/workshop
mkdir -p $HOME/workshop/data
cd $HOME/workshop/data
Work in your Desktop / Laptop |
---|
Copy the ENA downloaded script to the newly created “data” folder in the HPC.
Use the ‘File Finder’ to connect to the HPC:
Windows (click on the tab of the file finder and type or copy-paste the following):
\\hpc-fs\work
Mac (Command + K):
Navigate to the /workshop/data/ folder and then “Drag & Drop” the ENA downloaded file.
Work in the HPC |
---|
Now, let’s create a PBS Pro submission script to download the data. Two options are described below, use either option:
Option 1: Use a script to read the ENA file
Tip: this option does not require the use of a text editor like vi or nano.
First, let’s get a copy of a script called “launch_read_ENA_download.pbs” as follows:
List the files in the directory:
You should have the ENA file and the launch script. For example:
Now, let’s submit the following job to the HPC cluster. We use the ‘qsub’ command to submit the script to the HPC, and we specify as a variable (-v) the “input_file” name of the ENA file. For example:
You can monitor the progress of the job by running the following command:
NOTE: The following code is for your reference only. We will not run the following code in the HPC. The content of the launch_input_ENA_download.sh
is:
Option 2: Create a PBS Pro submission script:
(For advanced users): Use vi or nano text editors to create the following PBS Pro script. Copy the following code and paste it into a new file called, for example, launch_ENA_download_SRR206.pbs
Once you have the launch PBS script ready, proceed to submit it to the HPC as follows:
You can monitor the progress of the job by running the following command:
Additional data download user guides
To find additional options to download public or private data to the HPC see Data Download
Tip: additional options include i) NCBI’s Short Read Archive (SRA), and ii) Illumina’s BaseScpace.