miRBase and MirGeneDB
Download Reference microRNA data from miRBase
...
Code Block |
---|
mkdir -p $HOME/workshop/2024-2/session6_smallRNAseq/data/references miRBase cd $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase |
Now move to the reference folder and download the miRBase datasets using wget in an Interactive session or (see below) use a PBS Pro script.
OPTION #1: Use interactive session to run the following commands:
...
Code Block |
---|
wget https://mirbase.org/download/hsa.gff3 |
OPTION #2: submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).
...
Let’s
copy the script to download miRBase files;
move to the reference folder; and
print the content of the launch_download_miRBase.pbs script with the code below:
Code Block |
---|
cp /work/training/2024/smallRNAseq/scripts/launch_download_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/miRBase
cat launch_download_miRBase.pbs |
Code Block |
---|
#!/bin/bash -l #PBS -N nfsmrnaseqdownload_miRBase #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=2:00:00 cd $PBS_O_WORKDIR wget https://www.mirbase.org/download/hairpin.fa wget https://www.mirbase.org/download/mature.fa wget https://www.mirbase.org/download/hsa.gff3 |
Fetch public small RNA-seq data
submit the script to the HPC cluster:
Code Block |
---|
qsub launch_download_miRBase.pbs |
monitor progress of job:
Code Block |
---|
qjobs |
Fetch public small RNA-seq data
Today we will download small RNA-seq data from the ENA (European Nucleotide Archive).
...
Click on the link above and search for “accession”, “Data availability”, “BioProject ID”, “GEO accession code” or “Array Express” identifier.
If, only an Array Express accession code is available, then go to https://www.ebi.ac.uk/biostudies/arrayexpress and search for the Array Express identifier. Browse the database to located the identifier for ENA.
Hint: it will take a couple of clicks to open multiple pages to find the identifier for the data deposited in ENA.
...
Which is the Array express identifier noted in the above manuscript and to which ENA identifier it relates to?
Expand | ||
---|---|---|
| ||
Array Express: E-MTAB-2206 , and ENA identifier: ERP004592 |
STEP 2: Search for data for the identified BioProject ID at the European Nucleotide Archive (ENA) database
...
STEP3: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. NOTE: the Script Name will be different for each person downloading the bash script (e.g.,
View file | ||
---|---|---|
|
...
STEP 4: Download the metadata information for the study in TSV (Tab-Separated Values) format:
...
Open the
...
file using
...
an app for Text files (e.g., TextEdit, NotePad, etc):
Code Block |
---|
run_accession sample_accession experiment_accession study_accession tax_id scientific_name fastq_ftp submitted_ftp sra_ftp bam_ftp ERR409882 SAMEA2300497 ERX376249 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/049ERR409882/SRR20630349/SRR20630349ERR409882.fastq.gz wget -nc ftp:// ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/055/SRR20630355/SRR20630355ERR409882/C_31.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/047/SRR20630347/SRR20630347.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/050/SRR20630350/SRR20630350.fastq.gz wget -nc ftp://ftpERR409/ERR409882 ERR409892 SAMEA2300502 ERX376254 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/042ERR409892/SRR20630342/SRR20630342ERR409892.fastq.gz wget -nc ftp:// ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/053/SRR20630353/SRR20630353ERR409892/H_13.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/039/SRR20630339/SRR20630339.fastq.gz wget -nc ftp://ERR409/ERR409892 ERR409893 SAMEA2300504 ERX376256 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/056ERR409893/SRR20630356/SRR20630356ERR409893.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/054/SRR20630354/SRR20630354ERR409893/C_36.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/041/SRR20630341/SRR20630341.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/045/SRR20630345/SRR20630345.fastq.gz wget -nc ftp://ERR409/ERR409893 ERR409895 SAMEA2300492 ERX376244 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/051ERR409895/SRR20630351/SRR20630351ERR409895.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/040/SRR20630340/SRR20630340ERR409895/H_09.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/048/SRR20630348/SRR20630348.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/052/SRR20630352/SRR20630352.fastq.gz wget -nc ftp://ERR409/ERR409895 ERR409897 SAMEA2300498 ERX376250 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/046ERR409897/SRR20630346/SRR20630346.fastq.gz |
Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:
Code Block |
---|
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
#work on current directory (folder)
cd $PBS_O_WORKDIR |
You should have this:
Code Block |
---|
#!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR wget -nc ftp://ERR409897.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409897/H_05.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409897 ERR409898 SAMEA2300501 ERX376253 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/044ERR409898/SRR20630344/SRR20630344ERR409898.fastq.gz wget -nc ftp:// ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/049/SRR20630349/SRR20630349ERR409898/C_29.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/055/SRR20630355/SRR20630355.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/047/SRR20630347/SRR20630347.fastq.gz wget -nc ftp://ERR409/ERR409898 ERR409899 SAMEA2300490 ERX376242 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/050ERR409899/SRR20630350/SRR20630350ERR409899.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/042/SRR20630342/SRR20630342ERR409899/H_07.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/053/SRR20630353/SRR20630353.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/043/SRR20630343/SRR20630343.fastq.gz wget -nc ftp://ERR409/ERR409899 ERR409879 SAMEA2300495 ERX376247 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/039ERR409879/SRR20630339/SRR20630339ERR409879.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/056/SRR20630356/SRR20630356ERR409879/C_39.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/054/SRR20630354/SRR20630354.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/041/SRR20630341/SRR20630341.fastq.gz wget -nc ftp://ftpERR409/ERR409879 ERR409880 SAMEA2300488 ERX376240 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/045ERR409880/SRR20630345/SRR20630345ERR409880.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/051/SRR20630351/SRR20630351ERR409880/H_08.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqerr/SRR206/040/SRR20630340/SRR20630340.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/048/SRR20630348/SRR20630348.fastq.gz wget -nc ftp://ftpERR409/ERR409880 ERR409883 SAMEA2300487 ERX376239 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/SRR206ERR409/052ERR409883/SRR20630352/SRR20630352ERR409883.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastqrun/SRR206ERR409/046/SRR20630346/SRR20630346.fastq.gz |
STEP 6: Save the file and now let’s transfer it to the HPC. See below:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
Code Block |
---|
\\hpc-fs\home\ |
Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session4_RNAseq/data/mydata folder
Code Block |
---|
smb://hpc-fs/home/ |
Drag and drop the script into the /workshop/2024-2/session4_RNAseq/data/mydata folder
STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:
Code Block |
---|
cd $HOME/workshop/2024-2/session4_RNAseq/data/mydata |
How to use the dos2unix tool? Type:
Code Block |
---|
dos2unix --help |
Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.
Code Block |
---|
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs |
Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.
Now we are ready to submit to the HPC cluster the script to download FASTQ files:
Code Block |
---|
qsub ena-file-download-selected-files-20241013-1123.pbs |
Monitor progress of job:
Code Block |
---|
qjobs |
Note: Downloading the above datasets will take about ~50 minutes.
Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline:
For this approach you will need to have a list of SRA identifiers. For example, for the human Huntington Disease study the list of identifiers are:
Code Block |
---|
ERR409878
ERR409879
ERR409880
ERR409881
ERR409882
ERR409883
ERR409884
ERR409885
ERR409886
ERR409887
ERR409888
ERR409889
ERR409890
ERR409891
ERR409892
ERR409893
ERR409894
ERR409895
ERR409896
ERR409897
ERR409898
ERR409899
ERR409900 |
The above list has been already prepared for you, fetch a copy of the list of IDs into your “my data” folder created previously:
Code Block |
---|
cp /work/training/2024/smallRNAseq/data/human_disease/SRA_Acc_List.txt $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata |
Now let’s also get a copy of the “launch_fetch_SRA.pbs” script into your “my data” folder:
Code Block |
---|
cp /work/training/2024/smallRNAseq/data/human_disease/launch_fetch_SRA.pbs $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata |
Check the content of the script:
Code Block |
---|
cat launch_fetch_SRA.pbs |
Use singularity container:
Code Block |
---|
singularity run -B $PWD /work/training/tools/sif_lib/sra-tools_v2.10.7.sif \ fastq-dump \ --split-files \ --outdir $HOMEERR409883/C_35.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409883 ERR409884 SAMEA2300491 ERX376243 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409884/ERR409884.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409884/H_12.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409884 ERR409886 SAMEA2300493 ERX376245 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409886/ERR409886.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409886/H_06.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409886 ERR409888 SAMEA2300503 ERX376255 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409888/ERR409888.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409888/C_38.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409888 ERR409878 SAMEA2300496 ERX376248 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409878/ERR409878.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409878/C_33.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409878 ERR409889 SAMEA2300500 ERX376252 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409889/ERR409889.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409889/H_03.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409889 ERR409881 SAMEA2300509 ERX376261 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409881/ERR409881.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409881/H_10.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409881 ERR409885 SAMEA2300505 ERX376257 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409885/ERR409885.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409885/H_02.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409885 ERR409894 SAMEA2300507 ERX376259 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409894/ERR409894.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409894/H_14.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409894 ERR409887 SAMEA2300506 ERX376258 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409887/ERR409887.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409887/C_21.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409887 ERR409890 SAMEA2300489 ERX376241 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409890/ERR409890.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409890/H_01.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409890 ERR409896 SAMEA2300508 ERX376260 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409896/C_32.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409896 ERR409891 SAMEA2300494 ERX376246 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409891/ERR409891.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409891/C_14.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409891 ERR409900 SAMEA2300499 ERX376251 PRJEB5212 9606 Homo sapiens ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR409/ERR409900/C_37.fastq.gz ftp.sra.ebi.ac.uk/vol1/err/ERR409/ERR409900 |
STEP 5: Select FASTQ files (tick boxes next to the file names) and click on “Get download script”. This will download a bash script (e.g.,
)View file name ena-file-download-selected-files-20241009-0005.sh
...
Open the downloaded ena file using TextEdit (NotePad or similar app). The downloaded script looks like this:
Code Block |
---|
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409878/ERR409878.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409879/ERR409879.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409880/ERR409880.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409881/ERR409881.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409882/ERR409882.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409883/ERR409883.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409884/ERR409884.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409885/ERR409885.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409886/ERR409886.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409887/ERR409887.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409888/ERR409888.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409889/ERR409889.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409890/ERR409890.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409891/ERR409891.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409892/ERR409892.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409893/ERR409893.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409894/ERR409894.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409895/ERR409895.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409897/ERR409897.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409898/ERR409898.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409899/ERR409899.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz |
Now using the TextEdit or NotePad app, we will add the following lines to the top of the script - copy and paste the following to the above script:
Code Block |
---|
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
#work on current directory (folder)
cd $PBS_O_WORKDIR |
You should have this:
Code Block |
---|
#!/bin/bash -l
#PBS -N ENA_data_download
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
#work on current directory (folder)
cd $PBS_O_WORKDIR
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409878/ERR409878.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409879/ERR409879.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409880/ERR409880.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409881/ERR409881.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409882/ERR409882.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409883/ERR409883.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409884/ERR409884.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409885/ERR409885.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409886/ERR409886.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409887/ERR409887.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409888/ERR409888.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409889/ERR409889.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409890/ERR409890.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409891/ERR409891.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409892/ERR409892.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409893/ERR409893.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409894/ERR409894.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409895/ERR409895.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409896/ERR409896.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409897/ERR409897.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409898/ERR409898.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409899/ERR409899.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/ERR409900/ERR409900.fastq.gz |
STEP 6: Save the file and now let’s transfer it to the HPC. See below:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
Windows PC: open file finder and type the address below to connect to your home directory in the HPC, and then browse to the /workshop/2024-2/session6_smallRNAseq/data/mydata folder
Code Block |
---|
\\hpc-fs\home\ |
Mac: open file finder and press “command” + “k” to open prompt, then type the below command, and then browse to the /workshop/2024-2/session6_smallRNAseq/data/mydata
...
deprecated
Code Block |
---|
#!/bin/bash -l
#PBS -N rna
#PBS -l select=1:ncpus=1:mem=8gb
#PBS -l walltime=24:00:00
#Enable the container modules
source /pkg/shpc/enable
#Load the SRA-TOOLS module
module load sra-tools/3.0.5--h9f5acd7_1
#work on current directory (folder)
cd $PBS_O_WORKDIR
for i in $(cat SRR_Acc_List.txt);
do
echo $i
prefetch.3 $i
fasterq-dump.3 --split-files $i
done
gzip *fastq |
submit PBS script to the HPC cluster
Code Block |
---|
qsub launch_fetch_SRA.pbs |
monitor job progression
Code Block |
---|
qjobs |
folder
Code Block |
---|
smb://hpc-fs/home/ |
Drag and drop the script into the /workshop/2024-2/session6_smallRNAseq/data/mydata folder
STEP 7: We will ensure the copied file from our laptop / desktop does not have unwanted characters. Let’s move to the data folder:
Code Block |
---|
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/mydata |
How to use the dos2unix tool? Type:
Code Block |
---|
dos2unix --help |
Now let’s run dos2unix conversion. Note the filename may vary, so adjust the filename as appropriate.
Code Block |
---|
dos2unix -n ena-file-download-selected-files-20241013-1123.sh ena-file-download-selected-files-20241013-1123.pbs |
Note: If you create a file using Microsoft Excel, it is likely that it will add ASCII characters, use dos2unix to remove such characters.
Now we are ready to submit to the HPC cluster the script to download FASTQ files:
Code Block |
---|
qsub ena-file-download-selected-files-20241013-1123.pbs |
Monitor progress of job:
Code Block |
---|
qjobs |
Note: Downloading the above datasets will take about ~50 minutes.
Find in the link below alternative approaches to download data from SRA, BaseSpace or use the nf-core/fetchngs pipeline: