Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Fetching Public RNA-seq data

Manuscript: Crotta et al. (2023). Nature Communications. http://doi.org/10.1038/s41467-023-36352-z

Find where the data is available for download in the above manuscript.

Hints:

  • Search for “Data availability” or reference to “BioProject ID” or “GEO accession code”

  • Find the BioProject ID and search for this ID at https://www.ebi.ac.uk/ena/browser/home

  • Select FASTQ files of interests and click on “Get download script”

  • Copy the downloaded file to your HPC account or copy the content to a file created in the HPC using Nano (or other text editor)

  • Add the PBS pro scheduler lines and submit a job. See step by step details at:

eResearch Downloading public data

Download data using the nf-core/fetchngs pipeline

Source: https://nf-co.re/fetchngs/1.12.0/

 

image-20240829-043553.png

 

Alternatively, to the above approach we can also use the nextflow nf-core/fetchngs pipeline to download data.

To run the this pipeline we need to inputs: 1) list of SRA identifiers and 2) PBS Pro script to fetch the data using sratools.

First, prepare a file with the list of SRA IDs of interest to be downloaded:

Hint:

  • In the terminal create a new folder called ‘fetchngs’. For example:

  • mkdir $HOME/workshop/2024/rnaseq/data/fetchngs
    #then, move to the newly create folder
    cd $HOME/workshop/2024/rnaseq/data/fetchngs
  • Copy the following list of IDs. Hint click on the top right corner of the block below to copy the text.

SRR20622172
SRR20622173
SRR20622177
SRR20622176
SRR20622180
SRR20622174
SRR20622178
SRR20622179
SRR20622175

Alternatively, instead of list of SSR identifiers it is possible to download all data in a given BioProject ID:

PRJNA862097

NOTE: Either the list above or citing the BioProject ID in the ‘ids.csv’ file will download exactly the same data.

  • Create a ‘ids.csv’. file using nano and paste the list of IDs:

nano ids.csv
  • Next, copy and paste the following PBS script to download the specified files in ‘ids.csv’.

  • NOTE: instead of listing individual SRR identifiers it is also possible to list the BioProject ID (e.g., PRJNA862107) which will fetch all SRR samples automatically.

Secondly, create a launch PBS script to download the data for the above IDs

  • Copy the block of code below. Hint click on the top right corner of the block below to copy the text.

#!/bin/bash -l
#PBS -N nf_fetchngs
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#work on current directory
cd $PBS_O_WORKDIR
#load java and set up memory settings to run nextflow
module load java
export NXF_OPTS='-Xms1g -Xmx4g'
#run the RNAseq pipeline
nextflow run nf-core/fetchngs \
   -profile singularity \
   --input ids.csv \
   --outdir data \
   --download_method sratools \
   --nf_core_pipeline rnaseq \
   -resume
  • Use nano to create a launch script, for example:

nano launch_nf_core_fetchngs.pbs
  • Paste the block of code above and save the file.

Submit the the download job to the HPC cluster:

qsub launch_nf_core_fetchngs.pbs

Outputs:

data
├── custom
│   └── user-settings.mkfg
├── fastq
│   ├── SRX16645917_SRR20622180.fastq.gz
│   ├── SRX16645918_SRR20622179.fastq.gz
│   ├── SRX16645919_SRR20622178.fastq.gz
│   ├── SRX16645920_SRR20622177.fastq.gz
│   ├── SRX16645921_SRR20622175.fastq.gz
│   ├── SRX16645922_SRR20622174.fastq.gz
│   ├── SRX16645923_SRR20622173.fastq.gz
│   ├── SRX16645924_SRR20622176.fastq.gz
│   └── SRX16645925_SRR20622172.fastq.gz
├── metadata
│   ├── SRR20622172.runinfo_ftp.tsv
│   ├── SRR20622173.runinfo_ftp.tsv
│   ├── SRR20622174.runinfo_ftp.tsv
│   ├── SRR20622175.runinfo_ftp.tsv
│   ├── SRR20622176.runinfo_ftp.tsv
│   ├── SRR20622177.runinfo_ftp.tsv
│   ├── SRR20622178.runinfo_ftp.tsv
│   ├── SRR20622179.runinfo_ftp.tsv
│   └── SRR20622180.runinfo_ftp.tsv
├── pipeline_info
│   ├── execution_report_2024-08-29_14-23-00.html
│   ├── execution_timeline_2024-08-29_14-23-00.html
│   ├── execution_trace_2024-08-29_14-23-00.txt
│   ├── nf_core_fetchngs_software_mqc_versions.yml
│   ├── params_2024-08-29_14-23-05.json
│   └── pipeline_dag_2024-08-29_14-23-00.html
└── samplesheet
    ├── id_mappings.csv
    ├── multiqc_config.yml
    └── samplesheet.csv

  • No labels