eResearch nf-core-RNAseq pipeline

Aims

Run the nextflow nf-core/rnaseq pipeline in the HPC cluster. Exercises include:
- Running a test to verify the execution of the pipeline
- Running QC check to determine read trimming parameters
- Running the full nf-core/rnaseq pipeline.

Work in the HPC

Work in the HPC

Before we start using the HPC, let’s start an interactive session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

Get a copy of the scripts to be used in this module

Use the terminal to log into the HPC and create a /RNAseq/ folder to run the nf-core/rnaseq pipeline. For example:

mkdir -p $HOME/workshop/scripts
cp /work/training/rnaseq/scripts/* $HOME/workshop/scripts/
ls -l $HOME/workshop/scripts/

Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/
Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/

Copy public data to your $HOME

mkdir -p $HOME/workshop/data
cp /work/training/rnaseq/data/* $HOME/workshop/data/
# list the content of the $HOME/workshop/data/

Line 1: The first command creates the folder /scripts/
Line 2: Copies all files from /work/datasets/workshop/scripts/ folder as noted by an asterisk to newly created $HOME/workshop/scripts/ folder
Line 3: a quick challenge - see the previous section for hints

Create a folder for running the nf-RNA-seq pipeline

Let’s create an “RNAseq” folder to run the nf-core/rnaseq pipeline and move into it. For example:

mkdir -p $HOME/workshop/RNAseq
mkdir $HOME/workshop/RNAseq/run1_test
mkdir $HOME/workshop/RNAseq/run2_QC
mkdir $HOME/workshop/RNAseq/run3_RNAseq
cd $HOME/workshop/

Lines 1-4: create sub-folders for each exercise
Line 5: change the directory to the folder “run1_test”
Line 6: print the current working directory

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

Copy the launch_nf-core_RNAseq_test.pbs to the working directory

cd $HOME/workshop/RNAseq/run1_test
cp $HOME/workshop/scripts/launch_nf-core_RNAseq_test.pbs .

View the content of the script as follows:

cat launch_nf-core_RNAseq_test.pbs

#!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java export NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/rnaseq -r 3.12.0 -profile test,singularity --outdir results

#!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java export NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/rnaseq -r 3.12.0 -profile test,singularity --outdir results

nextflow command: nextflow run
pipeline name: nf-core/rnaseq
pipeline version: -r 3.12.0
container type and sample data: -profile test,singularity
output directory: --outdir results

Submitting the job

Submit the test job to the HPC cluster as follows:

qsub launch_nf-core_RNAseq_test.pbs

Monitoring the Run

qjobs

Exercise 2: Run RNA-seq QC check

The pipeline requires preparing at least 2 files:

Metadata file (samplesheet.csv) that specifies the name of the samples, location of FASTQ files ('Read 1' and ‘Read 2’), and strandedness (forward, reverse, or auto. Note: auto is used when the strandedness of the data is unknown)
PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

cd $HOME/workshop/data/
pwd

Copy the bash script to the working folder

cp /work/training/rnaseq/scripts/create_samplesheet_nf-core_RNAseq.sh $HOME/workshop/data/

Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

cat create_samplesheet_nf-core_RNAseq.sh

Example for Single-End data (only ‘Read 1’ is available):

#!/bin/bash -l

#User defined variables

##########################################################

DIR='$HOME/workshop/data/'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py $DIR $INDEX \

--strandedness auto \

--read1_extension .fastq.gz

#!/bin/bash -l

#User defined variables

##########################################################

DIR='$HOME/workshop/data/'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py $DIR $INDEX \

--strandedness auto \

--read1_extension .fastq.gz

Let’s generate the metadata file by running the following command:

sh create_RNAseq_samplesheet.sh

Check the newly created samplesheet.csv file:

ls -l
cat samplesheet.cvs

sample,fastq_1,fastq_2,strandedness

CD49fmNGFRm_rep1,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622174_1.fastq.gz,,auto

CD49fmNGFRm_rep2,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622175_1.fastq.gz,,auto

CD49fmNGFRm_rep3,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622177_1.fastq.gz,,auto

CD49fpNGFRp_rep1,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622178_1.fastq.gz,,auto

CD49fpNGFRp_rep2,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622179_1.fastq.gz,,auto

CD49fpNGFRp_rep3,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622180_1.fastq.gz,,auto

MTEC_rep1,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622172_1.fastq.gz,,auto

MTEC_rep2,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622173_1.fastq.gz,,auto

MTEC_rep3,/work/eresearch_bio/training/data/rnaseq_data/mouse_PRJNA862107/SRR20622176_1.fastq.gz,,auto