Prepared by the eResearch Office, QUT.
This page provides a guide to QUT users on how to install and run the nextflow nf-core/rnaseq workflow on the HPC.
Pre-requisites
...
Basic Unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
Have an HPC account on QUT’s HPC compute. Apply for a new HPC account here.
R tutorials:
...
The nf-core/rnaseq workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here. Prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.
...
Additional details on the workflow can be found at:
Overview: https://nf-co.re/rnaseq/3.10.01
Usage: https://nf-co.re/rnaseq/3.10.01/usage
GitHub: https://github.com/nf-core/rnaseq
Pipeline Summary
...
The pipeline is built using Nextflow and processes data using the following steps:
cat - Merge re-sequenced FastQ files
FastQC - Raw read QC
UMI-tools extract - UMI barcode extraction
TrimGalore - Adapter and quality trimming
BBSplit - Removal of genome contaminants
SortMeRNA - Removal of ribosomal RNA (optional)
STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
STAR via RSEM - Alignment and quantification of expression levels
HISAT2 - Memory efficient splice aware alignment to a reference
UMI-tools dedup - UMI-based deduplication
picard MarkDuplicates - Duplicate read marking
StringTie - Transcript assembly and quantification
BEDTools and bedGraphToBigWig - Create bigWig coverage files
RSeQC - Various RNA-seq QC metrics
Qualimap - Various RNA-seq QC metrics
dupRadar - Assessment of technical / biological read duplication
Preseq - Estimation of library complexity
featureCounts - Read counting relative to gene biotype
DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram
MultiQC - Present QC for raw reads, alignment, read counting , and sample similaritysimiliarity
Pseudo-alignment and quantification
Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
Workflow reporting and genomes
Reference genome files - Saving reference genome indices/files
Pipeline information - Report metrics generated during the workflow execution
...
Code Block |
---|
nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.10.1 |
Running the pipeline using custom data
Example of a typical command to run an RNA-seq analysis for mouse samples:
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCh38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 2 \
--three_prime_clip_r2 2 |
Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \ --outdir results \ -r 3.10.1 \ --genome GRCh38 \ -profile singularity \ --aligner star_rsem \ --clip_r1 10 \ --clip_r2 10 \ --three_prime_clip_r1 2 \ --three_prime_clip_r2 2 \ -resume |
Preparing a ‘samplesheet.csv’ file
Prepare a sample sheet file that specifies the input files to be used. To do this, we use an nf-core script to generate the ‘samplesheet.csv’ file as follows (setting strandedness to auto allows the pipeline to determine the strandedness of your RNA-seq data automatically):
Code Block |
---|
#load python 3.10 module load python/3.10.8-gcccore-12.2.0 #download script and make executable wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py chmod +x fastq_dir_to_samplesheet.py #generate the samplesheet.csv file ./fastq_dir_to_samplesheet.py /path/to/directory/containing/fastq_files/ samplesheet.csv \ --strandedness unstrandedauto \ --read1_extension _R1.fastq.gz \ --read2_extension _R2.fastq.gz |
Example index.csv (Version 3.10.1):
Code Block |
---|
sample,fastq_1,fastq_2,strandedness control_1,/path/to/directory/containing/fastq_files/control-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-1_R2.fastq.gz,unstrandedauto control_2,/path/to/directory/containing/fastq_files/control-2_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-2_R2.fastq.gz,unstrandedauto control_3,/path/to/directory/containing/fastq_files/control-3_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-3_R2.fastq.gz,unstrandedauto infected_1,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-1_R2.fastq.gz,unstrandedauto infected_2,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-2_R2.fastq.gz,unstrandedauto infected_3,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-3_R2.fastq.gz,unstrandedauto |
Preparing to run on the HPC
...
Code Block |
---|
#!/bin/bash -l #PBS -N nfrna2 #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' #run the rnaseq pipeline nextflow run nf-core/rnaseq \ -profile singularity \ -r 3.10.1 \ --input samplesheet.csv \ --genome GRCm38 GRCh38 \ --outdir results \ --aligner star_salmon |
We recommend running the nextflow nf-core/rnaseq pipeline once and then assessing the fastqc results folder to assess if sequence biases are present in the 5'-end and 3'-end ends of the sequences. Then, we can use the PBS script below to tell the pipeline to remove a defined number of bases from the 5'-end (--clip_r1
or --clip_r2
) or 3'-end (--three_prime_clip_r1
or --three_prime_clip_r2
). Also, we can specify to remove ribosomal RNA as these sets of sequences are non-informative.
Code Block |
---|
#!/bin/bash -l #PBS -N nfrna2 #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' #run the rnaseq pipeline nextflow run nf-core/rnaseq --input samplesheet.csv \ --outdir results \ -r 3.10.1 \ --genome GRCh38 \ -profile singularity \ --aligner star_rsemsalmon \ --clip_r1 10 \ --clip_r2 10 \ --three_prime_clip_r1 12 \ --three_prime_clip_r2 12 |
Submitting the job
Once you have created the folder for the run, the inputsamplesheet.tsv csv file, nextflow.config, and launch.pbs, you are ready to submit.
...
Code Block |
---|
#delete the existing assests associated with the RNAseq pipeline: cd ~/.nextflow/assets/nf-core rm -r rnaseq/ #run again a test with the new version that you are testing, for example, version 3.10.1. See details on how to run a test above (under 'Getting Started') |
Add output folders/files
sample data
Running the pipeline using custom data
Example of a typical command to run an RNA-seq analysis for mouse samples:
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCm38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 2 \
--three_prime_clip_r2 2 |
Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCm38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 2 \
--three_prime_clip_r2 2 \
-resume |