Content Comparison

...

The nf-core/rnaseq workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here. Prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.

...

GitHub: https://github.com/nf-core/rnaseq

Pipeline Summary

...

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing
- cat - Merge re-sequenced FastQ files
- FastQC - Raw read QC
- UMI-tools extract - UMI barcode extraction
- TrimGalore - Adapter and quality trimming
- BBSplit - Removal of genome contaminants
- SortMeRNA - Removal of ribosomal RNA (optional)
Alignment and quantification
- STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
- STAR via RSEM - Alignment and quantification of expression levels
- HISAT2 - Memory efficient splice aware alignment to a reference
Alignment post-processing
- SAMtools SAMtools - Sort and index alignments
- UMI-tools dedup - UMI-based deduplication
- picard MarkDuplicates - Duplicate read marking
Other steps
- StringTie - Transcript assembly and quantification
- BEDTools and bedGraphToBigWig - Create bigWig coverage files
Quality control
- RSeQC - Various RNA-seq QC metrics
- Qualimap - Various RNA-seq QC metrics
- dupRadar - Assessment of technical / biological read duplication
- Preseq - Estimation of library complexity
- featureCounts - Read counting relative to gene biotype
- DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram
- MultiQC - Present QC for raw reads, alignment, read counting , and sample similaritysimiliarity
Pseudo-alignment and quantification
- Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
Workflow reporting and genomes
- Reference genome files - Saving reference genome indices/files
- Pipeline information - Report metrics generated during the workflow execution

...

Code Block
nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.10.1

Running the pipeline using custom data

Example of a typical command to run an RNA-seq analysis for mouse samples:

code

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.10.1 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_rsem \
        --clip_r1 10 \
        --clip_r2 10 \
        --three_prime_clip_r1 2 \
        --three_prime_clip_r2 2

Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.

Code Block

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.10.1 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_rsem \
        --clip_r1 10 \
        --clip_r2 10 \
        --three_prime_clip_r1 2 \
        --three_prime_clip_r2 2 \
      -resume

Preparing a ‘samplesheet.csv’ file

Prepare a sample sheet file that specifies the input files to be used. To do this, we use an nf-core script to generate the ‘samplesheet.csv’ file as follows (setting strandedness to auto allows the pipeline to determine the strandedness of your RNA-seq data automatically):

Code Block

#load python 3.10
module load python/3.10.8-gcccore-12.2.0

#download script and make executable
wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py
chmod +x fastq_dir_to_samplesheet.py

#generate the samplesheet.csv file
./fastq_dir_to_samplesheet.py /path/to/directory/containing/fastq_files/ samplesheet.csv \
    --strandedness unstrandedauto \
    --read1_extension _R1.fastq.gz \
    --read2_extension _R2.fastq.gz

Example index.csv (Version 3.10.1):

Code Block

sample,fastq_1,fastq_2,strandedness
control_1,/path/to/directory/containing/fastq_files/control-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-1_R2.fastq.gz,unstrandedauto
control_2,/path/to/directory/containing/fastq_files/control-2_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-2_R2.fastq.gz,unstrandedauto
control_3,/path/to/directory/containing/fastq_files/control-3_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-3_R2.fastq.gz,unstrandedauto
infected_1,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-1_R2.fastq.gz,unstrandedauto
infected_2,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-2_R2.fastq.gz,unstrandedauto
infected_3,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-3_R2.fastq.gz,unstrandedauto

Preparing to run on the HPC

...

Code Block

#!/bin/bash -l
#PBS -N nfrna2
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow
module load java
NXF_OPTS='-Xms1g -Xmx4g'

#run the rnaseq pipeline
nextflow run nf-core/rnaseq \
      -profile singularity \
      -r 3.10.1 \
      --input samplesheet.csv \
      --genome GRCm38 GRCh38 \
      --outdir results \
      --aligner star_salmon

We recommend running the nextflow nf-core/rnaseq pipeline once and then assessing the fastqc results folder to assess if sequence biases are present in the 5'-end and 3'-end ends of the sequences. Then, we can use the PBS script below to tell the pipeline to remove a defined number of bases from the 5'-end (--clip_r1 or --clip_r2) or 3'-end (--three_prime_clip_r1 or --three_prime_clip_r2). Also, we can specify to remove ribosomal RNA as these sets of sequences are non-informative.

Code Block

#!/bin/bash -l
#PBS -N nfrna2
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow
module load java
NXF_OPTS='-Xms1g -Xmx4g'

#run the rnaseq pipeline
nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.10.1 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_rsemsalmon \
        --clip_r1 10 \
        --clip_r2 10 \
        --three_prime_clip_r1 2 \
        --three_prime_clip_r2 2

...

Once you have created the folder for the run, the inputsamplesheet.tsv csv file, nextflow.config, and launch.pbs, you are ready to submit.

...

Code Block

#delete the existing assests associated with the RNAseq pipeline:
cd ~/.nextflow/assets/nf-core
rm -r rnaseq/

#run again a test with the new version that you are testing, for example, version 3.10.1. See details on how to run a test above (under 'Getting Started')

Add output folders/files

sample data

Running the pipeline using custom data

Example of a typical command to run an RNA-seq analysis for mouse samples:

Code Block

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.10.1 \
        --genome GRCm38 \
        -profile singularity \
        --aligner star_rsem \
        --clip_r1 10 \
        --clip_r2 10 \
        --three_prime_clip_r1 2 \
        --three_prime_clip_r2 2

Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.

Code Block

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.10.1 \
        --genome GRCm38 \
        -profile singularity \
        --aligner star_rsem \
        --clip_r1 10 \
        --clip_r2 10 \
        --three_prime_clip_r1 2 \
        --three_prime_clip_r2 2 \
      -resume

Version	Old Version 32	New Version Current
Changes made by	Roberto Barrero Gumiel	Magdalena Antczak
Saved on	Apr 17, 2023	Jul 19, 2023

Versions Compared

Key

Pipeline Summary

Running the pipeline using custom data

Preparing a ‘samplesheet.csv’ file

Preparing to run on the HPC

Running the pipeline using custom data