Prepared by the eResearch Office, QUT.
This page provides a guide to QUT users on how to install and run the nextflow nf-core/rnaseq workflow on the HPC.
Pre-requisites
Installed conda3 or miniconda3 ( https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html )
Basic Unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
Have an HPC account on QUT’s HPC compute. Apply for a new HPC account here.
R tutorials:
Install Nextflow
The nf-core/rnaseq workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here. Prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.
Additional information is available here: https://nf-co.re/usage/installation
Additional details on the workflow can be found at:
Overview: https://nf-co.re/rnaseq/3.010.1
Usage: https://nf-co.re/rnaseq/3.10.01/usage
GitHub: https://github.com/nf-core/rnaseq
Pipeline Summary
...
The pipeline is built using Nextflow and processes data using the following steps:
cat - Merge re-sequenced FastQ files
FastQC - Raw read QC
UMI-tools extract - UMI barcode extraction
TrimGalore - Adapter and quality trimming
BBSplit - Removal of genome contaminants
SortMeRNA - Removal of ribosomal RNA (optional)
STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
STAR via RSEM - Alignment and quantification of expression levels
HISAT2 - Memory efficient splice aware alignment to a reference
UMI-tools dedup - UMI-based deduplication
picard MarkDuplicates - Duplicate read marking
StringTie - Transcript assembly and quantification
BEDTools and bedGraphToBigWig - Create bigWig coverage files
RSeQC - Various RNA-seq QC metrics
Qualimap - Various RNA-seq QC metrics
dupRadar - Assessment of technical / biological read duplication
Preseq - Estimation of library complexity
featureCounts - Read counting relative to gene biotype
DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram
MultiQC - Present QC for raw reads, alignment, read counting , and sample similaritysimiliarity
Pseudo-alignment and quantification
Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
Workflow reporting and genomes
Reference genome files - Saving reference genome indices/files
Pipeline information - Report metrics generated during the workflow execution
Getting Started
Download and run the workflow using minimal data provided by nf-core/rnaseq. We recommend using singularity as the profile for QUT’s HPC. Another profile option can be ‘conda.’ Note: the profile option ‘docker’ is unavailable on the HPC.
Code Block |
---|
nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.10.1 |
...
Preparing a ‘samplesheet.csv’ file
Prepare a sample sheet file that specifies the input files to be used. To do this, we use an nf-core script to generate the ‘samplesheet.csv’ file as follows (setting strandedness to auto allows the pipeline to determine the strandedness of your RNA-seq data automatically):
Code Block |
---|
#load python 3.10 module load python/3.10.8-gcccore-12.2.0 #download script and make executable wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py chmod +x fastq_dir_to_samplesheet.py #generate the samplesheet.csv file ./fastq_dir_to_samplesheet.py /path/to/directory/containing/fastq_files/ samplesheet.csv \ --strandedness reverseauto \ --read1_extension _R1.fastq.gz \ --read2_extension _R2.fastq.gz |
Example of a typical command to run an RNA-seq analysis for mouse samples:
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCh38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 1 \
--three_prime_clip_r2 1 |
Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCh38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 1 \
--three_prime_clip_r2 1 \
-resume |
Preparing a ‘samplesheet.csv’ file
Prepare an index.csv file containing the information of the samples to be processed. See below examples of index.csv files.
Example index.csv (previous versions):
...
index.csv (Version 3.10.1):
Code Block |
---|
sample,fastq_1,fastq_2,strandedness control,_1,/path/to/directory/containing/fastq_files/control-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-1_R2.fastq.gz,unstrandedauto control,_2,/path/to/directory/containing/fastq_files/control-2_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-2_R2.fastq.gz,unstrandedauto control,_3,/path/to/directory/containing/fastq_files/control-3_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-3_R2.fastq.gz,unstrandedauto infected,_1,/path/to/fastq/infected-1_R1.fastq.gz,/path/to/fastq/infected-1_R2.fastq.gz,unstranded infected,2,/path/to/fastq/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/fastq/infected-2_R2.fastq.gz,unstranded infected,3,/path/to/fastqdirectory/containing/fastq_files/infected-1_R1R2.fastq.gz,/path/to/fastq/infected-3_R2.fastq.gz,unstranded |
Index format for current version 3.3:
Code Block |
---|
group,fastq_1,fastq_2,strandedness control_rep1auto infected_2,/path/to/fastq/control-1_R1.fastq.gz,/path/to/fastq/control-1_R2.fastq.gz,unstranded control_rep2,/path/to/fastq/control-2_directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/fastq/control-2_R2.fastq.gz,unstranded control_rep3,/path/to/fastq/control-3_R1.fastq.gz,/path/to/fastq/control-3directory/containing/fastq_files/infected-2_R2.fastq.gz,unstrandedauto infected_rep13,/path/to/fastq/infected-1_R1.fastq.gz,/path/to/fastq/infected-1_R2.fastq.gz,unstranded infected_rep2,/path/to/fastq/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/fastq/infected-2_R2.fastq.gz,unstranded infected_rep3,/path/to/fastq/infected-1_R1.fastq.gz,/path/to/fastq/directory/containing/fastq_files/infected-3_R2.fastq.gz,unstrandedauto |
Preparing to run on the HPC
To run this on the HPC a PBS submission script needs to be created using a text editor. For example, create a file called launch.pbs using a text editor of choice (i.e., vi or nano) and then copy and paste the code below:
Code Block |
---|
#!/bin/bash -l #PBS -N nfrna2 #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' #run the rnaseq pipeline nextflow run nf-core/rnaseq \ -profile singularity \ -r 3.310.1 \ --input indexsamplesheet.csv \ --genome GRCm38 GRCh38 \ --outdir results \ --aligner star_salmon |
We recommend running the nextflow nf-core/rnaseq pipeline once and then assess assessing the fastqc results folder to assess if sequence biases are present in the 5'-end and 3'-end ends of the sequences. Then, we can use the PBS script below to tell the pipeline to remove a defined number of bases from the 5'-end (--clip_r1
or --clip_r2
) or 3'-end (--three_prime_clip_r1
or --three_prime_clip_r2
). Also, we can specify to remove ribosomal RNA as these sets of sequences are non-informative.
Code Block |
---|
#!/bin/bash -l #PBS -N nfrna2 #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' #run the rnaseq pipeline #with-dag can output files in .png, .pdf, .svg or .html nextflow run nf-core/rnaseq -profile conda --input samplesheet.csv \ --genomeoutdir GRCm38results \ --aligner star_salmonr 3.10.1 \ --min_mapped_reads 5genome GRCh38 \ --clip_r1 10profile singularity \ --clipaligner star_r2salmon 10 \ --three_prime_clip_r1 210 \ --three_prime_clip_r2 210 \ --removethree_prime_riboclip_rna \ -dump-channelsr1 2 \ -with-dag flowchart.png-three_prime_clip_r2 2 |
Submitting the job
Once you have created the folder for the run, the inputsamplesheet.tsv csv file, nextflow.config, and launch.pbs, you are ready to submit.
Submit the run with this command (On Lyra)
Code Block |
---|
qsub launch.pbs |
Monitoring the Run
You can use the command
Code Block |
---|
qstat -u $USER |
Alternatively, use the following command:
Code Block |
---|
qjobs |
To check on the jobs, you are running. Nextflow will launch additional jobs during the run.
...
Finally, if you have configured the connection to the NFTower, you can logon log on and check your run.
Troubleshooting
I have been using version 3.3. and now, when I run version 3.10.1, I get an error that the asset is corrupted. What should I do?
Code Block |
---|
#delete the existing assests associated with the RNAseq pipeline: cd ~/.nextflow/assets/nf-core rm -r rnaseq/ #run again a test with the new version that you are testing, for example, version 3.10.1. See details on how to run a test above (under 'Getting Started') |
Add output folders/files
sample data
Running the pipeline using custom data
Example of a typical command to run an RNA-seq analysis for mouse samples:
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCm38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 2 \
--three_prime_clip_r2 2 |
Note, if the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.
Code Block |
---|
nextflow run nf-core/rnaseq --input samplesheet.csv \
--outdir results \
-r 3.10.1 \
--genome GRCm38 \
-profile singularity \
--aligner star_rsem \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 2 \
--three_prime_clip_r2 2 \
-resume |