1. Introduction to nf-core/RNA-seq

What is RNA-seq?

RNA-seq (RNA sequencing) is a powerful and widely used technique for analysing the quantity and sequences of RNA in a sample. It provides a snapshot of the entire transcriptome—the complete set of RNA transcripts produced by the genome at a given time.

Key Concepts of RNA-seq

  1. Transcriptome Analysis: RNA-seq is used to study the transcriptome, which includes messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding RNAs. It provides information about which genes are being expressed, the level of expression, and the presence of alternative splicing variants.

  2. Sequencing Process: The RNA molecules in a sample are first converted into complementary DNA (cDNA) using reverse transcription. These cDNA molecules are then sequenced using high-throughput sequencing technologies, such as Illumina, PacBio, or Oxford Nanopore.

  3. Data Analysis: After sequencing, the resulting reads are aligned to a reference genome or assembled de novo to reconstruct the transcriptome. The data is then analyzed to quantify gene expression levels, detect differentially expressed genes, identify novel transcripts, and study gene fusions or mutations.

  4. Applications:

    • Gene Expression Profiling: Identifying which genes are active and to what extent under various conditions.

    • Differential Gene Expression: Comparing gene expression across different conditions, such as in disease vs. healthy tissue.

    • Alternative Splicing: Detecting different isoforms of mRNA produced from the same gene.

    • Mutation Detection: Identifying mutations, such as SNPs (single nucleotide polymorphisms), in expressed genes.

  5. Advantages:

    • Comprehensive: RNA-seq can detect both known and novel transcripts without prior knowledge of the genome.

    • Quantitative: It provides quantitative data on the abundance of RNA, allowing for precise measurement of gene expression levels.

    • Resolution: RNA-seq can detect low-abundance transcripts and distinguish between closely related isoforms.

RNA-seq analysis pipeline

Source: https://nf-co.re/rnaseq/3.14.0/

nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.

The pipeline offers three aligner options to process RNAseq data: 1) star_salmon, 2) star_rsem and 3) HISAT2 (not recommend see below).

image-20240829-004658.png

Warning Quantification isn’t performed if using --aligner hisat2 due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.

Warning Quantification isn’t performed if using --aligner hisat2 due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.

Overview of the Tools

  • STAR (Spliced Transcripts Alignment to a Reference) 10.1093/bioinformatics/bts635

    • A highly efficient and accurate RNA-seq aligner that aligns sequencing reads to a reference genome.

    • It can handle large genomes and complex splicing events, making it a suitable choice for RNA-seq alignment.

  • SALMON 10.1038/nmeth.4197

    • A tool for quantifying transcript abundances directly from RNA-seq reads without the need for alignment (quasi-alignment) or after alignment.

    • Known for its speed and accuracy in quantification.

    • It utilizes lightweight algorithms for fast processing, making it suitable for large datasets.

  • RSEM (RNA-Seq by Expectation-Maximization) http://doi.org/10.1186/1471-2105-12-323

    • A tool that estimates gene and isoform expression levels from RNA-seq data.

    • It uses a probabilistic model to account for the uncertainty in read mapping, providing highly accurate quantification of transcripts.

    • Typically slower than SALMON, but provides comprehensive statistical outputs.

Comparison: STAR+SALMON vs. STAR+RSEM

1. Speed and Computational Efficiency

  • STAR+SALMON: Generally faster. SALMON's quasi-alignment method is designed for speed, making it a good choice if computational efficiency is a priority.

  • STAR+RSEM: Slower because RSEM performs a more detailed probabilistic analysis, but it can be more accurate in certain cases, especially when dealing with complex transcriptomes.

2. Quantification Accuracy

  • STAR+RSEM: Offers highly accurate transcript quantification due to its expectation-maximization algorithm, especially useful in complex samples where precise transcript isoform expression is critical.

  • STAR+SALMON: Also provides accurate quantification and is highly comparable to STAR+RSEM, especially when using its alignment-based mode. SALMON’s accuracy is generally considered very good, particularly for standard transcript-level analysis.

3. Ease of Use and Flexibility

  • STAR+SALMON: Easier to set up and run, with flexible options for alignment-free or alignment-based quantification. It is often preferred for large-scale studies where speed and scalability are crucial.

  • STAR+RSEM: Can be more computationally demanding, but offers detailed output that might be necessary for certain types of downstream analysis.

4. Output and Interpretation

  • STAR+RSEM: Provides extensive output, including posterior probabilities, which can be useful for downstream statistical analysis.

  • STAR+SALMON: Outputs are straightforward and easy to interpret, making it suitable for most common RNA-seq analyses.

In summary

  • Use STAR+SALMON if you need fast, efficient, and reliable quantification for large datasets, especially when working with many samples or requiring quick turnaround.

  • Use STAR+RSEM if your analysis demands high precision in transcript isoform quantification and you are working with complex transcriptomes where the detailed statistical output can be beneficial.

In practice, both pipelines are highly regarded, and the best choice often comes down to the specific requirements of your research project and the resources available.

Overview of pipeline tools:

source: https://nf-co.re/rnaseq/3.14.0

  1. Merge re-sequenced FastQ files (cat)

  2. Auto-infer strandedness by subsampling and pseudoalignment (fqSalmon)

  3. Read QC (FastQC)

  4. UMI extraction (UMI-tools)

  5. Adapter and quality trimming (Trim Galore!)

  6. Removal of genome contaminants (BBSplit)

  7. Removal of ribosomal RNA (SortMeRNA)

  8. Choice of multiple alignment and quantification routes:

    1. STAR -> Salmon

    2. STAR -> RSEM

    3. HiSAT2 -> NO QUANTIFICATION

  9. Sort and index alignments (SAMtools)

  10. UMI-based deduplication (UMI-tools)

  11. Duplicate read marking (picard MarkDuplicates)

  12. Transcript assembly and quantification (StringTie)

  13. Create bigWig coverage files (BEDToolsbedGraphToBigWig)

  14. Extensive quality control:

    1. RSeQC

    2. Qualimap

    3. dupRadar

    4. Preseq

    5. DESeq2

    6. Kraken2 -> Bracken on unaligned sequences; optional

  15. Pseudoalignment and quantification (Salmon or ‘Kallisto’optional)

  16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQCR)