Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

  • Create a metadata “samplesheet.csv” for small RNAseq datasets.

  • Learn to use a “nextflow.config” file in the working directory to override Nextflow parameters (e.g., specify where to find the pipeline assets).

  • Learn how to prepare a PBS script to run the expression profiling of small RNAs against the reference miRBase database annotated microRNAs.

Preparing the pipeline inputs

The pipeline requires preparing at least 2 files:

...

PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline

...

Nextflow.config - revision 2.3.1 of the nf-core/smrnaseq pipeline may not be able to identify the location of reference adapter sequences, thus, we will use a local nextflow.config file to tell Nextflow where to find the reference adapters necessary to trim the raw small_RNA-Seq data

A. Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Copy the bash script to the working folder

Code Block
cp /work/training/2024/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

...

NOTE: modify ‘read1_extension’ as appropriate for your data. For example: _1.fastq.gz or _R1_001.fastq.gz or _R1.fq.gz , etc

Let’s generate the metadata file by running the following command:

Code Block
sh create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Check the newly created samplesheet.csv file:

Code Block
cat samplesheet.csv

sample,fastq_1

ERR409878,/work/training/2024/smallRNAseq/data/human_disease/ERR409878.fastq.gz

ERR409879,/work/training/2024/smallRNAseq/data/human_disease/ERR409879.fastq.gz

ERR409880,/work/training/2024/smallRNAseq/data/human_disease/ERR409880.fastq.gz

ERR409881,/work/training/2024/smallRNAseq/data/human_disease/ERR409881.fastq.gz

ERR409882,/work/training/2024/smallRNAseq/data/human_disease/ERR409882.fastq.gz

ERR409883,/work/training/2024/smallRNAseq/data/human_disease/ERR409883.fastq.gz

ERR409884,/work/training/2024/smallRNAseq/data/human_disease/ERR409884.fastq.gz

ERR409885,/work/training/2024/smallRNAseq/data/human_disease/ERR409885.fastq.gz

ERR409886,/work/training/2024/smallRNAseq/data/human_disease/ERR409886.fastq.gz

ERR409887,/work/training/2024/smallRNAseq/data/human_disease/ERR409887.fastq.gz

ERR409888,/work/training/2024/smallRNAseq/data/human_disease/ERR409888.fastq.gz

ERR409889,/work/training/2024/smallRNAseq/data/human_disease/ERR409889.fastq.gz

ERR409890,/work/training/2024/smallRNAseq/data/human_disease/ERR409890.fastq.gz

ERR409891,/work/training/2024/smallRNAseq/data/human_disease/ERR409891.fastq.gz

ERR409892,/work/training/2024/smallRNAseq/data/human_disease/ERR409892.fastq.gz

ERR409893,/work/training/2024/smallRNAseq/data/human_disease/ERR409893.fastq.gz

ERR409894,/work/training/2024/smallRNAseq/data/human_disease/ERR409894.fastq.gz

ERR409895,/work/training/2024/smallRNAseq/data/human_disease/ERR409895.fastq.gz

ERR409896,/work/training/2024/smallRNAseq/data/human_disease/ERR409896.fastq.gz

ERR409897,/work/training/2024/smallRNAseq/data/human_disease/ERR409897.fastq.gz

ERR409898,/work/training/2024/smallRNAseq/data/human_disease/ERR409898.fastq.gz

ERR409899,/work/training/2024/smallRNAseq/data/human_disease/ERR409899.fastq.gz

ERR409900,/work/training/2024/smallRNAseq/data/human_disease/ERR409900.fastq.gz

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_miRBase.pbs)

Copy and paste the code below to the terminal:

Code Block
cp $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease/samplesheet.csv $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/launch_nf-core_smallRNAseq_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/nextflow.config $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
  • Line 1: Copy the samplesheet.csv file to the working directory

  • Line 2: Copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory

  • Line 3: Copy the nextflow.config file from shared folder to my working directory.

  • Line 4: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_smallRNAseq_miRBase.pbs 

...

TIP: when running the nf-core/smrnaseq pipeline (release 2.3.1) the pipeline is not able to find the location of the reference adapter sequences for trimming of the raw small RNAseq pipeline, so we need to specify where to find the folder where the adapter sequences file is located. To do this, we prepare a “nextflow.config” file (see below). This file should be already in your working directory. Print the content as follows:

Code Block
cat nextflow.config
Code Block
singularity {
    runOptions = '-B $HOME/.nextflow/assets/nf-core/smrnaseq/assets'
}

Note: if a config file is placed in the working folder it can override parameters define by the global ~/.nextflow/config file or the config file define as part of the pipeline.

Submit the job to the HPC cluster:

Code Block
qsub launch_nf-core_smallRNAseq_miRBase.pbs

Monitor the progress:

Code Block
qjobs

The job will take several hours to run, hence we will use precomputed results for the statistical analysis in the next section.

Outputs

The pipeline will produce two folders, one called “work,” where all the processing is done, and another called “results,” where we can find the pipeline's outputs. The content of the results folder is as follows:

Code Block
results/
├── bowtie_index
│   ├── mirna_hairpin
│   └── mirna_mature
├── fastp
│   └── on_raw
├── fastqc
│   ├── raw
│   └── trimmed
├── mirna_quant
│   ├── bam
│   ├── edger_qc    <----- Expression mature miRNA (mature_counts.csv) and precursor-miRNAs (haripin_counts.csv) counts can be found in this subfolder. 
│   ├── mirtop
│   ├── reference
│   └── seqcluster
├── mirtrace
│   ├── mirtrace-report.html
│   ├── mirtrace-results.json
│   ├── mirtrace-stats-contamination_basic.tsv
│   ├── mirtrace-stats-contamination_detailed.tsv
│   ├── mirtrace-stats-length.tsv
│   ├── mirtrace-stats-mirna-complexity.tsv
│   ├── mirtrace-stats-phred.tsv
│   ├── mirtrace-stats-qcstatus.tsv
│   ├── mirtrace-stats-rnatype.tsv
│   ├── qc_passed_reads.all.collapsed
│   └── qc_passed_reads.rnatype_unknown.collapsed
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
└── pipeline_info
    ├── execution_report_2024-08-20_16-55-53.html
    ├── execution_timeline_2024-08-20_16-55-53.html
    ├── execution_trace_2024-08-20_16-55-53.txt
    ├── nf_core_smrnaseq_software_mqc_versions.yml
    ├── params_2024-08-20_16-56-04.json
    └── pipeline_dag_2024-08-20_16-55-53.html

The quantification of the mature miRNA and hairpin expressions can be found in the /results/mirna_quant/edger_qc directory.

Code Block
cd /results/mirna_quant/edger_qc
Code Block
hairpin_counts.csv
hairpin_CPM_heatmap.pdf
hairpin_edgeR_MDS_distance_matrix.txt
hairpin_edgeR_MDS_plot_coordinates.txt
hairpin_edgeR_MDS_plot.pdf
hairpin_log2CPM_sample_distances_dendrogram.pdf
hairpin_log2CPM_sample_distances_heatmap.pdf
hairpin_log2CPM_sample_distances.txt
hairpin_logtpm.csv
hairpin_logtpm.txt
hairpin_normalized_CPM.txt
hairpin_unmapped_read_counts.txt
mature_counts.csv      <----- Expression mature miRNA. This file will be used to identify differentially expressed miRNAs (Session 7)
mature_CPM_heatmap.pdf
mature_edgeR_MDS_distance_matrix.txt
mature_edgeR_MDS_plot_coordinates.txt
mature_edgeR_MDS_plot.pdf
mature_log2CPM_sample_distances_dendrogram.pdf
mature_log2CPM_sample_distances_heatmap.pdf
mature_log2CPM_sample_distances.txt
mature_logtpm.csv
mature_logtpm.txt
mature_normalized_CPM.txt
mature_unmapped_read_counts.txt

Submitting the job

Code Block
qsub launch_nf-core_RNAseq_pipeline.pbs

Monitoring the Run

Code Block
qjobs

Outputs

The pipeline will produce two folders, one called “work,” where all the processing is done, and another called “results,” where we can find the pipeline's outputs. The content of the results folder is as follows:

The quantification of the gene and transcript expressions can be found in the ‘star_salmon’ directory.

Code Block
cd results/star_salmon

The following feature count tables are generated:

Copying data for hands-on exercises

Before we start using the HPC, let’s start an interactive session:

Code Block
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

Get a copy of the scripts to be used in this module

Use the terminal to log into the HPC and create a /RNAseq/ folder to run the nf-core/rnaseq pipeline. For example:

Code Block
mkdir -p $HOME/workshop/small_RNAseq/scripts
cp /work/training/smallRNAseq/scripts/* $HOME/workshop/small_RNAseq/scripts/
ls -l $HOME/workshop/small_RNAseq/scripts/
  • Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/

  • Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/

  • Line 3: List the files in the script folder

Copy multiple subdirectories and files using rsync

Code Block
mkdir -p $HOME/workshop/small_RNAseq/data/
rsync -rv /work/training/smallRNAseq/data/ $HOME/workshop/small_RNAseq/data/
  • Line 1: The first command creates the folder /scripts/

  • Line 2: rsync copies all subfolders and files from the specified source folder to the selected destination folder. The -r = recursively will copy directories and files; -v = verbose messages of the transfer of files

Create a folder for running the nf-core small RNA-seq pipeline

Let’s create a “runs” folder to run the nf-core/rnaseq pipeline:

Code Block
mkdir -p $HOME/workshop/small_RNAseq
mkdir $HOME/workshop/small_RNAseq/run1_test
mkdir $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
cd $HOME/workshop/small_RNAseq/
  • Lines 1-4: create sub-folders for each exercise

  • Line 5: change the directory to the folder “small_RNAseq”

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

Copy the launch_nf-core_smallRNAseq_test.pbs to the working directory

Code Block
cd $HOME/workshop/small_RNAseq/run1_test
cp $HOME/workshop/small_RNAseq/scripts/launch_nf-core_smallRNAseq_test.pbs .

View the content of the script as follows:

Code Block
cat launch_nf-core_smallRNAseq_test.pbs

#!/bin/bash -l

#PBS -N nfsmrnaseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#work on current directory (folder)

cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

# run the test

nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

where:

  • nextflow command: nextflow run

  • pipeline name: nf-core/smrnaseq

  • pipeline version: -r 2.1.0

  • container type and sample data: -profile test,singularity

  • output directory: --outdir results

Submitting the job

Now we can submit the small RNAseq test job to the HPC scheduler:

Code Block
qsub launch_nf-core_smallRNAseq_test.pbs

Monitoring the Run

Code Block
qjobs

Exercise 2: Running the small RNA pipeline using public human data

The pipeline requires preparing at least 2 files:

...

PBS Pro script (launch_nf-core_smallRNAseq_human.pbs) with instructions to run the pipeline

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/small_RNAseq/data/human
pwd

Copy the bash script to the working folder

Code Block
cp /work/training/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/small_RNAseq/data/human
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

#!/bin/bash -l

#User defined variables.

##########################################################

DIR='$HOME/workshop/small_RNAseq/data/human'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py  $DIR index.csv \

        --strandedness auto \

        --read1_extension .fastq.gz

#format index file

cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX}

#Remove intermediate files:

rm index.csv fastq_dir_to_samplesheet.py

Let’s generate the metadata file by running the following command:

Code Block
sh create_RNAseq_samplesheet.sh

Check the newly created samplesheet.csv file:

Code Block
ls -l
cat samplesheet.cvs

sample,fastq_1

SRR20753704,/work/training/smallRNAseq/data/SRR20753704.fastq.gz

SRR20753705,/work/training/smallRNAseq/data/SRR20753705.fastq.gz

SRR20753706,/work/training/smallRNAseq/data/SRR20753706.fastq.gz

SRR20753707,/work/training/smallRNAseq/data/SRR20753707.fastq.gz

SRR20753708,/work/training/smallRNAseq/data/SRR20753708.fastq.gz

SRR20753709,/work/training/smallRNAseq/data/SRR20753709.fastq.gz

SRR20753716,/work/training/smallRNAseq/data/SRR20753716.fastq.gz

SRR20753717,/work/training/smallRNAseq/data/SRR20753717.fastq.gz

SRR20753718,/work/training/smallRNAseq/data/SRR20753718.fastq.gz

SRR20753719,/work/training/smallRNAseq/data/SRR20753719.fastq.gz

SRR20753720,/work/training/smallRNAseq/data/SRR20753720.fastq.gz

SRR20753721,/work/training/smallRNAseq/data/SRR20753721.fastq.gz

 

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_human.pbs)

Copy and paste the code below to the terminal:

Code Block
cp $HOME/workshop/small_RNAseq/data/human/samplesheet.csv $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
cp $HOME/workshop/small_RNAseq/scripts/launch_nf-core_smallRNAseq_human.pbs $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
cd $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
  • Line 1: Copy the samplesheet.csv file to the working directory

  • Line 2: copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory

  • Line 3: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_smallRNAseq_human.pbs

#!/bin/bash -l

#PBS -N nfsmallRNAseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#PBS -m abe

 

#run the tasks in the current working directory

cd $PBS_O_WORKDIR

#load java and assign up to 4GB RAM memory for nextflow to use

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

 

#run the small RNAseq pipeline

nextflow run nf-core/smrnaseq -r 2.1.0 \

        -profile singularity \

        --outdir results \

        --input samplesheet.csv \

        --genome GRCh38-local \

        --mirtrace_species hsa \

        --three_prime_adapter 'TGGAATTCTCGGGTGCCAAGG' \

        --fastp_min_length 18 \

        --fastp_max_length 30 \

        --hairpin /work/training/smallRNAseq/data/mirbase/hairpin.fa \

        --mature /work/training/smallRNAseq/data/mirbase/mature.fa \

        --mirna_gtf /work/training/smallRNAseq/data/mirbase/hsa.gff3 \

        -resume

Submit the job to the HPC cluster:

Code Block
qsub launch_nf-core_smallRNAseq_human.pbs

Monitor the progress:

Code Block
qjobs

The job will take several hours to run, hence we will use precomputed results for the statistical analysis in the next section.

Precomputed results:

We ran the small RNA seq samples and the results can be found at:

Code Block
/work/training/smallRNAseq/runs/run2_smallRNAseq_human

The results of the miRNA profiling can be found in the folder call “edger”:

Code Block
results/
├── edger
├── fastp
├── fastqc
├── genome
├── index
├── mirdeep
├── mirdeep2
├── mirtop
├── mirtrace
├── multiqc
├── pipeline_info
├── samtools
└── unmapped

inside the “edger” folder find the “mature_counts.csv” file:

Code Block
hairpin_counts.csv
hairpin_CPM_heatmap.pdf
hairpin_edgeR_MDS_distance_matrix.txt
hairpin_edgeR_MDS_plot_coordinates.txt
hairpin_edgeR_MDS_plot.pdf
hairpin_log2CPM_sample_distances_dendrogram.pdf
hairpin_log2CPM_sample_distances_heatmap.pdf
hairpin_log2CPM_sample_distances.txt
hairpin_logtpm.csv
hairpin_logtpm.txt
hairpin_normalized_CPM.txt
hairpin_unmapped_read_counts.txt
mature_counts.csv       <-- we will use this file for the statistical analysis in the next section
mature_counts.txt
mature_CPM_heatmap.pdf
mature_edgeR_MDS_distance_matrix.txt
mature_edgeR_MDS_plot_coordinates.txt
mature_edgeR_MDS_plot.pdf
mature_log2CPM_sample_distances_dendrogram.pdf
mature_log2CPM_sample_distances_heatmap.pdf
mature_log2CPM_sample_distances.txt
mature_logtpm.csv
mature_logtpm.txt
mature_normalized_CPM.txt
mature_unmapped_read_counts.txt

Note: the “mature_counts.csv” needs to be transposed prior running the statistical analysis. This can be done either user the R script or using a script called “transpose_csv.py”.

Let’s initially create a “DESeq2” folder and copy the files needed for the statistical analysis:

...

Overview

  • Create a metadata “samplesheet.csv” for small RNAseq datasets.

  • Learn to use a “nextflow.config” file in the working directory to override Nextflow parameters (e.g., specify where to find the pipeline assets).

  • Learn how to prepare a PBS script to run the expression profiling of small RNAs against the reference miRBase database annotated microRNAs.

Preparing the pipeline inputs

The pipeline requires preparing at least 2 files:

  • Metadata file (samplesheet.csv) thatspecifies the name of the samples, location of FASTQ files ('Read 1' and ‘Read 2’), and strandedness (forward, reverse, or auto. Note: auto is used when the strandedness of the data is unknown)

  • PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline

  • Nextflow.config - revision 2.3.1 of the nf-core/smrnaseq pipeline may not be able to identify the location of reference adapter sequences, thus, we will use a local nextflow.config file to tell Nextflow where to find the reference adapters necessary to trim the raw small_RNA-Seq data

A. Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Copy the bash script to the working folder

Code Block
cp /work/training/2024/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

...

NOTE: modify ‘read1_extension’ as appropriate for your data. For example: _1.fastq.gz or _R1_001.fastq.gz or _R1.fq.gz , etc

Let’s generate the metadata file by running the following command:

Code Block
sh create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Check the newly created samplesheet.csv file:

Code Block
cat samplesheet.csv

sample,fastq_1

ERR409878,/work/training/2024/smallRNAseq/data/human_disease/ERR409878.fastq.gz

ERR409879,/work/training/2024/smallRNAseq/data/human_disease/ERR409879.fastq.gz

ERR409880,/work/training/2024/smallRNAseq/data/human_disease/ERR409880.fastq.gz

ERR409881,/work/training/2024/smallRNAseq/data/human_disease/ERR409881.fastq.gz

ERR409882,/work/training/2024/smallRNAseq/data/human_disease/ERR409882.fastq.gz

ERR409883,/work/training/2024/smallRNAseq/data/human_disease/ERR409883.fastq.gz

ERR409884,/work/training/2024/smallRNAseq/data/human_disease/ERR409884.fastq.gz

ERR409885,/work/training/2024/smallRNAseq/data/human_disease/ERR409885.fastq.gz

ERR409886,/work/training/2024/smallRNAseq/data/human_disease/ERR409886.fastq.gz

ERR409887,/work/training/2024/smallRNAseq/data/human_disease/ERR409887.fastq.gz

ERR409888,/work/training/2024/smallRNAseq/data/human_disease/ERR409888.fastq.gz

ERR409889,/work/training/2024/smallRNAseq/data/human_disease/ERR409889.fastq.gz

ERR409890,/work/training/2024/smallRNAseq/data/human_disease/ERR409890.fastq.gz

ERR409891,/work/training/2024/smallRNAseq/data/human_disease/ERR409891.fastq.gz

ERR409892,/work/training/2024/smallRNAseq/data/human_disease/ERR409892.fastq.gz

ERR409893,/work/training/2024/smallRNAseq/data/human_disease/ERR409893.fastq.gz

ERR409894,/work/training/2024/smallRNAseq/data/human_disease/ERR409894.fastq.gz

ERR409895,/work/training/2024/smallRNAseq/data/human_disease/ERR409895.fastq.gz

ERR409896,/work/training/2024/smallRNAseq/data/human_disease/ERR409896.fastq.gz

ERR409897,/work/training/2024/smallRNAseq/data/human_disease/ERR409897.fastq.gz

ERR409898,/work/training/2024/smallRNAseq/data/human_disease/ERR409898.fastq.gz

ERR409899,/work/training/2024/smallRNAseq/data/human_disease/ERR409899.fastq.gz

ERR409900,/work/training/2024/smallRNAseq/data/human_disease/ERR409900.fastq.gz

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_miRBase.pbs)

Copy and paste the code below to the terminal:

Code Block
cp $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease/samplesheet.csv $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/launch_nf-core_smallRNAseq_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/nextflow.config $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
  • Line 1: Copy the samplesheet.csv file to the working directory

  • Line 2: Copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory

  • Line 3: Copy the nextflow.config file from shared folder to my working directory.

  • Line 4: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_smallRNAseq_miRBase.pbs 

...

TIP: when running the nf-core/smrnaseq pipeline (release 2.3.1) the pipeline is not able to find the location of the reference adapter sequences for trimming of the raw small RNAseq pipeline, so we need to specify where to find the folder where the adapter sequences file is located. To do this, we prepare a “nextflow.config” file (see below). This file should be already in your working directory. Print the content as follows:

Code Block
cat nextflow.config
Code Block
singularity {
    runOptions = '-B $HOME/.nextflow/assets/nf-core/smrnaseq/assets'
}

Note: if a config file is placed in the working folder it can override parameters define by the global ~/.nextflow/config file or the config file define as part of the pipeline.

Submit the job to the HPC cluster:

Code Block
qsub launch_nf-core_smallRNAseq_miRBase.pbs

Monitor the progress:

Code Block
qjobs

The job will take several hours to run, hence we will use precomputed results for the statistical analysis in the next section.

Outputs

The pipeline will produce two folders, one called “work,” where all the processing is done, and another called “results,” where we can find the pipeline's outputs. The content of the results folder is as follows:

Code Block
results/
├── bowtie_index
│   ├── mirna_hairpin
│   └── mirna_mature
├── fastp
│   └── on_raw
├── fastqc
│   ├── raw
│   └── trimmed
├── mirna_quant
│   ├── bam
│   ├── edger_qc    <----- Expression mature miRNA (mature_counts.csv) and precursor-miRNAs (haripin_counts.csv) counts can be found in this subfolder. 
│   ├── mirtop
│   ├── reference
│   └── seqcluster
├── mirtrace
│   ├── mirtrace-report.html
│   ├── mirtrace-results.json
│   ├── mirtrace-stats-contamination_basic.tsv
│   ├── mirtrace-stats-contamination_detailed.tsv
│   ├── mirtrace-stats-length.tsv
│   ├── mirtrace-stats-mirna-complexity.tsv
│   ├── mirtrace-stats-phred.tsv
│   ├── mirtrace-stats-qcstatus.tsv
│   ├── mirtrace-stats-rnatype.tsv
│   ├── qc_passed_reads.all.collapsed
│   └── qc_passed_reads.rnatype_unknown.collapsed
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
└── pipeline_info
    ├── execution_report_2024-08-20_16-55-53.html
    ├── execution_timeline_2024-08-20_16-55-53.html
    ├── execution_trace_2024-08-20_16-55-53.txt
    ├── nf_core_smrnaseq_software_mqc_versions.yml
    ├── params_2024-08-20_16-56-04.json
    └── pipeline_dag_2024-08-20_16-55-53.html

The quantification of the mature miRNA and hairpin expressions can be found in the /results/mirna_quant/edger_qc directory.

Code Block
cd /results/mirna_quant/edger_qc
Code Block
hairpin_counts.csv
hairpin_CPM_heatmap.pdf
hairpin_edgeR_MDS_distance_matrix.txt
hairpin_edgeR_MDS_plot_coordinates.txt
hairpin_edgeR_MDS_plot.pdf
hairpin_log2CPM_sample_distances_dendrogram.pdf
hairpin_log2CPM_sample_distances_heatmap.pdf
hairpin_log2CPM_sample_distances.txt
hairpin_logtpm.csv
hairpin_logtpm.txt
hairpin_normalized_CPM.txt
hairpin_unmapped_read_counts.txt
mature_counts.csv      <----- Expression mature miRNA. This file will be used to identify differentially expressed miRNAs (Session 7)
mature_CPM_heatmap.pdf
mature_edgeR_MDS_distance_matrix.txt
mature_edgeR_MDS_plot_coordinates.txt
mature_edgeR_MDS_plot.pdf
mature_log2CPM_sample_distances_dendrogram.pdf
mature_log2CPM_sample_distances_heatmap.pdf
mature_log2CPM_sample_distances.txt
mature_logtpm.csv
mature_logtpm.txt
mature_normalized_CPM.txt
mature_unmapped_read_counts.txt

Let’s inspect the mature.csv file. Let’s use the ‘cat’ command to print it on the screen:

Code Block
cat mature_counts.csv
Code Block
"","hsa-let-7a-5p","hsa-let-7a-3p","hsa-let-7a-2-3p","hsa-let-7b-5p","hsa-let-7b-3p","hsa-let-7c-5p","hsa-let-7c-3p","hsa-let-7d-5p","hsa-let-7d-3p","hsa-
"ERR409882",364608,341,16,59417,1998,68342,44,14861,3790,29486,207,211184,228,1462,7002,2,49664,1,1091,174,326,43,6,468,7,1482,1615,9,17256,534,573,6526,0
"ERR409879",305651,184,6,52115,1476,58425,30,12397,2659,23604,201,198778,151,1013,5486,1,48381,4,945,202,194,40,7,368,3,1097,1317,6,12662,561,372,3693,2,1
"ERR409881",712880,165,9,83857,2335,162724,83,30556,4503,68044,385,456864,348,1818,9893,0,111712,5,1495,259,174,48,6,318,2,1466,2220,4,17865,466,551,10360
"ERR409884",182178,111,3,27892,913,39989,21,7751,1886,13902,159,127386,132,743,3651,3,40311,0,629,117,97,21,11,305,2,1147,902,2,8313,368,242,2276,0,1146,4
"ERR409889",568269,257,13,92339,2239,100021,45,20819,3511,44172,207,276474,259,1376,12407,5,83908,5,1971,467,403,70,30,1082,7,3082,3172,14,24112,819,421,6
"ERR409894",314053,137,9,44708,1220,74145,74,12313,2827,25295,196,196866,158,896,4681,3,43677,1,806,138,131,22,7,296,3,1181,1169,5,11145,611,360,3742,5,12
"ERR409887",178201,48,4,25678,733,41506,27,7833,1613,15724,121,123391,98,497,3288,0,39434,1,445,97,65,15,3,150,2,539,461,3,5837,186,161,2958,2,847,3,1544,
"ERR409880",318121,136,3,46347,1260,65606,39,11095,2269,24585,200,191072,194,1118,5599,2,67420,3,1242,155,168,22,2,505,6,1708,1836,3,11293,482,359,3652,1,
"ERR409890",332579,105,7,40131,955,73537,38,13528,2029,31807,158,207846,175,962,5146,0,42402,0,659,149,102,20,4,219,3,964,1086,4,11957,423,385,6017,4,1556

Note: the “mature_counts.csv” needs to be transposed prior running the statistical analysis. This can be done either user the R script or using a script called “transpose_csv.py”.

Let’s copy the transpose_csv.py script to the working folder:

Code Block
cp /work/training/2024/smallRNAseq/scripts/transpose_csv.py .

The check how to use the script do the following:

Code Block
python transpose_csv.py --help
Code Block
usage: transpose_csv.py [-h] --input INPUT --output OUTPUT

Transpose a CSV file and generate a tab-delimited TXT file.

optional arguments:
  -h, --help       show this help message and exit
  --input INPUT    Input CSV file containing mature miRNA counts.
  --output OUTPUT  Output tab-delimited TXT file.

To transpose the initial “mature_counst.csv” file do the following:

Code Block
python transpose_csv.py --input mature_counts.csv --out mature_counts.txt

Let’s now print the transposed mature counts table:

Code Block
cat mature_counts.txt
Code Block