Page Comparison

...

Overview

Create a metadata “samplesheet.csv” for small RNAseq datasets.
Learn to use a “nextflow.config” file in the working directory to override Nextflow parameters (e.g., specify where to find the pipeline assets).
Learn how to prepare a PBS script to run the expression profiling of small RNAs against the reference miRBase database annotated microRNAs.

Preparing the pipeline inputs

The pipeline requires preparing at least 2 files:

...

PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline

...

Nextflow.config - revision 2.3.1 of the nf-core/smrnaseq pipeline may not be able to identify the location of reference adapter sequences, thus, we will use a local nextflow.config file to tell Nextflow where to find the reference adapters necessary to trim the raw small_RNA-Seq data

A. Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Copy the bash script to the working folder

Code Block
cp /work/training/2024/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

...

NOTE: modify ‘read1_extension’ as appropriate for your data. For example: _1.fastq.gz or _R1_001.fastq.gz or _R1.fq.gz , etc

Let’s generate the metadata file by running the following command:

...

the working directory to override Nextflow parameters (e.g., specify where to find the pipeline assets).
Learn how to prepare a PBS script to run the expression profiling of small RNAs against the reference miRBase database annotated microRNAs.

Preparing the pipeline inputs

The pipeline requires preparing at least 2 files:

Metadata file (samplesheet.csv) thatspecifies the name of the samples, location of FASTQ files ('Read 1' and ‘Read 2’), and strandedness (forward, reverse, or auto. Note: auto is used when the strandedness of the data is unknown)
PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline
Nextflow.config - revision 2.3.1 of the nf-core/smrnaseq pipeline may not be able to identify the location of reference adapter sequences, thus, we will use a local nextflow.config file to tell Nextflow where to find the reference adapters necessary to trim the raw small_RNA-Seq data

A. Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

Check the newly created samplesheet.csv file:Copy the bash script to the working folder

...

Code Block
cat samplesheet.csv

sample,fastq_1

cp /work/training/2024/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

...

ERR409879,/work/training/2024/smallRNAseq/data/human_disease/ERR409879.fastq.gz

ERR409880,/work/training/2024/smallRNAseq/data/human_disease/ERR409880.fastq.gz

ERR409881,/work/training/2024/smallRNAseq/data/human_disease/ERR409881.fastq.gz

ERR409882,/work/training/2024/smallRNAseq/data/human_disease/ERR409882.fastq.gz

ERR409883,/work/training/2024/smallRNAseq/data/human_disease/ERR409883.fastq.gz

...

Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

...

NOTE: modify ‘read1_extension’ as appropriate for your data. For example: _1.fastq.gz or _R1_001.fastq.gz or _R1.fq.gz , etc

Let’s generate the metadata file by running the following command:

Code Block
sh create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease

...

ERR409885,/work/training/2024/smallRNAseq/data/human_disease/ERR409885.fastq.gz

...

Check the newly created samplesheet.csv file:

Code Block
cat samplesheet.csv

sample,fastq_1

ERR409878,/work/training/2024/smallRNAseq/data/human_disease/ERR409886ERR409878.fastq.gz

ERR409887ERR409879,/work/training/2024/smallRNAseq/data/human_disease/ERR409887ERR409879.fastq.gz

ERR409888ERR409880,/work/training/2024/smallRNAseq/data/human_disease/ERR409888ERR409880.fastq.gz

ERR409889ERR409881,/work/training/2024/smallRNAseq/data/human_disease/ERR409889ERR409881.fastq.gz

ERR409890ERR409882,/work/training/2024/smallRNAseq/data/human_disease/ERR409890ERR409882.fastq.gz

ERR409891ERR409883,/work/training/2024/smallRNAseq/data/human_disease/ERR409891ERR409883.fastq.gz

ERR409892ERR409884,/work/training/2024/smallRNAseq/data/human_disease/ERR409892ERR409884.fastq.gz

ERR409893ERR409885,/work/training/2024/smallRNAseq/data/human_disease/ERR409893ERR409885.fastq.gz

ERR409894ERR409886,/work/training/2024/smallRNAseq/data/human_disease/ERR409894ERR409886.fastq.gz

ERR409895ERR409887,/work/training/2024/smallRNAseq/data/human_disease/ERR409895ERR409887.fastq.gz

ERR409896ERR409888,/work/training/2024/smallRNAseq/data/human_disease/ERR409896ERR409888.fastq.gz

ERR409897ERR409889,/work/training/2024/smallRNAseq/data/human_disease/ERR409897ERR409889.fastq.gz

ERR409898ERR409890,/work/training/2024/smallRNAseq/data/human_disease/ERR409898ERR409890.fastq.gz

ERR409899ERR409891,/work/training/2024/smallRNAseq/data/human_disease/ERR409899ERR409891.fastq.gz

ERR409900ERR409892,/work/training/2024/smallRNAseq/data/human_disease/ERR409900ERR409892.fastq.gz

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_miRBase.pbs)

Copy and paste the code below to the terminal:

...

ERR409893,/work/training/2024/smallRNAseq/data/human_disease/ERR409893.fastq.gz

ERR409894,/work/training/2024/smallRNAseq/data/human_disease/

...

ERR409894.fastq.gz

ERR409895,/work/training/2024/smallRNAseq/data/human_disease/ERR409895.fastq.gz

ERR409896,/work/training/2024/smallRNAseq/

...

data/

...

human_disease/ERR409896.fastq.gz

ERR409897,/work/training/2024/smallRNAseq/data/human_disease/ERR409897.fastq.gz

ERR409898,/work/training/2024/smallRNAseq/

...

Line 1: Copy the samplesheet.csv file to the working directory
Line 2: Copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory
Line 3: Copy the nextflow.config file from shared folder to my working directory.
Line 4: move to the working directory

...

data/human_disease/ERR409898.fastq.gz

ERR409899,/work/training/2024/smallRNAseq/data/human_disease/ERR409899.fastq.gz

ERR409900,/work/training/2024/smallRNAseq/data/human_disease/ERR409900.fastq.gz

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_RNAseqsmallRNAseq_QCmiRBase.pbs script)

Copy and paste the code below to the terminal:

...

Code Block
cat launch_nf-core_smallRNAseq_miRBase.pbs

#!/bin/bash -l

#PBS -N nfsmallRNAseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#run the tasks in the current working directory

cd $PBS_O_WORKDIR

#load java and assign up to 4GB RAM memory for nextflow to use

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

#run the small RNAseq pipeline

nextflow run nf-core/smrnaseq -r 2.3.1 \

-profile singularity \

--outdir results \

--input samplesheet.csv \

--genome GRCh38-local \

--mirtrace_species hsa \

--three_prime_adapter 'TGGAATTCTCGGGTGCCAAGG' \

--fastp_min_length 18 \

--fastp_max_length 30 \

--hairpin /work/training/smallRNAseq/data/mirbase/hairpin.fa \

--mature /work/training/smallRNAseq/data/mirbase/mature.fa \

--mirna_gtf /work/training/smallRNAseq/data/mirbase/hsa.gff3 \

cp $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease/samplesheet.csv $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/launch_nf-core_smallRNAseq_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cp /work/training/2024/smallRNAseq/scripts/nextflow.config $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
cd $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase

Line 1: Copy the samplesheet.csv file to the working directory
Line 2: Copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory
Line 3: Copy the nextflow.config file from shared folder to my working directory.
Line 4: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_smallRNAseq_miRBase.pbs

...

Print the “nextflow.config” file (see below). Note: if a config file is placed in the working folder it can override parameters define by the global ~/.nextflow/config file or the config file define as part of the pipeline.

...

Code Block
mkdir -p $HOME/workshop/small_RNAseq/scripts cp /work/training/smallRNAseq/scripts/* $HOME/workshop/small_RNAseq/scripts/ ls -l $HOME/workshop/small_RNAseq/scripts/

Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/
Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/
Line 3: List the files in the script folder

Copy multiple subdirectories and files using rsync

Code Block
mkdir -p $HOME/workshop/small_RNAseq/data/ rsync -rv /work/training/smallRNAseq/data/ $HOME/workshop/small_RNAseq/data/

Line 1: The first command creates the folder /scripts/
Line 2: rsync copies all subfolders and files from the specified source folder to the selected destination folder. The -r = recursively will copy directories and files; -v = verbose messages of the transfer of files

Create a folder for running the nf-core small RNA-seq pipeline

...

Code Block
mkdir -p $HOME/workshop/small_RNAseq mkdir $HOME/workshop/small_RNAseq/run1_test mkdir $HOME/workshop/small_RNAseq/run2_smallRNAseq_human cd $HOME/workshop/small_RNAseq/

Lines 1-4: create sub-folders for each exercise
Line 5: change the directory to the folder “small_RNAseq”

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

...

Code Block
cat launch_nf-core_smallRNAseq_test.pbs

#!/bin/bash -l

#PBS -N nfsmrnaseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#work on current directory (folder)

cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

# run the test

nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

where:

nextflow command: nextflow run
pipeline name: nf-core/smrnaseq
pipeline version: -r 2.1.0
container type and sample data: -profile test,singularity
output directory: --outdir results

Submitting the job

Now we can submit the small RNAseq test job to the HPC scheduler:

...

Monitoring the Run

Code Block
qjobs

Exercise 2: Running the small RNA pipeline using public human data

The pipeline requires preparing at least 2 files:

Metadata file (samplesheet.csv) thatspecifies the “sample name” and “location of FASTQ files” ('Read 1').
PBS Pro script (launch_nf-core_smallRNAseq_human.pbs) with instructions to run the pipeline

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

...

Code Block
cp /work/training/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/small_RNAseq/data/human

Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Code Block
cat create_nf-core_smallRNAseq_samplesheet.sh

#!/bin/bash -l

#User defined variables.

##########################################################

DIR='$HOME/workshop/small_RNAseq/data/human'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py $DIR index.csv \

--strandedness auto \

--read1_extension .fastq.gz

#format index file

cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX}

#Remove intermediate files:

rm index.csv fastq_dir_to_samplesheet.py

...

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_human.pbs)

Copy and paste the code below to the terminal:

Code Block

cp $HOME/workshop/small_RNAseq/data/human/samplesheet.csv $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
cp $HOME/workshop/small_RNAseq/scripts/launch_nf-core_smallRNAseq_human.pbs $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
cd $HOME/workshop/small_RNAseq/run2_smallRNAseq_human

Line 1: Copy the samplesheet.csv file to the working directory
Line 2: copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory
Line 3: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_smallRNAseq_human.pbs

#!/bin/bash -l

#PBS -N nfsmallRNAseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#PBS -m abe

#run the tasks in the current working directory

cd $PBS_O_WORKDIR

#load java and assign up to 4GB RAM memory for nextflow to use

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

#run the small RNAseq pipeline

nextflow run nf-core/smrnaseq -r 2.1.0 \

-profile singularity \

--outdir results \

--input samplesheet.csv \

--genome GRCh38-local \

--mirtrace_species hsa \

--three_prime_adapter 'TGGAATTCTCGGGTGCCAAGG' \

--fastp_min_length 18 \

--fastp_max_length 30 \

--hairpin /work/training/smallRNAseq/data/mirbase/hairpin.fa \

--mature /work/training/smallRNAseq/data/mirbase/mature.fa \

--mirna_gtf /work/training/smallRNAseq/data/mirbase/hsa.gff3 \

-resume

...

Note: the “mature_counts.csv” needs to be transposed prior running the statistical analysis. This can be done either user the R script or using a script called “transpose_csv.py”.

Let’s initially create a “DESeq2” folder and copy the files needed for the statistical analysis:

...

Versions Compared

Old Version 15

New Version 16

Key

Overview

Preparing the pipeline inputs

A. Create the metadata file (samplesheet.csv):

Preparing the pipeline inputs

A. Create the metadata file (samplesheet.csv):

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline

Exercise 1: Running a test with nf-core sample data

Submitting the job

Monitoring the Run

Exercise 2: Running the small RNA pipeline using public human data

Create the metadata file (samplesheet.csv):