Overview
Create a metadata “samplesheet.csv” for small RNAseq datasets.
Learn to use a “nextflow.config” file in the working directory to override Nextflow parameters (e.g., specify where to find the pipeline assets).
Learn how to prepare a PBS script to run the expression profiling of small RNAs against the reference miRBase database annotated microRNAs.
Preparing the pipeline inputs
The pipeline requires preparing at least 2 files:
Metadata file (samplesheet.csv) that specifies the name of the samples, location of FASTQ files ('Read 1' and ‘Read 2’), and strandedness (forward, reverse, or auto. Note: auto is used when the strandedness of the data is unknown)
PBS Pro script (launch_nf-core_RNAseq_QC.pbs) with instructions to run the pipeline
Nextflow.config - revision 2.3.1 of the nf-core/smrnaseq pipeline may not be able to identify the location of reference adapter sequences, thus, we will use a local nextflow.config file to tell Nextflow where to find the reference adapters necessary to trim the raw small_RNA-Seq data
A. Create the metadata file (samplesheet.csv):
Change to the data folder directory:
cd $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease
Copy the bash script to the working folder
cp /work/training/2024/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease
Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located
View the content of the script:
cat create_nf-core_smallRNAseq_samplesheet.sh
NOTE: modify ‘read1_extension’ as appropriate for your data. For example: _1.fastq.gz or _R1_001.fastq.gz or _R1.fq.gz , etc
Let’s generate the metadata file by running the following command:
sh create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease
Check the newly created samplesheet.csv file:
cat samplesheet.csv
sample,fastq_1 ERR409878,/work/training/2024/smallRNAseq/data/human_disease/ERR409878.fastq.gz ERR409879,/work/training/2024/smallRNAseq/data/human_disease/ERR409879.fastq.gz ERR409880,/work/training/2024/smallRNAseq/data/human_disease/ERR409880.fastq.gz ERR409881,/work/training/2024/smallRNAseq/data/human_disease/ERR409881.fastq.gz ERR409882,/work/training/2024/smallRNAseq/data/human_disease/ERR409882.fastq.gz ERR409883,/work/training/2024/smallRNAseq/data/human_disease/ERR409883.fastq.gz ERR409884,/work/training/2024/smallRNAseq/data/human_disease/ERR409884.fastq.gz ERR409885,/work/training/2024/smallRNAseq/data/human_disease/ERR409885.fastq.gz ERR409886,/work/training/2024/smallRNAseq/data/human_disease/ERR409886.fastq.gz ERR409887,/work/training/2024/smallRNAseq/data/human_disease/ERR409887.fastq.gz ERR409888,/work/training/2024/smallRNAseq/data/human_disease/ERR409888.fastq.gz ERR409889,/work/training/2024/smallRNAseq/data/human_disease/ERR409889.fastq.gz ERR409890,/work/training/2024/smallRNAseq/data/human_disease/ERR409890.fastq.gz ERR409891,/work/training/2024/smallRNAseq/data/human_disease/ERR409891.fastq.gz ERR409892,/work/training/2024/smallRNAseq/data/human_disease/ERR409892.fastq.gz ERR409893,/work/training/2024/smallRNAseq/data/human_disease/ERR409893.fastq.gz ERR409894,/work/training/2024/smallRNAseq/data/human_disease/ERR409894.fastq.gz ERR409895,/work/training/2024/smallRNAseq/data/human_disease/ERR409895.fastq.gz ERR409896,/work/training/2024/smallRNAseq/data/human_disease/ERR409896.fastq.gz ERR409897,/work/training/2024/smallRNAseq/data/human_disease/ERR409897.fastq.gz ERR409898,/work/training/2024/smallRNAseq/data/human_disease/ERR409898.fastq.gz ERR409899,/work/training/2024/smallRNAseq/data/human_disease/ERR409899.fastq.gz ERR409900,/work/training/2024/smallRNAseq/data/human_disease/ERR409900.fastq.gz |
---|
B. Prepare PBS Pro script to run the nf-core/smrnaseq pipeline
Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_miRBase.pbs)
Copy and paste the code below to the terminal:
cp $HOME/workshop/2024-2/session6_smallRNAseq/data/human_disease/samplesheet.csv $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase cp /work/training/2024/smallRNAseq/scripts/launch_nf-core_smallRNAseq_miRBase.pbs $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase cp /work/training/2024/smallRNAseq/scripts/nextflow.config $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase cd $HOME/workshop/2024-2/session6_smallRNAseq/runs/run1_human_miRBase
Line 1: Copy the samplesheet.csv file to the working directory
Line 2: Copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory
Line 3: Copy the nextflow.config file from shared folder to my working directory.
Line 4: move to the working directory
View the content of the launch_nf-core_RNAseq_QC.pbs
script:
cat launch_nf-core_smallRNAseq_miRBase.pbs
TIP: when running the nf-core/smrnaseq pipeline (release 2.3.1) the pipeline is not able to find the location of the reference adapter sequences for trimming of the raw small RNAseq pipeline, so we need to specify where to find the folder where the adapter sequences file is located. To do this, we prepare a “nextflow.config” file (see below). This file should be already in your working directory. Print the content as follows:
cat nextflow.config
singularity { runOptions = '-B $HOME/.nextflow/assets/nf-core/smrnaseq/assets' } |
---|
Note: if a config file is placed in the working folder it can override parameters define by the global ~/.nextflow/config file or the config file define as part of the pipeline.
Submit the job to the HPC cluster:
qsub launch_nf-core_smallRNAseq_miRBase.pbs
Monitor the progress:
qjobs
The job will take several hours to run, hence we will use precomputed results for the statistical analysis in the next section.
Outputs
The pipeline will produce two folders, one called “work,” where all the processing is done, and another called “results,” where we can find the pipeline's outputs. The content of the results folder is as follows:
results/ ├── bowtie_index │ ├── mirna_hairpin │ └── mirna_mature ├── fastp │ └── on_raw ├── fastqc │ ├── raw │ └── trimmed ├── mirna_quant │ ├── bam │ ├── edger_qc <----- Expression mature miRNA (mature_counts.csv) and precursor-miRNAs (haripin_counts.csv) counts can be found in this subfolder. │ ├── mirtop │ ├── reference │ └── seqcluster ├── mirtrace │ ├── mirtrace-report.html │ ├── mirtrace-results.json │ ├── mirtrace-stats-contamination_basic.tsv │ ├── mirtrace-stats-contamination_detailed.tsv │ ├── mirtrace-stats-length.tsv │ ├── mirtrace-stats-mirna-complexity.tsv │ ├── mirtrace-stats-phred.tsv │ ├── mirtrace-stats-qcstatus.tsv │ ├── mirtrace-stats-rnatype.tsv │ ├── qc_passed_reads.all.collapsed │ └── qc_passed_reads.rnatype_unknown.collapsed ├── multiqc │ ├── multiqc_data │ ├── multiqc_plots │ └── multiqc_report.html └── pipeline_info ├── execution_report_2024-08-20_16-55-53.html ├── execution_timeline_2024-08-20_16-55-53.html ├── execution_trace_2024-08-20_16-55-53.txt ├── nf_core_smrnaseq_software_mqc_versions.yml ├── params_2024-08-20_16-56-04.json └── pipeline_dag_2024-08-20_16-55-53.html
The quantification of the mature miRNA and hairpin expressions can be found in the /results/mirna_quant/edger_qc directory.
cd /results/mirna_quant/edger_qc
hairpin_counts.csv hairpin_CPM_heatmap.pdf hairpin_edgeR_MDS_distance_matrix.txt hairpin_edgeR_MDS_plot_coordinates.txt hairpin_edgeR_MDS_plot.pdf hairpin_log2CPM_sample_distances_dendrogram.pdf hairpin_log2CPM_sample_distances_heatmap.pdf hairpin_log2CPM_sample_distances.txt hairpin_logtpm.csv hairpin_logtpm.txt hairpin_normalized_CPM.txt hairpin_unmapped_read_counts.txt mature_counts.csv <----- Expression mature miRNA. This file will be used to identify differentially expressed miRNAs (Session 7) mature_CPM_heatmap.pdf mature_edgeR_MDS_distance_matrix.txt mature_edgeR_MDS_plot_coordinates.txt mature_edgeR_MDS_plot.pdf mature_log2CPM_sample_distances_dendrogram.pdf mature_log2CPM_sample_distances_heatmap.pdf mature_log2CPM_sample_distances.txt mature_logtpm.csv mature_logtpm.txt mature_normalized_CPM.txt mature_unmapped_read_counts.txt
Let’s inspect the mature.csv file. Let’s use the ‘cat’ command to print it on the screen:
cat mature_counts.csv
"","hsa-let-7a-5p","hsa-let-7a-3p","hsa-let-7a-2-3p","hsa-let-7b-5p","hsa-let-7b-3p","hsa-let-7c-5p","hsa-let-7c-3p","hsa-let-7d-5p","hsa-let-7d-3p","hsa- "ERR409882",364608,341,16,59417,1998,68342,44,14861,3790,29486,207,211184,228,1462,7002,2,49664,1,1091,174,326,43,6,468,7,1482,1615,9,17256,534,573,6526,0 "ERR409879",305651,184,6,52115,1476,58425,30,12397,2659,23604,201,198778,151,1013,5486,1,48381,4,945,202,194,40,7,368,3,1097,1317,6,12662,561,372,3693,2,1 "ERR409881",712880,165,9,83857,2335,162724,83,30556,4503,68044,385,456864,348,1818,9893,0,111712,5,1495,259,174,48,6,318,2,1466,2220,4,17865,466,551,10360 "ERR409884",182178,111,3,27892,913,39989,21,7751,1886,13902,159,127386,132,743,3651,3,40311,0,629,117,97,21,11,305,2,1147,902,2,8313,368,242,2276,0,1146,4 "ERR409889",568269,257,13,92339,2239,100021,45,20819,3511,44172,207,276474,259,1376,12407,5,83908,5,1971,467,403,70,30,1082,7,3082,3172,14,24112,819,421,6 "ERR409894",314053,137,9,44708,1220,74145,74,12313,2827,25295,196,196866,158,896,4681,3,43677,1,806,138,131,22,7,296,3,1181,1169,5,11145,611,360,3742,5,12 "ERR409887",178201,48,4,25678,733,41506,27,7833,1613,15724,121,123391,98,497,3288,0,39434,1,445,97,65,15,3,150,2,539,461,3,5837,186,161,2958,2,847,3,1544, "ERR409880",318121,136,3,46347,1260,65606,39,11095,2269,24585,200,191072,194,1118,5599,2,67420,3,1242,155,168,22,2,505,6,1708,1836,3,11293,482,359,3652,1, "ERR409890",332579,105,7,40131,955,73537,38,13528,2029,31807,158,207846,175,962,5146,0,42402,0,659,149,102,20,4,219,3,964,1086,4,11957,423,385,6017,4,1556
Note: the “mature_counts.csv” needs to be transposed prior running the statistical analysis. This can be done either user the R script or using a script called “transpose_csv.py”.
Let’s copy the transpose_csv.py script to the working folder:
cp /work/training/2024/smallRNAseq/scripts/transpose_csv.py .
The check how to use the script do the following:
python transpose_csv.py --help
usage: transpose_csv.py [-h] --input INPUT --output OUTPUT Transpose a CSV file and generate a tab-delimited TXT file. optional arguments: -h, --help show this help message and exit --input INPUT Input CSV file containing mature miRNA counts. --output OUTPUT Output tab-delimited TXT file.
To transpose the initial “mature_counst.csv” file do the following:
python transpose_csv.py --input mature_counts.csv --out mature_counts.txt
Let’s now print the transposed mature counts table:
cat mature_counts.txt