Table of Contents | ||
---|---|---|
|
Create working folder and copy data
Let’s create an interactive session on the HPC:
Code Block |
---|
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb |
Let’s create working folders for today. Copy and paste the following block of code into your terminal and hit enter for the last command:
Code Block |
---|
mkdir -p $HOME/workshop/ONTvariants
mkdir -p $HOME/workshop/ONTvariants/data
mkdir -p $HOME/workshop/ONTvariants/scripts
mkdir -p $HOME/workshop/ONTvariants/runs/run1_QC
mkdir -p $HOME/workshop/ONTvariants/runs/run2_mapping
mkdir -p $HOME/workshop/ONTvariants/runs/run3_variant_calling |
Now, let’s copy the scripts and data for today’s session:
Code Block |
---|
cp /work/training/ONTvariants/data/* $HOME/workshop/ONTvariants/data
cp /work/training/ONTvariants/scripts/* $HOME/workshop/ONTvariants/scripts
cd $HOME/workshop/ONTvariants |
Install tools using conda
Approach 1: Create a conda environment and install tools one at a time
Create a conda environment called ONTvariants_QC
...
Code Block |
---|
Collecting package metadata (current_repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.12.0 latest version: 24.5.0 Please update conda by running $ conda update -n base -c defaults conda ## Package Plan ## environment location: /home/barrero/miniconda3/envs/ONTvariants_QC Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate ONTvariants_QC # # To deactivate an active environment, use # # $ conda deactivate |
Let’s activate the conda environment:
Code Block |
---|
conda activate ONTvariantONTvariants_QC |
Next, we need to install few tools for today’s exercises. Now let’s go the https://anaconda.org and search for the following tools and instructions on how to install them:
Code Block |
---|
nanoplot, porechop_abi, chopper, seqkit |
For example, search for nanoplot:
...
Next, let’s install chopper (Approx. 1min):
Code Block |
---|
conda install bioconda::chopper |
Finally, let’s install the seqkit suite of tools:
Code Block |
---|
conda install bioconda::seqkit |
Approach 2: Create environment and install tools all at once
This is a slower option, but it is convenient when installing many tools.
Prepare the following environment.yml file:
Code Block |
---|
name: ONTvariants_QC channels: - conda-forge - defaults - bioconda dependencies: - nanoplot - porechop_abi - chopper |
Create a new environment:
Code Block |
---|
cd $HOME/workshop/ONTvariants/scripts
conda env create -f environment_QC.yml |
Running QC
Now that we have installed all the tools needed for the QC of Nanopore reads, let’s run the preprocessing of reads.
Let’s initially move to the run1_QC working directory:
Code Block |
---|
cd $HOME/workshop/ONTvariants/runs/run1_QC |
Now let’s copy the script for the exercise:
Code Block |
---|
cp /work/training/ONTvariants/scripts/launch_ONTvariants_QC.pbs . |
Note: the above script copies the launch script for the scripts folder to the current directory denoted by the full stop “ . “ at the end of the command.
Let’s print the content of the script:
Code Block |
---|
cat launch_ONTvariants_QC.pbs |
Code Block |
---|
#!/bin/bash -l
#PBS -N run1_QC
#PBS -l select=1:ncpus=8:mem=16gb
#PBS -l walltime=48:00:00
#PBS -m abe
cd $PBS_O_WORKDIR
conda activate ONTvariants_QC
###############################################################
# Variables
###############################################################
FASTQ='/work/training/ONTvariants/data/SRR17138639_1.fastq.gz'
GENOME='/work/training/ONTvariants/data/chr20.fasta'
SAMPLEID='SRR17138639'
###############################################################
#STEP1: NanoPlot - overall QC report
NanoPlot -t 8 --fastq $FASTQ --prefix ${SAMPLEID}_QC_ --plots dot --N50 --tsv_stats
#STEP2: porechop_abi - remove adapters
porechop_abi -abi -t 8 --input $FASTQ --discard_middle --output ${SAMPLEID}_trimmed.fastq
#STEP3: chopper - retain reads with >Q10 and length>300b
chopper -q 10 -l 300 -i ${SAMPLEID}_trimmed.fastq > ${SAMPLEID}_trimmed_q10.fastq
#STEP4: get stats of trimmed FASTQ files
seqkit stats *.fastq > Report_trimmed_FASTQ_stats.txt |
Note:
Line 1: Defines that the script is a bash script.
Lines 2-5: Are commented out with “#” at the beginning and are ignored by bash, however, these PBS lines tell the scholar (PBS Pro) the name of the job (line 2), the number of CPUs and RAM memory to use (line 3), the time to run the script (line 4) and report if there are any errors (line 5).
Line 7: Tells the job to run on the current directory.
Line 9: Activate the conda environment where the QC tools were installed using conda.
Lines 11-17: User defined variables. Modify the FASTQ, genome and/or sample ID to use to run the job as appropriate. Note: in the lines below, the variable names are used instead of the actual names or locations of the files (e.g., $FASTQ)
Line 20: Run a Quality Control (QC) overview of the raw Nanopore reads using NanoPlot
Line 23: Remove adapter sequences from the 5'- and 3’-ends of the raw reads
Line 26: Filter reads with a quality score below Q10 (90% accuracy; -q 10) and shorter than 300 bases (-l 300)
Line 28: collect the stats for trimmed FASTQ files processed using porechop_abi and chopper
Submit the QC job to the HPC cluster:
Code Block |
---|
qsub launch_ONTvariants_QC.pbs |
Monitor the progress of the job:
Code Block |
---|
qjobs |
The job will take ~30 - 40 min to complete.
Outputs
The following list of outputs will be generated once the job has completed:
Code Block |
---|
.
├── launch_ONTvariants_QC.pbs
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fa
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fa_stats.txt
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fastq
├── SRR17138639_1_porechop_abi.fastq
├── SRR17138639.fastq.gz
├── SRR17138639_QC_LengthvsQualityScatterPlot_dot.html
├── SRR17138639_QC_LengthvsQualityScatterPlot_dot.png
├── SRR17138639_QC_NanoPlot_20240517_2037.log
├── SRR17138639_QC_NanoPlot-report.html
├── SRR17138639_QC_NanoStats.txt
├── SRR17138639_QC_Non_weightedHistogramReadlength.html
├── SRR17138639_QC_Non_weightedHistogramReadlength.png
├── SRR17138639_QC_Non_weightedLogTransformed_HistogramReadlength.html
├── SRR17138639_QC_Non_weightedLogTransformed_HistogramReadlength.png
├── SRR17138639_QC_WeightedHistogramReadlength.html
├── SRR17138639_QC_WeightedHistogramReadlength.png
├── SRR17138639_QC_WeightedLogTransformed_HistogramReadlength.html
├── SRR17138639_QC_WeightedLogTransformed_HistogramReadlength.png
├── SRR17138639_QC_Yield_By_Length.html
└── SRR17138639_QC_Yield_By_Length.png |
As outputs find the porechop_abi processed file (SRR17138639_1_porechop_abi.fastq
) and the chopper output (SRR17138639_1_porechop_abi_chopper_q10_300b.fastq
). To visualise the QC reports, let’s connect to the HPC via file finder (see below).
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
To browse the working folder in the HPC type in the file finder:
Windows PC
Code Block |
---|
\\hpc-fs\work\training\ONTvariants |
Mac
Code Block |
---|
smb://hpc-fs/work/training/ONTvariants |
Once connected, browse to the “/ONTvariants/runs/run1_QC” folder.
Let’s open the “SRR17138639_QC_NanoPlot-report.html“ report:
Summary statistics
Metrics | dataset |
---|---|
number_of_reads | 5513156 |
number_of_bases | 7815960904.0 |
median_read_length | 586.0 |
mean_read_length | 1417.7 |
read_length_stdev | 2997.2 |
n50 | 4054.0 |
mean_qual | 11.4 |
median_qual | 13.4 |
longest_read_(with_Q):1 | 199230 (3.7) |
longest_read_(with_Q):2 | 169532 (3.9) |
longest_read_(with_Q):3 | 134047 (3.6) |
longest_read_(with_Q):4 | 133337 (3.6) |
longest_read_(with_Q):5 | 115232 (3.3) |
highest_Q_read_(with_length):1 | 26.1 (290) |
highest_Q_read_(with_length):2 | 25.4 (202) |
highest_Q_read_(with_length):3 | 24.9 (331) |
highest_Q_read_(with_length):4 | 24.5 (232) |
highest_Q_read_(with_length):5 | 24.5 (243) |
Reads >Q5: | 5417207 (98.3%) 7502.8Mb |
Reads >Q7: | 5275906 (95.7%) 6978.9Mb |
Reads >Q10: | 4853447 (88.0%) 6056.6Mb |
Reads >Q12: | 3905370 (70.8%) 4809.9Mb |
Reads >Q15: | 1324999 (24.0%) 1571.9Mb |
Next, let’s inspect the “SRR17138639_QC_LengthvsQualityScatterPlot_dot.png“ file. Alternatively for high resolution image open instead “SRR17138639_QC_LengthvsQualityScatterPlot_dot.html“
...
Next: ONTvariants - mapping