Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 18 Next »

Create working folder and copy data

Let’s create an interactive session on the HPC:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

Let’s create working folders for today. Copy and paste the following block of code into your terminal and hit enter for the last command:

mkdir -p $HOME/workshop/ONTvariants
mkdir -p $HOME/workshop/ONTvariants/data
mkdir -p $HOME/workshop/ONTvariants/scripts
mkdir -p $HOME/workshop/ONTvariants/runs/run1_QC
mkdir -p $HOME/workshop/ONTvariants/runs/run2_mapping
mkdir -p $HOME/workshop/ONTvariants/runs/run3_variant_calling

Now, let’s copy the scripts and data for today’s session:

cp /work/training/ONTvariants/data/* $HOME/workshop/ONTvariants/data
cp /work/training/ONTvariants/scripts/* $HOME/workshop/ONTvariants/scripts
cd $HOME/workshop/ONTvariants

Install tools using conda

Create a conda environments

Create a conda environment called ONTvariants_QC

conda create -n ONTvariants_QC
Collecting package metadata (current_repodata.json): done
Solving environment: done


## Package Plan ##

  environment location: /home/barrero/miniconda3/envs/ONTvariants_QC


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate ONTvariants_QC
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Let’s activate the conda environment:

conda activate ONTvariant_QC

Next, we need to install few tools for today’s exercises. Now let’s go the https://anaconda.org and search for the following tools and instructions on how to install them:

nanoplot, porechop_abi, chopper, seqkit

For example, search for nanoplot:

image-20240517-045848.png

If the tool you are looking is available in conda, a list of options will be presented. Typically choose the option at the top with most downloads and compatible for your system:

image-20240517-050025.png

Click on the link to the tool of interest and you will be presented with the conda command line to run in your system to install the tool:

image-20240517-050058.png

Copy and paste the first command shown above in your terminal where you have activated the ‘ONTvariant’ conda environment (Approx. 5-10min):

conda install bioconda::nanoplot

Now repeat the process for porechop_abi (Approx. 1min)’, then install it (Approx. 1min):

conda install bioconda::porechop_abi

Next, let’s install chopper (Approx. 1min):

conda install bioconda::chopper

Finally, let’s install the seqkit suite of tools:

conda install bioconda::seqkit

Alternative approach to create a conda env and install tools (we are not doing this - this just for your information) - installing all tools at once (slower option!)

Prepare the following environment.yml file:

name: ONTvariants_QC
channels:
  - conda-forge
  - defaults
  - bioconda
dependencies:
  - nanoplot=1.42.0
  - porecho=0.2.4
  - porechop_abi=0.5.0
  - chopper=0.8.0

Create a new environment:

conda env create -f environment.yml

Running QC

Now that we have installed all the tools needed for the QC of Nanopore reads, let’s run the preprocessing of reads.

Let’s initially move to the run1_QC working directory:

cd $HOME/workshop/ONTvariants/runs/run1_QC

Now let’s copy the script for the exercise:

cp /work/training/ONTvariants/scripts/launch_ONTvariants_QC.pbs .

Note: the above script copies the launch script for the scripts folder to the current directory denoted by the full stop “ . “ at the end of the command.

Let’s print the content of the script:

cat launch_ONTvariants_QC.pbs 
#!/bin/bash -l
#PBS -N run1_QC
#PBS -l select=1:ncpus=8:mem=16gb
#PBS -l walltime=48:00:00
#PBS -m abe

cd $PBS_O_WORKDIR

conda activate ONTvariants_QC

###############################################################
# Variables
###############################################################
FASTQ='/work/training/ONTvariants/data/SRR17138639_1.fastq.gz'
GENOME='/work/training/ONTvariants/data/chr20.fasta'
SAMPLEID='SRR17138639'
###############################################################

#STEP1: NanoPlot - overall QC report
NanoPlot -t 8 --fastq $FASTQ --prefix ${SAMPLEID}_QC_ --plots dot --N50 --tsv_stats

#STEP2: porechop_abi - remove adapters
porechop_abi -abi -t 8 --input ${SAMPLEID}.fastq.gz --discard_middle --output ${SAMPLEID}_trimmed.fastq

#STEP3: chopper - retain reads with >Q10 and length>300b
chopper -q 10 -l 300 -i ${SAMPLEID}_trimmed.fastq > ${SAMPLEID}_trimmed_q10.fastq

Note:

  • Line 1: Defines that the script is a bash script.

  • Lines 2-5: Are commented out with “#” at the beginning and are ignored by bash, however, these PBS lines tell the scholar (PBS Pro) the name of the job (line 2), the number of CPUs and RAM memory to use (line 3), the time to run the script (line 4) and report if there are any errors (line 5).

  • Line 7: Tells the job to run on the current directory.

  • Line 9: Activate the conda environment where the QC tools were installed using conda.

  • Lines 11-17: User defined variables. Modify the FASTQ, genome and/or sample ID to use to run the job as appropriate. Note: in the lines below, the variable names are used instead of the actual names or locations of the files (e.g., $FASTQ)

  • Line 20: Run a Quality Control (QC) overview of the raw Nanopore reads using NanoPlot

  • Line 23: Remove adapter sequences from the 5'- and 3’-ends of the raw reads

  • Line 26: Filter reads with a quality score below Q10 (90% accuracy; -q 10) and shorter than 300 bases (-l 300)

Submit the QC job to the HPC cluster:

qsub launch_ONTvariants_QC.pbs

Monitor the progress of the job:

qjobs

The job will take ~30 - 40 min to complete.

Outputs

The following list of outputs will be generated once the job has completed:

.
├── launch_ONTvariants_QC.pbs
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fa
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fa_stats.txt
├── SRR17138639_1_porechop_abi_chopper_q10_300b.fastq
├── SRR17138639_1_porechop_abi.fastq
├── SRR17138639.fastq.gz
├── SRR17138639_QC_LengthvsQualityScatterPlot_dot.html
├── SRR17138639_QC_LengthvsQualityScatterPlot_dot.png
├── SRR17138639_QC_NanoPlot_20240517_2037.log
├── SRR17138639_QC_NanoPlot-report.html
├── SRR17138639_QC_NanoStats.txt
├── SRR17138639_QC_Non_weightedHistogramReadlength.html
├── SRR17138639_QC_Non_weightedHistogramReadlength.png
├── SRR17138639_QC_Non_weightedLogTransformed_HistogramReadlength.html
├── SRR17138639_QC_Non_weightedLogTransformed_HistogramReadlength.png
├── SRR17138639_QC_WeightedHistogramReadlength.html
├── SRR17138639_QC_WeightedHistogramReadlength.png
├── SRR17138639_QC_WeightedLogTransformed_HistogramReadlength.html
├── SRR17138639_QC_WeightedLogTransformed_HistogramReadlength.png
├── SRR17138639_QC_Yield_By_Length.html
└── SRR17138639_QC_Yield_By_Length.png

As outputs find the porechop_abi processed file (SRR17138639_1_porechop_abi.fastq) and the chopper output (SRR17138639_1_porechop_abi_chopper_q10_300b.fastq). To visualise the QC reports, let’s connect to the HPC via file finder (see below).

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

Windows PC

\\hpc-fs\work\training\ONTvariants

Mac

smb://hpc-fs/work/training/ONTvariants

Once connected, browse to the “/ONTvariants/runs/run1_QC” folder.

Let’s open the “SRR17138639_QC_NanoPlot-report.html“ report:

Summary statistics

Metrics

dataset

number_of_reads

5513156

number_of_bases

7815960904.0

median_read_length

586.0

mean_read_length

1417.7

read_length_stdev

2997.2

n50

4054.0

mean_qual

11.4

median_qual

13.4

longest_read_(with_Q):1

199230 (3.7)

longest_read_(with_Q):2

169532 (3.9)

longest_read_(with_Q):3

134047 (3.6)

longest_read_(with_Q):4

133337 (3.6)

longest_read_(with_Q):5

115232 (3.3)

highest_Q_read_(with_length):1

26.1 (290)

highest_Q_read_(with_length):2

25.4 (202)

highest_Q_read_(with_length):3

24.9 (331)

highest_Q_read_(with_length):4

24.5 (232)

highest_Q_read_(with_length):5

24.5 (243)

Reads >Q5:

5417207 (98.3%) 7502.8Mb

Reads >Q7:

5275906 (95.7%) 6978.9Mb

Reads >Q10:

4853447 (88.0%) 6056.6Mb

Reads >Q12:

3905370 (70.8%) 4809.9Mb

Reads >Q15:

1324999 (24.0%) 1571.9Mb

Next, let’s inspect the “SRR17138639_QC_LengthvsQualityScatterPlot_dot.png“ file. Alternatively for high resolution image open instead “SRR17138639_QC_LengthvsQualityScatterPlot_dot.html“

image-20240517-233321.png

  • No labels