Let’s create an NEW interactive session on the HPC:
Code Block |
---|
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb |
Install tools using conda
...
We will now create a new CONDA environment to install the tools needed for mapping. The reason we need to create a new a new environments is because the QC and mapping tools have no compatible dependencies.
Create a conda environment called snifflesONTvariants_mapping
Code Block |
---|
conda create -n ONTvariants_mapping |
Code Block |
---|
Collecting package metadata (current_repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.12.0 latest version: 24.5.0 Please update conda by running $ conda update -n base -c defaults conda ## Package Plan ## environment location: /home/barrero/miniconda3/envs/ONTvariant Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate ONTvariants_mapping # # To deactivate an active environment, use # # $ conda deactivate |
...
Next, we need to install few tools for today’s exercises. Now let’s go the https://anaconda.org and search for the following tools and instructions on how to install them:
Code Block |
---|
samtools, sniffles, minimap2 |
For example, search for samtools:
...
Click on the link to the tool of interest and you will be presented with the conda command line to run in your system to install the tool:
...
Copy and paste the first command shown line from above in into your terminal where you have activated the ‘ONTvariant’ conda environmentsessions with the activated ‘ONTvariants_mapping’ environment. Install samtools (version 1.20):
Code Block |
---|
conda install bioconda::samtools |
Now repeat the process for ‘sniffles’, then install it:
Code Block |
---|
conda install bioconda::sniffles |
Next, let’s install minimap2 and seqkit(version 2.28):
Code Block |
---|
conda install bioconda::minimap2 |
Code Block |
---|
conda install bioconda::seqkit |
Now we are done installing all the tools that we need for today.
...
Alternative approach to create a conda env and install tools (we are not doing this - this just for your information) - installing all tools at once (slower option!)
Prepare the following environment.yml file:
Code Block |
---|
name: ONTvariants_mapping channels: - conda-forge - defaults - bioconda dependencies: - samtools=1.20 - sniffles=1.0.12 - minimap=2-minimap2=2.28 - seqkit=2.8.1 |
Create a new environment:
Code Block |
---|
conda env create -f environment.yml |
Installing more tools or dealing with compatibility issues between tools
As you have seen, we can search at anaconda.org for other tools that we might be interested to use.
Remember, if you run into compatibility issues or errors, you can always create a new conda environment for the tool of interest. NOTE: you can switch between conda environements as follows:
Code Block |
---|
conda activate myenvironment1
#... do so work with it
conda deactivate
#now we can activate another environment
conda activate myenvironment2
#... do some other tasks
conda deactivate #optional to deactivate last environment to use given that when we submit a job to the HPC cluster it will be closed automatically at the completion of the last task. |
Hands-on exercises
Exercise #1: running a test using a sample dataset
Code Block |
---|
Convert the sam file to bam (a binary sam format) using samtools’ view command
Sort the bam file (needed for fast access) using samtools sort command
Create an index of the bam file (needed by IGV) using samtools index command |
Ecoli
...
ONTvariants_QC
...
...
...
conda deactivate
conda activate ONTvariants_mapping
...
...
... |
Running mapping
Now that we have installed all the tools needed for the QC of Nanopore reads, let’s run the preprocessing of reads.
Let’s initially move to the run1_QC working directory:
Code Block |
---|
cd $HOME/workshop/ONTvariants/runs/run2_mapping |
Now let’s copy the script for the exercise:
Code Block |
---|
cp /work/training/ONTvariants/scripts/launch_ONTvariants_mapping.pbs . |
Note: the above script copies the launch script for the scripts folder to the current directory denoted by the full stop “ . “ at the end of the command.
Let’s print the content of the script:
Code Block |
---|
cat launch_ONTvariants_mapping.pbs |
Code Block |
---|
#!/bin/bash -l
#PBS -N run2_mapping
#PBS -l select=1:ncpus=8:mem=16gb
#PBS -l walltime=72:00:00
#PBS -m abe
cd $PBS_O_WORKDIR
conda activate ONTvariants_mapping
###############################################################
# Variables
###############################################################
FASTQ='/work/training/ONTvariants/runs/run1_QC/SRR17138639_1_porechop_abi_chopper_q10_300b.fastq'
GENOME='/work/training/ONTvariants/data/chr20.fasta'
SAMPLEID='SRR17138639'
GENOMEID='chr20'
###############################################################
#STEP1: Mapping preprocessed reads with minimap2 onto reference genome
#minimap2 -t 8 -a $GENOME $FASTQ | awk '$3!="*"' > ${SAMPLEID}_mapped_${GENOMEID}.sam
minimap2 -t 8 -a $GENOME $FASTQ > ${SAMPLEID}_mapped_${GENOMEID}.sam
##STEP2: samtools - SAM to sorted BAM
samtools view -bS ${SAMPLEID}_mapped_${GENOMEID}.sam > ${SAMPLEID}_mapped_${GENOMEID}.bam
samtools sort -o ${SAMPLEID}_mapped_${GENOMEID}.sorted.bam ${SAMPLEID}_mapped_${GENOMEID}.bam
samtools index ${SAMPLEID}_mapped_${GENOMEID}.sorted.bam |
Note:
Line 1: Defines that the script is a bash script.
Lines 2-5: Are commented out with “#” at the beginning and are ignored by bash, however, these PBS lines tell the scholar (PBS Pro) the name of the job (line 2), the number of CPUs and RAM memory to use (line 3), the time to run the script (line 4) and report if there are any errors (line 5).
Line 7: Tells the job to run on the current directory.
Line 9: Activate the conda environment where the QC tools were installed using conda.
Lines 15-18: User defined variables. Modify the FASTQ, genome and/or sample ID to use to run the job as appropriate. Note: in the lines below, the variable names are used instead of the actual names or locations of the files (e.g., $FASTQ)
Line 22: Map quality processed reads onto reference genome using minimap2. The output alignment file is in SAM format.
Line 25: Convert the SAM file to BAM (a binary SAM format) using samtools' view command.
Line 26: Sort the BAM file (required for fast access) using samtools' sort command
Line 27: Create an index of the sorted BAM file (needed for IGV) using the samtools' index command.
Submit the QC job to the HPC cluster:
Code Block |
---|
qsub launch_ONTvariants_mapping.pbs |
Monitor the progress of the job:
Code Block |
---|
qjobs |
Once the run has completed. The following files will be in the “run2_mapping” folder:
Code Block |
---|
.
├── launch_ONTvariants_mapping.pbs
├── SRR17138639_mapped_chr20.bam
├── SRR17138639_mapped_chr20.sam
├── SRR17138639_mapped_chr20.sorted.bam
└── SRR17138639_mapped_chr20.sorted.bam.bai |
To visualise the mapped reads (sorted BAM file) using IGV (see below), first we need to connect to the HPC using FileFinder:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
To browse the working folder in the HPC type in the file finder:
Windows PC
Code Block |
---|
\\hpc-fs\work\training\ONTvariants\runs\run2_mapping |
Mac
Code Block |
---|
smb://hpc-fs/work/training/ONTvariants\runs\run2_mapping |
Visualisation of the alignment using IGV
We can visualise the mapped reads using a web-based IGV genome browser tool at https://igv.org
Click on IGV Web App:
...
Next you will be presented with a page with the human genome hg38 (GRCh38) loaded for all chromosomes including RefSeq (reference) genes:
...
Let’s now upload the BAM and the index BAM.BAI file by selecting Tracks → Local File → browse to the BAM and BAI files and select them both at the same time to upload:
...
Next select chromosome 20 (chromosome from the scroll down menu on the top left corner, click on ‘all’ then select ‘chr20’:
...
Zoom in a visualise the alignments:
...