Here we describe the process to fetch and extract umapped unmapped reads onto a reference genome(s)/transcriptome using BWA.
This guide assumes that conda is already installed - if you have not yet done so, see here
...
.
Pre-requisites
Installed conda as per https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
Raw FASTQ files have been quality trimmed (i.e., remove adapters and poor quality passes (i.e., >=Q30)).
Strategy #1: BWA aligner
STEP 1.1 - (Optional) Create a conda
...
environment with necessary tools if not yet
...
available
Creating conda environment, and you can give it any name (i.e., related to the workflow or tool being installed). For simplicity, here we will call it ‘myanalyses’
...
Activate the conda environment.
Code Block |
---|
conda activate myanalyses |
...
Code Block |
---|
conda install -c bioconda bedtools |
...
To look for other bioinformatics tools to install, search the tool of interest at: https://anaconda.org
STEP 1.2 - Mapping of reads onto reference sequences (i.e., can be abundant genomic sequences found in
...
an environmental sample).
Here we aim to generate a mapped file (BAM) that can be used to identify unmapped sequences of interest to run a follow-up de novo genome assembly, for example.
See BWA user manual for more details: https://bio-bwa.sourceforge.net/bwa.shtml
For the mapping, we can use either individual FASTQ files for each sample or merged files for a group of related samples - if you merge the FASTQ files, these need to be merged using the sample order for both READ1 (*R1.fastq) and READ2 (*R2.fastq) files. We assume that adaptors have already been already removed from the raw reads and the individual or merged files are quality processed reads (for example, processed by either cutadapt, trimmomatic, trim galore, fastp or another QC tool)
Code Block |
---|
#!/bin/bash -l #PBS -N bwa_mapping #PBS -l select=1:ncpus=8:mem=16gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #define VARIABLES for the location of your FASTQ files and reference genome(s) ######################################################## R1=/path/to/sample_trimmed_R1.fastq R2=/path/to/sample_trimmed_R1.fastq MYGENOMES=/path/to/bwa_index/my_reference_genomes.fasta SAMPLEID='LocationX' ######################################################### conda activate myanalyses #create a BWA genome index file prior mapping (this needs to be done once and then either remove or commented out using "#" as the first character in the line ) bwa index ${MYGENOMES} #map reads onto reference bwa mem ${READ1} ${READ2} > ${SAMPLEID}_aln-pe.sam #convert SAM to BAM samtools view -S -b ${SAMPLEID}_aln-pe.sam > ${SAMPLEID}_aln-pe.bam #retrive umapped reads samtools view -b -f 4 ${SAMPLEID}_aln-pe.bam > ${SAMPLEID}_unmapped.bam #sort unmapped reads by name samtools sort -n -o ${SAMPLEID}_unmapped.sorted.bam ${SAMPLEID}_unmapped.bam #extract FASTQ files for R1 and R2 unmapped reads bedtools bamtofastq [OPTIONS] -i ${SAMPLEID}_unmapped.sorted.bam -fq ${SAMPLEID}_unmapped_R1.bam -fq2 ${SAMPLEID}_unmapped_R2.bam |
Strategy #2: bowtie aligner
STEP 2.1 - (Optional) Create a conda environment with necessary tools if not yet available
(optional) If not yet created as per above - Create a conda environment, and you can give it any name (i.e., related to the workflow or tool being installed). For simplicity, here we will call it ‘myanalyses’
Code Block |
---|
conda create --name myanalyses python=3.7 |
Activate the conda environment.
Code Block |
---|
conda activate myanalyses |
Let’s now necessary tools one at a time (faster than installing them all at once):
Code Block |
---|
conda install -c bioconda bowtie |
STEP 1.2 - Mapping of reads onto reference sequences (i.e., can be abundant genomic sequences found in an environmental sample).
Here we aim to generate a mapped file (BAM) that can be used to identify unmapped sequences of interest to run a follow-up de novo genome assembly, for example.
See bowtie user manual for more details: https://bowtie-bio.sourceforge.net/manual.shtml
For the mapping, we can use either individual FASTQ files for each sample or merged files for a group of related samples - if you merge the FASTQ files, these need to be merged using the sample order for both READ1 (*R1.fastq) and READ2 (*R2.fastq) files. We assume that adaptors have already been removed from the raw reads and the individual or merged files are quality processed reads (for example, processed by either cutadapt, trimmomatic, trim galore, fastp, or another QC tool)
Build bowtie-index for ref genomes:
Code Block |
---|
bowtie-build -f my_references.fasta my_references.fasta |
Run mapping allowing no mismatches (identity = 100%)
Code Block |
---|
#!/bin/bash -l
#PBS -N bwa_mapping
#PBS -l select=1:ncpus=8:mem=16gb
#PBS -l walltime=24:00:00
#work on current directory (folder)
cd $PBS_O_WORKDIR
#define VARIABLES for the location of your FASTQ files and reference genome(s)
####################################################################
READ1=’/path/to/mySample_trimmed_R1.fastq’
READ2=’/path/to/mySample_trimmed_R2.fastq’
BOWTIEINDEX=’/path/to/bowtie-index/my_references.fasta’
#####################################################################
#run mapping:
bowtie --threads 8 -a -v 0 --best --strata --sam $BOWTIEINDEX \
-1 $READ1 \
-2 $READ2 > bowtie_v0_mapped_reads_onto_my_references.sam
|