Exosome variant analysis

This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.

Create a Conda environment with tools needed for downstream analyze

Create a python 3.7 environment:

conda create --name liver python=3.7

Activate the conda environment:

conda activate liver

Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

channels:
  - bioconda
  - conda-forge
dependencies:
  - bedtools
  - samtools
  - seqkit
  - vcftools
  - emboss

Run the following command to install additional tools

conda env update --file environment.yml

To deactivate the conda environment, run:

conda deactivate

Convert BAM to FASTQ

Move to the folder where all the BAM files are present and prepare the following script (i.e., launch_BAM2FASTQ.pbs):

#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

Submit the job to the PBS scehduler:

qsub launch_BAM2FASTQ.pbs

Check the submited job(s):

qjobs

Sarek

Create a conda environment with nf-core

conda create --name nf-core python=3.8 nf-core nextflow
conda activate nf-core

nf-core download sarek -r 3.1.2 --output nf-core-sarek -x nonce -c none