This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.
Date created: 19/01/2023; Last update: 20/01/2023
Create a Conda environment with tools needed for downstream analyze
Create a python 3.7 environment:
conda create --name liver python=3.7
Activate the conda environment:
conda activate liver
Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
channels: - bioconda - conda-forge dependencies: - bedtools - samtools - seqkit - vcftools - emboss
Run the following command to install additional tools
conda env update --file environment.yml
To deactivate the conda environment, run:
conda deactivate
Convert BAM to FASTQ
Move to the folder where all the BAM files are present and prepare the following script (i.e., launch_BAM2FASTQ.pbs):
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done
Submit the job to the PBS scehduler:
qsub launch_BAM2FASTQ.pbs
Check the submitted job(s):
qjobs
Run variant calling using the nextflow nf-core/sarek pipeline
To run Sarek 3 files are required:
launch.pbs → details how to run the workflow
nextflow.config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
Below is an example of a launch.pbs file:
#!/bin/bash -l #PBS -N sarek #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --input index.csv \ -config nextflow.config
nextflow.config file:
singularity { cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/NXF_CONDA_CACHEDIR' } singularity { enabled = true autoMounts = true } process { executor = 'pbspro' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ } scratch = false cleanup = false }
Example of an samplesheet.csv file:
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R2.fastq.gz