This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.
...
Code Block |
---|
conda activate liver |
Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
...
Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):
Code Block |
---|
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done |
...
To run Sarek 3 files are required:
launch.pbs → details how to run the workflow
nextflow.config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
Below is an example of a launch.pbs file for mapping onto selected genome:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--input index.csv \
-config nextflow.config |
To run variant calling and annotation use the following script instead:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--input index.csv \
-config nextflow.config \
--wes true \
--tools 'deepvariant|freebayes|haplotypecaller|manta|mpileup|snpeff|strelka|vep' \
-resume |
nextflow.config file:
Code Block |
---|
singularity { cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/NXF_CONDA_CACHEDIR' } singularity { enabled = true autoMounts = true } process { executor = 'pbspro' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ } scratch = false cleanup = false } |
...