This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.
...
Code Block |
---|
conda activate liver |
Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
...
Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):
Code Block |
---|
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done |
...
To run Sarek 3 files are required:
launch.pbs → details how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
We will run the Sarek pipeline in three phases:
Phase I: Preprocessing, mapping, markduplicates, recalibrate
Phase II: Variant calling
Phase III: Annotation
PHASE I - preprocessing
Below is an example of a launch_phase1.pbs file for mapping onto selected genome:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek_I
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--input samplesheet.csv |
...
~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )
Code Block |
---|
singularity {
cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}
singularity {
enabled = true
autoMounts = true
}
process {
executor = 'pbspro'
beforeScript = {
"""
source $HOME/.bashrc
source $HOME/.profile
"""
}
scratch = false
cleanup = false
} |
Example of a samplesheet.csv file:
Code Block |
---|
patient,sample,lane,fastq_1,fastq_2
healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz |
Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase1.pbs |
PHASE II - variant calling
Prepare/edit the following launch_phase2.pbs script:
Code Block |
---|
#!/bin/bash -l #PBS -N sarek_II #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --inputconfig samplesheetnextflow.csvconfig \ --wesstep variant_calling \ --tools 'deepvariant,freebayes,haplotypecaller,manta,mpileup,snpeff,strelka,vep'haplotypecaller \ --wes \ -resume |
~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )
...
Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)
Submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase2.pbs |
monitor the progress on the HPC:
Code Block |
---|
qjobs |
Alternatively, view the progress of the submitted job on the Nextflow Tower.
PHASE III - annotation
Prepare/edit the following launch_phase3.pbs script:
Code Block |
---|
#!/bin/bash -l #PBS -N sarek_III #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ -config nextflow.config \ --step annotate \ source $HOME/.profile --tools vep,snpeff \ """ } --wes \ scratch = false cleanup = false } |
Example of an samplesheet.csv file:
Code Block |
---|
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R2.fastq.gz-resume |
Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).
Submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase3.pbs |
monitor the progress on the HPC:
Code Block |
---|
qjobs |
Alternatively, view the progress of the submitted job on the Nextflow Tower.