Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

...

Source : https://nf-co.re/sarek/3.2.3

...

This guide provides a step-by-step guide to:

1) convert BAM files (i.e., public) to paired-end FASTQ files; and

...

image-20240508-064345.pngImage Added

Pre-

...

requisite:

This user guide assumes that you have already installed conda/miniconda3 and nextflowNextflow. If you have not done this yet, follow the instructions in the link below:

...

https://

...

eresearchqut.

...

atlassian.

...

net/

...

wiki/

...

spaces/

...

EG/

...

pages/

...

Code Block
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
bash Miniconda3-py37_22.11.1-1-Linux-x86_64.sh

...

nextflow installation: NextFlow quick start

...

VCFtools: https://vcftools.github.io/man_latest.html

Create a Conda environment with tools needed for downstream analyses

Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:

Code Block
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

Create a Python 3.7 environment:

Code Block
conda create --name liver python=3.7

Activate the conda environment:

Code Block
conda activate liver

Then you can install one tool at a time or provide a list (see below); however, installing a list of tools may take a while.

We will use samtools and bcftools for the first part of the tutorial. Run the following commands:

Code Block
conda install -c bioconda samtools
Code Block
conda install -c bioconda bedtools

To install several tools, prepare a file called environment.yml (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

Code Block
channels:
  - bioconda
  - conda-forge
dependencies:
  - seqkit
  - vcftools
  - bcftools
  - emboss

Run the following command to install additional tools

Code Block
conda env update --file environment.yml

To deactivate the conda environment, run:

...

2207776769/Session+2+-+Installing+Nextflow

Convert BAM to FASTQ

Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):

Code Block
#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

...

To run Sarek 3 files are required:

  1. launch.pbs → details on how to run the workflow

  2. ~/.nextflow/config → specify how to run the workflow in the HPC

  3. samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

...

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:

...

Code Block
singularity {
    cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}

singularity {
    enabled = true
    autoMounts = true
}

process {
  executor = 'pbspro'
  beforeScript = {
      """
      source $HOME/.bashrc
      source $HOME/.profile
      """
  }
  scratch = false
  cleanup = false
}

Example of a samplesheet.csv file:

Code Block
patient,sample,lane,fastq_1,fastq_2
healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz

...

Prepare/edit the following launch_phase2.pbs script:

Code Block
#!/bin/bash -l
#PBS -N sarek_II
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step variant_calling \
        --tools haplotypecaller \
        --wes \
        -resume

...

Prepare/edit the following launch_phase3.pbs script:

Code Block
#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step annotate \
        --tools vep,snpeff \
        --wes \
        -resume

...