Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Exercise 3: Run nf-core/sarek using a liver samples 

What is NAFLD?

Non-alcoholic fatty liver disease (NAFLD) is a condition characterized by an accumulation of fat in liver cells (hepatocytes). Excess fat in the liver can lead to significant damage over the years. There are two types of NAFLD:

...

STEP1: Create the metadata file (samplesheet.csv):

Change to the data folder directory:

...

Code Block
cp /work/training/sarek/scripts/create_samplesheet_nf-core_sarek.py $HOME/workshop/sarek/run3_liver
  • Note: you could replace ‘$HOME/workshop/sarek/runs/liver’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

Check help option on how to run the script:

...

Code Block
python create_samplesheet_nf-core_sarek.py -h

usage: create_samplesheet_nf-core_sarek.py [-h] [--dir DIR] [--read1_extension READ1_EXTENSION] [--read2_extension READ2_EXTENSION] [--out OUT]

Extract metadata from fastq files in a directory.

optional arguments:

  -h, --help            show this help message and exit

  --dir DIR             Directory to search for files (default: current directory)

  --read1_extension READ1_EXTENSION

                        Extension for fastq_1 files (default: R1_001.fastq.gz)

  --read2_extension READ2_EXTENSION

                        Extension for fastq_2 files (default: R2_001.fastq.gz)

  --out OUT             Output metadata CSV file

Let’s generate the metadata file by running the following command:

...

Code Block
ls -l
cat samplesheet.cvs

patient,sample,lane,fastq_1,fastq_2

SRR14724455,NA12892a

Control1,C1,L001,/sarek/data/WES/liver/Control1_C1_L001_R1_001.fastq.gz,/sarek/data/WES/liver/Control1_C1_L001_R2_001.fastq.gz

Control2,C2,L001,/sarek/data/WES/liver/Control2_C2_L001_R1_001.fastq.gz,/sarek/data/WES/liver/Control2_C2_L001_R2_001.fastq.gz

Control3,C3,L001,/sarek/data/WES/

trio

liver/

SRR14724455

Control3_

NA12892a

C3_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724455

Control3_

NA12892a

C3_L001_R2_001.fastq.gz

SRR14724456

Control4,

NA12891a

C4,L001,/sarek/data/WES/

trio

liver/

SRR14724456

Control4_

NA12891a

C4_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724456

Control4_

NA12891a

C4_L001_R2_001.fastq.gz

SRR14724463

NAFLD1,

NA12878a

P1,L001,/sarek/data/WES/

trio

liver/

SRR14724463

NAFLD1_

NA12878a

P1_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724463

NAFLD1_

NA12878a

P1_L001_R2_001.fastq.gz

SRR14724474

NAFLD2,

NA12892b

P2,L001,/sarek/data/WES/

trio

liver/

SRR14724474

NAFLD2_

NA12892b

P2_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724474

NAFLD2_

NA12892b

P2_L001_R2_001.fastq.gz

SRR14724475

NAFLD3,

NA12891b

P3,L001,/sarek/data/WES/

trio

liver/

SRR14724475

NAFLD3_

NA12891b

P3_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724475

NAFLD3_

NA12891b

P3_L001_R2_001.fastq.gz

SRR14724483

NAFLD4,

NA12878b

P4,L001,/sarek/data/WES/

trio

liver/

SRR14724483

NAFLD4_

NA12878b

P4_L001_R1_001.fastq.gz,/sarek/data/WES/

trio

liver/

SRR14724483

NAFLD4_

NA12878b

P4_L001_R2_001.fastq.gz

Alternatively copy the samplesheet.csv file:

Code Block
cp /work/training/sarek/data/WES/liver/samplesheet.csv .

STEP2 - Run the nf-core/sarek pipeline

...

Code Block
cp $HOME/workshop/sarek/scripts/launch_nf-core_sarek_liver.pbs $HOME/workshop/sarek/runs/run3_liver
cd $HOME/workshop/sarek/runs/run3_liver
  • Line 1: Copy the samplesheet.csv file generated above to the working directory

  • Line 2: copy the launch_nf-core_sarek_trio.pbs submission script to the working directory

  • Line 3: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

Code Block
cat launch_nf-core_sarek_liver.pbs

#!/bin/bash -l

#PBS -N nfsarek_liver

#PBS -l walltime=48:00:00

#PBS -l select=1:ncpus=1:mem=5gb

 

cd $PBS_O_WORKDIR

NXF_OPTS='-Xms1g -Xmx4g'

module load java

 

#specify the nextflow version to use to run the workflow

export NXF_VER=23.10.1

 

#run the sarek pipeline

nextflow run nf-core/sarek \

        -r 3.3.2 \

        -profile singularity \

        --genome GATK.GRCh38 \

        --input samplesheet.csv \

        --wes \

        --outdir ./results \

        --step mapping \

        --tools haplotypecaller,snpeff,vep \

        --snpeff_cache /work/training/sarek/NXF_SINGULARITY_CACHEDIR/snpeff_cache \

        --vep_cache /work/training/sarek/NXF_SINGULARITY_CACHEDIR/vep_cache \

        -resume

  • The above script will screen for germline (inherited) mutations using GATK’s haplotypecaller and then annotate the identified variants using snpeff and VEP.

  • Version 3.3.2 allows running the pipeline to do quality assessment only, without any alignment, read counting, or trimming.

  • The pipeline enables use to start at distinct stages, we are commencing from the start “--step mapping”

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, and launch.pbs, you are ready to submit the job to the HPC scheduler:

...

Once the pipeline has finished running - Assess the results as follows:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

...

During execution of the workflow two output folders are generated:

  • work - where all intermediate results and tasks are run

  • results - where all final results for all stages of the pipeline are copied

...