Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Exercise 3: Run nf-core/sarek using a liver samples 

What is NAFLD?

Non-alcoholic fatty liver disease (NAFLD) is a condition characterized by an accumulation of fat in liver cells (hepatocytes). Excess fat in the liver can lead to significant damage over the years. There are two types of NAFLD:

...

STEP1: Create the metadata file (samplesheet.csv):

Change to the data folder directory:

...

Code Block
cp /work/training/sarek/scripts/create_samplesheet_nf-core_sarek.py $HOME/workshop/sarek/runs/run3_liver

Al
  • Note: you could replace ‘$HOME/workshop/sarek/runs/liver’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

Check help option on how to run the script:

...

Code Block
cp $HOME/workshop/sarek/scripts/launch_nf-core_sarek_liver.pbs $HOME/workshop/sarek/runs/run3_liver
cd $HOME/workshop/sarek/runs/run3_liver
  • Line 1: Copy the samplesheet.csv file generated above to the working directory

  • Line 2: copy the launch_nf-core_sarek_trio.pbs submission script to the working directory

  • Line 3: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

...

#!/bin/bash -l

#PBS -N nfsarek_liver

#PBS -l walltime=48:00:00

#PBS -l select=1:ncpus=1:mem=5gb

 

cd $PBS_O_WORKDIR

NXF_OPTS='-Xms1g -Xmx4g'

module load java

 

#specify the nextflow version to use to run the workflow

export NXF_VER=23.10.1

 

#run the sarek pipeline

nextflow run nf-core/sarek \

        -r 3.3.2 \

        -profile singularity \

        --genome GATK.GRCh38 \

        --input samplesheet.csv \

        --wes \

        --outdir ./results \

        --step mapping \

        --tools haplotypecaller,snpeff,vep \

        --snpeff_cache /work/training/sarek/NXF_SINGULARITY_CACHEDIR/snpeff_cache \

        --vep_cache /work/training/sarek/NXF_SINGULARITY_CACHEDIR/vep_cache \

        -resume

  • The above script will screen for germline (inherited) mutations using GATK’s haplotypecaller and then annotate the identified variants using snpeff and VEP.

  • Version 3.3.2 allows running the pipeline to do quality assessment only, without any alignment, read counting, or trimming.

  • The pipeline enables use to start at distinct stages, we are commencing from the start “--step mapping”

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, and launch.pbs, you are ready to submit the job to the HPC scheduler:

...

Once the pipeline has finished running - Assess the results as follows:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

...

Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read2). Look into the “MultiQC” folder and open the provided HTML report.

During execution of the workflow two output folders are generated:

  • work - where all intermediate results and tasks are run

  • results - where all final results for all stages of the pipeline are copied

Let’s browse the results of the pipeline:

Code Block
results/
├── annotation
│   └── haplotypecaller
├── csv
│   ├── markduplicates.csv
│   ├── markduplicates_no_table.csv
│   ├── recalibrated.csv
│   └── variantcalled.csv
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
├── pipeline_info
│   ├── execution_report_2024-05-11_20-17-13.html
│   ├── execution_timeline_2024-05-11_20-17-13.html
│   ├── execution_trace_2024-05-11_20-17-13.txt
│   ├── params_2024-05-11_23-31-28.json
│   ├── pipeline_dag_2024-05-11_20-17-13.html
│   └── software_versions.yml
├── preprocessing
│   ├── markduplicates
│   ├── recalibrated
│   └── recal_table
├── reports
│   ├── bcftools
│   ├── EnsemblVEP
│   ├── fastp
│   ├── fastqc
│   ├── markduplicates
│   ├── mosdepth
│   ├── samtools
│   ├── snpeff
│   └── vcftools
├── tabix
│   ├── wgs_calling_regions_noseconds.hg38.bed.gz
│   └── wgs_calling_regions_noseconds.hg38.bed.gz.tbi
└── variant_calling
    └── haplotypecaller