Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

...

In this section we’ll be revisiting the variant calling workflows from sessions 2 and 3. Specifically we’ll be doing some further analysis of the nfcore/sarek output.

In the ‘2.Setup’ section, you downloaded 3 nfcore/sarek runs to H:/workshop/sarek/runs. These represent sarek variant calling results from the 3 exercises run in session 2:

  • Exercise 2: Running the sarek variant calling pipeline with a family trio data (NA12878, NA12891, NA12892)

  • Exercise 3: Running the sarek variant calling pipeline with liver samples

  • Exercise 4: Use learned skills to prepare and run variant analysis for a second family trio (HG005, HG006, HG007)

See ‘Session 2 - Variant calling analysis’ for more details on the experimental datasets.

SNPeff and SNPSift

There is a variety of ways to analyse variant data, but we’ll specifically be looking at SNPeff results, which is a variant annotation and functional effect prediction tool that is run as part of the sarek pipeline.

https://pcingola.github.io/SnpEff/

SNPeff annotates each variant according to the annotation information available for the genome - whether the variant is an insertion or deletion, which genomic region it falls under (introns, exons, splice junctions, promoter regions, etc). It also categorises variants as HIGH, MODERATE or LOW impact, depending on how the variant affects codons, amino acids and protein structure.

SNPeff creates its own variant analysis reports (html reports) for each sample, which can be seen here:

H:/workshop/sarek/runs/run2_trio/results/reports/snpeff/haplotypecaller

We’ll be creating our of reports, with some additional functional information, using R.

R Markdown

We’ll be creating html reports using R Markdown. These reports contain ‘chunks’ of R code and markdown language, which allows text, headings, etc to be added to a report.

https://rmarkdown.rstudio.com/

As an example, in RStudio, select

‘File’ → “New File” → “R markdown”

Then click ‘OK’

This will create a basic example markdown report.

Click ‘Knit’ to generate the report. It will ask you to name the file and ask where to save the file. Call it ‘test’ and save it in H:/workshop

...

You will see it generates the test html report.

Functional analysis

For our report we will be taking the SNPeff results and seeing if the list of variants, and the genes associated with them, are overrepresented in metabolic pathways.

To do this we will use the R clusterProfiler package, which examines enrichment of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms, based on a set of input genes. In this case, the genes associated with the SNPeff variants.

...

Creating the report

The R script and R Markdown scripts have already been created, so you’ll just need to open them in RStudio.

In RStudio, select ‘File’ → “Open File”

Browse to H:/workshop/sarek/runs/run3_liver

Select the ‘modify.R’ file and click ‘Open’.

Open the ‘function_report.Rmd’ file in the same way (it’s in the same location as modify.R)

These are the two files you’ll need to generate the reports.

To run the scripts, you just need to change/update (and run) 3 lines in the modify.R file.

These are:

  • Line 8: setwd("H:/workshop/sarek/runs/run3_liver

...

  • ")

This sets the working directory you want to run the analysis from. This needs to be the same directory where you ran your sarek script from, i.e. where the sarek ‘results’ directory is.

IMPORTANT: while the modify'R script can be run from anywhere, you'll need to copy the function_report.Rmd script to this working directory, because that R markdown script calls to subdirectories from its location.

  • Line 69: maingroup <- "C1"

This tells the script which sample to run the analysis on. Choose a sample from the list of available samples (which are shown if you run the line above this: dir("./results/reports/snpeff/haplotypecaller/"))

  • Line 78: variant_type <- "variants_impact_HIGH"

This tells the script which category of variant results you want to analyse. SNPeff groups results by ‘impact’ and variety of other categories. To see which categories you can analyse, run the three lines above the ‘variant_type’ line:

library(data.table)

genetable <- fread(paste0("./results/reports/snpeff/haplotypecaller/", maingroup, "/", maingroup, ".haplotypecaller.filtered_snpEff.genes.txt"), skip = 1, stringsAsFactors = F)

gsub("variants_effect_", "", colnames(genetable))[5:length(colnames(genetable))]

Now just run the rmarkdown::render(.... line to automatically generate the report.

You will now see the HTML report in the working directory you defined in setwd(...). This is a single file, a self-contained report with all the tables and figures you added to it. It can be opened in a web browser, emailed to colleagues or published as a web page.