3. Illumina using nfcore/ampliseq

Overview

 

nfcore/ampliseq workflow

image-20240523-211941.png

https://nf-co.re/ampliseq/2.9.0

 

Run nfcore/ampliseq with a test dataset

 

First create a test directory:

cd $HOME/meta_workshop/illumina mkdir test cd test

 

Then run the nfcore/ampliseq test workflow, which runs on a small test dataset.

module load java nextflow run nf-core/ampliseq -r 2.9.0 -profile test,singularity --outdir results

 

Nexflow generates 1000’s of temporary files in a ‘work’ directory and the results in whatever you called your --outdir. Clean up these files by:

cd $HOME/meta_workshop/illumina rm -R test

 

Creating the samplesheet

 

https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input

The samplesheet requires just two columns (three for paired-end sequences): sampleID and forwardReads

The sampleID is whatever you want to call your samples. This can be based on the filenames (as we’ve done in creating the samplesheet.tsv file below) or names of your choice.

The forwardReads is the full path of each fastq file for each sample.

 

First, we’ll create a directory called ‘data’, which will store our samplesheet and metadata files

 

Next we’ll create the samplesheet. NOTE: you don’t need to know the Linux commands here, but if you are or become Linux proficient, you can adapt the below as needed for your own datasets. As a beginner, it’s probably better to just create the samplesheet and metadata files in Excel, then copy to the HPC.

 

ANOTHER NOTE: if you create a samplesheet or metadata file in Excel, then copy to the HPC, you should run the dos2unix command on the file(s). Windows adds some additional characters that Linux doesn’t like. dos2unix fixes this.

e.g.

 

Creating the metadata

 

https://nf-co.re/ampliseq/2.9.0/docs/usage#metadata

The metadata file contains treatment group information for the samples.

It must have an ID column, which contains the same sample names as in your samplesheet.tsv.

It also has additional columns, one for each experimental condition or variable.

 

In Windows File Explorer, go to the workshop ‘data’ directory you created earlier. The samplesheet.tsv and the metadata.tsv files are there.

Z:\meta_workshop\illumina\data

 

Open the metadata.tsv file. It will open in Excel. You’ll see it already has the column names and the sample ID’s. We created it like this in the samplesheet creation section.

It does not contain treatment group information. The paper that the samples are based on didn’t have treatment groups, so we need to create dummy groups.

I’ve created two dummy columns, ‘Nose_size’ and ‘Batch’. Add some categories in these columns (e.g. ‘Big’, ‘Small’, etc). It doesn’t matter what you add, as this is just dummy data. It’s just for testing purposes. In your own dataset you’ll probably have actual treatment groups.

 

Once you’ve added the dummy data, save the file and run dos2unix on it.

 

Running nfcore/ampliseq

 

Run the full nfcore/ampliseq by copying the following into PuTTy:

This moves to your $HOME/meta_workshop/illumina directory, loads the java module (Nextflow needs this) and runs the full ampliseq workflow, with all the parameters.

 

The parameters:

-r 2.9.0 runs version 2.9.0 of the ampliseq workflow. This is important for version control.

-profile singularity is the type of container we use on the HPC. Nextflow uses containers to run.

--single_end Since we have single-end data, we need to add this parameter. If we had paired-end we don’t need to add anything as paired-end is the default.

--ignore_failed_trimming Some of the samples in the public dataset are poor quality and fail the adapter trimming step. We’re ignoring these in this practice session, but if you have your own dataset you’ll want to address this in other ways (e.g. re-sequence samples, remove as outliers, etc).

--input "data/samplesheet.tsv" The samplesheet you created. Note in this case they must be in a ‘data’ subdirectory, but they can be anywhere you like, which you should then provide the full path for.

--FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" The forward and reverse primers used. This is from the paper.

Comparison of Illumina versus Nanopore 16S rRNA Gene Sequencing of the Human Nasal Microbiota

The hypervariable V5 and V6 regions (276 base pairs—bp) of the 16S rRNA gene were amplified using the 785F (5′-GGA TTA GAT ACC CBR GTA GTC-3′) and 1061R (5′-TCA CGR CAC GAG CTG ACG AC-3′) primers [20]

--outdir results The output directory for results. You can call this whatever you like.

 

The workflow takes approximately 40 minutes to complete.

 

Running again, with metadata

 

In the previous run to save time we didn’t include the dummy metadata. You can run it in the background now with the metadata option included, to see how it runs.

Additional parameters:

--metadata "data/metadata.tsv" Include the metadata file

--outdir results_with_metadata Output results to a new directory, called results_with_metadata

 

Interpreting ampliseq results

 

https://nf-co.re/ampliseq/2.9.0/docs/output

 

Z:\meta_workshop\illumina\results