Overview

nfcore/ampliseq workflow

https://nf-co.re/ampliseq/2.9.0

Run nfcore/ampliseq with a test dataset

First create a test directory:

cd $HOME/meta_workshop/illumina
mkdir test
cd test

Then run the nfcore/ampliseq test workflow, which runs on a small test dataset.

nextflow run nf-core/ampliseq -r 2.9.0 -profile test,singularity --outdir results

Nexflow generates 1000’s of temporary files in a ‘work’ directory and the results in whatever you called your --outdir. Clean up these files by:

cd $HOME/meta_workshop/illumina
rm -R test

Creating the samplesheet

https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input

The samplesheet requires just two columns (three for paired-end sequences): sampleID and forwardReads

The sampleID is whatever you want to call your samples. This can be based on the filenames (as we’ve done in creating the samplesheet.tsv file below) or names of your choice.

The forwardReads is the full path of each fastq file for each sample.

First, we’ll create a directory called ‘data’, which will store our samplesheet and metadata files

cd $HOME/meta_workshop/illumina
mkdir data

Next we’ll create the samplesheet. NOTE: you don’t need to know the Linux commands here, but if you are or become Linux proficient, you can adapt the below as needed for your own datasets. As a beginner, it’s probably better to just create the samplesheet and metadata files in Excel, then copy to the HPC.

# Go to fastq directory
cd $HOME/meta_workshop/illumina/fastq

# Get filenames
find -maxdepth 1 -type f -iname "*fastq.gz" > filenames.txt

# Cuts everything after the '.' (i.e. .fastq.gz) and before the '/' (i.e. removes the slash
cat filenames.txt | cut -d "." -f2 | cut -d "/" -f2 > sample_ID.txt

# NOTE - pipe this to a metadata file, which can have the metadata addd later (just the sample IDs and column headers for now). Thus the same sample IDs are in both metadata and files manifest files.
echo -e "ID\tNose_size\tBatch" | cat - sample_ID.txt > $HOME/meta_workshop/illumina/data/metadata.tsv

# "$PWD" adds full path
# -maxdepth 1 doesn't look in subdirs
find "$PWD" -maxdepth 1 -type f -iname "*.fastq.gz" > fastq.txt

# Combine it column-wise
paste sample_ID.txt fastq.txt > files_cols.txt

# Add headers
echo -e "sampleID\tforwardReads" | cat - files_cols.txt > $HOME/meta_workshop/illumina/data/samplesheet.tsv

# Cleanup
rm *.txt

ANOTHER NOTE: if you create a samplesheet or metadata file in Excel, then copy to the HPC, you should run the dos2unix command on the file(s). Windows adds some additional characters that Linux doesn’t like. dos2unix fixes this.

e.g.

dos2unix $HOME/meta_workshop/illumina/data/samplesheet.tsv

Creating the metadata

https://nf-co.re/ampliseq/2.9.0/docs/usage#metadata

The metadata file contains treatment group information for the samples.

It must have an ID column, which contains the same sample names as in your samplesheet.tsv.

It also has additional columns, one for each experimental condition or variable.

In Windows File Explorer, go to the workshop ‘data’ directory you created earlier. The samplesheet.tsv and the metadata.tsv files are there.

Z:\meta_workshop\illumina\data

Open the metadata.tsv file. It will open in Excel. You’ll see it already has the column names and the sample ID’s. We created it like this in the samplesheet creation section.

It does not contain treatment group information. The paper that the samples are based on didn’t have treatment groups, so we need to create dummy groups.

I’ve created two dummy columns, ‘Nose_size’ and ‘Batch’. Add some categories in these columns (e.g. ‘Big’, ‘Small’, etc). It doesn’t matter what you add, as this is just dummy data. It’s just for testing purposes. In your own dataset you’ll probably have actual treatment groups.

Once you’ve added the dummy data, save the file and run dos2unix on it.

dos2unix $HOME/meta_workshop/illumina/data/metadata.tsv

Running nfcore/ampliseq

Run the full nfcore/ampliseq by copying the following into PuTTy:

cd $HOME/meta_workshop/illumina
module load java
nextflow run nf-core/ampliseq -r 2.9.0 -profile singularity --single_end --ignore_failed_trimming --input "data/samplesheet.tsv" --FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" --outdir results

This moves to your $HOME/meta_workshop/illumina directory, loads the java module (Nextflow needs this) and runs the full ampliseq workflow, with all the parameters.

The parameters:

-r 2.9.0 runs version 2.9.0 of the ampliseq workflow. This is important for version control.

-profile singularity is the type of container we use on the HPC. Nextflow uses containers to run.

--single_end Since we have single-end data, we need to add this parameter. If we had paired-end we don’t need to add anything as paired-end is the default.

--ignore_failed_trimming Some of the samples in the public dataset are poor quality and fail the adapter trimming step. We’re ignoring these in this practice session, but if you have your own dataset you’ll want to address this in other ways (e.g. re-sequence samples, remove as outliers, etc).

--input "data/samplesheet.tsv" The samplesheet you created. Note in this case they must be in a ‘data’ subdirectory, but they can be anywhere you like, which you should then provide the full path for.

--FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" The forward and reverse primers used. This is from the paper.

https://www.mdpi.com/2073-4425/11/9/1105

The hypervariable V5 and V6 regions (276 base pairs—bp) of the 16S rRNA gene were amplified using the 785F (5′-GGA TTA GAT ACC CBR GTA GTC-3′) and 1061R (5′-TCA CGR CAC GAG CTG ACG AC-3′) primers [20]

--outdir results The output directory for results. You can call this whatever you like.

The workflow takes approximately 40 minutes to complete.

Interpreting ampliseq results

https://nf-co.re/ampliseq/2.9.0/docs/output

Z:\meta_workshop\illumina\results

3. Illumina using nfcore/ampliseq