Overview

nfcore/ampliseq workflow

https://nf-co.re/ampliseq/2.9.0

Run nfcore/ampliseq with a test dataset

First create a test directory:

cd $HOME/meta_workshop/illumina
mkdir test
cd test

Then run the nfcore/ampliseq test workflow, which runs on a small test dataset.

nextflow run nf-core/ampliseq -r 2.9.0 -profile test,singularity --outdir results

Nexflow generates 1000’s of temporary files in a ‘work’ directory and the results in whatever you called your --outdir. Clean up these files by:

cd $HOME/meta_workshop/illumina
rm -R test

Creating the samplesheet

https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input

The samplesheet requires just two columns (three for paired-end sequences): sampleID and forwardReads

The sampleID is whatever you want to call your samples. This can be based on the filenames (as we’ve done in creating the samplesheet.tsv file below) or names of your choice.

The forwardReads is the full path of each fastq file for each sample.

First, we’ll create a directory called ‘data’, which will store our samplesheet and metadata files

cd $HOME/meta_workshop/illumina
mkdir data

Next we’ll create the samplesheet. NOTE: you don’t need to know the Linux commands here, but if you are or become Linux proficient, you can adapt the below as needed for your own datasets. As a beginner, it’s probably better to just create the samplesheet and metadata files in Excel, then copy to the HPC.

# Go to fastq directory
cd $HOME/meta_workshop/illumina/fastq

# Get filenames
find -maxdepth 1 -type f -iname "*fastq.gz" > filenames.txt

# Cuts everything after the '.' (i.e. .fastq.gz) and before the '/' (i.e. removes the slash
cat filenames.txt | cut -d "." -f2 | cut -d "/" -f2 > sample_ID.txt

# NOTE - pipe this to a metadata file, which can have the metadata addd later (just the sample IDs and column headers for now). Thus the same sample IDs are in both metadata and files manifest files.
echo -e "ID\tNose_size\tBatch" | cat - sample_ID.txt > $HOME/meta_workshop/illumina/data/metadata.tsv

# "$PWD" adds full path
# -maxdepth 1 doesn't look in subdirs
find "$PWD" -maxdepth 1 -type f -iname "*.fastq.gz" > fastq.txt

# Combine it column-wise
paste sample_ID.txt fastq.txt > files_cols.txt

# Add headers
echo -e "sampleID\tforwardReads" | cat - files_cols.txt > $HOME/meta_workshop/illumina/data/samplesheet.tsv

# Cleanup
rm *.txt

ANOTHER NOTE: if you create a samplesheet or metadata file in Excel, then copy to the HPC, you should run the dos2unix command on the file(s). Windows adds some additional characters that Linux doesn’t like. dos2unix fixes this.

e.g.

dos2unix $HOME/meta_workshop/illumina/data/samplesheet.tsv

Creating the metadata

https://nf-co.re/ampliseq/2.9.0/docs/usage#metadata

The metadata file contains treatment group information for the samples.

It must have an ID column, which contains the same sample names as in your samplesheet.tsv.

It also has additional columns, one for each experimental condition or variable.

In Windows File Explorer, go to the workshop ‘data’ directory you created earlier. The samplesheet.tsv and the metadata.tsv files are there.

Z:\meta_workshop\illumina\data

Open the metadata.tsv file. It will open in Excel. You’ll see it already has the column names and the sample ID’s. We created it like this in the samplesheet creation section.

It does not contain treatment group information. The paper that the samples are based on didn’t have treatment groups, so we need to create dummy groups.

I’ve created two dummy columns, ‘Nose_size’ and ‘Batch’. Add some categories in these columns (e.g. ‘Big’, ‘Small’, etc). It doesn’t matter what you add, as this is just dummy data. It’s just for testing purposes. In your own dataset you’ll probably have actual treatment groups.

Once you’ve added the dummy data, save the file and run dos2unix on it.

dos2unix $HOME/meta_workshop/illumina/data/metadata.tsv

Running nfcore/ampliseq

cd $HOME/meta_workshop/illumina
module load java
nextflow run nf-core/ampliseq -r 2.9.0 -profile singularity --single_end --ignore_failed_trimming --input "data/samplesheet.tsv" --metadata "data/metadata.tsv" --FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" --outdir results

Parameters:

-r 2.9.0 runs version 2.9.0 of the ampliseq workflow. This is important for version control.

3. Illumina using nfcore/ampliseq

nfcore/ampliseq workflow

Run nfcore/ampliseq with a test dataset

Creating the samplesheet

Creating the metadata

Running nfcore/ampliseq