Overview
nfcore/ampliseq workflow
https://nf-co.re/ampliseq/2.9.0
Run nfcore/ampliseq with a test dataset
First create a test directory:
cd $HOME/meta_workshop/illumina mkdir test cd test
Then run the nfcore/ampliseq test workflow, which runs on a small test dataset.
nextflow run nf-core/ampliseq -r 2.9.0 -profile test,singularity --outdir results
Nexflow generates 1000’s of temporary files in a ‘work’ directory and the results in whatever you called your --outdir
. Clean up these files by:
cd $HOME/meta_workshop/illumina rm -R test
Creating the samplesheet
https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input
The samplesheet requires just two columns (three for paired-end sequences): sampleID
and forwardReads
The sampleID
is whatever you want to call your samples. This can be based on the filenames (as we’ve done in creating the samplesheet.tsv file below) or names of your choice.
The forwardReads
is the full path of each fastq file for each sample.
First, we’ll create a directory called ‘data’, which will store our samplesheet and metadata files
cd $HOME/meta_workshop/illumina mkdir data
Next we’ll create the samplesheet. NOTE: you don’t need to know the Linux commands here, but if you are or become Linux proficient, you can adapt the below as needed for your own datasets. As a beginner, it’s probably better to just create the samplesheet and metadata files in Excel, then copy to the HPC.
# Go to fastq directory cd $HOME/meta_workshop/illumina/fastq # Get filenames find -maxdepth 1 -type f -iname "*fastq.gz" > filenames.txt # Cuts everything after the '.' (i.e. .fastq.gz) and before the '/' (i.e. removes the slash cat filenames.txt | cut -d "." -f2 | cut -d "/" -f2 > sample_ID.txt # NOTE - pipe this to a metadata file, which can have the metadata addd later (just the sample IDs and column headers for now). Thus the same sample IDs are in both metadata and files manifest files. echo -e "ID\tNose_size\tBatch" | cat - sample_ID.txt > $HOME/meta_workshop/illumina/data/metadata.tsv # "$PWD" adds full path # -maxdepth 1 doesn't look in subdirs find "$PWD" -maxdepth 1 -type f -iname "*.fastq.gz" > fastq.txt # Combine it column-wise paste sample_ID.txt fastq.txt > files_cols.txt # Add headers echo -e "sampleID\tforwardReads" | cat - files_cols.txt > $HOME/meta_workshop/illumina/data/samplesheet.tsv # Cleanup rm *.txt
ANOTHER NOTE: if you create a samplesheet or metadata file in Excel, then copy to the HPC, you should run the dos2unix
command on the file(s). Windows adds some additional characters that Linux doesn’t like. dos2unix
fixes this.
e.g.
dos2unix $HOME/meta_workshop/illumina/data/samplesheet.tsv
Creating the metadata
https://nf-co.re/ampliseq/2.9.0/docs/usage#metadata
The metadata file contains treatment group information for the samples.
It must have an ID
column, which contains the same sample names as in your samplesheet.tsv.
It also has additional columns, one for each experimental condition or variable.
In Windows File Explorer, go to the workshop ‘data’ directory you created earlier. The samplesheet.tsv and the metadata.tsv files are there.
Z:\meta_workshop\illumina\data
Open the metadata.tsv file. It will open in Excel. You’ll see it already has the column names and the sample ID’s. We created it like this in the samplesheet creation section.
It does not contain treatment group information. The paper that the samples are based on didn’t have treatment groups, so we need to create dummy groups.
I’ve created two dummy columns, ‘Nose_size’ and ‘Batch’. Add some categories in these columns (e.g. ‘Big’, ‘Small’, etc). It doesn’t matter what you add, as this is just dummy data. It’s just for testing purposes. In your own dataset you’ll probably have actual treatment groups.
Once you’ve added the dummy data, save the file and run dos2unix
on it.
dos2unix $HOME/meta_workshop/illumina/data/metadata.tsv
Running nfcore/ampliseq
Run the full nfcore/ampliseq by copying the following into PuTTy:
cd $HOME/meta_workshop/illumina module load java nextflow run nf-core/ampliseq -r 2.9.0 -profile singularity --single_end --ignore_failed_trimming --input "data/samplesheet.tsv" --metadata "data/metadata.tsv" --FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" --outdir results
This moves to your $HOME/meta_workshop/illumina
directory, loads the java module (Nextflow needs this) and runs the full ampliseq workflow, with all the parameters.
The parameters:
-r 2.9.0
runs version 2.9.0 of the ampliseq workflow. This is important for version control.
-profile singularity
is the type of container we use on the HPC. Nextflow uses containers to run.
--single_end
Since we have single-end data, we need to add this parameter. If we had paired-end we don’t need to add anything as paired-end is the default.
--ignore_failed_trimming
Some of the samples in the public dataset are poor quality and fail the adapter trimming step. We’re ignoring these in this practice session, but if you have your own dataset you’ll want to address this in other ways (e.g. re-sequence samples, remove as outliers, etc).
--input "data/samplesheet.tsv" --metadata "data/metadata.tsv"
The samplesheet and metadata files you created. Note in this case they must be in a ‘data’ subdirectory, but they can be anywhere you like, which you should then provide the full path for.
--FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC"
The forward and reverse primers used. This is from the paper.
https://www.mdpi.com/2073-4425/11/9/1105
The hypervariable V5 and V6 regions (276 base pairs—bp) of the 16S rRNA gene were amplified using the 785F (5′-GGA TTA GAT ACC CBR GTA GTC-3′) and 1061R (5′-TCA CGR CAC GAG CTG ACG AC-3′) primers [20]
--outdir results
The output directory for results. You can call this whatever you like.