Overview
nfcore/ampliseq workflow
https://nf-co.re/ampliseq/2.9.0
Run nfcore/ampliseq with a test dataset
First create a test directory:
cd $HOME/meta_workshop/illumina mkdir test cd test
Then run the nfcore/ampliseq test workflow, which runs on a small test dataset.
nextflow run nf-core/ampliseq -r 2.9.0 -profile test,singularity --outdir results
Nexflow generates 1000’s of temporary files in a ‘work’ directory and the results in whatever you called your --outdir
. Clean up these files by:
cd $HOME/meta_workshop/illumina rm -R test
Creating the samplesheet
https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input
The samplesheet requires just two columns (three for paired-end sequences): sampleID
and forwardReads
The sampleID
is whatever you want to call your samples. This can be based on the filenames (as we’ve done in creating the samplesheet.tsv file below) or names of your choice.
The forwardReads
is the full path of each fastq file for each sample.
First, we’ll create a directory called ‘data’, which will store our samplesheet and metadata files
cd $HOME/meta_workshop/illumina mkdir data
Next we’ll create the samplesheet. NOTE: you don’t need to know the Linux commands here, but if you are or become Linux proficient, you can adapt the below as needed for your own datasets. As a beginner, it’s probably better to just create the samplesheet and metadata files in Excel, then copy to the HPC.
# Go to fastq directory cd $HOME/meta_workshop/illumina/fastq # Get filenames find -maxdepth 1 -type f -iname "*fastq.gz" > filenames.txt # Cuts everything after the '.' (i.e. .fastq.gz) and before the '/' (i.e. removes the slash cat filenames.txt | cut -d "." -f2 | cut -d "/" -f2 > sample_ID.txt # NOTE - pipe this to a metadata file, which can have the metadata addd later (just the sample IDs and column headers for now). Thus the same sample IDs are in both metadata and files manifest files. echo -e "ID\tNose_size\tBatch" | cat - sample_ID.txt > $HOME/meta_workshop/illumina/data/metadata.tsv # "$PWD" adds full path # -maxdepth 1 doesn't look in subdirs find "$PWD" -maxdepth 1 -type f -iname "*.fastq.gz" > fastq.txt # Combine it column-wise paste sample_ID.txt fastq.txt > files_cols.txt # Add headers echo -e "sampleID\tforwardReads" | cat - files_cols.txt > $HOME/meta_workshop/illumina/data/samplesheet.tsv # Cleanup rm *.txt
ANOTHER NOTE: if you create a samplesheet or metadata file in Excel, then copy to the HPC, you should run the dos2unix
command on the file(s). Windows adds some additional characters that Linux doesn’t like. dos2unix
fixes this.
e.g.
dos2unix $HOME/meta_workshop/illumina/data/samplesheet.tsv
Creating the metadata
https://nf-co.re/ampliseq/2.9.0/docs/usage#metadata
The metadata file contains treatment group information for the samples.
It must have an ID
column, which contains the same sample names as in your samplesheet.tsv.
It also has additional columns, one for each experimental condition or variable.
In Windows File Explorer, go to the workshop ‘data’ directory you created earlier. The samplesheet.tsv and the metadata.tsv files are there.
Z:\meta_workshop\illumina\data
Open the metadata.tsv file. It will open in Excel. You’ll see it already has the column names and the sample ID’s. We created it like this in the samplesheet creation section.
It does not contain treatment group information. The paper that the samples are based on didn’t have treatment groups, so we need to create dummy groups.
I’ve created two dummy columns, ‘Nose_size’ and ‘Batch’. Add some categories in these columns (e.g. ‘Big’, ‘Small’, etc). It doesn’t matter what you add, as this is just dummy data. It’s just for testing purposes. In your own dataset you’ll probably have actual treatment groups.
Once you’ve added the dummy data, save the file and run dos2unix
on it.
dos2unix $HOME/meta_workshop/illumina/data/metadata.tsv
Running nfcore/ampliseq
cd $HOME/meta_workshop/illumina module load java nextflow run nf-core/ampliseq -r 2.9.0 -profile singularity --single_end --input "data/samplesheet.tsv" --metadata "data/metadata.tsv" --FW_primer "GGATTAGATACCCBRGTAGTC" --RV_primer "TCACGRCACGAGCTGACGAC" --outdir results
Parameters:
-r 2.9.0
runs version 2.9.0 of the ampliseq workflow. This is important for version control.