4. Input specifications

Samplesheet input

Nextflow pipelines generally need an input file, often referred to as a samplesheet, which contains information about the samples you would like to analyse.

The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first columns to match those required by the pipeline.

The minimum information required will vary and will be specified on the usage section of the pipeline that you are interested to run.

When running Nextflow, use this parameter to specify the samplesheet location: --input '[path to samplesheet file]'

The samplesheet has to be a comma-separated file with a minimum set of columns and a header row.

Examples of samplesheets

For the nf-core/smrnaseq pipeline, the samplesheet has to be a comma-separated file with the following 2 columns.

Column	Description
`sample`	Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`).
`fastq_1`	Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.

Column names has to be specified in a header row as shown in the samplesheet example below:

sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz

For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:

Column	Description
`sample`	Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`).
`fastq_1`	Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
`fastq_2`	Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
`strandedness`	Sample strand-specificity. Must be one of `unstranded`, `forward`, `reverse` or `auto`.

Column names has to be specified in a header row as shown in the samplesheet example below:

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.

Exercise 1

The following samplesheet file for the nf-core/rnaseq pipeline consisting of both single- and paired-end data is ready for analysis.

How many samples does it have in total?
How many are single-end and paired-end?

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,forward
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,forward
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,forward
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,,reverse

Solution:

There are 6 samples in total, as TREATMENT_REP3 has been sequenced twice. There are 3 single-end and 3 paired-end samples.

Exercise 2

Find what are the minimal columns required in the samplesheet to run nfcore/ampliseq

Solution

You will need to go to the usage page of nfcore/ampliseq which can be found at https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input

(make sure you are using the latest version of the pipeline).

The input specification section will specify that the samplesheet must minimally contain 2 columns: sampleID and forwardReads.

Input folder

Some pipelines like nf-core/ampliseq will let you specify directly the path to the folder that contains your input FASTQ files, as an alternative to using a samplesheet.

For example:

--input_folder 'path/to/data/'

File names must follow a specific pattern, default is /*_R{1,2}_001.fastq.gz, but this can be adjusted with the --extension parameter.

For example, the following files in the folder data would be processed as sample1 and sample2:

data
    |-sample1_1_L001_R1_001.fastq.gz
    |-sample1_1_L001_R2_001.fastq.gz
    |-sample2_1_L001_R1_001.fastq.gz
    |-sample2_1_L001_R2_001.fastq.gz

All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter --multiple_sequencing_runs and a specific folder structure, for example:

data
    |-runA
    |   |-sample1_1_L001_R1_001.fastq.gz
    |   |-sample1_1_L001_R2_001.fastq.gz
    |   |-sample2_1_L001_R1_001.fastq.gz
    |   |-sample2_1_L001_R2_001.fastq.gz
    |
    |-runB
        |-sample3_1_L001_R1_001.fastq.gz
        |-sample3_1_L001_R2_001.fastq.gz
        |-sample4_1_L001_R1_001.fastq.gz
        |-sample4_1_L001_R2_001.fastq.gz

Where sample1 and sample2 were sequenced in one sequencing run and sample3 and sample4 in another sequencing run.