4. Input specifications
Samplesheet input
Nextflow pipelines generally need an input file, often referred to as a samplesheet, which contains information about the samples you would like to analyse.
The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first columns to match those required by the pipeline.
The minimum information required will vary and will be specified on the usage section of the pipeline that you are interested to run.
When running Nextflow, use this parameter to specify the samplesheet location: --input '[path to samplesheet file]'
The samplesheet has to be a comma-separated file with a minimum set of columns and a header row.
Examples of samplesheets
For the nf-core/smrnaseq pipeline, the samplesheet has to be a comma-separated file with the following 2 columns.
Column | Description |
---|---|
| Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores ( |
| Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”. |
Column names has to be specified in a header row as shown in the samplesheet example below:
sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz
For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:
Column | Description |
---|---|
| Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores ( |
| Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”. |
| Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”. |
| Sample strand-specificity. Must be one of |
Column names has to be specified in a header row as shown in the samplesheet example below:
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.
Exercise 1
The following samplesheet file for the nf-core/rnaseq pipeline consisting of both single- and paired-end data is ready for analysis.
How many samples does it have in total? Tip: Make sure you check whether there are samples with replicates.
How many are single-end and paired-end? Tip: Single end only have 1 fastq.gz file, paired-end have a pair of fastq.gz files (generally
*_R{1,2}_001.fastq.gz
).
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,forward
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,forward
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,forward
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,,reverse
Exercise 2
Find what are the minimal columns required in the samplesheet to run nfcore/ampliseq
Input folder
Some pipelines like nf-core/ampliseq will let you specify directly the path to the folder that contains your input FASTQ files, as an alternative to using a samplesheet.
For example:
--input_folder 'path/to/folder/ontianing/the/data'
File names must follow a specific pattern, default is /*_R{1,2}_001.fastq.gz
, but this can be adjusted with the --extension
parameter.
For example, the following files in the folder data
would be processed as sample1
and sample2
:
data
|-sample1_1_L001_R1_001.fastq.gz
|-sample1_1_L001_R2_001.fastq.gz
|-sample2_1_L001_R1_001.fastq.gz
|-sample2_1_L001_R2_001.fastq.gz
All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter --multiple_sequencing_runs
and a specific folder structure, for example:
Where sample1
and sample2
were sequenced in one sequencing run and sample3
and sample4
in another sequencing run.