Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 https://www.science.org/doi/10.1126/science.abj6987).

GRCh38 vs. T2T assemblies

  • GRCh38:

    • Genome Reference Consortium Human Build 38 was released in December 2013.

    • 24 chromosomes (including the X and Y chromosomes) and 261 additional scaffolds that have not been assigned to a chromosome

    • Approximately 151 gaps in the primary sequence of GRCh38. These gaps are typically located in highly repetitive or hard-to-sequence regions such as centromeres, telomeres, and regions of segmental duplications.

  • T2T:

    • The Telomere-to-Telomere (T2T) consortium released the T2T assembly in 2021 being the first truly gapless human genome assembly (T2T-CHM13), which further improved upon GRCh38 by closing these gaps. However, GRCh38 remains the reference genome widely used in genomics projects.

T2T genome

The latest T2T human genome and annotation has been downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

You can access this genome at: /work/training/references/ncbi/T2T

Check available files:

Code Block
ls -l /work/training/references/ncbi/T2T/
Code Block
GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
GCF_009914755.1_T2T-CHM13v2.0_genomic.gff
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
GCF_009914755.1_T2T-CHM13v2.0_protein.faa
GCF_009914755.1_T2T-CHM13v2.0_rna.fna

Print the assembly report:

Code Block
cat /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt

...

Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 https://www.science.org/doi/10.1126/science.abj6987).

GRCh38 vs. T2T assemblies

  • GRCh38:

    • Genome Reference Consortium Human Build 38 was released in December 2013.

    • 24 chromosomes (including the X and Y chromosomes) and 261 additional scaffolds that have not been assigned to a chromosome

    • Approximately 151 gaps in the primary sequence of GRCh38. These gaps are typically located in highly repetitive or hard-to-sequence regions such as centromeres, telomeres, and regions of segmental duplications.

  • T2T:

    • The Telomere-to-Telomere (T2T) consortium released the T2T assembly in 2021 being the first truly gapless human genome assembly (T2T-CHM13), which further improved upon GRCh38 by closing these gaps. However, GRCh38 remains the reference genome widely used in genomics projects.

T2T genome

The latest T2T human genome and annotation has been downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

You can access this genome at: /work/training/references/ncbi/T2T

Check available files:

Code Block
ls -l /work/training/references/ncbi/T2T/
Code Block
GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
GCF_009914755.1_T2T-CHM13v2.0_genomic.gff
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
GCF_009914755.1_T2T-CHM13v2.0_protein.faa
GCF_009914755.1_T2T-CHM13v2.0_rna.fna

Print the assembly report:

Code Block
cat /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
Code Block
# Assembly name:  T2T-CHM13v2.0
# Description:    T2T CHM13v2.0 Telomere-to-Telomere assembly of the CHM13 cell line, with chrY from NA24385
# Organism name:  Homo sapiens (human)
# Taxid:          9606
# BioSample:      SAMN03255769
# BioProject:     PRJNA559484
# Submitter:      T2T Consortium
# Date:           2022-01-24
# Assembly type:  haploid
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# Expected final version: no
# Genome coverage: 30x
# GenBank assembly accession: GCA_009914755.4
# RefSeq assembly accession: GCF_009914755.1
# RefSeq assembly and GenBank assemblies identical: no
#
## Assembly-Units:
## GenBank Unit Accession       RefSeq Unit Accession   Assembly-Unit name
## GCA_009914825.4      GCF_009914825.1 Primary Assembly
## GCA_009914855.2              non-nuclear
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule       Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn     Assembly-Unit   Sequence-Length    UCSC-style-name
1       assembled-molecule      1       Chromosome      CP068277.2      =       NC_060925.1     Primary Assembly        248387328       chr1
2       assembled-molecule      2       Chromosome      CP068276.2      =       NC_060926.1     Primary Assembly        242696752       chr2
3       assembled-molecule      3       Chromosome      CP068275.2      =       NC_060927.1     Primary Assembly        201105948       chr3
4       assembled-molecule      4       Chromosome      CP068274.2      =       NC_060928.1     Primary Assembly        193574945       chr4
5       assembled-molecule      5       Chromosome      CP068273.2      =       NC_060929.1     Primary Assembly        182045439       chr5
6       assembled-molecule      6       Chromosome      CP068272.2      =       NC_060930.1     Primary Assembly        172126628       chr6
7       assembled-molecule      7       Chromosome      CP068271.2      =       NC_060931.1     Primary Assembly        160567428       chr7
8       assembled-molecule      8       Chromosome      CP068270.2      =       NC_060932.1     Primary Assembly        146259331       chr8
9       assembled-molecule      9       Chromosome      CP068269.2      =       NC_060933.1     Primary Assembly        150617247       chr9
10      assembled-molecule      110       Chromosome      CP068277CP068268.2      =       NC_060925060934.1     Primary Assembly        248387328 134758134      chr1 2chr10
11      assembled-molecule      2 11      Chromosome      CP068276CP068267.2      =       NC_060926060935.1     Primary Assembly        242696752135127769       chr2chr11
3 12      assembled-molecule      312       Chromosome      CP068275CP068266.2      =       NC_060927060936.1     Primary Assembly        201105948133324548       chr3chr12
413       assembled-molecule      4 13      Chromosome      CP068274CP068265.2      =       NC_060928060937.1     Primary Assembly        193574945113566686       chr4chr13
514       assembled-molecule      514       Chromosome      CP068273CP068264.2      =       NC_060929060938.1     Primary Assembly        182045439101161492       chr5chr14
6 15      assembled-molecule      6 15      Chromosome      CP068272CP068263.2      =       NC_060930060939.1     Primary Assembly        17212662899753195       chr6 7chr15
16      assembled-molecule      716       Chromosome      CP068271CP068262.2      =       NC_060931060940.1     Primary Assembly        16056742896330374       chr7 8chr16
17      assembled-molecule      8 17      Chromosome      CP068270CP068261.2      =       NC_060932060941.1     Primary Assembly        14625933184276897       chr8 9chr17
18      assembled-molecule      918       Chromosome      CP068269CP068260.2      =       NC_060933060942.1     Primary Assembly        80542538 150617247       chr9chr18
1019      assembled-molecule      1019      Chromosome      CP068268CP068259.2      =       NC_060934060943.1     Primary Assembly        13475813461707364        chr10chr19
1120      assembled-molecule      1120      Chromosome      CP068267CP068258.2      =       NC_060935060944.1     Primary Assembly        66210255 135127769       chr11chr20
1221      assembled-molecule      1221      Chromosome      CP068266CP068257.2      =       NC_060936060945.1     Primary Assembly        13332454845090682        chr12chr21
1322      assembled-molecule      1322      Chromosome      CP068265CP068256.2      =       NC_060937060946.1     Primary Assembly        51324926  113566686      chr22
chr13X 14      assembled-molecule      14X       Chromosome      CP068264CP068255.2      =       NC_060938060947.1     Primary Assembly        101161492154259566       chrX
chr14Y 15      assembled-molecule      15Y       Chromosome      CP068263CP086569.2      =       NC_060939060948.1     Primary Assembly        62460029        chrY
MT     Primary Assemblyassembled-molecule      MT  99753195    Mitochondrion   CP068254.1 chr15 16    <>  assembled-molecule    na  16    non-nuclear  Chromosome   16569   CP068262.2      =       NC_060940.1     Primary Assembly        96330374        chr16
17      assembled-molecule      17      Chromosome      CP068261.2      =       NC_060941.1     Primary Assembly        84276897        chr17
18      assembled-molecule      18      Chromosome      CP068260.2      =       NC_060942.1     Primary Assembly        80542538        chr18
19      assembled-molecule      19      Chromosome      CP068259.2      =       NC_060943.1     Primary Assembly        61707364        chr19
20      assembled-molecule      20      Chromosome      CP068258.2      =       NC_060944.1     Primary Assembly        66210255        chr20
21      assembled-molecule      21      Chromosome      CP068257.2      =       NC_060945.1     Primary Assembly        45090682        chr21
22      assembled-molecule      22      Chromosome      CP068256.2      =       NC_060946.1     Primary Assembly        51324926        chr22
X       assembled-molecule      X       Chromosome      CP068255.2      =       NC_060947.1     Primary Assembly        154259566       chrX
Y       assembled-molecule      Y       Chromosome      CP086569.2      =       NC_060948.1     Primary Assembly        62460029        chrY
MT      assembled-molecule      MT      Mitochondrion   CP068254.1      <>      na      non-nuclear     16569   chrM

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOMEchrM

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Code Block
cd $HOME/workshop/2024-2/session4_RNAseq/data/human

Copy the bash script to the working folder

Code Block
cp /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/workshop/2024-2/session4_RNAseq/data/human
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Example for Paired-End data (when ‘Read 1’ and ‘Read2’ are available) - Copy available script if working with PE data:

Code Block
cat /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_SEdata.sh

...

NOTE: modify ‘read1_extension’ and ‘read2_extension’ as appropriate for your data. For example: R1.fastq.gz, R2.fastq.gz or R1_001.fq.gz, R2_001.fq.gz , etc

Prior running the “create_samplesheet” script, we need to know the path to the current (working directory) - run pwd and copy the path as we will use it in the subsequent block of code:

Code Block
pwd

Let’s generate the metadata file by running the following command:

Code Block
sh create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/workshop/2024-2/session4_RNAseq/data/human

Check the newly created samplesheet.csv file:

Code Block
cat samplesheet.cvs
Code Block
sample,fastq_1,fastq_2,strandedness
SRR1039508,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039508_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039508_2.fastq.gz,auto
SRR1039509,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039509_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039509_2.fastq.gz,auto
SRR1039510,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039510_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039510_2.fastq.gz,auto
SRR1039511,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039511_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039511_2.fastq.gz,auto
SRR1039514,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039514_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039514_2.fastq.gz,auto
SRR1039517,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039517_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039517_2.fastq.gz,auto
SRR1039518,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039518_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human

Copy the bash script to the working folder

Code Block
cp /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/SRR1039518_2.fastq.gz,auto
SRR1039519,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039519_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Example for Paired-End data (when ‘Read 1’ and ‘Read2’ are available) - Copy available script if working with PE data:

Code Block
cat /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_SEdata.sh

...

Prior running the “create_samplesheet” script, we need to know the path to the current (working directory) - type:

Code Block
pwd

Let’s generate the metadata file by running the following command:

Code Block
sh create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/session4_RNAseq/data/human/SRR1039519_2.fastq.gz,auto
SRR1039520,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039520_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039520_2.fastq.gz,auto
SRR1039521,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039521_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human

Check the newly created samplesheet.csv file:

Code Block
cat samplesheet.cvs/SRR1039521_2.fastq.gz,auto

Run RNAseq pipeline using a custom genome

We can use the nf-core/rnaseq pipeline to profile the expression of genes in a custom genome (e.g., T2T or any animal or plant genome) of your interest, as long as there is a reference genome (FASTA file) and genome annotation (GTF or GFF3).

To use your own genome assembly - you need 1) FASTA genome sequence and 2) GFF/GTF genome annotation file

Code Block
--fasta my_custom_genome.fasta  # de novo assembled genome or genome not available as an igenomes reference
--gtf my_custom_genome.gtf      # genome annotatio showing the location of genes

...

Code Block
cd $HOME/workshop/2024-2/session4_RNAseq/runs/run4run3_RNAseq/_T2T

Copy and paste the code below to the terminal:

Code Block
cp $HOME/work/trainingworkshop/2024-2/rnaseqsession4_RNAseq/data/human/samplesheet.csv $HOME/workshop/2024-2/session4_RNAseq/runs/run3_RNAseq_T2T
cp $HOME/workshop/2024-2/session4_RNAseq/scripts/launch_nf-core_RNAseq_pipeline_T2T.pbs $HOME/workshop/2024-2/session4_RNAseq/runs/run3_RNAseq_T2T
  • Line 1: Copy the samplesheet.csv for pre-downloaded human samples file to the working directory

  • Line 2: Copy the launch scrip to run expression profiling using the T2T genome

Print the content of the “launch_nf_core_RNAseq_T2T.pbs” script:

Code Block
cat launch_nf_-core_RNAseq_pipeline_T2T.pbs
Code Block
#!/bin/bash -l
#PBS -N nfRNAseq_T2T
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#work on current directory
cd $PBS_O_WORKDIR
#load java and set up memory settings to run nextflow
module load java
export NXF_OPTS='-Xms1g -Xmx4g'
#run the RNAseq pipeline
nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.14.0 \
        --fasta /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna \
        --gtf /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf \
        --remove_ribo_rna \
        -profile singularity \
        --aligner star_salmon \
        --extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " \
        -resume

...

Submit the job to the cluster

Code Block
qsub launch_nf_-core_RNAseq_pipeline_T2T.pbs

Tip: Read the help information for Nextflow pipelines

Information on how to run a nextflow pipeline and additional available parameters can be provided on the pipeline website (i.e., https://nf-co.re/rnaseq/3.12.0/docs/usage/ ). You can also run the following command to get help information:

...

Some pipelines may need file names, and others may want a CSV file with file names, the path to raw data files, and other information.

Public genomes

ENSEMBL publishes a range of genome assemblies and annotation files for a broad range of species. Look for species of interest at:

...