6. Using a custom genome (T2T)

Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 The complete sequence of a human genome).

GRCh38 vs. T2T assemblies

  • GRCh38:

    • Genome Reference Consortium Human Build 38 was released in December 2013.

    • 24 chromosomes (including the X and Y chromosomes) and 261 additional scaffolds that have not been assigned to a chromosome

    • Approximately 151 gaps in the primary sequence of GRCh38. These gaps are typically located in highly repetitive or hard-to-sequence regions such as centromeres, telomeres, and regions of segmental duplications.

  • T2T:

    • The Telomere-to-Telomere (T2T) consortium released the T2T assembly in 2021 being the first truly gapless human genome assembly (T2T-CHM13), which further improved upon GRCh38 by closing these gaps. However, GRCh38 remains the reference genome widely used in genomics projects.

T2T genome

The latest T2T human genome and annotation has been downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

You can access this genome at: /work/training/references/ncbi/T2T

Check available files:

ls -l /work/training/references/ncbi/T2T/
GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt GCF_009914755.1_T2T-CHM13v2.0_genomic.fna GCF_009914755.1_T2T-CHM13v2.0_genomic.gff GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf GCF_009914755.1_T2T-CHM13v2.0_protein.faa GCF_009914755.1_T2T-CHM13v2.0_rna.fna

Print the assembly report:

cat /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

Copy the bash script to the working folder

  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

Example for Paired-End data (when ‘Read 1’ and ‘Read2’ are available) - Copy available script if working with PE data:

image-20241011-062402.png

NOTE: modify ‘read1_extension’ and ‘read2_extension’ as appropriate for your data. For example: R1.fastq.gz, R2.fastq.gz or R1_001.fq.gz, R2_001.fq.gz , etc

Prior running the “create_samplesheet” script, we need to know the path to the current (working directory) - run pwd and copy the path as we will use it in the subsequent block of code:

Let’s generate the metadata file by running the following command:

Check the newly created samplesheet.csv file:

Run RNAseq pipeline using a custom genome

We can use the nf-core/rnaseq pipeline to profile the expression of genes in a custom genome (e.g., T2T or any animal or plant genome) of your interest, as long as there is a reference genome (FASTA file) and genome annotation (GTF or GFF3).

To use your own genome assembly - you need 1) FASTA genome sequence and 2) GFF/GTF genome annotation file

move to the working directory for running the T2T expression profiling:

Copy and paste the code below to the terminal:

  • Line 1: Copy the samplesheet.csv for pre-downloaded human samples file to the working directory

  • Line 2: Copy the launch scrip to run expression profiling using the T2T genome

Print the content of the “launch_nf_core_RNAseq_T2T.pbs” script:

NOTE:

  • Do not specify the -genome parameter for pipeline version 3.14.0, previous version required to define either -genome null or -genome custom, but not with the latest version

  • Check below explanation of code:

image-20241013-091550.png

Submit the job to the cluster

Tip: Read the help information for Nextflow pipelines

Information on how to run a nextflow pipeline and additional available parameters can be provided on the pipeline website (i.e., rnaseq: Usage ). You can also run the following command to get help information:

Some pipelines may need file names, and others may want a CSV file with file names, the path to raw data files, and other information.

Public genomes

ENSEMBL publishes a range of genome assemblies and annotation files for a broad range of species. Look for species of interest at:

https://asia.ensembl.org/info/data/ftp/index.html