6. Using a custom genome (T2T)

Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 https://www.science.org/doi/10.1126/science.abj6987).

T2T genome

The latest T2T human genome and annotation has been downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

You can access this genome at:

/work/training/references/ncbi/T2T

Check available files:

ls -l /work/training/references/ncbi/T2T/

GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
GCF_009914755.1_T2T-CHM13v2.0_genomic.gff
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
GCF_009914755.1_T2T-CHM13v2.0_protein.faa
GCF_009914755.1_T2T-CHM13v2.0_rna.fna

Run RNAseq pipeline using a custom genome

We can use the nf-core/rnaseq pipeline to profile the expression of genes in a custom genome (e.g., T2T or any animal or plant genome) of your interest, as long as there is a reference genome (FASTA file) and genome annotation (GTF or GFF3).

What parameters can be used to use a custom genome?

--fasta my_custom_genome.fasta  # de novo assembled genome or genome not available as an igenomes reference
--gtf my_custom_genome.gtf      # genome annotatio showing the location of genes

Copy and paste the code below to the terminal:

cp $HOME/workshop/2024/rnaseq/data/samplesheet.csv $HOME/workshop/2024-2/session4_RNAseq/runs/run4_RNAseq_T2T
cp $HOME/workshop/2024/rnaseq/scripts/launch_nf-core_RNAseq_pipeline_T2T.pbs $HOME/workshop/2024-2/session4_RNAseq/runs/run4_RNAseq_T2T
cd $HOME/workshop/2024-2/session4_RNAseq/runs/run4_RNAseq/T2T

Line 1: Copy the samplesheet.csv file to the working directory
Line 2: Copy the launch scrip to run expression profiling using the T2T genome

Print the content of the “launch_nf_core_RNAseq_T2T.pbs” script:

cat launch_nf_core_RNAseq_T2T.pbs

#!/bin/bash -l
#PBS -N nfRNAseq_T2T
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#work on current directory
cd $PBS_O_WORKDIR
#load java and set up memory settings to run nextflow
module load java
export NXF_OPTS='-Xms1g -Xmx4g'
#run the RNAseq pipeline
nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.14.0 \
        --fasta /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna \
        --gtf /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf \
        --remove_ribo_rna \
        -profile singularity \
        --aligner star_salmon \
        --extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " \
        -resume

NOTE:

Do not specify the -genome parameter for pipeline version 3.14.0, previous version required to define either -genome null or -genome custom, but not with the latest version

Submit the job to the cluster

qsub launch_nf_core_RNAseq_T2T.pbs

Tip: Read the help information for Nextflow pipelines

Information on how to run a nextflow pipeline and additional available parameters can be provided on the pipeline website (i.e., https://nf-co.re/rnaseq/3.12.0/docs/usage/ ). You can also run the following command to get help information:

nextflow run nf-core/rnaseq --help

Some pipelines may need file names, and others may want a CSV file with file names, the path to raw data files, and other information.