Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 https://www.science.org/doi/10.1126/science.abj6987).
GRCh38 vs. T2T assemblies
GRCh38:
Genome Reference Consortium Human Build 38 was released in December 2013.
24 chromosomes (including the X and Y chromosomes) and 261 additional scaffolds that have not been assigned to a chromosome
Approximately 151 gaps in the primary sequence of GRCh38. These gaps are typically located in highly repetitive or hard-to-sequence regions such as centromeres, telomeres, and regions of segmental duplications.
T2T:
The Telomere-to-Telomere (T2T) consortium released the T2T assembly in 2021 being the first truly gapless human genome assembly (T2T-CHM13), which further improved upon GRCh38 by closing these gaps. However, GRCh38 remains the reference genome widely used in genomics projects.
T2T genome
The latest T2T human genome and annotation has been downloaded from NCBI:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/
You can access this genome at: /work/training/references/ncbi/T2T
Check available files:
Code Block |
---|
ls -l /work/training/references/ncbi/T2T/ |
Code Block |
---|
GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
GCF_009914755.1_T2T-CHM13v2.0_genomic.gff
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
GCF_009914755.1_T2T-CHM13v2.0_protein.faa
GCF_009914755.1_T2T-CHM13v2.0_rna.fna |
Print the assembly report:
Code Block |
---|
cat /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt |
...
Run RNA-seq pipeline using the Telomere-2-Telomere (T2T) latest human genome
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies (Nurk et al., Science, 2022 https://www.science.org/doi/10.1126/science.abj6987).
GRCh38 vs. T2T assemblies
GRCh38:
Genome Reference Consortium Human Build 38 was released in December 2013.
24 chromosomes (including the X and Y chromosomes) and 261 additional scaffolds that have not been assigned to a chromosome
Approximately 151 gaps in the primary sequence of GRCh38. These gaps are typically located in highly repetitive or hard-to-sequence regions such as centromeres, telomeres, and regions of segmental duplications.
T2T:
The Telomere-to-Telomere (T2T) consortium released the T2T assembly in 2021 being the first truly gapless human genome assembly (T2T-CHM13), which further improved upon GRCh38 by closing these gaps. However, GRCh38 remains the reference genome widely used in genomics projects.
T2T genome
The latest T2T human genome and annotation has been downloaded from NCBI:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/
You can access this genome at: /work/training/references/ncbi/T2T
Check available files:
Code Block |
---|
ls -l /work/training/references/ncbi/T2T/ |
Code Block |
---|
GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
GCF_009914755.1_T2T-CHM13v2.0_genomic.gff
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf
GCF_009914755.1_T2T-CHM13v2.0_protein.faa
GCF_009914755.1_T2T-CHM13v2.0_rna.fna |
Print the assembly report:
Code Block |
---|
cat /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt |
Code Block |
---|
# Assembly name: T2T-CHM13v2.0 # Description: T2T CHM13v2.0 Telomere-to-Telomere assembly of the CHM13 cell line, with chrY from NA24385 # Organism name: Homo sapiens (human) # Taxid: 9606 # BioSample: SAMN03255769 # BioProject: PRJNA559484 # Submitter: T2T Consortium # Date: 2022-01-24 # Assembly type: haploid # Release type: major # Assembly level: Complete Genome # Genome representation: full # Expected final version: no # Genome coverage: 30x # GenBank assembly accession: GCA_009914755.4 # RefSeq assembly accession: GCF_009914755.1 # RefSeq assembly and GenBank assemblies identical: no # ## Assembly-Units: ## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name ## GCA_009914825.4 GCF_009914825.1 Primary Assembly ## GCA_009914855.2 non-nuclear # # Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by # unlocalized scaffolds. # Unplaced scaffolds are listed at the end. # RefSeq is equal or derived from GenBank object. # # Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name 1 assembled-molecule 1 Chromosome CP068277.2 = NC_060925.1 Primary Assembly 248387328 chr1 2 assembled-molecule 2 Chromosome CP068276.2 = NC_060926.1 Primary Assembly 242696752 chr2 3 assembled-molecule 3 Chromosome CP068275.2 = NC_060927.1 Primary Assembly 201105948 chr3 4 assembled-molecule 4 Chromosome CP068274.2 = NC_060928.1 Primary Assembly 193574945 chr4 5 assembled-molecule 5 Chromosome CP068273.2 = NC_060929.1 Primary Assembly 182045439 chr5 6 assembled-molecule 6 Chromosome CP068272.2 = NC_060930.1 Primary Assembly 172126628 chr6 7 assembled-molecule 7 Chromosome CP068271.2 = NC_060931.1 Primary Assembly 160567428 chr7 8 assembled-molecule 8 Chromosome CP068270.2 = NC_060932.1 Primary Assembly 146259331 chr8 9 assembled-molecule 9 Chromosome CP068269.2 = NC_060933.1 Primary Assembly 150617247 chr9 10 assembled-molecule 10 Chromosome CP068268.2 = NC_060934.1 Primary Assembly 134758134 chr10 11 assembled-molecule 1 11 Chromosome CP068277CP068267.2 = NC_060925060935.1 Primary Assembly 248387328135127769 chr1chr11 2 12 assembled-molecule 212 Chromosome CP068276CP068266.2 = NC_060926060936.1 Primary Assembly 242696752133324548 chr2chr12 313 assembled-molecule 3 13 Chromosome CP068275CP068265.2 = NC_060927060937.1 Primary Assembly 201105948113566686 chr3chr13 414 assembled-molecule 414 Chromosome CP068274CP068264.2 = NC_060928060938.1 Primary Assembly 193574945101161492 chr4chr14 5 15 assembled-molecule 5 15 Chromosome CP068273CP068263.2 = NC_060929060939.1 Primary Assembly 18204543999753195 chr5 6chr15 16 assembled-molecule 616 Chromosome CP068272CP068262.2 = NC_060930060940.1 Primary Assembly 17212662896330374 chr6 7chr16 17 assembled-molecule 7 17 Chromosome CP068271CP068261.2 = NC_060931060941.1 Primary Assembly 16056742884276897 chr7 8chr17 18 assembled-molecule 818 Chromosome CP068270CP068260.2 = NC_060932060942.1 Primary Assembly 14625933180542538 chr8 9chr18 19 assembled-molecule 9 19 Chromosome CP068269CP068259.2 = NC_060933060943.1 Primary Assembly 15061724761707364 chr9chr19 1020 assembled-molecule 1020 Chromosome CP068268CP068258.2 = NC_060934060944.1 Primary Assembly 66210255 134758134 chr10chr20 1121 assembled-molecule 1121 Chromosome CP068267CP068257.2 = NC_060935060945.1 Primary Assembly 13512776945090682 chr11chr21 1222 assembled-molecule 1222 Chromosome CP068266CP068256.2 = NC_060936060946.1 Primary Assembly 51324926 133324548 chr22 chr12X 13 assembled-molecule 13X Chromosome CP068265CP068255.2 = NC_060937060947.1 Primary Assembly 113566686154259566 chrX chr13Y 14 assembled-molecule 14Y Chromosome CP068264CP086569.2 = NC_060938060948.1 Primary Assembly 62460029 101161492 chr14chrY 15MT assembled-molecule MT Mitochondrion CP068254.1 <> 15 na Chromosome non-nuclear CP068263.2 16569 = NC_060939.1 Primary Assembly 99753195 chr15 16 assembled-molecule 16 Chromosome CP068262.2 = NC_060940.1 Primary Assembly 96330374 chr16 17 assembled-molecule 17 Chromosome CP068261.2 = NC_060941.1 Primary Assembly 84276897 chr17 18 assembled-molecule 18 Chromosome CP068260.2 = NC_060942.1 Primary Assembly 80542538 chr18 19 assembled-molecule 19 Chromosome CP068259.2 = NC_060943.1 Primary Assembly 61707364 chr19 20 assembled-molecule 20 Chromosome CP068258.2 = NC_060944.1 Primary Assembly 66210255 chr20 21 assembled-molecule 21 Chromosome CP068257.2 = NC_060945.1 Primary Assembly 45090682 chr21 22 assembled-molecule 22 Chromosome CP068256.2 = NC_060946.1 Primary Assembly 51324926 chr22 X assembled-molecule X Chromosome CP068255.2 = NC_060947.1 Primary Assembly 154259566 chrX Y assembled-molecule Y Chromosome CP086569.2 = NC_060948.1 Primary Assembly 62460029 chrY MT assembled-molecule MT Mitochondrion CP068254.1 <> na non-nuclear 16569 chrM |
Create the metadata file (samplesheet.csv):
Change to the data folder directory:
Code Block |
---|
cd $HOMEchrM |
Create the metadata file (samplesheet.csv):
Change to the data folder directory:
Code Block |
---|
cd $HOME/workshop/2024-2/session4_RNAseq/data/human |
Copy the bash script to the working folder
Code Block |
---|
cp /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/workshop/2024-2/session4_RNAseq/data/human |
Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located
View the content of the script:
Example for Paired-End data (when ‘Read 1’ and ‘Read2’ are available) - Copy available script if working with PE data:
Code Block |
---|
cat /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_SEdata.sh |
...
NOTE: modify ‘read1_extension’ and ‘read2_extension’ as appropriate for your data. For example: R1.fastq.gz, R2.fastq.gz or R1_001.fq.gz, R2_001.fq.gz , etc
Prior running the “create_samplesheet” script, we need to know the path to the current (working directory) - run pwd and copy the path as we will use it in the subsequent block of code:
Code Block |
---|
pwd |
Let’s generate the metadata file by running the following command:
Code Block |
---|
sh create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/workshop/2024-2/session4_RNAseq/data/human |
Check the newly created samplesheet.csv file:
Code Block |
---|
cat samplesheet.cvs |
Code Block |
---|
sample,fastq_1,fastq_2,strandedness
SRR1039508,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039508_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039508_2.fastq.gz,auto
SRR1039509,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039509_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039509_2.fastq.gz,auto
SRR1039510,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039510_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039510_2.fastq.gz,auto
SRR1039511,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039511_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039511_2.fastq.gz,auto
SRR1039514,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039514_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039514_2.fastq.gz,auto
SRR1039517,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039517_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039517_2.fastq.gz,auto
SRR1039518,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039518_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039518_2.fastq.gz,auto
SRR1039519,/home/user/workshop/2024-2/session4_RNAseq/data/human |
Copy the bash script to the working folder
Code Block |
---|
cp /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/SRR1039519_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human |
Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located
View the content of the script:
Example for Paired-End data (when ‘Read 1’ and ‘Read2’ are available) - Copy available script if working with PE data:
Code Block |
---|
cat /work/training/2024/rnaseq/scripts/create_samplesheet_nf-core_RNAseq_SEdata.sh |
...
Let’s generate the metadata file by running the following command:
Code Block |
---|
sh create_samplesheet_nf-core_RNAseq_PEdata.sh $HOME/SRR1039519_2.fastq.gz,auto SRR1039520,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039520_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039520_2.fastq.gz,auto SRR1039521,/home/user/workshop/2024-2/session4_RNAseq/data/human/SRR1039521_1.fastq.gz,/home/user/workshop/2024-2/session4_RNAseq/data/human |
Check the newly created samplesheet.csv file:
Code Block |
---|
cat samplesheet.cvs/SRR1039521_2.fastq.gz,auto |
Run RNAseq pipeline using a custom genome
We can use the nf-core/rnaseq pipeline to profile the expression of genes in a custom genome (e.g., T2T or any animal or plant genome) of your interest, as long as there is a reference genome (FASTA file) and genome annotation (GTF or GFF3).
To use your own genome assembly - you need 1) FASTA genome sequence and 2) GFF/GTF genome annotation file
Code Block |
---|
--fasta my_custom_genome.fasta # de novo assembled genome or genome not available as an igenomes reference --gtf my_custom_genome.gtf # genome annotatio showing the location of genes |
...
Code Block |
---|
cd $HOME/workshop/2024-2/session4_RNAseq/runs/run4run3_RNAseq/_T2T |
Copy and paste the code below to the terminal:
Code Block |
---|
cp $HOME/work/trainingworkshop/2024-2/rnaseqsession4_RNAseq/data/human/samplesheet.csv $HOME/workshop/2024-2/session4_RNAseq/runs/run3_RNAseq_T2T cp $HOME/workshop/2024-2/session4_RNAseq/scripts/launch_nf-core_RNAseq_pipeline_T2T.pbs $HOME/workshop/2024-2/session4_RNAseq/runs/run3_RNAseq_T2T |
Line 1: Copy the samplesheet.csv for pre-downloaded human samples file to the working directory
Line 2: Copy the launch scrip to run expression profiling using the T2T genome
Print the content of the “launch_nf_core_RNAseq_T2T.pbs” script:
Code Block |
---|
cat launch_nf_-core_RNAseq_pipeline_T2T.pbs |
Code Block |
---|
#!/bin/bash -l #PBS -N nfRNAseq_T2T #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=48:00:00 #work on current directory cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java export NXF_OPTS='-Xms1g -Xmx4g' #run the RNAseq pipeline nextflow run nf-core/rnaseq --input samplesheet.csv \ --outdir results \ -r 3.14.0 \ --fasta /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna \ --gtf /work/training/references/ncbi/T2T/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf \ --remove_ribo_rna \ -profile singularity \ --aligner star_salmon \ --extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " \ -resume |
...
Submit the job to the cluster
Code Block |
---|
qsub launch_nf_-core_RNAseq_pipeline_T2T.pbs |
Tip: Read the help information for Nextflow pipelines
Information on how to run a nextflow pipeline and additional available parameters can be provided on the pipeline website (i.e., https://nf-co.re/rnaseq/3.12.0/docs/usage/ ). You can also run the following command to get help information:
...
Some pipelines may need file names, and others may want a CSV file with file names, the path to raw data files, and other information.
Public genomes
ENSEMBL publishes a range of genome assemblies and annotation files for a broad range of species. Look for species of interest at:
...