/
Batch rename fasta headers
Batch rename fasta headers
Aim:
We will use the seqkit replace tool to rename the fasta headers using a list with desired names
Requirements
If not yet available install seqkit as follows
#STEP1: Activate the ConsGenome environment
conda activate ConsGenome
#STEP2: install seqki as follows
conda install -c bioconda seqkit
Input files
FASTA file: prepare an input fasta file with the sequence Accession Number as the header. For example:
>KY709128
TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
>MG894709
TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACATCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACTTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
>KY818102
TGAGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGAGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
>KT827367
TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTTAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCACGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
>JF967937
TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACATACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCCTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
>FJ687476
TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTCAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTGTCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG
2. New Names List: also prepare a list with for example the following two columns: Column1 (Accession Number) and Column2 (AccessionNumber_CollectionYear_Country)
KY709128 KY709128_2010_Philippines
MG894709 MG894709_2012_Philippines
KY818102 KY818102_2011_Philippines
KT827367 KT827367_2010_China
JF967937 JF967937_2010_Philippines
FJ687476 FJ687476_2007_South_Korea
Batch rename script
## Usage: qsub launch_batch_rename.pbs
cd $PBS_O_WORKDIR
conda activate ConsGenome
#User defined variables
NAMES=names.txt
FILE=samples.fasta
#PIPELINE
seqkit replace -p "(.+)" -r '{kv}' -k ${NAMES} ${FILE} > renamed_${FILE}
echo "completed"
Place both INPUT files (names.txt and sample.fasta) in the same folder along with the ‘launch_batch_rename.pbs’ script. Then run on the HPC as follows:
qsub launch_batch_rename.pbs
, multiple selections available,
Related content
DKE 121 genome assembly
DKE 121 genome assembly
More like this
5. RNA-seq pipeline
5. RNA-seq pipeline
More like this
Preparing a new genome index for sarek
Preparing a new genome index for sarek
More like this
3. Fetch public RNA-seq data
3. Fetch public RNA-seq data
More like this
RNAseq - Star 2 pass approach (Ronin)
RNAseq - Star 2 pass approach (Ronin)
More like this