/
Batch rename fasta headers

Batch rename fasta headers

Aim:

We will use the seqkit replace tool to rename the fasta headers using a list with desired names

Requirements

If not yet available install seqkit as follows

#STEP1: Activate the ConsGenome environment conda activate ConsGenome #STEP2: install seqki as follows conda install -c bioconda seqkit

Input files

  1. FASTA file: prepare an input fasta file with the sequence Accession Number as the header. For example:

>KY709128 TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG >MG894709 TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACATCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACTTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG >KY818102 TGAGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGAGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG >KT827367 TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTTAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCACGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG >JF967937 TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTTAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACATACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCCTAAAAGGGGTATCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG >FJ687476 TGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTGGAAGGACTGTCGGGAGCTACGTGGGTGGATGTAGTGCTGGAGCATGGAAGTTGCGTCACTACCATGGCAAAAGACAAACCAACACTGGACATTGAACTCCTGAAGACGGAGGTCACAAACCCTGCAGTCCTGCGCAAACTGTGCATTGAAGCTAAAATATCAAATACCACCACCGATTCGAGATGTCCAACACAAGGAGAAGCCACGCTGGTGGAAGAGCAGGACACGAACTTTGTGTGCCGACGAACGCTCGTGGACAGAGGCTGGGGCAATGGTTGTGGGCTATTTGGAAAAGGTAGCTTAATAACGTGTGCTAAGTTCAAGTGTGTGACAAAACTGGAAGGAAAGATAGTCCAATATGAAAACTTAAAATATTCAGTCATAGTCACCGTACACACTGGAGACCAACACCAAGTTGGAAATGAGACCACAGAACATGGAACAACTGCAACCATAACACCTCAAGCTCCTACGTCGGAAATACAGCTGACAGACTACGGAGCTCTAACACTGGATTGTTCACCTAGAACAGGACTAGACTTTAATGAGATGGTGTTGTTGACGATGAAAGAAAAATCATGGCTCGTCCACAAACAATGGTTTCTGGACCTACCACTGCCTTGGACCTCAGGGGCCTCAACATCCCAAGAGACTTGGAATAGACAAGACCTGCTGGTCACATTCAAGACAGCTCATGCAAAAAAGCAGGAAGTAGTCGTGCTAGGATCACAAGAAGGAGCAATGCACACTGCGCTGACTGGAGCGACAGAAATCCAAACGTCTGGAACGACAACAATTTTTGCAGGGCACCTGAAATGCAGACTAAAAATGGATAAACTGACCTTAAAAGGGGTGTCATATGTAATGTGCACAGGGTCATTCAAGCTAGAGAAGGAAGTGGCTGAGACCCAGCATGGAACTGTTCTAGTGCAAGTTAAATACGAAGGAACAGATGCACCATGCAAGATCCCCTTCTCGTCCCAAGATGAGAAGGGAGTAACCCAGAATGGGAGATTGATAACAGCCAACCCCATAGTCACTGACAAAGAAAAACCAGTCAACATTGAAGCGGAGCCACCTTTTGGGGAGAGCTACCTTGTGGTAGGAGCAGGTGAAAAAGCTTTGAAACTAAGCTGGTTCAAGAAGGGAAGCAGTATAGGGAAAATGTTTGAAGCAACTGCCCGCGGAGCACGAAGGATGGCCATCCTGGGAGACACCGCATGGGACTTCGGTTCTATAGGAGGGGTGTTCACATCTGTGGGAAAACTGATACACCAGATTTTTGGGACTGCGTATGGAGTCTTGTTCAGCGGGGTTTCTTGGACCATGAAAATAGGAATAGGGATTCTGCTGACATGGCTAGGATTAAATTCAAGGAGCACATCCCTTTCAATGACGTGTATCGCAGTCGGCATGGTCACACTGTACCTAGGAGTCATGGTTCAGGCG

2. New Names List: also prepare a list with for example the following two columns: Column1 (Accession Number) and Column2 (AccessionNumber_CollectionYear_Country)

KY709128 KY709128_2010_Philippines MG894709 MG894709_2012_Philippines KY818102 KY818102_2011_Philippines KT827367 KT827367_2010_China JF967937 JF967937_2010_Philippines FJ687476 FJ687476_2007_South_Korea

Batch rename script

## Usage: qsub launch_batch_rename.pbs cd $PBS_O_WORKDIR conda activate ConsGenome #User defined variables NAMES=names.txt FILE=samples.fasta #PIPELINE seqkit replace -p "(.+)" -r '{kv}' -k ${NAMES} ${FILE} > renamed_${FILE} echo "completed"

Place both INPUT files (names.txt and sample.fasta) in the same folder along with the ‘launch_batch_rename.pbs’ script. Then run on the HPC as follows:

qsub launch_batch_rename.pbs

Related content