Content Comparison

Aim:

This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.

...

Once installed you will be able to access the mapper.pl script. This script will be used to merge merges sequences that are 100% identical sequences and generate their also generates the copy count for each unique sequence (i.e., seq1_x578 ; seq1 has 578 copies). The available parameters options are:

Code Block

mapper.pl --help
No config or reads file could be found
/Users/barrero/anaconda3/bin/mapper.pl input_file_reads

This script takes as input a file with deep sequencing reads (these can be in
different formats, see the options below). The script then processes the reads
and/or maps them to the reference genome, as designated by the options given.
Options:

Read input file:
-a              input file is seq.txt format
-b              input file is qseq.txt format
-c              input file is fasta format
-e              input file is fastq format
-d              input file is a config file (see miRDeep2 documentation).
                options -a, -b, -c or -e must be given with option -d.

Preprocessing/mapping:
-g              three-letter prefix for reads (by default 'seq')
-h              parse to fasta format
-i              convert rna to dna alphabet (to map against genome)
-j              remove all entries that have a sequence that contains letters
                other than a,c,g,t,u,n,A,C,G,T,U,N
-k seq          clip 3' adapter sequence
-l int          discard reads shorter than int nts, default = 18
-m              collapse reads

-p genome       map to genome (must be indexed by bowtie-build). The 'genome'
                string must be the prefix of the bowtie index. For instance, if
                the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then
                the prefix is 'h_sapiens_37_asm'.
-q              map with one mismatch in the seed (mapping takes longer)

-r int          a read is allowed to map up to this number of positions in the genome
                default is 5

Output files:
-s file         print processed reads to this file
-t file         print read mappings to this file

Other:
-u              do not remove directory with temporary files
-v              outputs progress report

-n              overwrite existing files

-o              number of threads to use for bowtie

Example of use:

/Users/barrero$HOME/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT  -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v

Merging and collapsing identical sequences using mapper.pl.

Code Block
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa

Where:

-c input is a fasta file (see above for other input options)

-m merge identical sequences and generate its copy number

-s output filename

Example: Merged identical sequences showing copy number (i.e., _x57828)

Code Block

>seq_0_x57828
CTTGTTGCCTCCTTAGCAGGG
>seq_57828_x1554
AAAAAAAAAAAAAAAAAAAA
>seq_59382_x1212
AAAAAAATAAAAAAAAAAAA
>seq_60594_x1160
AAAAAAAAAAAAAAAAAAA
>seq_61754_x1025
CTTGTTGCCTCCTTAGCAGGT
>seq_62779_x904
AAAAAATAAAAAAAAAAAAA
>seq_63683_x644
AAAAAATTAAAAAAAAAAAA
>seq_64327_x519
ATAAAAATAAAAAAAAAAAA
>seq_64846_x449
ATAAAAAAAAAAAAAAAAAA
>seq_65295_x404
ATTATATTAAATATTAACTA
>seq_65699_x371
ATAAAATAAAAAAAAAAAAA
>seq_66070_x365
AAAAAAATAAAAAAAAAAA
>seq_66435_x364
AAAAAAAAAAAAAAAAAAAAA
>seq_66799_x354
AAAAAATAAAAAAAAAAAA
>seq_67153_x335
ATAAAATTAAAAAAAAAAAA
>seq_67488_x335
ATAAAAAAAAAAAAAAAAA
>seq_67823_x270
AAAAAAAAAAAAAAAAAAAT
>seq_68093_x234
AAAAAAATAAAAAAAAACAA
>seq_68327_x232
AAAAAAATAAAAAAAAAAAT
>seq_68559_x221
CTTGTTGTCTCCTTAGCAGGG
>seq_68780_x209
AAAAAAAAGAAAAAAAAAAA
>seq_68989_x183
CTTGTTTCCTCCTTAGCAGGG

Method 2: cd-hit clustering

...

Code Block

Usage: cd-hit [Options] 

Options

   -i	input filename in fasta format, required, can be in .gz format
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	    this is the default cd-hit's "global sequence identity" calculated as:
 	    number of identical amino acids or bases in alignment
 	    divided by the full length of the shorter sequence

For additional information on other parameters options check man page as follows:

...

Code Block

 #!/bin/bash -l
#PBS -N MAPPER
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

# User defined variables:
FASTA=S32_19to21nt.rename.fasta      #input file
COLLAPSED=S32_19to21nt.collapsed.fa  #output file

mapper.pl ${FASTA} -c -m -s ${COLLAPSED}

Where:

#PBS -N name of the job. It can be any name (i.e., script name, tool name, etc)

...

Version	Old Version 2	New Version Current
Changes made by	Roberto Barrero Gumiel	Roberto Barrero Gumiel
Saved on	Aug 17, 2021	Aug 17, 2021

Versions Compared

Key

Aim:

Method 2: cd-hit clustering