Aim:
This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.
...
Once installed you will be able to access the mapper.pl script. This script will be used to merge merges sequences that are 100% identical sequences and generate their also generates the copy count for each unique sequence (i.e., seq1_x578 ; seq1 has 578 copies). The available parameters options are:
Code Block |
---|
mapper.pl --help No config or reads file could be found /Users/barrero/anaconda3/bin/mapper.pl input_file_reads This script takes as input a file with deep sequencing reads (these can be in different formats, see the options below). The script then processes the reads and/or maps them to the reference genome, as designated by the options given. Options: Read input file: -a input file is seq.txt format -b input file is qseq.txt format -c input file is fasta format -e input file is fastq format -d input file is a config file (see miRDeep2 documentation). options -a, -b, -c or -e must be given with option -d. Preprocessing/mapping: -g three-letter prefix for reads (by default 'seq') -h parse to fasta format -i convert rna to dna alphabet (to map against genome) -j remove all entries that have a sequence that contains letters other than a,c,g,t,u,n,A,C,G,T,U,N -k seq clip 3' adapter sequence -l int discard reads shorter than int nts, default = 18 -m collapse reads -p genome map to genome (must be indexed by bowtie-build). The 'genome' string must be the prefix of the bowtie index. For instance, if the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then the prefix is 'h_sapiens_37_asm'. -q map with one mismatch in the seed (mapping takes longer) -r int a read is allowed to map up to this number of positions in the genome default is 5 Output files: -s file print processed reads to this file -t file print read mappings to this file Other: -u do not remove directory with temporary files -v outputs progress report -n overwrite existing files -o number of threads to use for bowtie Example of use: /Users/barrero$HOME/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v |
Merging and collapsing identical sequences using mapper.pl.
Code Block |
---|
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa |
Where:
-c input is a fasta file (see above for other input options)
-m merge identical sequences and generate its copy number
-s output filename
Example: Merged identical sequences showing copy number (i.e., _x57828)
Code Block |
---|
>seq_0_x57828
CTTGTTGCCTCCTTAGCAGGG
>seq_57828_x1554
AAAAAAAAAAAAAAAAAAAA
>seq_59382_x1212
AAAAAAATAAAAAAAAAAAA
>seq_60594_x1160
AAAAAAAAAAAAAAAAAAA
>seq_61754_x1025
CTTGTTGCCTCCTTAGCAGGT
>seq_62779_x904
AAAAAATAAAAAAAAAAAAA
>seq_63683_x644
AAAAAATTAAAAAAAAAAAA
>seq_64327_x519
ATAAAAATAAAAAAAAAAAA
>seq_64846_x449
ATAAAAAAAAAAAAAAAAAA
>seq_65295_x404
ATTATATTAAATATTAACTA
>seq_65699_x371
ATAAAATAAAAAAAAAAAAA
>seq_66070_x365
AAAAAAATAAAAAAAAAAA
>seq_66435_x364
AAAAAAAAAAAAAAAAAAAAA
>seq_66799_x354
AAAAAATAAAAAAAAAAAA
>seq_67153_x335
ATAAAATTAAAAAAAAAAAA
>seq_67488_x335
ATAAAAAAAAAAAAAAAAA
>seq_67823_x270
AAAAAAAAAAAAAAAAAAAT
>seq_68093_x234
AAAAAAATAAAAAAAAACAA
>seq_68327_x232
AAAAAAATAAAAAAAAAAAT
>seq_68559_x221
CTTGTTGTCTCCTTAGCAGGG
>seq_68780_x209
AAAAAAAAGAAAAAAAAAAA
>seq_68989_x183
CTTGTTTCCTCCTTAGCAGGG
|
Method 2: cd-hit clustering
...
Code Block |
---|
Usage: cd-hit [Options]
Options
-i input filename in fasta format, required, can be in .gz format
-o output filename, required
-c sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids or bases in alignment
divided by the full length of the shorter sequence
|
For additional information on other parameters options check man page as follows:
...
Code Block |
---|
#!/bin/bash -l
#PBS -N MAPPER
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
cd $PBS_O_WORKDIR
# User defined variables:
FASTA=S32_19to21nt.rename.fasta #input file
COLLAPSED=S32_19to21nt.collapsed.fa #output file
mapper.pl ${FASTA} -c -m -s ${COLLAPSED}
|
Where:
#PBS -N
name of the job. It can be any name (i.e., script name, tool name, etc)
...