Aim:
This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.
...
Code Block |
---|
mapper.pl --help
No config or reads file could be found
/Users/barrero/anaconda3/bin/mapper.pl input_file_reads
This script takes as input a file with deep sequencing reads (these can be in
different formats, see the options below). The script then processes the reads
and/or maps them to the reference genome, as designated by the options given.
Options:
Read input file:
-a input file is seq.txt format
-b input file is qseq.txt format
-c input file is fasta format
-e input file is fastq format
-d input file is a config file (see miRDeep2 documentation).
options -a, -b, -c or -e must be given with option -d.
Preprocessing/mapping:
-g three-letter prefix for reads (by default 'seq')
-h parse to fasta format
-i convert rna to dna alphabet (to map against genome)
-j remove all entries that have a sequence that contains letters
other than a,c,g,t,u,n,A,C,G,T,U,N
-k seq clip 3' adapter sequence
-l int discard reads shorter than int nts, default = 18
-m collapse reads
-p genome map to genome (must be indexed by bowtie-build). The 'genome'
string must be the prefix of the bowtie index. For instance, if
the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then
the prefix is 'h_sapiens_37_asm'.
-q map with one mismatch in the seed (mapping takes longer)
-r int a read is allowed to map up to this number of positions in the genome
default is 5
Output files:
-s file print processed reads to this file
-t file print read mappings to this file
Other:
-u do not remove directory with temporary files
-v outputs progress report
-n overwrite existing files
-o number of threads to use for bowtie
Example of use:
/Users/barrero/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v
|
Merging and collapsing identical sequences using mapper.pl.
Code Block |
---|
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa |
Where:
-c input is a fasta file (see above for other input options)
-m merge identical sequences and generate its copy number
-s output filename
Example: Merged identical sequences showing copy number (i.e., _x57828)
Code Block |
---|
>seq_0_x57828
CTTGTTGCCTCCTTAGCAGGG
>seq_57828_x1554
AAAAAAAAAAAAAAAAAAAA
>seq_59382_x1212
AAAAAAATAAAAAAAAAAAA
>seq_60594_x1160
AAAAAAAAAAAAAAAAAAA
>seq_61754_x1025
CTTGTTGCCTCCTTAGCAGGT
>seq_62779_x904
AAAAAATAAAAAAAAAAAAA
>seq_63683_x644
AAAAAATTAAAAAAAAAAAA
>seq_64327_x519
ATAAAAATAAAAAAAAAAAA
>seq_64846_x449
ATAAAAAAAAAAAAAAAAAA
>seq_65295_x404
ATTATATTAAATATTAACTA
>seq_65699_x371
ATAAAATAAAAAAAAAAAAA
>seq_66070_x365
AAAAAAATAAAAAAAAAAA
>seq_66435_x364
AAAAAAAAAAAAAAAAAAAAA
>seq_66799_x354
AAAAAATAAAAAAAAAAAA
>seq_67153_x335
ATAAAATTAAAAAAAAAAAA
>seq_67488_x335
ATAAAAAAAAAAAAAAAAA
>seq_67823_x270
AAAAAAAAAAAAAAAAAAAT
>seq_68093_x234
AAAAAAATAAAAAAAAACAA
>seq_68327_x232
AAAAAAATAAAAAAAAAAAT
>seq_68559_x221
CTTGTTGTCTCCTTAGCAGGG
>seq_68780_x209
AAAAAAAAGAAAAAAAAAAA
>seq_68989_x183
CTTGTTTCCTCCTTAGCAGGG
|
Method 2: cd-hit clustering
...
Code Block |
---|
Usage: cd-hit [Options]
Options
-i input filename in fasta format, required, can be in .gz format
-o output filename, required
-c sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids or bases in alignment
divided by the full length of the shorter sequence
|
For additional information on other parameters options check man page as follows:
...
Code Block |
---|
#!/bin/bash -l
#PBS -N MAPPER
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
cd $PBS_O_WORKDIR
# User defined variables:
FASTA=S32_19to21nt.rename.fasta #input file
COLLAPSED=S32_19to21nt.collapsed.fa #output file
mapper.pl ${FASTA} -c -m -s ${COLLAPSED}
|
Where:
#PBS -N
name of the job. It can be any name (i.e., script name, tool name, etc)
...