Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Aim:

This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.

...

Once installed you will be able to access the mapper.pl script. This script will be used to merge merges sequences that are 100% identical sequences and generate their also generates the copy count for each unique sequence (i.e., seq1_x578 ; seq1 has 578 copies). The available parameters options are:

Code Block
mapper.pl --help
No config or reads file could be found
/Users/barrero/anaconda3/bin/mapper.pl input_file_reads

This script takes as input a file with deep sequencing reads (these can be in
different formats, see the options below). The script then processes the reads
and/or maps them to the reference genome, as designated by the options given.
Options:

Read input file:
-a              input file is seq.txt format
-b              input file is qseq.txt format
-c              input file is fasta format
-e              input file is fastq format
-d              input file is a config file (see miRDeep2 documentation).
                options -a, -b, -c or -e must be given with option -d.

Preprocessing/mapping:
-g              three-letter prefix for reads (by default 'seq')
-h              parse to fasta format
-i              convert rna to dna alphabet (to map against genome)
-j              remove all entries that have a sequence that contains letters
                other than a,c,g,t,u,n,A,C,G,T,U,N
-k seq          clip 3' adapter sequence
-l int          discard reads shorter than int nts, default = 18
-m              collapse reads

-p genome       map to genome (must be indexed by bowtie-build). The 'genome'
                string must be the prefix of the bowtie index. For instance, if
                the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then
                the prefix is 'h_sapiens_37_asm'.
-q              map with one mismatch in the seed (mapping takes longer)

-r int          a read is allowed to map up to this number of positions in the genome
                default is 5

Output files:
-s file         print processed reads to this file
-t file         print read mappings to this file

Other:
-u              do not remove directory with temporary files
-v              outputs progress report

-n              overwrite existing files

-o              number of threads to use for bowtie

Example of use:

/Users/barrero$HOME/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT  -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v

...

Code Block
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa

Where:

-c input is a fasta file (see above for other input options)

-m merge identical sequences and generate its copy number

-s output filename

Example: Merged identical sequences showing copy number (i.e., _x57828)

...