Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

Aim:

This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on the their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.

Pre-requisites

Method 1: Clustering oligonucleotide sequences (i.e., aptamers, miRNAs or small RNAs)

...

Once installed you will be able to access the mapper.pl script. This script will be used to merge merges sequences that are 100% identical sequences and generate their also generates the copy count for each unique sequence (i.e., seq1_x578 ; seq1 has 578 copies). The available parameters options are:

Code Block
mapper.pl --help
No config or reads file could be found
/Users/barrero/anaconda3/bin/mapper.pl input_file_reads

This script takes as input a file with deep sequencing reads (these can be in
different formats, see the options below). The script then processes the reads
and/or maps them to the reference genome, as designated by the options given.
Options:

Read input file:
-a              input file is seq.txt format
-b              input file is qseq.txt format
-c              input file is fasta format
-e              input file is fastq format
-d              input file is a config file (see miRDeep2 documentation).
                options -a, -b, -c or -e must be given with option -d.

Preprocessing/mapping:
-g              three-letter prefix for reads (by default 'seq')
-h              parse to fasta format
-i              convert rna to dna alphabet (to map against genome)
-j              remove all entries that have a sequence that contains letters
                other than a,c,g,t,u,n,A,C,G,T,U,N
-k seq          clip 3' adapter sequence
-l int          discard reads shorter than int nts, default = 18
-m              collapse reads

-p genome       map to genome (must be indexed by bowtie-build). The 'genome'
                string must be the prefix of the bowtie index. For instance, if
                the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then
                the prefix is 'h_sapiens_37_asm'.
-q              map with one mismatch in the seed (mapping takes longer)

-r int          a read is allowed to map up to this number of positions in the genome
                default is 5

Output files:
-s file         print processed reads to this file
-t file         print read mappings to this file

Other:
-u              do not remove directory with temporary files
-v              outputs progress report

-n              overwrite existing files

-o              number of threads to use for bowtie

Example of use:

$HOME/Users/barrero/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT  -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v

Merging and collapsing identical sequences using mapper.pl.

Code Block
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa

Where:

-c input is a fasta file (see above for other input options)

-m merge identical sequences and generate its copy number

-s output filename

Example: Merged identical sequences showing copy number (i.e., _x57828)

Code Block
>seq_0_x57828
CTTGTTGCCTCCTTAGCAGGG
>seq_57828_x1554
AAAAAAAAAAAAAAAAAAAA
>seq_59382_x1212
AAAAAAATAAAAAAAAAAAA
>seq_60594_x1160
AAAAAAAAAAAAAAAAAAA
>seq_61754_x1025
CTTGTTGCCTCCTTAGCAGGT
>seq_62779_x904
AAAAAATAAAAAAAAAAAAA
>seq_63683_x644
AAAAAATTAAAAAAAAAAAA
>seq_64327_x519
ATAAAAATAAAAAAAAAAAA
>seq_64846_x449
ATAAAAAAAAAAAAAAAAAA
>seq_65295_x404
ATTATATTAAATATTAACTA
>seq_65699_x371
ATAAAATAAAAAAAAAAAAA
>seq_66070_x365
AAAAAAATAAAAAAAAAAA
>seq_66435_x364
AAAAAAAAAAAAAAAAAAAAA
>seq_66799_x354
AAAAAATAAAAAAAAAAAA
>seq_67153_x335
ATAAAATTAAAAAAAAAAAA
>seq_67488_x335
ATAAAAAAAAAAAAAAAAA
>seq_67823_x270
AAAAAAAAAAAAAAAAAAAT
>seq_68093_x234
AAAAAAATAAAAAAAAACAA
>seq_68327_x232
AAAAAAATAAAAAAAAAAAT
>seq_68559_x221
CTTGTTGTCTCCTTAGCAGGG
>seq_68780_x209
AAAAAAAAGAAAAAAAAAAA
>seq_68989_x183
CTTGTTTCCTCCTTAGCAGGG

Method 2: cd-hit clustering

...

Code Block
Usage: cd-hit [Options] 

Options

   -i	input filename in fasta format, required, can be in .gz format
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	    this is the default cd-hit's "global sequence identity" calculated as:
 	    number of identical amino acids or bases in alignment
 	    divided by the full length of the shorter sequence

For additional information on other parameters options check man page as follows:

...

Code Block
 #!/bin/bash -l
#PBS -N MAPPER
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

# User defined variables:
FASTA=S32_19to21nt.rename.fasta      #input file
COLLAPSED=S32_19to21nt.collapsed.fa  #output file

mapper.pl ${FASTA} -c -m -s ${COLLAPSED}

Where:

#PBS -N name of the job. It can be any name (i.e., script name, tool name, etc)

...