Aim:
This page provides tips on how to cluster oligonucleotide sequences (i.e., aptamers, miRNAs, etc) based on their sequence identity using two strategies: 1) mapper.pl script from the mirdeep2 package, and 2) cd-hit clustering approach.
Pre-requisites
Installed conda3 or miniconda3 ( https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html )
Basic unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
Method 1: Clustering oligonucleotide sequences (i.e., aptamers, miRNAs or small RNAs)
Install the mirdeep2 package using conda
conda install -c bioconda mirdeep2
Once installed you will be able to access the mapper.pl script. This script merges sequences that are 100% identical and also generates the copy count for each unique sequence (i.e., seq1_x578 ; seq1 has 578 copies). The available parameters options are:
mapper.pl --help No config or reads file could be found /Users/barrero/anaconda3/bin/mapper.pl input_file_reads This script takes as input a file with deep sequencing reads (these can be in different formats, see the options below). The script then processes the reads and/or maps them to the reference genome, as designated by the options given. Options: Read input file: -a input file is seq.txt format -b input file is qseq.txt format -c input file is fasta format -e input file is fastq format -d input file is a config file (see miRDeep2 documentation). options -a, -b, -c or -e must be given with option -d. Preprocessing/mapping: -g three-letter prefix for reads (by default 'seq') -h parse to fasta format -i convert rna to dna alphabet (to map against genome) -j remove all entries that have a sequence that contains letters other than a,c,g,t,u,n,A,C,G,T,U,N -k seq clip 3' adapter sequence -l int discard reads shorter than int nts, default = 18 -m collapse reads -p genome map to genome (must be indexed by bowtie-build). The 'genome' string must be the prefix of the bowtie index. For instance, if the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then the prefix is 'h_sapiens_37_asm'. -q map with one mismatch in the seed (mapping takes longer) -r int a read is allowed to map up to this number of positions in the genome default is 5 Output files: -s file print processed reads to this file -t file print read mappings to this file Other: -u do not remove directory with temporary files -v outputs progress report -n overwrite existing files -o number of threads to use for bowtie Example of use: /Users/barrero/anaconda3/bin/mapper.pl reads_seq.txt -a -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -p h_sapiens_37_asm -s reads.fa -t reads_vs_genome.arf -v
Merging and collapsing identical sequences using mapper.pl.
mapper.pl S32_19to21nt.rename.fasta -c -m -s S32_19to21nt.collapsed.fa
Where:
-c input is a fasta file (see above for other input options)
-m merge identical sequences and generate its copy number
-s output filename
Example: Merged identical sequences showing copy number (i.e., _x57828)
>seq_0_x57828 CTTGTTGCCTCCTTAGCAGGG >seq_57828_x1554 AAAAAAAAAAAAAAAAAAAA >seq_59382_x1212 AAAAAAATAAAAAAAAAAAA >seq_60594_x1160 AAAAAAAAAAAAAAAAAAA >seq_61754_x1025 CTTGTTGCCTCCTTAGCAGGT >seq_62779_x904 AAAAAATAAAAAAAAAAAAA >seq_63683_x644 AAAAAATTAAAAAAAAAAAA >seq_64327_x519 ATAAAAATAAAAAAAAAAAA >seq_64846_x449 ATAAAAAAAAAAAAAAAAAA >seq_65295_x404 ATTATATTAAATATTAACTA >seq_65699_x371 ATAAAATAAAAAAAAAAAAA >seq_66070_x365 AAAAAAATAAAAAAAAAAA >seq_66435_x364 AAAAAAAAAAAAAAAAAAAAA >seq_66799_x354 AAAAAATAAAAAAAAAAAA >seq_67153_x335 ATAAAATTAAAAAAAAAAAA >seq_67488_x335 ATAAAAAAAAAAAAAAAAA >seq_67823_x270 AAAAAAAAAAAAAAAAAAAT >seq_68093_x234 AAAAAAATAAAAAAAAACAA >seq_68327_x232 AAAAAAATAAAAAAAAAAAT >seq_68559_x221 CTTGTTGTCTCCTTAGCAGGG >seq_68780_x209 AAAAAAAAGAAAAAAAAAAA >seq_68989_x183 CTTGTTTCCTCCTTAGCAGGG
Method 2: cd-hit clustering
Install cd-hit using conda as follows:
conda install -c bioconda cd-hit
Main parameters to be used are the following:
Usage: cd-hit [Options] Options -i input filename in fasta format, required, can be in .gz format -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids or bases in alignment divided by the full length of the shorter sequence
For additional information on other parameters options check man page as follows:
cd-hit --help
Running cd-hit
cd-hit -i S32_19to21nt.fasta -o out.fa -c 1.0
where:
-i input fasta file
-o output of cd-hit clusters (i.e, representative consensus)
-c identity thresholds for merging (i.e., -c 1.0 = 100% ; -c 0.90 = 90%)
$ cd-hit -i sample.fa -o sample_cd-hit.fa -c 1.0 ================================================================ Program: CD-HIT, V4.8.1, Mar 01 2019, 06:12:48 Command: cd-hit -i sample.fa -o sample_cd-hit.fa -c 1.0 Started: Tue Aug 17 14:47:32 2021 ================================================================ Output ---------------------------------------------------------------- total seq: 5000 longest and shortest : 22 and 19 Total letters: 100274 Sequences have been sorted Approximated minimal memory consumption: Sequence : 0M Buffer : 1 X 10M = 10M Table : 1 X 65M = 65M Miscellaneous : 0M Total : 76M Table limit with the given memory limit: Max number of representatives: 4000000 Max number of word counting entries: 90406164 comparing sequences from 0 to 5000 ..... 5000 finished 4409 clusters Approximated maximum memory consumption: 77M writing new database writing clustering information program completed ! Total CPU time 0.14
Running jobs on the HPC
QUT’s HPC uses the PBS Pro scheduler to submit jobs to compute nodes.
Prepare a PBS Pro submission script (i.e, launch.pbs) to submit a job to the HPC. For example:
#!/bin/bash -l #PBS -N MAPPER #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR # User defined variables: FASTA=S32_19to21nt.rename.fasta #input file COLLAPSED=S32_19to21nt.collapsed.fa #output file mapper.pl ${FASTA} -c -m -s ${COLLAPSED}
Where:
#PBS -N
name of the job. It can be any name (i.e., script name, tool name, etc)
#PBS -l select=1:ncpus=2:mem=4gb
this tells the system that 2 CPUs and 4GB of memory will be used. Modify as appropriate. Note- the more CPUs and MEM you request the longer the job can take to commence running.
#PBS -l walltime=24:00:00
amount of time that the job is allowed to run - in this case up to 24 hours. The maximum wall time is 7 days or 168 hours. Depending on the requested time to run the job it will be placed in the quick, small, large or huge queue.
Once the PBS Pro script is ready, now you can submit the job to the HPC as follows:
qsub launch.pbs
To see the progress of your job, type:
qjobs
Alternatively:
qstat -u $user_name