Running BLAST sequence similarity

BLAST is a program that compares a DNA/RNA or protein sequence, which are strings of letters against other reference sequences (i.e., whole non-redundant nucleic acid database, also known as NT).

Pre-requisites

Installing BLAST

Once miniconda3 has been installed, you will need to log out and login again to enable the conda command.

To install BLAST or other bioinformatics tools go to https://anaconda.org and search for the tool of interest. You can also use the command

conda search -c bioconda blast

to search for tools in the bioconda channel.

If the tool is available click on the tool link, which will open a new window showing the command line needed to install the tool. For example for BLAST the suggested command is:

conda install -c bioconda blast

run the above command to install blast. Conda will evaluate if the tool or necessary dependencies are available and will automatically install all necessary items to run in this case blast.

Note: Follow a similar process as above to install other tools.

Sample Data

Demo sample data to compare the similarity of DNA sequences generated by an RNA-seq approach against a reference Miscanthus sinensis mosaic virus (MsiMV) can be found at:

/work/eresearch_bio/sandpit/blast
-rw-rw---- 1 barrero 36K Jan 20 12:43 query_sample.fa -rw-rw---- 1 barrero 9.7K Jan 20 12:44 MsiMV_genome.fasta -rw-rw----+ 1 barrero 419 Jan 20 12:50 launch_blastN.pbs

We want to compare the similarity (from 0 to 100%) of the sequences (also called ‘reads’) inside the query_sample.fa file against the reference MsiMV_genome.fasta sequence. Note: RNA/DNA (and protein) sequences can be stored in a ‘fasta format’. This is a header noted by “>” symbol followed by a sequence identifier on the first row. From the second row onwards the DNA/RNA(protein) sequence is presented.

Running blast on the HPC

We use a PBS Pro submission script to submit jobs to the HPC cluster. Create a file called ‘launch_blastN.pbs’ and fill it with this content, substituting email@host for your email address, and the files used as input to blastn:

#!/bin/bash -l #PBS -N blastN #PBS -l walltime=10:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 #PBS -m bae #PBS -M email@host #PBS -j oe cd $PBS_O_WORKDIR #define variables. For example name of the fasta file to use. Note: it can be either with a suffix .fa or .fasta or other. QUERY=query_sample.fa REFERENCE=MsiMV_genome.fasta EVALUE=1e-10 #run blastn search blastn -query $QUERY \ -subject $REFERENCE \ -out blastN_${QUERY}_vs_${REFERENCE}.out \ -outfmt 6 \ -evalue $EVALUE \ -num_threads 4

Submit the job:

qsub launch_blastN.pbs

Checking the progression of the submitted job:

qjobs #alternatively use: qstat -u USERNAME

How to interpret the result? check this tutorial.