(This post was initially published on the BiGCaT blog, now discontinued.)
While designing a new microarray for nugo we had to do some automated sequence alignments. More specifically, we had to find correct alignments for 200’000 25 nucleotide probes against 20’000 transcript sequences. We’re looking for exact matches, so we used BLAT, the BLAST-Like Alignment Tool by Jim Kent.
This tool produces correct alignments with great speed. Just for the sake of comparison, we ran both BLAT and BLAST on the same datasets. Standard NCBI BLAST completed the task in 3 days. With BLAT we produced the exact same output in 2 minutes.
Intuitively I understood that BLAT is faster because it is much stricter in making alignments. Looking for divergent sequences simply has to come at a speed penalty, because there is no such thing as a free lunch. BLAT doesn’t do that. You can’t use it for finding protein homology in very distantly related species. However, in our case we were only interested in exact matches anyway.
I wanted to be able to explain the difference in more exact words so I did a bit of reading. What is the reason why BLAT is so much faster than BLAST?
Here is the Short Answer: BLAST is designed to be faster than the Smith-Waterman (SW) algorithm but optimized for finding relatively weak homology. BLAT is optimized for speed at the expense of abandoning weak homology.
And here is the Long Answer:
Sequence alignment is a classic problem in bioinformatics, and it is really an extension of the string searching problem, which is a classic problem in computer science. The computer science guys have already come up with good solutions for those problems such as search indexes (Remember: every problem in computer science has been solved since around 1960. All the things that came after were adaptations of 1960 solutions to newer hardware).
As the BLAST developers realized, searches can be made more efficient than the SW algorithm by using a search index. A well designed index can be used to find positions of exact matches (seeds) in constant (O(1)) time, whereas the scanning algorithm used by SW takes linear (O(n)) time. Producing a complete alignment is proportional to the number of lookups needed in the index.
Let’s assume
K = word-size of our index (11 for DNA for both BLAT and BLAST by default)
A = alphabet size (4 for DNA, 20 for protein)
G = database size (can be in the order of several Gbp)
Q = query size (more modest, usually no more than 10 Kbp)
BLAST works by indexing the query. It looks up each K-mer of the database in the index of the query. Because there are ~G K-mers in the database, alignment speed is proportional to the database size G.
BLAT works by indexing the database. Then, each K-mer of the query is looked up in the index. The alignment speed is proportional to the query size Q, usually a much smaller number than G.
Where is the catch?
The catch comes when you want to allow non-perfect matches. To solve this problem, BLAST calculates possible mismatches in the query and indexes them as well, thus making the index much bigger. In practice, BLAST uses the top 50 significant mismatches, so the BLAST query index is proportional to 50 times the size of query. For speed this doesn’t matter because lookups are still O(1) (although you’ll get more false positives, but still that doesn’t affect speed too much).
For BLAT, since the index is already proportional to the size of the database, usually about 1.0 Gb in size for any relevant genome, making a 50 times larger index with mismatches is simply impossible. Also the number of false positive hits in the idnex would get too large. BLAT can solve this problem by doing repeated lookups in the index for each possible mismatch in each K-mer. If you allow one mismatch in each K-mer, you need to do a look up 1+(A-1)*K times per K-mer (32 when K=11 and A=4). This means a huge speed hit. If you allow two mismatches, the speed hit goes to 1+(A-1)^2*K or exactly 100 for K=11 and A=4.
BLAST does extensive preprocessing on the query, anticipating possible significant mismatces and indexing them as well. BLAT does the same on the database, but since the database is so large there are limits on the preprocessing that can be done.
References:
BLAST Altschul, SF et al., Journal of Molecular Biology 1990, 215:403-410
BLAT: Kent, WJ, Genome Research 2002, 12:656-664
Tags: algorithms, blast, blat