Perl saves the day

January 6th, 2006

(This post was initially published on the BiGCaT blog, now discontinued.)

from the there-is-more-than-one-way-to-do-it-dept.

Today I came accross this older but very interesting article by Lincoln Stein, BioPerl developer. Did you know that Perl saved the Human Genome Project? Now who says Perl is obsolete? This article underlines the importance of Perl as a glue language, not only bringing together various tools and platforms, but also making the many different research groups of the Human Genome Project work together effectively.

Nutrigenomics Masterclass 2005 Writeup

November 26th, 2005

(This post was initially published on the BiGCaT blog, now discontinued.)

From 21 to 23 november was the Nugo Masterclass in Wageningen, coordenated by Michael Müller and Ingeborg van Leeuwen-Bol. Since I know that our collaboration with NuGO will be very important for my future work, I was very interested in hearing the masters of Nutrigenomics themselves speak at this masterclass. Unfortunately I didn’t have the chance to visit the NuGO week in Italy, so this was clearly an important miss that needed to be corrected.
Most people visiting the masterclass were young nutritional scientists who wanted to learn more about the power of genomics. The class was not targeted at a bioinformaticist like me, and although I learned a lot about current nutrition research through all the poster presentations and talks, the masters didn’t tell me much about genomics that I didn’t already know.

A nice thing about this masterclass was that it took a very practical approach to common problems in nutrigenomics research. All the participants were allowed to present their current research individually and ask the masters and audience direcly about the problems and bottlenecks they are facing. This is a great approach I think, that stimulates critical thinking and problem solving skills as opposed to simple unidirectional information absorption.

The Nutrigenomics Masters

Michael Müller and Duccio Cavalieri both used the metaphor of fishing in their talks but in entirely different contexts. Müller used the fishing metaphor when he talks about microarray data analysis. How do you fish the answer to your favorite question (YFQ) out of the sea of data that you just generated in your experiment? Cavalieri on the other hand suggested that before you do any experiment, go fishing for a week, take a book or two with you and really think through your experiment that you are going to do. What you need to find out is: what is the ideal experiment to do to solve YFQ?. Coming from the background of the yeast model system, Duccio promoted the reductionists approach, where you start with a simple model system and a simple question. He warned against using systems biology and microarrays as a magic black box. No, a computer will not do the research for you (and neither will the bioinformatician)

Cavalieri also talked extensively on the topic of pathway analysis, and demonstrated the Pathway Processor Toolbox, currently being developed at his lab. It will be ready for use soon, although unfortunately only to select NuGO partners. He also mentioned the work of BiGCaT student Thomas Kelder on the Toolbox, and he was clearly impressed, well done Thomas.

Nugo Chips

I soon realized that most of the attendants were probably going to or at least considering to use the Nugo affymetrix chip, the one that I’m currently designing together with Patrick Koks e.a. Although this made me feel a bit like a traveling salesman, meeting the “customers” did teach me a lot and got me extra excited about the chip design.

Ben van Ommen’s visit to the conference was unfortunately very short. He talked about exchange grants and all the other possibilities of cooperation within the NuGO network, which is certainly very useful information. He advertized the NuGO chip some more, and he did it in such a way that I don’t think that getting enough people to use the chip is going to be a problem at all. He mentioned the NuGO website and admitted that it is not very userfriendly. They’re working on it.

Why is BLAT so much faster than BLAST?

November 8th, 2005

(This post was initially published on the BiGCaT blog, now discontinued.)

While designing a new microarray for nugo we had to do some automated sequence alignments. More specifically, we had to find correct alignments for 200’000 25 nucleotide probes against 20’000 transcript sequences. We’re looking for exact matches, so we used BLAT, the BLAST-Like Alignment Tool by Jim Kent.

This tool produces correct alignments with great speed. Just for the sake of comparison, we ran both BLAT and BLAST on the same datasets. Standard NCBI BLAST completed the task in 3 days. With BLAT we produced the exact same output in 2 minutes.

Intuitively I understood that BLAT is faster because it is much stricter in making alignments. Looking for divergent sequences simply has to come at a speed penalty, because there is no such thing as a free lunch. BLAT doesn’t do that. You can’t use it for finding protein homology in very distantly related species. However, in our case we were only interested in exact matches anyway.

I wanted to be able to explain the difference in more exact words so I did a bit of reading. What is the reason why BLAT is so much faster than BLAST?

Here is the Short Answer: BLAST is designed to be faster than the Smith-Waterman (SW) algorithm but optimized for finding relatively weak homology. BLAT is optimized for speed at the expense of abandoning weak homology.

And here is the Long Answer:

Sequence alignment is a classic problem in bioinformatics, and it is really an extension of the string searching problem, which is a classic problem in computer science. The computer science guys have already come up with good solutions for those problems such as search indexes (Remember: every problem in computer science has been solved since around 1960. All the things that came after were adaptations of 1960 solutions to newer hardware).

As the BLAST developers realized, searches can be made more efficient than the SW algorithm by using a search index. A well designed index can be used to find positions of exact matches (seeds) in constant (O(1)) time, whereas the scanning algorithm used by SW takes linear (O(n)) time. Producing a complete alignment is proportional to the number of lookups needed in the index.

Let’s assume
K = word-size of our index (11 for DNA for both BLAT and BLAST by default)
A = alphabet size (4 for DNA, 20 for protein)
G = database size (can be in the order of several Gbp)
Q = query size (more modest, usually no more than 10 Kbp)

BLAST works by indexing the query. It looks up each K-mer of the database in the index of the query. Because there are ~G K-mers in the database, alignment speed is proportional to the database size G.

BLAT works by indexing the database. Then, each K-mer of the query is looked up in the index. The alignment speed is proportional to the query size Q, usually a much smaller number than G.

Where is the catch?
The catch comes when you want to allow non-perfect matches. To solve this problem, BLAST calculates possible mismatches in the query and indexes them as well, thus making the index much bigger. In practice, BLAST uses the top 50 significant mismatches, so the BLAST query index is proportional to 50 times the size of query. For speed this doesn’t matter because lookups are still O(1) (although you’ll get more false positives, but still that doesn’t affect speed too much).

For BLAT, since the index is already proportional to the size of the database, usually about 1.0 Gb in size for any relevant genome, making a 50 times larger index with mismatches is simply impossible. Also the number of false positive hits in the idnex would get too large. BLAT can solve this problem by doing repeated lookups in the index for each possible mismatch in each K-mer. If you allow one mismatch in each K-mer, you need to do a look up 1+(A-1)*K times per K-mer (32 when K=11 and A=4). This means a huge speed hit. If you allow two mismatches, the speed hit goes to 1+(A-1)^2*K or exactly 100 for K=11 and A=4.

BLAST does extensive preprocessing on the query, anticipating possible significant mismatces and indexing them as well. BLAT does the same on the database, but since the database is so large there are limits on the preprocessing that can be done.

References:
BLAST Altschul, SF et al., Journal of Molecular Biology 1990, 215:403-410
BLAT: Kent, WJ, Genome Research 2002, 12:656-664