Lab 7, May 16

Exercises are due Friday next week.

In Lab 7, you will continue to work on your toy BLAST program. If you didn't get everything working in the Lab 6 part of the that program, continue working on that as well as the new parts. If you absolutely cannot get the Lab 6 part finished and working by the time it is due, and hence can't continue of with Lab 7, let us know, and we will give you some parts of the code for Lab 6. But it is best if you do it all yourself.

In Lab 7, you will also learn about the Prosite data base of regular expressions.

If you have not already, please also read the third notes on Perl, although no excercises will be assigned from them. Please let me know if you find any errors or ambiguities in those notes.

What you need to turn in are your answers to the questions and exercises below. For Perl programs use script, or some other means, to print out the program and show how it runs on data.

 

Continuation of the toy BLAST program

There are several deficiencies of the toy BLAST program you did for Lab 6. Here we will tackle a few of them.

First, when a database string matches a substring in the query string by w characters, say, where k < w, then the same w-length substring will be found multiple times by the previous program, and if w > t, the threshold for reporting, it will get reported many times; that is not desireable. Real BLAST has ways to avoid *finding* the same substring multiple times, but we will only make sure that the substgring is not *reported* multiple times. We can do it using a hash called stringhash, like this: Whenever BLAST finds a reportable substring in a database, starting at position $i, say, in the database string, it looks to see if $stringhash{$i} is defined. If so, it does not report the string again. Otherwise it assigns the string to $stringhash{$i} and does report the string. Implement this change in the program.

We would like to process strings that are more than a single line long. So in the file each string will be held in consecutive lines, with strings seperated by blank lines. That is like saying that each string is a paragraph instead of just a single line. To read in a whole paragraph, put the line

$/ = "";

somewhere in the program before the reading begins. Read about this on page 102 of Johnson or some other Perl reference.

Read kmer4.pl and get it running, and try to understand how it works. That will be in preparation for our final change to the toy BLAST program, in lab 8.

Prosite:

http://www.expasy.ch/prosite/

Go to the prosite site and familiarize yourself with the various services available there, and scan the list of other pattern-based utilities.

Be sure to study completely the section called "The optimal way to develop patterns". I am not sure where it is, but it may be in the documents page somewhere. That material will be on the final exam!

We discussed in class that the regular expression that characterises the granin family of proteins is a false positive for BRCA1. Prosite has now been changed so that the search with BRCA1 does not return a granin protein.

Exercise: