In Lab 7, you will continue to work on your toy BLAST program. If you didn't get everything working in the Lab 6 part of the that program, continue working on that as well as the new parts. If you absolutely cannot get the Lab 6 part finished and working by the time it is due, and hence can't continue of with Lab 7, let us know, and we will give you some parts of the code for Lab 6. But it is best if you do it all yourself.
In Lab 7, you will also learn about the Prosite data base of regular expressions.
If you have not already, please also read the third notes on Perl, although no excercises will be assigned from them. Please let me know if you find any errors or ambiguities in those notes.
What you need to turn in are your answers to the questions and exercises below. For Perl programs use script, or some other means, to print out the program and show how it runs on data.
First, when a database string matches a substring in the query string by w characters, say, where k < w, then the same w-length substring will be found multiple times by the previous program, and if w > t, the threshold for reporting, it will get reported many times; that is not desireable. Real BLAST has ways to avoid *finding* the same substring multiple times, but we will only make sure that the substgring is not *reported* multiple times. We can do it using a hash called stringhash, like this: Whenever BLAST finds a reportable substring in a database, starting at position $i, say, in the database string, it looks to see if $stringhash{$i} is defined. If so, it does not report the string again. Otherwise it assigns the string to $stringhash{$i} and does report the string. Implement this change in the program.
We would like to process strings that are more than a single line long. So in the file each string will be held in consecutive lines, with strings seperated by blank lines. That is like saying that each string is a paragraph instead of just a single line. To read in a whole paragraph, put the line
$/ = "";
somewhere in the program before the reading begins. Read about this on page 102 of Johnson or some other Perl reference.
Read kmer4.pl and get it running, and try to understand how it works. That will be in preparation for our final change to the toy BLAST program, in lab 8.
Go to the prosite site and familiarize yourself with the various services available there, and scan the list of other pattern-based utilities.
Be sure to study completely the section called "The optimal way to develop patterns". I am not sure where it is, but it may be in the documents page somewhere. That material will be on the final exam!
We discussed in class that the regular expression that characterises the granin family of proteins is a false positive for BRCA1. Prosite has now been changed so that the search with BRCA1 does not return a granin protein.
Exercise:
Extra credit project: Some people have suggested that rather than doing alignment, a better way to measure the biological similarity of two sequences S and S' is to count the number of kmers that they have in common, divided by the total length of S and S'. In more detail, if a kmer is found in S t times, and in S' t' times, then that kmer contributes Min[t,t'] to the total count of common kmers.
Write a program that asks the user for the value of k, and then finds all the kmers in S and the number of times each one occurs in S, and finds all the kmers in S' and the number of times each occcurs in S'. The program then computes the ``similarity" of S and S' as above.
Download the file packed globins
Extend your program to take in a set of sequences and compute the above similarity for each pair of strings in the input. Try out different values of k. Also extend and run the NW program to compute the Needleman-Wunsch similarity for each pair of sequences. Do the two different methods give the same relative order of sequence pairs, i.e, which pairs are declared to be the most similar, etc? When do you think the kmer approach might be better at finding biologically related sequences, than are alignment based approaches? Which k should one use?