In this lab, we will use Blast on the web, and write a bit of a toy Blast program in Perl. Also, There are some incomplete third Perl notes, available through announcements. I don't think you will need the material in them for anything in this course, but you are welcome to read those notes. They discuss more about regular expression matching. Optional, somewhat undercooked, Third Perl notes
Using the Coffee Break link above, read the October 22, 2001 entry
on RFS. This is supposed to be a good example of how similarity searching
can be useful in finding a gene/protein that may be causing or contributing to a genetic disease.
Be sure to follow the links HSPC, hypothetical protein, and related sequences.
A hypothetical protein is a protein sequence derived from a DNA gene sequence when there is
no knowledge that the derived amino acid sequence is actually a protein sequence.
Now we will attempt to use BLAST to get the same conclusions they report in Coffee break. This is not
so easy as it might seem.
Exercises:
Another kmer program, useful next week
kmer4.pl
The new elements of Perl include the substr function, the length function, two-dimensional hashes, the defined function, building up a list as a hash value, and maybe some others. You should read Johnson on these elements of Perl, although I will talk about them explicitly in class. Eventually, you will need to understand all of these for the larger BLAST program we will write. For this lab, kmerfirst.pl will be the most useful. Program kmerfirst.pl finds the first position of each of the different kmers of length k. You will extract what you need from this, and then expand on it, to write the start of a toy BLAST program. Your program should do the following things:
1. Read in from file a query string Q.
2. For k = 4, use program kmerfirst.pl to find the first location of each different k-mer in Q. This information is kept in a hash as in program kmerfirst.pl.
3. Successively read in one string at a time from a file called toygenbank. You should use toygenbank. When you hand in the lab for this week, be sure to show the program working on this toy data, but you can also make up more challenging data to add to it.
When a string S is read in, scan through it 4-mers, using the same hash as before. For this, extract and adapt what you need from kmerfirst.pl
4. Whenever (if ever) a 4-mer in S is determined to be in Q, extract the location of the first occurrence of that 4-mer in Q. Then put the characters of Q and S in arrays (as we did in needleman.pl) so that you can examine individual characters. Then scan left from the k-mer in Q and in S, as long as you find matching characters. Repeat to the right. Let L denote the length of the whole match obtained in this way. If L is greater than 10, then print a message that a good HSP has been found between Q and S, and print S.
Notice that the same HSP gets reported multiple times. Explain why that happens.
If you want to work ahead, next week the lab will ask you to modify and use kmer3.pl instead of kmerfirst.pl, so that every location of each different 4-mer in Q is collected. Then when a 4-mer common to Q and S is found, the left and right scan is done at each location in Q. Also, the program should use a hash to avoid reporting the same HSP multiple times.
Here is a suggestion for a different way to repeatedly remove the first character of a string. Try test-it.pl
^M