Here is a guide to the correspondence between posted class notes, videos and 2008 lectures. See also the video synoposes. April 1, 3, 2008 (the first two lectures of the course) are detailed in Lecture notes 1, 2, 3, and in Videos of lectures 1, 2 and most of 3. April 8, 2008 This lecture corresponds to the end of Lecture notes 3, and video of Lecture 3 (at the end of 3). April 10, 2008 The material in Lectures 5,6,7 (we won't do all of these in one lecture); The appropriate video is video Lecture 5 and the start of video lecture 6. April 15, 2008 Lecture notes 5,6,7 and video lectures 6 (for discussion of traceback) and 7 (for development of the notation for DP), and part of video 4 (on Xparal). April 17 Continuation of Lecture notes 5,6,7 on DP for global alignment, and introduction to the definition of local alignment and why we do local alignment. April 22 Detailed development of the DP recursion for local alignment. The two lecture notes listed as #8 in the 2002 notes. Video lecture 10. April 24 Introduction to statistical evaluation of database matching scores - Lecture notes 9. May 1 Derivation of lower bound on expected length of longest common subsequence (see video of Chip Martel's lecture), and an upper bound on expected length of longest common substring. Lecture notes 9 and 12, and video 12 for expected length of longest common substring (I think there was an error in the lecture video that was fixed in the notes). May 6 Database matching probability. Working out the probability of a query string S matching completely, somewhere, in a long string D, representing the concatenation of the sequences in a database D. This is covered in Video lecture 13 (which is again not an award-winning performance), and in the class notes from April 28, 2003. How to misunderstand probability; Space aliens, coincidences (no notes or video for this). Granin story from Lecture notes 17 (Granin is discussed towards the end of Video Lecture 18). The takehome message: When you want to assess the statistical significance of an observation (say the alignment value between two strings), it is wrong to just compute the probability that the particular observation was produced by chance (under some random model - for example between two randomly generated strings with the same character frequencies as the two given strings), and then reject the null hypothesis (that the observation was produce by chance, by some random process) if the probability of the observed value is `low' (user defined). It is wrong to do that, but it is very commonly done. Instead, one has to compute the total probability of all the possible outcomes whose individual probabilities are less than or equal to the probability of the observation. That total quantity, called a p-value generally, is what must be low in order to reject the null hypothesis. Alternatively, one can compute the total probability of all the possible outcomes with values higher (if a higher value suggests more biological association) than or equal to the observed value. Here is an extreme example of this error: Suppose we flip a fair coin 100 times, and record its specific ordered sequence of heads and tails. The probability of that specific sequence is (1/2)**100, which is an extremely small number. But can we conclude from the fact that the probability is so small, that the sequence was not produce by a random process, i.e., it was produced by some non-random force or influence (space aliens or a strange distortion of local gravity)? We cannot, if we realize that *every* specific sequence of 100 flips has that same tiny probability, so that the sum of the probabilities of all the outcomes (specific ordered sequences of heads and tails) whose individual probability is equal to or smaller than the probability of the observed sequence is 1! That is, we are guaranteed in this experiment (coin flipping) to produce a specific observation that has an extremely low probability, so when such a low probability event is observed, we cannot use the fact that it is has a low probability to conclude that some force of nature was involved, i.e., to reject the null hypothesis. Mistakes of this kind, in using probability, are common and the granin story (although a bit more involved) is one example. But it could be worse. Those mistakes at least take place in the context of experiments where one can enumerate all the possible outcomes and have a sensible random model, so a precise and meaningful null hypothesis can be stated. Then in principle, the correct kind of p-value calculations can be done, even if they are complex or not actually practical. A worse, but related mistake, is to say that an event has very low probability, and therefore conclude that it must be due to a non-random force, in a context of events where we don't know the set of possible events, or the probabilities of each event. In those cases, it is meaningless to use this hypothesis testing approach - it gets us nowhere. Question: What is the probability of a duck? What is the probability of life? What can it possibly mean to say that the probability of life is low? May 8, 13 BLAST and Perl for ToyBLAST. See video Lectures 14, 15, 16, 17. May 15 Patrice Koehl gave a guest lecture on protein sequence and structure alignment. May 20 Multiple sequence alignment (MSA) - Video Lecture 19. Definition of Multiple sequence alignment. Sum-of-pairs objective function for MSA. Lecture notes on Multiple Sequence Alignment. May 22 Multiple sequence alignment - video lecture 19 on DP for sum-of-pairs. video lecture 20 on the tree consistency theorem and its use. Lecture notes on MSA. May 27 Multiple sequence alignment: progressive alignment using a guide tree - ClustalX and star.pl video Lecture 20, 21 and Lecture notes on MSA. Although consistency only assures that n-1 of the (n choose 2) induced pairwise distances are equal (as small as) their optimal pairwise distances, if the objective function used for pairwise alignment satisfies the triangle inequality condition, then the total induced (sum-of-pairs) distance of the best center star is at most twice the total distances of the (n choose 2) optimal pairwise alignments. You will try this out in practice in Lab 9. Start of discussion of the uses of multiple sequence alignment for PSSMs. See video 22 and lecture notes of profiles. May 29 Finish discussion of PSSMs - see video 22 and lecture notes on profiles. Start of the discussion on Markov Models and Hidden Markov Models. Read the notes handed out in class. We discussed the underlying philosophy of what a model or a generator of sequences is. This is discussed in the context of PSSMs in video lecture 23, starting around minute 18. June 3 Discussion of Hidden Markov Models and the Viterbi algorithm. See lecture notes distributed in class, and video lectures 23 and 24. The example given in class follows the discussion in the Durbin and Eddy book on CpG islands - CpG rich segments in DNA. Video lecture 24 introduced Markov Models and HMMs and CpG islands. Video lecture 25 discussed the Vitterbi algorithm to find the most probabile chain in an HMM that generates a given sequence.