Lab 4, April 25

Exercises are due Friday May 2.

This week's lab concentrates on Perl and on LCS.

What you need to turn in are your answers to the Perl Exercises. For Perl programs use script to print out the program and show how it runs on data.

Perl (and related) Exercises for the week:


Exercises
  1. Regular Expressions in Perl

    Continue reading the Second Perl Notes and do exercises 2.3, through 2.7.

    In the next exercises you will work with regular expression programs that are on the web. To start be sure you have gotten the first three programs running and that you understand how the regular expression works in them. In the starting comments of each program, you are asked to run the program in certain ways. Be sure you do those, but DON'T turn in scripts of those executions. web.

    The three programs (along with a fourth) can be found at: Perl regular expression programs

    The fourth program is very similar to the third, but has some extensions. It is called accession.pl

    Perl Exercise 1: Modify accession.pl so that it only prints the accession numbers it finds, and prints each one on a seperate line. That is, it prints a list of accession numbers it finds in the input file, but does not print anything else. Below is a testfile you can use.

    Perl Exercise 2: The meaning of [ ,.;:?] in the regular expression is that the digits of an accession number must be followed by any ONE of the six characters listed between the brackets [...]. So this is an OR of the six characters. Now in the second of the three programs linked above (the one that looks for a DNA string in an input line,) we used | to indicate an OR. That regular expression had (A|T|C|G|a|t|c|g). Replace that with [ATCGatcg] to see if the program works. Script the result.

    Notice that the thrid program reads in a processes every line in a file. This may be usefull for you in the LCS exercise further on in this lab.

    The next paragraph holds the contents of a testfile you can use for exercises 1 and 2. You can cut and paste it into a file:

    The primary ACCESSION number might be something like U63121 or maybe the ACCESSION is FA10325, or the ACCESSION is PQ3469. I don't know what the sequences are for these numbers. Maybe they are atcGGGAcatcaGGG or maybe not. Is this another ACCESSION number NO9978 or is it a telephone number. Perhaps XYzCATCGGAPQ is a DNA string? What will the program do with it? Is that what you want it to do?

    Read about lists and arrays in Johnson if you have not already done so.

    2. LCS computation

    As detailed in Lab 3, your modification of the Needleman-Wunsch program can be used to compute the Longest Common Subsequence between two strings, by setting the match score and mismatch and space penalties appropriately (review this from Lab 3 if you are unclear about it).

    In class we showed that the expected length of the LCS between two random DNA strings, each of length n, (or even between one random DNA string and one arbitrary DNA string, each of length n) is at least n/4, but that the correct expected LCS length is at least n/2, and might be close to 3n/4. In this exercise, you will determine as best you can what the correct expected length is.

    A Perl program to produce random DNA strings can be found at:

    random dna

    Get this program running and make sure you understand how it works. Then you may want to modify it so that it asks the user how many random strings to produce, and then it generates that many strings and writes them into a file. The main modification is to put a loop around part of the existing program.

    Now generate gobs (that is a technical term for a number that is at least 10) of random strings of length 10 each, and using your modified Needleman-Wunsch from Lab 3, compute the length of the LCS between the first of your strings and each of the other ones. Compute the average LCS length obtained. Then repeat with strings of length 20, then 50, then 100, 200. What do you observe about the average LCS length? If you are comfortable enough with Perl and programming at this point, you may want to put everything together into a single program that generates the random strings, computes the LCS lengths, computes and reports the averages. This may save you a lot of effort compared to running programs over and over again. In fact, if you do have a single program, then for each string length you should generate at least 100 random strings.