Lab 8 finishes the Toy Blast program and introduces Multiple Sequence Alignment. You will learn how to use one of the most popular global multiple alignment algorithms out there, ClustalX, and you will use a homegrown multiple alignment program, star.pl that we will discuss in class.
What you need to turn in are your answers to the questions and exercises below. For Perl programs use script, or some other means, to print out the program and show how it runs on data.
To bring up clustalx, start Xming to allow you to view an Xwindow remotely (at Hutchinson) and log on to a csif machine (using Putty). clustalx should be available on any csif machine. To start it, enter the command clustalx2
Your first task is to walk through the following practical tutorial on
ClustalX (The tutorial
may show screen shots that differ a little bit from the ones you get now, so
you will have to be flexible and transfer what you learn from the tutorial to
the current version. You don't need to do the exercises in the tutorial, or read everything,
just read enough that you can use ClustalX to do the alignment of the globins described below.
ClustalX is also widely available on other systems if you want
to work elsewhere.) As you work through the tutorial, you should try out
what it tells you to do, using the clustalx that you started on
a csif machine. However, if the tutorial asks you to download a sequence, you
will not be able to do that because there are no sequences already loaded on
the csif machines. To solve that problem, in the csif window in the directory
where you will do the clustalX exercise, you can use the command
cp ~cs124/fastaglobins.txt
to copy the file of the globins in fasta format into a file called
fastaglobins.txt in your directory. If that doesn't work
you can open a vi file on the csif machine, in the same directory
where you are executing ClustalX, and cut and paste
the desired sequences into that file (on the Dells in Hutchison you need to
use the middle button on the mouse to do the pasting).
Then you should be able to use the load sequences feature in
ClustalX in the file menu. For example you can do this with the globins in fasta format (below),
or with the sequences in inf1.fas from the tutorial, although I don't find their multiple alignment
to be very interesting.
Or, since you are running Xming open up your favorite graphical editor (kate, for instance) and paste the text using that editor.
You could also use winscp. Just launch it and log in. You'll see two lists of files, one on the base machine, and one on the target. You can just drag and drop the file where you want to put it on the target machine.
If you still can't figure out how to do this, send marisano an email at mjajames@ucdavis.edu.
ClustalX Tutorial
You don't need to read about PAUP or other parts of that tutorial. Later,
when we talk about phylogeny programs you might want to read about PAUP although we
won't be using it in this class.
Bring up the tutorial in a browser, and open
Another tutorial (with again some of the same issues) is at
Another ClustalX Tutorial
Download the file aligned globins That shows a polished, hand optimized, multiple alignment of many globin sequences.
Download the file globins in fasta format which has the same sequences as the globins file, but with the spaces in the alignments removed. The sequences are also now in fasta format which requires that each sequence entry begin with a line starting with > The same line can also have information about the sequence, but in this file I just use line1, line2 etc. If you want you can change those to specify what the actual sequences are.
Now use clustalx (loading the globin sequnences in fasta format that you just downloaded) to get a multiple alignment of those sequences. Note that when you read in the sequences, clustalx shows you how they are aligned presently, with colorful graphics. That may trick you into thinking it has already done the multiple alignment, but it has not. You have to actually ask it (by the appropriate clicks) to do the multiple alignment. Now get the multiple alignment done. How does it compare to the original one in the aligned globins file?
Download the file
globins for star.pl
which is the same globins but with all spaces and labels removed.
Also download the program star.pl
star.pl
that we will use to multiply align the sequences. The workings of this
program will be explained in class.
You will also need a weight matrix weight.txt when you run the program star.pl. Download that matrix from weight.txt
NOTE MAY 30: Some student were having windows to unix conversion problems getting the downloaded
files to run with star.pl. As a solution, we hope, the needed files have now been put on the csif
machines so that you can use them there without needing to transfer from a windows to a unix environment.
When you have a csif unix window open, you put a copy of star.pl into the unix directory where you are working
by using the command: cp ~cs124/star.pl .
You can get a copy of the needed globins for use with star.pl by using the command: cp ~cs124/multaligndata.txt .
That will create a file on your directory called multaligndata.txt
which might still have the nasty \rs in it.
You can get a copy of the needed weight matrix to use with star.pl by using the command: cp ~cs124/weight.txt .
That will create a file on your directory called weight.txt
which might still have the nasty \rs in it.
This should work once Marisano puts clean copies of the needed files there.
Because of the data problems for this exercise, we will
consider it as an Extra-Credit project now. That is, if you do all the other parts of this lab well, you
will aready get full credit for this lab. Or, if you have a copy of
the string and weight files with the extra \r in them, they can
be removed by using Marisano's script on the class newsgroup.
Also, for star.pl you do not want the globin strings in fasta
format. You want a file with the raw strings, one per line in the file.
Run this multiple alignment program with the globins you just downloaded for star.pl.
The multiple alignment will have to be cut out from among other output. Get it and cut it out and save it to a file. How does it compare to the multiple alignment we started from in file alinged globins (just by eye-balling the alignments), and the alignment produced by clustalx? Take note of the ratio of the optimal pairwise to the induced scores produced by star.pl. We will explain this in class.
In the star.pl program, you are initialy asked for a center number and told to use the mini center first. After the first multiple alignment is found, the program gives you the option to specify another center and find the resulting multiple alignment. Just do this a couple of times, picking centers other than the mini center. Each time, see how well the resulting multiple alignment matches the original alignment in globins, and what the ratio of optimal to induced scores is. Record these ratios, and state your conclusions.
2) Write the recurrences for the DP to find the optimal two-sequence alignment when the objective function is to Minimize the # mismatches + # spaces in the alignment. Implement these recurrences in Perl and get the program running. Run the program and the Needleman-Wunsch program on pairs of sequences and compare the resulting alignments. How do they compare?