Pfam: Pfam
From Pfam: ``The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)"At the pfam website , click on help and read the section on scores. Then, in you own words, explain what pfam scores are and what E values mean in pfam and how they differ from E values in Blast. Then go back to the main page for pfam and click on Sequence Search and try out the example search given there. The first reported match is to the PH which should be a pattern described by an HMM. It would be natural for pfam to make the details of the HMM available from this search, but I could not find it. See if you can find the details of the HMM that was used for this match to PH. The HMMs used in pfam were created by a program called HMMER hmmer which you can look at and download if interested.
The next part of this lab concentrates on the Genscan gene finding program. This program uses Hidden Markov Models to predict genes from genomic DNA.
One of the most successful gene finding programs is Genscan. This program is built on the knowledge that genes show many structural similarities, especially on the organismal level. In addition, genomic organization can also be predicted using probabilistic models and can aid in gene-finding. Genscan uses Hidden Semi-Markov Models (HSMMs), similar to the HMMs discussed in lecture, to predict proteins. For those interested in learning more about the details of this type of program, check out this online audio and slide tutorial at home (since we have no audio), or just flip through the slides in lab. The tutorial is a bit dated at this point (it was probably put together around year 2000) since complete human genome reference sequence(s) are known now, providing additional information to build gene models, but the tutorial is still a good introduction to the use of Markov-type models to find features in sequences, and an introduction to Genscan in particular.
Here you can find a description of GenomeScan (an extension of Genscan). Although this program is slightly different, please read the section “1. OVERVIEW OF GENOMESCAN” which gives a good description of the types of biological data used to build the models for both programs. There is additional documentation at the Genscan web page.
Genscan was built using different sets of what is known as “training data”. Basically this is known genomic sequence that is used to generate the probabilities for the models used by the program. Remember in class how we talked about transition and emission probabilities? Well, they have to come from somewhere. The creators of Genscan used three separate sets of training data, Human, Arabidopsis, and Maize. When using the program you have to choose the model that is closest to the organism your sequence is coming from. For example, the makers of Genscan and GenomeScan report some success using the Human/Vertebrate models on Drosophila.
1. What problems can you see arising from using data from one species to build a predictive model of gene and genome structure for another?
2. How many predicted peptides did you find?
3. How many amino acids long is the third peptide?
Copy the amino acid sequence for the third putative protein and paste it into BLASTp (find it at BLAST ). Run the protein-protein BLAST on the non-redundant (nr) database.
4. What is the top hit? How many amino acids long is it? How does this compare to the Genscan predicted peptide?
Can you guess what’s going on here? How well do you think Genscan predicts Drosophila genes now? Refer back to your BLAST results. Do you see other hits with low E Values? There are quite a few with E Values of 0. If you have time, explore a few of these.
5. Any theories about what happened?
To accurately interpret and annotate raw genomic sequence, the best strategy is to use a combination of tools. For an idea of how this is done, check out GeneBander . You don’t have to read the whole page, but look at the example. It incorporates a lot of the technology and ideas we’ve covered this quarter, combining BLAST, CLUSTAL, Genscan, and other programs to provide the best possible predictive picture. It sure would have helped our search if we had included the BLAST information!