Began in 1988, Prosite is a database of regular expression-like patterns, which can be used to search against sequences of unknown function to aid in their classification. The designers of the PROSITE database had the following leading concepts in mind.
Completeness For any compiliation of protein family identifiers to be helpful, it is important that it contains as many biologically meaningful patterns and profiles as possible. High Specificity Patterns and profiles should be specific enough as not detect too many unrelated sequences, yet they should detect most if not all of the sequences that clearly belong to the set in consideration. Documentation Each of the entries in PROSITE is fully documented; this documentation (described in greater detail below) includes a consise description of the protein family or domain it was designed to detect as well as a summary indicating the reasons the pattern was developed. Periodic Reviewing Entries in the database are reviewed periodically to ensure they are still valid and up to date. A Symbiotic Relationship with SWISS-PROT. A very close relationship between the PROSITE database and SWISS-PROT is maintained. Updates of the two databases are usually done in parallel such that a PROSITE entry and the correspondint annotations of related SWISS-PROT entries will change concurrently.
Completeness For any compiliation of protein family identifiers to be helpful, it is important that it contains as many biologically meaningful patterns and profiles as possible.
High Specificity Patterns and profiles should be specific enough as not detect too many unrelated sequences, yet they should detect most if not all of the sequences that clearly belong to the set in consideration.
Documentation Each of the entries in PROSITE is fully documented; this documentation (described in greater detail below) includes a consise description of the protein family or domain it was designed to detect as well as a summary indicating the reasons the pattern was developed.
Periodic Reviewing Entries in the database are reviewed periodically to ensure they are still valid and up to date.
A Symbiotic Relationship with SWISS-PROT. A very close relationship between the PROSITE database and SWISS-PROT is maintained. Updates of the two databases are usually done in parallel such that a PROSITE entry and the correspondint annotations of related SWISS-PROT entries will change concurrently.
In addition to sequence classification, PROSITE can also be used as a curated catalog of protein families. Each member of a PROSITE family has been included as the result of an experienced review. PROSITE can more useful than a protein database search for gathering all the confirmed members of a protein family. There is a link at the bottom each prosite entry for this purpose.
The core PROSITE database consists entirely of two flat text files. The first file (PROSITE.DAT) is computer-parseable, and contains all of the information nessesary for any program wishing to make use of PROSITE to scan sequences for the occurence of a pattern or profile. The file also contains statistics on the number of true and false positives, as well as false negatives for the current release of SWISS-PROT. The second file PROSITE.DOC contains textual information documenting each pattern.
One of the advantages to viewing PROSITE entries from the ExPasy home page they are formatted for greater readability in a web browser using the NiceSite tool. The following table is the NiceSite view of the PROSITE entry. This one is for the Flavodoxin family.
True positive hits:
FLAV_ANASP (P11241), FLAV_AZOCH (P23001), FLAV_AZOVI (P00324), FLAV_CHOCR (P14070), FLAV_CLOAB (P18855), FLAV_CLOBE (P00322), FLAV_DESDE (P26492), FLAV_DESGI (Q01095), FLAV_DESSA (P18086), FLAV_DESVH (P00323), FLAV_ECOLI (P23243), FLAV_ENTAG (P28579), FLAV_HAEIN (P44562), FLAV_MEGEL (P00321), FLAV_NOSSM (P35707), FLAV_RHOCA (P52967), FLAV_SYNP2 (P31158), FLAV_SYNP7 (P10340), FLAV_SYNY3 (P27319), FLAW_DESDE (P80312), FLAW_DESGI (Q01096), FLAW_ECOLI (P41050), FLAW_KLEOX (P56268), FLAW_KLEPN (P04668), FLAW_RHOCA (P18607), FLAW_SALTY (P55887)
`Potential' hits (sequences that belong to the set under consideration, but which were not picked up because the region(s) that are used as a 'fingerprint' (pattern or profile) is not yet available in the data bank (partial sequence)):
FLAV_KLEPN (O07026), FLAW_AZOCH (P23002), FLAW_AZOVI (P52964)
False positive hits (sequences which do not belong to the set under consideration:
HEMG_ECOLI (P27863), IE68_HSVSA (Q01042), MAP3_SCHPO (P31397), RUAP_SOYBN (P39657), TRPA_PSEAE (P07344), TRPA_PSESY (P34816), YT21_MYCTU (P71557)
It contains general information on the regular expression used to detect inclusion in the family, as well as the results for the current revision of SWISS-PROT.
The documentation corresponding to this entry is also available from the PROSITE web page. THe following is a Nice Site view of that:
Flavodoxins [1,E1] are electron-transfer proteins that function in various electron transport systems. Flavodoxins bind one FMN molecule, which serves as a redox-active prosthetic group. Flavodoxins are functionally interchangeable with ferredoxins. They have been isolated from prokaryotes, cyanobacteria, and some eukaryotic algae. The signature pattern for these proteins is derived from a conserved region in their N-terminal section, this region is involved in the binding of the FMN phosphate group.
From the web page, PROSITE entries can be accessed using the following methods. Both relevant PROSITE entries and documentation are retrieved.
Use this to search the description field of the PROSITE documentation entries. You man enter search term(s) as words or partial words This means to find the entry for Flavodoxin, you could either type Flavodoxin, or just flav. If you give more than one keyword, then only the entries having all of your keywords will be listed.
If you know the full name of an entry or the accession number you can use this box to retrieve the entry. You may give only one name or accession at a time. For example, the following will retrieve entries for the flavodoxin proteins. Returns the PROSITE entry FLAVODOXIN Also returns the PROSITE entry FLAVODOXIN Returns the PROSITE documentation for FLAVODOXIN
If you know the full name of an entry or the accession number you can use this box to retrieve the entry. You may give only one name or accession at a time. For example, the following will retrieve entries for the flavodoxin proteins.
Returns the PROSITE entry FLAVODOXIN
Also returns the PROSITE entry FLAVODOXIN
Returns the PROSITE documentation for FLAVODOXIN
Use this method to search for an author's name in the references field of prosite documentation. The following methods would be acceptable for finding the author Amos Bairoch.
Use this method to search for an author's name in the references field of prosite documentation.
The following methods would be acceptable for finding the author Amos Bairoch.
If multiple authors match your query, an ordered list of relevant entries is shown. For example the following is a result of a search for the author name Kimura.
Release 16, September 1999 and updates up to 20-Nov-1999
Please choose one of the following entries:
Use this method to look for a PROSITE pattern sited in a particular journal. You must know the journal name or its abbreviation and (optionally) a year and/or a volume number to narrow the search. If you are not sure of the exact journal name a hyperlink to a list of journals cited in PROSITE is also provided. Journal Name: Volume: Year:
Use this method to look for a PROSITE pattern sited in a particular journal. You must know the journal name or its abbreviation and (optionally) a year and/or a volume number to narrow the search. If you are not sure of the exact journal name a hyperlink to a list of journals cited in PROSITE is also provided.
Journal Name: Volume: Year:
Use this method to search the full text of PROSITE documentation entries. The search is case insensitive and partial terms is made possible by checking the box appending wildcards to the beginning and end of words. You can also use the boolean operators AND, OR, and NOT to restrict your search, although parentheses (), which may be part of certain words in PROSITE are not allowed. If words multiple are given without boolean operators they will be searched adjacently as a phrase. Some search examples provided by PROSITE: heat shock: will list entries containing "heat shock" (adjacent words). atpase AND coli: will list entries containing both "atpase" and "coli". coli AND {atpase OR atp synthetase}: will list entries containing "coli" and either "atpase" or "atp synthetase". atpase NOT coli: will list entries containing "atpase", except those containing "coli". aldehyde: with the option 'Prefix and append wildcard '*' to words': will list entries containing "aldehyde", as well as "glyceraldehyde", "lactaldehyde", "aspartate-semialdehyde", etc. without the option'Prefix and append wildcard '*' to words': will only list entries containing the exact word "aldehyde". Enter search keywords: Prefix and append wildcard '*' to words.
Use this method to search the full text of PROSITE documentation entries. The search is case insensitive and partial terms is made possible by checking the box appending wildcards to the beginning and end of words.
You can also use the boolean operators AND, OR, and NOT to restrict your search, although parentheses (), which may be part of certain words in PROSITE are not allowed.
If words multiple are given without boolean operators they will be searched adjacently as a phrase.
Some search examples provided by PROSITE: heat shock: will list entries containing "heat shock" (adjacent words). atpase AND coli: will list entries containing both "atpase" and "coli". coli AND {atpase OR atp synthetase}: will list entries containing "coli" and either "atpase" or "atp synthetase". atpase NOT coli: will list entries containing "atpase", except those containing "coli". aldehyde: with the option 'Prefix and append wildcard '*' to words': will list entries containing "aldehyde", as well as "glyceraldehyde", "lactaldehyde", "aspartate-semialdehyde", etc. without the option'Prefix and append wildcard '*' to words': will only list entries containing the exact word "aldehyde".
Enter search keywords: Prefix and append wildcard '*' to words.
For more powerful retrievals you can use the Sequence Retrieval System to to search the PROSITE database. A link to an ExPasy SRS server is provided for these purposes.
The ScanPROSITE tool can be used to scan a sequence against all PROSITE patterns or a prosite pattern against the current SWISS-PROT and TrEMBL databases.
This tool is useful for identifying the function or or functional domains within a protein. The protein can be entered as a SWISS-PROT/TrEMBL accession number (e.g. P05130) or a sequence identifier (e.g. KPC1_DROME). Or a user provided protein can be entered as text FASTA? into a provided dialog. From PROSITE:Enter a SWISS-PROT/TrEMBL accession number (AC) (for example P05130) or a sequence identifier (ID) (for example KPC1_DROME): Or you can paste your own sequence in the box below: Option: Exclude patterns with a high probability of occurrence
This tool is useful for identifying the function or or functional domains within a protein.
The protein can be entered as a SWISS-PROT/TrEMBL accession number (e.g. P05130) or a sequence identifier (e.g. KPC1_DROME).
Or a user provided protein can be entered as text FASTA? into a provided dialog.
This tool makes it possible to scan a PROSITE or user provided pattern against SWISSPROT and TrEMBL. A dialog at the bottom also allows for limiting the search by organism or species. It does this by only returning database hits that match the specified terms in the OC or OS fields. A database search can take a few minuits so an option to send the search results back via e-mail is also offered.
This tool makes it possible to scan a PROSITE or user provided pattern against SWISSPROT and TrEMBL.
A dialog at the bottom also allows for limiting the search by organism or species. It does this by only returning database hits that match the specified terms in the OC or OS fields.
A database search can take a few minuits so an option to send the search results back via e-mail is also offered.
There is a distinct methodology behind the development of a PROSITE pattern.
The first and most important criterion is that a pattern must exibit high specificity as well as high sensitivity. This means that a PROSITE pattern should be as short as possible while detecting all or most of the sequences it was designed to describe minimising the occurence of false positives.
The authors generally start pattern developement by studying reviews on a group or family of proteins. From there a table of relevent sequences is assembled. In addition to sequences from the literature any new or unpublished sequences revelant to the family under consideration is are added. From this list of sequences a multiple alignment is constructed. Important attention is paid to regions thought or proven to be important to the biological function of that group of proteins. Here is a listing of some of the sites taken into consideration:
- Enzyme catalytic sites. - Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin, etc). - Amino acids involved in binding a metal ion. - Cysteines involved in disulfide bonds. - Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.
Following the multiple sequence alignment, is an itterative process identifying and testing possible patterns. First short (not more than four or five residues long) conserved sequence which belong to the regions of biological importance are identified. These are refered to as 'core' patterns. The most recent verson of the SWISS-PROT database is then scanned with these core patterns. If any of the core patterns detect all of the proteins under consideration and none (or very few) of the other proteins, the core pattern is suitable to be used in PROSITE. If none of the core patterns are suitable, a further series of scans involving a gradual increase in the size of the core pattern(s) is then nessesary until the desired level of sensitivity and specificty is met.
It should be noted that the process by which PROSITE patterns are created is not automated. This has great advantages in excluding 'false' patterns that are not of biological signifigance and therfore unsuitable at detecting new sequences. From an information theoretic point of view this could be thought of as over-fitting your data. An example of the 'false' pattern hazard is given below.
Let us assume that we have a partial alignment of three sequences around an active site residue (in this example an histidine whose position is marked with an asterisk) as shown below: * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS Here we would start scanning with a core pattern with the sequence A-T-H- [D or E]. This pattern is small and would probably pick up too many false positive results. According to the procedure outlined above, we would then have to extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are represented only once in a given position of the alignment. For example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences which are in the alignment, but it would be biologically meaningless; there is no consensus in the first three positions of the pattern and the pattern does not even group residues with identical physicochemical properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal
Let us assume that we have a partial alignment of three sequences around an active site residue (in this example an histidine whose position is marked with an asterisk) as shown below:
* ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS
Here we would start scanning with a core pattern with the sequence A-T-H- [D or E]. This pattern is small and would probably pick up too many false positive results. According to the procedure outlined above, we would then have to extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are represented only once in a given position of the alignment. For example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences which are in the alignment, but it would be biologically meaningless; there is no consensus in the first three positions of the pattern and the pattern does not even group residues with identical physicochemical properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal
A number of the patterns in the PROSITE database were have been previously published. Patterns extracted from published articles are tested against the current addition of the SWISS-PROT data bank to ensure that they are up to date. If this is the case they are included without modification, otherwise they are updated using an extension of method used to find new patterns described above.
PROSITE patterns are described using the following conventions:
- The standard IUPAC one-letter codes for the amino acids are used. - The symbol `x' is used for a position where any amino acid is accepted. - Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr. - Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. - Each element in a pattern is separated from its neighbor by a `-'. - Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. - When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. - A period ends the pattern.
Examples:
PA [AC]-x-V-x(4)-{ED}. This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp} PA <A-x-[ST](2)-x(0,1)-V. This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
PA [AC]-x-V-x(4)-{ED}.
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
PA <A-x-[ST](2)-x(0,1)-V.
This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
The PROSITE pattern for the flavodoxin family of proteins is derived from a conserved region in their N-terminal section. This region is of biological significance and is involved in the binding of the FMN phosphate group.
Currently there is no PROSITE regular expression for detecting proteins belonging to the myoglobin family. In this workshop you will use methods similar to those used to develop the patterns in PROSITE to develop a regular expression for detecting members of the myoglobin family.
Step One. Obtain the Swiss-Prot sequences for myoglobins.
How did you do this? Answer: USE SRS
Step Two. Create a ClustalW alignment of the sequences retrieved in step one. Since you will be examining this alignment closely, ClustalX is a good choice for creating the alignment.
Step Three. From the resulting multiple alignment choose two or three core regular expressions. Your regular expressions should only contain substitution groups and wildcards characters, variable length wildcards are not allowed. ( e.g. [liv]..[g]...[r] ) Search these expressions against Swiss-Prot.
Does increasing the number of members of a substitution group increase or decrease the number of sequences maching the pattern? Decrease
Does increasing the number of members of a substitution group increase or decrese the number of sequences retrieved? Increase
Does increasing the number of substitutions groups increase or decrease the numbe of sequences retrieved? Decrease
Step Four. Choose three patterns that seem to work best for you. Each of them should retrieve less that 150 total sequences from Swiss-Prot, and at least 50 true positives. Report the number of true positives, false positives, and false negatives for each of the three motifs.
A Possible Answer:
[g].......[f]..[h][p].....[f] 68 TP, 17 FP, 85 TOTAL [g]......[l][f]..[h][py].....[f] 66 TP, 11 FP, 76 TOTAL [g]......[l][f]..[h][py]....[k][f] 61 TP, 0 FP, 61 TOTAL [h][g]...[l]..[l].........[k][f] 61 TP, 0 FP, 61 TOTAL from the other region [k][k][k].............[h].....[i] 63 TP, 5 FP, 68 TOTAL