Minimotif Miner v4.0 Guide
There are thousands of short, contiguous sequences that encode critical molecular functions in proteins. Minimotif Miner (MnM) scans a protein sequence for the presence of these sequences, or minimotifs, which have been experimentally confirmed in one or more proteins. This is done by comparison against the MnM database, which includes structured annotations of the key functional details of such minimotifs from literature. MnM defines three distinct classes of such functions: (1) binding to other proteins, nucleic acids, or small molecules, (2) post-translational modification of the minimotif by an enzyme or chemical, and (3) protein trafficking. We use the term minimotif to distinguish short, functional, contiguous peptides from other types of motifs (i.e. DNA motifs, structural motifs).
Typical results of MnM predict more than 50 new minimotifs for a protein query. A major limitation in this type of analysis is that the low sequence complexity of short minimotifs produces false positive predictions where the sequence occurs in a protein by random chance and not because it contains the predicted function. MnM 3.0 introduces a library of advanced heuristics and filters, which enable vast reduction of false positive predictions. These filters use minimotif complexity, protein surface location, molecular processes, cellular processes, protein-protein interactions, and genetic interactions. We recently combined all of these heuristics into a single, compound filter which makes significant progress toward solving this problem with high accuracy of minimotif prediction as measured by a performance benchmarking study which evaluated both sensitivity and specificity.
Please cite both of these papers:
Rajasekaran et al., 2009 Nucleic Acids Res. 37:D185-190 PMID: 18978034
Balla et al., 2006 Nat. Methods 3:175-177 PMID: 16489333
2. Minimotif syntax and glossary
Minimotifs are much too complex to describe using sequences alone. An important aspect of MnM is its highly structured model of protein function. We define minimotifs in terms of their relationship to three discrete data types, including source protein, a target molecule, and an activity relationship between them (Figure 1). These comprise a triplet which unambiguously conveys function in a structured manner, which is utilized by many of MnMs algorithms. Each of these elements has a set of attributes (yellow ovals and purple squares). Each discretized triplet has attributes that can be assigned to the triplet (orange boxes). More information on the syntax model and an example of new types of analyses enabled by this model can be found in Vyas et al. 2009 BMC Genomics 10:360 PMID: 19656396.
Elements of the minimotif syntax model
Discretized triplet - the basic structure of a minimotif is a definition of the chemistry of the minimotif (its protein sequence and modifications) and a function. These are structured into a 3-unit entity composed of the minimotifs in its Source protein, an activity, and a target of the minimotif.
Example in Figure 2:
Affinity - a measure of the strength of the interaction between the motif and its target. This is an attribute of the discretized triplet.
Structure - the three dimensional arrangement of atoms for a minimotif bound to its target. This is an attribute of the discretized triplet.
Reference refers to the publication that reported discovery of the minimotif. In most cases this is a PubMed ID. This is an attribute of the discretized triplet.
Experiment the evidence that supports the identification of the minimotif. Most minimotifs are supported by more than one type of experiment. This is an attribute of the discretized triplet.
Source Protein the protein that contains the minimotif sequence.
Source Protein Type Minimotif sequences are presented as instances of exact sequences that occur in the Source Protein or as Consensus sequences which are an interpretation of multiple instances and indicates degeneracy.
Peptide or Protein minimotifs can be studied as short peptides or in full-length proteins, which is indicated.
Source Protein Name - the name of the protein that contains the minimotif
Source Protein Accession - a unique combination of letters and numbers assigned to the source protein in a database such as RefSeq.
Source Protein Residue - the location of the minimotif start (N-terminal) amino acid in the source protein accession number
Motif Modification (M-Mod) a covalent change to the minimotif sequence.
M-Mod Residue the minimotif amino acid that is covalently changed
M-Mod Position the location of the minimotif amino acid that is covalently changed in the source protein accession number.
M-Mod Type the name of the chemical modification of the minimotif amino acid that is covalently changed
M-Mod Type Code - a code from the Psi-Mod database that indicates the type of modification to the minimotif. If a code for the chemical change did not exist then an internal code is assigned.
Activity - all minimotifs, and likely cellular activities can be characterized into binding, modifying and trafficking. Modified indicates a covalent change to at least one amino acid in the minimotif. For some minimotifs, an activity may be known, but the target is not known. In this case, we use 'requires' for the activity as the minimotif is required for the activity.
Activity Subclass a more specific phrase that describes the activity. For example "phosphorylates" is a subclass of the activity "modifies" (in this case, the motif modifies, by phosphorylation, a protein target).
Activity Code an identifier from the Gene Ontology database that is annotated for the Activity Subclass.
Activity Modification (A-Mod) a enzymatic modification of a minimotif site.
A-Mod Residue when the target is a enzyme that covalently changes minimotif sequence, the activity modification describes that change.
A-Mod Position - the minimotif amino acid that is covalently changed.
A-Mod Type - the location of the minimotif amino acid that is covalently changed in the source protein accession number.
A-Mod Type Code - a code from the Psi-Mod database that indicates the type of modification to the minimotif. If a code for the chemical change did not exist then an internal code is assigned.
Target the protein, or other molecule that acts upon the minimotif
Target Name - the molecule that acts upon the minimotif sequence.
Target Accession - a unique combination of letters and number assigned to the target, if the target is a protein.
Target Domain when a target is a protein, the target domain indicates the domain that interacts with or modifies the minimotif.
Target Multidomain when a target protein has more than one domain of the same type, the multidomain number indicates the relative position from the N-terminus.
Target Cell Localization - the subcellular location in a cell involved in the minimotif-mediated trafficking.
3. How to perform a basic query search?
This page is used to enter a protein query to search the MnM database (Figure 3).
Protein name or accession number box: Enter a RefSeq accession number of a protein or the protein name. The RefSeq numbers start with (NP_) and can be retrieved using text entries from HomoloGene. When a protein name is entered, you will be able to choose the species from a list of search results. Select the SEARCH MOTIF Button to initiate a search.
Protein sequence box: If the accession number is not available, or if you wish to analyze a novel protein sequence, then paste the complete protein sequence into the text box below. All spaces in the actual sequence must be deleted before the sequence is submitted. This is also useful for analyzing a segment of a protein. If a protein sequence is entered, the Blast it checkbox should be selected so that other information can be cross-referenced in the analysis. Select the SEARCH MOTIF button to initiate a search.
Species box: Sequences can be analyzed with several proteomes. Use the pull-down menu to choose the species (human, mouse, fly, watercress, rice, or yeast) from which the specific protein sequence or accession number was derived. This allows correct display of the statistics in the results.
Sort Results Checkbox: The default mode is checked and will set parameters to score and rank order output. Analysis of a test data set shows that scores above 0.24 have ~2% false positives and recovers >60% of true positives. This provides the best analysis for selecting minimotifs. If the box is unchecked, the user can use different types of information to remove false-positives, but these approaches are less robust that the default filter setting.
4. How to perform a batch query search?
Some scientists may want to analyze many protein sequences at once. We have now enabled this type of workflow as an email service for batch query input mode on the MnM input page. The input file for the request must contain a list of protein accession numbers from one or more various data sources (UniProtKB, MIM, RefSeq, Ensemble, UniGene, MIM, PIR, Entrez Gene) and/or protein sequences; this format is indicated in a hyperlink in this section of the input page. This feature was included in Minimotif Miner 3.
5. Interpreting results and example analysis
Summay: Multiple filters have been introduced since the release of Minimotif Miner, the most important being a combined filtering approach that can result in 90% accuracy of minimotif prediction with few false positives. The score used by this combined filter is basis of the default ranking of the minimotif results list. In the minimotif results table, minimotifs with scores above the threshold of 0.91 are highlighted green (produce no false positives on a test dataset), between 0.24 and 0.90 are highlighted yellow (produce high recovery of true minimotifs with only 2% false positives on a test dataset). Experimentally validated minimotifs are distinguished from predictions by highlighting the minimotifs blue in the results table. For those minimotifs lacking other information or scores less than .23 may still produce valid minimotifs, but these motifs, as a whole are less likely to be confirmed.
We have analyzed
Prion protein (NP_000302) as an example. The results are presented in a Protein
Sequence Window, Minimotif Results Table, and Protein Details Table. A menu can
be used to interact with these windows to perform additional analyses and refine
the search results.
We have analyzed Prion protein (NP_000302) as an example. The results are presented in a Protein Sequence Window, Minimotif Results Table, and Protein Details Table. A menu can be used to interact with these windows to perform additional analyses and refine the search results.
Minimotif Results Table
This table is organized with each row containing information about a minimotif in the protein (Figure 4). The best scoring minimotifs are ranked by a scoring metric depending on the filtering approach selected. Elements of the table are:
Motif Expression A list of the minimotifs that were identified in the protein query. Each expression is hyperlinked to the PubMed abstract for the paper describing the discovery of the minimotif.
Annotation A short standardized description about the known function of the minimotif. See Vyas et al., 2009 BMC Genomics, for more information about the syntax.
Occurrence Position Identifies the amino acid number where the minimotif begins in the protein. Multiple numbers indicates multiple minimotifs present, and their locations in the protein.
Combined Filter Score (not in all views) A score that combines a number of algorithms (see section 7) to produce our best filter algorithm. On test datasets, scores above 0.91 produce no false positives and correctly predict 39% of true positives. Scores above 0.24 produce ~2% false positives and correctly predict >60% of true positives.
Frequency Score (not in all views) A score for an algorithm (see section 7) that scores and ranks motifs based on their complexity. Scores above 5 on a test data set were more likely to be a true positive.
Surface Prediction Score (not in all views) A score for an algorithm (see section 7) that scores motifs based the probability that they are located on the surface of a protein. Scores range between 0 and 1 with higher scores reflecting a higher probability of being on the protein surface.
Number in Proteome The number of times a minimotif is observed in a proteome selected on the input page.
Protein Sequence Display Window and Menu
The protein sequence of the query is shown in this window (Figure 5). The Menu items can be used to manipulate this display. Hovering the mouse over any highlighted item will reveal a popup bubble with additional information. Functions for the menu selections are described:
New Search This button returns to the query input page.
Reset Current Search This button resets any changes to the original results page.
View SNPs This button will capitalize and highlight blue all polymorphisms in this gene entered into the SNP database. Move and hold mouse over highlighted sequence to view polymorphism information (see dbSNP database). Selecting any SNP will change the proteins sequence and highlight the change green.
View selected motif All minimotifs found are highlighted magenta in the protein sequence window. The view motif button will highlight in magenta any single minimotif selected in the Minimotif Results Table. Hover the mouse over any magenta highlighted sequence to view minimotif information.
View motifs from new SNPs Once one or more SNPs are selected from the Protein Sequence Display and are highlighted this function can be used to create a new SNP display table that shows all minimotifs introduced by the SNPs (colored green) and all minimotifs that would be eliminated by the SNP (colored red). Section 8 describes how this approach can be used to develop new hypotheses for the causes of disease.
View homologous proteins This button reveals Minimotif Results Table and ranks minimotifs with the most evolutionarily conserved minimotifs on top. Homologous proteins are retrieved from the HomoloGene cluster to which the query protein is assigned. Other minimotif metrics presented in this table are described in the View Advanced Table below.
View domains This button will color domains with alternating colors of cyan, yellow and green to indicate the different domains. Domains are as defined in the RefSeq database. Move and hover the mouse over a highlighted domain to view information.
View advanced table This table presents more detailed metrics for the minimotif frequency score.
Expected count in proteome: The number of times the minimotif is predicted to occur in the human proteome. To calculate this, proteomes are downloaded from the RefSeq database and the frequency of each amino acid in the proteome is calculated from the equation (AAX/AATOT) where AAX is the total number of each amino acid in the proteome and AATOT is the total number of amino acids in the proteome, which is 13,715,901 for the human proteome (RefSeq Vs. 4). Minimotif frequencies can be calculated by multiplying individual amino acid frequencies.
For example if the frequency of a Pro and Leu in the human proteome are 0.062 and 0.099, respectively, then the probability of the PXLP minimotif is 0.062*0.099*0.062 or 3.8 x 10-4. The inverse of this number indicates that this minimotif should be observed approximately every 2678 amino acids if the proteome were a random distribution of amino acids. When minimotifs are located at the N- or C-terminus, then the probability is also based on the total number of proteins in the proteome. For example, the probability of the same PXLP> minimotif being on the C-terminus (> indicates C-terminus, would be 3.8 x 10-4 / 27,418 genes in the human proteome = 1.4 x 10-8, thus would only be expected to occur once every 7 million amino acids in the human proteome.
Actual number in protein The number of times a minimotif is observed in the protein query.
Frequency Score in protein is calculated by dividing Actual count/Expected count.
Expected count in proteome The number of times a minimotif is predicted to occur in the proteome selected on the query page.
Actual number in proteome The number of times a minimotif is observed in the proteome selected on the query page.
Enrichment Factor in Proteome is calculated by dividing Actual count in Proteome/Expected count in proteome. 1.00 indicates the minimotif is observed at its predicted frequency; numbers above 1 indicates minimotif enrichment; numbers below 1 indicate that the minimotif is under-enriched.
This function can be used to refine MnM predictions or select for specific types or properties of minimotifs. See section 7 for more detail.
This will download the identified minimotifs into an Excel file.
A link to this user guide.
A link to a set of video tutorials on how to use different parts of MnM.
Protein Details Window
This table contains information about links to other databases for the protein query. Protein details will not be identified if a sequence was entered in the text window. Try to use RefSeq numbers for best results. An example is shown in Figure 6.
Protein Accession Number A unique identifier for the protein query from the RefSeq database.
Protein Name of protein query from the RefSeq record.
Aliases Other names for the protein or gene.
Official symbol Name of gene identified by entered accession number.
Protein length Number of amino acids in protein.
Taxon ID A unique number assigned to a species in the Taxonomy database.
Species The name of the organism associated with the Taxon ID.
Chromosome Chromosome number where the gene is located.
Gene ID The gene number from the NCBI database.
Locus ID The LocusLink number identifier from the NCBI database.
OMIM ID Annotation of the gene entered in the OMIM database.
Unigene ID ID for the entry in the UniGene database.
Swiss-Prot ID Identification number of analogous protein in the Swiss-Prot database.
Map Location of gene on chromosome.
Source Accession number for the nucleotide sequence form which the protein sequence is derived.
6. How to choose the right minimotif false positive filters?
You just performed analysis with MnM and while you may be pleased with the amount of new information about your protein, you must now choose which minimotifs you will pursue with experimentation. Some minimotifs may have very strong biological relationships to your protein and this is a good place to start. To help you choose, we have built a series of filters based on other information about proteins. Each filter is described below and can be selected or deselected from the Filtering menu item at the bottom of the Protein Sequence Display on the Results page. We recommend the combined filter, which has been tested to produce the least false positives and is set as the default filter. The basis of each scoring metric and filter is described below:
Filter 1: Attribute Selection
Rationale In many cases a user would like to restrict the search for minimotifs to a particular attribute. This function allows the user to include or exclude motifs that meet a set of criteria. These include whether a minimotif is involved in binding, post-translational modification, or does not have a target (requires). Activity subclasses can be restricted to trafficking minimotifs, phosphorylation minimotifs, or protease minimotifs. Finally the results can be restricted to instances where a minimotif is known in a specific protein, or consensus sequences, which are interpretations of multiple instances.
Publication - Rajasekaran et al., 2009 Nucleic Acids Res. 37:D185-190 PMID: 18978034
Filter 2: Frequency Score (FS)
Rationale - Scoring identifies motifs that are highly over-represented in a protein. Identifies when a protein has multiple occurrences of the same motif.
Calculation -The Frequency Score (FS) measures the relative occurrence of motifs in the protein query with respect to the entire proteome. This is simply the frequency of the motif in the protein query divided by the frequency of the motif in the entire proteome. Frequencies are calculated based on the amino acid composition of the motif and the proteome. Scores represent the over-representation and the under-representation of a motif with scores of 1 indicating the motif is observed at its predicted frequency.
Validation - Analysis of over 2300 validated motifs annotated in the SwissProt database show that the FS score is globally significant when compared to analysis with a randomized motif database. Analysis of a test dataset with thousands of true positives and true negatives demonstrates good performance of the frequency filter. See Rajasekaran et al., 2010.
Limitations - Complexity of motif strongly influences FS score. Assumes a random sequence in proteome. Motif identified may be buried and not accessible for function. Scores are not calculated considering that some motifs may be specific for certain subcellular or extra-cellular compartment. (This can be partially addressed by choosing organelles on the input page). Furthermore, the motif definition, which often varies between references, can influence the motif score.
Publication - Balla et al., 2006 Nat. Methods 3:175-177 PMID: 16489333; Rajasekaran et al., 2010 PLoS One 19:pii e12276 PMID: 20808856
Filter 3: Surface Prediction Score (SPS)
Rationale -In order to be a functional, the motif, or at least a part of it must be exposed to solvent. Thus, motifs predictions that are buried are likely false positive. Scores range from 0 -1 with the highest score preferred.
Calculation - The SPS score is calculated assuming a two state model (buried and exposed) and each amino acid of the motif gets a fraction score (between 0 and 1) representing the probability of being in that state. For each non-wildcard residue of the motif, the greater of the two probabilities is considered and a normalized probability of the motif being in the exposed state is calculated as the SPS score of the motif.
Validation - This surface prediction algorithm has a 75% accuracy in prediction of surface residues in 215 proteins with known crystal structures (Naderi-Manesh et al., 2001 Proteins 32:452-59 PMID: 11170200).
Limitations - Just because a motif is on the surface does not infer function. Some motifs may require a specific structure. Furthermore, this algorithm has a ~75% prediction accuracy for individual amino acids, thus has an inherent prediction of false positive surface residues.
Publication - Balla et al., 2006 Nat. Methods 3:175-177 PMID: 16489333
Filter 4: Evolutionary Conservation
Rationale Minimotifs are often conserved in related species, so determining those minimotifs that are conserved can help to eliminate false positives. Alternatively, this function can be used to identify those minimotifs that are unique to a species.
Calculation We originally reported a ECS score, but after user feedback now provide just a conservation percentage of the minimotif in a table that can be revealed by selecting the VIEW: View homologous proteins menu item at the bottom of the Protein Sequence Display window. Rows in the Minimotif Results table are ranked with the most evolutionarily conserved minimotifs on top. Homologous proteins are retrieved from the HomoloGene cluster to which the query protein is assigned.
Publication - Balla et al., 2006 Nat. Methods 3:175-177 PMID: 16489333
Filter 5: Protein-Protein Interaction Scores (PPIS)
Rationale - The protein-protein interaction (PPI) filter has two principle uses. 1. Since MnM predicts many PPIs, the minimotif sources and target may have an interaction already identified in one of the PPI databases. This can be used to remove false positive minimotifs. If one is working on a known PPI, MnM can be used to identify minimotifs that mediate the PPI providing a mechanistic basis to better understand the PPI.
Calculation There are three types of PPI filters implemented in MnM. The PPI filter identifies those motifs where the Source and Target have a previously known PPI in one of several PPI databases including DIP, EntrezGene, HPRD, IntAct, MINT, and Virus-MINT. The PPI-HomoloGene filter uses a PPI in one species to predict a PPI between orthologs of other species. Likewise the Extended PPI filter uses the same approach, but by using different BLAST thresholds one can choose a stringency of homologs.
Validation - The filters were tested on datasets with 1000s of true positives and true negatives. The PPI filter had the highest accuracy recovering ~61% of the true positives and 2% false positives; this is the best performing PPI filter. The PPI-HomoloGene filter yielded a slightly better recovery of ~64% true positives, but a much poorer selectivity with ~15% false positives. This algorithm should only be used if one is interested in predicting minimotifs that have sources and targets with PPIs identified in other species. The PPI similarity algorithm has a number of different thresholds that performed at different levels.
Limitations In our analysis of many PPI databases, we found minimal overlap of about 15% which is consistent with literature reports that these databases are not comprehensive and we may only know of ~10% of all PPIs. Therefore, true positives may be eliminated. Furthermore, the annotation of minimotifs is likely to be biased toward data in PPI databases. The accuracy will likely be lower than that reported in our test data set analysis.
Publication - Rajasekaran et al., 2010 Proteins 79:153-164 PMID: 20938975
Filter 6: Molecular Function Scores (MFS)
Rationale - One of the most difficult problems for many users is selecting minimotifs that are biologically relevant to the protein query. There are so many functions in the cell and the user may not know of many of these functions and how they relate to their protein of interest. We now provide a Molecular Function filter, which helps with this problem, as well as eliminating some false positives. This filter works well for identifying minimotif targets that are involved in the same or different molecular functions. For example, if the query protein is acetylated, a user may want to only look for minimotif predictions for other proteins with acetylation activity. Alternatively, the user may only be interested in potential new activities.
Calculation - The minimotif Source/Target pair is searched against the Gene Ontology (GO) database to determine if the two proteins have a related molecular function. Alternatively, the user can choose to exclude these and identify those ones suggesting different molecular functions. Since the GO database has a hierarchical directed graph relationship between cell processes, we can also determine if a minimotif Source/Target pair is related not through their most immediate molecular function (distance = 0), but through the parent, grandparent molecular functions and so forth (distances = 1,2,3,4,5). These distances can be selected on the MnM website in the filter motifs menu item.
Validation - The filters (with different distances) were tested on datasets with 1000s of true positive and true negative minimotifs. The distance =1 filter had the highest accuracy recovering ~59% of the true positives and 21% false positives; this is the best performing Molecular Function filter. As expected, the distance = 0 filter yielded fewer true positives (29%) and fewer false positives (12%), while the longer distances retrieved more true positives at the cost of more false positives. This algorithm should be used if one is interested in predicting minimotifs where one wants to identify pairs with the same or different molecular functions. For different molecular functions, select exclude from the motif filtering menu item.
Limitations - The prediction of minimotifs where the target and sources are involved in the same molecular function relies on annotations in the GO database which is growing.
Publication - Rajasekaran et al., 2010 PLoS One 19:pii e12276 PMID: 20808856
Filter 7: Cellular Function Scores (CFS)
Rationale One of the most difficult problems for many users is selecting minimotifs that are biologically relevant to the protein query. There are so many functions in the cell and the user may not know of many of these functions and how they relate to their protein of interest. We now provide a Cell Function filter, which helps with this problem, as well as eliminates some false positives. This filter works well for identifying minimotif targets that are involved in the same or different cell processes. For example, if the query protein is involved in cell division, a user may want to only look for minimotif predictions for other proteins involved in cell division or may want to identify predictions that are involved in other cellular processes.
Calculation - The minimotif Source/Target pair is searched against the Gene Ontology (GO) database to determine if the two proteins are in the same cell process. Alternatively, the user can choose to exclude these and identify those involved in different cell processes. Since the GO database has a hierarchical directed graph relationship between cell processes, we can also determine if a minimotif Source/Target pair is related not through their most immediate cell process (distance = 0), but through the parent, grandparent processes and so forth (distances = 1,2,3,4,5). These distances can be selected on the MnM website in the filter motifs menu item.
Validation - The filters (with different distances) were tested on datasets with 1000s of true positive and true negative minimotifs. The distance =1 filter had the highest accuracy recovering ~26% of the true positives and 6% false positives; this is the best performing Cell Function filter. As expected, the distance = 0 filter yielded fewer true positives (11%) and fewer false positives (3%), while the longer distances retrieved more true positives at the cost of more false positives. This algorithm should be used if one is interested in predicting minimotifs where one wants do identify pairs with the same or different cell processes. For different cell processes, select exclude from the motif filtering menu item.
Limitations - The prediction of minimotifs were that target and sources are involved in the same cell process relies on annotations in the GO database, which is growing.
Publication - Rajasekaran et al., 2010 PLoS One 19:pii e12276 PMID: 20808856
Filter 8: Genetic Interaction Scores (GIS)
Rationale -There are many examples that have a genetic interaction between a protein containing a minimotif and a target protein. Since these proteins already have a functional relationship, we expect that a minimotif that predicts this relationship is more likely to be a true-positive prediction. This also provides a means for the functional connection between two genes with a known genetic interaction.
Calculation There are three algorithms listed in the Motif Filtering menu that use genetic interactions to reduce false positives. The Genetic Interaction filter identifies those motifs where the Source and Target have a previously known genetic Interaction in one of several genetic interaction databases. The GI-HomoloGene filter uses a PPI in one species to predict a genetic interaction between orthologs of other species. The GI-node filter determines if a minimotif Source/Target pair is related not through their most immediate genetic interaction of a minimotif source and target (distance = 0), but through the parent, grandparent genetic interactions of the source and target, and so forth (distances = 1,2,3). These distances can be selected on the MnM website in the filter motifs menu item.
Validation The filters were tested on datasets with 1000s of true positives and true negatives. The Genetic Interaction filter had the highest accuracy recovering ~21% of the true positives and 3% false positives; this is the best performing genetic interaction filter. The GI-HomoloGene filter yielded a slightly better recovery of ~24% true positives, but a much poorer selectivity with ~12% false positives. This algorithm should only be used if one is interested in predicting minimotifs that have sources and targets with a GIs identified in other species. The GI-node algorithm had a higher sensitivity.
Limitations The data for genetic interaction is relatively sparse with yeast having the highest density. For this reason, true positives can be eliminated.
Publication - submitted
Filter 9: The Combined Filter
Rationale While each individual filter is effective in reducing false-positive predictions, no filter by itself is efficient. Since many types of data used for filters are independent of each other, we expected that some of this information may be orthogonal, thus a filter that used multiple pieces of information should be better at reducing false positives than any individual filter. Because the combined filter is the best performing thus far, it is set as the default filter on the MnM website. Scores above 0.24 when evaluated on a test data set produce very few false positives. We recommend this filter and threshold score. Choose the minimotifs from the table that are colored green and yellow to meet this criterion.
Calculation We used a linear regression model including most of the above filters and estimated coefficients by training the combined filter on a dataset containing 1000s of true positive and true negative minimotifs. This produces a single score for each minimotif that can be used to rank order them.
Validation The equation obtained from analysis of the training set was then used to evaluate sensitivity and selectivity on a test data set. Using a score threshold of 0.91 ~40% of the true positive minimotifs were not removed by the filter and no false positives were retained by the filter. When the highest accuracy threshold of 0.24 is used >60% of the true positives are not removed by the filter and only 2% false positives are retained by the filter. Using the high accuracy threshold of 0.24, while less stringent than the 0.91 threshold, is likely to produce few false positives.
Limitations As with any trained algorithm, the coefficients are likely to be over fit to the training data set, thus we expect more false positives than predicted by the validation.
Publication - submitted
7. How to identify new disease hypotheses?
MnM can be used to develop new hypotheses for the causes of almost any disease. We continue with the sample analysis of Prion protein. Prion is an interesting example because a D178N mutation causes one of two diseases. Allelic variation at the 128 position determines whether individuals get Creutzfeldt-Jakob disease (V128) or fatal familial insomnia (M128)(Goldfarb et al., Science. 1992). Analysis of Prion with MnM predicts 39 potential functional motifs in Prion protein. Six such motifs were already experimentally confirmed and five motifs were closely related to known effects of Prion and are consistent with published experiments.
If you're interested in a disease specific protein, we can look to genomic data, such as SNPs. To illustrate this, you can (from the Minimotif Miner Results page), under the VIEW menu, select the VIEW SNPs and all polymorphisms are revealed in the Protein Sequence window (Figure 7). Next select the two positions of interest (128 and 178 based on the importance of these positions) in causing disease. The box turns green and the amino acid changes to the polymorphism. Next, in the VIEW menu, select the VIEW Motifs From New SNPs and a new table appears (Figure 7) with new minimotifs introduced by the polymorphisms colored green and minimotifs removed by the polymorphisms colored red.
To illustrate the potential power of this analysis, we have painted these potential functions onto the surface of Prion protein for easy visualization (Figure 8); a program called MolMol was used to generate this figure. This analysis shows that the 128V variant, favoring Creutzfeldt-Jakob disease, eliminates a potential Vav2 SH2 binding motif (yellow) present in the 128M variant that favors Fatal familial Insomnia. Also, the D178N disease associated mutation eliminates a potential Caspase 1 cleavage site (cyan). The Caspase apoptotic cascade is activated by Prion. Several other motifs (N-glycosylation, Grb2-SH2 binding) predicted by MnM surround these important mutations and may also be responsive to mutation and/or allelic variation. Several motifs predicted on the opposing face of the protein or on the fragments of Prion not shown in this structure are not shown. This analysis provides an example of how MnM can be used for predicting new functions in proteins and generating new hypotheses for how mutations cause disease. Changes in minimotifs can also lead to perturbations of interaction networks and regulation of important catalytic activities, which may be critical for understanding disease related phenotypes. More information about the use of MnM for prediction of new disease hypotheses can be found in Rajasekaran et al., 2009 Nucleic Acids Res. 37:D185-190 PMID: 18978034; Schiller, 2007 Current Protocols in Protein Science, Unit 2.12.1- 2.12.14 PMID: 18429315.
8. MnM 4.0 Database statistics
Table 1. Statistics for growth of minimotif entries in MnM
|Required for Cell Process||-||-||47||47|
9. MnM Query Engine for advanced users
To study a specific motif, or directly query the MnM database, we built the Minimotif Miner Query Engine as an advanced tool for users (Vyas et al., 2009). This application can select different minimotif instances or consensus sequences to build a super-consensus for many instances. Position specific scoring matrices can also be created from selected instances. This allows the data from multiple studies to be used to arrive at a consensus sequences. An example application to SH3 domains showed that there are 10 distinct types of SH3 domain binding motifs, which are shown in Figure 9.
PLEASE CITE USE OF MNM QUERY ENGINE: Vyas J, et al. (2009) BMC Genomics. 10, 360. PMID: 19656396
10. Minimotif annotation software MimoSA
Where do all the minimotifs come from? To curate MnM, we use the MimoSA software. MimoSA is a downloadable software application that can be used to manage Bioinformatics projects that annotate information from the scientific literature (Vyas et al, 2010). MimoSA (Minimotif System for Annotation) was built to annotate minimotifs but is readily adapted to other domains. It interfaces with a MySQL database and has many features such as form based entry, read, write, edit, and delete functions, a database viewer, a paper browser, paper status manager. MimoSA uses a SPAM-like algorithm called TextMine that scores papers for desired content (Figure 10).
PLEASE CITE USE OF MIMOSA OR TEXTMINE: Jay Vyas, et al. (2010). BMC Bioinformatics, 11, 328. PMID: 19656955