WO2001092990A2 - Structure-based methods for assessing amino acid variances - Google Patents

Structure-based methods for assessing amino acid variances Download PDF

Info

Publication number
WO2001092990A2
WO2001092990A2 PCT/US2001/017351 US0117351W WO0192990A2 WO 2001092990 A2 WO2001092990 A2 WO 2001092990A2 US 0117351 W US0117351 W US 0117351W WO 0192990 A2 WO0192990 A2 WO 0192990A2
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
acid residue
model
target
protein
Prior art date
Application number
PCT/US2001/017351
Other languages
French (fr)
Other versions
WO2001092990A3 (en
Inventor
Daniel I. Chasman
R. Mark Adams
Original Assignee
Variagenics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Variagenics, Inc. filed Critical Variagenics, Inc.
Priority to CA002410726A priority Critical patent/CA2410726A1/en
Priority to JP2002501137A priority patent/JP2004501446A/en
Priority to EP01939635A priority patent/EP1350115A2/en
Priority to AU2001265131A priority patent/AU2001265131A1/en
Publication of WO2001092990A2 publication Critical patent/WO2001092990A2/en
Publication of WO2001092990A3 publication Critical patent/WO2001092990A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • This invention relates to computational methods for genetic variance modeling and prediction.
  • the human genome contains approximately 60,000 to 100,000 genes.
  • a variance i.e., a mutation or polymorphism
  • the variance can result in the production of a gene product, usually a protein, with altered or no activity.
  • the variance can be as small as the addition, deletion or substitution of a single nucleotide.
  • Such a single nucleotide variance is sometimes called "single nucleotide polymorphism" or SNP.
  • the invention features variance modeling and prediction methods that are useful for assessing whether an amino acid variation at a selected amino acid residue in a protein of interest is likely (or is not likely) to have an effect on the protein (e.g., alter a biological activity of the protein).
  • the methods of the invention can be used to generate a structural model or models of all or a part of the protein of interest, assess the quality of the structural model, and evaluate the potential functional consequences of an amino acid variation.
  • the methods of the invention achieve these goals by considering certain functional, structural, and phylogenetic features of specific amino acid residues.
  • the methods of the invention are useful even when they do not predict the effect of an amino acid variance with complete accuracy.
  • the growing number of known variances makes it extremely difficult to investigate all potentially significant variances.
  • techniques that allow one to predict (even imperfectly) which variances are more likely to affect the structure or activity of a selected protein are useful because they permit one to assign a priority to a variance.
  • One embodiment of the methods of the invention identifies a model amino acid residue in a model protein to represent a polymorphic amino acid residue (variance) of interest in a protein of interest, generates a record of the analysis of the model amino acid residue, generates an assessment of model quality, generates a summary of the functional, structural, and phylogenic features assessed, and a generates a graphical representation of all or a part of the model protein that can be annotated with information related to the various features assessed.
  • Another embodiment of the methods of the invention generates statistically based predictions regarding the likelihood that an amino acid change at a polymorphic amino acid residue in a protein of interest will have an effect on the protein.
  • the methods of the invention entail identifying a model amino acid residue (the "model amino acid residue” or “model variance") within a protein structure (the “model protein") that serves as a structural model of a selected polymorphic amino acid (“polymorphic target amino acid residue” or “target variance") of the protein of interest (the “target protein” or “target sequences”).
  • the polymorphic target amino acid residue is a particular amino acid residue within protein of interest that is polymorphic.
  • a first amino acid e.g., Gly
  • a second amino acid e.g., Lys
  • the amino acid present at the polymorphic target amino acid residue can be, e.g., a third, fourth, or fifth amino acid.
  • the methods of the invention can be used to assess any number of changes in the amino acid present at the polymorphic target amino acid residue and any number of different polymorphic amino acid residues within a target protein.
  • the model protein must have sufficient structural information to perform the methods' analyses.
  • the structural information can be derived from x-ray crystallography, NMR, or some other technique for determining the structure of a protein at the amino acid or atomic level.
  • the model protein is selected from among proteins with structural information based, at least in part, on its sequence similarity to the target protein.
  • the model protein can be selected based on overall sequence similarity to the target protein or based on the presence of a portion having sequence similarity to a portion of the target protein which includes the polymorphic target amino acid residue.
  • the methods of the invention entail assessing certain functional, structural, and phylogenic features of the model amino acid and its environment within the model protein.
  • the values of the features of the model amino acid residue are then used to determine a potential for an effect of an amino acid change (or variance) at the polymorphic target amino acid residue by comparison to certain criteria.
  • Some features have categorical values. For these features there may be only two values: either the specified criteria for feature are met or they are not.
  • One example of a categorical feature is "helix breaking". To meet the criteria for "helix breaking", the model amino acid residue must be in a region of helical secondary structure and one of the polymorphic amino acids must be either Gly or Pro.
  • model features can be either continuously valued or categorical.
  • solvent accessibility of the model amino acid can be a continuous value or, if cut-off values are defined, a categorical value.
  • a significant feature of the variance modeling and prediction methods of the invention is the concept of a "structural neighborhood.” This is the region within a selected radius of the atoms of a particular amino acid residue. The amino acid residues and other structural features within the structural neighborhood of an amino acid residue strongly influence the effect of a change in the actual amino acid present at the position of the amino acid residue.
  • Another significant feature of the variance modeling and prediction methods of the invention is the selection of the functional, structural, and phylogenic features that are useful in predicting the effect of a variance. Among the features analyzed are: solvent exposure, nearness to a heterogen atom, and deviation from the average crystallographic B- factor of the model protein.
  • the methods of the invention are very powerful because they do not require structural information about the target protein beyond the sequence of the target protein or the sequence of the target protein in the region that includes the polymorphic target amino acid residue.
  • the methods of the invention rely on the use of public sequence and structure databases. These databases become more robust as more and more sequences and structures are added. Thus, the reliability of the models and predictions made by the methods of the invention will continually increase.
  • the methods of the invention can be used to predict which non-synonymous polymorphisms are likely to affect protein function, however, they have application in many other areas of protein science.
  • the methods can be applied to predicting whether a polymorphism will affect the interaction of a drug with a target protein.
  • the methods of the invention can be applied for this purpose. Where two or more polymorphisms occur in a single protein, the methods of the invention can help evaluate both their individual and their combined effects. More generally, the choice of the target protein and polymorphisms need not be dictated by the occurrence of natural genetic variation. For example the choice can be prospective as in the case of the engineering of an enzyme, where the methods of the invention can be applied to the evaluation of which potential mutations will alter the enzyme activity. Broadly, the methods of the invention can be used whenever it is important to assess a relationship between amino acid variation and any aspect of protein activity or structure.
  • polymorphic amino acid residue As used herein the terms “polymorphic amino acid residue,” “amino acid polymorphism,” “polymorphism,” and “variance” refer to an amino acid position within a protein that can be one or another of two or more different amino acids.
  • structure refers to the three dimensional arrangement of atoms in the protein.
  • Fusion refers to any measurable property of a protein. Examples of protein function include, but are not limited to, catalysis, binding to other proteins, binding to non- protein molecules (e.g., drugs), and isomerization between two or more structural forms.
  • Biologically relevant protein refers to any protein playing a role in the life of an organism.
  • Training dataset refers to a collection of one or more proteins each with one or more polymorphisms or mutations and information concerning the effects of each polymorphism on its protein's structure or function.
  • FIG. 1 is a flow chart depicting an example of some of the steps in the annotation mode.
  • FIG.2 is a flow chart depicting an example of some of the steps in selecting predictive features using a training dataset.
  • FIG. 3 is a flow chart depicting an example of some the steps in the probabilistic mode.
  • the invention features methods for variance modeling and prediction.
  • the methods can be used to assess the effect of any number of amino acid variations in a protein of interest (the "target protein").
  • the methods are useful for assessing the effect of an amino acid change at a selected polymorphic amino acid residue in a target protein.
  • the amino acid residue can be an amino acid residue that is known to exhibit polymorphism, i.e., one that is known to differ among individuals of a population. For example, some individuals have a Glu at amino acid 6 of their hemoglobin beta-chain. Other individuals have a Val at this position, and this polymorphism is the cause of sickle-cell anemia.
  • the amino acid residue is polymorphic, but whether the polymorphism has any effect on the protein of interest will not be known.
  • the variance modeling and prediction methods of the invention rely on the analysis of a residue ("the model amino acid residue") in a model protein that is used to represent the polymorphic amino acid residue (the "polymorphic target amino acid residue") in the protein of interest (the “target protein”).
  • the model amino acid residue and model protein are selected based on sequence similarity to all or a portion of the target protein.
  • the model protein is one for which there is considerable structural information available (e.g., the structure of the protein has been solved).
  • the methods of the invention entail examination of various physical, structural, and phylogenetic features of the model amino acid residue.
  • the features examined are ones that are useful in predicting whether a change in the amino acid present at the model amino acid residue will affect the activity of the model protein. Examples of such features include: solvent accessibility, relative crystallographic B factor, and proximity to a heteroatom. Because the model amino acid residue and the model protein are similar to the polymorphic target amino acid residue and target protein respectively, the prediction made for the model amino acid residue and model protein will be relevant to the polymorphic target amino acid residue and target protein.
  • the methods of the invention can be used to assess the impact of any known polymorphism.
  • the methods of the invention can also be used to assess the effect of any potential change at any selected amino acid residue, including amino acid residues that are not known to be polymorphic.
  • the methods of the invention can be used to provide: 1) an annotated model of the polymorphic target amino acid residue (annotation mode); 2) a prediction of the probability that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein (probabilistic mode); or 3) a classification of the polymorphic target amino acid residue as one which is either likely or not likely to have an effect on an activity of the target protein (classification mode).
  • a model amino acid residue is used to represent the polymorphic target amino acid residue.
  • all three modes entail determining the value of at least one selected physical, structural, and phylogenetic feature of the model amino acid residue.
  • the values of the selected features can be used to provide an annotated model of the target protein.
  • One skilled in the art can use the values of the selected features to assess the likelihood that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein.
  • FIG. 1 is a flow chart depicting some of the steps in one embodiment of the annotation mode.
  • the amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 102).
  • Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 104).
  • a model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 106).
  • the structural neighborhood of the model amino acid is determined (STEP 108), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 110).
  • the results of the various determinations are they output (STEP 112).
  • the output can be a list of values or an annotated graphical depiction of all or a part of the model protein.
  • the probabilistic mode and the classification mode employ a database of polymorphisms (a "training dataset") satisfying two requirements.
  • This database of polymorphisms (“training polymorphisms”) can contain polymorphisms of a single protein, e.g., lac repressor or lysozyme, or polymorphisms of two or more proteins.
  • the effect on activity is known, as is the value of at least one physical, structural, and phylogenetic feature.
  • the training polymorphisms are statistically analyzed to identify a subset of all possible physical, structural, and phylogenetic features that is most useful for predicting whether the training polymorphism has an effect on activity. This subset will also be useful for predicting whether an amino acid change at a model amino acid residue or a polymorphic target amino acid residue has an effect of activity.
  • FIG. 2 is a flow chart depicting some of the steps in selecting a subset of feature useful for prediction. First, a training dataset of unbiased training polymorphisms with a known effect on activity is provided (STEP 202). The value of selected physical, structural, or phylogenetic features for each training polymorphism in the dataset is then determined (STEP 204). Statistical analysis is then used to select a subset of features useful for making predictions (STEP 204).
  • the probabilistic mode entails selecting training polymorphisms that are similar to the model amino acid residue in terms of the subset of features. The proportion of selected training polymorphisms that have an effect on activity is determined and this information is used to predict whether a change in the amino acid present at the model amino acid residue will have an effect on the model protein. Because the model amino acid residue is selected to represent the polymorphic target amino acid residue, this prediction is also relevant to the polymorphic target amino acid residue and the target protein.
  • the probabilistic mode can be more readily understood by considering a specific example. In this example, the polymorphic target amino acid residue is amino acid 120 of target protein X.
  • amino acid residue 150 of protein A (the model protein) is selected as the model amino acid residue. Because the structure of protein A has been solved, the value of various selected features of amino acid residue 150 of model protein A can be determined.
  • known polymorphisms (training polymorphisms) of lac repressor are selected on the basis of the similarity of a selected subset of features of their features to the analyzed subset of features of amino acid residue 150 of protein A. The selected lac repressor polymorphisms are used to predict whether an amino acid change at amino acid 120 of target protein X will have an effect on protein X.
  • FIG. 3 is a flow chart depicting some of the steps in the probabilistic mode.
  • the amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 302).
  • Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 304).
  • a model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 306).
  • the structural neighborhood of the model amino acid is determined (STEP 308), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 310).
  • Training polymo ⁇ hisms in an unbiased training dataset that have physical, structural, or phylogenetic characteristics similar to the model amino acid residue and its structural neighborhood (STEP 312).
  • the proportion of training polymo ⁇ hisms identified in STEP 312 that have an effect on activity of the protein are then used to assess the probability that a change in the amino acid present at the polymo ⁇ hic target amino acid residue will have an effect on the target protein (STEP 314).
  • the training polymo ⁇ hisms and the associated information regarding the effect of the polymo ⁇ hisms on activity and the values of various features of the polymo ⁇ hisms are used to build a classification tree.
  • the classification tree can be used to classify the model amino acid residue as either one that is likely to have an effect on activity or one that is not likely to have an effect on activity. Because the model amino acid residue is selected to represent the polymo ⁇ hic target amino acid residue, this classification is also relevant to the polymo ⁇ hic target amino acid residue and the target protein.
  • the annotation mode, the probabilistic mode, and the classification mode are three examples of how the methods of the invention can be used. Those skilled in art will recognize that many other implementations are possible. For example, it may be possible to derive a mathematical relationship (e.g., a regression relationship) between a subset of all possible physical, structural, and phylogenetic features that can be used to predict the effect of a polymo ⁇ hism. Selection and Validation of a Model Protein and Model amino acid residue
  • An important step in the methods of the invention is the selection of a model amino acid residue within a model protein that can be used to represent the polymo ⁇ hic target amino acid residue in the target protein. This is accomplished by first selecting a model protein(s).
  • a model protein(s) can be any protein(s) that is homologous in sequence to the target protein and for which there is significant structural information.
  • the selection involves searching for a protein(s) that is similar in sequence to the target protein in a curated structural database such as the Protein Data Bank (PDB).
  • PDB Protein Data Bank
  • Sequence similarity is typically assessed by the BLAST program (NCBI) that aligns two seq ⁇ ences and reports and evaluates the alignment with a two quality scores, the E-value (a measure of expectation by chance) and the number of aligned residues that are the same in the two sequences. It can also be assessed using other sequence alignment methods like the Smith- Waterman or FASTA algorithms.
  • NCBI BLAST program
  • protein structures from the PDB are considered acceptable models for the target protein's structure if the E-value of the alignment for the target protein's sequence and the model protein's sequence is sufficiently small (e.g., an E-value less than 10 "4 ).
  • an alignment e.g., a BLAST alignment
  • residues in the target protein and residues in the model protein are used to identify the residue in the model protein that is taken as the model amino acid residue.
  • a crystal (or NMR) structure for the target protein itself already exists in the PDB and this, of course, is the best possible case.
  • the E-value of the alignment is essentially zero and the quality of the model is equivalent to the reliability of the crystallographic (or NMR) procedures.
  • a theoretical homology model of the target protein or a related protein may have been constructed, published, and deposited in the PDB. The homology model's quality can be assessed manually by reference to the homology modeling procedure and from the publication describing the model.
  • Other embodiments of the invention can inco ⁇ orate an explicit step for construction of a fully optimized homology model for each target protein before assessing the function of individual residues.
  • the structure of the model protein in the vicinity of the model amino acid residue can be assessed for quality.
  • a structural neighborhood of the model amino acid residue is identified.
  • the structural neighborhood can be the collection of residues in the model protein's structure that have at least one atom within some distance or radius (e.g., 5A) of at least one atom in the model amino acid residue.
  • the residues in the structural neighborhood are residues that make the closest contact with the modeled variance, and the value of 5A for the radius can be used to reflect a generous approximate distance for Van der Waals interactions.
  • the model quality near the model amino acid residue is computed as the fraction of residues in the structural neighborhood that are identically conserved in the BLAST alignment the target protein and the protein whose structure is used for the model.
  • Statistical measures of neighborhood similarity could also be used to assess quality, e.g., a structural neighborhood equivalent of the BLAST E-value.
  • the measure of conservation in the structural neighborhood affords a very precise measurement of the accuracy of the modeling near the model amino acid residue itself.
  • the structural neighborhood of the model amino acid residue is used to define the structural neighborhood of the polymo ⁇ hic target amino acid residue.
  • the sequence of the region of the model protein corresponding to the structural neighborhood of the model amino acid residue is aligned with the sequence of the target protein and the aligned target protein amino acids are defined as part of the structural neighborhood of the polymo ⁇ hic target amino acid residue.
  • model protein Among the features examined in the model protein are: the distance between the model amino acid residue and any structural motifs or important functional residues, e.g., enzyme active sites in the model protein; distances between the model amino acid residue and any heterogens present in the model protein; and the distance between the model amino acid residue and any subunit interfaces in the model protein.
  • the sequence of the target protein and the sequence of the model protein are examined for matches to the entries in one or more databases of recognized domains, e.g., the PROSITE database domains (Bairoch et al. (1997) Nucl. Acids. Res. 24:217) or the pfam HMM database (Bateman et al., (2000) Nucl. Acids. Res. 28:263).
  • the PROSITE database is a compilation of two types of sequence signatures- profiles, typically representing whole protein domains, and patterns typically representing just the most highly conserved functional or structural aspects of protein domains.
  • the minimum distance is determined between atoms in the model amino acid residue and atoms in the model's match to the PROSITE entry.
  • Small minimum distances e.g., 5A
  • heterogens are small chemical groups (non-protein molecules) in protein structures that are associated with a protein during the structure determination. Often heterogens are enzyme cofactors, substrates, glycosides, substrate analogs, or drugs. Their location in the structure of a protein may suggest the location of an enzymatic active site or an important functional motif. As with the matches to PROSITE patterns, the minimum distance between the atoms in the model amino acid residue and atoms in the model structure's heterogens are calculated and reported.
  • the distances are inte ⁇ reted to reflect a potential effect of the model amino acid residue on the model protein's function, and by extension on the target protein's function.
  • a model amino acid residue near an enzyme cofactor is inte ⁇ reted to suggest that the variance will affect the enzyme's activity.
  • a relatively small distance e.g., within 5A
  • two or more model amino acid residues each modeling the same or different polymo ⁇ hic target amino acid residues
  • This last possibility may be particularly relevant when multiple variances in a single target protein have biological properties that depend on their haplotype.
  • One important class of features that can be used to evaluate the tolerance of a protein to a polymo ⁇ hism is related to intrinsic structural properties and phylogenetic aspects of the model amino acid residue.
  • the structural properties include the accessibility of the model amino acid residue to solvent and its secondary structure classification, e.g., helix or sheet. Both of these properties can be computed for a model amino acid residue in the context of the model protein structure from well-known algorithms. Both are used to implement the notion that amino acid polymo ⁇ hisms at residues with certain structural dispositions are likely to affect protein structure or function.
  • the phylogenetic aspects of the polymo ⁇ hic target amino acid residue are quantitative measures of the degree of phylogenetic variability (or alternatively conservation) at the polymo ⁇ hic target amino acid residue within the family of protein related sequences containing the target protein.
  • phylogenetic variability e.g., Kabat-Wu variability measure, phylogenetic weight, and any of these would suffice.
  • One convenient measure is the phylogenetic entropy. This value can be computed from a simultaneous multiple alignment of the target protein's sequence family. For example, all protein sequences in the public databases that are at least 30% identical to the target protein can be collected and aligned to each other with known algorithms, e.g. CLUSTALW.
  • This collection of simultaneously aligned sequences is known as a multiple alignment. It defines an association between each residue in each sequence and one (or none if there are gaps in the alignment) of the residues in each of the other sequences. Each position in the multiple alignment therefore represents a set of homologous residues in the set of homologous proteins.
  • the entropy of each position is computed as:
  • N number of different amino acids at that position in the multiple alignment
  • the structural and the phylogenetic information needed to analyze the intrinsic structural and phylogenetic features of the model amino acid residue can be found in the HSSP database (Sander et al. (1991) Proteins 9:56-68) which supplies continuously updated structural and phylogenetic information for each protein structure in the PDB database.
  • structural data for each residue in the protein structure includes its secondary structure assignment (e.g., helix, sheet, etc.) and an estimate of its solvent accessibility.
  • Each residue in the corresponding PDB structure is also associated with a phylogenetic entropy computed from a multiple alignment of proteins sharing at least 30% sequence identity with the model protein.
  • this phylogenetic information can be used to approximate the phylogenetic information for proteins that are similar to the target protein.
  • the entire multiple alignment for proteins related to the model protein, and therefore an amino acid profile for each residue, is also provided in the database.
  • HSSP structural and phylogenetic data for all model amino acid residues can be reported using the methods of the invention. This data can also be used in a series of tests for determining whether there are expected functional consequences of a change in the amino acid present at the polymo ⁇ hic target amino acid residue on the target protein. The following functional tests can use information in the HSSP database. 1) Buried Charge: The model amino acid is inaccessible and the polymo ⁇ hic target amino acid residue includes a charged residue.
  • the polymo ⁇ hic target amino acid residue includes either a glycine or a proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure based on structure analysis.
  • the model amino acid residue has less than about 10 A2 ( ⁇ 1 water molecule) exposure to the solvent.
  • This feature can also be assessed using a relative accessibility value, which is the ratio of the observed solvent exposure to the maximum solvent exposure for the model amino acid residue amino acid in a polyalanine chain (or some other predetermined polypeptide chain).
  • a value of relative accessibility less than about 0.2 suggests that the model amino acid residue is inaccessible, while a value greater than about 0.8 suggests that the model amino acid residue is accessible.
  • the polymo ⁇ hic target amino acid residue includes an amino acid that is found not more than 10% of the time in the multiple alignment profile for the model amino acid residue.
  • the polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure.
  • the polymo ⁇ hic target amino acid residue includes an amino acid that is not found in the multiple alignment profile for the polymo ⁇ hic target amino acid residue. If the target protein and the model protein are similar enough, this feature can be approximated by multiple alignment profile of model amino acid residue, e.g., from the HSSP file.
  • Unusual Amino Acid by Class The polymo ⁇ hic target amino acid residue is not found in the minimum profile from Adams et al. (Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the target or model amino acid residue. This feature is preferable to the "Unusual Amino Acid" feature used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
  • Hydrophobicity Compatibility The average hydrophobicity of the model amino acid residue is outside of a predetermined range (i.e., the neighborhood is particularly hydrophobic or particularly hydrophilic) and the difference between the hydrophobicity of the first amino acid and the second amino acid exceeds a predetermined value.
  • Phylogenetic data (e.g., from the HSSP database) can be used to analyze the structural neighborhood of the model amino acid residue to determine whether the model amino acid residue is in a region of the model protein that is relatively conserved. For example, the entropy values from the HSSP database for each residue in a structural neighborhood are averaged. A structural neighborhood is judged to be unusually conserved or unusually variable on an absolute basis if its average entropy value is, respectively, significantly less than or greater than the average entropy for structural neighborhoods derived from representative PDB structures and their corresponding phylogenetic properties.
  • Representative structures from the PDB have been defined by others on the basis of fold families (see Holm and Sander, Science 273:595), and are available through the FSSP database at EMBL. Structural neighborhoods from about 600 representative structure families have been compiled and analyzed for phylogenetic entropy.
  • the average structural neighborhood entropy value is compared to the average and standard deviation of the entropy for all of the residues in the model protein polypeptide chain that contains the model amino acid residue.
  • a conventional significance statistic can be computed as:
  • N number of residues in the structural neighborhood
  • ⁇ En> average entropy of residues in the structural neighborhood
  • ⁇ Ec> average entropy of residues in the polypeptide chain that contains the model amino acid residue
  • S.D. Ec standard deviation in the entropy for residues in the polypeptide chain that contains the model amino acid residue.
  • B-factors are calculated for each residue in the model structure as the average of its atomic B-factors and compared first to absolute standards of low and high B-factor values (e.g., estimated at 15.0A 2 and 45.0A 2 , respectively) computed from structural neighborhoods in a representative set of PDB structures. Subsequently, a relative measure of the model residue B-factor is determined by comparison to the mean and standard deviation for residues in the model protein.
  • model amino acid residues with significantly low or high B-factors relative to other residues in the model protein are judged to be relatively intolerant or tolerant, respectively, to amino acid variation.
  • the average B-factor of the model amino acid residue's structural neighborhood can be computed and compared to absolute standards of low and high B-factors compiled for structural neighborhoods in representative PDB structures.
  • a measure of the structural neighborhood B-factor relative to the model protein itself is determined by comparing the average of the residue B-factors for the structural neighborhood of the model amino acid residue to the average and standard deviation in the B-factors for residues in the polypeptide chain of the model amino acid residue. The significance of the structural neighborhood's average B-factor is computed as:
  • N number of residues in the structural neighborhood
  • ⁇ Bc> average residue B-factor for residues in the polypeptide chains involved in the structural neighborhood
  • S.D. Be standard deviation in the residue B-factor for residues in the polypeptide chains involved in the structural neighborhood S.D.
  • Bn S.D.
  • Bc/(V ⁇ standard deviation in the average B-factor for samples of N residues chosen from the same chain as the model amino acid residue.
  • Modeled variances in structural neighborhoods of significantly low average residue B-factor are judged to be in sufficiently rigid environments that they may have structural and functional consequences.
  • Modeled variances in structural neighborhoods of significantly high average residue B-factor are judged to be insufficiently flexible environments that they may not have structural or functional consequences.
  • Other relative measures can be used, e.g., t- distribution value.
  • the B-factor is related to the flexibility of the region of a polypeptide being analyzed.
  • the B-factor can be replaced in the methods of the invention by another suitable measure of flexibility.
  • the B-factor can be replaced by the r.m.s. deviation in residue position for an ensemble of structures or experimental determinations in the coupling constants and relaxation times of atoms that are diagnostic for mobility.
  • the methods of the invention can report the outcome of the quality and function tests for each of the variances during the course of the analysis and produce a graphical representation of the model protein, generated as script for a molecule rendering program, e.g., RasMol.
  • the protein structure can be represented by ribbons while the modeled variances, heterogens, and residues corresponding to PROSITE matches in the model structure are displayed in space filling representation. Residue labels are added for the modeled variances.
  • all of the output, including the graphical representation can be converted to a web browser readable form.
  • the model amino acid residue has less than l ⁇ A 2 ( ⁇ 1 water molecule) exposure to the solvent.
  • the value is the solvent accessible area in A 2 .
  • a model amino acid residue can also be defined as inaccessible if it has a low value for its relative accessibility, e.g., less than 0.2.
  • the modeled variance is within 5.0 A of at least one residue in a different polypeptide chain in the coordinates. The values are yes or no.
  • the model amino acid residue is within 5.0A of a residue that is absolutely conserved in the phylogenetic analysis. The values are yes or no.
  • the model amino acid residue is within 5.0A of a heterogen atom. The value is a distance in Angstroms.
  • the model amino acid residue is within 5.0A of at least one other model amino acid residue.
  • the value is a distance in Angstroms.
  • the model amino acid residue is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the target protein. The value is a distance in Angstroms.
  • the model amino acid residue is within 5.0A of a residue in the model structure that matches a prosite entry that is NOT matched by the target protein.
  • the value is a distance in Angstroms.
  • Rare Amino Acid At least one of the residues encoded by the variance is found not more than 10% of the time in phylogenetic profile for the model amino acid residue. The values are yes or no.
  • the polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure. The values are yes or no.
  • Turn Breaking The polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure. The values are yes or no.
  • Unusual Amino Acid At least one of the residues encoded by the polymorphic target amino acid residue is not found in the phylogenetic profile for the polymo ⁇ hic target amino acid residue. This parameter can be approximated using the phylogenetic profile of the model amino acid residue, e.g., from the HSSP file. The values are yes or no. This parameter can also be assessed using classes, as described above.
  • the average B factor for the model amino acid residue residue is less than 15.0 or greater than 45.0. Lower values mean less motion for that residue in the crystal structure.
  • the average B factor for the model amino acid residue is at least 2 standard deviations above or below the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
  • Low or High Neighbor B The average B factor for the structural neighborhood of the model amino acid residue is less than 15.0 or greater than 45.0.
  • the average B factor for the structural neighborhood of the model amino acid residue is at least 2 S.D. (S.D. as defined above) less than or greater than the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
  • the entropy of the model amino acid residue is less than 0.5 or greater than 2.0.
  • the value is in entropy units ranging from 0.0 meaning absolute conservation to ln20 ⁇ 3. meaning no conservation.
  • the entropy of the model amino acid residue is less than or greater than 2.0 S.D. from the average phylogenetic entropy for residues in the target protein.
  • Low or High Neighborhood Entropy The average entropy for the structural neighborhood of the model amino acid residue is less than 0.5 or greater than 2.0.
  • the average entropy for the structural neighborhood of the model amino acid residue is at least 2.0 S.D. (S.D. as defined above) smaller or greater than the average entropy for the polypeptide chain of the model amino acid residue. The value is the number of S.D.
  • model amino acid residue described above can be used as predictor variables in quantitative, statistical models for assessing whether a polymo ⁇ hism will have an effect on protein structure or function.
  • the statistical models rely on actual experimental data concerning the effects of variances on protein activity.
  • the predictive models can employ continuous values (or discrete approximations of continuous values, e.g., high, medium, and low B-factor) of the predictor features. Described below are two statistical models for predicting whether a polymo ⁇ hic target amino acid residue will affect protein structure or function by assessing some or all of the features of modeled variances.
  • the features in the predictive methods are slightly adapted from their definitions above.
  • Other statistical models for predicting the effects of polymo ⁇ hic target amino acid residues can be used, and are indicated by reference at the end of this section. These alternative methods can use some or all of the same predictive features of model amino acid residues.
  • the features used for predictions fall into two broad categories: environment features and categorical features.
  • the class of categorical features is further divided into polymo ⁇ hism-specific categorical features and special case categorical features. Each of these different features is briefly described below along with an example of how the feature is valued.
  • Solvent Accessibility This is a measure of the accessibility of the model amino acid residue to solvent. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
  • Relative Accessibility This is a measure of the accessibility of the model amino acid residue to solvent relative to the maximum accessibility of that residue in peptide of specified composition, typically a polyalanine polypeptide. It is used as a continuous variable in the probabilistic model described below.
  • Relative B-Factor This is a measure of the crystallographic B-factor of the model amino acid residue normalized to the average and standard deviation of the B-factor for other residues in the same polypeptide chain of the model protein. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
  • This feature is a measure of the statistical significance (defined as above for same feature) of the average B-factor of the model amino acid residue's structural neighborhood relative to the average B-factor of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
  • This feature is a measure of the statistical significance (defined as above for same feature) of the average phylogenetic entropy of the model amino acid residue's structural neighborhood relative to the average phylogenetic entropy of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
  • Polymo ⁇ hism-Specific Categorical Features are variance-specific because they are related to the identity of the amino acids that comprise the polymo ⁇ hic target amino acid residue. In the statistical modeling approaches described below these features are given the value 1 (or yes) if the polymo ⁇ hism meets the specified criteria, and 0 (or no) otherwise.
  • Unusual Amino Acid One of the amino acids of the polymo ⁇ hism is not found in the phylogenetic profile of the target variable residue. This feature can be approximated by examination of the model amino acid residue's phylogenetic profile, e.g., from the HSSP file.
  • Unusual Amino Acid by Class One of the amino acids of the polymo ⁇ hism is not found in the minimum profile from Adams et al. ⁇ Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the polymo ⁇ hic target amino acid residue.
  • This feature can be approximated from the model amino acid residue's phylogenetic profile, e.g., from the HSSP file.
  • the feature is preferably used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
  • the model amino acid residue has turn secondary structure and one of the amino acids of the polymo ⁇ hism is glycine or proline.
  • the model amino acid residue has helical secondary structure assignments and one of the amino acids of the polymo ⁇ hism is glycine or proline.
  • the model amino acid residue is near (e.g., 5A) a heterogen atom (ligand) in the model protein.
  • the model amino acid residue is near a PROSITE match (e.g., 5A) that is common to the target protein and the model protein.
  • Interface The model amino acid residue is near (e.g., 5A) the interface between two or more subunits in the model protein.
  • Training Data SetsTraining datasets contain: 1) at least one amino acid variation in a protein for which there is sufficient structural information to assess at least one selected structural, phylogenetic, or physical feature, and 2) information describing the effect on protein function of each amino acid variation.
  • _Mutants of the E. coli lac repressor Markiewicz et al. (1994) J Mol. Biol. 240:421-433
  • lysozyme (Rennell et al. (1991) J Mol. Biol. 222:67) can be used as training datasets.
  • Any other collection of polymo ⁇ hisms, preferably unbiased, for which activity and structural information is available can also be used, even if the dataset contains polymo ⁇ hisms of many different proteins.
  • the approach of the probabilistic mode is to view each variance as having a probability that it will affect protein structure or function.
  • This outlook implicitly reflects the idea that the homology models are, in general, approximate descriptions. There may be some factors bearing on the variance's effect on structure or function that are not anticipated by the model. However, given enough unbiased data, such factors can be assessed in probabilistic terms through experimental data sets that examine the relationship between mutations and effects on protein structure or function. For example, one of the data sets used in the implementation of the methods of the invention involves over 4000 unbiased mutations in the E. coli lac repressor and their classification with respect to the repressor' s biological function.
  • the probability values for the predictions combine measures of the intrinsic tolerance of the target protein's structure and function to amino acid variation at the polymo ⁇ hic target amino acid residue, the nature of the chemical change caused by the variance, and additional classifications for the special cases of variances in particularly vulnerable locations in the model protein's structure.
  • For computing the probability that a polymo ⁇ hic target amino acid residue will affect target protein structure or function training polymo ⁇ hisms with feature values similar to the feature values of the model amino acid residue are collected from the training data set.
  • the precise criteria for assessing feature value similarity between the training polymo ⁇ hisms and the model amino acid residue are parameters of the prediction model.
  • these criteria are set so that the polymo ⁇ hisms in the training set have environment feature values within some tolerance, e.g., 1 standard deviation, of the environment feature values of the model amino acid residue, and categorical feature values that are identical to the categorical feature values of the model amino acid residue.
  • the probability that the polymo ⁇ hic target amino acid residue will affect target protein structure or function is defined as the proportion of residues in the sub-group of selected training polymo ⁇ hisms that have effects on their own protein's structure or function. Defining the probabilities in this way assumes that the environment features and the categorical features calibrated for effects on structure and function with the training set have predictive meaning for the polymo ⁇ hic target amino acid residue and target protein. It also assumes that the .training polymo ⁇ hisms represent an unbiased sampling of the effects on protein structure of polymo ⁇ hisms with the specified feature values.
  • the features are assumed to reflect generic properties of polymo ⁇ hisms that are useful for evaluating their effect on protein function and the training polymo ⁇ hisms are assumed to reflect typical behavior for amino acid variation. Empirically, this assumption is valid, at least for soluble, globular proteins and the lac repressor and lysozyme training datasets.
  • the selected training polymo ⁇ hisms will be more similar to the model amino acid residue, which itself was selected to be similar to the polymo ⁇ hic target amino acid residue on the basis of sequence similarity.
  • the more features used to parameterize the model amino acid residue the more difficult it becomes to identify enough training polymo ⁇ hisms to make an adequate statistical comparison with the current training data sets.
  • some of the features are strongly correlated with others and contribute little to the characterization of model amino acid residues, e.g., accessibility and relative accessibility.
  • the reduced set of features used for selecting polymo ⁇ hisms in the probabilistic model is selected using standard maximum likelihood statistical methods. Formally, this entails computing the gain in the likelihood for predictions made on the training data with each possible combination of a few environment features and a few categorical features compared with predictions based on a more general hypothesis. In this case, the more general hypothesis defines a polymo ⁇ hism' s probability of effecting function as the proportion of polymo ⁇ hisms in the entire training data set that have effects on function.
  • the optimal set of parameters is the one that gives the maximum likelihood gain.
  • This exhaustive procedure is very computationally intensive. Computational time can be reduced by exploiting the observed strong effect of the environment features on the likelihood calculation. This observation leads to an approximate, stepwise procedure for maximizing the likelihood in which: first the environment features that alone maximize the likelihood are identified, and second, in conjunction with the selected optimized environment features, an optimal set of categorical features is identified. Applying this approximate procedure on two training data sets showed that the best environment features typically include one of the two accessibility features, one of the two B-factor features, and one of the two phylogenetic entropy features, as might be expected. Other statistical methods, e.g., discriminant function analysis, can be also used to choose the dominant features.
  • the number of parameters used can be reduced by the standard statistical method of principal component analysis. For example, the complete set of six environment features for all polymo ⁇ hic residues in the training set is transformed to its principal components, and then just one or a few of the stronger principal components (those with the larger eigenvalues, with or without realignment with the original environment features) are used instead of all of the environment features.
  • the probability of a target polymo ⁇ hism having an effect on protein structure and function is determined as described above with the chosen principal components replacing the environment features in the computation.
  • QUEST will directly accept both the continuously valued environment features and the categorical features as predictor variables. As with the probabilistic model, it proves useful to limit the number of variables to three environment features (or to use principal components) and a selection of the other, categorical features in order to accommodate the limited size of the training data sets.
  • QUEST uses ANOVA F-statistics to select variables and to define "split" values in each continuous parameter for optimal classification. Trees are then constructed with the selected variables and "split" criteria, and then pruned subject to node size criteria and cross- validation tests. Once the optimal tree is delineated, target polymo ⁇ hisms can be assessed for whether they are predicted to affect protein structure or function.
  • Typical application of QUEST involves running it in default mode with the exception that the minimum node size is often increased, e.g., by a factor of about four, to simplify the classification trees without a serious loss of accuracy.
  • a guiding principle in the construction of classification trees with continuous predictor variables is that the "split" values for the predictors should make sense to a user familiar with the underlying scientific issues.
  • applying the environment features as continuous variables in the automated QUEST method can lead to classification trees that are excessively branched and hard to inte ⁇ ret.
  • An alternative way to implement the method involves categorizing the values of each of the environment features into a reasonable number of groups. For example, each environment feature can be categorized into high, medium, and low values. The categorized environment features can then be used with the other categorical features to construct simplified and robust classification trees by QUEST.
  • Other statistical methods can be used in the analysis of the environment and categorical features, and in their application for predicting whether target polymo ⁇ hisms will affect protein structure or function. These include but are not limited to: discriminant function analysis for selection of environment features for each combination of categorical features (e. g., see StatSoft Inc., Electronic Text Book, http ://www. statsoft.com. Chapter on Discriminant Analysis), and logistic regression of the environment features for each combination of categorical features (e.g., see Montgomery and Peck (1992), Introduction to Linear Regression Analysis, Wiley, NY, Chapter 6). Some implementations might use neural nets or related models to assimilate training data for predictions of effects on structure or function caused by polymo ⁇ hic target amino acid residues.
  • a computer program for the automated structural modeling and functional analysis can be written in any suitable language, e.g., Python 1.4.
  • Programs and supporting files e.g., databases
  • the program can be run on suitable computer systems known to those skilled in the art, for example, a Silicon Graphics O 2 workstation operating under IRIX v. 6.5.
  • Useful databases include: the Protein Data Bank (PDB) of macromolecular structures and sequences corresponding to the structures; the Homology-Derived Secondary Structure of Proteins (HSSP; EMBL) database; the PROSITE database (EXPASY; currently using release 15) of profiles and patterns.
  • PDB Protein Data Bank
  • HSSP Homology-Derived Secondary Structure of Proteins
  • EMBL Homology-Derived Secondary Structure of Proteins
  • PROSITE database EXPASY; currently using release 15
  • Useful software for implementing certain features of the method include: BLAST 2.0.6 sequence alignment and database searching software (NCBI); RasMol 2.6.4 (Roger Sayle) program for visualizing (rendering) and annotating homology models; Chime (MDL, Inc.) http plug-in module for visualizing the models in a web browser; and Pfscan 1.0 software (Philipp Bucher; Swiss Institute for Experimental Cancer Research) for comparing amino acid sequences to PROSITE profiles.
  • the methods of the invention are not limited to use with any particular hardware/software configuration. They may find applicability in any computing or processing environment.
  • the methods of the invention may be implemented in computer programs executing on programmable computers that each includes a processor, a storage medium readable by the processor (including volatile and non- volatile memory and/or storage elements), at least one input device, and one or more output devices.
  • Program code may be applied to data entered using an input device to perform the methods and to generate output information for display.
  • Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system.
  • the programs can be implemented in assembly or machine language.
  • the language may be a compiled or an inte ⁇ reted language.
  • Each computer program may be stored on a storage medium or device (e.g., CD- ROM, hard disk, or magnetic diskette) that is readable by a general or special pu ⁇ ose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the methods.
  • the methods may also be implemented as a computer-readable storage medium, configured with a computer program, where, upon execution, instructions in the computer program cause the computer to operate in accordance with the methods.
  • the methods of the invention are useful in a number of areas beyond simply making predictions about the effect of a known or theoretical polymo ⁇ hism.
  • the methods of the invention can be used for identification and analysis of amino acid polymo ⁇ hisms that affect the structure or function of proteins involved directly or indirectly in the action of pharmaceutical or diagnostic agents.
  • the methods can be used in the identification and analysis of structural or functional interactions between two or more one amino acid polymo ⁇ hisms in a protein of interest (e.g., in analysis of haplotypes).
  • the methods of the invention can be used to identify and analyze polymo ⁇ hisms that have an effect on a catalytic activity of the protein of interest or a non-catalytic activity of the protein or interest (e.g., structure, stability, binding to a second protein or polypeptide chain, binding to a nucleic acid molecule, binding to a small molecule, and binding to a macromolecule that is neither a protein nor a nucleic acid).
  • a catalytic activity of the protein of interest e.g., structure, stability, binding to a second protein or polypeptide chain, binding to a nucleic acid molecule, binding to a small molecule, and binding to a macromolecule that is neither a protein nor a nucleic acid.
  • the methods of the invention can also be used in the identification and analysis of candidate polymo ⁇ hisms for polymo ⁇ hism-specific targeting by pharmaceutical or diagnostic agents, for the identification and analysis of candidate polymo ⁇ hisms for pharmacogenomic applications, and for experimental biochemical and structural analysis of pharmaceutical targets that exhibit amino acid polymo ⁇ hism.
  • the methods of the invention can be used to identify amino acid substitutions that can be made to engineer the structure or function of a protein or interest (e.g., to increase or decrease a selected activity or to add or remove a selective activity).
  • the methods can also be used for the prospective or retrospective identification and analysis alterations in a biological property related to a polymo ⁇ hism.
  • Example 1 Annotation Mode
  • the method of the invention was used to analyze a number of polymo ⁇ hic amino acid residues in lac repressor.
  • the annotation mode was used and purine repressor was selected as the model protein.
  • Reproduced below is a portion of the output of a computer program used to implement one embodiment of the method of the invention.
  • the output provides: a list of the polymo ⁇ hic amino acids analyzed, an alignment of each region of lac repressor containing a polymo ⁇ hic residue with the corresponding region of the model protein, a summary of the number of amino acid resides that are identical in each aligned region, a summary of the PDB file information for purine repressor used in the analysis, a prosite report for the model protein, a summary of the alignment of the amino acid residues in the neighborhood of each model amino aicd residue with the amino acid residues in the neighborhood of the corresponding polymo ⁇ hic amino acid residue, a summary of the determinations made for each model amino acid residue (including: distance to conserved motifs, distance to heterogens, distances between model amino aid residues, entropy, secondary structure, neighborhood entropy, B-factor, relative B-factor, neighborhood B- factor, and relative neighborhood B-factor), a list of the features on which determinations can be made, and a list of determinations made for each model amino acid
  • Model based on 2pua_A has:
  • This model uses BLAST entry 2pua_A and PDB coordinates 2pua
  • MOL_ID 1; MOLECULE: PURINE REPRESSOR; CHAIN: A; ENGINEERED: YES; MUTATION: R190A; BIOLOGICAL_UNIT: HOMODIMER; OTHER_DETAILS : METHYLPURINE-PUR-OPERATOR; MOL_ID: 2; MOLECULE: DNA; CHAIN: B; ENGINEERED: YES;
  • Variance list in model is [('A', 54, 0) Modeled Variance ('A 1 , 54, 0)
  • Variance list in model is [('A', 171, 0) ] Modeled Variance ('A 1 , 171, 0)
  • Number of residues in Neighborhood of radius 5.0 A is 12. Of these, 11 residues are covered by alignment Of these, 1 residues are the same
  • Variance list in model is [('A', 248, 0)] Modeled Variance (*A', 248, 0)
  • Number of residues in Neighborhood of radius 5.0 A is 18 Of these, 18 residues are covered by alignment Of these, 8 residues are the same
  • Variance list in model is [('A', 299, 0)] Modeled Variance ('A', 299, 0)
  • Average entropy of chain is 1.612
  • Ave. entropy for neighborhood is -0.878 S.D from ave entropy of chain of modeled variance.
  • Average entropy of chain is 1. 612
  • Ave. entropy for neighborhood is -0.153 S.D from ave. entropy of chain of modeled variance.
  • Average entropy of chain is 1.612
  • Ave. entropy for neighborhood is -2.402 S.D from ave entropy of chain of modeled variance.
  • Average entropy of chain is 1.612
  • Ave. entropy for neighborhood is 0.287 S.D from ave entropy of chain of modeled variance.
  • Average B-factor of atoms in residue is : 50.9
  • Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15.8 Min and Max residue B-factors of residue's chain: 14 ,9 93.6 Deciles for residue B-factors of residue ' s chain: 14 . 9 25. 9 30.2 33. 5
  • Residue b factor is in the 8 th decile
  • Residue B-factor is 0.5 S.D. from average B-factor for chain
  • Average B-factor of atoms in residue's neighborhood is : 42.4
  • Average B-factor of atoms in chains of residue's neighborhood is : 44.0
  • Neighborhood B factor is -0.3 S.D. from average B-factor for chains in neighborhood
  • Residue b factor is in the 3 th decile
  • Residue B-factor is -0.6 S.D. from average B-factor for chain
  • Average B-factor of atoms in residue's neighborhood is : 35.0
  • Neighborhood B factor is -1.9 S.D. from average B-factor for chains in neighborhood
  • Average B-factor of atoms in residue is : 29.2
  • Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15. c Min and Max residue B-factors of residue ' s chain : 14.9 93.6 Deciles for residue B-factors of residue ' s chain : 14.9 25.9 30.2 33.5
  • Residue b factor is in the 2 th decile
  • Residue B-factor is -0.9 S. D. from average B-factor for chain
  • Average B-factor of atoms in residue's neighborhood is : 28.4
  • Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood
  • B factor is -4.1 S.D. from average B-factor for chains in neighborhood
  • Residue b factor is in the 8 th decile
  • Residue B-factor is 0.5 S.D. from average B-factor for chain
  • Average B-factor of atoms in residue's neighborhood is : 46.1
  • Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood
  • B factor is 0.5 S.D. from average B-factor for chains in neighborhood
  • the modeled variance is inaccessible and the actual variance includes a charged residue.
  • the actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a region of helical secondary structure from HSSP analysis. The only value is yes . hi_b —
  • the modeled variances crystallographic B-factor is less than 45.0 A ⁇ 2 hi_decile_b —
  • the modeled variances crystallographic B-factor is in the tenth decile of B-factors for modeled variances chain in PDB file.
  • the modeled variances phylogenetic variation is in the tenth decile of variation for modeled variances chain in PDB file.
  • the average crystallographic B-factor for the modeled variances neighborhood is greater than 45.0 A ⁇ 2.
  • the average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d. above the average B-factor for other neighborhoods in residues chain hi_nbhd_rel_var —
  • the average phylogenetic variation for the modeled variances neighborhood is at least 2.0 s.d. above the average variation for other neighborhoods in residues chain hi__nbhd_var --
  • the average phylogenetic variation for the modeled variances neighborhood is greater than 2.0 e.u. (8 residues with equal weight) hi_rel_b —
  • the modeled variances crystallographic B-factor is at least 2.0 s.d. above average B-factor for the modeled variances chain in PDB file.
  • the modeled variances phylogenetic variation is at least 2.0 s.d. above average variation for modeled variances chain in PDB file . hi_var —
  • the modeled variances phylogenetic variation is greater than 2.0 e.u. (8 residues with equal weight) inaccessible —
  • the HSSP file indicates that the modeled variance has less than 10 A A 2 ( ⁇ 1 water molecule) exposure to the solvent.
  • the value is the solvent accessible area in A ⁇ 2.
  • the modeled variance is within 5.0A of at least one residue in a different chain in the coordinates. The only value is yes. lo_b — The modeled variances crystallographic B-factor is less than 15.0 A ⁇ 2 lo_decile_b —
  • the modeled variances crystallographic B-factor is in the first decile of B-factors for modeled variances chain in PDB file. lo_decile_var —
  • the modeled variances phylogenetic variation is in the first decile of variation for modeled variances chain in PDB file.
  • the average crystallographic B-factor for the modeled variances neighborhood is less than 15.0 A ⁇ 2.
  • the average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d/ below the average B-factor for other neighborhoods in residues chain lo_nbhd_rel_var —
  • the average phylogenetic variation for the "modeled variances neighborhood is at least 2.0 s.d. below the average variation for other neighborhoods in residues chain lo_nbhd_var —
  • the average phylogenetic variation for the modeled variances neighborhood is less than 0.69 e.u. (2 residues with equal weight) lo_rel_b —
  • the modeled variances crystallographic B-factor is at least 2.0 s.d. below average B-factor for the modeled variances chain in PDB file. lo_rel_var —
  • the modeled variances phylogenetic variation is at least 2.0 s.d. below average variation for modeled variances chain in PDB file. lo_var —
  • the modeled variances phylogenetic variation is less than 0.69 e.u. (2 residues with equal weight) near_conserved —
  • the modeled variance is within 5.0A of a residue that is absolutely conserved in the HSSP profile. The only value is yes. near_het_atom —
  • the modeled variance is within 5.0A of a hetero atom in the coordinates.
  • the value is a distance in Angstroms .
  • the modeled variance is within 5.0A of at least one other modeled variance.
  • the value is a distance in Angstroms.
  • the modeled variance is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the primary sequence. See near_struct_prosite. The value is a distance in Angstroms . near_struct_prosite —
  • the modeled variance is withing 5.0A of a residue in the coordinates that matches a prosite entry that is NOT matched by the primary sequence. See near_seq_prosite. The value is a distance in Angstroms. rare_aa —
  • At least one of the residues encoded by the variance is found not more than 10% of the time in the HSSP profile for the modeled variance. The only value is yes. turn_breaking —
  • the actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a turn from the HSSP analysis. The only value is yes. unusual_aa —
  • At least one of the residues encoded by the variance is not found in the HSSP profile for the modeled variance. The only value is yes.
  • Example 2 Probabilistic Modeln a second example, the probabilistic mode was used to assess the probability that a change in the amino acid present at each of 3245 known lac repressor polymo ⁇ hisms would alter activity of lac repressor.
  • a set of 1468 lysozyme polymo ⁇ hisms was used as the training dataset and maximum likelihood analysis was used to select the characteristics (from among the physical, structural and phylogenetic features described above) that would be used to analyze the model amino acid residues.
  • the selected training polymo ⁇ hisms were then used to assess the probability that a change in the amino acid present at the polymo ⁇ hic target amino acid residue would have an effect on activity of the target protein.
  • the assessment was based on the proportion of selected training polymo ⁇ hisms that have an effect on the activity of the training protein, lysozyme.
  • no prediction was made. This is because the number of selected training polymo ⁇ hism was too small to make a statistically significant prediction.
  • the predictions made were then compared to the known effects of the lac repressor polymo ⁇ hisms and the accuracy of the predictions was analyzed. The results of this analysis are presented in Table 1 below.
  • the predictions are sorted by confidence level.
  • the values in the column under the heading "0.70" summarize the accuracy for predicting that mutations having a probability of affecting function of 0.70 or greater will affect function and that mutations with a probability of 0.3 (1.0 minus 0.7) will not affect function.
  • the accuracy of the each class of predictions is assessed by the actual number of true positives, false positives, true negatives, and false negatives and by the statistical measures correlation coefficient, chi-squared value compared to a null hypothesis of predictions made knowing just the fraction of polymo ⁇ hisms affecting function, selectivity, and sensitivity for the predictions.
  • the last value in each column is the misclassification rate (fraction of incorrectly predicted mutations). This example demonstrates that the probabilistic mode can be used to make predictions about the likely effect of a polymo ⁇ hism.
  • Example 3 Classification Mode:
  • the classification mode was used to classify each of 3245 known lac repressor polymo ⁇ hisms as either a polymo ⁇ hism that is likely to later activity or a polymo ⁇ hism that is not likely to alter activity.
  • 1468 lysozyme polymo ⁇ hisms were used as a training dataset to build a classification tree using QUEST.
  • three selected continuously valued features (relative accessibility, neighborhood relative B-factor, and neighborhood relative entropy) and three selected categorical features (unusual amino acid, unusual amino acid by class, and conserved position) were used in building the classification tree.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Peptides Or Proteins (AREA)

Abstract

Variance modeling and prediction methods that are useful for assessing whether an amino acid variation at a selected amino acid residue in a protein of interest is likely (or is not likely) to have an effect on the protein (e.g., alter a biological activity of the protein) are described. The methods can be used to generate a structural model or models of all or a part of the protein of interest, assess the quality of the structural model, and evaluate the potential functional consequences of an amino acid variation. The methods achieve these goals by considering certain functional, structural, and phylogenetic features of specific amino acid residues.

Description

STRUCTURE-BASED METHODS FOR ASSESSING AMINO ACID VARIANCES
TECHNICAL FIELD
This invention relates to computational methods for genetic variance modeling and prediction.
RELATED APPLICATION INFORMATION
This application claims priority from provisional application serial no. 60/208,628, filed June 1, 2000.
BACKGROUND
The human genome contains approximately 60,000 to 100,000 genes. A variance (i.e., a mutation or polymorphism) in any of these genes can result in the production of a gene product, usually a protein, with altered or no activity. The variance can be as small as the addition, deletion or substitution of a single nucleotide. Such a single nucleotide variance is sometimes called "single nucleotide polymorphism" or SNP.
Researches have identified over 6700 human disorders that are believed to have a genetic component. Moreover, certain genetic changes, while not the immediate cause of a disorder, may predispose an individual to certain disorders. In addition, variations in specific genes have been implicated in the differences observed between individuals in their response to drugs and other therapeutic interventions. This is important because otherwise effective therapies are sometimes not approved for use or are withdrawn from use because a relatively small number of individuals have a severe adverse reaction to the therapy. If the adverse reaction to a given therapy can be attributed to the presence of a particular genetic variation, it may be possible to identify those individuals who should not be treated with the therapy. This would permit therapies to be tailored for individual patients and would increase the number of available therapies. Thus, there are many reasons for identifying and characterizing variances.
Of course, not all genetic changes are medically significant. In one study, the SNPs in 114 independent alleles of 106 genes relevant to cardiovascular disease, endocrinology or neuropsychiatry were screened, leading to the identification of 392 coding-region SNPs that proved to be divided roughly equally between those causing synonymous and non- synonymous changes (Cargill et al. (1999) Nat. Genet. 22:231).
Because there are many non-synonymous changes, it would be useful to predict whether a given polymorphism in a selected gene is likely to cause a change in the function of gene product.
SUMMARY
The invention features variance modeling and prediction methods that are useful for assessing whether an amino acid variation at a selected amino acid residue in a protein of interest is likely (or is not likely) to have an effect on the protein (e.g., alter a biological activity of the protein). The methods of the invention can be used to generate a structural model or models of all or a part of the protein of interest, assess the quality of the structural model, and evaluate the potential functional consequences of an amino acid variation. The methods of the invention achieve these goals by considering certain functional, structural, and phylogenetic features of specific amino acid residues.
The methods of the invention are useful even when they do not predict the effect of an amino acid variance with complete accuracy. The growing number of known variances makes it extremely difficult to investigate all potentially significant variances. Thus, techniques that allow one to predict (even imperfectly) which variances are more likely to affect the structure or activity of a selected protein are useful because they permit one to assign a priority to a variance. As a result, one can decide to apportion more resources to the investigation of more promising variances and less resources to the investigation of less promising variances.
One embodiment of the methods of the invention, all of which are preferably implemented using a computer program, identifies a model amino acid residue in a model protein to represent a polymorphic amino acid residue (variance) of interest in a protein of interest, generates a record of the analysis of the model amino acid residue, generates an assessment of model quality, generates a summary of the functional, structural, and phylogenic features assessed, and a generates a graphical representation of all or a part of the model protein that can be annotated with information related to the various features assessed. Another embodiment of the methods of the invention generates statistically based predictions regarding the likelihood that an amino acid change at a polymorphic amino acid residue in a protein of interest will have an effect on the protein.
The methods of the invention entail identifying a model amino acid residue (the "model amino acid residue" or "model variance") within a protein structure (the "model protein") that serves as a structural model of a selected polymorphic amino acid ("polymorphic target amino acid residue" or "target variance") of the protein of interest (the "target protein" or "target sequences"). The polymorphic target amino acid residue is a particular amino acid residue within protein of interest that is polymorphic. Thus, in a first variant of the target protein it is a first amino acid (e.g., Gly) and in a second variant of the target protein it is a second amino acid (e.g., Lys). Of course, there can be additional variants of the target protein in which the amino acid present at the polymorphic target amino acid residue can be, e.g., a third, fourth, or fifth amino acid. The methods of the invention can be used to assess any number of changes in the amino acid present at the polymorphic target amino acid residue and any number of different polymorphic amino acid residues within a target protein.
Information regarding protein structure is important to the methods of invention, and the model protein must have sufficient structural information to perform the methods' analyses. The structural information can be derived from x-ray crystallography, NMR, or some other technique for determining the structure of a protein at the amino acid or atomic level. The model protein is selected from among proteins with structural information based, at least in part, on its sequence similarity to the target protein. Thus, the model protein can be selected based on overall sequence similarity to the target protein or based on the presence of a portion having sequence similarity to a portion of the target protein which includes the polymorphic target amino acid residue.
Once a model amino acid within a model protein has been identified, the methods of the invention entail assessing certain functional, structural, and phylogenic features of the model amino acid and its environment within the model protein. The values of the features of the model amino acid residue are then used to determine a potential for an effect of an amino acid change (or variance) at the polymorphic target amino acid residue by comparison to certain criteria. Some features have categorical values. For these features there may be only two values: either the specified criteria for feature are met or they are not. One example of a categorical feature is "helix breaking". To meet the criteria for "helix breaking", the model amino acid residue must be in a region of helical secondary structure and one of the polymorphic amino acids must be either Gly or Pro. Other features, called "environment features" since they describe the structural, physical, and phylogenetic disposition of the model amino acid residue, can be either continuously valued or categorical. For example, the solvent accessibility of the model amino acid can be a continuous value or, if cut-off values are defined, a categorical value.
A significant feature of the variance modeling and prediction methods of the invention is the concept of a "structural neighborhood." This is the region within a selected radius of the atoms of a particular amino acid residue. The amino acid residues and other structural features within the structural neighborhood of an amino acid residue strongly influence the effect of a change in the actual amino acid present at the position of the amino acid residue. Another significant feature of the variance modeling and prediction methods of the invention is the selection of the functional, structural, and phylogenic features that are useful in predicting the effect of a variance. Among the features analyzed are: solvent exposure, nearness to a heterogen atom, and deviation from the average crystallographic B- factor of the model protein. These and other features are described in greater detail below.
The methods of the invention are very powerful because they do not require structural information about the target protein beyond the sequence of the target protein or the sequence of the target protein in the region that includes the polymorphic target amino acid residue. The methods of the invention rely on the use of public sequence and structure databases. These databases become more robust as more and more sequences and structures are added. Thus, the reliability of the models and predictions made by the methods of the invention will continually increase.
The methods of the invention can be used to predict which non-synonymous polymorphisms are likely to affect protein function, however, they have application in many other areas of protein science. For example, the methods can be applied to predicting whether a polymorphism will affect the interaction of a drug with a target protein. The methods of the invention can be applied for this purpose. Where two or more polymorphisms occur in a single protein, the methods of the invention can help evaluate both their individual and their combined effects. More generally, the choice of the target protein and polymorphisms need not be dictated by the occurrence of natural genetic variation. For example the choice can be prospective as in the case of the engineering of an enzyme, where the methods of the invention can be applied to the evaluation of which potential mutations will alter the enzyme activity. Broadly, the methods of the invention can be used whenever it is important to assess a relationship between amino acid variation and any aspect of protein activity or structure.
As used herein the terms "polymorphic amino acid residue," "amino acid polymorphism," "polymorphism," and "variance" refer to an amino acid position within a protein that can be one or another of two or more different amino acids. In the context of a protein, the term "structure" refers to the three dimensional arrangement of atoms in the protein. "Function" refers to any measurable property of a protein. Examples of protein function include, but are not limited to, catalysis, binding to other proteins, binding to non- protein molecules (e.g., drugs), and isomerization between two or more structural forms. "Biologically relevant protein" refers to any protein playing a role in the life of an organism. "Training dataset" refers to a collection of one or more proteins each with one or more polymorphisms or mutations and information concerning the effects of each polymorphism on its protein's structure or function.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a flow chart depicting an example of some of the steps in the annotation mode.
FIG.2 is a flow chart depicting an example of some of the steps in selecting predictive features using a training dataset.
FIG. 3 is a flow chart depicting an example of some the steps in the probabilistic mode.
DETAILED DESCRIPTION
The invention features methods for variance modeling and prediction. The methods can be used to assess the effect of any number of amino acid variations in a protein of interest (the "target protein"). Thus, the methods are useful for assessing the effect of an amino acid change at a selected polymorphic amino acid residue in a target protein. The amino acid residue can be an amino acid residue that is known to exhibit polymorphism, i.e., one that is known to differ among individuals of a population. For example, some individuals have a Glu at amino acid 6 of their hemoglobin beta-chain. Other individuals have a Val at this position, and this polymorphism is the cause of sickle-cell anemia. In many cases, it will be known that the amino acid residue is polymorphic, but whether the polymorphism has any effect on the protein of interest will not be known. The variance modeling and prediction methods of the invention rely on the analysis of a residue ("the model amino acid residue") in a model protein that is used to represent the polymorphic amino acid residue (the "polymorphic target amino acid residue") in the protein of interest (the "target protein"). The model amino acid residue and model protein are selected based on sequence similarity to all or a portion of the target protein. The model protein is one for which there is considerable structural information available (e.g., the structure of the protein has been solved). The methods of the invention entail examination of various physical, structural, and phylogenetic features of the model amino acid residue. The features examined are ones that are useful in predicting whether a change in the amino acid present at the model amino acid residue will affect the activity of the model protein. Examples of such features include: solvent accessibility, relative crystallographic B factor, and proximity to a heteroatom. Because the model amino acid residue and the model protein are similar to the polymorphic target amino acid residue and target protein respectively, the prediction made for the model amino acid residue and model protein will be relevant to the polymorphic target amino acid residue and target protein.
The methods of the invention can be used to assess the impact of any known polymorphism. The methods of the invention can also be used to assess the effect of any potential change at any selected amino acid residue, including amino acid residues that are not known to be polymorphic.
The methods of the invention can be used to provide: 1) an annotated model of the polymorphic target amino acid residue (annotation mode); 2) a prediction of the probability that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein (probabilistic mode); or 3) a classification of the polymorphic target amino acid residue as one which is either likely or not likely to have an effect on an activity of the target protein (classification mode). In all three modes, a model amino acid residue is used to represent the polymorphic target amino acid residue. In addition, all three modes entail determining the value of at least one selected physical, structural, and phylogenetic feature of the model amino acid residue.
In one embodiment of the annotation mode, the values of the selected features can be used to provide an annotated model of the target protein. One skilled in the art can use the values of the selected features to assess the likelihood that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein.
FIG. 1 is a flow chart depicting some of the steps in one embodiment of the annotation mode. The amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 102). Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 104). A model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 106). The structural neighborhood of the model amino acid is determined (STEP 108), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 110). The results of the various determinations are they output (STEP 112). The output can be a list of values or an annotated graphical depiction of all or a part of the model protein.
The probabilistic mode and the classification mode employ a database of polymorphisms (a "training dataset") satisfying two requirements. First the effect of each polymorphism on protein activity must be known. Second, there must be sufficient structural information for the protein containing the polymorphism for determination of at least one selected physical, structural, and phylogenetic feature. This database of polymorphisms ("training polymorphisms") can contain polymorphisms of a single protein, e.g., lac repressor or lysozyme, or polymorphisms of two or more proteins. Thus, for each training polymorphism, the effect on activity is known, as is the value of at least one physical, structural, and phylogenetic feature. In one embodiment of the probabilistic mode, the training polymorphisms are statistically analyzed to identify a subset of all possible physical, structural, and phylogenetic features that is most useful for predicting whether the training polymorphism has an effect on activity. This subset will also be useful for predicting whether an amino acid change at a model amino acid residue or a polymorphic target amino acid residue has an effect of activity. FIG. 2 is a flow chart depicting some of the steps in selecting a subset of feature useful for prediction. First, a training dataset of unbiased training polymorphisms with a known effect on activity is provided (STEP 202). The value of selected physical, structural, or phylogenetic features for each training polymorphism in the dataset is then determined (STEP 204). Statistical analysis is then used to select a subset of features useful for making predictions (STEP 204).
Once the subset of features has been identified, the probabilistic mode entails selecting training polymorphisms that are similar to the model amino acid residue in terms of the subset of features. The proportion of selected training polymorphisms that have an effect on activity is determined and this information is used to predict whether a change in the amino acid present at the model amino acid residue will have an effect on the model protein. Because the model amino acid residue is selected to represent the polymorphic target amino acid residue, this prediction is also relevant to the polymorphic target amino acid residue and the target protein. The probabilistic mode can be more readily understood by considering a specific example. In this example, the polymorphic target amino acid residue is amino acid 120 of target protein X. Based on sequence homology with protein X, amino acid residue 150 of protein A (the model protein) is selected as the model amino acid residue. Because the structure of protein A has been solved, the value of various selected features of amino acid residue 150 of model protein A can be determined. Next, known polymorphisms (training polymorphisms) of lac repressor are selected on the basis of the similarity of a selected subset of features of their features to the analyzed subset of features of amino acid residue 150 of protein A. The selected lac repressor polymorphisms are used to predict whether an amino acid change at amino acid 120 of target protein X will have an effect on protein X. For example, if 8 of the 10 selected lac repressor polymorphisms have an effect on the activity of lac repressor, it is more likely that an amino acid alteration at the polymoφhic target amino acid residue will affect an activity of the target protein than if only 2 of the 10 selected lac repressor polymoφhisms have an effect of the activity of lac repressor.
FIG. 3 is a flow chart depicting some of the steps in the probabilistic mode. The amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 302). Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 304). A model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 306). The structural neighborhood of the model amino acid is determined (STEP 308), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 310). Training polymoφhisms in an unbiased training dataset that have physical, structural, or phylogenetic characteristics similar to the model amino acid residue and its structural neighborhood (STEP 312). The proportion of training polymoφhisms identified in STEP 312 that have an effect on activity of the protein are then used to assess the probability that a change in the amino acid present at the polymoφhic target amino acid residue will have an effect on the target protein (STEP 314).
In one embodiment of the classification mode, the training polymoφhisms and the associated information regarding the effect of the polymoφhisms on activity and the values of various features of the polymoφhisms are used to build a classification tree. The classification tree can be used to classify the model amino acid residue as either one that is likely to have an effect on activity or one that is not likely to have an effect on activity. Because the model amino acid residue is selected to represent the polymoφhic target amino acid residue, this classification is also relevant to the polymoφhic target amino acid residue and the target protein.
The annotation mode, the probabilistic mode, and the classification mode are three examples of how the methods of the invention can be used. Those skilled in art will recognize that many other implementations are possible. For example, it may be possible to derive a mathematical relationship (e.g., a regression relationship) between a subset of all possible physical, structural, and phylogenetic features that can be used to predict the effect of a polymoφhism. Selection and Validation of a Model Protein and Model amino acid residue
An important step in the methods of the invention is the selection of a model amino acid residue within a model protein that can be used to represent the polymoφhic target amino acid residue in the target protein. This is accomplished by first selecting a model protein(s). A model protein(s) can be any protein(s) that is homologous in sequence to the target protein and for which there is significant structural information. Usually, the selection involves searching for a protein(s) that is similar in sequence to the target protein in a curated structural database such as the Protein Data Bank (PDB).
Sequence similarity is typically assessed by the BLAST program (NCBI) that aligns two seqμences and reports and evaluates the alignment with a two quality scores, the E-value (a measure of expectation by chance) and the number of aligned residues that are the same in the two sequences. It can also be assessed using other sequence alignment methods like the Smith- Waterman or FASTA algorithms. Using BLAST, protein structures from the PDB are considered acceptable models for the target protein's structure if the E-value of the alignment for the target protein's sequence and the model protein's sequence is sufficiently small (e.g., an E-value less than 10"4). This is a relatively strict standard, and alignments with E- values as large as 1 or larger can be used if there is corroborating structural or biological information. For example, a protein with an E-value greater than 10"4 that is structurally, functionally, or biologically similar to the target protein can be useful. When there is a selection of proteins in the PDB that are homologous to the target protein, the PDB sequence with the smallest possible E-value (i.e., most homologous to the target protein) is preferably chosen as the model protein.
After the model protein is selected, an alignment (e.g., a BLAST alignment) between residues in the target protein and residues in the model protein is used to identify the residue in the model protein that is taken as the model amino acid residue. In some cases, a crystal (or NMR) structure for the target protein itself already exists in the PDB and this, of course, is the best possible case. Here the E-value of the alignment is essentially zero and the quality of the model is equivalent to the reliability of the crystallographic (or NMR) procedures. In other cases, a theoretical homology model of the target protein or a related protein may have been constructed, published, and deposited in the PDB. The homology model's quality can be assessed manually by reference to the homology modeling procedure and from the publication describing the model. Other embodiments of the invention can incoφorate an explicit step for construction of a fully optimized homology model for each target protein before assessing the function of individual residues.
The structure of the model protein in the vicinity of the model amino acid residue can be assessed for quality. To do this, a structural neighborhood of the model amino acid residue is identified. For example, the structural neighborhood can be the collection of residues in the model protein's structure that have at least one atom within some distance or radius (e.g., 5A) of at least one atom in the model amino acid residue. Intuitively, the residues in the structural neighborhood are residues that make the closest contact with the modeled variance, and the value of 5A for the radius can be used to reflect a generous approximate distance for Van der Waals interactions. The model quality near the model amino acid residue is computed as the fraction of residues in the structural neighborhood that are identically conserved in the BLAST alignment the target protein and the protein whose structure is used for the model. Statistical measures of neighborhood similarity could also be used to assess quality, e.g., a structural neighborhood equivalent of the BLAST E-value. In contrast to the overall or global assessment of model quality reflected in the BLAST alignment statistics, the measure of conservation in the structural neighborhood affords a very precise measurement of the accuracy of the modeling near the model amino acid residue itself.
The structural neighborhood of the model amino acid residue is used to define the structural neighborhood of the polymoφhic target amino acid residue. To do so, the sequence of the region of the model protein corresponding to the structural neighborhood of the model amino acid residue is aligned with the sequence of the target protein and the aligned target protein amino acids are defined as part of the structural neighborhood of the polymoφhic target amino acid residue.
As explained in greater detail below, structural neighborhoods around a model amino acid residue are also used in determining the functional consequences of amino acid variation on the target protein.
Once the model quality has been assessed, the potential functional consequences of the variation are assessed by considering various features associated with both the model amino acid residue and the target amino acid residue. The values for many of the features described herein can be calculated using methods described in Proteins (T. Creighton, W.H. Freeman & Co., New York, 1992; hereby incoφorated by reference).
Features Related to the Distance Between the Model Amino Acid Residue and Certain Structural Elements
Among the features examined in the model protein are: the distance between the model amino acid residue and any structural motifs or important functional residues, e.g., enzyme active sites in the model protein; distances between the model amino acid residue and any heterogens present in the model protein; and the distance between the model amino acid residue and any subunit interfaces in the model protein.
To identify significant structural motifs, the sequence of the target protein and the sequence of the model protein are examined for matches to the entries in one or more databases of recognized domains, e.g., the PROSITE database domains (Bairoch et al. (1997) Nucl. Acids. Res. 24:217) or the pfam HMM database (Bateman et al., (2000) Nucl. Acids. Res. 28:263). The PROSITE database is a compilation of two types of sequence signatures- profiles, typically representing whole protein domains, and patterns typically representing just the most highly conserved functional or structural aspects of protein domains. For PROSITE profiles and patterns that match both the target protein sequence and the model protein sequence, the minimum distance is determined between atoms in the model amino acid residue and atoms in the model's match to the PROSITE entry. Small minimum distances (e.g., 5A) between the model amino acid residue and the PROSITE match are considered to indicate a potential consequence of the amino acid variation on the target protein's structure and function.
Another important feature is the distance between the model amino acid residue and any heterogen in the model protein. Heterogens are small chemical groups (non-protein molecules) in protein structures that are associated with a protein during the structure determination. Often heterogens are enzyme cofactors, substrates, glycosides, substrate analogs, or drugs. Their location in the structure of a protein may suggest the location of an enzymatic active site or an important functional motif. As with the matches to PROSITE patterns, the minimum distance between the atoms in the model amino acid residue and atoms in the model structure's heterogens are calculated and reported. When small (e.g., 5 A), the distances are inteφreted to reflect a potential effect of the model amino acid residue on the model protein's function, and by extension on the target protein's function. For example, a model amino acid residue near an enzyme cofactor is inteφreted to suggest that the variance will affect the enzyme's activity.
Distance criteria are also used to assess a potential effect of the variance on the stability of the quaternary structure of the target protein. If the model amino acid residue is relatively near the subunit interface of the model protein (e.g., within 5 A), this feature is reported and inteφreted to have potential effects on the way in which protein subunits are associated.
Finally, a relatively small distance (e.g., within 5A) between two or more model amino acid residues (each modeling the same or different polymoφhic target amino acid residues) is inteφreted as reflecting a potential functional interaction between variable residues in the target protein. This last possibility may be particularly relevant when multiple variances in a single target protein have biological properties that depend on their haplotype.
Features Related to Intrinsic Structural and Phylogenetic Aspects of the Model Amino Acid Residue
One important class of features that can be used to evaluate the tolerance of a protein to a polymoφhism is related to intrinsic structural properties and phylogenetic aspects of the model amino acid residue. The structural properties include the accessibility of the model amino acid residue to solvent and its secondary structure classification, e.g., helix or sheet. Both of these properties can be computed for a model amino acid residue in the context of the model protein structure from well-known algorithms. Both are used to implement the notion that amino acid polymoφhisms at residues with certain structural dispositions are likely to affect protein structure or function. The phylogenetic aspects of the polymoφhic target amino acid residue are quantitative measures of the degree of phylogenetic variability (or alternatively conservation) at the polymoφhic target amino acid residue within the family of protein related sequences containing the target protein. There are several ways to represent phylogenetic variability, e.g., Kabat-Wu variability measure, phylogenetic weight, and any of these would suffice. One convenient measure is the phylogenetic entropy. This value can be computed from a simultaneous multiple alignment of the target protein's sequence family. For example, all protein sequences in the public databases that are at least 30% identical to the target protein can be collected and aligned to each other with known algorithms, e.g. CLUSTALW. This collection of simultaneously aligned sequences is known as a multiple alignment. It defines an association between each residue in each sequence and one (or none if there are gaps in the alignment) of the residues in each of the other sequences. Each position in the multiple alignment therefore represents a set of homologous residues in the set of homologous proteins. The entropy of each position is computed as:
Figure imgf000015_0001
Where: fi = frequency of amino acid i at that position in the multiple alignment
N = number of different amino acids at that position in the multiple alignment
The structural and the phylogenetic information needed to analyze the intrinsic structural and phylogenetic features of the model amino acid residue can be found in the HSSP database (Sander et al. (1991) Proteins 9:56-68) which supplies continuously updated structural and phylogenetic information for each protein structure in the PDB database. In the HSSP files, structural data for each residue in the protein structure includes its secondary structure assignment (e.g., helix, sheet, etc.) and an estimate of its solvent accessibility. Each residue in the corresponding PDB structure is also associated with a phylogenetic entropy computed from a multiple alignment of proteins sharing at least 30% sequence identity with the model protein. If the target protein is included in the HSSP multiple alignment of the model protein's sequence family, this phylogenetic information can be used to approximate the phylogenetic information for proteins that are similar to the target protein. The entire multiple alignment for proteins related to the model protein, and therefore an amino acid profile for each residue, is also provided in the database. HSSP structural and phylogenetic data for all model amino acid residues can be reported using the methods of the invention. This data can also be used in a series of tests for determining whether there are expected functional consequences of a change in the amino acid present at the polymoφhic target amino acid residue on the target protein. The following functional tests can use information in the HSSP database. 1) Buried Charge: The model amino acid is inaccessible and the polymoφhic target amino acid residue includes a charged residue.
2) Conserved Position: The model amino acid residue is absolutely conserved in the multiple alignment profile.
3) Helix Breaking: The polymoφhic target amino acid residue includes either a glycine or a proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure based on structure analysis.
4) Inaccessible: The model amino acid residue has less than about 10 A2 (~1 water molecule) exposure to the solvent. This feature can also be assessed using a relative accessibility value, which is the ratio of the observed solvent exposure to the maximum solvent exposure for the model amino acid residue amino acid in a polyalanine chain (or some other predetermined polypeptide chain). A value of relative accessibility less than about 0.2 suggests that the model amino acid residue is inaccessible, while a value greater than about 0.8 suggests that the model amino acid residue is accessible.
5) Low or high entropy: Entropy values of the model amino acid residue less than about 0.5 or greater than about 2.0 can suggest intolerance or tolerance to a polymoφhism, respectively. Similarly the model amino acid residue entropy can be measured relative to the entropy values for other residues in the model protein, and statistically significant values, e.g., less than or greater than about 2.0 standard deviations from the mean entropy can signify intolerance or tolerance to a polymoφhism, respectively. Other relative measures can also be used, e.g., rank order.
6) Rare Amino Acid: The polymoφhic target amino acid residue includes an amino acid that is found not more than 10% of the time in the multiple alignment profile for the model amino acid residue.
7) Turn Breaking: The polymoφhic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure.
8) Unusual Amino Acid: The polymoφhic target amino acid residue includes an amino acid that is not found in the multiple alignment profile for the polymoφhic target amino acid residue. If the target protein and the model protein are similar enough, this feature can be approximated by multiple alignment profile of model amino acid residue, e.g., from the HSSP file.
9) Unusual Amino Acid by Class: The polymoφhic target amino acid residue is not found in the minimum profile from Adams et al. (Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the target or model amino acid residue. This feature is preferable to the "Unusual Amino Acid" feature used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
10) Hydrophobicity Compatibility: The average hydrophobicity of the model amino acid residue is outside of a predetermined range (i.e., the neighborhood is particularly hydrophobic or particularly hydrophilic) and the difference between the hydrophobicity of the first amino acid and the second amino acid exceeds a predetermined value.
11) Buried Volume Compatibility: The model amino acid residue is inaccessible to solvent and the maximum solvent accessibility of either the first or the second amino acid is differs from the buried volume of the model amino acid residue by a predetermined amount.
For the above-described features, the numerical cutoff values are merely suggested appropriate values. Other values are useful and can be selected by those skilled in the art.
Phylogenetic Features Related to the Structural Neighborhood of the Model Amino Acid Residue
Phylogenetic data (e.g., from the HSSP database) can be used to analyze the structural neighborhood of the model amino acid residue to determine whether the model amino acid residue is in a region of the model protein that is relatively conserved. For example, the entropy values from the HSSP database for each residue in a structural neighborhood are averaged. A structural neighborhood is judged to be unusually conserved or unusually variable on an absolute basis if its average entropy value is, respectively, significantly less than or greater than the average entropy for structural neighborhoods derived from representative PDB structures and their corresponding phylogenetic properties. Representative structures from the PDB have been defined by others on the basis of fold families (see Holm and Sander, Science 273:595), and are available through the FSSP database at EMBL. Structural neighborhoods from about 600 representative structure families have been compiled and analyzed for phylogenetic entropy.
To determine conservation relative to other structural neighborhoods in the model protein, the average structural neighborhood entropy value is compared to the average and standard deviation of the entropy for all of the residues in the model protein polypeptide chain that contains the model amino acid residue. A conventional significance statistic can be computed as:
Neighborhood Relative Entropy = (<En>-<Ec>)/(S.D. En)
Where:
N = number of residues in the structural neighborhood
<En> = average entropy of residues in the structural neighborhood
<Ec> = average entropy of residues in the polypeptide chain that contains the model amino acid residue
S.D. Ec = standard deviation in the entropy for residues in the polypeptide chain that contains the model amino acid residue.
S.D. En = S.D. Ec/(VN) = standard deviation in the average entropy for samples of N residues chosen from the same chain as the model amino acid residue.
This value is reported. Variances are ascribed potential structural and functional consequences or not if they occur in the structural neighborhoods that are extremely well conserved or highly variable, respectively, compared to the entropy values for residues in the polypeptide chain containing the model amino acid residue. Other relative measures can be used, e.g., t-distribution value.
While phylogenetic entropy can be used as a feature, those skilled in the art can use other measures of the phylogenetic variability of the polymoφhic amino acid residue and the model amino acid residue. These measures are related to the variability in the amino acid present at a selected position in a selected set of related proteins. Features Related to the Crystallo graphic B -Factor
A similar treatment is applied to the crystallographic B factors (if available) to identify parts of the model protein that are unusually rigid and therefore relatively intolerant to amino acid variation. (If B-factors are not available, other measures of molecular rigidity can be used instead, e.g., ensemble statistics from NMR.) B-factors are calculated for each residue in the model structure as the average of its atomic B-factors and compared first to absolute standards of low and high B-factor values (e.g., estimated at 15.0A2 and 45.0A2, respectively) computed from structural neighborhoods in a representative set of PDB structures. Subsequently, a relative measure of the model residue B-factor is determined by comparison to the mean and standard deviation for residues in the model protein. Other relative measures can also be used, e.g., rank order. As above, model amino acid residues with significantly low or high B-factors relative to other residues in the model protein are judged to be relatively intolerant or tolerant, respectively, to amino acid variation. With similar inteφretations of low and high values, the average B-factor of the model amino acid residue's structural neighborhood can be computed and compared to absolute standards of low and high B-factors compiled for structural neighborhoods in representative PDB structures. Finally, a measure of the structural neighborhood B-factor relative to the model protein itself is determined by comparing the average of the residue B-factors for the structural neighborhood of the model amino acid residue to the average and standard deviation in the B-factors for residues in the polypeptide chain of the model amino acid residue. The significance of the structural neighborhood's average B-factor is computed as:
Neighborhood Relative B-Factor = (<Bn>-<Bc>)/(S .D. Bn))
Where:
N = number of residues in the structural neighborhood
<Bn> = average residue B-factor for residues in the structural neighborhood
<Bc> = average residue B-factor for residues in the polypeptide chains involved in the structural neighborhood
S.D. Be = standard deviation in the residue B-factor for residues in the polypeptide chains involved in the structural neighborhood S.D. Bn = S.D. Bc/(VΛ = standard deviation in the average B-factor for samples of N residues chosen from the same chain as the model amino acid residue.
This value is reported. Modeled variances in structural neighborhoods of significantly low average residue B-factor (e.g., 2.0 S.D. Bn) are judged to be in sufficiently rigid environments that they may have structural and functional consequences. Modeled variances in structural neighborhoods of significantly high average residue B-factor (e.g., 2.0 S.D. Bn) are judged to be insufficiently flexible environments that they may not have structural or functional consequences. Other relative measures can be used, e.g., t- distribution value.
The B-factor is related to the flexibility of the region of a polypeptide being analyzed. Thus, the B-factor can be replaced in the methods of the invention by another suitable measure of flexibility. For example, where NMR data is available the B-factor can be replaced by the r.m.s. deviation in residue position for an ensemble of structures or experimental determinations in the coupling constants and relaxation times of atoms that are diagnostic for mobility.
Reporting and Display of Results
The methods of the invention can report the outcome of the quality and function tests for each of the variances during the course of the analysis and produce a graphical representation of the model protein, generated as script for a molecule rendering program, e.g., RasMol. In the standard representation, the protein structure can be represented by ribbons while the modeled variances, heterogens, and residues corresponding to PROSITE matches in the model structure are displayed in space filling representation. Residue labels are added for the modeled variances. Finally, all of the output, including the graphical representation can be converted to a web browser readable form.
Values Assigned to Features
In the methods of the invention, various features are quantified. The following list provides suggested values for cutoffs. These are only suggested values. Those skilled in the art can select other cutoff values that are appropriate for specific situations. Buried Charge: The model amino acid residue is inaccessible and the actual variance includes a charged residue. The values are yes or no.
Conserved Position: The model amino acid residue is absolutely conserved in the phylogenetic analysis. The values are yes or no.
Inaccessible: The model amino acid residue has less than lθA2 (~ 1 water molecule) exposure to the solvent. The value is the solvent accessible area in A2. There can be approximately 1 water molecule/1 θA2 solvent accessible surface. A model amino acid residue can also be defined as inaccessible if it has a low value for its relative accessibility, e.g., less than 0.2.
Interface: The modeled variance is within 5.0 A of at least one residue in a different polypeptide chain in the coordinates. The values are yes or no.
Near Conserved: The model amino acid residue is within 5.0A of a residue that is absolutely conserved in the phylogenetic analysis. The values are yes or no.
Near Heterogen Atom. The model amino acid residue is within 5.0A of a heterogen atom. The value is a distance in Angstroms.
Near Other Variances: The model amino acid residue is within 5.0A of at least one other model amino acid residue. The value is a distance in Angstroms.
Near Sequence Prosite: The model amino acid residue is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the target protein. The value is a distance in Angstroms.
Near Structure Prosite: The model amino acid residue is within 5.0A of a residue in the model structure that matches a prosite entry that is NOT matched by the target protein. The value is a distance in Angstroms.
Rare Amino Acid: At least one of the residues encoded by the variance is found not more than 10% of the time in phylogenetic profile for the model amino acid residue. The values are yes or no.
Helix Breaking: The polymoφhic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure. The values are yes or no. Turn Breaking: The polymoφhic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure. The values are yes or no.
Unusual Amino Acid: At least one of the residues encoded by the polymorphic target amino acid residue is not found in the phylogenetic profile for the polymoφhic target amino acid residue. This parameter can be approximated using the phylogenetic profile of the model amino acid residue, e.g., from the HSSP file. The values are yes or no. This parameter can also be assessed using classes, as described above.
Low or High B-Factor: The average B factor for the model amino acid residue residue is less than 15.0 or greater than 45.0. Lower values mean less motion for that residue in the crystal structure.
Low or High Relative B: The average B factor for the model amino acid residue is at least 2 standard deviations above or below the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
Low or High Neighbor B: The average B factor for the structural neighborhood of the model amino acid residue is less than 15.0 or greater than 45.0.
Low or High Neighbor Relative B: The average B factor for the structural neighborhood of the model amino acid residue is at least 2 S.D. (S.D. as defined above) less than or greater than the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
Low or High Phylogenetic Entropy: The entropy of the model amino acid residue is less than 0.5 or greater than 2.0. The value is in entropy units ranging from 0.0 meaning absolute conservation to ln20 ~ 3. meaning no conservation.
Low or High Relative Phylogenetic Entropy: The entropy of the model amino acid residue is less than or greater than 2.0 S.D. from the average phylogenetic entropy for residues in the target protein.
Low or High Neighborhood Entropy: The average entropy for the structural neighborhood of the model amino acid residue is less than 0.5 or greater than 2.0.
Low or High Relative Neighborhood Entropy: The average entropy for the structural neighborhood of the model amino acid residue is at least 2.0 S.D. (S.D. as defined above) smaller or greater than the average entropy for the polypeptide chain of the model amino acid residue. The value is the number of S.D.
Use of Features as Predictor Variables
Various features of the model amino acid residue described above can be used as predictor variables in quantitative, statistical models for assessing whether a polymoφhism will have an effect on protein structure or function. The statistical models rely on actual experimental data concerning the effects of variances on protein activity. The predictive models can employ continuous values (or discrete approximations of continuous values, e.g., high, medium, and low B-factor) of the predictor features. Described below are two statistical models for predicting whether a polymoφhic target amino acid residue will affect protein structure or function by assessing some or all of the features of modeled variances. The features in the predictive methods are slightly adapted from their definitions above. Other statistical models for predicting the effects of polymoφhic target amino acid residues can be used, and are indicated by reference at the end of this section. These alternative methods can use some or all of the same predictive features of model amino acid residues.
The features used for predictions fall into two broad categories: environment features and categorical features. The class of categorical features is further divided into polymoφhism-specific categorical features and special case categorical features. Each of these different features is briefly described below along with an example of how the feature is valued.
Environment Features: All environment features can be used in continuous or categorical forms, with or without normalization, depending on the statistical approach employed for the predictions.
Solvent Accessibility: This is a measure of the accessibility of the model amino acid residue to solvent. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymoφhisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models. Relative Accessibility: This is a measure of the accessibility of the model amino acid residue to solvent relative to the maximum accessibility of that residue in peptide of specified composition, typically a polyalanine polypeptide. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymoφhisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
Relative B-Factor: This is a measure of the crystallographic B-factor of the model amino acid residue normalized to the average and standard deviation of the B-factor for other residues in the same polypeptide chain of the model protein. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymoφhisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
Neighborhood Relative B-Factor: This feature is a measure of the statistical significance (defined as above for same feature) of the average B-factor of the model amino acid residue's structural neighborhood relative to the average B-factor of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymoφhisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
Neighborhood Relative Entropy: This feature is a measure of the statistical significance (defined as above for same feature) of the average phylogenetic entropy of the model amino acid residue's structural neighborhood relative to the average phylogenetic entropy of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymoφhisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models. Polymoφhism-Specific Categorical Features: These features are variance-specific because they are related to the identity of the amino acids that comprise the polymoφhic target amino acid residue. In the statistical modeling approaches described below these features are given the value 1 (or yes) if the polymoφhism meets the specified criteria, and 0 (or no) otherwise.
Unusual Amino Acid: One of the amino acids of the polymoφhism is not found in the phylogenetic profile of the target variable residue. This feature can be approximated by examination of the model amino acid residue's phylogenetic profile, e.g., from the HSSP file.
Unusual Amino Acid by Class: One of the amino acids of the polymoφhism is not found in the minimum profile from Adams et al. {Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the polymoφhic target amino acid residue. This feature .can be approximated from the model amino acid residue's phylogenetic profile, e.g., from the HSSP file. The feature is preferably used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
Conserved Position: The model amino acid residue is conserved in the phylogeny.
Buried Charge: The model amino acid residue is inaccessible to solvent and one of the amino acids of the polymoφhic target amino acid residue is charged.
Turn Breaking: The model amino acid residue has turn secondary structure and one of the amino acids of the polymoφhism is glycine or proline.
Helix Breaking: The model amino acid residue has helical secondary structure assignments and one of the amino acids of the polymoφhism is glycine or proline.
Special Case Categorical Features: These features are related to special cases pertaining to the location of the model amino acid residue in the model structure that are not polymoφhism specific. They are assigned the value 1 (or yes) if the model amino acid residue meets the specified criteria, and 0 (or no) otherwise.
Near Hetero Atom: The model amino acid residue is near (e.g., 5A) a heterogen atom (ligand) in the model protein.
Near Prosite: The model amino acid residue is near a PROSITE match (e.g., 5A) that is common to the target protein and the model protein. Interface: The model amino acid residue is near (e.g., 5A) the interface between two or more subunits in the model protein.
Training Data SetsTraining datasets contain: 1) at least one amino acid variation in a protein for which there is sufficient structural information to assess at least one selected structural, phylogenetic, or physical feature, and 2) information describing the effect on protein function of each amino acid variation._Mutants of the E. coli lac repressor (Markiewicz et al. (1994) J Mol. Biol. 240:421-433) or lysozyme (Rennell et al. (1991) J Mol. Biol. 222:67) can be used as training datasets. Any other collection of polymoφhisms, preferably unbiased, for which activity and structural information is available can also be used, even if the dataset contains polymoφhisms of many different proteins.
Probabilistic Mode
The approach of the probabilistic mode is to view each variance as having a probability that it will affect protein structure or function. This outlook implicitly reflects the idea that the homology models are, in general, approximate descriptions. There may be some factors bearing on the variance's effect on structure or function that are not anticipated by the model. However, given enough unbiased data, such factors can be assessed in probabilistic terms through experimental data sets that examine the relationship between mutations and effects on protein structure or function. For example, one of the data sets used in the implementation of the methods of the invention involves over 4000 unbiased mutations in the E. coli lac repressor and their classification with respect to the repressor' s biological function.
The probability values for the predictions combine measures of the intrinsic tolerance of the target protein's structure and function to amino acid variation at the polymoφhic target amino acid residue, the nature of the chemical change caused by the variance, and additional classifications for the special cases of variances in particularly vulnerable locations in the model protein's structure. For computing the probability that a polymoφhic target amino acid residue will affect target protein structure or function, training polymoφhisms with feature values similar to the feature values of the model amino acid residue are collected from the training data set. The precise criteria for assessing feature value similarity between the training polymoφhisms and the model amino acid residue are parameters of the prediction model. Typically, but not exclusively, these criteria are set so that the polymoφhisms in the training set have environment feature values within some tolerance, e.g., 1 standard deviation, of the environment feature values of the model amino acid residue, and categorical feature values that are identical to the categorical feature values of the model amino acid residue.
The probability that the polymoφhic target amino acid residue will affect target protein structure or function is defined as the proportion of residues in the sub-group of selected training polymoφhisms that have effects on their own protein's structure or function. Defining the probabilities in this way assumes that the environment features and the categorical features calibrated for effects on structure and function with the training set have predictive meaning for the polymoφhic target amino acid residue and target protein. It also assumes that the .training polymoφhisms represent an unbiased sampling of the effects on protein structure of polymoφhisms with the specified feature values. In other words, the features are assumed to reflect generic properties of polymoφhisms that are useful for evaluating their effect on protein function and the training polymoφhisms are assumed to reflect typical behavior for amino acid variation. Empirically, this assumption is valid, at least for soluble, globular proteins and the lac repressor and lysozyme training datasets.
In principle, the greater the number of features used to parameterize the model amino acid residue and thus to select the subset of training polymoφhisms for estimating the probability of an effect, the greater the accuracy of the probabilistic model. When more features are used, the selected training polymoφhisms will be more similar to the model amino acid residue, which itself was selected to be similar to the polymoφhic target amino acid residue on the basis of sequence similarity. However, the more features used to parameterize the model amino acid residue, the more difficult it becomes to identify enough training polymoφhisms to make an adequate statistical comparison with the current training data sets. In addition, some of the features are strongly correlated with others and contribute little to the characterization of model amino acid residues, e.g., accessibility and relative accessibility. In practice with the current data sets, this means that about three of the six current environmental features and about three or four categorical features can be used. When they become available, larger training data set will permit the use of more features for characterizing each polymoφhism. The reduced set of features used for selecting polymoφhisms in the probabilistic model is selected using standard maximum likelihood statistical methods. Formally, this entails computing the gain in the likelihood for predictions made on the training data with each possible combination of a few environment features and a few categorical features compared with predictions based on a more general hypothesis. In this case, the more general hypothesis defines a polymoφhism' s probability of effecting function as the proportion of polymoφhisms in the entire training data set that have effects on function. The optimal set of parameters is the one that gives the maximum likelihood gain. This exhaustive procedure is very computationally intensive. Computational time can be reduced by exploiting the observed strong effect of the environment features on the likelihood calculation. This observation leads to an approximate, stepwise procedure for maximizing the likelihood in which: first the environment features that alone maximize the likelihood are identified, and second, in conjunction with the selected optimized environment features, an optimal set of categorical features is identified. Applying this approximate procedure on two training data sets showed that the best environment features typically include one of the two accessibility features, one of the two B-factor features, and one of the two phylogenetic entropy features, as might be expected. Other statistical methods, e.g., discriminant function analysis, can be also used to choose the dominant features.
As an alternative, the number of parameters used can be reduced by the standard statistical method of principal component analysis. For example, the complete set of six environment features for all polymoφhic residues in the training set is transformed to its principal components, and then just one or a few of the stronger principal components (those with the larger eigenvalues, with or without realignment with the original environment features) are used instead of all of the environment features. The probability of a target polymoφhism having an effect on protein structure and function is determined as described above with the chosen principal components replacing the environment features in the computation.
Classification Mode
The problem of relating predictor variables to a categorical outcome is well described by the statistical methods for building classification trees (see Breiman et al. (1984) Classification and Regression Trees (Wadsworth: Belmont)). Through these methods, the influence of each continuous or categorical predictor on an outcome is statistically assessed, ranked, and used to build a tree that optimally classifies the categorical outcome in a training data set. In this case, the environment features and the polymoφhism' s categorical features from the structural model are the predictor variables, and the categorical outcome is whether the polymoφhism will or will not affect protein structure or function. The QUEST classification tree method and program program (Loh et al. (1997) Statistica Sinica 7:815) has been used to test classification tree analysis in predicting the effects of modeled polymoφhisms on structure and function.
Implementing QUEST for these predictions is straightforward. QUEST will directly accept both the continuously valued environment features and the categorical features as predictor variables. As with the probabilistic model, it proves useful to limit the number of variables to three environment features (or to use principal components) and a selection of the other, categorical features in order to accommodate the limited size of the training data sets. QUEST uses ANOVA F-statistics to select variables and to define "split" values in each continuous parameter for optimal classification. Trees are then constructed with the selected variables and "split" criteria, and then pruned subject to node size criteria and cross- validation tests. Once the optimal tree is delineated, target polymoφhisms can be assessed for whether they are predicted to affect protein structure or function. Typical application of QUEST involves running it in default mode with the exception that the minimum node size is often increased, e.g., by a factor of about four, to simplify the classification trees without a serious loss of accuracy.
A guiding principle in the construction of classification trees with continuous predictor variables is that the "split" values for the predictors should make sense to a user familiar with the underlying scientific issues. But, in practice, applying the environment features as continuous variables in the automated QUEST method can lead to classification trees that are excessively branched and hard to inteφret. An alternative way to implement the method involves categorizing the values of each of the environment features into a reasonable number of groups. For example, each environment feature can be categorized into high, medium, and low values. The categorized environment features can then be used with the other categorical features to construct simplified and robust classification trees by QUEST. Other Statistical Models
Other statistical methods can be used in the analysis of the environment and categorical features, and in their application for predicting whether target polymoφhisms will affect protein structure or function. These include but are not limited to: discriminant function analysis for selection of environment features for each combination of categorical features (e. g., see StatSoft Inc., Electronic Text Book, http ://www. statsoft.com. Chapter on Discriminant Analysis), and logistic regression of the environment features for each combination of categorical features (e.g., see Montgomery and Peck (1992), Introduction to Linear Regression Analysis, Wiley, NY, Chapter 6). Some implementations might use neural nets or related models to assimilate training data for predictions of effects on structure or function caused by polymoφhic target amino acid residues.
Implementation
A computer program for the automated structural modeling and functional analysis can be written in any suitable language, e.g., Python 1.4. Programs and supporting files (e.g., databases) that are incoφorated into the analysis are available. The program can be run on suitable computer systems known to those skilled in the art, for example, a Silicon Graphics O2 workstation operating under IRIX v. 6.5. Useful databases include: the Protein Data Bank (PDB) of macromolecular structures and sequences corresponding to the structures; the Homology-Derived Secondary Structure of Proteins (HSSP; EMBL) database; the PROSITE database (EXPASY; currently using release 15) of profiles and patterns. Useful software for implementing certain features of the method include: BLAST 2.0.6 sequence alignment and database searching software (NCBI); RasMol 2.6.4 (Roger Sayle) program for visualizing (rendering) and annotating homology models; Chime (MDL, Inc.) http plug-in module for visualizing the models in a web browser; and Pfscan 1.0 software (Philipp Bucher; Swiss Institute for Experimental Cancer Research) for comparing amino acid sequences to PROSITE profiles.
The methods of the invention are not limited to use with any particular hardware/software configuration. They may find applicability in any computing or processing environment. The methods of the invention may be implemented in computer programs executing on programmable computers that each includes a processor, a storage medium readable by the processor (including volatile and non- volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform the methods and to generate output information for display.
Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language. The language may be a compiled or an inteφreted language.
Each computer program may be stored on a storage medium or device (e.g., CD- ROM, hard disk, or magnetic diskette) that is readable by a general or special puφose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the methods. The methods may also be implemented as a computer-readable storage medium, configured with a computer program, where, upon execution, instructions in the computer program cause the computer to operate in accordance with the methods.
Applications
The methods of the invention are useful in a number of areas beyond simply making predictions about the effect of a known or theoretical polymoφhism. For example, the methods of the invention can be used for identification and analysis of amino acid polymoφhisms that affect the structure or function of proteins involved directly or indirectly in the action of pharmaceutical or diagnostic agents. The methods can be used in the identification and analysis of structural or functional interactions between two or more one amino acid polymoφhisms in a protein of interest (e.g., in analysis of haplotypes).
The methods of the invention can be used to identify and analyze polymoφhisms that have an effect on a catalytic activity of the protein of interest or a non-catalytic activity of the protein or interest (e.g., structure, stability, binding to a second protein or polypeptide chain, binding to a nucleic acid molecule, binding to a small molecule, and binding to a macromolecule that is neither a protein nor a nucleic acid).
The methods of the invention can also be used in the identification and analysis of candidate polymoφhisms for polymoφhism-specific targeting by pharmaceutical or diagnostic agents, for the identification and analysis of candidate polymoφhisms for pharmacogenomic applications, and for experimental biochemical and structural analysis of pharmaceutical targets that exhibit amino acid polymoφhism.
In addition, the methods of the invention can be used to identify amino acid substitutions that can be made to engineer the structure or function of a protein or interest (e.g., to increase or decrease a selected activity or to add or remove a selective activity).
The methods can also be used for the prospective or retrospective identification and analysis alterations in a biological property related to a polymoφhism.
EXAMPLES
Example 1 : Annotation Mode
The method of the invention was used to analyze a number of polymoφhic amino acid residues in lac repressor. In this example, the annotation mode was used and purine repressor was selected as the model protein. Reproduced below is a portion of the output of a computer program used to implement one embodiment of the method of the invention. In order, the output provides: a list of the polymoφhic amino acids analyzed, an alignment of each region of lac repressor containing a polymoφhic residue with the corresponding region of the model protein, a summary of the number of amino acid resides that are identical in each aligned region, a summary of the PDB file information for purine repressor used in the analysis, a prosite report for the model protein, a summary of the alignment of the amino acid residues in the neighborhood of each model amino aicd residue with the amino acid residues in the neighborhood of the corresponding polymoφhic amino acid residue, a summary of the determinations made for each model amino acid residue (including: distance to conserved motifs, distance to heterogens, distances between model amino aid residues, entropy, secondary structure, neighborhood entropy, B-factor, relative B-factor, neighborhood B- factor, and relative neighborhood B-factor), a list of the features on which determinations can be made, and a list of determinations made for each model amino acid residue.
Output:
Description: LACTOSE OPERON REPRESSOR
Variance # p_val nt_pos aa_pos AAs Source Index
1 1.000 0 56 Leu (1) Ser (1) LacR_Muts 1 2 1.000 0 172 Glu (1) Gly (1) LacR_Muts 1
3 1.000 0 247 Asp (1) Lys (1) LacR_Muts 1
4 1.000 0 298 Gin (1) Ala (1) LacR_ uts 1
5 1.000 0 316 Asn (1) Ser (1) LacR Muts 1
2pua_A 3e-34 # variances: 4 mol:protein length: 340 Purine Repressor
Alignment 1
Identical AA: 96 Total AA: 307
56 VAQQ L AGKQ 53 VARS L KVNH
172 RLGV E HLVA 170 YMAG R YLIE
247 LVAN D QMAL 247 FCGG D IMAM
298 RLLG Q TSVD 298 DSLG E TAFN
316 -1 No Overlap
Alignment Summary for Chosen Model from BLAST report
Model based on 2pua_A has:
P value: 3e-34
# of variances matched 4
I Summary of Modeled Variances for P03023 |
This model uses BLAST entry 2pua_A and PDB coordinates 2pua
Variance 56 Alignment # 1 Quality of alignment (Identical AA/Total AA) : 96/307
Quality of match (Identical AA/Total AA) : 3/9 Sequence of Query VAQQ L AGKQ Sequence of Model VARS L KVNH
Variance 172 Alignment # 1 Quality of alignment (Identical AA/Total AA) : 96/307
Quality of match (Identical AA/Total AA) : 1/9 Sequence of Query RLGV E HLVA Sequence of Model YMAG R YLIE
Variance 247 Alignment # 1 Quality of alignment (Identical AA/Total AA) : 96/307
Quality of match (Identical AA/Total AA) : 3/9 Sequence of Query LVAN D QMAL Sequence of Model FCGG D IMAM
Variance 298 Alignment # 1 Quality of alignment (Identical AA/Total AA) : 96/307
Quality of match (Identical AA/Total AA) : 3/9 Sequence of Query RLLG Q TSVD Sequence of Model DSLG E TAFN
Checking for coordinates uncompress /pdb/pdb/2pua.pdb
PDB coordinates already exist for 2pua
HSSP file exists for 2pua compress /pdb/pdb/2pua.pdb
I Features of Coordinates Used in Models | ******************************************
PDB Header:
COMPLEX (DNA-BINDING PROTEIN/DNA) 04-OCT-97 2PUA
PDB Title:
CRYSTAL STRUCTURE OF THE LACI FAMILY MEMBER, PURR, BOUND TO DNA: MINOR GROOVE BINDING BY ALPHA HELICES
PDB Compound:
MOL_ID: 1; MOLECULE: PURINE REPRESSOR; CHAIN: A; ENGINEERED: YES; MUTATION: R190A; BIOLOGICAL_UNIT: HOMODIMER; OTHER_DETAILS : METHYLPURINE-PUR-OPERATOR; MOL_ID: 2; MOLECULE: DNA; CHAIN: B; ENGINEERED: YES;
OTHER_DETAILS : PURINE REPRESSOR BOUND TO 6-METHYLPURINE AS COREPRESSOR AND PERFECT PALINDROME PURF OPERATOR
**********************************
I PROSITE REPORT FOR COORDINATES | ********************************** Prosite Report for chain 'A'
Scan Summary
Prosite ID Score Start End Description
*** NO prosite Scanning Matches ***
Search Summary
Prosite ID Score Start End Description PS00356 1.000 4 22 HTH LACI FAMILY
Making- odel P03023__2pua.rsml with_ coordinates /pdb/pdb/2pua.pdb
*********************
I QUALITY OF MODELS | *********************
I Model based on 2pua
Variance in primary is 56
Variance list in model is [('A', 54, 0) Modeled Variance ('A1, 54, 0)
Neighbors aligned in alignment 0 for variance 56
Neighborhood Residue Aligned Residue Same in primary and model?
51R {'A', 49, 'A') no
52V ('A1, 50, V1) yes
53A ('A', 51, 'A') yes
54Q ('A', 52, 'R') no
55Q ('A1, 53, 'S') no
56L ('A', 54, L') yes
57A ('A1, 55, 'K'l no
58G ('A', 56, 'V) no
59K ('A', 57, 'N') no
Neighbors not aligned in alignment 0 for variance 56 ('B', 707) ('B', 708)
Summary of neighborhood for this alignment: Number of residues in Neighborhood of radius 5 . 0 A is 11 Of these, 9 residues are covered by alignment Of these, 3 residues are the same
Variance in primary is 172
Variance list in model is [('A', 171, 0) ] Modeled Variance ('A1, 171, 0)
Neighbors aligned in alignment 0 for variance 172
Neighborhood Residue Aligned Residue Same in primary and model?
168R ('A', 167, ,yι no
169L ('A', 168, 'M' no
170G ('A', 169, 'A' no
171V ('A', 170, 'G' no
172E ('A1, 171, 'R' no
173H ('A', 172, ιγι no
174L ('A1, 173, 'L' yes
175V ('A', 174, 'I* no
176A ('A', 175, Ε' no
177L ('A', 176, 'R' no
204Y ('A', 203, 'A' no
Neighbors not aligned in alignment 0 for variance 172 ('A', 340)
Summary of neighborhood for this alignment:
Number of residues in Neighborhood of radius 5.0 A is 12. Of these, 11 residues are covered by alignment Of these, 1 residues are the same
Variance in primary is 247
Variance list in model is [('A', 248, 0)] Modeled Variance (*A', 248, 0)
Neighbors aligned in alignment 0 for variance 247
Neighborhood Residue Aligned Residue Same in primary and model?
244V ('A', 245, 'C') no 245A ('A1, 246, 'G') no 246N ('A', 247, 'G') no 247D yes 248Q no 249M yes 250A yes 251L no
Figure imgf000037_0001
276T no 279S no 282Y no 286S no 288T
Figure imgf000037_0002
yes
Summary of neighborhood for this alignment:
Number of residues in Neighborhood of radius 5.0 A is 18 Of these, 18 residues are covered by alignment Of these, 8 residues are the same
Variance in primary is 298 |
Variance list in model is [('A', 299, 0)] Modeled Variance ('A', 299, 0)
Neighbors aligned in alignment 0 for variance 298
Neighborhood Residue Aligned Residue Same in primary and model?
86R ('A', 84, 'N') no
294R ('A', 295, 'D') no
295L ('A', 296, 'S') no
296L ('A', 297, 'L') yes
297G ('A', 298, 'G') yes
298Q ('A', 299, 'E') no
299T ('A', 300, 'T') yes
300S ('A', 301, 'A') no
301V ('A', 302, 'F') no
302D ('A', 303, 'N') no
303R ('A', 304, 'M') no
Summary of neighborhood for this alignment:
Number of residues in Neighborhood of radius 5.0 A is 11 Of these, 11 residues are covered by alignment Of these, 3 residues are the same
HSSP file already exists
I Summary of Entropy Statistics for Coordinates from HSSP file I Chain Average Entropy 'A' 1.612
*********************************
I FUNCTION OF MODELED VARIANCES | *********************************
I ==> Function I for Model 2pua: Proximity to Prosite, Het Atoms, and Other Variances |
i Variance 56: Leu Ser |
Proximity to Prosite Features
Modeled Variance: residue 54 in PDB chain 'A' from BLAST alignment 1
For Chain A
Prosite Scan:
*** No Prosite Scan Matches ***
Prosite Search: PS00356
4 - 22 19 12.7
Nearest Het Atoms
Modeled Variance ('A', 54, 0)
Hetero_Group Type Distance Atom ('A', 599) 6MP 35.7 C7
Variance 172: Glu Gly
Proximity to Prosite Features
Modeled Variance: residue 171 in PDB chain 'A' from BLAST alignment 1
For Chain A
Prosite Scan: *** NO rosite Scan Matches ***
Prosite Search: PS00356
4 - 22 21 60.3
Nearest Het Atoms
Modeled Variance ('A', 171, 0)
Hetero_Group Type Distance Atom ('A', 599) 6MP 18.6 C8
I Variance 247: Asp Lys
Proximity to Prosite Features
Modeled Variance: residue 248 in PDB chain 'A' from BLAST alignment 1
For Chain A
Prosite Scan:
*** No Prosite Scan Matches ***
Prosite Search: PS00356
4 - 22 21 48.6
Nearest Het Atoms
Modeled Variance ('A', 248, 0)
Hetero_Group Type Distance Atom ('A', 599) 6MP 6.4 N9
Variance 298: Gin Ala
Proximity to Prosite Features
Modeled Variance: residue 299 in PDB chain 'A' from BLAST alignment 1
For Chain A Prosite Scan:
*** No Prosite Scan Matches ***
Prosite Search: PS00356
4 - 22 19 30.
Nearest Het Atoms
Modeled Variance ('A', 299, 0)
Hetero_Group Type Distance Atom ('A', 599) 6MP 16.0 C8
Distances between modeled variances in Angstroms
Figure imgf000040_0001
I ==> Function II for Model 2pua: Intrinsic properties of variant position |
Variance 56: Leu Ser |
For variance modeled by residue 54 in PDB chain 'A' from BLAST alignment 1:
Phylogeny
Conservation Entropy: 0.734 Rel. Entropy: 24 Weight : 1.34
Amino Acid Profile at this Residue
L 0.84 R 0.03 V 0.03 F 0.03 K : 0.03 M : 0.03 A : 0.03
Structure
Secondary Structure: Alpha Helix Accessibility: 144 AΛ2 (# waters * 10)
Variance 172: Glu Gly |
For variance modeled by residue 171 in PDB chain 'A' from BLAST alignment 1:
Phylogeny
Conservation
Entropy: 2. 035
Rel. Entropy: 68
Weight: 0. 86
Amino Acid Profile at this Residue
R : 0.28
K : 0.19
Q : 0.13
E : 0.13
T : 0.06
N : 0.06
D : 0.06
V : 0.03
L : 0.03
H : 0.03
Structure
Secondary Structure: Alpha Helix Accessibility: 58 AΛ2 (# waters * 10)
I Variance 247: Asp Lys |
For variance modeled by residue 248 in PDB chain 'A' from BLAST alignment 1:
Phylogeny
Conservation
Entropy: 0.347
Rel. Entropy: 12
Weight : 1.49 Amino Acid Profile at this Residue D : 0 . 91 S : 0 . 06 N : 0. 03
Structure
Secondary Structure: Alpha Helix Accessibility: 0 AΛ2 (# waters * 10)
Variance 298: Gin Ala
For variance modeled by residue 299 in PDB chain 'A' from BLAST alignment 1:
Phylogeny
Conservation
Entropy: 1. 995
Rel. Entropy: 67
Weight : 0. 84
Amino Acid Profile at this Residue
A : 0.26
E : 0.23
K : 0.11
R : 0.09
T : 0.09
Q : 0.09
H : 0.06
S : 0.06
V : 0.03
Structure
Secondary Structure: Alpha Helix Accessibility: 91 AΛ2 (# waters * 10)
I ==> Function III for Model 2pua: Properties of variance structure neighborhood |
Variance 56 |
Modeled Variance: residue 54 in PDB chain 'A' from BLAST alignment 1 Neighb x Entropy
('A', 49) 1.976
('A', 50) 1.588
('A', 51) 0.259
('A', 52) 0.808
('A', 53) 1.951
('A', 54) 0.734
('A', 55) 1.788
('A', 56) 1.985
('A', 57) 1.767
Total number of neighbors 11
Total number of neighbors found in hssp file 9 Neighborhood min entropy of 0. 259 at [ ( 'A' , 51)]
Neighborhood max entropy of 1. 985 at [ ( 'A' , 56)]
Average Neighbor Entropy 1 . 428
Total number of residues in chain ' A ' is 339
Average entropy of chain is 1.612
S.D. of entropy of chain is 0.626
Deciles of entropy of chain is [0.0, 0.692, 1.028, 1 308, 1.531, 1.787,
1.922, 2.064, 2.169, 2.312, 2.649]
Chain min entropy of 0.000 at [('A1, 8), ('A', 18), 'A', 19)] et al.
Chain max entropy of 2.649 at [('A', 190)]
Ave. entropy for neighborhood is -0.878 S.D from ave entropy of chain of modeled variance.
Variance 172 |
Modeled Variance: residue 171 in PDB chain 'A' from BLAST alignment 1
Neighbor Entropy
('A', 167) 2.098
('A', 168) 2.002
('A', 169) 0.981
('A', 170) 1.490
('A', 171) 2.035
('A1, 172) 1.968
('A', 173) 0.701
('A', 174) 1.580
('A', 175) 1.866
('A', 176) 2.399
('A', 203) 1.022
('A', 340) 0.868
Total number of neighbors 12
Total number of neighbors found in hssp file 12 Neighborhood min entropy of 0.701 at [('A', 173)] Neighborhood max entropy of 2.399 at [('A', 176)] Average Neighbor Entropy 1.584 Total number of residues in chain ' A ' is 339
Average entropy of chain is 1. 612
S . D . of entropy of chain is 0 . 626
Deciles of entropy of chain is [0. 0, 0. 692 , 1 . 028 , 1. 308 , 1 . 531, 1 .787 ,
1. 922, 2 . 064 , 2 . 169, 2. 312 , 2. 649]
Chain min entropy of 0.000 at [('A', 8), ('A1, 18), ('A', 19)] et al.
Chain max entropy of 2.649 at [('A', 190)]
Ave. entropy for neighborhood is -0.153 S.D from ave. entropy of chain of modeled variance.
Variance 247
Modeled Variance: residue 248 in PDB chain 'A' from BLAST alignment 1
Neighbor Entropy
('A' 245) 1.479
('A' 246) 1.890
('A' 247) 1.524
('A' 248) 0.347
('A' 249) 2.368
('A' 250) 1.435
('A' 251) 0.474
('A' 252) 1.876
('A' 253) 0.879
('A' 273) 0.692
('A' 274) 1.220
('A' 275) 0.815
('A' 276) 0.931
('A' 277) 1.970
('A' 280) 1.210
('A' 283) 2.181
('A' 287) 0.607
('A' 289) 0.731
Total number of neighbors 18
Total number of neighbors found in hssp file 18 Neighborhood min entropy of 0.347 at [('A', 248)] Neighborhood max entropy of 2.368 at [('A', 249)] Average Neighbor Entropy 1.257
Total number of residues in chain ' A ' is 339
Average entropy of chain is 1.612
S.D. of entropy of chain is 0.626
Deciles of entropy of chain is [0.0, 0.692, 1.028, 1.308, 1. 531, 1.787,
1.922, 2.064, 2.169, 2.312, 2.649]
Chain min entropy of 0.000 at [('A1, 8), ('A', 18), ('A', 19)] et al.
Chain max entropy of 2.649 at [('A1, 190)]
Ave. entropy for neighborhood is -2.402 S.D from ave entropy of chain of modeled variance.
Variance 298 Modeled Variance: residue 299 in PDB chain 'A' from BLAST alignment 1
Neighbor Entropy
('A', 84) 2.327
('A', 295) 2.258
('A', 296) 1.876
('A', 297) 1.155
('A', 298) 0.691
('A', 299) 1.995
('A', 300) 1.958
('A', 301) 1.083
('A', 302) 1.863
('A', 303) 1.741
('A', 304) 1.380
Total number of neighbors 11
Total number of neighbors found in hssp file 11 Neighborhood min entropy of 0.691 at [('A', 298) Neighborhood max entropy of 2.327 at [('A', 84)] Average Neighbor Entropy 1.666
Total number of residues in chain ' A ' is 339
Average entropy of chain is 1.612
S.D. of entropy of chain is 0.626
Deciles of entropy of chain is [0.0, 0.692, 1.028, 1.308, 1. 531, 1.787,
1.922, 2.064, 2.169, 2.312, 2.649]
Chain min entropy of 0.000 at [('A', 8), ('A', 18), ('A', 19) ] et al.
Chain max entropy of 2.649 at [('A', 190)]
Ave. entropy for neighborhood is 0.287 S.D from ave entropy of chain of modeled variance.
I ==> Function IV for Model 2pua : Analysis of crystallographic B factors
] Variance 56: Leu Ser
Modeled Variance: residue 54 in PDB chain 'A' from BLAST alignment 1
Residue Statistics
Residue Statistics
Average B-factor of atoms in residue is : 50.9 Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15.8 Min and Max residue B-factors of residue's chain: 14 ,9 93.6 Deciles for residue B-factors of residue ' s chain: 14 . 9 25. 9 30.2 33. 5
36. 2 40 . 0 43 . 8 48 . 8 58 . 9 67 . 4 93. 6
Residue b factor is in the 8 th decile
Residue B-factor is 0.5 S.D. from average B-factor for chain
Neighborhood Statistics
Average B-factor of atoms in residue's neighborhood is : 42.4
Average B-factor of atoms in chains of residue's neighborhood is : 44.0
Neighborhood B factor is -0.3 S.D. from average B-factor for chains in neighborhood
Variance 172: Glu Gly |
Modeled Variance: residue 171 in PDB chain 'A' from BLAST alignment 1
Residue Statistics
Residue Statistics
Average B-factor of atoms in residue is : 33.4
Average B-factor of residues in residue's chain is : 43.6
S.D. in B-factor of residues in residue's chain is : 15.8
Min and Max residue B-factors of residue's chain: 14.9 93.6
Deciles for residue B-factors of residue's chain: 14.9 25.9 30. 33.5
36.2 40.0 43.8 48.8 58.9 67.4 93.6
Residue b factor is in the 3 th decile
Residue B-factor is -0.6 S.D. from average B-factor for chain
Neighborhood Statistics
Average B-factor of atoms in residue's neighborhood is : 35.0
Average B-factor of atoms in chains of residue's neighborhood is : 43.6
Neighborhood B factor is -1.9 S.D. from average B-factor for chains in neighborhood
Variance 247: Asp Lys
Modeled Variance: residue 248 in PDB chain 'A' from BLAST alignment 1
Residue Statistics
Residue Statistics
Average B-factor of atoms in residue is : 29.2 Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15. c Min and Max residue B-factors of residue ' s chain : 14.9 93.6 Deciles for residue B-factors of residue ' s chain : 14.9 25.9 30.2 33.5
36.2 40.0 43.8 48.8 58. S 67.4 93.6 Residue b factor is in the 2 th decile Residue B-factor is -0.9 S. D. from average B-factor for chain
Neighborhood Statistics
Average B-factor of atoms in residue's neighborhood is : 28.4 Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood B factor is -4.1 S.D. from average B-factor for chains in neighborhood
Variance 298: Gin Ala
Modeled Variance: residue 299 in PDB chain 'A' from BLAST alignment 1
Residue Statistics
Residue Statistics
Average B-factor of atoms in residue is : 51.1
Average B-factor of residues in residue's chain is 43.6
S.D. in B-factor of residues in residue's chain is 15.
Min and Max residue B-factors of residue's chain: 14 ,9 93, 6
Deciles for residue B-factors of residue's chain: 14.9 25.9 30. 33.5
36.2 40.0 43.8 48.8 58.9 67.4 93.6
Residue b factor is in the 8 th decile
Residue B-factor is 0.5 S.D. from average B-factor for chain
Neighborhood Statistics
Average B-factor of atoms in residue's neighborhood is : 46.1 Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood B factor is 0.5 S.D. from average B-factor for chains in neighborhood
I POSSIBLE FEATURES FOR MODELED VARIANCES |
buried_charge —
The modeled variance is inaccessible and the actual variance includes a charged residue.
The only value is yes. conserved_position —
The modeled variance is absolutely conserved in the HSSP profile. The only value is yes. helix_breaking —
The actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a region of helical secondary structure from HSSP analysis. The only value is yes . hi_b —
The modeled variances crystallographic B-factor is less than 45.0 AΛ2 hi_decile_b —
The modeled variances crystallographic B-factor is in the tenth decile of B-factors for modeled variances chain in PDB file. hi_decile_var —
The modeled variances phylogenetic variation is in the tenth decile of variation for modeled variances chain in PDB file. hi_nbhd_b —
The average crystallographic B-factor for the modeled variances neighborhood is greater than 45.0 AΛ2. hi_nbhd_rel_b —
The average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d. above the average B-factor for other neighborhoods in residues chain hi_nbhd_rel_var —
The average phylogenetic variation for the modeled variances neighborhood is at least 2.0 s.d. above the average variation for other neighborhoods in residues chain hi__nbhd_var --
The average phylogenetic variation for the modeled variances neighborhood is greater than 2.0 e.u. (8 residues with equal weight) hi_rel_b —
The modeled variances crystallographic B-factor is at least 2.0 s.d. above average B-factor for the modeled variances chain in PDB file. hi_rel_var —
The modeled variances phylogenetic variation is at least 2.0 s.d. above average variation for modeled variances chain in PDB file . hi_var —
The modeled variances phylogenetic variation is greater than 2.0 e.u. (8 residues with equal weight) inaccessible —
The HSSP file indicates that the modeled variance has less than 10 AA2 (~ 1 water molecule) exposure to the solvent. The value is the solvent accessible area in AΛ2. There can be approximately 1 water molecule/10 AΛ2 solvent accessible surface. interface —
The modeled variance is within 5.0A of at least one residue in a different chain in the coordinates. The only value is yes. lo_b — The modeled variances crystallographic B-factor is less than 15.0 AΛ2 lo_decile_b —
The modeled variances crystallographic B-factor is in the first decile of B-factors for modeled variances chain in PDB file. lo_decile_var —
The modeled variances phylogenetic variation is in the first decile of variation for modeled variances chain in PDB file. lo_nbhd_b —
The average crystallographic B-factor for the modeled variances neighborhood is less than 15.0 AΛ2. lo_nbhd_rel_b —
The average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d/ below the average B-factor for other neighborhoods in residues chain lo_nbhd_rel_var —
The average phylogenetic variation for the "modeled variances neighborhood is at least 2.0 s.d. below the average variation for other neighborhoods in residues chain lo_nbhd_var —
The average phylogenetic variation for the modeled variances neighborhood is less than 0.69 e.u. (2 residues with equal weight) lo_rel_b —
The modeled variances crystallographic B-factor is at least 2.0 s.d. below average B-factor for the modeled variances chain in PDB file. lo_rel_var —
The modeled variances phylogenetic variation is at least 2.0 s.d. below average variation for modeled variances chain in PDB file. lo_var —
The modeled variances phylogenetic variation is less than 0.69 e.u. (2 residues with equal weight) near_conserved —
The modeled variance is within 5.0A of a residue that is absolutely conserved in the HSSP profile. The only value is yes. near_het_atom —
The modeled variance is within 5.0A of a hetero atom in the coordinates. The value is a distance in Angstroms . near_other_variances —
The modeled variance is within 5.0A of at least one other modeled variance. The value is a distance in Angstroms. near_seq_prosite —
The modeled variance is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the primary sequence. See near_struct_prosite. The value is a distance in Angstroms . near_struct_prosite —
The modeled variance is withing 5.0A of a residue in the coordinates that matches a prosite entry that is NOT matched by the primary sequence. See near_seq_prosite. The value is a distance in Angstroms. rare_aa —
At least one of the residues encoded by the variance is found not more than 10% of the time in the HSSP profile for the modeled variance. The only value is yes. turn_breaking —
The actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a turn from the HSSP analysis. The only value is yes. unusual_aa —
At least one of the residues encoded by the variance is not found in the HSSP profile for the modeled variance. The only value is yes.
Begin Summary
******************************************
I IDENTIFIED FEATURES FOR PRIMARY P03023 | ******************************************
Model Coordinates: 2pua BLAST Alignment: 2pua_A Variance: 56, Amino Acids: Leu or Ser
For residue 54 in PDB chain 'A' from BLAST alignment 1: Leu Quality
E-Value of alignment 3e-34
Fraction identical residues in alignment: 0.31 (96/307)
Fraction identical residues in local alignment of variance:
0.33 (3/9) Fraction identical residues in structural neighborhood of modeled variance: 0.33 (3/9) Total number of residues in structural neighborhood: 11 Number of residues in phylogenetic entropy analyis: 37 Source of model: X-Ray
Function
Feature Value hi_b 50.86 AΛ2 interface yes unusual aa yes Variance : 172, Amino Acids : Glu or Gly
For residue 171 in PDB chain 'A' from BLAST alignment 1 : Arg Quality
E-Value of alignment 3e-34
Fraction identical residues in alignment: 0.31 (96/307)
Fraction identical residues in local alignment of variance:
0.11 (1/9) Fraction identical residues in structural neighborhood of modeled variance: 0.09 (1/11) Total number of residues in structural neighborhood: 12 Number of residues in phylogenetic entropy analyis: 32 Source of model: X-Ray
Function
Feature Value helix_breaking yes hi_var 2.04 eu unusual_aa yes
Variance: 247, Amino Acids: Asp or Lys
For residue 248 in PDB chain 'A' from BLAST alignment 1: Asp
Quality
E-Value of alignment 3e-34
Fraction identical residues in alignment: 0.31 (96/307)
Fraction identical residues in local alignment of variance:
0.33 (3/9) Fraction identical residues in structural neighborhood of modeled variance: 0.44 (8/18) Total number of residues in structural neighborhood: 18 Number of residues in phylogenetic entropy analyis: 35 Source of model: X-Ray
Function
Feature Value buried_charge yes inaccessible 0 AΛ2 (# of waters * 10) lo_decile_var 0.3 (0.0) eu lo_nbhd_rel_b -4.06 s.d. lo_nbhd_rel_var -2.40 s.d. lo_rel_var -2.02 s.d lo_var 0.35 eu unusual_aa yes
Variance: 298, Amino Acids: Gin or Ala
For residue 299 in PDB chain 'A' from BLAST alignment 1: Glu
Quality
E-Value of alignment 3e-34
Fraction identical residues in alignment: 0.31 (96/307)
Fraction identical residues in local alignment of variance:
0.33 (3/9) Fraction identical residues in structural neighborhood of modeled variance: 0.27 (3/11)" Total number of residues in structural neighborhood: 11 Number of residues in phylogenetic entropy analyis: 35 Source of model: X-Ray
Function
Feature Value hijo 51.08 AΛ2 hi_nbhd_b 46.11 AΛ2 hi_var 2.00 eu
******************************************
I Total Number of Features Flagged Is 17 | ******************************************
Example 2: Probabilistic Modeln a second example, the probabilistic mode was used to assess the probability that a change in the amino acid present at each of 3245 known lac repressor polymoφhisms would alter activity of lac repressor. In this example, a set of 1468 lysozyme polymoφhisms was used as the training dataset and maximum likelihood analysis was used to select the characteristics (from among the physical, structural and phylogenetic features described above) that would be used to analyze the model amino acid residues. This analysis led to the selection of three of the continuously valued parameters (relative accessibility, neighborhood relative B-factor, and neighborhood relative entropy) and three categorical features (unusual amino acid, unusual amino acid by class, and conserved position) as the characteristics most useful for predicting the effect of a polymoφhism. Thus, these characteristics were used to analyze the model protein. Because the structure of a major portion the lac repressor has been solved, lac repressor itself is used as the model protein (as well as being the target protein). Thus, the model amino acid residues are identical to the polymoφhic target amino acid residue. In cases where there is not sufficient structural information for the target protein, a model protein would be chosen on the basis of sequence similarity and the predictions would be made on the model amino acid residues.
For each model amino acid residue, determinations were made for each of the three selected continuously valued features (relative accessibility, neighborhood relative B-factor, and neighborhood relative entropy) and each of the three selected categorical features (unusual amino acid, unusual amino acid by class, and conserved position). Once these determinations were made, polymoφhic amino acids in the training dataset that were similar to each of the model amino acid residues were selected. A training polymoφhism was deemed to be sufficiently similar to a model amino acid residue if the following criteria were met: the value of each selected continuously valued parameter was within one standard deviation of the value of the parameter for the model amino acid residue and the value of each selected categorical feature was the same as the value of the feature for the model amino acid residue.
For each model amino acid residue, the selected training polymoφhisms were then used to assess the probability that a change in the amino acid present at the polymoφhic target amino acid residue would have an effect on activity of the target protein. The assessment was based on the proportion of selected training polymoφhisms that have an effect on the activity of the training protein, lysozyme. For some model amino acid residues, no prediction was made. This is because the number of selected training polymoφhism was too small to make a statistically significant prediction. The predictions made were then compared to the known effects of the lac repressor polymoφhisms and the accuracy of the predictions was analyzed. The results of this analysis are presented in Table 1 below.
Figure imgf000053_0001
Figure imgf000054_0001
In Table 1 , the predictions are sorted by confidence level. Thus for example, the values in the column under the heading "0.70" summarize the accuracy for predicting that mutations having a probability of affecting function of 0.70 or greater will affect function and that mutations with a probability of 0.3 (1.0 minus 0.7) will not affect function. The accuracy of the each class of predictions is assessed by the actual number of true positives, false positives, true negatives, and false negatives and by the statistical measures correlation coefficient, chi-squared value compared to a null hypothesis of predictions made knowing just the fraction of polymoφhisms affecting function, selectivity, and sensitivity for the predictions. The last value in each column is the misclassification rate (fraction of incorrectly predicted mutations). This example demonstrates that the probabilistic mode can be used to make predictions about the likely effect of a polymoφhism.Example 3 : Classification Mode:
In this example, the classification mode was used to classify each of 3245 known lac repressor polymoφhisms as either a polymoφhism that is likely to later activity or a polymoφhism that is not likely to alter activity. In this example, 1468 lysozyme polymoφhisms were used as a training dataset to build a classification tree using QUEST. In this example, three selected continuously valued features (relative accessibility, neighborhood relative B-factor, and neighborhood relative entropy) and three selected categorical features (unusual amino acid, unusual amino acid by class, and conserved position) were used in building the classification tree. The 3245 predictions made for each of the lac repressor polymorphisms was compared to the known effect of the polymoφhism. This analysis revealed 704 true positives, 491 false positives, 1500 true negatives, and 550 false negatives (Correlation: 0.32; Chi Squared: 327.73, Sensitivity: 0.56; Specificity: 0.59) for an overall misclassification rate of only 0.32. This example demonstrates that the classification mode can be used to make predictions about the likely effect of a polymoφhism.

Claims

WHAT IS CLAIMED IS:
1. A computer-assisted method using a programmed computer including a processor, an input device, and an output device, comprising:
(a) inputting into the programmed computer, through the input device, data including at least a portion of the amino acid sequence of a target protein having a polymoφhic amino aid residue, wherein the amino acid residue present at the polymoφhic target amino acid residue can be at least a first amino acid or a second amino acid;
(b) selecting, using the processor, a model amino acid residue within a model protein to represent a polymoφhic target amino acid residue in the target protein based on overall sequence homology between the target protein and the model protein;
(c) making at least one determination, using the processor, useful for predicting whether changing the identity of the amino acid present at the polymoφhic target amino acid residue from the first amino acid to the second amino acid will have an effect on the target protein based on at least one physical, structural or phylogenetic characteristic of the model amino acid residue; and
(d) outputting, to the output device, the results of the at least one determination.
2. The method of claim 1 wherein (c) comprises defining a structural neighborhood for the polymoφhic target amino acid residue and for the model amino acid residue.
3. The method of claim 2 wherein (c) further comprises providing for the model amino acid residue a value for a parameter selected from the group consisting of: solvent accessibility, relative solvent accessibility, absolute B factor, relative B factor, neighborhood B factor, neighborhood relative B factor, absolute variability, relative variability, neighborhood variability, and neighborhood relative variability.
4. The method of claim 2 wherein in (c) further comprises making at least one determination selected from the group consisting of: whether the amino acid present at the model amino acid residue is inaccessible to solvent and either the first amino acid or the second amino acid is charged; whether either the first amino acid or the second amino acid is absolutely conserved in proteins having a predetermined degree of identity to either with the model protein or the target protein; whether the model amino acid residue has less or greater than a predetermined exposure to the solvent; whether the model amino acid residue has a relative solvent accessibility value less or greater than a predetermined value; whether the model amino acid residue is inaccessible to solvent and the maximum solvent accessibility of either the first amino acid residue or the second amino acid residue differs from the buried volume of the model amino acid residue by a predetermined amount; whether the B-factor of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the model amino acid residue is outside a predetermined range; whether the average B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the phylogenetic variability of either the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the average phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the amino acid at the target amino acid residue is Gly or Pro and the model amino acid residue in a region of helical secondary structure or in a turn; whether the average hydrophobicity of the structural neighborhood of the model amino acid residue is outside a predetermined range and whether the difference between the hydrophobicity of the first amino acid and the hydrophobicity of the second amino acid is greater than a predetermined value; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid by class; whether one of the at least first amino acid residue and the second amino acid is a rare amino acid; whether the distance between the model amino acid residue and each heterogen present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and each subunit interface present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and each conserved motif present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and a second model amino acid residue used to represent either the target polymorphic amino acid residue or a second polymoφhic target amino acid residue in the target protein is less or greater than a predetermined value; and whether the distance between the model amino acid residue and a conserved amino acid residue present in the target or model protein is less or greater than a predetermined value.
5. The method of claim 4 wherein (c) comprising making at least three determinations selected from the group consisting of: whether the model amino acid residue has less or greater than a predetermined exposure to the solvent; whether the model amino acid residue has a relative solvent accessibility value less or greater than a predetermined value; whether the relative B-factor of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; and whether the relative phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range.
6. The method of claim 4 wherein (c) comprises:
(i) making at least one determination selected from the group consisting of: whether the model amino acid residue has less or greater than a predetermined exposure to the solvent; whether the model amino acid residue has a relative solvent accessibility value less or greater than a predetermined value;
(ii) making at least one determination selected from the group consisting of: whether the relative B-factor of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range;
(iii) making at least one determination selected from the group consisting of: whether the relative phylogenetic variability of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; and whether the relative phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; and
(iv) making at least one least one determination selected from the group consisting of: whether the amino acid at the model amino acid residue is Gly or Pro and the model amino acid residue in a region of helical secondary structure or in a turn; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid by class; whether the distance between the model amino acid residue and each heterogen present in the model protein is less or greater than a predetermined value; and whether the distance between the model amino acid residue and each subunit interface present in the model protein is less or greater than a predetermined value; whether the amino acid present at the model amino acid residue is inaccessible to solvent and either the first amino acid or the second amino acid is charged; whether the distance between the model amino acid residue and each conserved motif present in the model protein is less or greater than a predetermined value; whether the amino acid present at the target amino acid residue is absolutely conserved in proteins having a predetermined degree of identity to either with the model protein or the target protein.
7. The method of claim 4 wherein (c) comprises making at least seven determinations selected from the group consisting of: whether the model amino acid residue has less or greater than a predetermined exposure to the solvent; whether the model amino acid residue has a relative solvent accessibility value less or greater than a predetermined value; whether the relative B-factor of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the amino acid at the model amino acid residue is Gly or Pro and the model amino acid residue in a region of helical secondary structure or in a turn; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid by class; whether the distance between the model amino acid residue and each heterogen present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and each subunit interface present in the model protein is, less or greater than a predetermined value; whether the amino acid present at the model amino acid residue is inaccessible to solvent and either the,first amino acid or the second amino acid is charged; whether the distance between the model amino acid residue and each conserved motif present in the model protein is less or greater than a predetermined value; and whether the amino acid present at the target amino acid residue is absolutely conserved in proteins having a predetermined degree of identity to either with the model protein or the target protein.
8. A computer-assisted method using a programmed computer including a processor, an input device, and an output device, comprising:
(a) inputting into the programmed computer, through the input device, data including the amino acid sequence of a target protein;
(b) selecting, using the processor, a model amino acid residue within a model protein to represent a polymoφhic target amino acid residue in a target protein based on overall sequence homology between the target protein and the model protein, wherein the amino acid present at the polymoφhic target amino acid residue can be at least a first amino acid or a second amino acid;
(c) predicting, using the processor, whether changing the identity of the amino acid present at the polymoφhic target amino acid residue from the first amino acid to the second amino acid will have an effect on the target protein based on at least one determination of a physical, structural or phylogenetic characteristic of the model amino acid residue; and (d) outputting, to the output device, the results of the prediction.
9. The method of claim 8 wherein (c) further comprises selecting at least one training amino acid residue in a training dataset that is similar to the model amino acid residue in at least one selected physical, structural, or phylogenic characteristic.
10. The method of claim 9 wherein the training dataset comprises variants of lac repressor.
11. The method of claim 9 wherein the training dataset comprises variants of T4 lysozyme.
12. The method of claim 9 wherein the training dataset comprises variants of at least two different proteins.
13. The method of claim 9 wherein the at least one determination in step (c) is selected using statistical analysis.
14. The method of claim 13 wherein the statistical analysis comprises maximum likelihood analysis.
15. The method of claim 13 wherein the statistical analysis comprises principal component analysis.
16. The method of claim 13 wherein the statistical analysis comprises discriminant function analysis.
17. The method of claim 13 wherein the statistical analysis comprises logistic regression.
18. The method of claim 8 wherein (c) comprises classification tree analysis.
19. The method of claim 9 wherein the amino acid present at the selected at least one training amino acid residue is the same as either the first amino acid or the second amino acid.
20. The method of claim 8 wherein (c) comprises defining a structural neighborhood for the polymoφhic target amino acid residue and for the model amino acid residue.
21. The method of claim 20 wherein (c) further comprises providing for the model amino acid residue a value for a parameter selected from the group consisting of: solvent accessibility, relative solvent accessibility, absolute B factor, relative B factor, neighborhood B factor, neighborhood relative B factor, absolute variability, relative variability, neighborhood variability, and neighborhood relative variability.
22. The method of claim 20 wherein in (c) further comprises making at least one determination selected from the group consisting of: whether the amino acid present at the model amino acid residue is inaccessible to solvent and either the first amino acid or the second amino acid is charged; whether either the first amino acid or the second amino acid is absolutely conserved in proteins having a predetermined degree of identity to either with the model protein or the target protein; whether the model amino acid residue has less or greater than a predetermined exposure to the solvent; whether the model amino acid residue has a relative solvent accessibility value less or greater than a predetermined value; whether the model amino acid residue is inaccessible to solvent and the maximum solvent accessibility of either the first amino acid residue or the second amino acid residue differs from the buried volume of the model amino acid residue by a predetermined amount; whether the B-factor of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the model amino acid residue is outside a predetermined range; whether the average B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the relative B-factor of the structural neighborhood of the model amino acid residue is outside a predetermined range; whether the phylogenetic variability of either the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the average phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the relative phylogenetic variability of the structural neighborhood of the model amino acid residue or the polymoφhic target amino acid residue is outside a predetermined range; whether the amino acid at the target amino acid residue is Gly or Pro and the model amino acid residue in a region of helical secondary structure or in a turn; whether the average hydrophobicity of the structural neighborhood of the model amino acid residue is outside a predetermined range and whether the difference between the hydrophobicity of the first amino acid and the hydrophobicity of the second amino acid is greater than a predetermined value; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid; whether one of the at least first amino acid residue and the second amino acid is an unusual amino acid by class; whether one of the at least first amino acid residue and the second amino acid is a rare amino acid; whether the distance between the model amino acid residue and each heterogen present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and each subunit interface present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and each conserved motif present in the model protein is less or greater than a predetermined value; whether the distance between the model amino acid residue and a second model amino acid residue used to represent either the target polymoφhic amino acid residue or a second polymoφhic target amino acid residue in the target protein is less or greater than a predetermined value; and whether the distance between the model amino acid residue and a conserved amino acid residue present in the model or target protein is less or greater than a predetermined value.
23. A computer program, residing on a computer-readable medium, including instructions for causing the computer to:
(a) receive data including the amino acid sequence of a target protein;
(b) select a model amino acid residue within a model protein to represent a polymoφhic target amino acid residue in a target protein based on overall sequence homology between the target protein and the model protein, wherein the amino acid present at the polymoφhic target amino acid residue can be at least a first amino acid or a second amino acid;
(c) make at least one determination useful for predicting whether changing the identity of the amino acid present at the polymoφhic target amino acid residue from the first amino acid to the second amino acid will have an effect on the target protein based on at least one physical, structural or phylogenetic characteristic of the model amino acid residue; and
(d) output the results of the at least one determination.
24. A computer program, residing on a computer-readable medium, including instructions for causing the computer to:
(a) receive data including the amino acid sequence of a target protein;
(b) select a model amino acid residue within a model protein to represent a polymoφhic target amino acid residue in a target protein based on overall sequence homology between the target protein and the model protein, wherein the amino acid present at the polymoφhic target amino acid residue can be at least a first amino acid or a second amino acid; (c) predict whether changing the identity of the amino acid present at the polymoφhic target amino acid residue from the first amino acid to the second amino acid will have an effect on the target protein based on at least one physical, structural or phylogenetic characteristic of the model amino acid residue; and
(d) output the results of the prediction.
25. The method of claim 1 further comprising:
(e) outputting to the output device a graphical representation of the structure of at least a portion of the model protein with at least one model amino acid residue annotated with the results of a least one determination made in step (c).
26. A method comprising:
(a) providing the amino acid sequence of a target protein;
(b) selecting a model amino acid residue within a model protein to represent a polymoφhic target amino acid residue in a target protein based on overall sequence homology between the target protein and the model protein, wherein the amino acid present at the polymoφhic target amino acid residue can be at least a first amino acid or a second amino acid; and
(c) making at least one determination useful for predicting whether changing the identity of the amino acid present at the polymoφhic target amino acid residue from the first amino acid to the second amino acid will have an effect on the target protein based on at least one physical, structural or phylogenetic characteristic of the model amino acid residue.
PCT/US2001/017351 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid variances WO2001092990A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CA002410726A CA2410726A1 (en) 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid variances
JP2002501137A JP2004501446A (en) 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid diversity
EP01939635A EP1350115A2 (en) 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid variances
AU2001265131A AU2001265131A1 (en) 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid variances

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US20862800P 2000-06-01 2000-06-01
US60/208,628 2000-06-01
US61473500A 2000-07-12 2000-07-12
US09/614,735 2000-07-12

Publications (2)

Publication Number Publication Date
WO2001092990A2 true WO2001092990A2 (en) 2001-12-06
WO2001092990A3 WO2001092990A3 (en) 2003-07-31

Family

ID=26903348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/017351 WO2001092990A2 (en) 2000-06-01 2001-05-30 Structure-based methods for assessing amino acid variances

Country Status (5)

Country Link
EP (1) EP1350115A2 (en)
JP (1) JP2004501446A (en)
AU (1) AU2001265131A1 (en)
CA (1) CA2410726A1 (en)
WO (1) WO2001092990A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9493753B2 (en) 2011-03-16 2016-11-15 Amano Enzyme Inc. Modified α-glucosidase and applications of same
CN110223730A (en) * 2019-06-06 2019-09-10 河南师范大学 Protein and small molecule binding site prediction technique, prediction meanss
CN111128300A (en) * 2019-12-26 2020-05-08 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN112257917A (en) * 2020-10-19 2021-01-22 北京工商大学 Time series abnormal mode detection method based on entropy characteristics and neural network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHASMAN DANIEL ET AL: "Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation." JOURNAL OF MOLECULAR BIOLOGY, vol. 307, no. 2, 2001, pages 683-706, XP002212639 ISSN: 0022-2836 *
DATABASE BIOSIS [Online] BIOSCIENCES INFORMATION SERVICE, PHILADELPHIA, PA, US; 1994 ALTSCHUL STEPHEN F ET AL: "Issues in searching molecular sequence databases." Database accession no. PREV199497283170 XP002212660 & NATURE GENETICS, vol. 6, no. 2, 1994, pages 119-129, ISSN: 1061-4036 *
REDDY BOOJALA V B ET AL: "Use of propensities of amino acids to the local structural environments to understand effect of substitution mutations on protein stability." PROTEIN ENGINEERING, vol. 11, no. 12, December 1998 (1998-12), pages 1137-1145, XP002212641 ISSN: 0269-2139 *
SUNYAEV SHAMIL ET AL: "Prediction of deleterious human alleles." HUMAN MOLECULAR GENETICS, vol. 10, no. 6, 2001, pages 591-597, XP002212640 ISSN: 0964-6906 *
SUNYAEV SHAMIL ET AL: "Towards a structural basis of human non-synonymous single nucleotide polymorphisms." TRENDS IN GENETICS, vol. 16, no. 5, May 2000 (2000-05), pages 198-200, XP002212759 ISSN: 0168-9525 *
WANG ZHEN ET AL: "SNPs, protein structure, and disease." HUMAN MUTATION, vol. 17, no. 4, 2001, pages 263-270, XP001104863 ISSN: 1059-7794 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9493753B2 (en) 2011-03-16 2016-11-15 Amano Enzyme Inc. Modified α-glucosidase and applications of same
US9650619B2 (en) 2011-03-16 2017-05-16 Amano Enzyme Inc. Modified alpha-glucosidase and applications of same
CN110223730A (en) * 2019-06-06 2019-09-10 河南师范大学 Protein and small molecule binding site prediction technique, prediction meanss
CN110223730B (en) * 2019-06-06 2022-09-27 河南师范大学 Prediction method and prediction device for protein and small molecule binding site
CN111128300A (en) * 2019-12-26 2020-05-08 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN111128300B (en) * 2019-12-26 2023-03-24 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN112257917A (en) * 2020-10-19 2021-01-22 北京工商大学 Time series abnormal mode detection method based on entropy characteristics and neural network
CN112257917B (en) * 2020-10-19 2023-05-12 北京工商大学 Time sequence abnormal mode detection method based on entropy characteristics and neural network

Also Published As

Publication number Publication date
WO2001092990A3 (en) 2003-07-31
AU2001265131A1 (en) 2001-12-11
CA2410726A1 (en) 2001-12-06
EP1350115A2 (en) 2003-10-08
JP2004501446A (en) 2004-01-15

Similar Documents

Publication Publication Date Title
Chasman et al. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation
Tang et al. Tools for predicting the functional impact of nonsynonymous genetic variation
Gerstein et al. Comparing genomes in terms of protein structure: surveys of a finite parts list
Fiser Template-based protein structure modeling
Jordan et al. Predicting protein-protein interface residues using local surface structural similarity
Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis
Capriotti et al. Improving the prediction of disease-related variants using protein three-dimensional structure
Skolnick et al. FINDSITE: a combined evolution/structure-based approach to protein function prediction
Kleinman et al. Statistical potentials for improved structurally constrained evolutionary models
Pieper et al. MODBASE, a database of annotated comparative protein structure models
US8744982B2 (en) Gene-specific prediction
Fradera et al. Guided docking approaches to structure-based design and screening
Karchin et al. Improving functional annotation of non-synonomous SNPs with information theory
Flores et al. Hinge Atlas: relating protein sequence to sites of structural flexibility
Chen et al. Template-guided protein structure prediction and refinement using optimized folding landscape force fields
WO2003009210A1 (en) Methods of providing customized gene annotation reports
Eyal et al. Protein side‐chain rearrangement in regions of point mutations
Zimmermann et al. LOCUSTRA: accurate prediction of local protein structure using a two-layer support vector machine approach
Li et al. Improving predicted protein loop structure ranking using a Pareto-optimality consensus method
US8452542B2 (en) Structure-sequence based analysis for identification of conserved regions in proteins
EP1350115A2 (en) Structure-based methods for assessing amino acid variances
US20060121455A1 (en) COP protein design tool
Wanarase et al. Evaluation of SNPs from human IGFBP6 associated with gene expression: an in-silico study
Kahsay et al. Quasi-consensus-based comparison of profile hidden Markov models for protein sequences
Ünlü Computational prediction of actin–actin interaction

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2410726

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2001939635

Country of ref document: EP

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2002 501137

Kind code of ref document: A

Format of ref document f/p: F

WWP Wipo information: published in national office

Ref document number: 2001939635

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2001939635

Country of ref document: EP