WO2001092990A2 - Structure-based methods for assessing amino acid variances - Google Patents
Structure-based methods for assessing amino acid variances Download PDFInfo
- Publication number
- WO2001092990A2 WO2001092990A2 PCT/US2001/017351 US0117351W WO0192990A2 WO 2001092990 A2 WO2001092990 A2 WO 2001092990A2 US 0117351 W US0117351 W US 0117351W WO 0192990 A2 WO0192990 A2 WO 0192990A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- amino acid
- acid residue
- model
- target
- protein
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- This invention relates to computational methods for genetic variance modeling and prediction.
- the human genome contains approximately 60,000 to 100,000 genes.
- a variance i.e., a mutation or polymorphism
- the variance can result in the production of a gene product, usually a protein, with altered or no activity.
- the variance can be as small as the addition, deletion or substitution of a single nucleotide.
- Such a single nucleotide variance is sometimes called "single nucleotide polymorphism" or SNP.
- the invention features variance modeling and prediction methods that are useful for assessing whether an amino acid variation at a selected amino acid residue in a protein of interest is likely (or is not likely) to have an effect on the protein (e.g., alter a biological activity of the protein).
- the methods of the invention can be used to generate a structural model or models of all or a part of the protein of interest, assess the quality of the structural model, and evaluate the potential functional consequences of an amino acid variation.
- the methods of the invention achieve these goals by considering certain functional, structural, and phylogenetic features of specific amino acid residues.
- the methods of the invention are useful even when they do not predict the effect of an amino acid variance with complete accuracy.
- the growing number of known variances makes it extremely difficult to investigate all potentially significant variances.
- techniques that allow one to predict (even imperfectly) which variances are more likely to affect the structure or activity of a selected protein are useful because they permit one to assign a priority to a variance.
- One embodiment of the methods of the invention identifies a model amino acid residue in a model protein to represent a polymorphic amino acid residue (variance) of interest in a protein of interest, generates a record of the analysis of the model amino acid residue, generates an assessment of model quality, generates a summary of the functional, structural, and phylogenic features assessed, and a generates a graphical representation of all or a part of the model protein that can be annotated with information related to the various features assessed.
- Another embodiment of the methods of the invention generates statistically based predictions regarding the likelihood that an amino acid change at a polymorphic amino acid residue in a protein of interest will have an effect on the protein.
- the methods of the invention entail identifying a model amino acid residue (the "model amino acid residue” or “model variance") within a protein structure (the “model protein") that serves as a structural model of a selected polymorphic amino acid (“polymorphic target amino acid residue” or “target variance") of the protein of interest (the “target protein” or “target sequences”).
- the polymorphic target amino acid residue is a particular amino acid residue within protein of interest that is polymorphic.
- a first amino acid e.g., Gly
- a second amino acid e.g., Lys
- the amino acid present at the polymorphic target amino acid residue can be, e.g., a third, fourth, or fifth amino acid.
- the methods of the invention can be used to assess any number of changes in the amino acid present at the polymorphic target amino acid residue and any number of different polymorphic amino acid residues within a target protein.
- the model protein must have sufficient structural information to perform the methods' analyses.
- the structural information can be derived from x-ray crystallography, NMR, or some other technique for determining the structure of a protein at the amino acid or atomic level.
- the model protein is selected from among proteins with structural information based, at least in part, on its sequence similarity to the target protein.
- the model protein can be selected based on overall sequence similarity to the target protein or based on the presence of a portion having sequence similarity to a portion of the target protein which includes the polymorphic target amino acid residue.
- the methods of the invention entail assessing certain functional, structural, and phylogenic features of the model amino acid and its environment within the model protein.
- the values of the features of the model amino acid residue are then used to determine a potential for an effect of an amino acid change (or variance) at the polymorphic target amino acid residue by comparison to certain criteria.
- Some features have categorical values. For these features there may be only two values: either the specified criteria for feature are met or they are not.
- One example of a categorical feature is "helix breaking". To meet the criteria for "helix breaking", the model amino acid residue must be in a region of helical secondary structure and one of the polymorphic amino acids must be either Gly or Pro.
- model features can be either continuously valued or categorical.
- solvent accessibility of the model amino acid can be a continuous value or, if cut-off values are defined, a categorical value.
- a significant feature of the variance modeling and prediction methods of the invention is the concept of a "structural neighborhood.” This is the region within a selected radius of the atoms of a particular amino acid residue. The amino acid residues and other structural features within the structural neighborhood of an amino acid residue strongly influence the effect of a change in the actual amino acid present at the position of the amino acid residue.
- Another significant feature of the variance modeling and prediction methods of the invention is the selection of the functional, structural, and phylogenic features that are useful in predicting the effect of a variance. Among the features analyzed are: solvent exposure, nearness to a heterogen atom, and deviation from the average crystallographic B- factor of the model protein.
- the methods of the invention are very powerful because they do not require structural information about the target protein beyond the sequence of the target protein or the sequence of the target protein in the region that includes the polymorphic target amino acid residue.
- the methods of the invention rely on the use of public sequence and structure databases. These databases become more robust as more and more sequences and structures are added. Thus, the reliability of the models and predictions made by the methods of the invention will continually increase.
- the methods of the invention can be used to predict which non-synonymous polymorphisms are likely to affect protein function, however, they have application in many other areas of protein science.
- the methods can be applied to predicting whether a polymorphism will affect the interaction of a drug with a target protein.
- the methods of the invention can be applied for this purpose. Where two or more polymorphisms occur in a single protein, the methods of the invention can help evaluate both their individual and their combined effects. More generally, the choice of the target protein and polymorphisms need not be dictated by the occurrence of natural genetic variation. For example the choice can be prospective as in the case of the engineering of an enzyme, where the methods of the invention can be applied to the evaluation of which potential mutations will alter the enzyme activity. Broadly, the methods of the invention can be used whenever it is important to assess a relationship between amino acid variation and any aspect of protein activity or structure.
- polymorphic amino acid residue As used herein the terms “polymorphic amino acid residue,” “amino acid polymorphism,” “polymorphism,” and “variance” refer to an amino acid position within a protein that can be one or another of two or more different amino acids.
- structure refers to the three dimensional arrangement of atoms in the protein.
- Fusion refers to any measurable property of a protein. Examples of protein function include, but are not limited to, catalysis, binding to other proteins, binding to non- protein molecules (e.g., drugs), and isomerization between two or more structural forms.
- Biologically relevant protein refers to any protein playing a role in the life of an organism.
- Training dataset refers to a collection of one or more proteins each with one or more polymorphisms or mutations and information concerning the effects of each polymorphism on its protein's structure or function.
- FIG. 1 is a flow chart depicting an example of some of the steps in the annotation mode.
- FIG.2 is a flow chart depicting an example of some of the steps in selecting predictive features using a training dataset.
- FIG. 3 is a flow chart depicting an example of some the steps in the probabilistic mode.
- the invention features methods for variance modeling and prediction.
- the methods can be used to assess the effect of any number of amino acid variations in a protein of interest (the "target protein").
- the methods are useful for assessing the effect of an amino acid change at a selected polymorphic amino acid residue in a target protein.
- the amino acid residue can be an amino acid residue that is known to exhibit polymorphism, i.e., one that is known to differ among individuals of a population. For example, some individuals have a Glu at amino acid 6 of their hemoglobin beta-chain. Other individuals have a Val at this position, and this polymorphism is the cause of sickle-cell anemia.
- the amino acid residue is polymorphic, but whether the polymorphism has any effect on the protein of interest will not be known.
- the variance modeling and prediction methods of the invention rely on the analysis of a residue ("the model amino acid residue") in a model protein that is used to represent the polymorphic amino acid residue (the "polymorphic target amino acid residue") in the protein of interest (the “target protein”).
- the model amino acid residue and model protein are selected based on sequence similarity to all or a portion of the target protein.
- the model protein is one for which there is considerable structural information available (e.g., the structure of the protein has been solved).
- the methods of the invention entail examination of various physical, structural, and phylogenetic features of the model amino acid residue.
- the features examined are ones that are useful in predicting whether a change in the amino acid present at the model amino acid residue will affect the activity of the model protein. Examples of such features include: solvent accessibility, relative crystallographic B factor, and proximity to a heteroatom. Because the model amino acid residue and the model protein are similar to the polymorphic target amino acid residue and target protein respectively, the prediction made for the model amino acid residue and model protein will be relevant to the polymorphic target amino acid residue and target protein.
- the methods of the invention can be used to assess the impact of any known polymorphism.
- the methods of the invention can also be used to assess the effect of any potential change at any selected amino acid residue, including amino acid residues that are not known to be polymorphic.
- the methods of the invention can be used to provide: 1) an annotated model of the polymorphic target amino acid residue (annotation mode); 2) a prediction of the probability that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein (probabilistic mode); or 3) a classification of the polymorphic target amino acid residue as one which is either likely or not likely to have an effect on an activity of the target protein (classification mode).
- a model amino acid residue is used to represent the polymorphic target amino acid residue.
- all three modes entail determining the value of at least one selected physical, structural, and phylogenetic feature of the model amino acid residue.
- the values of the selected features can be used to provide an annotated model of the target protein.
- One skilled in the art can use the values of the selected features to assess the likelihood that a change in the amino acid present at the polymorphic target amino acid residue will have an effect on an activity of the target protein.
- FIG. 1 is a flow chart depicting some of the steps in one embodiment of the annotation mode.
- the amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 102).
- Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 104).
- a model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 106).
- the structural neighborhood of the model amino acid is determined (STEP 108), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 110).
- the results of the various determinations are they output (STEP 112).
- the output can be a list of values or an annotated graphical depiction of all or a part of the model protein.
- the probabilistic mode and the classification mode employ a database of polymorphisms (a "training dataset") satisfying two requirements.
- This database of polymorphisms (“training polymorphisms”) can contain polymorphisms of a single protein, e.g., lac repressor or lysozyme, or polymorphisms of two or more proteins.
- the effect on activity is known, as is the value of at least one physical, structural, and phylogenetic feature.
- the training polymorphisms are statistically analyzed to identify a subset of all possible physical, structural, and phylogenetic features that is most useful for predicting whether the training polymorphism has an effect on activity. This subset will also be useful for predicting whether an amino acid change at a model amino acid residue or a polymorphic target amino acid residue has an effect of activity.
- FIG. 2 is a flow chart depicting some of the steps in selecting a subset of feature useful for prediction. First, a training dataset of unbiased training polymorphisms with a known effect on activity is provided (STEP 202). The value of selected physical, structural, or phylogenetic features for each training polymorphism in the dataset is then determined (STEP 204). Statistical analysis is then used to select a subset of features useful for making predictions (STEP 204).
- the probabilistic mode entails selecting training polymorphisms that are similar to the model amino acid residue in terms of the subset of features. The proportion of selected training polymorphisms that have an effect on activity is determined and this information is used to predict whether a change in the amino acid present at the model amino acid residue will have an effect on the model protein. Because the model amino acid residue is selected to represent the polymorphic target amino acid residue, this prediction is also relevant to the polymorphic target amino acid residue and the target protein.
- the probabilistic mode can be more readily understood by considering a specific example. In this example, the polymorphic target amino acid residue is amino acid 120 of target protein X.
- amino acid residue 150 of protein A (the model protein) is selected as the model amino acid residue. Because the structure of protein A has been solved, the value of various selected features of amino acid residue 150 of model protein A can be determined.
- known polymorphisms (training polymorphisms) of lac repressor are selected on the basis of the similarity of a selected subset of features of their features to the analyzed subset of features of amino acid residue 150 of protein A. The selected lac repressor polymorphisms are used to predict whether an amino acid change at amino acid 120 of target protein X will have an effect on protein X.
- FIG. 3 is a flow chart depicting some of the steps in the probabilistic mode.
- the amino acid sequence of a target protein and the location of a polymorphic target amino acid residue within the target protein are identified (STEP 302).
- Proteins with sequence homology to the target protein are identified using an algorithm for identifying homologous protein sequences (STEP 304).
- a model protein is selected from among the selected proteins having sequence homology to the target protein and a model amino acid residue within the model protein is identified (STEP 306).
- the structural neighborhood of the model amino acid is determined (STEP 308), and the values of selected physical, structural, or phylogenetic features of the model amino acid residue and its structural neighborhood are determined (STEP 310).
- Training polymo ⁇ hisms in an unbiased training dataset that have physical, structural, or phylogenetic characteristics similar to the model amino acid residue and its structural neighborhood (STEP 312).
- the proportion of training polymo ⁇ hisms identified in STEP 312 that have an effect on activity of the protein are then used to assess the probability that a change in the amino acid present at the polymo ⁇ hic target amino acid residue will have an effect on the target protein (STEP 314).
- the training polymo ⁇ hisms and the associated information regarding the effect of the polymo ⁇ hisms on activity and the values of various features of the polymo ⁇ hisms are used to build a classification tree.
- the classification tree can be used to classify the model amino acid residue as either one that is likely to have an effect on activity or one that is not likely to have an effect on activity. Because the model amino acid residue is selected to represent the polymo ⁇ hic target amino acid residue, this classification is also relevant to the polymo ⁇ hic target amino acid residue and the target protein.
- the annotation mode, the probabilistic mode, and the classification mode are three examples of how the methods of the invention can be used. Those skilled in art will recognize that many other implementations are possible. For example, it may be possible to derive a mathematical relationship (e.g., a regression relationship) between a subset of all possible physical, structural, and phylogenetic features that can be used to predict the effect of a polymo ⁇ hism. Selection and Validation of a Model Protein and Model amino acid residue
- An important step in the methods of the invention is the selection of a model amino acid residue within a model protein that can be used to represent the polymo ⁇ hic target amino acid residue in the target protein. This is accomplished by first selecting a model protein(s).
- a model protein(s) can be any protein(s) that is homologous in sequence to the target protein and for which there is significant structural information.
- the selection involves searching for a protein(s) that is similar in sequence to the target protein in a curated structural database such as the Protein Data Bank (PDB).
- PDB Protein Data Bank
- Sequence similarity is typically assessed by the BLAST program (NCBI) that aligns two seq ⁇ ences and reports and evaluates the alignment with a two quality scores, the E-value (a measure of expectation by chance) and the number of aligned residues that are the same in the two sequences. It can also be assessed using other sequence alignment methods like the Smith- Waterman or FASTA algorithms.
- NCBI BLAST program
- protein structures from the PDB are considered acceptable models for the target protein's structure if the E-value of the alignment for the target protein's sequence and the model protein's sequence is sufficiently small (e.g., an E-value less than 10 "4 ).
- an alignment e.g., a BLAST alignment
- residues in the target protein and residues in the model protein are used to identify the residue in the model protein that is taken as the model amino acid residue.
- a crystal (or NMR) structure for the target protein itself already exists in the PDB and this, of course, is the best possible case.
- the E-value of the alignment is essentially zero and the quality of the model is equivalent to the reliability of the crystallographic (or NMR) procedures.
- a theoretical homology model of the target protein or a related protein may have been constructed, published, and deposited in the PDB. The homology model's quality can be assessed manually by reference to the homology modeling procedure and from the publication describing the model.
- Other embodiments of the invention can inco ⁇ orate an explicit step for construction of a fully optimized homology model for each target protein before assessing the function of individual residues.
- the structure of the model protein in the vicinity of the model amino acid residue can be assessed for quality.
- a structural neighborhood of the model amino acid residue is identified.
- the structural neighborhood can be the collection of residues in the model protein's structure that have at least one atom within some distance or radius (e.g., 5A) of at least one atom in the model amino acid residue.
- the residues in the structural neighborhood are residues that make the closest contact with the modeled variance, and the value of 5A for the radius can be used to reflect a generous approximate distance for Van der Waals interactions.
- the model quality near the model amino acid residue is computed as the fraction of residues in the structural neighborhood that are identically conserved in the BLAST alignment the target protein and the protein whose structure is used for the model.
- Statistical measures of neighborhood similarity could also be used to assess quality, e.g., a structural neighborhood equivalent of the BLAST E-value.
- the measure of conservation in the structural neighborhood affords a very precise measurement of the accuracy of the modeling near the model amino acid residue itself.
- the structural neighborhood of the model amino acid residue is used to define the structural neighborhood of the polymo ⁇ hic target amino acid residue.
- the sequence of the region of the model protein corresponding to the structural neighborhood of the model amino acid residue is aligned with the sequence of the target protein and the aligned target protein amino acids are defined as part of the structural neighborhood of the polymo ⁇ hic target amino acid residue.
- model protein Among the features examined in the model protein are: the distance between the model amino acid residue and any structural motifs or important functional residues, e.g., enzyme active sites in the model protein; distances between the model amino acid residue and any heterogens present in the model protein; and the distance between the model amino acid residue and any subunit interfaces in the model protein.
- the sequence of the target protein and the sequence of the model protein are examined for matches to the entries in one or more databases of recognized domains, e.g., the PROSITE database domains (Bairoch et al. (1997) Nucl. Acids. Res. 24:217) or the pfam HMM database (Bateman et al., (2000) Nucl. Acids. Res. 28:263).
- the PROSITE database is a compilation of two types of sequence signatures- profiles, typically representing whole protein domains, and patterns typically representing just the most highly conserved functional or structural aspects of protein domains.
- the minimum distance is determined between atoms in the model amino acid residue and atoms in the model's match to the PROSITE entry.
- Small minimum distances e.g., 5A
- heterogens are small chemical groups (non-protein molecules) in protein structures that are associated with a protein during the structure determination. Often heterogens are enzyme cofactors, substrates, glycosides, substrate analogs, or drugs. Their location in the structure of a protein may suggest the location of an enzymatic active site or an important functional motif. As with the matches to PROSITE patterns, the minimum distance between the atoms in the model amino acid residue and atoms in the model structure's heterogens are calculated and reported.
- the distances are inte ⁇ reted to reflect a potential effect of the model amino acid residue on the model protein's function, and by extension on the target protein's function.
- a model amino acid residue near an enzyme cofactor is inte ⁇ reted to suggest that the variance will affect the enzyme's activity.
- a relatively small distance e.g., within 5A
- two or more model amino acid residues each modeling the same or different polymo ⁇ hic target amino acid residues
- This last possibility may be particularly relevant when multiple variances in a single target protein have biological properties that depend on their haplotype.
- One important class of features that can be used to evaluate the tolerance of a protein to a polymo ⁇ hism is related to intrinsic structural properties and phylogenetic aspects of the model amino acid residue.
- the structural properties include the accessibility of the model amino acid residue to solvent and its secondary structure classification, e.g., helix or sheet. Both of these properties can be computed for a model amino acid residue in the context of the model protein structure from well-known algorithms. Both are used to implement the notion that amino acid polymo ⁇ hisms at residues with certain structural dispositions are likely to affect protein structure or function.
- the phylogenetic aspects of the polymo ⁇ hic target amino acid residue are quantitative measures of the degree of phylogenetic variability (or alternatively conservation) at the polymo ⁇ hic target amino acid residue within the family of protein related sequences containing the target protein.
- phylogenetic variability e.g., Kabat-Wu variability measure, phylogenetic weight, and any of these would suffice.
- One convenient measure is the phylogenetic entropy. This value can be computed from a simultaneous multiple alignment of the target protein's sequence family. For example, all protein sequences in the public databases that are at least 30% identical to the target protein can be collected and aligned to each other with known algorithms, e.g. CLUSTALW.
- This collection of simultaneously aligned sequences is known as a multiple alignment. It defines an association between each residue in each sequence and one (or none if there are gaps in the alignment) of the residues in each of the other sequences. Each position in the multiple alignment therefore represents a set of homologous residues in the set of homologous proteins.
- the entropy of each position is computed as:
- N number of different amino acids at that position in the multiple alignment
- the structural and the phylogenetic information needed to analyze the intrinsic structural and phylogenetic features of the model amino acid residue can be found in the HSSP database (Sander et al. (1991) Proteins 9:56-68) which supplies continuously updated structural and phylogenetic information for each protein structure in the PDB database.
- structural data for each residue in the protein structure includes its secondary structure assignment (e.g., helix, sheet, etc.) and an estimate of its solvent accessibility.
- Each residue in the corresponding PDB structure is also associated with a phylogenetic entropy computed from a multiple alignment of proteins sharing at least 30% sequence identity with the model protein.
- this phylogenetic information can be used to approximate the phylogenetic information for proteins that are similar to the target protein.
- the entire multiple alignment for proteins related to the model protein, and therefore an amino acid profile for each residue, is also provided in the database.
- HSSP structural and phylogenetic data for all model amino acid residues can be reported using the methods of the invention. This data can also be used in a series of tests for determining whether there are expected functional consequences of a change in the amino acid present at the polymo ⁇ hic target amino acid residue on the target protein. The following functional tests can use information in the HSSP database. 1) Buried Charge: The model amino acid is inaccessible and the polymo ⁇ hic target amino acid residue includes a charged residue.
- the polymo ⁇ hic target amino acid residue includes either a glycine or a proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure based on structure analysis.
- the model amino acid residue has less than about 10 A2 ( ⁇ 1 water molecule) exposure to the solvent.
- This feature can also be assessed using a relative accessibility value, which is the ratio of the observed solvent exposure to the maximum solvent exposure for the model amino acid residue amino acid in a polyalanine chain (or some other predetermined polypeptide chain).
- a value of relative accessibility less than about 0.2 suggests that the model amino acid residue is inaccessible, while a value greater than about 0.8 suggests that the model amino acid residue is accessible.
- the polymo ⁇ hic target amino acid residue includes an amino acid that is found not more than 10% of the time in the multiple alignment profile for the model amino acid residue.
- the polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure.
- the polymo ⁇ hic target amino acid residue includes an amino acid that is not found in the multiple alignment profile for the polymo ⁇ hic target amino acid residue. If the target protein and the model protein are similar enough, this feature can be approximated by multiple alignment profile of model amino acid residue, e.g., from the HSSP file.
- Unusual Amino Acid by Class The polymo ⁇ hic target amino acid residue is not found in the minimum profile from Adams et al. (Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the target or model amino acid residue. This feature is preferable to the "Unusual Amino Acid" feature used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
- Hydrophobicity Compatibility The average hydrophobicity of the model amino acid residue is outside of a predetermined range (i.e., the neighborhood is particularly hydrophobic or particularly hydrophilic) and the difference between the hydrophobicity of the first amino acid and the second amino acid exceeds a predetermined value.
- Phylogenetic data (e.g., from the HSSP database) can be used to analyze the structural neighborhood of the model amino acid residue to determine whether the model amino acid residue is in a region of the model protein that is relatively conserved. For example, the entropy values from the HSSP database for each residue in a structural neighborhood are averaged. A structural neighborhood is judged to be unusually conserved or unusually variable on an absolute basis if its average entropy value is, respectively, significantly less than or greater than the average entropy for structural neighborhoods derived from representative PDB structures and their corresponding phylogenetic properties.
- Representative structures from the PDB have been defined by others on the basis of fold families (see Holm and Sander, Science 273:595), and are available through the FSSP database at EMBL. Structural neighborhoods from about 600 representative structure families have been compiled and analyzed for phylogenetic entropy.
- the average structural neighborhood entropy value is compared to the average and standard deviation of the entropy for all of the residues in the model protein polypeptide chain that contains the model amino acid residue.
- a conventional significance statistic can be computed as:
- N number of residues in the structural neighborhood
- ⁇ En> average entropy of residues in the structural neighborhood
- ⁇ Ec> average entropy of residues in the polypeptide chain that contains the model amino acid residue
- S.D. Ec standard deviation in the entropy for residues in the polypeptide chain that contains the model amino acid residue.
- B-factors are calculated for each residue in the model structure as the average of its atomic B-factors and compared first to absolute standards of low and high B-factor values (e.g., estimated at 15.0A 2 and 45.0A 2 , respectively) computed from structural neighborhoods in a representative set of PDB structures. Subsequently, a relative measure of the model residue B-factor is determined by comparison to the mean and standard deviation for residues in the model protein.
- model amino acid residues with significantly low or high B-factors relative to other residues in the model protein are judged to be relatively intolerant or tolerant, respectively, to amino acid variation.
- the average B-factor of the model amino acid residue's structural neighborhood can be computed and compared to absolute standards of low and high B-factors compiled for structural neighborhoods in representative PDB structures.
- a measure of the structural neighborhood B-factor relative to the model protein itself is determined by comparing the average of the residue B-factors for the structural neighborhood of the model amino acid residue to the average and standard deviation in the B-factors for residues in the polypeptide chain of the model amino acid residue. The significance of the structural neighborhood's average B-factor is computed as:
- N number of residues in the structural neighborhood
- ⁇ Bc> average residue B-factor for residues in the polypeptide chains involved in the structural neighborhood
- S.D. Be standard deviation in the residue B-factor for residues in the polypeptide chains involved in the structural neighborhood S.D.
- Bn S.D.
- Bc/(V ⁇ standard deviation in the average B-factor for samples of N residues chosen from the same chain as the model amino acid residue.
- Modeled variances in structural neighborhoods of significantly low average residue B-factor are judged to be in sufficiently rigid environments that they may have structural and functional consequences.
- Modeled variances in structural neighborhoods of significantly high average residue B-factor are judged to be insufficiently flexible environments that they may not have structural or functional consequences.
- Other relative measures can be used, e.g., t- distribution value.
- the B-factor is related to the flexibility of the region of a polypeptide being analyzed.
- the B-factor can be replaced in the methods of the invention by another suitable measure of flexibility.
- the B-factor can be replaced by the r.m.s. deviation in residue position for an ensemble of structures or experimental determinations in the coupling constants and relaxation times of atoms that are diagnostic for mobility.
- the methods of the invention can report the outcome of the quality and function tests for each of the variances during the course of the analysis and produce a graphical representation of the model protein, generated as script for a molecule rendering program, e.g., RasMol.
- the protein structure can be represented by ribbons while the modeled variances, heterogens, and residues corresponding to PROSITE matches in the model structure are displayed in space filling representation. Residue labels are added for the modeled variances.
- all of the output, including the graphical representation can be converted to a web browser readable form.
- the model amino acid residue has less than l ⁇ A 2 ( ⁇ 1 water molecule) exposure to the solvent.
- the value is the solvent accessible area in A 2 .
- a model amino acid residue can also be defined as inaccessible if it has a low value for its relative accessibility, e.g., less than 0.2.
- the modeled variance is within 5.0 A of at least one residue in a different polypeptide chain in the coordinates. The values are yes or no.
- the model amino acid residue is within 5.0A of a residue that is absolutely conserved in the phylogenetic analysis. The values are yes or no.
- the model amino acid residue is within 5.0A of a heterogen atom. The value is a distance in Angstroms.
- the model amino acid residue is within 5.0A of at least one other model amino acid residue.
- the value is a distance in Angstroms.
- the model amino acid residue is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the target protein. The value is a distance in Angstroms.
- the model amino acid residue is within 5.0A of a residue in the model structure that matches a prosite entry that is NOT matched by the target protein.
- the value is a distance in Angstroms.
- Rare Amino Acid At least one of the residues encoded by the variance is found not more than 10% of the time in phylogenetic profile for the model amino acid residue. The values are yes or no.
- the polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of helical secondary structure. The values are yes or no.
- Turn Breaking The polymo ⁇ hic target amino acid residue includes either a Glycine or a Proline and some other amino acid, and the model amino acid residue is in a region of turn secondary structure. The values are yes or no.
- Unusual Amino Acid At least one of the residues encoded by the polymorphic target amino acid residue is not found in the phylogenetic profile for the polymo ⁇ hic target amino acid residue. This parameter can be approximated using the phylogenetic profile of the model amino acid residue, e.g., from the HSSP file. The values are yes or no. This parameter can also be assessed using classes, as described above.
- the average B factor for the model amino acid residue residue is less than 15.0 or greater than 45.0. Lower values mean less motion for that residue in the crystal structure.
- the average B factor for the model amino acid residue is at least 2 standard deviations above or below the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
- Low or High Neighbor B The average B factor for the structural neighborhood of the model amino acid residue is less than 15.0 or greater than 45.0.
- the average B factor for the structural neighborhood of the model amino acid residue is at least 2 S.D. (S.D. as defined above) less than or greater than the average B factor for residues in the polypeptide chain of the model amino acid residue. The value is the number of standard deviations.
- the entropy of the model amino acid residue is less than 0.5 or greater than 2.0.
- the value is in entropy units ranging from 0.0 meaning absolute conservation to ln20 ⁇ 3. meaning no conservation.
- the entropy of the model amino acid residue is less than or greater than 2.0 S.D. from the average phylogenetic entropy for residues in the target protein.
- Low or High Neighborhood Entropy The average entropy for the structural neighborhood of the model amino acid residue is less than 0.5 or greater than 2.0.
- the average entropy for the structural neighborhood of the model amino acid residue is at least 2.0 S.D. (S.D. as defined above) smaller or greater than the average entropy for the polypeptide chain of the model amino acid residue. The value is the number of S.D.
- model amino acid residue described above can be used as predictor variables in quantitative, statistical models for assessing whether a polymo ⁇ hism will have an effect on protein structure or function.
- the statistical models rely on actual experimental data concerning the effects of variances on protein activity.
- the predictive models can employ continuous values (or discrete approximations of continuous values, e.g., high, medium, and low B-factor) of the predictor features. Described below are two statistical models for predicting whether a polymo ⁇ hic target amino acid residue will affect protein structure or function by assessing some or all of the features of modeled variances.
- the features in the predictive methods are slightly adapted from their definitions above.
- Other statistical models for predicting the effects of polymo ⁇ hic target amino acid residues can be used, and are indicated by reference at the end of this section. These alternative methods can use some or all of the same predictive features of model amino acid residues.
- the features used for predictions fall into two broad categories: environment features and categorical features.
- the class of categorical features is further divided into polymo ⁇ hism-specific categorical features and special case categorical features. Each of these different features is briefly described below along with an example of how the feature is valued.
- Solvent Accessibility This is a measure of the accessibility of the model amino acid residue to solvent. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
- Relative Accessibility This is a measure of the accessibility of the model amino acid residue to solvent relative to the maximum accessibility of that residue in peptide of specified composition, typically a polyalanine polypeptide. It is used as a continuous variable in the probabilistic model described below.
- Relative B-Factor This is a measure of the crystallographic B-factor of the model amino acid residue normalized to the average and standard deviation of the B-factor for other residues in the same polypeptide chain of the model protein. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
- This feature is a measure of the statistical significance (defined as above for same feature) of the average B-factor of the model amino acid residue's structural neighborhood relative to the average B-factor of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
- This feature is a measure of the statistical significance (defined as above for same feature) of the average phylogenetic entropy of the model amino acid residue's structural neighborhood relative to the average phylogenetic entropy of the model amino acid residue's polypeptide chain. It is used as a continuous variable in the probabilistic model described below. It can be converted into a categorical variable by partitioning, e.g., into 2, 3, or 4 bins each with an equal number of training data polymo ⁇ hisms in each bin for the classification tree model described below. Other approaches can be used with these and other statistical models.
- Polymo ⁇ hism-Specific Categorical Features are variance-specific because they are related to the identity of the amino acids that comprise the polymo ⁇ hic target amino acid residue. In the statistical modeling approaches described below these features are given the value 1 (or yes) if the polymo ⁇ hism meets the specified criteria, and 0 (or no) otherwise.
- Unusual Amino Acid One of the amino acids of the polymo ⁇ hism is not found in the phylogenetic profile of the target variable residue. This feature can be approximated by examination of the model amino acid residue's phylogenetic profile, e.g., from the HSSP file.
- Unusual Amino Acid by Class One of the amino acids of the polymo ⁇ hism is not found in the minimum profile from Adams et al. ⁇ Protein Science 5:1240, 1996) that includes all of the amino acids in the phylogenetic profile of the polymo ⁇ hic target amino acid residue.
- This feature can be approximated from the model amino acid residue's phylogenetic profile, e.g., from the HSSP file.
- the feature is preferably used when the multiple alignment contains relatively few sequences. Classification schemes other than that proposed by Adams et al. can also be used.
- the model amino acid residue has turn secondary structure and one of the amino acids of the polymo ⁇ hism is glycine or proline.
- the model amino acid residue has helical secondary structure assignments and one of the amino acids of the polymo ⁇ hism is glycine or proline.
- the model amino acid residue is near (e.g., 5A) a heterogen atom (ligand) in the model protein.
- the model amino acid residue is near a PROSITE match (e.g., 5A) that is common to the target protein and the model protein.
- Interface The model amino acid residue is near (e.g., 5A) the interface between two or more subunits in the model protein.
- Training Data SetsTraining datasets contain: 1) at least one amino acid variation in a protein for which there is sufficient structural information to assess at least one selected structural, phylogenetic, or physical feature, and 2) information describing the effect on protein function of each amino acid variation.
- _Mutants of the E. coli lac repressor Markiewicz et al. (1994) J Mol. Biol. 240:421-433
- lysozyme (Rennell et al. (1991) J Mol. Biol. 222:67) can be used as training datasets.
- Any other collection of polymo ⁇ hisms, preferably unbiased, for which activity and structural information is available can also be used, even if the dataset contains polymo ⁇ hisms of many different proteins.
- the approach of the probabilistic mode is to view each variance as having a probability that it will affect protein structure or function.
- This outlook implicitly reflects the idea that the homology models are, in general, approximate descriptions. There may be some factors bearing on the variance's effect on structure or function that are not anticipated by the model. However, given enough unbiased data, such factors can be assessed in probabilistic terms through experimental data sets that examine the relationship between mutations and effects on protein structure or function. For example, one of the data sets used in the implementation of the methods of the invention involves over 4000 unbiased mutations in the E. coli lac repressor and their classification with respect to the repressor' s biological function.
- the probability values for the predictions combine measures of the intrinsic tolerance of the target protein's structure and function to amino acid variation at the polymo ⁇ hic target amino acid residue, the nature of the chemical change caused by the variance, and additional classifications for the special cases of variances in particularly vulnerable locations in the model protein's structure.
- For computing the probability that a polymo ⁇ hic target amino acid residue will affect target protein structure or function training polymo ⁇ hisms with feature values similar to the feature values of the model amino acid residue are collected from the training data set.
- the precise criteria for assessing feature value similarity between the training polymo ⁇ hisms and the model amino acid residue are parameters of the prediction model.
- these criteria are set so that the polymo ⁇ hisms in the training set have environment feature values within some tolerance, e.g., 1 standard deviation, of the environment feature values of the model amino acid residue, and categorical feature values that are identical to the categorical feature values of the model amino acid residue.
- the probability that the polymo ⁇ hic target amino acid residue will affect target protein structure or function is defined as the proportion of residues in the sub-group of selected training polymo ⁇ hisms that have effects on their own protein's structure or function. Defining the probabilities in this way assumes that the environment features and the categorical features calibrated for effects on structure and function with the training set have predictive meaning for the polymo ⁇ hic target amino acid residue and target protein. It also assumes that the .training polymo ⁇ hisms represent an unbiased sampling of the effects on protein structure of polymo ⁇ hisms with the specified feature values.
- the features are assumed to reflect generic properties of polymo ⁇ hisms that are useful for evaluating their effect on protein function and the training polymo ⁇ hisms are assumed to reflect typical behavior for amino acid variation. Empirically, this assumption is valid, at least for soluble, globular proteins and the lac repressor and lysozyme training datasets.
- the selected training polymo ⁇ hisms will be more similar to the model amino acid residue, which itself was selected to be similar to the polymo ⁇ hic target amino acid residue on the basis of sequence similarity.
- the more features used to parameterize the model amino acid residue the more difficult it becomes to identify enough training polymo ⁇ hisms to make an adequate statistical comparison with the current training data sets.
- some of the features are strongly correlated with others and contribute little to the characterization of model amino acid residues, e.g., accessibility and relative accessibility.
- the reduced set of features used for selecting polymo ⁇ hisms in the probabilistic model is selected using standard maximum likelihood statistical methods. Formally, this entails computing the gain in the likelihood for predictions made on the training data with each possible combination of a few environment features and a few categorical features compared with predictions based on a more general hypothesis. In this case, the more general hypothesis defines a polymo ⁇ hism' s probability of effecting function as the proportion of polymo ⁇ hisms in the entire training data set that have effects on function.
- the optimal set of parameters is the one that gives the maximum likelihood gain.
- This exhaustive procedure is very computationally intensive. Computational time can be reduced by exploiting the observed strong effect of the environment features on the likelihood calculation. This observation leads to an approximate, stepwise procedure for maximizing the likelihood in which: first the environment features that alone maximize the likelihood are identified, and second, in conjunction with the selected optimized environment features, an optimal set of categorical features is identified. Applying this approximate procedure on two training data sets showed that the best environment features typically include one of the two accessibility features, one of the two B-factor features, and one of the two phylogenetic entropy features, as might be expected. Other statistical methods, e.g., discriminant function analysis, can be also used to choose the dominant features.
- the number of parameters used can be reduced by the standard statistical method of principal component analysis. For example, the complete set of six environment features for all polymo ⁇ hic residues in the training set is transformed to its principal components, and then just one or a few of the stronger principal components (those with the larger eigenvalues, with or without realignment with the original environment features) are used instead of all of the environment features.
- the probability of a target polymo ⁇ hism having an effect on protein structure and function is determined as described above with the chosen principal components replacing the environment features in the computation.
- QUEST will directly accept both the continuously valued environment features and the categorical features as predictor variables. As with the probabilistic model, it proves useful to limit the number of variables to three environment features (or to use principal components) and a selection of the other, categorical features in order to accommodate the limited size of the training data sets.
- QUEST uses ANOVA F-statistics to select variables and to define "split" values in each continuous parameter for optimal classification. Trees are then constructed with the selected variables and "split" criteria, and then pruned subject to node size criteria and cross- validation tests. Once the optimal tree is delineated, target polymo ⁇ hisms can be assessed for whether they are predicted to affect protein structure or function.
- Typical application of QUEST involves running it in default mode with the exception that the minimum node size is often increased, e.g., by a factor of about four, to simplify the classification trees without a serious loss of accuracy.
- a guiding principle in the construction of classification trees with continuous predictor variables is that the "split" values for the predictors should make sense to a user familiar with the underlying scientific issues.
- applying the environment features as continuous variables in the automated QUEST method can lead to classification trees that are excessively branched and hard to inte ⁇ ret.
- An alternative way to implement the method involves categorizing the values of each of the environment features into a reasonable number of groups. For example, each environment feature can be categorized into high, medium, and low values. The categorized environment features can then be used with the other categorical features to construct simplified and robust classification trees by QUEST.
- Other statistical methods can be used in the analysis of the environment and categorical features, and in their application for predicting whether target polymo ⁇ hisms will affect protein structure or function. These include but are not limited to: discriminant function analysis for selection of environment features for each combination of categorical features (e. g., see StatSoft Inc., Electronic Text Book, http ://www. statsoft.com. Chapter on Discriminant Analysis), and logistic regression of the environment features for each combination of categorical features (e.g., see Montgomery and Peck (1992), Introduction to Linear Regression Analysis, Wiley, NY, Chapter 6). Some implementations might use neural nets or related models to assimilate training data for predictions of effects on structure or function caused by polymo ⁇ hic target amino acid residues.
- a computer program for the automated structural modeling and functional analysis can be written in any suitable language, e.g., Python 1.4.
- Programs and supporting files e.g., databases
- the program can be run on suitable computer systems known to those skilled in the art, for example, a Silicon Graphics O 2 workstation operating under IRIX v. 6.5.
- Useful databases include: the Protein Data Bank (PDB) of macromolecular structures and sequences corresponding to the structures; the Homology-Derived Secondary Structure of Proteins (HSSP; EMBL) database; the PROSITE database (EXPASY; currently using release 15) of profiles and patterns.
- PDB Protein Data Bank
- HSSP Homology-Derived Secondary Structure of Proteins
- EMBL Homology-Derived Secondary Structure of Proteins
- PROSITE database EXPASY; currently using release 15
- Useful software for implementing certain features of the method include: BLAST 2.0.6 sequence alignment and database searching software (NCBI); RasMol 2.6.4 (Roger Sayle) program for visualizing (rendering) and annotating homology models; Chime (MDL, Inc.) http plug-in module for visualizing the models in a web browser; and Pfscan 1.0 software (Philipp Bucher; Swiss Institute for Experimental Cancer Research) for comparing amino acid sequences to PROSITE profiles.
- the methods of the invention are not limited to use with any particular hardware/software configuration. They may find applicability in any computing or processing environment.
- the methods of the invention may be implemented in computer programs executing on programmable computers that each includes a processor, a storage medium readable by the processor (including volatile and non- volatile memory and/or storage elements), at least one input device, and one or more output devices.
- Program code may be applied to data entered using an input device to perform the methods and to generate output information for display.
- Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system.
- the programs can be implemented in assembly or machine language.
- the language may be a compiled or an inte ⁇ reted language.
- Each computer program may be stored on a storage medium or device (e.g., CD- ROM, hard disk, or magnetic diskette) that is readable by a general or special pu ⁇ ose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the methods.
- the methods may also be implemented as a computer-readable storage medium, configured with a computer program, where, upon execution, instructions in the computer program cause the computer to operate in accordance with the methods.
- the methods of the invention are useful in a number of areas beyond simply making predictions about the effect of a known or theoretical polymo ⁇ hism.
- the methods of the invention can be used for identification and analysis of amino acid polymo ⁇ hisms that affect the structure or function of proteins involved directly or indirectly in the action of pharmaceutical or diagnostic agents.
- the methods can be used in the identification and analysis of structural or functional interactions between two or more one amino acid polymo ⁇ hisms in a protein of interest (e.g., in analysis of haplotypes).
- the methods of the invention can be used to identify and analyze polymo ⁇ hisms that have an effect on a catalytic activity of the protein of interest or a non-catalytic activity of the protein or interest (e.g., structure, stability, binding to a second protein or polypeptide chain, binding to a nucleic acid molecule, binding to a small molecule, and binding to a macromolecule that is neither a protein nor a nucleic acid).
- a catalytic activity of the protein of interest e.g., structure, stability, binding to a second protein or polypeptide chain, binding to a nucleic acid molecule, binding to a small molecule, and binding to a macromolecule that is neither a protein nor a nucleic acid.
- the methods of the invention can also be used in the identification and analysis of candidate polymo ⁇ hisms for polymo ⁇ hism-specific targeting by pharmaceutical or diagnostic agents, for the identification and analysis of candidate polymo ⁇ hisms for pharmacogenomic applications, and for experimental biochemical and structural analysis of pharmaceutical targets that exhibit amino acid polymo ⁇ hism.
- the methods of the invention can be used to identify amino acid substitutions that can be made to engineer the structure or function of a protein or interest (e.g., to increase or decrease a selected activity or to add or remove a selective activity).
- the methods can also be used for the prospective or retrospective identification and analysis alterations in a biological property related to a polymo ⁇ hism.
- Example 1 Annotation Mode
- the method of the invention was used to analyze a number of polymo ⁇ hic amino acid residues in lac repressor.
- the annotation mode was used and purine repressor was selected as the model protein.
- Reproduced below is a portion of the output of a computer program used to implement one embodiment of the method of the invention.
- the output provides: a list of the polymo ⁇ hic amino acids analyzed, an alignment of each region of lac repressor containing a polymo ⁇ hic residue with the corresponding region of the model protein, a summary of the number of amino acid resides that are identical in each aligned region, a summary of the PDB file information for purine repressor used in the analysis, a prosite report for the model protein, a summary of the alignment of the amino acid residues in the neighborhood of each model amino aicd residue with the amino acid residues in the neighborhood of the corresponding polymo ⁇ hic amino acid residue, a summary of the determinations made for each model amino acid residue (including: distance to conserved motifs, distance to heterogens, distances between model amino aid residues, entropy, secondary structure, neighborhood entropy, B-factor, relative B-factor, neighborhood B- factor, and relative neighborhood B-factor), a list of the features on which determinations can be made, and a list of determinations made for each model amino acid
- Model based on 2pua_A has:
- This model uses BLAST entry 2pua_A and PDB coordinates 2pua
- MOL_ID 1; MOLECULE: PURINE REPRESSOR; CHAIN: A; ENGINEERED: YES; MUTATION: R190A; BIOLOGICAL_UNIT: HOMODIMER; OTHER_DETAILS : METHYLPURINE-PUR-OPERATOR; MOL_ID: 2; MOLECULE: DNA; CHAIN: B; ENGINEERED: YES;
- Variance list in model is [('A', 54, 0) Modeled Variance ('A 1 , 54, 0)
- Variance list in model is [('A', 171, 0) ] Modeled Variance ('A 1 , 171, 0)
- Number of residues in Neighborhood of radius 5.0 A is 12. Of these, 11 residues are covered by alignment Of these, 1 residues are the same
- Variance list in model is [('A', 248, 0)] Modeled Variance (*A', 248, 0)
- Number of residues in Neighborhood of radius 5.0 A is 18 Of these, 18 residues are covered by alignment Of these, 8 residues are the same
- Variance list in model is [('A', 299, 0)] Modeled Variance ('A', 299, 0)
- Average entropy of chain is 1.612
- Ave. entropy for neighborhood is -0.878 S.D from ave entropy of chain of modeled variance.
- Average entropy of chain is 1. 612
- Ave. entropy for neighborhood is -0.153 S.D from ave. entropy of chain of modeled variance.
- Average entropy of chain is 1.612
- Ave. entropy for neighborhood is -2.402 S.D from ave entropy of chain of modeled variance.
- Average entropy of chain is 1.612
- Ave. entropy for neighborhood is 0.287 S.D from ave entropy of chain of modeled variance.
- Average B-factor of atoms in residue is : 50.9
- Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15.8 Min and Max residue B-factors of residue's chain: 14 ,9 93.6 Deciles for residue B-factors of residue ' s chain: 14 . 9 25. 9 30.2 33. 5
- Residue b factor is in the 8 th decile
- Residue B-factor is 0.5 S.D. from average B-factor for chain
- Average B-factor of atoms in residue's neighborhood is : 42.4
- Average B-factor of atoms in chains of residue's neighborhood is : 44.0
- Neighborhood B factor is -0.3 S.D. from average B-factor for chains in neighborhood
- Residue b factor is in the 3 th decile
- Residue B-factor is -0.6 S.D. from average B-factor for chain
- Average B-factor of atoms in residue's neighborhood is : 35.0
- Neighborhood B factor is -1.9 S.D. from average B-factor for chains in neighborhood
- Average B-factor of atoms in residue is : 29.2
- Average B-factor of residues in residue's chain is : 43.6 S.D. in B-factor of residues in residue's chain is 15. c Min and Max residue B-factors of residue ' s chain : 14.9 93.6 Deciles for residue B-factors of residue ' s chain : 14.9 25.9 30.2 33.5
- Residue b factor is in the 2 th decile
- Residue B-factor is -0.9 S. D. from average B-factor for chain
- Average B-factor of atoms in residue's neighborhood is : 28.4
- Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood
- B factor is -4.1 S.D. from average B-factor for chains in neighborhood
- Residue b factor is in the 8 th decile
- Residue B-factor is 0.5 S.D. from average B-factor for chain
- Average B-factor of atoms in residue's neighborhood is : 46.1
- Average B-factor of atoms in chains of residue's neighborhood is : 43.6 Neighborhood
- B factor is 0.5 S.D. from average B-factor for chains in neighborhood
- the modeled variance is inaccessible and the actual variance includes a charged residue.
- the actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a region of helical secondary structure from HSSP analysis. The only value is yes . hi_b —
- the modeled variances crystallographic B-factor is less than 45.0 A ⁇ 2 hi_decile_b —
- the modeled variances crystallographic B-factor is in the tenth decile of B-factors for modeled variances chain in PDB file.
- the modeled variances phylogenetic variation is in the tenth decile of variation for modeled variances chain in PDB file.
- the average crystallographic B-factor for the modeled variances neighborhood is greater than 45.0 A ⁇ 2.
- the average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d. above the average B-factor for other neighborhoods in residues chain hi_nbhd_rel_var —
- the average phylogenetic variation for the modeled variances neighborhood is at least 2.0 s.d. above the average variation for other neighborhoods in residues chain hi__nbhd_var --
- the average phylogenetic variation for the modeled variances neighborhood is greater than 2.0 e.u. (8 residues with equal weight) hi_rel_b —
- the modeled variances crystallographic B-factor is at least 2.0 s.d. above average B-factor for the modeled variances chain in PDB file.
- the modeled variances phylogenetic variation is at least 2.0 s.d. above average variation for modeled variances chain in PDB file . hi_var —
- the modeled variances phylogenetic variation is greater than 2.0 e.u. (8 residues with equal weight) inaccessible —
- the HSSP file indicates that the modeled variance has less than 10 A A 2 ( ⁇ 1 water molecule) exposure to the solvent.
- the value is the solvent accessible area in A ⁇ 2.
- the modeled variance is within 5.0A of at least one residue in a different chain in the coordinates. The only value is yes. lo_b — The modeled variances crystallographic B-factor is less than 15.0 A ⁇ 2 lo_decile_b —
- the modeled variances crystallographic B-factor is in the first decile of B-factors for modeled variances chain in PDB file. lo_decile_var —
- the modeled variances phylogenetic variation is in the first decile of variation for modeled variances chain in PDB file.
- the average crystallographic B-factor for the modeled variances neighborhood is less than 15.0 A ⁇ 2.
- the average crystallographic B-factor for the modeled variances neighborhood is at least 2.0 s.d/ below the average B-factor for other neighborhoods in residues chain lo_nbhd_rel_var —
- the average phylogenetic variation for the "modeled variances neighborhood is at least 2.0 s.d. below the average variation for other neighborhoods in residues chain lo_nbhd_var —
- the average phylogenetic variation for the modeled variances neighborhood is less than 0.69 e.u. (2 residues with equal weight) lo_rel_b —
- the modeled variances crystallographic B-factor is at least 2.0 s.d. below average B-factor for the modeled variances chain in PDB file. lo_rel_var —
- the modeled variances phylogenetic variation is at least 2.0 s.d. below average variation for modeled variances chain in PDB file. lo_var —
- the modeled variances phylogenetic variation is less than 0.69 e.u. (2 residues with equal weight) near_conserved —
- the modeled variance is within 5.0A of a residue that is absolutely conserved in the HSSP profile. The only value is yes. near_het_atom —
- the modeled variance is within 5.0A of a hetero atom in the coordinates.
- the value is a distance in Angstroms .
- the modeled variance is within 5.0A of at least one other modeled variance.
- the value is a distance in Angstroms.
- the modeled variance is within 5.0A of a residue in the coordinates that matches a prosite entry that is also matched by the primary sequence. See near_struct_prosite. The value is a distance in Angstroms . near_struct_prosite —
- the modeled variance is withing 5.0A of a residue in the coordinates that matches a prosite entry that is NOT matched by the primary sequence. See near_seq_prosite. The value is a distance in Angstroms. rare_aa —
- At least one of the residues encoded by the variance is found not more than 10% of the time in the HSSP profile for the modeled variance. The only value is yes. turn_breaking —
- the actual variance includes either a Gly or a Pro and some other amino acid, and the modeled variance is in a turn from the HSSP analysis. The only value is yes. unusual_aa —
- At least one of the residues encoded by the variance is not found in the HSSP profile for the modeled variance. The only value is yes.
- Example 2 Probabilistic Modeln a second example, the probabilistic mode was used to assess the probability that a change in the amino acid present at each of 3245 known lac repressor polymo ⁇ hisms would alter activity of lac repressor.
- a set of 1468 lysozyme polymo ⁇ hisms was used as the training dataset and maximum likelihood analysis was used to select the characteristics (from among the physical, structural and phylogenetic features described above) that would be used to analyze the model amino acid residues.
- the selected training polymo ⁇ hisms were then used to assess the probability that a change in the amino acid present at the polymo ⁇ hic target amino acid residue would have an effect on activity of the target protein.
- the assessment was based on the proportion of selected training polymo ⁇ hisms that have an effect on the activity of the training protein, lysozyme.
- no prediction was made. This is because the number of selected training polymo ⁇ hism was too small to make a statistically significant prediction.
- the predictions made were then compared to the known effects of the lac repressor polymo ⁇ hisms and the accuracy of the predictions was analyzed. The results of this analysis are presented in Table 1 below.
- the predictions are sorted by confidence level.
- the values in the column under the heading "0.70" summarize the accuracy for predicting that mutations having a probability of affecting function of 0.70 or greater will affect function and that mutations with a probability of 0.3 (1.0 minus 0.7) will not affect function.
- the accuracy of the each class of predictions is assessed by the actual number of true positives, false positives, true negatives, and false negatives and by the statistical measures correlation coefficient, chi-squared value compared to a null hypothesis of predictions made knowing just the fraction of polymo ⁇ hisms affecting function, selectivity, and sensitivity for the predictions.
- the last value in each column is the misclassification rate (fraction of incorrectly predicted mutations). This example demonstrates that the probabilistic mode can be used to make predictions about the likely effect of a polymo ⁇ hism.
- Example 3 Classification Mode:
- the classification mode was used to classify each of 3245 known lac repressor polymo ⁇ hisms as either a polymo ⁇ hism that is likely to later activity or a polymo ⁇ hism that is not likely to alter activity.
- 1468 lysozyme polymo ⁇ hisms were used as a training dataset to build a classification tree using QUEST.
- three selected continuously valued features (relative accessibility, neighborhood relative B-factor, and neighborhood relative entropy) and three selected categorical features (unusual amino acid, unusual amino acid by class, and conserved position) were used in building the classification tree.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Peptides Or Proteins (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA002410726A CA2410726A1 (en) | 2000-06-01 | 2001-05-30 | Structure-based methods for assessing amino acid variances |
JP2002501137A JP2004501446A (en) | 2000-06-01 | 2001-05-30 | Structure-based methods for assessing amino acid diversity |
EP01939635A EP1350115A2 (en) | 2000-06-01 | 2001-05-30 | Structure-based methods for assessing amino acid variances |
AU2001265131A AU2001265131A1 (en) | 2000-06-01 | 2001-05-30 | Structure-based methods for assessing amino acid variances |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US20862800P | 2000-06-01 | 2000-06-01 | |
US60/208,628 | 2000-06-01 | ||
US61473500A | 2000-07-12 | 2000-07-12 | |
US09/614,735 | 2000-07-12 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001092990A2 true WO2001092990A2 (en) | 2001-12-06 |
WO2001092990A3 WO2001092990A3 (en) | 2003-07-31 |
Family
ID=26903348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/017351 WO2001092990A2 (en) | 2000-06-01 | 2001-05-30 | Structure-based methods for assessing amino acid variances |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1350115A2 (en) |
JP (1) | JP2004501446A (en) |
AU (1) | AU2001265131A1 (en) |
CA (1) | CA2410726A1 (en) |
WO (1) | WO2001092990A2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9493753B2 (en) | 2011-03-16 | 2016-11-15 | Amano Enzyme Inc. | Modified α-glucosidase and applications of same |
CN110223730A (en) * | 2019-06-06 | 2019-09-10 | 河南师范大学 | Protein and small molecule binding site prediction technique, prediction meanss |
CN111128300A (en) * | 2019-12-26 | 2020-05-08 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN112257917A (en) * | 2020-10-19 | 2021-01-22 | 北京工商大学 | Time series abnormal mode detection method based on entropy characteristics and neural network |
-
2001
- 2001-05-30 WO PCT/US2001/017351 patent/WO2001092990A2/en not_active Application Discontinuation
- 2001-05-30 JP JP2002501137A patent/JP2004501446A/en active Pending
- 2001-05-30 CA CA002410726A patent/CA2410726A1/en not_active Abandoned
- 2001-05-30 AU AU2001265131A patent/AU2001265131A1/en not_active Abandoned
- 2001-05-30 EP EP01939635A patent/EP1350115A2/en not_active Withdrawn
Non-Patent Citations (6)
Title |
---|
CHASMAN DANIEL ET AL: "Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation." JOURNAL OF MOLECULAR BIOLOGY, vol. 307, no. 2, 2001, pages 683-706, XP002212639 ISSN: 0022-2836 * |
DATABASE BIOSIS [Online] BIOSCIENCES INFORMATION SERVICE, PHILADELPHIA, PA, US; 1994 ALTSCHUL STEPHEN F ET AL: "Issues in searching molecular sequence databases." Database accession no. PREV199497283170 XP002212660 & NATURE GENETICS, vol. 6, no. 2, 1994, pages 119-129, ISSN: 1061-4036 * |
REDDY BOOJALA V B ET AL: "Use of propensities of amino acids to the local structural environments to understand effect of substitution mutations on protein stability." PROTEIN ENGINEERING, vol. 11, no. 12, December 1998 (1998-12), pages 1137-1145, XP002212641 ISSN: 0269-2139 * |
SUNYAEV SHAMIL ET AL: "Prediction of deleterious human alleles." HUMAN MOLECULAR GENETICS, vol. 10, no. 6, 2001, pages 591-597, XP002212640 ISSN: 0964-6906 * |
SUNYAEV SHAMIL ET AL: "Towards a structural basis of human non-synonymous single nucleotide polymorphisms." TRENDS IN GENETICS, vol. 16, no. 5, May 2000 (2000-05), pages 198-200, XP002212759 ISSN: 0168-9525 * |
WANG ZHEN ET AL: "SNPs, protein structure, and disease." HUMAN MUTATION, vol. 17, no. 4, 2001, pages 263-270, XP001104863 ISSN: 1059-7794 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9493753B2 (en) | 2011-03-16 | 2016-11-15 | Amano Enzyme Inc. | Modified α-glucosidase and applications of same |
US9650619B2 (en) | 2011-03-16 | 2017-05-16 | Amano Enzyme Inc. | Modified alpha-glucosidase and applications of same |
CN110223730A (en) * | 2019-06-06 | 2019-09-10 | 河南师范大学 | Protein and small molecule binding site prediction technique, prediction meanss |
CN110223730B (en) * | 2019-06-06 | 2022-09-27 | 河南师范大学 | Prediction method and prediction device for protein and small molecule binding site |
CN111128300A (en) * | 2019-12-26 | 2020-05-08 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN111128300B (en) * | 2019-12-26 | 2023-03-24 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN112257917A (en) * | 2020-10-19 | 2021-01-22 | 北京工商大学 | Time series abnormal mode detection method based on entropy characteristics and neural network |
CN112257917B (en) * | 2020-10-19 | 2023-05-12 | 北京工商大学 | Time sequence abnormal mode detection method based on entropy characteristics and neural network |
Also Published As
Publication number | Publication date |
---|---|
WO2001092990A3 (en) | 2003-07-31 |
AU2001265131A1 (en) | 2001-12-11 |
CA2410726A1 (en) | 2001-12-06 |
EP1350115A2 (en) | 2003-10-08 |
JP2004501446A (en) | 2004-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chasman et al. | Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation | |
Tang et al. | Tools for predicting the functional impact of nonsynonymous genetic variation | |
Gerstein et al. | Comparing genomes in terms of protein structure: surveys of a finite parts list | |
Fiser | Template-based protein structure modeling | |
Jordan et al. | Predicting protein-protein interface residues using local surface structural similarity | |
Vihinen | How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis | |
Capriotti et al. | Improving the prediction of disease-related variants using protein three-dimensional structure | |
Skolnick et al. | FINDSITE: a combined evolution/structure-based approach to protein function prediction | |
Kleinman et al. | Statistical potentials for improved structurally constrained evolutionary models | |
Pieper et al. | MODBASE, a database of annotated comparative protein structure models | |
US8744982B2 (en) | Gene-specific prediction | |
Fradera et al. | Guided docking approaches to structure-based design and screening | |
Karchin et al. | Improving functional annotation of non-synonomous SNPs with information theory | |
Flores et al. | Hinge Atlas: relating protein sequence to sites of structural flexibility | |
Chen et al. | Template-guided protein structure prediction and refinement using optimized folding landscape force fields | |
WO2003009210A1 (en) | Methods of providing customized gene annotation reports | |
Eyal et al. | Protein side‐chain rearrangement in regions of point mutations | |
Zimmermann et al. | LOCUSTRA: accurate prediction of local protein structure using a two-layer support vector machine approach | |
Li et al. | Improving predicted protein loop structure ranking using a Pareto-optimality consensus method | |
US8452542B2 (en) | Structure-sequence based analysis for identification of conserved regions in proteins | |
EP1350115A2 (en) | Structure-based methods for assessing amino acid variances | |
US20060121455A1 (en) | COP protein design tool | |
Wanarase et al. | Evaluation of SNPs from human IGFBP6 associated with gene expression: an in-silico study | |
Kahsay et al. | Quasi-consensus-based comparison of profile hidden Markov models for protein sequences | |
Ünlü | Computational prediction of actin–actin interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2410726 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2001939635 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2002 501137 Kind code of ref document: A Format of ref document f/p: F |
|
WWP | Wipo information: published in national office |
Ref document number: 2001939635 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2001939635 Country of ref document: EP |