WO2003087805A2

WO2003087805A2 - Method for efficiently computing the mass of modified peptides for mass spectrometry data-based identification

Info

Publication number: WO2003087805A2
Application number: PCT/EP2003/003998
Authority: WO
Inventors: Jacques Colinge; Alexandre Masselot
Original assignee: Geneprot, Inc.
Priority date: 2002-04-17
Filing date: 2003-04-16
Publication date: 2003-10-23
Also published as: AU2003226819A8; AU2003226819A1; WO2003087805A8; WO2003087805A3

Abstract

The present invention relates to an automated method for efficiently and precisely computing theoretical mass spectra of proteins containing modifications, such as Post-Translational Modifications (PTMs). The mass spectra may be either peptide mass fingerprints or peptide dissociation mass spectra. The efficient mass computation consists in a new two-stage procedure. A rich set of functionalities permits the precise assignment of modifications.

Description

METHOD FOR EFFICIENTLY COMPUTING THE MASS OF MODIFIED PEPTIDES FOR MASS SPECTROMETRY DATA-BASED IDENTIFICATION

FIELD OF THE INVENTION The present invention relates to an automated method for efficiently and precisely computing theoretical mass spectra of proteins containing modifications, such as Post- Translational Modifications (PTMs) The mass spectra may be either peptide mass fingerprints or peptide dissociation mass spectra The efficient mass computation consists in a new two-stage procedure A rich set of functionalities permits the precise assignment of modifications along the protein sequence

BACKGROUND

Mass spectrometry (MS) combined with database searching has become the preferred method for identifying proteins in proteomic research In a typical proteomic project, the proteins of interest are separated, generally by one or two dimensional gel electrophoresis or by column chromatography The proteins are generally digested with an enzyme, such as trypsin, and the resulting proteolytic peptides are analyzed by mass spectrometry or tandem mass spectrometry (MS-MS) followed by database searching (James, P ed 2000 Proteome Research Mass Spectrometry, Springer, Berlin) A commonly used technique for MS analysis is peptide mapping, whereby proteins are digested with an enzyme and the molecular masses of the peptides are measured Database search engines calculate the possible peptide masses using the specificity of the enzyme for each protein present in a given database The measured masses are then compared to the calculated masses and a score is calculated Proteins from organisms with sequenced genomes or preferably extensive protein databases can usually be successfully identified using peptide mapping The success rate for peptide mapping is much lower, however, when applied to organisms with incomplete genomic information or when analyzing complex protein mixtures In such cases, more experimental information can be obtained by tandem mass spectrometry (MS-MS) In this technique, ions corresponding to a single peptide are isolated by a mass spectrometer and fragmented by excitation, resulting in molecular dissociation reactions, and the masses of the fragment ions are measured The simplest scoring method for peptide mapping consists in counting the number of measured peptide masses that match calculated peptide masses within the accuracy of the measurement. This method is satisfactory for high-quality experimental data with the caveat, however, that it tends to give higher scores to larger proteins, for which the possibility of more peptides results in a higher probability of random matching. More sophisticated methods, which are also based on counting the number of matching peptide masses, make greater use of the experimental information by including existing knowledge on the proteins and thereby increase the selectivity and sensitivity of the identification. Similar concepts apply to matching MS-MS theoretical and experimental data. For an example of a generally applicable scoring method in MS-MS, see US patent 6,017,693.

In nature, proteins and peptides may undergo a variety of modifications. Examples include the formation of structural bridges within the polypeptide chain or between more than one polypeptide, and the addition of modifying groups (e.g., carbohydrate, glycosyl, acetyl, amidyl, phosphate, etc.). Post-translational modifications are typically enzymatic modifications to an amino acid sequence or polypeptide chain after it has been translated from messenger RNA. For many types of modifications, the literature provides rales based at least in part on the amino acid sequence, specifying sites of possible modification (see. for example, Post-translational modification of proteins, Krishna R.G. Wold 1\, Λdv Enzymol Relat Areas Mol liiol, 1993, 67:265-98). In addition, proteolytic enzymes target certain polypeptide sequences, resulting in possible proteolysis of that protein or peptide.

Different modifications can be predicted to occur with different degrees of certainty, some having a high tendency to be present given an amino acid sequence or residue. Other modifications are more variable, in that there is a significant chance that the modification is not present at the site at which the modification can occur. Further, one of two (or even more) different modifications (but not both) may take place at a certain amino acid residue. Generally, protein modifications that occur in nature as part of normal cellular processes (e.g., posttranslational modification, proteolysis) are at least somewhat variable. However, certain residues are commonly modified during proteomic methods (e.g., cysteine may be reduced and alkylated). Such nonnaturally-imposed modifications may be controlled and thus, are considered more fixed. That is, there is a higher certainty that the modification will be present at the site at which it can occur. Accordingly, we define fixed modifications as modifications that substantially always occur at a given site and that result from a nonnatu rally-occurring process. Variable modifications occur in nature and are not necessarily present at every potential site for that modification on a peptide or protein.

Most of the softwares currently available for analyzing mass spectrometry data address the task of computing protein modifications in an unsatisfactory manner. Generally, there is a lack of possibilities offered to the end-user, regarding the assignment of the modifications along the protein sequence. Indeed, in the currently available softwares, the modifications tend to be associated with a generic rule, e.g., every lysine may be modified, and thus each site where the rule applies is considered as potentially modified in the computation. Although such rules are definitely useful, in cases where it is known by the end-user beforehand that only a specific lysine is modified, there is no reason for considering the other ones. Another typical limitation of currently available softwares is their inability to treat differently the N- and C-termini peptides from the peptides issued from the core of the protein. As a result, potential modifications located at the C-terminus (or N-terminus) of the protein must be either omitted altogether or computed (erroneously) as potential modifications for the C-terminus (respectively, the N-terminus) of every peptide issued from the core of the protein.

Computation time is an important issue in generating theoretical mass spectra for proteins containing modifications, as including several potential modifications on a single protein or peptide may result in a combinatorial explosion of spectra to compute.

The present invention provides a computationally efficient method for the identification of proteins and peptides based on MS or MS-MS data. The invention also provides a means and method to allow both fixed and variable modifications to be taken into account in the identification process, as well as a means for the end-user to search specifically for the existence of a given modification at a given location. The invention improves the reliability and accuracy of protein or peptide identification from mass spectrometry data, thereby rendering MS and MS- MS based analysis amenable to the automation required by large-scale proteomics.

SUMMARY OF THE INVENTION The invention provides a method of computing the mass of peptides and proteins. The computing method of the invention generates simulated mass spectra based on peptide and protein sequences from databases (herein after "test peptide" or "test protein") that one wants to compare to experimental mass spectra in order to identify the peptides and proteins present in an experimental sample. In preferred aspects, the invention computes the masses of peptides that are potentially modified at precise positions along the peptide sequence (e.g. at precise amino acid residues). Every possible combination of modifications on a peptide, including variable and fixed modifications, as defined below, may be taken into account. The provision of a more complete set of test peptides associated with modifications consequently improves the success rates of methods of identifying proteins and peptides, since the use of a more complete and accurate set of potential modifications allows more specific comparisons with experimental mass spectra to be conducted.

The invention provides a method of computing and validating the mass of modified peptides in two stages. The first stage relies on peptide mass data. Such peptides are generally formed through enzymatic cleavage of a desired protein. Possible masses are calculated for each peptide, based on known cleavage sites and possible modification sites within the protein sequence. Such calculated masses are correlated with the experimental mass data gathered by, for example, mass spectrometry (MS). Calculated masses that are not within a given tolerance range of an experimental mass may be discarded at this point. This prevents a waste of computation time in the second stage. In the second stage, a precise computation is carried out on the peptides belonging to sets of modifications retained in the first stage. This computation specifically assigns positions along the peptide sequence to each retained potential modification on the peptide. These simulated masses are then compared to experimentally fragmented peptides whose fragment masses are measured by MS-MS. The invention thus provides an automatic method for computing the masses of potentially modified peptides in two-main steps, given a list of potential modifications on a protein sequence (fixed or variable) and a list of enzyme cleavage sites that define peptides. For each peptide thus defined, the method comprises:

(1) associating said peptide with at least one set of modifications, wherein each combination of said peptide with a set of modifications corresponds to, or is associated with, a total mass; and

(2) for a subset of peptides, selected based on the total mass of a combination to which they belong,

(a) for one or more selected types of modifications, associating peptides in said subset with every possible set of modifications, wherein a modification is associated with a location on the peptide; and

(b) computing the corresponding fragmentation spectra. In step (1) above, the modifications from a given set of simultaneously present modifications associated with a peptide need not be associated with, or assigned to, an amino acid residue of the peptide. Rather, the total number of each type of modification is considered, such that each combination of a peptide and a set of modifications results in a total mass. Different numbers of sets of modifications may be considered for each peptide, resulting in as many possible total masses.

Furthermore, in step (1) above, every variable modification must be considered as present or not present and thus several sets of simultaneously present modifications for a peptide are usually generated. As such, a peptide will typically be associated with more than one set of modifications.

Referring to step (2) above, selecting a subset of peptides from step (1) can be carried out by any suitable means. A preferred method is by comparing total masses of modified peptides calculated in step (1) to an experimental mass spectrum. One or more modified peptides from step

(1) which correspond to or are likely to correspond to a mass observed in the experimental mass spectrum are thus selected.

Referring again to step (2) above, only a subset of the peptides of step (1) are considered when generating sets of modifications associated with a location on the peptide, thereby reducing the computational resources required. More preferably, for a given peptide, step

(2) is carried out only for a subset of the total combinations of said peptide with sets of modifications from step (1), rather than for all sets of modifications of step (1) for a given peptide. Thus, in this embodiment, the invention encompasses a method for computing the mass of potentially modified peptides, given a list of potential modifications on a protein sequence and a list of enzyme cleavage sites that define peptides, which method comprises:

(1) associating each peptide with at least one set of modifications, wherein each combination of said peptide with a set of modifications corresponds to or is associated with a total mass; and

(2) for a subset of said combinations selected based on their total mass,

(a) generating every possible modification location within each combination from said subset; and (b) computing the corresponding fragmentation spectra.

It will also be appreciated that step (1) may be replaced by any other method, which provides the basis for selecting a subset of peptides, which according to step (2) can be associated with modifications assigned to locations on a peptide. In this respect, the invention also provides a method for computing the mass of potentially modified peptides, said method comprising:

(1) providing a set of modification sites on a protein sequence and enzyme cleavage sites that define peptides, wherein each peptide corresponds to or is associated with a total mass; and

(2) for a subset of peptides, selected based on the total mass,

(a) associating each selected peptide with every possible set of modifications, wherein each modification is associated with a location on the peptide; and

(b) computing the corresponding fragmentation spectra.

In one aspect of the invention, some or all of the potential modification sites are known in advance for a test amino acid sequence. In one example, potential modification sites may be pre- determined and associated with an amino acid sequence (e.g., in a database), such that a set of preferred specific sites are known to be prone to bearing modifications. Besides, these pre- determined modifications may be fixed or variable modifications, in that their probability of occurring may be high or medium to low, respectively.

In other aspects of the invention, the locations of some or all of the modifications considered for a test amino acid sequence are identified using a set of rules defining potential modification sites. Both variable and fixed modifications may be determined in this way. In other aspects, some modifications sites are imposed based on pre-determined modifications sites and some modifications sites are identified using a set of rules defining potential modification sites. Preferably the modification sites are imposed a priori or identified using a rule-based approach after computational enzyme digestion.

Preferably the rules for identifying modifications are computer-implemented. In further embodiments, at least a portion of modification sites are predicted by artificial neural networks or hidden Markov models.

In preferred embodiments, the amino acid sequence of a protein or peptide is translated into binary code. Thus, favored methods of associating a peptide with a modification comprise locating potential modification sites on a binary encoded protein sequence. Preferably associating a peptide with a modification further comprises using a bit mask to represent the rules defining putative modification sites. The occurrences of the potential modifications are found by using bit-wise operators. In other preferred embodiments, enzyme cleavage sites are located on a binary encoded protein sequence using bit masks to represent the rules defining cleavage sites. Cleavage sites are found using bit-wise operators.

In other preferred embodiments of the methods of the invention, possible missed cleavage sites are rapidly located on a binary encoded protein sequence using bit masks to represent the rules defining possible missed cleavage sites. Potential missed cleavage sites are found by using bit-wise operators.

In one aspect, the size of the binary representation of each amino acid residue is at least 20 bits. In further preferred embodiments, the binary representation size per amino acid is chosen to facilitate computation. Most preferably, the binary representation size per amino acid residue is 24 or 32 bits.

In further aspects, the binary representation is computed by using the alphabetical order of the amino acid one-letter code.

In another embodiment, the second step of the methods above, which computes, for a subset of peptides, the fragmentation spectra of the peptides containing precisely located modifications, is conducted by taking into account the probabilities of each modification to occur.

In this embodiment, fragmentation spectra of peptides whose probabilities to occur are very low

(e.g. resulting from an unlikely combination of variable modifications present simultaneously), are not calculated. While it will be appreciated that a protein sequence can be obtained from any suitable source, in preferred embodiments protein sequences or amino acid sequences for use in the methods of the invention are obtained from a database. In other embodiments, protein sequences are computed or manually assembled.

In another aspect, the invention relates to a plurality of amino acid sequences stored on a computer-readable medium obtained by the methods of the present invention. Preferably, each of said plurality of amino acid sequences is associated with a modification and a total mass.

In another aspect, the invention provides a method of identifying proteins which is carried out in two correlation steps, wherein the first correlation step is based on a mass spectrometry (MS) spectrum and a second correlation step is based on a tandem MS (MS-MS) spectrum. In the first step, more than one test amino acid sequences are represented and peptide mass fingerprint (PMF) data are obtained by MS. The known mass, modification sites, and enzyme cleavage sites of the protein to be identified are used to pre-select a plurality of peptide sequences that may account for the PMF data from said test amino acid sequences. In a second step, the invention involves generating potential combinations of modifications for each of the peptides pre-selected in the first step and correlating the thus simulated peptides to experimental data obtained by MS-MS for the protein to be identified.

The test amino acid sequences are typically sequences stored in a database. The stored sequences are preferably amino acid sequences, although any suitable means of representation may be used (e.g., nucleotide sequences encoding amino acid sequences). The amino acid sequences may be generated via computer means during the process of correlation to the experimental mass spectrum.

The first correlation step permits the selected test peptides to be used as the basis of the second correlation step, in which predicted fragments of the selected peptides are correlated to experimental MS-MS data of the protein to be identified. The predicted fragments are associated with modifications, most preferably a set of post-translational modifications, allowing for accurate identification in the second correlation step. By associating the most extensive set of modifications only at the second correlation step, the invention achieves an important reduction in the amount of computation required to identify a protein. The correlation steps can be carried out using any of several known methods for correlating a test sequence to an experimental mass spectrum (see, for example, US patent 6,017,693).

In one aspect, the invention provides a method for identifying a peptide comprising:

(a) providing a plurality of test amino acid sequences, preferably stored in a database, each preferably associated with at least one modification, and each associated with a total mass and;

(b) providing an experimental mass spectrum of a peptide;

(c) correlating said experimental mass spectrum of step (b) with said total masses from said test amino acid sequences of step (a) to select a subset of said test amino acid sequences;

(d) for each test amino acid sequence in said subset, given a list of modifications to be considered,

(1) associating said test amino acid sequence with every possible set of modifications, wherein each modification is associated with a location on the test amino acid sequence, and

(2) computing the corresponding fragmentation spectra.

(e) providing at least one fragmentation mass spectrum generated from the parent mass spectrum of step (b); and (f) correlating said fragmentation mass spectrum of step (e) with said computed fragmentation spectra of step (d). Preferably, the test amino acid sequence of step (a) represents enzymatically digested peptides derived from a protein sequence. In one aspect, step (a) further comprises providing the amino acid sequence of a protein, and generating amino acid test sequences corresponding to peptides obtained from said protein by enzymatic digestion.

In another preferred aspect of the invention, the MS and MS-MS mass spectra are acquired prior to any computational steps occurring. As such, the experimental part of the methods yields, for a given peptide, a set of fragmentation spectra with their corresponding masses in the MS spectrum of the peptide (hereinafter, "parent masses"). The MS spectrum of step (b) may therefore be simplified to represent only those parent masses for which fragmentation spectra have been acquired. This represents yet another improvement in computation time.

In another aspect, the invention provides a method for identifying a protein comprising: (a) providing a set of modification sites on a test protein sequence and enzyme cleavage sites that define peptides, wherein each test peptide thus defined is preferably associated with at least one modification and wherein each test peptide is associated with a total mass;

(b) providing a list of experimental parent masses, for which experimental fragmentation mass spectra have been acquired, of a peptide released from the protein to be identified by enzymatic digestion;

(c) correlating said list of experimental parent masses of step (b) with said total masses from said test peptides of step (a) to select a subset of test peptides;

(d) for each test peptide in said subset, given a list of modifications to be considered, (1) associating said test peptide with every possible set of modifications, wherein each modification is associated with a location on the peptide; and (2) computing the corresponding fragmentation spectra, (e) correlating said experimental fragmentation mass spectra of step (b) with said computed fragmentation spectra of step (d) to generated a score; and (f) identifying the protein of interest by association with the highest ranking test peptides in the correlation of step (e). It will be apparent to one skilled in the art that the method above is useful for the identification of a protein substantially pure in a sample, as well as for the identification of a number of different proteins mixed together in a sample. In other aspects, the invention also encompasses computer systems and computer program products for carrying out the methods of computing the mass of peptides and of identifying a protein or peptide.

The present invention provides a means for computing the mass of a peptide, given a set of modification and enzyme cleavage sites on a protein sequence, comprising: a computer program product including a computer usable medium having computer readable program code means embodied in said medium. The computer program product includes computer readable program code means for causing a computer to associate each peptide with at least one set of modifications, wherein each combination of peptide and set of modifications corresponds to or is associated with a total mass. The computer program product also includes computer readable program code means for causing a computer, only for a subset of peptides (and preferably only for a subset of combinations of peptides and set of modifications) selected based on their total mass, to: (a) associate each peptide in said subset with every possible set of modifications for one or more selected types of modifications, wherein each modification is associated with a location on the peptide, and (b) compute the corresponding fragmentation spectra.

In another embodiment, the present invention provides a computer program product including a computer usable medium having computer readable program code means embodied in said medium for identifying a protein or peptide. Optionally, the computer program product includes computer readable program code means for causing a computer to store experimental mass data from the protein or peptide to be identified. The computer program product includes computer readable program code means for causing a computer to correlate the experimental mass data (generally MS data) with a plurality of test amino acid sequences each of which sequences is associated with a total mass, to select a subset of test amino acid sequences based on the total mass. The computer program product includes computer readable program code means for causing a computer, for each test amino acid sequence in said subset, given a type of modification to be considered, to: (a) associate each test amino acid sequence in said subset with every possible set of modifications, wherein each modification is associated with a location on the peptide, and (b) compute the corresponding fragmentation spectra. Finally, the computer program product preferably further includes computer readable program code means for causing a computer to correlate experimental MS-MS data with said computed fragmentation spectra. DESCRIPTION OF THE FIGURES

Figure 1 shows a procedure for the identification of proteins, involving searching a database of biological sequences with mass spectrometry data and comparing the experimental spectra with theoretical spectra generated from the biological sequences stored in the database.

Figure 2 illustrates the process of tandem mass spectrometry whereby a first mass spectrum of an ionized analyte is acquired (upper panel, showing two main ions at 433 and 325 m/z) which allows to select ions (the non-selected ions being filtered out) to be further processed by submission to a fragmentation process followed by a second analysis process (lower panel, showing the fragmentation - MSMS - spectrum of the parent ion selected at ca. 433 m/z).

DETAILED DESCRIPTION

The present invention relates to an automated method for efficiently and precisely computing theoretical mass spectra of proteins containing modifications, such as Post- Translational Modifications (PTMs). The mass spectra may be either peptide mass fingerprints or peptide dissociation mass spectra. The efficient mass computation consists in a new two-stage procedure. A rich set of functionalities permits the precise assignment of modifications. The present invention also relates to improving current methods for identifying biological molecules, especially modified polypeptide molecules. In one embodiment, the invention provides a method for determining the identity of an experimental biological molecule. A common procedure used in proteomics projects is to compare experimental MS spectra with theoretical spectra generated from biological sequences stored in a database. A general scheme of this process is shown in Figure 1. It will be appreciated that the sequences may be any collection of biological sequences, not necessarily organized in a database. For example, sequences may be generated by computation or manual assembly. The proteins from which MS data are acquired may be modified as described herein.

Modifications include the addition of chemical moieties to an amino acid sequence (e.g., carbohydrate, glycosyl, acetyl, amidyl, phosphate, etc), as well as structural interaction between amino acids, and enzymatic cleavage. They may be due to biological processes or chemical reactions occurring while the proteins are prepared for MS data acquisition. Modifications change the expected unmodified mass of the protein. Therefore, it is of prime importance to consider possible protein modifications in the generation of theoretical mass spectra.

In the methods of the present invention, two kinds of MS data are considered: Peptide Mass Fingerprints (PMFs) and fragmentation spectra (by MS-MS). Proteins are generally digested into peptides using trypsin or another appropriate protease. The mass measurement of the peptides obtained by digestion provides a PMF. Such a PMF can be used for searching or comparison to a database. In certain circumstances, PMFs are not specific enough to permit non- ambiguous identification of the original protein. A second procedure can hence be applied: fragmentation of the peptides (Papayannopoulos, I. 1995: The interpretation of collision-induced dissociation mass spectra of peptides, Mass Spectrometry Review, 14:49-73). This procedure is called tandem mass spectrometry, tandem-MS, MS² or MS-MS. The masses of the fragments constitute a very specific data set to identify each peptide. By extension, the MS-MS data for several peptides of a protein constitute a very specific data set to identify the original protein (Henzel,W.J. et al. 1993 : Identifying protein from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases, Porch. Natl. Acad. Sci. USA, 90:5011-5015; McCormack, A. L. et al. 1997: Direct analysis and identification of proteins in mixture by LC/MS-MS and database searching at the low-femtomole level, Anal. Chem., 69:767- 776; James, P. ed. 2000: Proteome Research: Mass Spectrometry, Springer, Berlin.).

Definitions

"Biological molecules", as used herein, include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses. Examples of biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates. Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids. A protein typically contains approximately at least ten amino acids, preferably at least fifty amino acids and more preferably at least 100 amino acids. As used herein, "protein" includes peptides, peptide fragments, polypeptides, amino acid sequences, and other terms common in the art.

As used herein, a "protein sequence" represents the identity and order of the amino acid residues that make up a protein. A protein sequence can be represented as a list of amino acids, for example. A protein sequence is usually ordered from the N-terminal to the C-terminal. As used herein, a "peptide" is part of a protein obtained by enzyme digestion. In terms of sequence, a peptide sequence is a sub-sequence of the entire protein sequence. A peptide sequence represents the identity and order of the amino acid residues that make up a peptide. As used herein, a "parent peptide" is a peptide observed in the first stage of tandem mass spectrometry that is fragmented in the second stage of tandem mass spectrometry.

As used herein, an "experimental biological molecule" is a biological molecule, which is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule. As used herein, a "theoretical biological molecule", also referred to herein as a "test molecule" may be any biological molecule other than the experimental biological molecule which is to be determined. The theoretical biological molecule may be a predicted biological molecule (i.e., not -experimentally determined), a randomly-generated biological molecule, or a known biological molecule (e.g., described in a database).

As used herein, a "variable modification" is, e.g., a modification that occurs in nature, generally as a part of normal cellular processes. The presence of a variable modification at its correlated modification site is not certain and as such the probability of observing the modification can be considered medium to low. As used herein, a "fixed modification" is, e.g., a nonnaturally-imposed modification. The presence of a fixed modification is generally controlled by the experimental protocol, and thus is considered certain for a given set of conditions. As such the probability of observing the modification can be considered high.

Hence, the terms "fixed" and "variable" are used herein to qualify the probability that a modification occurs at a given location on the amino acid sequence.

Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule. Mass data include individual mass spectra and groups of mass spectra. The mass spectra can be in the form of peptide maps. When referring to a peptide, a "set of modifications" refers to one possible grouping of modifications that may be present on a peptide simultaneously. A set of modifications may comprise several modifications, a single modification, or no modifications. The sites from a set of modification can be determined based on any suitable rules. Thus, it is not necessary that all modifications, which can occur in nature or a given environment, are considered; rather, the rules defined by the user determine the extent and type of modifications that are considered.

The "location" of a modification on a peptide refers to the amino acid residue on which the modification occurs or is predicted to occur.

Computing the mass of modified peptide sequences Disclosed herein is a method for efficiently and precisely computing theoretical mass spectra in the presence of protein modifications. Variable modifications induce a combinatorial explosion of the number of possible cases to consider, and the computation of MS-MS spectra requires extra processing time in addition to the computing time required for calculating peptide masses only. It is an object of the present invention to reduce the number of peptide masses and MS-MS spectra to be computed, while preserving the advantages of increased precision gained in the assignment by taking into account protein modifications, for the peptides that are most likely to occur. The present invention also provides functionalities and methods to precisely and rapidly locate modifications on those peptides. In order to reduce the number of computations, the theoretical spectra are computed in two stages as described below.

In a first step, a protein sequence is digested into a plurality of peptides, and these peptides are associated with a list of possible modifications. In this step, more than one set of modifications is provided, each set representing a possible combination of modifications that may be present on a peptide. A set of modifications may comprise 0, 1, or several modifications, depending on the size and sequence of the peptide.

The set of modification sites can be determined based on any suitable rules, for example, the bit-masking method provided herein may be used. It is not necessary that all modifications possible in nature or a given environment are considered. The rules defined by the user determine the extent and type of modifications that are considered.

Likewise, it is not necessary to associate modifications with a location on the peptide in this first step. While the association of modifications with locations on a peptide allows every possible combination of modifications to be computed, in this first step it is sufficient to consider the possible numbers of modifications of a certain type. However, it will be appreciated that associating a modification with a location on the peptide is also contemplated if desired. A total mass is associated with each of the combinations of peptide and set of modifications. These masses are sufficient to compare to an experimental MS spectrum, also referred to as a PMF. The total masses are also sufficient to compare with peptide parent masses of MS-MS data. Table 1 provides an example of a peptide associated with a list of three variable modifications. The exemplary peptide is shown with the modifications considered and with the rules for determining the site of each modification on the peptide. In this example, only the total masses are represented; the impact of the modifications sites on the MS-MS spectra is analyzed further below (Table 2). Table 1 : Peptide PNCFMNGR

Variable modifications considered: With the rule:

Meth_nterm: N-term methylation N-terminal

Deam_N: Deamidation on N residues N followed by G

Oxy: Oxidation H or M or W

In a second step, to compare with MS-MS data, a theoretical MS-MS spectrum from a peptide is computed, but only for peptides for which there is a sufficiently close match between an experimental parent peptide mass and a theoretical total mass (as exemplified in Table 1).

As described above, the first step of the method has generated, for a given test protein, a plurality of peptides, each of which is associated with a plurality of possible modifications, or in preferred aspects, a list of sets of modifications. All combinations of peptide and set of modifications from the list correspond to a theoretical total mass. In the second step, these combinations of peptide and set of modifications, characterized by their total mass, are correlated to a given mass spectrum, preferably an experimental mass spectrum. Correlating a given mass spectrum to a combination of a peptide and a set of modifications can be carried out according to means well known in the art. This not only allows the peptides that may be responsible for the experimental mass spectrum to be identified, but also allows possible combinations of peptide and set of modifications to be considered. Selecting a limited number of combinations of peptide and set of modifications, as the basis for theoretical MS-MS spectra computation, as opposed to computing the totality of possible MS-MS spectra, increases computational efficiency tremendously. Thus, in a preferred embodiment, only the combinations of peptide and set of modifications having a total mass close to an experimental mass are selected. These selected peptides are then used for computing the MS-MS spectrum.

As the exact location of every modification impacts the MS-MS spectrum (see Table 2), more than one MS-MS spectrum may be generated from each combination of peptide and set of modifications (see also Table 9). Modifications can be associated with a peptide in computing an MS-MS spectrum in a manner analogous to that used in step 1. Since this second step includes consideration of subsets of the peptide sequence initially considered in step 1, however, the modifications are considered more extensively in the second step. In particular, modifications are associated with a particular, specific location on a peptide, allowing the method of the invention to take into account different variable modifications, which may often occur only in the alternative at a certain amino acid residue. Thus, in one example, a bit-masking method is used to identify sites of modifications, and the modifications are associated with locations (e.g. particular amino acid residues) on a peptide. For the types of modifications considered, a peptide is then associated with all possible combinations of modifications.

Table 2 demonstrates a theoretical MS-MS spectrum of tryptic peptide FPNCYQKPCNR. The modification considered here is Cys_CAM (iodoacetamide, +57Da), related to the breakage of disulfide bonds, and is treated as a variable modification. The rule used in identifying modifications is that every cysteine (C) can be modified. The mass contribution of each modification can be seen in the column labeled "Total Mass" (total mass of the peptide). For each set of modifications, the b series ion (representing ions starting from the N-terminal residue of the test sequence, with the loss of C-terminal amino acid residues) and the y series ion

(representing ions starting from the C-terminal residue of the test sequence with the loss of N- terminal amino acid residues) are shown.

As will be appreciated from the table, the two cases where only one cysteine is modified share the same total mass, which would have no consequence for PMF searches. It is only necessary to consider the number of modified cysteines in the first step of the present method. However, where the fragment masses are required, the exact location of each modification becomes important. When computing fragmentation spectrum in this second step, the case "one cysteine modified" yields two distinct MS-MS spectra due to the two possible locations of the modified cysteine.

Table 2

Theoretical MS-MS spectra can be generated from peptides using means well known in the art for computing a theoretical fragmentation spectrum (see Snyder, A. P., 2000: Interpreting Protein Mass-Spectra, Oxford University Press, Washington DC). The approach described herein saves computer power and memory as only a subset of every possible theoretical MS-MS • spectrum is computed. The method can thus be carried out as further described in detail. The method may also be used in the context of identifying experimental biological molecules, which comprise modification, particularly proteins having post-translational modifications.

Generating protein or peptide sequences associated with modifications Test peptide sequences, which are to be compared with an experimental MS spectrum in a first step, and with an MS-MS spectrum in a second step, are associated with one or more modifications.

Again, fixed modifications are usually introduced experimentally and thus can be considered as certain under given conditions. Variable modifications, on the other hand, must be considered as possibly occurring but not systematically. Besides, the modifications may have a precise location assigned a priori or their location may be defined by a rule. Such a rule very often takes the form of a simple pattern like: Cys_CAM on every cysteine, N-deamidation on every asparagine followed by a glycine, or N-term methylation at the N-terminal extremity of the protein (see Turner, J. P. et al., 1997, Letter code, structure and derivatives of amino acids, Molecular Biotechnology, 8:233-247, for a review). Also considered are modifications sites predicted by a sequence-based algorithm (e.g., a neural network or hidden Markov model, see Blom, N. et al., 1999, Sequence- and structure-based prediction of eukaryotic phosphorylation sites, J. Mol. Biol., 294: 1351-1362 and Hansen, J. E. et al., 1998, NetOglyc : prediction ofmucine type O-glycosylation sites based on sequence context and surface accessibility, Glycoconjugate Journal, 15: 115-130) and modification sites extracted from an appropriate database (e.g., Swissprot).

To generate the theoretical PMFs, it is sufficient to compute the mass of the peptides. This implies that locating the modifications on a peptide is not necessary. Thus, when considering a large set of amino acid or protein sequences, it is computationally more efficient to avoid associating modifications with locations on the amino acid sequence. It is only necessary to take into account the number of every fixed modification and every combination of variable modification (considered both present and not present). On the other hand, the generation of the theoretical MS-MS spectra requires the exact location of the modifications (see Table 2). Thus, in a preferred example, a test amino acid sequence may be associated with one or more post-translational modifications. These post-translational modifications may have been experimentally determined or predicted.

Bit masks

In a preferred embodiment, possible modification sites for a given type of modification are rapidly located in silico by representing protein sequences and possible modification sites with bit masks.

Using a bit mask involves representing each amino acid with a binary number, as shown in Table 3. For each amino acid, the bit set to 1 is the i^lh bit, where i is the number at which the one letter code for that amino acid falls in the ordered list of one-letter symbol for amino acids

(A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). This requires a string of at least 20 bits to encode each amino acid.

Table 3 The 20 amino acids, their one-letter code, mono-isotopic mass and binary

To define possible modification sites, a maximal pattern size ? is fixed, for instance three amino acids, one before the modification site, one at the modification site and one after the modification site. Rules providing sites at which a given modification may be present can be used to determine patterns. Each possible modification site is thus associated with a pattern.

The building of the pattern then makes use of bit-wise operators AND, OR, NOT. These, bit-wise operators can be defined as in Table 4. NOT is a unary operator (i.e., which uses only one operand) that performs a logical NOT with every bit of its operand. AND and OR are binary operators that work bit per bit: they apply the classical logical operators to the corresponding bits of their two operands. The table below shows three examples covering every possibilities.

Table 4 Bit- wise operators NOT, AND and OR

1100 AND 1010 = 1000

1100 OR 1010 = 1110

NOT 10 = 01

The building of the pattern can be carried out as follows: 1. At every pattern position corresponding to a mandatory amino acid we take the binary representation of this amino acid.

2. At every pattern position corresponding to a set of possible amino acid we take the bitwise OR of the binary representations of these amino acids.

3. At every pattern position where any amino acid is acceptable we take the binary number 11 1 111 1111 1 11111 1111 (20 times 1).

4. At every pattern position where every amino acid is acceptable but one, we take the bitwise NOT of the binary representation of this amino acid.

5. At every pattern position where every amino acid is acceptable but a set of amino acids, we take the bit-wise NOT of the result of the bit-wise OR of the binary representations of these amino acids.

Method for determining PTM sites

Given an amino acid sequence represented as bits and a pattern as defined above, the amino acid sequence may be scanned to identify possible modifications sites in order to associate the amino acid sequence with a set of modifications. The procedure for determining possible modification sites (pattern size is/?) can be carried out as follows:

1. Encode the/? first letters of the protein sequence in/? binary representations as above (Table 3). See Table 5 for an example. 2. Concatenate the/? binary representations of the amino acids in one single data structure.

See Table 5 for an example. 3. Compute the binary representation of the pattern (of size/?) associated with a modification of interest, using the building method described above and a concatenation similar to that used in step (2). See Table 7 for examples. 4. Compute the bit-wise logical AND (see Table 4) of the two binary representations

(pattern from step (3) and/? amino acids from step (2)). 5. If the result of the bit-wise AND is equal to the representation of the/? amino acids, one putative site has been found.

6. Remove the binary representation of the 1^st amino acid of the set of/?, compute the representation of the next one in the protein sequence, shift thep-l remaining representations to the left, and insert the new one. See Table 5 for an example.

7. Repeat steps 3 to 6 (optionally, 4 to 6), until the end of the protein sequence is reached.

This procedure is further exemplified in Table 5. The flexibility of the binary representation allows several possible amino acids to be considered for a given modification with extra conditions on the previous and next amino acids. The conditions on the previous and next amino acids may require specific amino acids to be present or not present. Table 7 provides further examples.

Table 5: Examples of bit masks (20 bit size) computed for a sequence of amino acid and the resulting patterns (/?=3)

Table 6 thus provides an example of the application of the bit mask -encoding method to locate putative sites for asparagines deamidation. The rule is: asparagine (N) followed by glycine (G), no condition on the amino acid N-terminal to the asparagine. A pattern size of p=i amino acids is considered. The binary encoding of the amino acids is ' 1' at the position corresponding to their alphabetic order, '0' elsewhere. The sequence of letters used to represent amino acids is A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (see Snyder, A.P. 2000, Interpreting Protein Mass Spectra, Oxford University Press, Washington DC). The table 6 illustrates the steps introduced in the text with one occurrence of the modification found at 'G'.

Table 6: Application of the bit mask-encoding method Protein sequence DANGPT Deam_N bit mask used: mi liiniii liiiiin oooo ooooiooo oooooooo

0000 oooooooo 00100000

Table 7: Examples of patterns associated to modifications (20 bits, ?=3)

It will be apparent to one skilled in the art that in some particular cases of modifications, the size of the pattern (default/? =3) can be reduced to accelerate computation without loosing information. For example, the Cys_CAM modification, as illustrated in Table 7 above, accepts any amino acid before and any amino acid after the site of the modification, provided that the amino acid at the site of modification if a cysteine. The size of the pattern for scanning for this modification in protein sequences can therefore be conveniently reduced to one, as only one amino acid is important for detecting potential modification sites. Similarly, the modification Deam_N (Table 7) can be searched with a pattern size of two. It is therefore advantageous to implement a method fbr scanning for potential modification sites according to the invention, which utilizes a variable pattern size.

In a further preferred embodiment it is possible to use 24 bit long bit masks (where the 4 left-most bits are not used) as 24 is a multiple of 8 and this is more convenient lor a computer. In a more advantageous embodiment, it is possible to use 32 bit long bit masks (where the 12 left- most bits are not used). In a further more advantageous embodiment, the bit set to .1 in the coding of amino acids may be computed from the amino acid letter position in the alphabet (26 letters instead of 20 amino acids). For instance first bit for A, third bit for C, fourth bit for D, twenty- fifth for Y, etc. Table 10 provides examples of modifications of 32 bit long bit masks thus calculated. In a further embodiment, enzyme cleavage sites are represented by bit masks. An analogous procedure is performed as for the modifications. For instance, trypsin cleavage occurs at each lysine (K) or arginine (R) site that is not followed by a proline (P). An example is provided in Table 8.

Table 8: Examples of enzyme patterns (20 bits, ?=3)

Trypsin enzyme: cleaves on Lysine 111 11 1 1 1 1 11 11 11 1 11110000001000001 0000000- (K) or arginine (R) if they are not 1.1 11 1 1101 1 111 1 1 1 1111 followed by a proline (P).

In other preferred embodiment, missed cleavage sites are considered. The maximal number of missed cleavage may be specified as a parameter to the method.

In an even more advantageous embodiment, missed cleavage sites are generated only in specific situations. This allows for better modeling of the natural occurrence of a missed cleavage. Thiede et al., 2000, give such a rule for tryptic digestion (Thiede.B. et al. 2000: Analysis of missed cleavage sites, tryptophan oxidation and N-terminal pyroglutamylation after in-gel tryptic digestion, Rapid Commun. Mass Spectrom., 14:496-502). '

In a preferred embodiment, the rule for missed cleavage is represented in a similar fashion as explained for possible modification sites. This pattern is then used to detect possible missed cleavage sites. Table 11 shows such a pattern for trypsin, based on the analysis made by Thiede et al, 2000.

In a further preferred embodiment, variable or fixed modifications assigned to precise locations of the protein sequence (pre-determined) are also considered, in addition to the locations determined on the basis of a rule. An example illustrating this with an acetylation forced to occur on residue 2 is presented in Table 9.

In a most preferred embodiment, overlapping variable modifications, i.e. variable modifications sharing the same location, are considered. In such a case, only one modification is considered as possible at a time. Thus, where every combination of modifications is generated. one of the overlapping modifications is included at a time (or zero modification as they are variable modifications). See Table 9 for an illustration of this.

In a further aspect, modifications that can occur at N- or C-terminal ends of the protein, and preferably only at the N- or C- terminal ends, are considered. In one embodiment, modifications that can occur atN- or C-terminal ends of the peptides obtained after digestion, and preferably only at the N- or C- terminal ends, are considered.

In a further embodiment, algorithms are used to predict possible modification locations when a simple pattern, eventually represented as bit mask, is not available or not precise enough. Such algorithms include artificial neural networks and hidden Markov models. The predicted locations are then used by the functionality that sets variable modifications at given locations. Predictive algorithms such as artificial neural networks and hidden Markov models typically consider a sliding window centered on the modification site. The parameters of such algorithms are drawn from known examples: the amino acid sequence in the sliding window is used as the input signal and the parameters are adjusted to predict the presence or absence of the modification, depending on the example (see Blom, N. et al., 1999, Sequence- and structure- based prediction of eukaryotic phosphorylation sites, J. Mol. Biol., 294: 1351-1362; Hansen, J. E. et al., 99 , NetOglyc . -prediction ofmucine type O-glycosylation sites based on sequence context and surface accessibility, Glycoconjugate Journal, 15:115-130).

In a preferred embodiment, predicted locations are extracted from databases and then used by the functionality that sets modifications, variable or not, at a given location.

In a most preferred embodiment, the bit masks or the predictive algorithms are applied to the peptide sequences directly (after digestion) and not to the entire protein sequence.

In a preferred embodiment of the methods, nucleotide sequences are translated into amino acid sequences and used to predict modification locations.

Obtaining MS spectra

The method of the present invention includes providing experimental mass data for the experimental biological molecule within a certain mass range. Experimental mass data includes the measured masses and standard deviations associated with the measured masses. The method also includes generating theoretical mass data in the same mass range. In one embodiment the experimental mass data is a subset of the theoretical mass data.

For example, mass data for proteins can be generated in any manner that provides mass data within certain accuracy. Examples include matrix-assisted laser desorption/ionization (MALDI) mass spectrometry, and electrospray ionization mass spectrometry. Mass data may also be generated by a computer configured with software capable of calculating amino acid mass data.

A step in generating mass data of a biological molecule may include first cleaving the biological molecule into constituent parts. Biological molecules may be cleaved by methods known in the art. Preferably, the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses. Methods of cleaving also include chemical degradation of the biological molecules. Biological molecules may be degraded by contact with an appropriate chemical substance.

For example, proteins are predictably degraded into peptides by means of cyanogen bromide and proteolytic enzymes (such as trypsin, endoproteinase Asp-N, V8 protease, endoproteinase Arg-C, etc). Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucl eases such as Eco RI, Sma I, BamH I, Hinc II, etc. Fragment mass data for the purposes of this invention is generated using multidimensional mass spectrometry (MS-MS). Various mass spectrometers can be used including a triple quadrupole mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer. A single peptide from a protein digest is subjected to MS-MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences.

Test sequences

The amino acid sequences that are associated with modifications according to the present method and used for generating theoretical spectra can be obtained from any suitable source and may be in any suitable format. The sequences, also referred to as "test sequences", may be any collection of biological sequences, not necessarily organized as a database. For example, amino acid sequences may be generated by computation or manual assembly.

A database comprises any compilation of information about characteristics of the biological molecules, test molecules or test sequences. Databases are the preferred method for storing both amino acid sequences and the nucleic acid sequences that code for these polypeptides. Different types of databases have advantages and disadvantages for use in polypeptide identification.

While the"database entry" for an amino acid sequence may appear to be a simple text file, many databases are organized into very flexible, complicated structures. The detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file" database), a collection of tables (a "relational" database), or it may be organized around concepts about a protein, gene, or organism (an "object-oriented" database).

In one aspect, amino acid sequences can be translated from nucleic acid sequence databases by analyzing the six possible reading frames. Protein mass data may thus be predicted from a nucleic acid sequence database. Most preferably, protein sequences (and corresponding mass data) may be obtained directly from databases containing a collection of amino acid sequences represented by single- letter or three-letter code starting at the N-terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (e.g., "B" indicating that the residue may be a "D" or "N"). Each sequence typically has an associated number-letter combination used internally by the database that is usually referred to as the accession number.

Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence. A database that . contains these elements is referred to as "annotated". Annotated databases are used if some functional or structural information is known about the protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence. Non-annotated databases only contain the sequence, an accession number, and a descriptive title.

Exemplary databases and/or sources of test molecules or test sequences include the Genpept database, the GenBank database (described in Burks, et al., "GenBank: Current status and future directions in Methods in Enzymology", 183:3 (1990)), EMBL data library (described in Kahn, et al., "EMBL Data Library," Methods in Enzymology, 183:23 (1990)), the Protein Sequence Database (described in Barker, et al., "Protein Sequence Database." Methods in Enzymology, 1983:31 (1990)), SWISS-PROT (described in Bairoch, et al., "The SWISS-PROT protein sequence data bank, recent developments," Nucleic Acids Res., 21:3093-3096 (1993)), and PIR-International (described in "Index of the Protein Sequence Database of the International Association of Protein Sequence Databanks (PIR-International)" Protein Sew Data Anal. 5:67- 192 (1993)).

Correlating experimental and theoretical spectra

Correlating experimentally deter ined mass data with mass data obtained according to the invention can be carried out according to any of several well known methods. In one example, the difference between each measured mass and each theoretical mass of the biological molecule in the database is calculated. If one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered correlated (or a "hit"). The total number of hits found for a particular experimental molecule for a particular database molecule is designated as r. Each measured mass associated with a hit is designated as mi, wherein i is an ordinal number from 1 to r. The theoretical mass(es) associated with the i™ hit is designated as r o. There may be more than one theoretical mass associated with the i"¹ hit. The difference between each measured mass, mi, and one of the theoretical masses associated with the i"¹ hit is determined. Any one of the theoretical masses m_jo may be used to determine these differences.

For example, the theoretical mass with the smallest difference between the measured mass mi may be used. Alternately, the average of the theoretical masses associated with the hit can be used to determine these differences.

In a preferred embodiment, the user interface for controlling the computer program of the invention allows the user to specify some modifications at given locations. In these embodiments, the user will have the opportunity to search the experimental mass spectra for peptides of a given protein, with a given modification at a given location. In this way, the user, which is preferably a scientist, has the possibility to make biological assumptions and to test them by searching the experimental data in a specific way. The computer program of the invention thus provides the user with the flexibility to run an automated, general search on the experimental data, whereby fixed and variable modifications at locations defined by, e.g. rules known in the art, are used to search the experimental data, or, alternatively, to conduct a more focused search, e.g. in a second analysis pass, where some modifications are imposed in their kind and their locations.

Table 9: An example with several modifications of different sorts (fixed, variable, with and without pre-determined location). Each combination of modifications is reported by the associated peptide total mass and, on a second line, the locations of the variable modifications.

Peptide AKAHWNDAANG

Modifications:

1. acetylation, location pre-deteπnined; forced to occur on the amino acid at position 2 (K)

2. methylation, variable, potentially occurring on [C, K, R, H, D, E. N, and Q] (i.e. positions 4, 6, 7 and 10)

3. deamidation, variable, potentially occurring on [N] followed by a [G] (i.e. position 10)

4. oxidation, variable, potentially occurring on [H, M, and W] (i.e. positions 4 and 5)

Remarks:

There are the following conflict sites:

• at position 4 between modifications (2) and (4)

• at position 10, between (2) and (3) And no conflict sites:

• at position 5, for modification (4) » at position 6 and 7 for (2) mas s = 11 5.5-1 :

1195.54 : AK ( 1 ) AHWNDAANG mass--- 1203.55 : (2)@3,

1209.55 : K ( 1) AH ( 2 ) WNDAANG in a .5 s = 1211.53 : ( 4 ) ® 3 ,

1211.53 : AK ( 1) AH ( ) WNDAANG ma-;s= 1203.55 : ( 2 ) @9 ,

1209.55 : K ( 1 ) AHWNDAAN (2 ) G mass= 1223.57 _: (2)@3, ( 2 ) © 9 ,

1223.57^" : AK ( 1 ) AK (2 ) WNDAAN (2 ) G mass= 1225.55 : (4)@3, (2)@9,

1225.55 : AK ( 1 ) AH {^'4 ) WNDAAN ( 2 ) G

1196.52 : K ( 1 ) AHWND AN ( 3 ) G mass= 1210.54 ^': (2)@3, (3)@9^',

1210.54:AK!1)AH(2) WNDAAN (3 ) G mass = 1212.52 : (4)--33, (3)09,

1212.52:AK(1)AH(4) WNDAA ( 3 ) G mass= 1209.55 : (2)xl,

1209.55 : AK ( 1) AHWND (2) AANG

1209.55 : AK (1) AHWN (2) DAANG mass= .1223.57 : ( 2 ) @ 3 , ( 2 ) x 1 ,

1223.57 : AK ( 1 ) AH ( 2 ) WND ( 2 ) AAN G

1223.57 : AK (1 ) AH (2) WN (2 ) DAANG m s-)= 1225.55 : (4)@3, (2)xl,

1225.55 : AK ( 1 ) AH ( 4 ) ND ( 2 ) AANG 1225 55 : AK 1) AH (4) N (2) DAANG mas s = = 1223 57 : (2 ) @9 , (^'2 ) xl ,

1223 57 : AK 1) AHWND (2 ) AAN (2) G

1223 57 : AK 1) AHWN (2) DAAN (2) G mas s = = 1237 58 : (2)03, (2)09, (2) xl,

1237 58 : AK 1 ) AH ( 2 ) WND ( 2 ) AAN ( 2 ) G

1237 58 : AK 1) AH (2 ) WN (2 ) DAAN (2 ) G mas s = = 1239 56 : (4 ) @3 , (2)09, (2) xl ,

1239 56 : AK 1) AH (4) WND (2) AAN (2 ) G

1239 56 : AK 1 ) AH ( 4 ) W ( 2 ) DAAN ( 2 ) G mas s = = 1210 54 : (3)09, (2 ) xl ,

1210 54 f AK 1) AHWND (2 ) AAN (3) G

1210 54 : AK 1) AHWN (2) DAA (3) G mass: = 1224 55 : (2)03, (3)09, (2) xl,

1224 55 : AK 1) AH (2) WND (2) AAN (3 ) G

1224 55 : AK 1) AH (2) W (2) DAAN ( 3 ) G ma s s = = 1226 53 : (4)03, (3 ) ©9 , ( 2) Xl ,

1226 53 : AK 1 ) AH ( 4 ) WND ( 2 ) AA ( 3 ) G

1226 53 : AK 1 ) AH ( 4 ) ( 2 ) DAAN ( 3 ) G mass = = 1223 57 : (2 ) 2 ,

1223 57 : AK 1) AHWN (2 ) D (2 ) AANG ma s s = = 1237 58 : ^• (2 ) ©3 , (2 ) x2 ,

1237 58 : AK 1 ) AH ( 2 ) ( 2 ) ( 2 ) AANG m a s s = - 1239 56 : (4 ) ©3 , (2 ) x2 ,

1239 56 : AK 1) AH ( 4 ) WN (2 ) D ( 2) AANG ma ss = = 1237 58 : (2)09, (2)x2,

1237 58 : AK 1 ) AHWN ( 2 ) D ( 2 ) AAN ( 2 ) G m a 3 s = = l-251 6 : ( 2 ) @3 , (2 ) @9 , (2 )x2 ,

1251 6 : AK (3 . ) AH ( 2 ) WN ( 2 ) D ( 2 ) AAN ( 2 ) G mas s = = 1253 58 : (4)03, (2)09, (2) x2 ,

1253 58 : AK 1 ) AH ( 4 ) N ( 2 ) D ( 2 ) AAN ( 2 ) G mass: = 1224 55 : (3)09, (2 ) x2 ,

1224 55 : AK 1) AHWN ( 2) D (2) AAN (3 ) G ma s s = = 1238 57 : (2)03, (3 ) ©3 , (2) x2 ,

1238 57 : AK ( 1 ) AH ( 2 ) WN ( 2 ) D ( 2 ) AA ( 3 ) G mass: = 1240 55 : (4)03, (3 ) @9 , (2) x2 ,

1240 55 : AK ( 1 ) AH ( 4 ) N ( 2 ) D ( 2 ) AAN ( 3 ) G m a s s = = 1211 53 : ( 4 ) 1 ,

1211 53 : AK ( 1) AHW (4 ) NDAANG ma s s = = 1225 55 : (2)03, (4)xl,

1225 55 : AK ( 1 ) AH ( 2 ) W (4 ) NDAANG mass = 1227 53 : (4)03, ( ) xl ,

1227 53 : AK ( 1) AH ( 4 ) ( 4 ) NDAANG mas s = = 1225 55 : (2)09, (4 ) l , 1225 55 : AK ( 1) AH (4 ) NDAA ( 2 ) G mas s = = 1239 56 : (2 ) ©3 , (2)09, (4) xl ,

1239 56 : AK 1 ) AH ( 2 ) W.( 4 ) NDAAN ( 2 ) G mas s = =^• 1241 54 : (4 ) 03 , (2 ) ©9, (4) l ,

1241 54 : AK 1 ) AH ( 4 ) W ( 4.) NDAAN ( 2 ) G mas s = = 1212 52 : (3)09, (4 ) xl ,

1212 52 : AK 1) AHW (4 ) DAA ( 3 ) G mas s = = 1226 53 : (2)03, (3)©9, (4)xl,

1226 53 : AK 1 ) AH ( 2 ) ( 4 ) NDAAN ( 3 ) G mas s = = 1228 51 : (4)03, (3)09, (4) xl ,

1228 51 : AK 1 ) AH ( 4 ) ( 4 ) NDAAN ( 3 ) G mas s = = 1225 55 : (2 ) xl, (4 ) xl,

1225 55 : AK 1) AH (4) ND (2) AANG

1225 55 : AK 1) AH (4 ) N (2 ) DAANG ma s s = = 1239 56 : (2)03, (2 ) xl, (4) xl ,

1239 56 : AK 1 ) AH ( 2 ) W (^C4 ) ND ( 2 ) AANG

1239 56 : AK 1)AH(2)W(4)N(2) DAANG mas s = = 1241. 54 : (4)03, (2 ) xl , (4) xl ,

1241 54 : AK 1 ) AH ( 4 ) ( 4 ) ND ( 2 ) AANG

1241 54 : AK 1)AH(4)W(4)N(2) DAANG mas s = = 1239. 56 : (2)09, (2 ) xl , (4) xl ,

1239 56 : AK 1 ) AH ( 4 ) ND ( 2 ) AAN ( 2 ) G

1239 56 : AK 1 ) AHW ( 4 ) ( 2 ) DAAN ( 2 ) G ma s s = = 1253. 58 : (2)03, (2)09, (2)xl, (4)xl,

1253 58 : AK 1 ) AH ( 2 ) ( 4 ) ND ( 2 ) AAN ( 2 ) G

1253 58 : AK 1)AH(2)W(4)N(2) DAAN (2) G ma s s = = 1255. 56 : (4)03, (2)09, (2)xl, (4)xl,

1255 5 ^'6 : A K 1 ) AH ( 4 ) W ( 4 ) ND ( 2 ) AA ( 2 ) G

1255 56 : AK 1)AH(4)W(4)N(2) DAAN ( 2 ) G mas s = = 1226. 53 : (3)09, (2 ) xl, (4) xl ,

1226 53 : AK 1 ) AHW ( 4 ) ND ( 2 ) AAN ( 3 ) G

1226 53 : AK 1) AHW (4 ) N (2 ) DAAN (3 ) G mas s = = 1240. 55 : (2)©3, (3)09, (2)xl, (4)xl,

1240 55 : AK 1 ) AH ( 2 ) W ( 4 ) ND ( 2 ) AAN ( 3 ) G

1240 55 : AK 1)AH(2)W(4)N(2) DAAN ( 3 ) G mas s = = 1242. 53 : (4)@3, (3)09, (2)xl, (4)xl,

1242 53 : AK 1 ) AH ( 4 ) W ( 4 ) ND ( 2 ) AAN ( 3 ) G

1242 53 : AK 1 ) AH ( 4 )"-W ( 4 ) N ( 2 ) DAAN ( 3 ) G mas s = = 1239 56 : (2 ) x2 , (4 ) l ,

1239 56 : AK 1 ) AHW ( 4 ) N ( 2 ) D ( 2 ) AANG ass = 1253 58 : ( 2 ) © 3 , (2)x2, (4)xl,

1253 58 : AK 1)AH(2)W(4)N(2)D(2) AANG mas s = = 1255 56 : (4)03, (2 ) 2 , (4) xl ,

1255 56 : AK 1)AH(4)W(4)N(2)D(2) AANG mass= 1253.58 : (2)09, (2)x2, (4)xl, 1253.58:AK(1) AHW (4) N (2 ) D (2 ) AAN (2) G mass= 1267.59 : (2)03, (2)09, (2)x2, (4)xl, 1267.59:AK(1)AH(2)W(4)N(2)D(2 ) AAN (2 ) G mass= 1269.57 : (4)@3, (2)09, (2)x2, (4)x^'l, 1269.57:AK(1)AH(4)W(4)N(2)D(2) AAN ( 2 ) G mass= 1240.55 : (3)09, (2)x2, (4)xl, 1240.55:AK(1)AH (4)N(2)D(2) AAN ( 3) G mass= 1254.56 ^': (2)03, (3)©9, (2)x2, ( 4 ) x 1 , 1254.56:AK(1)AHC2)W(4)N(2)D(2) AA ( 3 ) G mass= 1256.54 : (4)@3, (3)@9, (2)x2, (4)xl, 1256.54 :AK(1) AH (4) (4) N (2 ) D ( 2 ) AAN ( 3 ) G

Table 10: Examples of modifications. The format uses 3 lines per modification. First line: modification number, short name, long name, [characters before : characters at the modification site : characters after]. A ^Λ (hat) character means "not", i.e. every character but the ones after ^Λ. Second line: is N-terminal (True/False) — is C-terminal (True/False), correction on the monoisotopic amino acid mass : correction on the average amino acid mass. Third line: pattern bit mask coded by using 3 times 32 bits wherein each set of characters before, at and after the modification site is coded by setting to 1 the bits corresponding to the position, in the alphabet, of the corresponding 1 -letter code for amino acids.

0 ACET_nterm ( Ace yl at ion_n erm ) [ACDEFGHIKLMNPQRSTVWY : "NKHFWY : ACDE FGH I KLMNPQR STVWY ]

T F 42.0106:42.0373

000000010110111110111101111111010000000000101111100110010 loiiioiq ooo oo oi oiio mil oiiiioiii iiiioi

1 ACET_core (Acetyla ion_core) [ACDEFGHIK MNPQRSTWY: K: ACDEFGHIKLMNPQRSTVWY] F F 42.0106:42.0373

0000000101101111101111011111110100 oooooooooooooooooooiooo 0000000000000010110111110111101-11111101

2 PHOS ( Phosphorylation)

[ACDE FGH I KLMNPQR STVWY:DHSTY: ACDEFGHIKLMNPQRS WY]

F F 79.9663:79.9799

000000010110111110111101111111010000000100001100000000001 000100000000001011011111011110111111101

3 AMID (Amidation)

[ACDEFGHIKLMNPQRSTVW : ACDE FGH I KLMN P QRS TVW : G] F T -0.984 : -0.9847

000000010110111110111101111111010000000101101111101111011

1111101-0 ooooooooooooooooooooooooioooooo

4 BIOT (Biotin) [ACDEFGHIKLMNPQRSTVWY :K: ACDEFGHIKLM PQRSTVWY]

F T 226.078 :226.293

000000010110111110111101111111010000000000000000000001000 000000000000001011011111011110111111101

5 CAM_nterm ( Carbamylat ion_nt erm) [ACDEFGHIKLMNPQRSTVWY : ACDEFGHIKL NPQRSTVW :ACDEFGHIKLMNPQ

RSTVWY]

T F 43.0058 :43.025

000000010110111110111101111111010000000101101111101111011 111110100000001011011111011110111111101

6 CAM_core ( Carbamylat ion_core ) [ACDEFGHIKLMNPQ STVWY : K: ACDEFGHIKLMNPQRSTV Y] F F 4 3 . 0 0 5 8 : 4 3 . 0 2 5

7 CARB ( Garboxylat ion)

[ACDEFGHIKLMNPQRSTVWY: EN: ACDEFGHIKLMNPQRSTVWY]

F F 43.9898:44.0098

000000010110111110111101111111010000000000000000001000000 001000000000001011011111011110111111101

.8 PYRR ( yrro 1 i done_c arboxy 1 i _ac i d ) [ACDEFGHIKLMNPQRSTVWY :Q:ACDEFGHIKLMNPQRSTVWY]

T F -17,0266:-17.0306

000000010110111110111101111111010000000000000001000000000 000000000000001011011111011110111111101

9 HYDE, (Hydroxylation) ,

[ACDEFGHIKLMNPQRSTVWY :DKNP : ACDE FGH I KLMNPQRS TVWY ] F F 15.9949:15.9994

000000010110111110111101111111010000000000000000101001000 000100000000001011011111011110111111101

10 GGLU ( Gamma - carboxyglu mi c_ac i d ) [ACDEFGHIKLMNPQRSTVWY: E: ACDEFGHIKLMNPQRSTVWY] F F 43.9898:44.0098

000000010110111110111101111111010000000000000000000000000

001000000000001011011111011110111111101

11 METH_nterm ( Me t hy la t ion_nt erm) [ACDEFGHIKLMNPQRSTVWY: AP : ACDEFGHIKLMNPQRS VWY]

T F 14.0157:14.0269

000000010110111110111101111111010000000000000000100000000 000000100000001011011111011110111111101

12 METH_core ( e thyl at ion_cor e ) [ACDEFGHIKLMNPQRSTVWY: CDEHKNQR : AC D E FGH I KLMNPQRS TVW ]

F F 14.0157:14.0269

000000010110111110111101111111010000000000000011001001001 001110000000001011011111011110111111101

13 DIMETH_nterm ( D i - Me thy 1 at i on_n t erm) [ACDEFGHIKLMNPQRSTVWY :AP : ACDE FGHI KLMNPQR S TVWY ]

T F 28.0314:28.0538

000000010110111110111101111111010000000000000000100000000 000000 j 00000001011011111013.110111111101 Table 10 continued.

14 DIMETH_core ( Di -Me thy at ion_co r e ) [ACDEFGHIKLMNPQRSTVWY :CDEHKNQR: ACDEFGHIK MNPQRS VWY]

F F 28.0314:28.0538

0000000101101111101111011111110,1000000000000001100100100 1001110000000001011011111011110111111101

15 TRIMETH_nt erm ( ri - Me thy lat i on_nt erm) [ACDEFGHIKLMNPQRSTVWY : AP : ACDEFGHIKLMNPQRS VWY] T F 42.0471:42.0807

00000001011011111011110111111101000000000000000010 oooooo 0000000100000001011011111011110111111101

16 TRIMETH_core ( Tr i - Me t hy 1 a t i on_ c or e )

[ACDEFGHIKLMNPQRSTVWY :CDEHKNQR: ACDEFGHIK MNPQRSTVWY] F F 42.0471:42.0807

00000001011011111011110111111101000000000000001100100100 1001110000000001011011111011110111111101

17 SULF_nterm ( Sul f a t i on_nt erm )

[ACDEFGHIKLMNPQRSTVWY : ACDEFGHIKLMNPQRSTVWY :ACDEFGHIKLMNP QRSTVWY]

T F 79.9568:80.0642

00000001011011111011110111111101000000010110111110111101 111111Q100000001011011111011110111111101

18 SULF ( Sulf at ion_core )

[ACDEFGHIKLMNPQRSTVWY : Y : ACDEFGHIKLMNPQRSTVWY] F F 79.9568:80.0642

00000001011011111011110111111101000000010000000000000000 0000000000000001011011111011110111111101

19 FORM (Formylation)

[ACDEFGHIKLMNPQRSTVWY: ACDEFGHIKLMNPQRSTVWY : ACDE FGH IKLMNP QRSTVWY]

T F 27.9949:28.0104

00000001011011111011110111111101000000010110111110111101 1111110100000001011011111011110111111101

20 DEAM_N (Deamidation_N) [ CDEFGH I KLMNPQRSTVW : N : G] F F 0.984:0.9847 00000001011011111011110111111101000000000000000000100000 000000000000000000000000000000000100 oooo

21 DEAM_Q (Deamidation_Q)

[ACDEFGHIKLMNPQRSTVW :Q: ACDEFGHIKLM PQRSTVWY] F F 0.984:0.9847

00000001011011111011110111111101000000000000000100000000 0000-000000000001011011111011110111111101

22 Oxydation (Oxydation)

[ACDE FGH I KLMNPQR STVWY:HMW : ACDEFGHI KLMNPQRSTVWY]

F F 15.9949:15.999

00000001011011111011110111111101000000000100000000010000 1000000000000001011011111011110111111101

23 Cys_CM ( Carboxyme thyl_cys t eine ) [ACDEFGHIKLMNPQRSTVWY: C : ACDEFGH IK LMNPQRS TVWY] F F 58.0055:58.0367

00000001011011111011110111111101000000000000000000000000 0000010000000001011011111011110111111101

24 Cys_CAM ( Carboxyamidomet hy l_cy s t e ine ) [ACDEFGHIKLMNPQRSTVWY: C: ACDEFGHIKLMNPQRSTVW ] F F 57.0215:57.052

00000001011011111011110111111101000000000000000000 OOOOOO 0000010000000001011011111011110111111101

25 Cys_PE ( Pyridyl - e thyl_cys eine ) [ACDEFGHIKLMNPQRSTVWY: C: ACDEFGHIKLMNPQRSTVWY] F F 105.058:105.145

000000010110111110111101111111010 oooooooooooo ooooo oooooo 0000010000000001011011111011110111111101

26 Cys_PAM ( Prop i onami de_cys t e i n ) [ACDEFGHIKLMNPQRSTVWY: C : ACDE FGH IKLMNPQRS TVWY ]

F F 71.0371:71.0788 oooooooioiioiimoimoiiiiii loioooooooooooooooooooooooo

0000010000000001011011111011110111111101

27 MSO ( Methionine_sul f oxide )

[ACDEFGHIKLMNPQRSTVWY: M: ACDEFGHIKLMNPQRSTVWY] F F

15.9949:15.9994

00000001011011111011110111111101000000000000000000010000

0000000000000001011011111011110111111101 28 HSL ( Homos erine_Lac tone ) [ACDEFGHIKLMNPQRSTVWY: S : ACDEFGHIKLMNPQRSTVWY]

F F 12.9617:13.0189

000000010110111110111101111111010000000000000100000 ooooo 0000000000000001011011111011110111111101

Table 11: Example of an advanced rule for modeling trypsin activity. Use of the advanced rule reduces the number of incorrect theoretical peptides and therefore results in a more specific theoretical spectrum.

References cited herein are incorporated by reference in their entireties.

Claims

1. A method for computing the mass of potentially modified peptides, given a list of potential modifications on a protein sequence and a list of enzyme cleavage sites that define peptides, which method comprises:

(1) associating each peptide with at least one set of modifications, wherein each combination of said peptide with a set of modifications corresponds to or is associated with a total mass; and (2) for a subset of said combinations, selected based on the total mass:

(a) generating every possible modification location within each combination from said subset; and

(b) computing the corresponding fragmentation spectra.

2. The method of claim 1, wherein a set of modifications comprises a plurality of modifications.

3. The method of claim 1, wherein a peptide is associated with a plurality of sets of modifications in step (1).

4. The method of claim 1, wherein said subset of combinations of peptides and set of modifications is selected in step (2) by comparing total masses from step (1) with an experimental mass spectrum.

5. The method of claim 4, wherein the method further comprises selecting a plurality of peptides from step (1) which correspond or are likely to correspond to a mass observed in the experimental mass spectrum.

6. The method according to claim 1, wherein potential modification sites are identified on a binary encoded amino acid sequence.

7. The method according to claim 1, wherein enzymatic cleavage sites are identified on a binary encoded amino acid sequence.

8. The method according to claim 6 or 7, wherein bit masks are used to represent rules defining potential modification or enzymatic cleavage sites.

9. The method according to claim 6 or 7, wherein the binary representation size per amino - acid is chosen to facilitate computation.

10. The method according to claim 9, wherein the binary representation size per amino acid is 24 or 32 bits.

1 1. The method according to claims 6 or 7, wherein the binary representation is related to the amino acid one-letter code order in the alphabet.

12. The method according to claim 1, wήerein some of the modification sites are imposed a priori and others are identified using a rule.

13. The method according to claim 12, wherein the modification sites are imposed a priori or identified using a rule after enzymatic digestion.

14. The method according to claim 1, wherein a modification site is predicted by an artificial neural network or a hidden Markov model.

15. The method according to claim 1, wherein amino acid sequences are taken from a database.

16. The method according to claim 1, wherein the amino acid sequence and the total mass in step (1) are the result of a computation.

17. The method according to claim 1, wherein the amino acid sequence and the total mass in step (1) are manually assembled.

18. The method according to claim 1, wherein amino acid sequences are obtained by translating nucleotide sequences.

19. A method for identifying a protein comprising: (a) providing a set of modification sites on a test protein sequence and enzyme cleavage sites that define peptides, wherein each test peptide thus defined is associated with a total mass;

(d) for each test peptide in said subset, given a list of modifications to be considered, (1) associating said test peptide with every possible set of modifications, wherein each modification is associated with a location on the peptide; and (2) computing the corresponding fragmentation spectra.

(e) correlating said^* experimental fragmentation mass spectra of step (b) with said computed fragmentation spectra of step (d) to generated a score; and (f) identifying the protein of interest by association with the highest ranking test peptides in the correlation of step (e).

20. The method of claim 19, wherein each of said test peptide of step (a) is associated with at least one modification.

21. The method of claim 19, wherein test amino acid sequences are stored in a database.