WO2020260419A1

WO2020260419A1 - Protein probability model

Info

Publication number: WO2020260419A1
Application number: PCT/EP2020/067751
Authority: WO
Inventors: Gorka PRIETO AGUJETA; Jesús VÁZQUEZ COBOS
Original assignee: Universidad Del Pais Vasco-Euskal Herriko Unibersitatea; Centro Nacional De Investigaciones Cardiovasculares Carlos Iii (F.S.P.)
Priority date: 2019-06-24
Filing date: 2020-06-24
Publication date: 2020-12-30

Abstract

In this invention, we propose a protein probability model derived from analytical considerations that i) integrates the information provided by all identified peptides, that ii) accurately predicts the random behaviour of decoy proteins, and iii) effectively resolves the peptide-to-protein paradox. Our results were validated by analysing the results from three search engines for several tissues from the Human Proteome Map.

Description

Protein probability model

Technical field

In this invention, we present a novel identification workflow based on a newly developed protein probability model.

State of the art.

In shotgun proteomics, MS/MS spectra are matched against the theoretical spectra of peptides resulting from an in-silico digestion of a protein database. Each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the measured and theoretical spectra. The validity of identifications is checked by calculating the false discovery rate (FDR), which is usually set to a maximum of 1%. The same concept can be applied at the peptide level to control for the peptide FDR.

The most popular and widely-accepted strategy for computing FDRs in shotgun proteomics is the target-decoy approach. In this approach, two databases are used: a target database with all the protein sequences that could be present in the sample, and a decoy database with fictitious protein sequences that should not be detected. If the decoy database is correctly constructed (with the same size and statistical distributions as the target database), the probability of obtaining a false PSM in the target and decoy databases is identical. Several strategies have been proposed for the calculation of FDR by the target-decoy approach. The number of above-threshold matches in the decoy database can be directly used to estimate the number of false matches in the target database. Since spectra yielding high scores in the target database also tend to produce high scores in the decoy database, the search is usually performed against a concatenated target + decoy database, so that target and decoy sequences compete for the spectra. In turn, different formulae can be used to calculate the FDR at PSM or peptide levels using the competition strategy, depending on the population of matches used to estimate FDR. In a proposed refinement, spectra are searched separately against the two databases, and target and decoy matches compete a posteriori, so that the FDR can be calculated in the original target sequence population. In practice, the differences between these methods are small, and the competition target-decoy strategy for computing FDR is currently widely accepted in the proteomics community.

Nonetheless, the main goal of high-throughput proteomics is to identify not peptides, but the proteins present in a biological sample. The identity of these proteins can be inferred from peptides identified but controlling the FDR at the protein level is not straightforward. Since each protein can be identified by several peptides, true peptide identifications tend to concentrate on the proteins present in the sample, whereas false peptide identifications are randomly distributed among the proteins in the database. Therefore, the ratio between the number of peptides and the number of proteins matched by these peptides is higher for target peptides than for decoy peptides, and consequently the protein-level FDR can be much larger than the peptide-level FDR. Because of this protein FDR build-up problem, it is necessary to control FDR at the protein level and not just at the peptide level. Notably, this effect was not considered in one of the first drafts of the human proteome, which used a peptide-level FDR threshold but did not control the amplification of the error rate at the protein level, resulting in invalid identifications, as pointed out in later analyses. One strategy for minimizing the impact of error-rate amplification is to decrease the FDR threshold value at the PSM or peptide levels until an acceptable protein-level FDR is achieved. For instance, in the Human Plasma Peptide Atlas, a 1% protein-level FDR was achieved by using a stringent PSM-level FDR filter of 0:0002. However, such stringent criteria reduce the sensitivity of the identification workflow.

Another way to control amplification of the error rate is to compute a protein-level score from the scores of their peptides and estimate the FDR by directly applying the target-decoy approach at the protein level. Several protein-scoring methods have been proposed, and the widespread implementation of FDR quality control has allowed comparison of the different approaches. The approach that currently seems the most efficient, especially for large datasets, is to assign each protein with the score of its best peptide. However, this is somewhat paradoxical, since from a purely statistical standpoint one would expect to obtain more information and evidence for correct protein identification by considering all peptides mapping to a protein. This peptide-to-protein paradox illustrates the urgent need for the development of a comprehensive protein scoring model that effectively integrates all the existing peptide information. Such a model would facilitate the development of automated workflows with increased quality of protein information and improve annotation in protein databases.

Brief description of the figures

Figure 1: Distribution of different decoy protein scores using three tissues from the Human Protein Map as separated test datasets. The red line represents the uniform distribution. Decoys above this line are over-evaluated and decoys below the line are under-valuated.

Figure 2: Number of identified genes as a function of the FDR threshold for different protein scores types and using as separated tests three tissues from the Human Protein Map. The number of identifications provided by LPS is so small (less than 400) that it is not depicted in the figure.

Figure 3: Competition between target and decoy proteins for FDR estimation using a simulated dataset and LPG scores. The dashed blue lines in (B) indicate the score threshold. The LPG scores of the decoy proteins follow an exponential distribution since they correspond to the logarithm of the probability, which follows a uniform distribution, as previously presented. (A) When no true-positive (TP) target proteins are present, decoy proteins and false-positive (FP) target proteins are distributed symmetrically across the diagonal. The FDRr definition is based on this symmetry. (B) When there are true-positive target proteins, four regions can be defined considering the diagonal and the score threshold. The decoy-only (do) region contains pairs in which the decoy protein has an above-threshold score and the target protein has a below-threshold score. The decoy-best (db) region contains pairs for which both scores are above-threshold but the decoy-protein score is better than the target-protein score. The target-only (to) and target-best (tb) regions are defined in a similar fashion.

Figure 4: Number of identifications using the different protein-level FDR algorithms for three sample tissues in the Human Proteome Map. For this comparison, the protein score has been calculated simply as the score of its best peptide. Note that the axes have different oset values to better highlight differences. Figure 5: Number of identified genes as a function of the FDR threshold for different protein identification workflows, using three tissues from the Human Protein Map as separate tests.

Figure 6: Venn diagrams showing the number of genes identified by different protein identification workflows in three tissues of the Human Protein Map.

Figure 7: Comparison of target versus decoy peptides for each gene identified exclusively by any of the three different protein identification workflows discussed. Total number of peptides is considered, without filtering by FDR. Each point corresponds to a target-decoy pair. The comparison has been carried out in three tissues of the Human Protein Map.

Brief description of the invention

An aspect of the invention refers to a method carried out by a computer for selecting, identifying or classifying proteins present in a sample, preferably a biological sample, which comprises the following steps: i. matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and

ii. calculating the LPGF value (coLogarithm of Probability using the Gamma distribution for Filtered peptides), which is the probability of getting a decoy protein with an equal or lower p-value product from m identified peptides that are selected among n matched peptides, according to the following formulae:

where LPF (coLogarithm of Probability product using Filtered peptides) is the cologarithm of the product of p-values of the m identified peptides matching the protein, n are the total number of peptides matching the protein (considered as identified or not), G(x; k; Q) is the cumulative density function (CDF) of the gamma distribution with shape k and scale Q and LPGM (coLogarithm of Probability using the Gamma distribution for the Maximum peptide score) is given by

LPGM = -log₁₀(l - (l - _Pr) where p is preferably the lowest p-value among the set of p-values of all the peptides that match the protein, and wherein the peptide p-value is calculated from a score computed by matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide- to-spectrum match (PSM); or wherein the LPGF value is calculated by using the following chi-squared distribution:

where Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom; wherein the protein probability provided by the LPGF value is used to classify or rank the proteins according to their probability and to optionally select or identify the proteins that are considered truly identified according to a statistical significance threshold. A further aspect of the invention refers to a method carried out by a computer for selecting, identifying or classifying proteins present in a sample, which comprises the following steps: a. calculating the peptide p-values from the set of peptides identified in the sample, characterized in that:

i. MS/MS spectra obtained from the set of peptides from the sample are matched against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the MS/MS spectrum and the peptide;

ii. generating a single ordered list of decoy and target peptides according to the score of their corresponding PSM obtained in step i); and iii. determining the peptide p-values from the peptide list generated in step ii) containing the set of peptide scores, wherein for each peptide (target or decoy), the number of decoy peptides (d) with an equal or better score are counted up and this number is divided by the total number of decoy peptides (D) to produce a p-value p-value(peptidei) = d/D (1) or the cologarithm of the p-value (LP); and b. carrying out the method for selecting, identifying or classifying proteins present in a sample by:

iv. calculating the LPGF value from the peptide p-values obtained in step a) (coLogarithm of Probability using the Gamma distribution for Filtered peptides), which is the probability of getting a decoy protein with an equal or lower p-value product, using the p-values as obtained in step a), from m identified peptides that are selected among n matched peptides, according to

the following formulae:

where LPF is the cologarithm of the product of p-values of the m identified peptides matching, as identified in step a(i), the protein, n are the total number of peptides matching the protein considered as identified or not) G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by

LPGM = —log₁₀(l— (1— p)ⁿ) where p is the p-value obtained according to step a) of the peptide that matches the protein with the best score; or wherein the LPGF value is calculated by using the following chi-squared distribution:

where Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom; wherein the protein probability provided by the LPGF value is used to classify or rank the proteins according to their probability and to optionally select or identify the proteins that are considered truly identified according to a statistical significance threshold.

In a preferred embodiment of any of the previous aspects, the p-value, preferably p < 0.05, is used as the threshold for statistical significance of protein identification, so that all the proteins with LPGF < p are considered as identified according to said significance threshold in the protein probability model.

In another preferred embodiment of any of the previous aspects, a False Discovery Rate (FDR)- value, preferably FDR < 0.01, is used as the threshold for statistical significance of protein identification, so that the LPGF value of the target and decoy proteins are used to calculate the FDR of each protein, so that all the proteins whose FDR is lower than the FDR threshold are considered as identified according to said significance threshold.

A further aspect of the invention refers to a data processing apparatus/device/system comprising means for carrying out the steps of any of the methods previously described.

A further aspect of the invention refers to a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of any of the methods previously described.

A further aspect of the invention refers to a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of any of the methods previously described.

Detail description of the invention

In this invention, we propose a protein probability model derived from analytica l considerations that i) integrates the information provided by all identified peptides, ii) accurately predicts the random behaviour of decoy proteins, and iii) effectively resolves the peptide-to-protein paradox. Our results were validated by analysing the results from three search engines for several tissues from the Human Proteome Map (see examples).

Because protein confidence values are calculated from peptide confidence values and peptide confidence values do not change as peptides are assigned to proteins, the protein confidence values are dependent on the initial confidence values of the peptides. These initial peptide confidence values are based on a model for the relationship between the amount of evidence in data for a peptide and the probability of correctness of a peptide. In particular, in order to effectively resolve the peptide-to-protein paradox, the present invention first provides an accurate method for calculating peptide confidence values from a set of peptide-spectrum matches (PSM) obtained after searching against a target and a decoy protein database (concatenated or separated), wherein said PSMs are generated in a proteomics experiment in which one or more mass spectrometers that perform a plurality of scans of a sample produce a plurality of spectra and wherein a processor in communication with a target and a decoy protein database identify said set of PSM from the plurality of spectra. Once the set of PSM is generated, the peptide confidence values are obtained from a single ordered list of decoy and target peptides according to the score of their corresponding PSM, wherein for each peptide (target or decoy), the number of decoy peptides (d) with an equal or better score are counted and this number is divided by the total number of decoy peptides (D) to produce a p-value, wherein said p-value is essential in the present invention for later on calculating the protein confidence values and thus for constructing a protein probability model present in the sample. It is herein noted that when a peptide is matched by several PSMs, the PSM with the best score should be selected.

Thus a first aspect of the invention, refers to a method for calculating peptide confidence values from a set of peptides obtained after searching against a target and a decoy protein database (concatenated or separated), wherein preferably, but not limited to, said peptides are identified in a proteomics experiment (or in a (biological) sample) in which one or more mass spectrometers that perform a plurality of scans of the sample produce a plurality of spectra and wherein a processor in communication with a protein and a decoy database identify said set of peptides from the plurality of spectra, characterized in that the method comprises: a. obtaining the best sequence candidate (or peptide) from each database per each MS/MS spectrum, according to the score obtained from the search engine, where the match between the best-scoring candidate and a MS/MS spectrum is called a peptide-spectrum match (PSM);

b. selecting the PSM with best score when a peptide is matched by two or more PSM, so that a peptide list is constructed from the matched peptides together with their corresponding best scores;

c. determining peptide p-values from the peptide list containing the set of peptide scores by producing a single ordered list of decoy and target peptides according to their scores, wherein for each peptide (target or decoy), the number of decoy peptides (d) with an equal or better score are counted up and this number is divided by the total number of decoy peptides (D) to produce a p-value p-value(peptidei) = d/D (1) or the cologarithm of the p-value (LP).

More specifically the method comprises calculating the peptide p-values from the set of peptides identified in a sample, by the following steps: i. MS/MS spectra obtained from the set of peptides from the sample are matched against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the MS/MS spectrum and the peptide; ii. generating a single ordered list of decoy and target peptides according to the score of their corresponding PSM obtained in step i); and iii. determining the peptide p-values from the peptide list generated in step ii) containing the set of peptide scores, wherein for each peptide (target or decoy), the number of decoy peptides (d) with an equal or better score are counted up and this number is divided by the total number of decoy peptides (D) to produce a p-value p-value(peptidei) = d/D (1) or the cologarithm of the p-value (LP).

A second aspect of the invention refers to a method for determining a protein probability in a sample, preferably from the peptide p-values obtained according to the method of the first aspect of the invention although other peptide p-values obtained according to any further methods known to the skilled person (for a review see (Nesvizhskii, 2010)) are also useful in the context of the second aspect of the invention; and calculating said probability in terms of protein confidence values for each protein that evaluates the likelihood that a candidate protein is correctly identified, wherein the method integrates the p-values of the identified peptides according to the following formula:

where LPF is the cologarithm of the product of p-values of the m identified peptides matching the protein, n are the total number of peptides matching the protein (considered as identified or not) G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by

LPGM = —log_1Q(l— (1— p)ⁿ) where p is the p-value of the peptide that matches the protein with the best score.

Note that the above-mentioned expression can also be calculated using the chi- squared distribution, which produces exactly the same result:

where Chi²(x; k) is the CDF of the chi-squared distribution with k degrees of freedom.

More specifically, the previous method for determining a protein probability comprises the following steps:

a. Optionally calculating the peptide p-values as described in the first aspect of the invention; and

b. carrying out the method for determining a protein probability in a sample by: i. calculating the LPGF (coLogarithm of Probability using the Gamma distribution for Filtered peptides) value from the peptide p-values obtained in step a) or obtained by any other methods, which is the probability of getting a decoy protein with an equal or lower p-value product, using the p-values as obtained in step a), from m identified peptides that are selected among n matched peptides, according to the following formulae:

LPGM = -log₁₀(l - (l - _Pr) where p is the p-value obtained according to step a) of the peptide that matches the protein with the best score; or wherein the LPGF value is calculated by using the following chi-squared distribution:

where Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom.

It is noted that the LPGF value obtained as indicated above, is especially useful to classify or rank the proteins present in a (preferably biological) sample, according to their probability, and to optionally select or identify the proteins that are considered truly identified according to a statistical significance threshold. Therefore, a third aspect of the invention refers to a method of selecting, identifying or classifying a protein from a set of proteins using as an indicator the protein probability obtained according to any of the methods of the second aspect of the invention. In a preferred embodiment of this aspect of the invention, a p-value, preferably p < 0.05, is used as the threshold for statistical significance of protein identification, so that all the proteins with LPGF < p are considered as identified according to said significance threshold in the protein probability model. In another preferred embodiment of this aspect of the invention, a False Discovery Rate (FDR)-value, preferably FDR < 0.01, is used as the threshold for statistical significance of protein identification, so that the LPGF value of the target and decoy proteins are used to calculate the FDR of each protein, so that all the proteins whose FDR is lower than the FDR threshold are considered as identified according to said significance threshold.

It is noted that although not limited, the processor indicated in the first aspect may determine the peptide score values by using a heuristic.

A fourth aspect of the invention refers to a computer program or a computer program product, which may comprise or not a tangible computer-readable storage medium, wherein the computer program contents include a program with instructions being executed on a processor so as to perform a method for calculating peptide p- values in proteomic analysis, the method comprising: providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise at least an analysis module which determines peptide p-values for the set of peptides using the analysis module by following the methodology indicated in the first or second aspect of the invention.

The term "computer-readable medium" as used herein refers to any media that participates in providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non volatile media includes, for example, optical or magnetic disks. Volatile media includes dynamic memory, such as memory. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus.

Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.

Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor for execution. For example, the instructions may initially be carried on the magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled to bus can receive the data carried in the infra-red signal and place the data on bus. Bus carries the data to memory, from which processor retrieves and executes the instructions. The instructions received by memory may optionally be stored on storage device either before or after execution by processor.

In accordance with various embodiments, instructions configured to be executed by a processor to perform a method are stored on a computer-readable medium. The computer- readable medium can be a device that stores digital information. For example, a computer- readable medium includes a compact disc read-only memory (CD-ROM) as is known in the art for storing software. The computer-readable medium is accessed by a processor suitable for executing instructions configured to be executed.

Additionally, the described implementation includes software but the present teachings may be implemented as a combination of hardware and software or in hardware alone.

A fifth aspect of the invention, refers to a computer program or a computer program product, which may comprise or not a tangible computer-readable storage medium, wherein the computer program contents include a program with instructions being executed on a processor so as to perform a method for calculating protein probabilities in proteomic analysis, the method comprising: providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise at least an analysis module, wherein the analysis module calculates a protein probability from the peptide p- values obtained according to the first aspect or from other peptide p-values obtained according to any further methods known to the skilled person, and wherein said protein probability is determined by following the methodology indicated in the second aspect of the invention.

In a preferred embodiment, of the fifth aspect of the invention, the analysis module is use for identifying or classifying a protein from a set of proteins of a sample, preferably of a biological sample, by using as an indicator the protein probability obtained according to the fifth aspect of the invention.

In another preferred embodiment of the fourth or fifth aspect, the system further comprises a measurement module, and said measurement module obtains a plurality of spectra from one or more mass spectrometers that perform a plurality of scans of a sample using the measurement module; and/or identifies a plurality of peptides from the plurality of spectra using the analysis module; and/or identifies a plurality of proteins from the plurality of peptides using the analysis module.

A sixth aspect of the invention refers to a computer implemented system for calculating peptide p-values from a set of peptides obtained after searching against a target and a decoy protein database (concatenated or separated), wherein said peptides are identified in a proteomics experiment in which one or more mass spectrometers that perform a plurality of scans of a sample produce a plurality of spectra and wherein a processor in communication with a protein database and the one or more mass spectrometers identify said set of peptides from the plurality of spectra, and a set of proteins from the plurality of peptides, characterized in that the system comprises: a. obtaining the best sequence candidate (or peptide) from each database per each MS/MS spectrum, according to the score obtained from the search engine, where the match between the best-scoring candidate and a MS/MS spectrum is called a peptide-spectrum match (PSM).

b. selecting the PSM with best score when a peptide is matched by two or more PSM, so that a peptide list is constructed from the matched peptides together with their corresponding best scores

It is noted that such system may preferably comprise a communication mechanism for communicating information, and a processor coupled with said communication mechanism for processing information. Computer system may also include a memory, which can be a random access memory (RAM) or other dynamic storage device, coupled to the communication mechanism for determining base calls, and instructions to be executed by processor. Memory may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor. Computer system may further include a read only memory (ROM) or other static storage device coupled to the communication mechanism for storing static information and instructions for processor. A storage device such as a magnetic disk or optical disk may be provided and coupled to the communication mechanism for storing information and instructions.

A seventh aspect of the invention refers to a system for determining a protein probability from the peptide p-values obtained according to the system of the sixth aspect, and calculating said probability for each protein that evaluates the likelihood that a candidate protein is correctly identified, wherein the method integrates the -values of the identified peptides according to the following formula:

where LPF is the cologarithm of the product of p-values of the m identified peptides matching the protein, n are the total number of peptides matching the protein (considered as identified or not) G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by LPGM = -log₁₀ - - p)ⁿ) where p is the p-value of the peptide that matches the protein with the best score.

The following examples are merely for illustrative purposes and are not meant to limit the present invention.

Examples

Proteomics Dataset

For this study, the proteomics dataset we used to test different algorithms was one of the first drafts of the human proteome, submitted to ProteomeXchange as PXD000561. This dataset is known as the Human Proteome Map (HPM) and contains experimental data from different human tissues. We carried out comparisons at this tissue-level, and validated the reproducibility of the results in three tissues: Adult Heart, Adult Liver, and Adult Testis. We selected these tissues because they span the dataset size range of the HPM: Adult Heart has a lower number of identifications, Adult Testis has one of the largest number of identifications, and Adult Liver is an intermediate case. The raw files for each tissue have been converted to mgf file format using msconvert with the peakPicking true 1- option. Target and Decoy Databases

The target fasta database has been generated from GENCODE 15 version 25, with the addition of 47 common contaminants from UniProt. To account for PSM ambiguities in leucine vs isoleucine assignments, we replaced all leucines in the database with isoleucines. A separated decoy database was generated with Decoy Pyrat, a software tool that generates decoy sequences with minimal overlap between target and decoy peptides. Decoy Pyrat achieves this by first switching proteolytic cleavage sites with the preceding amino acid, reversing the database, and then shuffling any decoy sequences that become identical to target sequences.

Database Search

Three different search engines were used for the database search: X!Tandem (ALANINE 2017.02.01), Comet (2016.01 rev. 3) and MSFragger (build 20170103.0). We used a fragment monoisotopic mass error window of 0:05 Da, a parent monoisotopic mass error window of 10 ppm, carbamidomethylation of cysteine as afixed modification, oxidation of methionine as a variable modification, trypsin as the protease and a maximum of 2 missed cleavages. Separate searches were performed with the target and decoy databases, and the target and decoy PSMs were competed a posteriori. In this way, we conserved the option to use any of the previously presented PSM FDR algorithms.

Processing Details

When computing p-values, in order to avoid probabilities with a value of 0, which would be problematic when using logarithms, we apply an offset value to the numerator of Eq. 1. This value is 0:5 for decoy PSMs and -0:5 for target PSMs.

To compute a peptide-level FDR we have considered the best PSM for each spectrum

(rank = 1), and then the best PSM for each peptide.

Results Derivation of a Protein Probability Model

Generic identification workflows start by considering PSM-level scores, which are integrated into peptide-level scores and then into protein- or gene-level scores. In this invention, we focused on the problem of computing protein-level probabilities from peptide-level probabilities. Hence, we made two simplifications. First, we avoided the problem of integrating PSM-level scores into peptide-level scores by assigning to each peptide the score of its best PSM; this commonly-used simplification is used in most identification workflows. The second simplification is related to the protein inference problem. To avoid ambiguities in peptide to-protein assignments and deviations in expected random behaviors, we used only unique peptides. Since most of the ambiguities come from different protein products of the same gene, we only considered peptides that were unique at the gene level. In this way, we did not filter out peptides shared by different proteins produced by the same gene, avoiding an excessive decrease in dataset size. The results are thus provided at the gene level, although sometimes we continue to use the protein term for simplicity.

Since the different search engines provide different types of scores, in our model we first translate the scores into a general purpose peptide-level p-value, in a process we call "calibration". The p-value is defined as the probability of obtaining a peptide with equal or better score by chance alone. Scores are converted into p-values by analyzing the population of decoy peptides, which are assumed to be produced by random events, by a simple nonparametric approach. The target and decoy databases are searched separately, and a single ordered list of decoy and target peptides is produced according to their scores. For each peptide (target or decoy), we count the number of decoy peptides (d) with an equal or better score and divide this number by the total number of decoy peptides (D): p-value(peptidei) = d/D (1)

Eq. 1 ensures that the p-values for decoy peptides follow a uniform distribution. Since protein probability functions usually derive from multiplicative operations on their peptide probabilities, it was more convenient to use the cologarithm of the p-value: LP, = -logio(p-value(peptidei)) (2)

Using this notation, we can easily represent the approaches most used by different bioinformatics tools to calculate protein scores:

LPM = max LPi (3)

where LPM stands for LP Maximum and assigns the protein the probability of its best peptide (i.e., the peptide with lowest probability). LPS stands for LP Sum and estimates the protein probability as the product of all peptides of the protein that have been matched in the database search. Finally LPF, which stands for LP Filter, is a variant of LPS that only integrates the probability of the identified peptides (i.e., peptides above a predefined peptide-level FDR threshold).

Eqs. 3-5 are constructed from calibrated (i.e., uniformly distributed) and independent decoy peptide probabilities; however, when applied to decoy proteins, none of these scores followed the expected uniform distribution (Figure 1). This can be explained in the case of LPM, since the probability of the best score is not a true measure of protein probability. However, LPS increases much faster than would be expected for a true probability, and this marked deviation from a uniform distribution is somewhat surprising, since proteomics specialists have often assumed that the product of peptide probabilities is a good estimate of protein probability. The deviation from the uniform distribution, though less pronounced, still grows too fast when LPF is used, i.e, when only the decoy peptides above peptide-FDR threshold are considered in the protein score. These results indicate that these protein scores do not reflect the actual behavior of decoy proteins. In an effort to derive a formula able to predict decoy behavior, we considered each of these cases separately. We tried to reformulate the problem in the form of an appropriate question from which a probability model could be derived, as proposed for the FDR. If we are to use the best peptide as a means of scoring proteins, the question to address is what is the probability of getting a decoy protein that contains at least one decoy peptide with an equa l or lower p-value when the protein has been matched by n peptides? Since this event is complementary to having all n peptides matched with a higher p-value, we arrive at the expression

Probability(LPMprotein) = 1 - (1 - p)ⁿ (6) which is a well-known result of order statistics. Using the cologarithm notation, the equation is reexpressed thus:

Notably, when LPGM was used as a protein score it followed very accurately the uniform distribution and this result was reproduced in all the datasets analyzed (Figure 1). Besides, this finding was also consistently reproduced when different search engines were used. We concluded that LPGM behaved as a true protein probability.

We used a similar approach for the other two protein scores. If a protein score is to consider all the matched peptides of a protein, as LPS does, we would have to take account of the set of p-values for each of the peptides. However, the formulation of a probability question here is not straightforward. Let us assume that a certain decoy protein is matched by a "configuration" comprising n matched peptides with a list of the corresponding p-values. To calculate the probability of obtaining such a configuration or a better one by chance would require the definition of a criterion to determine whether a specific configuration is better than another, so that the decoy proteins can be ranked accordingly. While there are several possible ways to define such a criterion, we reasoned that the product of peptide p-values had some of the properties needed to rank the configurations as expected for a random event. This is evident in some cases. For instance if several decoy proteins are matched by the same number of peptides but have different p-values, we can intuitively affirm that the protein most probably matched by chance is the one having the highest product of p-values. A similar argument can be made for the case of two proteins matched with the same number of peptides with the same p-values, except that the second protein is matched by an additional peptide; in such a case, the protein with the lower number of peptides is more likely to be detected by chance, which again corresponds to the protein with highest product of peptide p-values. In the most general case, we need to take into account the number of matched peptides and the product of p-values. These two parameters can be used to formulate a coherent probability question: what is the probability of getting a decoy protein with an equal or lower product of peptide p-values when it is matched by n peptides? According to statistical theory, the neperian logarithm of the product of n independent and identically distributed uniform [0; 1] random variables follows a gammafn; 1) distribution. Since peptide p-values are uniformly distributed, we can use the gamma function to construct a protein probability as follows:

Probability ( LPS_protem ) = 1 - G(-ln(P); n, 1) (8) where P is the product of p-values of the n peptides and G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q.

Using the cologarithmic notation, we obtain

LPGS =—log_w(l - G(LPS - hi 10; n, 1)) (9)

where LPGS stands for LP Gamma of the Sum.

Note that eq.(9) is mathematically equivalent to using the Fisher's method for combining independent probabilities, which could be formuled as

LPGS =—log_w(l - Chi2(2 LPS ZnlO; 2 n)) where Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom.

In marked contrast to LPS, LPGS has the expected protein probability properties and accurately predicts the observed probability of matching a decoy protein in all samples across different search engines (Figure 1). A similar deviation, although less accused, was also observed in the case of Comet. A close inspection of the three best matching peptides from these decoy proteins revealed the presence of sequences containing high proportions of repeated amino acids like Gly, Ala, Ser, Pro, Leu; all these sequences, typical from proteins like keratins and collagens, are highly homologous to target proteins and produce non-random PSM. Hence, these deviations are produced by imperfections in the algorithm used to detect overlaps between target and decoy peptides at the time of generating the decoy database, and not to the protein probability algorithm.

The LPF score can be thought of as a refinement of LPS in which peptides that have not been identified according to the peptide FDR threshold are not included in the protein score. The question to address in this case is what is the probability of getting a decoy protein with an equal or lower p-value product from m identified peptides when the protein is matched by a total of n peptides? To construct a probability function using the gamma distribution we had to take into account the fact that several subsets of identified peptides are possible from the same set of matched peptides. We also had to consider proteins for which none of the matched peptides were positively identified; in these cases, the LPM value was assigned to LPF. The probability function then takes the form

Note again that the above-mentioned expression can also be calculated using the chi-squared distribution, which produces exactly the same result:

Again, LPGF accurately predicts the behavior of decoy proteins in all tissues and across all search engines (Figure 1) and thus acts as a true protein probability score. Note that, in contrast with LPGS, LPGF stands very well the imperfections in the decoy database, showing negligible deviations in the expected trend for top scoring proteins with MSFragger and Comet.

Analysis of the performance of these six scores measured as the number of target proteins identified at fixed protein FDR thresholds revealed three consistent patterns that were maintained in all the tissues and across all search engines (Figure 2). First, each of the three protein probability scores theoretically derived here (LPGM, LPGS, and LPGF) was more sensitive than its original counterpart (LPM, LPS, and LPF). This may be because LPM uses the information from only one peptide, discarding the evidence from the other identified peptides about the presence of the protein; in contrast, LPGM selects the best peptide after considering a total of n peptides. Similarly, LPS and LPF do not account for protein length, since they only consider the final product of peptide p-values, thus introducing a bias toward larger proteins, which tend to be matched by more decoy peptides. In contrast, LPGS takes into account the number of peptides used to calculate the product, and LPGF includes not only the number of peptides but also the total number of peptides matching the protein. In general, this finding highlights the importance of using true probabilities that accurately reflect decoy protein behavior instead of empiric protein scores.

The second key observation is that LPGM and LPM were consistently more sensitive than LPGS and LPS, respectively. This finding reflects the detrimental accumulation in LPGS and LPS of false target peptide matches, which only add random noise and contribute nothing to target protein identification. This phenomenom is avoided in LPGM and LPM, increasing their sensitivity despite the inclusion of only the best peptide in the score. The third and most important pattern is that LPGF was consistently more sensitive than LPGM. This contrasts with LPF, which was less sensitive than LPM in most cases. This finding shows that the inclusion of all relevant peptide information in a well-constructed protein probability score is always preferable to the simplification provided by using only the best peptide. More importantly, it resolves the peptide-protein paradox, demonstrating that the inclusion of relevant information produces better results than ignoring it, and establishing the basis for a rational approach to protein identification.

Derivation of a refined protein FDR

Let us assume a hypothetical experiment in which spectra are searched against a decoy and a target database and there are no protein identifications. If we calculate a protein score on the basis of the peptides that match each of the proteins, then the score distribution in the two databases is expected to be identical and the representation of decoy vs target scores would generate a cloud of points symmetrically distributed around the diagonal (Figure 3(A)).

The presence of true protein identifications can be viewed as an increase in the score of proteins harboring true target peptide matches, producing a horizontal displacement of scores to the right. The resulting distribution of target protein scores can thus be viewed as a superposition of the decoy distribution (containing the false-positive target protein scores) and the distribution of true-positive protein scores. Let us also suppose that a protein score threshold is applied to select a population of positively identified target proteins. The FDR of this population can be calculated in several ways. In the simplest approach, we use the number of above-threshold decoy proteins (d) to estimate the fraction of false positives in the population of above-threshold target proteins

(t)

We now defined a set of regions delimited by the diagonal and the score threshold. This allowed us to express FDRn as do 4- db + tb

FDRn (12)

to 4- db + tb

In the MAYU method, FDR is estimated as:

d T - t

FDRm (13) t ' D - d

where D and T are the sizes of the decoy and target databases. This equation can be derived under the basic assumption that the proportion of decoy matches above (d) and below (D - d) the protein score threshold is identical to the proportion of false-positive (FP) target matches above threshold and the number of target matches below threshold (T - 1):

d FP

D - d T - i

In this approach, FP is considered a more accurate estimate, and replaces d in Eq. 11. However, as schematized in Figure 3(B), the separation between the false and true target protein distributions is, in general, incomplete. Therefore, the population T - 1 target proteins below the score threshold is not only composed of false protein identifications, but also contains some false negatives. Therefore, the denominator in Eq. 14 is an overestimate to some extent, and FP is also overestimated. In the picked protein FDR approach, the target and decoy scores of the same protein are treated as pairs rather than as independent entities, and the protein that receives the highest score is selected; the other one is discarded. This kind of competition generates two populations of score pairs on either side of the diagonal, so that the picked FDR can be expressed thus:

This equation highlights a conceptual problem in this definition of FDR: a subpopulation of target proteins (the db region) is discarded, despite having a score above threshold. In the original target-decoy approach at the peptide level, the decoy and target sequences compete for the same spectrum, and thus it makes sense to discard the target sequence when the decoy gets the higher score, because this shows that the target sequence does not achieve a score higher than that obtained by a random sequence. However, at the protein level, the situation is completely different; the decoy counterpart of a target protein is matched by different spectra, and therefore its actual score give no information about the quality of the target protein identification. Hence, there is no reason to discard the target protein on the basis of the score of its decoy counterpart.

We reasoned that the target-decoy competition strategy at the protein level is a valid way to estimate statistical measures by exploiting the symmetry around the diagonal; however, direct comparison of target and decoy protein pairs should be avoided. With this idea in mind, we derived a refined FDR approach, based on a concept proposed previously at the peptide level. The population of target proteins with a score above threshold includes three regions (db, tb and to) as shown in Figure 3. Because of the symmetry of false-positives and decoy proteins around the diagonal, the expected total number of false positives in these regions is given by do + 2 x db. Therefore, the refined FDR is given by

do + 2 · db

FDRr

to + tb + db

This equation differs from Eq. 15 only in the term db in the numerator and denominator. Since the denominator is largerthan the numerator, FDRp <= FDRr. In other words, the two methods differ only slightly, with FDRp being somewhat more sensitive that FDRr, since it requires a lower protein score threshold to reach the same FDR. However, the higher sensitivity of the picked approach is obtained by rescuing proteins whose score is just below threshold, while the refined approach rescues proteins in the db region, which may have a score well above threshold.

The number of identifications obtained with the different protein-level FDR algorithms is shown in Figure 4; the same protein score was used in all cases. As expected, FDRn produces the smallest number of identifications, while FDRp and FDRr yield more identifications than MAYU. Also as expected, FDRp yields slightly more identifications than FDRr; however, FDRp loses some proteins that are clear positive identifications in FDRr.

For instance, in the Adult_Liver tissue results, we found that FDRp discarded a protein, GIGYF1, which is identified with a peptide with such a low p-value (7.2x 10⁵) that it should be accepted as a valid identification. This gene is eliminated because its decoy pair contains a decoy peptide with sequence IIIIEQR that yields a valid PSM (p-value = 8. Ox 10⁶) because it has a high homology with a target protein. Note that these two peptides are matching different spectra and therefore the decoy match does not give any relevant information.

Comparison with other protein identification workflows

To analyse the performance of the new formulations in practice, we compared the results obtained from analysis of HPM data from three tissues with several commonly used identification workflows. In one workflow, peptide identifications are filtered according to their peptide-level FDR, and the FDR of the resulting list of proteins is calculated using the conventional approach (FDRn); this method is frequently used by researchers in the field. In a second workflow, the proteins are assigned the score of their best peptide, and the picked FDR is used to validate the results. This workflow, which we will call FDRp(LPM), is used in the most recent version (3.0) of Percolator, 12 a popular algorithm included in the Proteome Discoverer package. In a third workflow, protein probabilities are calculated as the product of the probabilities of their identified peptides, and the FDR is calculated using the traditional approach (Eq. 11). This approach, which we will call FDRn(LPF), is used by PeptideShaker, a widely-used proteomics software tool. Finally, we used the best performing protein probability model proposed here (LPGF) in combination with the refined protein-level FDR; we call this workflow FDRr(LPGF).

Filtering by 1% FDR at the peptide level produced a list of proteins whose protein FDRn was > 7% in all cases (Table 1). In this approach, the peptide-level FDR could be adjusted until 1% protein FDR was achieved, but this came at the cost of a marked decrease in protein identification performance. This result highlights the error-rate amplification that occurs when peptides are integrated into proteins in the absence of an appropriate protein scoring model. Using the product of peptide probabilities (LPF) as the protein score allowed direct control of protein FDR; however, this method produced only a moderate increase in the number of proteins identified, and the improvement was not observed in all cases. The FDRp(LPM) workflow increased protein identification performance presumably because the use of only the best peptide minimizes the error-rate increase for decoy proteins (Table 1 and Figure 5). The algorithm proposed here (LPGF), in combination with the refined FDR, is able to use the information provided by all the peptides in a more efficient way, outperforming the other algorithms in all cases (Table 1 and Figure 5). This result was consistently reproduced when the same data were analysed in other search engines.

To explore these results in more depth, we analyzed the proteins differentially identified by the protein scoring workflows. FDRr(LPGF) included most of the identified proteins, missing only a small number of proteins identified by the other workflows (Figure 6). Closer inspection revealed that all proteins identified only by FDRp(LPM) were wonderhits: proteins identified from only one peptide. Moreover, these proteins identied only by FDRp(LPM) were matched by unexpectedly large numbers of target peptides (more than 50 in most cases). Wonder-hits were the minority of proteins identified by FDRn(LPF) only; however, these proteins were also matched by an abnormally high number of target peptides. To analyze the random peptide matching behavior of these proteins, we plotted the number of peptides matched to the decoy and target databases separately for the subsets of proteins identified by each one of the three workflows only. Proteins identified only by FDRp(LPM) or by FDRn(LPF) were matched by more than 20 peptides and sometimes by more than 100 peptides when searched against either the decoy or target databases; this indicates that these were proteins that, due to their size and sequence, tended to receive a large number of random matches and therefore were more likely to be identified by false peptides (Figure 7). In clear contrast, the proteins identified by FDRr(LPGF) only were matched in most cases by fewer than a dozen peptides (Figure 7), indicating that their positive identification was less likely due to random matching, and therefore that these proteins were most likely to be true positives. These findings were consistently reproduced with other search engines, indicating that this behavior was not specific to any particular search engine. We concluded that the novel protein identification workflow proposed here was not only more sensitive, but also provided identifications that were better supported by statistical considerations.

Table 1: Number of identifications provided by the different workflows using three tissues of the Human Proteome Map as separate target datasets

Target database size ^. 20407 genes.

Claims

1. A method carried out by a computer for selecting, identifying or classifying proteins present in a sample, which comprises the following steps: i. matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and

LPGM = —log_1Q(l— (1— p)ⁿ) where p is the lowest p-value among the set of p-values of the peptides that match the protein, and wherein said p-value is calculated from a score computed by matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide-to- spectrum match (PSM); or wherein the LPGF value is calculated by using the following chi-squared distribution:

2. A method carried out by a computer for selecting, identifying or classifying proteins present in a sample, which comprises the following steps: a. calculating the peptide confidence values from the set of peptides identified in the sample, characterized in that:

i. calculating the LPGF value from the peptide p-values obtained in step a), which is the probability of getting a decoy protein with an equal or lower p-value product from m identified peptides that are selected among n matched peptides, according to the following formulae:

3. The method according to any of claims 1 or 2, wherein a p-value, preferably p < 0.05, is used as the threshold for statistical significance of protein identification, so that all the proteins with LPGF < p are considered as identified according to said significance threshold in the protein probability model.

4. The method according to any of claims 1 or 2, wherein a False Discovery Rate (FDR)- value, preferably FDR < 0.01, is used as the threshold for statistical significance of protein identification, so that the LPGF value of the target and decoy proteins are used to calculate the FDR of each protein, so that all the proteins whose FDR is lower than the FDR threshold are considered as identified according to said significance threshold.

5. A data processing apparatus/device/system comprising means for carrying out the steps of any of the methods of claims 1 to 4.

6. A computer program comprising instructions which, when the program is executed by a computer, cause the computerto carry out the steps of the method of any of claims 1 to 4.

7. A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of the method of any of claims 1 to 4.