CN117809753A

CN117809753A - Efficient identification method and system for mixed protein

Info

Publication number: CN117809753A
Application number: CN202410059002.1A
Authority: CN
Inventors: 曾昭沛; 陈德华; 张振华; 杨永生
Original assignee: Diniu Shanghai Health Technology Co ltd
Current assignee: Diniu Shanghai Health Technology Co ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-04-02

Abstract

The invention discloses a high-efficiency identification method of mixed proteins, which comprises the following steps: constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search; carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA; and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level. The invention obviously improves the speed and accuracy of protein identification, simplifies the data analysis process and enhances the accuracy of protein quantification.

Description

Efficient identification method and system for mixed protein

Technical Field

The invention relates to the technical field of biological information, in particular to a method and a system for efficiently identifying mixed proteins.

Background

In recent years, protein identification has found wide application in biology, drug development, clinical diagnostics, and other fields. In general, protein identification is a process of analyzing key information about which proteins, their structure and function, etc., are present in a biological sample. Conventional protein identification methods rely mainly on techniques such as antibodies, enzyme-linked immunosorbent assays (ELISA) and protein purification, but these methods suffer from a number of limitations, including limitations in terms of specificity, detection range and complexity. In addition, proteins in organisms are numerous in variety and expression levels vary under different conditions, which increases the challenges of traditional identification methods.

In recent years, developments in mass spectrometry technology have provided new opportunities for protein identification. Mass spectrometry techniques are capable of degrading proteins into peptide fragments and determining the presence of the protein by analyzing the mass spectrum of these peptide fragments. However, existing mass spectrometry methods still have problems such as identification speed, complexity and accuracy of data analysis, and the like.

The complexity of identifying proteins is mainly due to the diversity and modification of proteins, as well as the processing of mass spectrometry data on a large scale. Furthermore, quantitative protein identification has an important role in disease diagnosis, drug screening and basic biological research, but there is still a need for more efficient methods.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems occurring in the prior art.

Therefore, the invention provides a high-efficiency identification method and system for mixed proteins, which solve the problems of slower speed and lower accuracy of the existing protein identification method.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, the invention provides a method for efficiently identifying mixed proteins, comprising the following steps:

constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;

carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;

and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.

As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:

the algorithm and structure of the bidirectional circulating neural network are used for obtaining a spectrogram database containing protein peptide fragments and corresponding mass spectrum information, and the method comprises the following steps:

the encoder comprises three layers of bidirectional long-short-term neural networks, takes the amino acid sequence of a peptide fragment and mass spectrum data thereof as input, and outputs the intensity of each fragment ion;

and decoding, wherein the decoder is a multi-layer perceptron formed by a ReLU activation function, processes the input amino acid representation and mass spectrum data through a full-connection layer, outputs intensity information of continuous fragment ion types at the input position of each amino acid, and constructs a peptide fragment spectrogram according to the intensity information of each ion.

the protein analysis algorithm through MaxDIA comprises the following steps:

qualitative identification of protein: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;

estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.

the estimation of the relative expression level of the protein comprises using a normalization coefficient N _j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:

where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity.

the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching ₁ ，p ₂ ，…p _m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC _Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:

C＝{α ₁ ，α ₂ ，...α _n }

wherein α in set C _n Representing the sequence number of the peptide fragment, XIC _Aαi Representing peptide fragment p _αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;

taking the ratio of the median of the expression amounts of the peptide fragment XIC meeting the conditions as the ratio of the protein abundance, the calculation mode of the ratio of the protein abundance is expressed as follows:

wherein r is _AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.

the filtering by four different standards comprises:

protein filtration based on contaminants, keratin is an epidermal structural protein which exists in the outer layers of skin, hair and nails, and finally enters a mass spectrometer together with a sample so as to influence the protein identification result;

based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;

protein filtering based on the deletion value, the threshold was set to 30%;

protein filtering based on unique peptide fragments, wherein the protein is obtained by different permutation and combination of peptide fragments, the same peptide fragment can be present in different proteins, the peptide fragment can be preferentially obtained when the protein searches the library, and then the proteins with the peptide fragments are matched according to the peptide fragments.

the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:

initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;

traversing each protein p in exp_p';

deleting p from exp_p' if p belongs to the contaminant set con_p;

deleting p from exp_p' if p belongs to the decoy proteins set De_p;

deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;

deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;

and finally returning to exp_p', namely collecting the screened proteins.

In a second aspect, the present invention provides a hybrid protein high efficiency identification system comprising:

the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;

the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;

and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.

In a third aspect, the present invention provides a computing device comprising:

a memory for storing a program;

a processor for executing the computer-executable instructions, which when executed by the processor, perform the steps of the hybrid protein high efficiency identification method.

In a fourth aspect, the present invention provides a computer-readable storage medium comprising: and when the program is executed by a processor, the steps of the high-efficiency identification method for the mixed protein are realized.

The invention has the beneficial effects that: the invention provides a high-efficiency, accurate and universal protein identification method based on mass spectrum data, which remarkably improves the speed and accuracy of protein identification, simplifies the data analysis process, enhances the accuracy of protein quantification, is simultaneously applicable to various biological samples, thereby promoting the progress of scientific research and application in the fields of biology, medicine, drug research and the like, and is hopeful to accelerate marker discovery, disease diagnosis and biological mechanism research.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a schematic flow chart of a method for efficient identification of mixed proteins according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for efficiently identifying mixed proteins according to an embodiment of the present invention;

FIG. 3 is a diagram of a method for constructing a spectrum library based on a bidirectional recurrent neural network according to a method for efficiently identifying hybrid proteins according to an embodiment of the present invention;

FIG. 4 is a qualitative chart of Bootstrap DIA protein according to an embodiment of the present invention;

FIG. 5 is a protein multi-dimensional filtration flow chart of a method for efficient identification of mixed proteins according to one embodiment of the present invention;

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to FIGS. 1-5, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, as shown in FIGS. 1 and 2, comprising the steps of:

s1: as shown in fig. 3, a spectrogram database containing protein peptide fragments and corresponding mass spectrum information is constructed through a two-way cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;

further, the operation steps include:

(1) The encoder comprises three layers of bidirectional long-short-term neural networks. Taking the amino acid sequence of the peptide fragment and the metadata (parent ion peptide, charge state, fragmentation mode and the like) as input, and outputting the intensity of each b or y fragment ion;

(2) The decoder is a multi-layer perceptron composed of ReLU activation functions. And processing the input amino acid representation and metadata through the full-connection layer, outputting intensity information of the ion types of the continuous fragments at the input position of each amino acid, and constructing a peptide fragment spectrogram according to the intensity information of each ion.

S2: carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;

further, the operation steps include:

(1) As shown in fig. 4, qualitative identification of proteins: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;

it comprises a number of steps to match the spectra in the library to the DIA samples, the purpose of which is to guide the DIA identification process on as little a priori knowledge as possible. And, more and more information is obtained in each round, which can be used in the following rounds;

(2) Estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.

Specifically, the normalization coefficient N is required _j Determining, namely multiplying the signal intensity of all peptide ions in the j-th mass spectrometry to correct intensity variation among different mass spectrometry, so that the identification of the protein expression quantity is more reliable, and defining the total intensity of peptide fragment ions P of a sample A as:

where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity. At the same time, the choice of peptide fragment information is also important, and MaxLFQ uses the relative expression of peptide ion signals rather than the total signal amount to calculate the relative expression of proteins, since the ratio of peptide fragment ion signals is a map of the corresponding protein intensity ratio. Specifically, for protein Pro, peptide fragment P= { P was identified therein by peptide fragment matching ₁ ，p ₂ ，...p _m And its intensity of distribution of XICs over sample s= { a, B, c..z } is XIC _Am . Firstly searching peptide fragments, wherein the peptide fragments which can be used for calculating the abundance ratio of the protein Pro on the samples A and B need to detect peptide fragment signals on the samples A and B at the same time, and the collection of peptide fragment serial numbers meeting the conditions needs to meet the conditions:

C＝{α ₁ ，α ₂ ，...α _n }

wherein α in set d _n The sequence numbers of the peptide fragments are shown,representing peptide fragment->The XIC expression quantity on sample a, the peptide fragment corresponding to the serial number in the set C satisfies the condition that the distinguishable peptide fragment signals exist on samples a and B simultaneously, in order to reduce the influence of abnormal values, the ratio of the median of the XIC expression quantity of the peptide fragment meeting the condition is taken as the ratio of the protein abundance, and the calculation mode is as follows:

wherein t is _AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.

S3: and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.

Further, as shown in FIG. 5, the multi-dimensional protein filtration algorithm, the steps of operation include:

(1) Protein filtration based on contaminants, common contaminants including keratin, serum albumin, etc., which is an epidermal structural protein present in the outer layers of skin, hair, nails, and eventually enters the mass spectrometer along with the sample, thereby affecting the results of protein identification;

(2) Based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;

(3) Protein filtering based on the deletion value, the threshold was set to 30%;

(4) Protein filtering based on unique peptide fragments, wherein proteins are obtained by different permutation and combination of peptide fragments, the same peptide fragment can appear in different proteins, peptide fragments can be preferentially obtained when proteins search for libraries, and then proteins with the peptide fragments are matched according to the peptide fragments, so that the non-specificity of the proteins to the peptide fragments can cause that certain proteins are incorrectly identified due to the existence of shared peptide fragments, and therefore, the proteins with low credibility need to be removed.

The multidimensional protein filtration algorithm further comprises screening the input protein collection to obtain a purer protein collection, comprising the steps of:

traversing each protein p in exp_p';

deleting p from exp_p' if p belongs to the contaminant set con_p;

deleting p from exp_p' if p belongs to the decoy proteins set De_p;

and finally returning to exp_p', namely collecting the screened proteins.

The embodiment also provides a high-efficiency identification system for mixed proteins, which comprises:

Still further, still include:

a memory for storing a program;

and the processor is used for loading the program to execute the efficient identification method of the mixed protein.

The present embodiment also provides a computer-readable storage medium storing a program which, when executed by a processor, implements the hybrid protein efficient identification method.

The storage medium according to the present embodiment belongs to the same inventive concept as the method for efficient identification of mixed proteins according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method of the embodiments of the present invention.

Example 2

Referring to table 1, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, and in order to verify the beneficial effects, a comparison of the two schemes is provided.

The purpose of the experiment is as follows: the performance of the prior art and my invention in protein identification was compared.

Preparing a sample: human serum samples were collected and labeled as prior art group and my invention group, respectively.

The traditional method comprises the following steps: the sample is processed and identified using existing protein identification techniques. The specific steps include protein extraction, separation, concentration, fluorescent labeling, chip loading, hybridization, washing, detection and the like.

Data of the two methods in terms of identification speed, accuracy, quantitative accuracy and applicability were recorded.

Statistical analysis was performed on the data and the performance of the two methods was compared.

The comparison results are shown in Table 1:

table 1 comparison table

As can be seen from Table 1, the method for multi-dimensional protein filtration of the invention can analyze a large number of protein samples in a short time by filtering with four different standards, can identify the existence of protein, can provide quantitative information of protein, has higher quantitative precision and wider applicability, can effectively improve the accuracy of mixed protein identification by the four steps, and is beneficial to searching potential biomarkers related to diseases, drug treatment reactions or disease progression.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method for efficient identification of mixed proteins, comprising:

carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;

2. The method for efficiently identifying a mixed protein according to claim 1, wherein:

3. The method for efficiently identifying a mixed protein according to claim 1 or 2, wherein: the protein analysis algorithm through MaxDIA comprises the following steps:

4. The method for efficiently identifying a mixed protein according to claim 3, wherein: the estimation of the relative expression level of the protein comprises using a normalization coefficient N _j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:

5. The method for efficiently identifying a mixed protein according to claim 4, wherein: the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching ₁ ，p ₂ ，…p _m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC _Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:

C＝{α ₁ ，α ₂ ，...α _n }

wherein,alpha in set C _n Representing the sequence number of the peptide fragment, XIC _Aαi Representing peptide fragment p _αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;

6. The method for efficiently identifying a mixed protein according to claim 5, wherein: the filtering by four different standards comprises:

protein filtering based on the deletion value, the threshold was set to 30%;

7. The method for efficiently identifying a mixed protein according to claim 6, wherein: the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:

traversing each protein p in exp_p';

deleting p from exp_p' if p belongs to the contaminant set con_p;

deleting p from exp_p' if p belongs to the decoy proteins set De_p;

and finally returning to exp_p', namely collecting the screened proteins.

8. An identification system based on the method for efficiently identifying a mixed protein according to any one of claims 1 to 7, characterized in that:

9. An electronic device, comprising:

a memory for storing a program;

a processor for loading the program to perform the steps of the hybrid protein high efficiency identification method of any one of claims 1-7.

10. A computer-readable storage medium storing a program which, when executed by a processor, implements the steps of the method for efficient identification of a hybrid protein according to any one of claims 1 to 7.