CN117809753A - Efficient identification method and system for mixed protein - Google Patents
Efficient identification method and system for mixed protein Download PDFInfo
- Publication number
- CN117809753A CN117809753A CN202410059002.1A CN202410059002A CN117809753A CN 117809753 A CN117809753 A CN 117809753A CN 202410059002 A CN202410059002 A CN 202410059002A CN 117809753 A CN117809753 A CN 117809753A
- Authority
- CN
- China
- Prior art keywords
- protein
- peptide
- exp
- proteins
- peptide fragment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 185
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 185
- 238000000034 method Methods 0.000 title claims abstract description 66
- 108010033276 Peptide Fragments Proteins 0.000 claims abstract description 74
- 102000007079 Peptide Fragments Human genes 0.000 claims abstract description 74
- 238000001914 filtration Methods 0.000 claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 238000001819 mass spectrum Methods 0.000 claims abstract description 28
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 9
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 4
- 150000002500 ions Chemical class 0.000 claims description 23
- 238000010606 normalization Methods 0.000 claims description 12
- 238000004949 mass spectrometry Methods 0.000 claims description 10
- 239000000356 contaminant Substances 0.000 claims description 7
- 150000001413 amino acids Chemical class 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 102000011782 Keratins Human genes 0.000 claims description 3
- 108010076876 Keratins Proteins 0.000 claims description 3
- 108010026552 Proteome Proteins 0.000 claims description 3
- 101710172711 Structural protein Proteins 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000000155 isotopic effect Effects 0.000 claims description 3
- 238000011068 loading method Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000011002 quantification Methods 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 15
- 238000011160 research Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 238000012509 protein identification method Methods 0.000 description 3
- 238000002965 ELISA Methods 0.000 description 2
- 239000012472 biological sample Substances 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 206010061818 Disease progression Diseases 0.000 description 1
- 102000007562 Serum Albumin Human genes 0.000 description 1
- 108010071390 Serum Albumin Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000007321 biological mechanism Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000005750 disease progression Effects 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 238000007877 drug screening Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000007850 fluorescent dye Substances 0.000 description 1
- 238000001215 fluorescent labelling Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000000751 protein extraction Methods 0.000 description 1
- 230000009145 protein modification Effects 0.000 description 1
- 238000001742 protein purification Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a high-efficiency identification method of mixed proteins, which comprises the following steps: constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search; carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA; and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level. The invention obviously improves the speed and accuracy of protein identification, simplifies the data analysis process and enhances the accuracy of protein quantification.
Description
Technical Field
The invention relates to the technical field of biological information, in particular to a method and a system for efficiently identifying mixed proteins.
Background
In recent years, protein identification has found wide application in biology, drug development, clinical diagnostics, and other fields. In general, protein identification is a process of analyzing key information about which proteins, their structure and function, etc., are present in a biological sample. Conventional protein identification methods rely mainly on techniques such as antibodies, enzyme-linked immunosorbent assays (ELISA) and protein purification, but these methods suffer from a number of limitations, including limitations in terms of specificity, detection range and complexity. In addition, proteins in organisms are numerous in variety and expression levels vary under different conditions, which increases the challenges of traditional identification methods.
In recent years, developments in mass spectrometry technology have provided new opportunities for protein identification. Mass spectrometry techniques are capable of degrading proteins into peptide fragments and determining the presence of the protein by analyzing the mass spectrum of these peptide fragments. However, existing mass spectrometry methods still have problems such as identification speed, complexity and accuracy of data analysis, and the like.
The complexity of identifying proteins is mainly due to the diversity and modification of proteins, as well as the processing of mass spectrometry data on a large scale. Furthermore, quantitative protein identification has an important role in disease diagnosis, drug screening and basic biological research, but there is still a need for more efficient methods.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.
The present invention has been made in view of the above-described problems occurring in the prior art.
Therefore, the invention provides a high-efficiency identification method and system for mixed proteins, which solve the problems of slower speed and lower accuracy of the existing protein identification method.
In order to solve the technical problems, the invention provides the following technical scheme:
in a first aspect, the invention provides a method for efficiently identifying mixed proteins, comprising the following steps:
constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the algorithm and structure of the bidirectional circulating neural network are used for obtaining a spectrogram database containing protein peptide fragments and corresponding mass spectrum information, and the method comprises the following steps:
the encoder comprises three layers of bidirectional long-short-term neural networks, takes the amino acid sequence of a peptide fragment and mass spectrum data thereof as input, and outputs the intensity of each fragment ion;
and decoding, wherein the decoder is a multi-layer perceptron formed by a ReLU activation function, processes the input amino acid representation and mass spectrum data through a full-connection layer, outputs intensity information of continuous fragment ion types at the input position of each amino acid, and constructs a peptide fragment spectrogram according to the intensity information of each ion.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the protein analysis algorithm through MaxDIA comprises the following steps:
qualitative identification of protein: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the estimation of the relative expression level of the protein comprises using a normalization coefficient N j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching 1 ,p 2 ,…p m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:
C={α 1 ,α 2 ,...α n }
wherein α in set C n Representing the sequence number of the peptide fragment, XIC Aαi Representing peptide fragment p αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;
taking the ratio of the median of the expression amounts of the peptide fragment XIC meeting the conditions as the ratio of the protein abundance, the calculation mode of the ratio of the protein abundance is expressed as follows:
wherein r is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the filtering by four different standards comprises:
protein filtration based on contaminants, keratin is an epidermal structural protein which exists in the outer layers of skin, hair and nails, and finally enters a mass spectrometer together with a sample so as to influence the protein identification result;
based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
protein filtering based on the deletion value, the threshold was set to 30%;
protein filtering based on unique peptide fragments, wherein the protein is obtained by different permutation and combination of peptide fragments, the same peptide fragment can be present in different proteins, the peptide fragment can be preferentially obtained when the protein searches the library, and then the proteins with the peptide fragments are matched according to the peptide fragments.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
In a second aspect, the present invention provides a hybrid protein high efficiency identification system comprising:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
In a third aspect, the present invention provides a computing device comprising:
a memory for storing a program;
a processor for executing the computer-executable instructions, which when executed by the processor, perform the steps of the hybrid protein high efficiency identification method.
In a fourth aspect, the present invention provides a computer-readable storage medium comprising: and when the program is executed by a processor, the steps of the high-efficiency identification method for the mixed protein are realized.
The invention has the beneficial effects that: the invention provides a high-efficiency, accurate and universal protein identification method based on mass spectrum data, which remarkably improves the speed and accuracy of protein identification, simplifies the data analysis process, enhances the accuracy of protein quantification, is simultaneously applicable to various biological samples, thereby promoting the progress of scientific research and application in the fields of biology, medicine, drug research and the like, and is hopeful to accelerate marker discovery, disease diagnosis and biological mechanism research.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a schematic flow chart of a method for efficient identification of mixed proteins according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for efficiently identifying mixed proteins according to an embodiment of the present invention;
FIG. 3 is a diagram of a method for constructing a spectrum library based on a bidirectional recurrent neural network according to a method for efficiently identifying hybrid proteins according to an embodiment of the present invention;
FIG. 4 is a qualitative chart of Bootstrap DIA protein according to an embodiment of the present invention;
FIG. 5 is a protein multi-dimensional filtration flow chart of a method for efficient identification of mixed proteins according to one embodiment of the present invention;
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.
Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Example 1
Referring to FIGS. 1-5, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, as shown in FIGS. 1 and 2, comprising the steps of:
s1: as shown in fig. 3, a spectrogram database containing protein peptide fragments and corresponding mass spectrum information is constructed through a two-way cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
further, the operation steps include:
(1) The encoder comprises three layers of bidirectional long-short-term neural networks. Taking the amino acid sequence of the peptide fragment and the metadata (parent ion peptide, charge state, fragmentation mode and the like) as input, and outputting the intensity of each b or y fragment ion;
(2) The decoder is a multi-layer perceptron composed of ReLU activation functions. And processing the input amino acid representation and metadata through the full-connection layer, outputting intensity information of the ion types of the continuous fragments at the input position of each amino acid, and constructing a peptide fragment spectrogram according to the intensity information of each ion.
S2: carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
further, the operation steps include:
(1) As shown in fig. 4, qualitative identification of proteins: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
it comprises a number of steps to match the spectra in the library to the DIA samples, the purpose of which is to guide the DIA identification process on as little a priori knowledge as possible. And, more and more information is obtained in each round, which can be used in the following rounds;
(2) Estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
Specifically, the normalization coefficient N is required j Determining, namely multiplying the signal intensity of all peptide ions in the j-th mass spectrometry to correct intensity variation among different mass spectrometry, so that the identification of the protein expression quantity is more reliable, and defining the total intensity of peptide fragment ions P of a sample A as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity. At the same time, the choice of peptide fragment information is also important, and MaxLFQ uses the relative expression of peptide ion signals rather than the total signal amount to calculate the relative expression of proteins, since the ratio of peptide fragment ion signals is a map of the corresponding protein intensity ratio. Specifically, for protein Pro, peptide fragment P= { P was identified therein by peptide fragment matching 1 ,p 2 ,...p m And its intensity of distribution of XICs over sample s= { a, B, c..z } is XIC Am . Firstly searching peptide fragments, wherein the peptide fragments which can be used for calculating the abundance ratio of the protein Pro on the samples A and B need to detect peptide fragment signals on the samples A and B at the same time, and the collection of peptide fragment serial numbers meeting the conditions needs to meet the conditions:
C={α 1 ,α 2 ,...α n }
wherein α in set d n The sequence numbers of the peptide fragments are shown,representing peptide fragment->The XIC expression quantity on sample a, the peptide fragment corresponding to the serial number in the set C satisfies the condition that the distinguishable peptide fragment signals exist on samples a and B simultaneously, in order to reduce the influence of abnormal values, the ratio of the median of the XIC expression quantity of the peptide fragment meeting the condition is taken as the ratio of the protein abundance, and the calculation mode is as follows:
wherein t is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
S3: and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
Further, as shown in FIG. 5, the multi-dimensional protein filtration algorithm, the steps of operation include:
(1) Protein filtration based on contaminants, common contaminants including keratin, serum albumin, etc., which is an epidermal structural protein present in the outer layers of skin, hair, nails, and eventually enters the mass spectrometer along with the sample, thereby affecting the results of protein identification;
(2) Based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
(3) Protein filtering based on the deletion value, the threshold was set to 30%;
(4) Protein filtering based on unique peptide fragments, wherein proteins are obtained by different permutation and combination of peptide fragments, the same peptide fragment can appear in different proteins, peptide fragments can be preferentially obtained when proteins search for libraries, and then proteins with the peptide fragments are matched according to the peptide fragments, so that the non-specificity of the proteins to the peptide fragments can cause that certain proteins are incorrectly identified due to the existence of shared peptide fragments, and therefore, the proteins with low credibility need to be removed.
The multidimensional protein filtration algorithm further comprises screening the input protein collection to obtain a purer protein collection, comprising the steps of:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
The embodiment also provides a high-efficiency identification system for mixed proteins, which comprises:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
Still further, still include:
a memory for storing a program;
and the processor is used for loading the program to execute the efficient identification method of the mixed protein.
The present embodiment also provides a computer-readable storage medium storing a program which, when executed by a processor, implements the hybrid protein efficient identification method.
The storage medium according to the present embodiment belongs to the same inventive concept as the method for efficient identification of mixed proteins according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method of the embodiments of the present invention.
Example 2
Referring to table 1, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, and in order to verify the beneficial effects, a comparison of the two schemes is provided.
The purpose of the experiment is as follows: the performance of the prior art and my invention in protein identification was compared.
Preparing a sample: human serum samples were collected and labeled as prior art group and my invention group, respectively.
The traditional method comprises the following steps: the sample is processed and identified using existing protein identification techniques. The specific steps include protein extraction, separation, concentration, fluorescent labeling, chip loading, hybridization, washing, detection and the like.
Data of the two methods in terms of identification speed, accuracy, quantitative accuracy and applicability were recorded.
Statistical analysis was performed on the data and the performance of the two methods was compared.
The comparison results are shown in Table 1:
table 1 comparison table
As can be seen from Table 1, the method for multi-dimensional protein filtration of the invention can analyze a large number of protein samples in a short time by filtering with four different standards, can identify the existence of protein, can provide quantitative information of protein, has higher quantitative precision and wider applicability, can effectively improve the accuracy of mixed protein identification by the four steps, and is beneficial to searching potential biomarkers related to diseases, drug treatment reactions or disease progression.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.
Claims (10)
1. A method for efficient identification of mixed proteins, comprising:
constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
2. The method for efficiently identifying a mixed protein according to claim 1, wherein:
the algorithm and structure of the bidirectional circulating neural network are used for obtaining a spectrogram database containing protein peptide fragments and corresponding mass spectrum information, and the method comprises the following steps:
the encoder comprises three layers of bidirectional long-short-term neural networks, takes the amino acid sequence of a peptide fragment and mass spectrum data thereof as input, and outputs the intensity of each fragment ion;
and decoding, wherein the decoder is a multi-layer perceptron formed by a ReLU activation function, processes the input amino acid representation and mass spectrum data through a full-connection layer, outputs intensity information of continuous fragment ion types at the input position of each amino acid, and constructs a peptide fragment spectrogram according to the intensity information of each ion.
3. The method for efficiently identifying a mixed protein according to claim 1 or 2, wherein: the protein analysis algorithm through MaxDIA comprises the following steps:
qualitative identification of protein: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
4. The method for efficiently identifying a mixed protein according to claim 3, wherein: the estimation of the relative expression level of the protein comprises using a normalization coefficient N j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity.
5. The method for efficiently identifying a mixed protein according to claim 4, wherein: the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching 1 ,p 2 ,…p m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:
C={α 1 ,α 2 ,...α n }
wherein,alpha in set C n Representing the sequence number of the peptide fragment, XIC Aαi Representing peptide fragment p αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;
taking the ratio of the median of the expression amounts of the peptide fragment XIC meeting the conditions as the ratio of the protein abundance, the calculation mode of the ratio of the protein abundance is expressed as follows:
wherein r is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
6. The method for efficiently identifying a mixed protein according to claim 5, wherein: the filtering by four different standards comprises:
protein filtration based on contaminants, keratin is an epidermal structural protein which exists in the outer layers of skin, hair and nails, and finally enters a mass spectrometer together with a sample so as to influence the protein identification result;
based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
protein filtering based on the deletion value, the threshold was set to 30%;
protein filtering based on unique peptide fragments, wherein the protein is obtained by different permutation and combination of peptide fragments, the same peptide fragment can be present in different proteins, the peptide fragment can be preferentially obtained when the protein searches the library, and then the proteins with the peptide fragments are matched according to the peptide fragments.
7. The method for efficiently identifying a mixed protein according to claim 6, wherein: the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
8. An identification system based on the method for efficiently identifying a mixed protein according to any one of claims 1 to 7, characterized in that:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
9. An electronic device, comprising:
a memory for storing a program;
a processor for loading the program to perform the steps of the hybrid protein high efficiency identification method of any one of claims 1-7.
10. A computer-readable storage medium storing a program which, when executed by a processor, implements the steps of the method for efficient identification of a hybrid protein according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410059002.1A CN117809753A (en) | 2024-01-15 | 2024-01-15 | Efficient identification method and system for mixed protein |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410059002.1A CN117809753A (en) | 2024-01-15 | 2024-01-15 | Efficient identification method and system for mixed protein |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117809753A true CN117809753A (en) | 2024-04-02 |
Family
ID=90421735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410059002.1A Pending CN117809753A (en) | 2024-01-15 | 2024-01-15 | Efficient identification method and system for mixed protein |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117809753A (en) |
-
2024
- 2024-01-15 CN CN202410059002.1A patent/CN117809753A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rosenblatt et al. | Serum proteomics in cancer diagnosis and management | |
Prakash et al. | Signal maps for mass spectrometry-based comparative proteomics | |
Listgarten et al. | Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry | |
Petricoin et al. | SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer | |
Yasui et al. | An automated peak identification/calibration procedure for high‐dimensional protein measures from mass spectrometers | |
US10309968B2 (en) | Methods and systems for assembly of protein sequences | |
CN110838340B (en) | Method for identifying protein biomarkers independent of database search | |
US20040153249A1 (en) | System, software and methods for biomarker identification | |
US20100017356A1 (en) | Method for Identifying Protein Patterns in Mass Spectrometry | |
CN103776891B (en) | A kind of method of detection differential expression protein | |
CN104718449B (en) | System and method for recognizing compound from MS/MS data in the case where precursor ion information is not used | |
CN101611313A (en) | Mass spectrometry biomarker assay | |
Armananzas et al. | Peakbin selection in mass spectrometry data using a consensus approach with estimation of distribution algorithms | |
CN116559453A (en) | Biomarker for lung cancer detection | |
CN110349621B (en) | Method, system, storage medium and device for checking reliability of peptide fragment-spectrogram matching | |
CN111537659A (en) | Method for screening biomarkers | |
CN114577972B (en) | Protein marker screening method for body fluid identification | |
US20070184511A1 (en) | Method for Diagnosing a Person Having Sjogren's Syndrome | |
CN117809753A (en) | Efficient identification method and system for mixed protein | |
Ndukum et al. | pkDACLASS: open source software for analyzing MALDI-TOF data | |
CN114973245B (en) | Extracellular vesicle classification method, device, equipment and medium based on machine learning | |
CN112151109B (en) | Semi-supervised learning method for evaluating randomness of biomolecule cross-linked mass spectrometry identification | |
Liu et al. | PepNet: a fully convolutional neural network for de novo peptide sequencing | |
CN103488913A (en) | A computational method for mapping peptides to proteins using sequencing data | |
CN111739583A (en) | Data independent property spectrum detection method based on optimized database (Sub-Lib) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |