CN117809753A - Efficient identification method and system for mixed protein - Google Patents

Efficient identification method and system for mixed protein Download PDF

Info

Publication number
CN117809753A
CN117809753A CN202410059002.1A CN202410059002A CN117809753A CN 117809753 A CN117809753 A CN 117809753A CN 202410059002 A CN202410059002 A CN 202410059002A CN 117809753 A CN117809753 A CN 117809753A
Authority
CN
China
Prior art keywords
protein
peptide
exp
proteins
peptide fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410059002.1A
Other languages
Chinese (zh)
Inventor
曾昭沛
陈德华
张振华
杨永生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Diniu Shanghai Health Technology Co ltd
Original Assignee
Diniu Shanghai Health Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Diniu Shanghai Health Technology Co ltd filed Critical Diniu Shanghai Health Technology Co ltd
Priority to CN202410059002.1A priority Critical patent/CN117809753A/en
Publication of CN117809753A publication Critical patent/CN117809753A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a high-efficiency identification method of mixed proteins, which comprises the following steps: constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search; carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA; and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level. The invention obviously improves the speed and accuracy of protein identification, simplifies the data analysis process and enhances the accuracy of protein quantification.

Description

Efficient identification method and system for mixed protein
Technical Field
The invention relates to the technical field of biological information, in particular to a method and a system for efficiently identifying mixed proteins.
Background
In recent years, protein identification has found wide application in biology, drug development, clinical diagnostics, and other fields. In general, protein identification is a process of analyzing key information about which proteins, their structure and function, etc., are present in a biological sample. Conventional protein identification methods rely mainly on techniques such as antibodies, enzyme-linked immunosorbent assays (ELISA) and protein purification, but these methods suffer from a number of limitations, including limitations in terms of specificity, detection range and complexity. In addition, proteins in organisms are numerous in variety and expression levels vary under different conditions, which increases the challenges of traditional identification methods.
In recent years, developments in mass spectrometry technology have provided new opportunities for protein identification. Mass spectrometry techniques are capable of degrading proteins into peptide fragments and determining the presence of the protein by analyzing the mass spectrum of these peptide fragments. However, existing mass spectrometry methods still have problems such as identification speed, complexity and accuracy of data analysis, and the like.
The complexity of identifying proteins is mainly due to the diversity and modification of proteins, as well as the processing of mass spectrometry data on a large scale. Furthermore, quantitative protein identification has an important role in disease diagnosis, drug screening and basic biological research, but there is still a need for more efficient methods.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.
The present invention has been made in view of the above-described problems occurring in the prior art.
Therefore, the invention provides a high-efficiency identification method and system for mixed proteins, which solve the problems of slower speed and lower accuracy of the existing protein identification method.
In order to solve the technical problems, the invention provides the following technical scheme:
in a first aspect, the invention provides a method for efficiently identifying mixed proteins, comprising the following steps:
constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the algorithm and structure of the bidirectional circulating neural network are used for obtaining a spectrogram database containing protein peptide fragments and corresponding mass spectrum information, and the method comprises the following steps:
the encoder comprises three layers of bidirectional long-short-term neural networks, takes the amino acid sequence of a peptide fragment and mass spectrum data thereof as input, and outputs the intensity of each fragment ion;
and decoding, wherein the decoder is a multi-layer perceptron formed by a ReLU activation function, processes the input amino acid representation and mass spectrum data through a full-connection layer, outputs intensity information of continuous fragment ion types at the input position of each amino acid, and constructs a peptide fragment spectrogram according to the intensity information of each ion.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the protein analysis algorithm through MaxDIA comprises the following steps:
qualitative identification of protein: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the estimation of the relative expression level of the protein comprises using a normalization coefficient N j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching 1 ,p 2 ,…p m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:
C={α 1 ,α 2 ,...α n }
wherein α in set C n Representing the sequence number of the peptide fragment, XIC Aαi Representing peptide fragment p αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;
taking the ratio of the median of the expression amounts of the peptide fragment XIC meeting the conditions as the ratio of the protein abundance, the calculation mode of the ratio of the protein abundance is expressed as follows:
wherein r is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the filtering by four different standards comprises:
protein filtration based on contaminants, keratin is an epidermal structural protein which exists in the outer layers of skin, hair and nails, and finally enters a mass spectrometer together with a sample so as to influence the protein identification result;
based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
protein filtering based on the deletion value, the threshold was set to 30%;
protein filtering based on unique peptide fragments, wherein the protein is obtained by different permutation and combination of peptide fragments, the same peptide fragment can be present in different proteins, the peptide fragment can be preferentially obtained when the protein searches the library, and then the proteins with the peptide fragments are matched according to the peptide fragments.
As a preferable embodiment of the method for efficiently identifying a mixed protein of the present invention, there is provided a method comprising:
the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
In a second aspect, the present invention provides a hybrid protein high efficiency identification system comprising:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
In a third aspect, the present invention provides a computing device comprising:
a memory for storing a program;
a processor for executing the computer-executable instructions, which when executed by the processor, perform the steps of the hybrid protein high efficiency identification method.
In a fourth aspect, the present invention provides a computer-readable storage medium comprising: and when the program is executed by a processor, the steps of the high-efficiency identification method for the mixed protein are realized.
The invention has the beneficial effects that: the invention provides a high-efficiency, accurate and universal protein identification method based on mass spectrum data, which remarkably improves the speed and accuracy of protein identification, simplifies the data analysis process, enhances the accuracy of protein quantification, is simultaneously applicable to various biological samples, thereby promoting the progress of scientific research and application in the fields of biology, medicine, drug research and the like, and is hopeful to accelerate marker discovery, disease diagnosis and biological mechanism research.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a schematic flow chart of a method for efficient identification of mixed proteins according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for efficiently identifying mixed proteins according to an embodiment of the present invention;
FIG. 3 is a diagram of a method for constructing a spectrum library based on a bidirectional recurrent neural network according to a method for efficiently identifying hybrid proteins according to an embodiment of the present invention;
FIG. 4 is a qualitative chart of Bootstrap DIA protein according to an embodiment of the present invention;
FIG. 5 is a protein multi-dimensional filtration flow chart of a method for efficient identification of mixed proteins according to one embodiment of the present invention;
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.
Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Example 1
Referring to FIGS. 1-5, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, as shown in FIGS. 1 and 2, comprising the steps of:
s1: as shown in fig. 3, a spectrogram database containing protein peptide fragments and corresponding mass spectrum information is constructed through a two-way cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
further, the operation steps include:
(1) The encoder comprises three layers of bidirectional long-short-term neural networks. Taking the amino acid sequence of the peptide fragment and the metadata (parent ion peptide, charge state, fragmentation mode and the like) as input, and outputting the intensity of each b or y fragment ion;
(2) The decoder is a multi-layer perceptron composed of ReLU activation functions. And processing the input amino acid representation and metadata through the full-connection layer, outputting intensity information of the ion types of the continuous fragments at the input position of each amino acid, and constructing a peptide fragment spectrogram according to the intensity information of each ion.
S2: carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
further, the operation steps include:
(1) As shown in fig. 4, qualitative identification of proteins: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
it comprises a number of steps to match the spectra in the library to the DIA samples, the purpose of which is to guide the DIA identification process on as little a priori knowledge as possible. And, more and more information is obtained in each round, which can be used in the following rounds;
(2) Estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
Specifically, the normalization coefficient N is required j Determining, namely multiplying the signal intensity of all peptide ions in the j-th mass spectrometry to correct intensity variation among different mass spectrometry, so that the identification of the protein expression quantity is more reliable, and defining the total intensity of peptide fragment ions P of a sample A as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity. At the same time, the choice of peptide fragment information is also important, and MaxLFQ uses the relative expression of peptide ion signals rather than the total signal amount to calculate the relative expression of proteins, since the ratio of peptide fragment ion signals is a map of the corresponding protein intensity ratio. Specifically, for protein Pro, peptide fragment P= { P was identified therein by peptide fragment matching 1 ,p 2 ,...p m And its intensity of distribution of XICs over sample s= { a, B, c..z } is XIC Am . Firstly searching peptide fragments, wherein the peptide fragments which can be used for calculating the abundance ratio of the protein Pro on the samples A and B need to detect peptide fragment signals on the samples A and B at the same time, and the collection of peptide fragment serial numbers meeting the conditions needs to meet the conditions:
C={α 1 ,α 2 ,...α n }
wherein α in set d n The sequence numbers of the peptide fragments are shown,representing peptide fragment->The XIC expression quantity on sample a, the peptide fragment corresponding to the serial number in the set C satisfies the condition that the distinguishable peptide fragment signals exist on samples a and B simultaneously, in order to reduce the influence of abnormal values, the ratio of the median of the XIC expression quantity of the peptide fragment meeting the condition is taken as the ratio of the protein abundance, and the calculation mode is as follows:
wherein t is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
S3: and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
Further, as shown in FIG. 5, the multi-dimensional protein filtration algorithm, the steps of operation include:
(1) Protein filtration based on contaminants, common contaminants including keratin, serum albumin, etc., which is an epidermal structural protein present in the outer layers of skin, hair, nails, and eventually enters the mass spectrometer along with the sample, thereby affecting the results of protein identification;
(2) Based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
(3) Protein filtering based on the deletion value, the threshold was set to 30%;
(4) Protein filtering based on unique peptide fragments, wherein proteins are obtained by different permutation and combination of peptide fragments, the same peptide fragment can appear in different proteins, peptide fragments can be preferentially obtained when proteins search for libraries, and then proteins with the peptide fragments are matched according to the peptide fragments, so that the non-specificity of the proteins to the peptide fragments can cause that certain proteins are incorrectly identified due to the existence of shared peptide fragments, and therefore, the proteins with low credibility need to be removed.
The multidimensional protein filtration algorithm further comprises screening the input protein collection to obtain a purer protein collection, comprising the steps of:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
The embodiment also provides a high-efficiency identification system for mixed proteins, which comprises:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
Still further, still include:
a memory for storing a program;
and the processor is used for loading the program to execute the efficient identification method of the mixed protein.
The present embodiment also provides a computer-readable storage medium storing a program which, when executed by a processor, implements the hybrid protein efficient identification method.
The storage medium according to the present embodiment belongs to the same inventive concept as the method for efficient identification of mixed proteins according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method of the embodiments of the present invention.
Example 2
Referring to table 1, for one embodiment of the present invention, a method for efficient identification of mixed proteins is provided, and in order to verify the beneficial effects, a comparison of the two schemes is provided.
The purpose of the experiment is as follows: the performance of the prior art and my invention in protein identification was compared.
Preparing a sample: human serum samples were collected and labeled as prior art group and my invention group, respectively.
The traditional method comprises the following steps: the sample is processed and identified using existing protein identification techniques. The specific steps include protein extraction, separation, concentration, fluorescent labeling, chip loading, hybridization, washing, detection and the like.
Data of the two methods in terms of identification speed, accuracy, quantitative accuracy and applicability were recorded.
Statistical analysis was performed on the data and the performance of the two methods was compared.
The comparison results are shown in Table 1:
table 1 comparison table
As can be seen from Table 1, the method for multi-dimensional protein filtration of the invention can analyze a large number of protein samples in a short time by filtering with four different standards, can identify the existence of protein, can provide quantitative information of protein, has higher quantitative precision and wider applicability, can effectively improve the accuracy of mixed protein identification by the four steps, and is beneficial to searching potential biomarkers related to diseases, drug treatment reactions or disease progression.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims (10)

1. A method for efficient identification of mixed proteins, comprising:
constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional cyclic neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
carrying out mass spectrum analysis by a protein analysis algorithm of MaxDIA;
and filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
2. The method for efficiently identifying a mixed protein according to claim 1, wherein:
the algorithm and structure of the bidirectional circulating neural network are used for obtaining a spectrogram database containing protein peptide fragments and corresponding mass spectrum information, and the method comprises the following steps:
the encoder comprises three layers of bidirectional long-short-term neural networks, takes the amino acid sequence of a peptide fragment and mass spectrum data thereof as input, and outputs the intensity of each fragment ion;
and decoding, wherein the decoder is a multi-layer perceptron formed by a ReLU activation function, processes the input amino acid representation and mass spectrum data through a full-connection layer, outputs intensity information of continuous fragment ion types at the input position of each amino acid, and constructs a peptide fragment spectrogram according to the intensity information of each ion.
3. The method for efficiently identifying a mixed protein according to claim 1 or 2, wherein: the protein analysis algorithm through MaxDIA comprises the following steps:
qualitative identification of protein: preprocessing DIA mass spectrum data, extracting features from the preprocessed mass spectrum data, comparing and matching the features with mass spectrograms of known proteins by using a protein database and a spectrum library matching algorithm, and carrying out statistical verification and reliability evaluation on the identified proteins by adopting a Bootstrap method;
estimating the relative expression level of the protein: the intensity values of the individual samples are added without normalization, and the normalization factor is then determined by a global optimization process with the normalization factor as a free variable, so that the quantitative error of the whole proteome is minimized.
4. The method for efficiently identifying a mixed protein according to claim 3, wherein: the estimation of the relative expression level of the protein comprises using a normalization coefficient N j Multiplying the signal intensities of all peptide ions in the jth mass spectrometry to correct for intensity variations between different mass spectrometry, defining the total intensity of peptide fragment ions P of sample a as:
where k represents all isotopic peaks of peptide ion P in sample A and XIC represents the cross-sectional area at maximum intensity.
5. The method for efficiently identifying a mixed protein according to claim 4, wherein: the estimating the relative expression level of the protein further comprises calculating the relative expression amount of the protein by using the relative expression amount of the peptide ion signal in consideration of the selection of the peptide fragment information, specifically, for the protein Pro, the peptide fragment P= { P is identified by peptide fragment matching 1 ,p 2 ,…p m And its intensity of distribution of XICs over the samples s= { a, B, C … Z } is XIC Am Searching for peptide fragments to calculate the abundance ratio of protein Pro on samples A and B requires that peptide fragment signals be detected simultaneously on samples A and B, and the collection of peptide fragment numbers meeting the conditions requires that the following are satisfied:
C={α 1 ,α 2 ,...α n }
wherein,alpha in set C n Representing the sequence number of the peptide fragment, XIC Aαi Representing peptide fragment p αi The XIC expression level on sample a, the peptide fragment corresponding to the sequence number in set C satisfies the condition that there is a distinguishable peptide fragment signal on both samples a and B;
taking the ratio of the median of the expression amounts of the peptide fragment XIC meeting the conditions as the ratio of the protein abundance, the calculation mode of the ratio of the protein abundance is expressed as follows:
wherein r is AB Representing the ratio of the abundance of protein on sample a and sample B, medium (XIC) represents the median of the elements in the collection.
6. The method for efficiently identifying a mixed protein according to claim 5, wherein: the filtering by four different standards comprises:
protein filtration based on contaminants, keratin is an epidermal structural protein which exists in the outer layers of skin, hair and nails, and finally enters a mass spectrometer together with a sample so as to influence the protein identification result;
based on protein filtering of the bait library, the bait protein is a non-target protein generated under a target-bait library search strategy;
protein filtering based on the deletion value, the threshold was set to 30%;
protein filtering based on unique peptide fragments, wherein the protein is obtained by different permutation and combination of peptide fragments, the same peptide fragment can be present in different proteins, the peptide fragment can be preferentially obtained when the protein searches the library, and then the proteins with the peptide fragments are matched according to the peptide fragments.
7. The method for efficiently identifying a mixed protein according to claim 6, wherein: the multidimensional protein filtering algorithm further comprises the step of screening the input protein set, and comprises the following steps:
initializing exp_p' to exp_p; exp_p' is the collection of proteins after screening;
traversing each protein p in exp_p';
deleting p from exp_p' if p belongs to the contaminant set con_p;
deleting p from exp_p' if p belongs to the decoy proteins set De_p;
deleting p from exp_p 'if p exists in exp_p' and the deletion probability mp is greater than a preset threshold;
deleting p from exp_p' if p is present in exp_p and the similarity to other proteins in the collection is greater than a preset threshold;
and finally returning to exp_p', namely collecting the screened proteins.
8. An identification system based on the method for efficiently identifying a mixed protein according to any one of claims 1 to 7, characterized in that:
the construction module is used for constructing a spectrogram database containing protein peptide fragments and corresponding mass spectrum information through a bidirectional circulating neural network algorithm; optimizing peptide segment identification results through repeated iterative search;
the analysis module is used for analyzing mass spectrum through a protein analysis algorithm of MaxDIA;
and the filtering module is used for filtering four different standards through a multidimensional protein filtering algorithm to obtain a protein library searching result with a high confidence level.
9. An electronic device, comprising:
a memory for storing a program;
a processor for loading the program to perform the steps of the hybrid protein high efficiency identification method of any one of claims 1-7.
10. A computer-readable storage medium storing a program which, when executed by a processor, implements the steps of the method for efficient identification of a hybrid protein according to any one of claims 1 to 7.
CN202410059002.1A 2024-01-15 2024-01-15 Efficient identification method and system for mixed protein Pending CN117809753A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410059002.1A CN117809753A (en) 2024-01-15 2024-01-15 Efficient identification method and system for mixed protein

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410059002.1A CN117809753A (en) 2024-01-15 2024-01-15 Efficient identification method and system for mixed protein

Publications (1)

Publication Number Publication Date
CN117809753A true CN117809753A (en) 2024-04-02

Family

ID=90421735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410059002.1A Pending CN117809753A (en) 2024-01-15 2024-01-15 Efficient identification method and system for mixed protein

Country Status (1)

Country Link
CN (1) CN117809753A (en)

Similar Documents

Publication Publication Date Title
Rosenblatt et al. Serum proteomics in cancer diagnosis and management
Prakash et al. Signal maps for mass spectrometry-based comparative proteomics
Listgarten et al. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry
Petricoin et al. SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer
Yasui et al. An automated peak identification/calibration procedure for high‐dimensional protein measures from mass spectrometers
US10309968B2 (en) Methods and systems for assembly of protein sequences
CN110838340B (en) Method for identifying protein biomarkers independent of database search
US20040153249A1 (en) System, software and methods for biomarker identification
US20100017356A1 (en) Method for Identifying Protein Patterns in Mass Spectrometry
CN103776891B (en) A kind of method of detection differential expression protein
CN104718449B (en) System and method for recognizing compound from MS/MS data in the case where precursor ion information is not used
CN101611313A (en) Mass spectrometry biomarker assay
Armananzas et al. Peakbin selection in mass spectrometry data using a consensus approach with estimation of distribution algorithms
CN116559453A (en) Biomarker for lung cancer detection
CN110349621B (en) Method, system, storage medium and device for checking reliability of peptide fragment-spectrogram matching
CN111537659A (en) Method for screening biomarkers
CN114577972B (en) Protein marker screening method for body fluid identification
US20070184511A1 (en) Method for Diagnosing a Person Having Sjogren's Syndrome
CN117809753A (en) Efficient identification method and system for mixed protein
Ndukum et al. pkDACLASS: open source software for analyzing MALDI-TOF data
CN114973245B (en) Extracellular vesicle classification method, device, equipment and medium based on machine learning
CN112151109B (en) Semi-supervised learning method for evaluating randomness of biomolecule cross-linked mass spectrometry identification
Liu et al. PepNet: a fully convolutional neural network for de novo peptide sequencing
CN103488913A (en) A computational method for mapping peptides to proteins using sequencing data
CN111739583A (en) Data independent property spectrum detection method based on optimized database (Sub-Lib)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication