CN114171130A - Core fucose identification method, system, equipment, medium and terminal - Google Patents
Core fucose identification method, system, equipment, medium and terminal Download PDFInfo
- Publication number
- CN114171130A CN114171130A CN202111235011.4A CN202111235011A CN114171130A CN 114171130 A CN114171130 A CN 114171130A CN 202111235011 A CN202111235011 A CN 202111235011A CN 114171130 A CN114171130 A CN 114171130A
- Authority
- CN
- China
- Prior art keywords
- core fucose
- fucose
- data
- core
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- SHZGCJCMOBCMKK-DHVFOXMCSA-N L-fucopyranose Chemical compound C[C@@H]1OC(O)[C@@H](O)[C@H](O)[C@@H]1O SHZGCJCMOBCMKK-DHVFOXMCSA-N 0.000 title claims abstract description 216
- SHZGCJCMOBCMKK-UHFFFAOYSA-N D-mannomethylose Natural products CC1OC(O)C(O)C(O)C1O SHZGCJCMOBCMKK-UHFFFAOYSA-N 0.000 title claims abstract description 214
- PNNNRSAQSRJVSB-SLPGGIOYSA-N Fucose Natural products C[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C=O PNNNRSAQSRJVSB-SLPGGIOYSA-N 0.000 title claims abstract description 214
- 238000000034 method Methods 0.000 title claims abstract description 33
- 150000002500 ions Chemical class 0.000 claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 51
- 238000001819 mass spectrum Methods 0.000 claims abstract description 45
- 230000005012 migration Effects 0.000 claims abstract description 26
- 238000013508 migration Methods 0.000 claims abstract description 26
- 102100021266 Alpha-(1,6)-fucosyltransferase Human genes 0.000 claims abstract description 16
- 101000819490 Homo sapiens Alpha-(1,6)-fucosyltransferase Proteins 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 210000002569 neuron Anatomy 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000004949 mass spectrometry Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 238000013467 fragmentation Methods 0.000 claims description 2
- 238000006062 fragmentation reaction Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000002159 abnormal effect Effects 0.000 abstract description 6
- 238000001514 detection method Methods 0.000 abstract description 6
- 238000012512 characterization method Methods 0.000 abstract description 4
- 241000699666 Mus <mouse, genus> Species 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 206010028980 Neoplasm Diseases 0.000 description 5
- 230000033581 fucosylation Effects 0.000 description 5
- 102000003886 Glycoproteins Human genes 0.000 description 3
- 108090000288 Glycoproteins Proteins 0.000 description 3
- 239000000090 biomarker Substances 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 102000002068 Glycopeptides Human genes 0.000 description 2
- 108010015899 Glycopeptides Proteins 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 2
- 230000004988 N-glycosylation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 150000002772 monosaccharides Chemical group 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- WQZGKKKJIJFFOK-QTVWNMPRSA-N D-mannopyranose Chemical compound OC[C@H]1OC(O)[C@@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-QTVWNMPRSA-N 0.000 description 1
- 108060003951 Immunoglobulin Proteins 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- OVRNDRQMDRJTHS-UHFFFAOYSA-N N-acelyl-D-glucosamine Natural products CC(=O)NC1C(O)OC(CO)C(O)C1O OVRNDRQMDRJTHS-UHFFFAOYSA-N 0.000 description 1
- OVRNDRQMDRJTHS-FMDGEEDCSA-N N-acetyl-beta-D-glucosamine Chemical compound CC(=O)N[C@H]1[C@H](O)O[C@H](CO)[C@@H](O)[C@@H]1O OVRNDRQMDRJTHS-FMDGEEDCSA-N 0.000 description 1
- MBLBDJOUHNCFQT-LXGUWJNJSA-N N-acetylglucosamine Natural products CC(=O)N[C@@H](C=O)[C@@H](O)[C@H](O)[C@H](O)CO MBLBDJOUHNCFQT-LXGUWJNJSA-N 0.000 description 1
- 206010061309 Neoplasm progression Diseases 0.000 description 1
- 102000013529 alpha-Fetoproteins Human genes 0.000 description 1
- 108010026331 alpha-Fetoproteins Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000007321 biological mechanism Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 1
- 231100000844 hepatocellular carcinoma Toxicity 0.000 description 1
- 230000003832 immune regulation Effects 0.000 description 1
- 230000028993 immune response Effects 0.000 description 1
- 102000018358 immunoglobulin Human genes 0.000 description 1
- 229940072221 immunoglobulins Drugs 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000002540 macrophage Anatomy 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 229950006780 n-acetylglucosamine Drugs 0.000 description 1
- 210000000822 natural killer cell Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 230000005748 tumor development Effects 0.000 description 1
- 230000005751 tumor progression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/20—Identification of molecular entities, parts thereof or of chemical compositions
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/62—Detectors specially adapted therefor
- G01N30/72—Mass spectrometers
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Pathology (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Immunology (AREA)
- General Physics & Mathematics (AREA)
- Biochemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention belongs to the technical field of core fucose identification, and discloses a method, a system, equipment, a medium and a terminal for identifying core fucose, wherein the method for identifying the core fucose comprises the following steps: introducing characteristic ions; preprocessing data; training a model; calculating a threshold value; and (3) identifying core fucose. According to the invention, core fucose does not exist in mouse tissues of FUT8, the characterization of non-core fucose data is learned through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, so that the problems of non-core fucose with fucose migration and label errors of the core fucose data in training data are avoided. The method is simple to operate, and the training data only contain non-core fucose data; technically, 10 characteristic ions are introduced, the abundance of the characteristic ions is input into a self-encoder, and the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum is obtained. The invention can also distinguish core fucose from non-core fucose with fucose migration, and has fast identification speed.
Description
Technical Field
The invention belongs to the technical field of core fucose identification, and particularly relates to a core fucose identification method, a system, equipment, a medium and a terminal.
Background
Core fucosylation alters the secondary and tertiary conformations of glycoproteins, thereby playing important roles in tumor progression, immune regulation, and stem cell differentiation. Core fucosylation. It has been reported that the core fucose level of tumor tissues is increased compared to normal tissues, immunoglobulins bind to receptors on the surfaces of natural killer cells and macrophages to induce immune response, and the affinity of antibodies to receptors is reduced by 98% to 99% after core fucosylation of N-glycans in the antibodies. Furthermore, many of the core fucose modified glycoproteins may serve as important biomarkers for tumors, e.g., alpha fetoprotein is an important biomarker for hepatocellular carcinoma.
An auto-encoder is a neural network used to learn to reconstruct input data in an unsupervised manner. The whole algorithm is mainly based on the following concepts:
1. an encoder: the input data is mapped to a code characterizing the input data.
2. A decoder: the encoding is mapped to a reconstruction of the input data.
3. And (3) reconstructing errors: euclidean distance of output data to input data.
4. Threshold value: and calculating the mean value and standard deviation of the reconstruction error of the training data.
The self-encoder algorithm learns the characterization of the training data such that the reconstruction error of the training data is minimized.
Core fucose: fucose is a modification of the linkage of the α 1,6 linkage to the innermost N-acetylglucosamine of N-glycosylation.
Fucose migration phenomenon: the phenomenon of intramolecular migration of the terminal fucose unit into the adjacent or distal monosaccharide after activation (see FIG. 7). This so-called fucose migration often leads to misleading fragment ions, i.e. the fucose residues are re-linked to sterically adjacent or distant monosaccharides, leading to erroneous mass spectral data.
At present, the core fucose identification algorithm based on the mass spectrometry technology mainly comprises the following algorithms:
1. manual spectrum resolving method. Matching the inherent characteristic peak of the core fucose with mass spectrum data generated by glycopeptides at a mass spectrometer MS2 stage in a manual mode, and identifying the glycopeptides by the fact that the number of matched peaks is larger than a threshold;
2. a machine learning method. The identification of the core fucose is regarded as a two-classification problem, corresponding characteristic ions are selected, and the two-classification learning method is applied to solve the identification problem of the core fucose. For example, Heeyoun applies a Support Vector Machine (SVM) and a Deep Neural Network (DNN) to core fucose identification.
However, the existing core fucose identification method cannot distinguish core fucose from non-core fucose in which fucose migration occurs. Because the fucose migration phenomenon is not considered, the training data may have data with wrong labels, that is, the labels of the partial mass spectrum data with the core fucose in the training set used by the training model are wrong, so that the application of the trained model to the core fucose identification is unreliable. Therefore, the development of the identification of core fucose of proteins is helpful to explain the function of core fucosylation of proteins and is also of great importance for the discovery of new biomarkers that can be used for the prognosis and diagnosis of cancer.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the existing core fucose identification algorithm based on the mass spectrum technology has the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration.
(2) Because the fucose migration phenomenon is not considered, data with wrong labels is possible to exist in the training data, namely, the labels in the training set used by the training model are wrong in the data of the partial mass spectrogram with the labels of the core fucose, so that the trained model is unreliable in the core fucose identification.
The difficulty in solving the above problems and defects is: only the fucose migration phenomenon exists, and the migration condition, the migration position, the migration statistical property and the like are unknown, so that the core fucose identification is greatly challenged.
The significance of solving the problems and the defects is as follows: core fucosylation changes the secondary and tertiary conformation of glycoproteins, thereby playing an important biological role in the development, progression and metastasis of tumors. Due to the fucose migration phenomenon, non-core fucose is easily identified as core fucose, thereby misleading the understanding of the tumor development process. Therefore, the high-quality identification of the core fucose has important significance for understanding the biological mechanism of the tumor.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method, a system, equipment, a medium and a terminal for identifying core fucose, in particular to a method, a system, equipment, a medium and a terminal for identifying core fucose based on a self-encoder, aiming at solving the problem that the core fucose mass spectrogram and the non-core fucose mass spectrogram with fucose migration cannot be distinguished in the existing core fucose identification algorithm based on the mass spectrum technology.
The technical scheme of the invention is summarized as follows: removing core fucose from mouse tissues of FUT8, obtaining mass spectrum data of the tissues, introducing characteristic ions, and using relative abundance of the characteristic ions to train a self-encoder model of non-core fucose; the identification of the core fucose is regarded as an abnormal detection problem of the model, and the identification of the core fucose is realized.
The invention is realized in such a way that a method for identifying core fucose comprises the following steps:
step one, extracting characteristic ions;
step two, data preprocessing;
step three, training a model;
step four, calculating a threshold value;
and step five, identifying the core fucose.
Further, in the step one, the introducing characteristic ions includes:
the pentasaccharide core is the intrinsic structure comprised by the N-saccharide, i.e. theoretically the Y ions generated after the fragmentation of the core fucose comprise 10 ions, called the characteristic ions identified for fucose, while the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is different from the mass of these 10 characteristic ions.
Further, in step two, the data preprocessing includes:
the mouse tissues from which FUT8 was removed had no core fucose present, and the mass spectra data of these tissues were normalized by dividing the abundance of each Y ion by the sum of the abundances of all Y ions on the mass spectra data to obtain normalized mass spectra data:
using the relative abundances of the 10 characteristic ions of the normalized mass spectrometry data as training data.
Further, in step three, the model training includes:
training an uncore fucose autoencoder; the non-core fucose self-encoder is a 7-layer artificial neural network, an input layer and an output layer both comprise 10 artificial neurons, a hidden layer respectively comprises 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons from left to right, and the artificial neurons of two adjacent layers are fully connected; the activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
Further, in step four, the threshold calculation includes:
taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),…x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
The threshold α is calculated as:
α=μ+k·σ;
Further, in step five, the identification of core fucose comprises:
recording the normalized data set of the mass spectrum data to be identified as Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
If it isIdentifying the ith mass spectrogram to be identified as the core fucose; if it isThe ith mass spectrum to be identified is identified as non-core fucose.
Another object of the present invention is to provide a core fucose identification system using the method for identifying core fucose, the core fucose identification system comprising:
a characteristic ion introduction module for introducing characteristic ions identified by fucose;
the data preprocessing module is used for removing core fucose which does not exist in mouse tissues of FUT8 and normalizing the mass spectrum data of the tissues;
a model training module for training the non-core fucose autoencoder;
a threshold calculation module for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module for identifying Y in the data set Y by calculating(i)And (3) performing identification of the core fucose.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
Another object of the present invention is to provide an information data processing terminal for implementing the core fucose identification system.
By combining all the technical schemes, the invention overcomes the technical bias in the industry and fills the blank in the industry. The invention has the advantages and positive effects that: the core fucose identification method provided by the invention can solve the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration. The core fucose does not exist in the mouse tissues of the FUT8, the invention learns the characterization of the non-core fucose data through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, thereby avoiding the problems of the non-core fucose with fucose migration and the label error of the core fucose data in the training data.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the non-core fucose data only for training the non-core fucose model, so that the training data has no core fucose and non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the relative abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram | Rate of accuracy | |
Non-core fucose | 1199 | 89.86% |
Core fucose | 426 | 98.83% |
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for identifying core fucose according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a core fucose identification method provided in an embodiment of the present invention.
FIG. 3 is a block diagram of a core fucose identification system according to an embodiment of the present invention;
in the figure: 1. a characteristic ion introduction module; 2. a data preprocessing module; 3. a model training module; 4. a threshold calculation module; 5. a core fucose identification module.
FIG. 4 is a schematic diagram of characteristic ions for core fucose identification provided by embodiments of the present invention.
Fig. 5 is a schematic diagram of parameters of a self-encoder model according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating the influence of the parameter k on the accuracy of the evaluation according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of an example of fucose migration phenomenon provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In view of the problems in the prior art, the present invention provides a method, system, device, medium and terminal for identifying core fucose, and the present invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for identifying core fucose provided by the embodiment of the present invention includes the following steps:
s101, introducing characteristic ions;
s102, preprocessing data;
s103, training a model;
s104, calculating a threshold value;
s105, core fucose identification.
The schematic diagram of the core fucose identification method provided by the embodiment of the invention is shown in figure 2.
The core fucose identification system provided by the embodiment of the invention is shown in figure 3, and comprises:
a characteristic ion introduction module 1 for introducing characteristic ions identified by fucose;
a data preprocessing module 2, which is used for carrying out normalization processing on the mass spectrum data of the mouse tissue with the FUT8 removed and core fucose not existing;
the model training module 3 is used for training the non-core fucose self-encoder;
a threshold calculation module 4 for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module 5 for identifying Y in the data set Y by calculating(i)And (4) performing core fucose identification on the reconstruction error.
The technical solution of the present invention will be further described below with reference to the term explanation.
An auto-encoder: a neural network that learns to reconstruct input data in an unsupervised manner;
core fucose: an N-glycosylation modification.
The technical solution of the present invention is further described below with reference to specific examples.
Core fucose is not present in the tissues of mice from which FUT8 was removed. According to the method, the characterization of the non-core fucose mass spectrum data is learned through the self-encoder, and the core fucose identification is regarded as an abnormal detection problem, so that the problems of label errors of the non-core fucose and the core fucose data with fucose migration in training data are avoided.
The technical route of the invention is shown in figure 2.
The technical scheme of the invention is as follows:
(1) introduction of characteristic ions
The pentasaccharide core is the inherent structure contained by the N-saccharide, i.e. theoretically, the Y ions generated after the core fucose is fragmented will include 10 ions as shown in fig. 4, which are called characteristic ions identified by fucose, and the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is often different from that of the 10 characteristic ions.
(2) Data pre-processing
Core fucose is not present in the tissues of mice from which FUT8 was removed. Normalizing the mass spectrum data of the tissues, namely dividing the abundance of each Y ion in the mass spectrum data by the sum of the abundances of all the Y ions to obtain normalized mass spectrum data:
the relative abundance of the above-mentioned 10 characteristic ions of these normalized mass spectral data was taken as training data.
(3) Model training
The invention trains a non-core fucose self-encoder, which is a 7-layer artificial neural network. As shown in fig. 5, the input layer and the output layer both include 10 artificial neurons, the hidden layer includes 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons, and 9 artificial neurons from left to right, respectively, and the artificial neurons of two adjacent layers are all connected. The activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
(4) Threshold calculation
Taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),...x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
The threshold α is calculated as:
α=μ+k·σ (3)
(5) Core fucose identification
Record and wait for identificationThe normalized data set of the qualitative spectrum data is Y ═ { Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
If it isIdentifying the ith mass spectrogram to be identified as core fucose; if it isIdentifying the ith mass spectrum to be identified as non-core fucose.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the data only containing non-core fucose, so that the training data has no core fucose and the non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
The technical solution of the present invention is further described below with reference to simulation experiments.
Experimental examples: the following examples are for illustrative purposes and are not intended to limit the scope of the present invention.
The experiment of the invention is carried out by training a self-encoder, wherein a training set comprises 18700 non-core fucose mass spectrograms, and the mass spectrograms to be identified comprise 1199 non-core fucose mass spectrograms and 426 high mannose type core fucose mass spectrograms. Fig. 6 shows the identification accuracy of the mass spectrum to be identified (dark color is the identification accuracy of core fucose, light color is the identification accuracy of non-core fucose) with the self-encoder trained by the training set, along with the variation of the parameter k sampling value.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram | Rate of accuracy | |
Non-core fucose | 1199 | 8986% |
Core fucose | 426 | 98.83% |
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A method for identifying core fucose, comprising the steps of:
step one, introducing characteristic ions;
step two, data preprocessing;
step three, training a model;
step four, calculating a threshold value;
and step five, identifying the core fucose.
2. The method for identifying core fucose as claimed in claim 1, wherein in the first step, the introducing of the characteristic ions comprises:
the pentasaccharide core is the intrinsic structure comprised by the N-saccharide, i.e. theoretically the Y ions generated after the fragmentation of the core fucose comprise 10 ions, called the characteristic ions identified for fucose, while the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is different from the mass of these 10 characteristic ions.
3. The method for identifying core fucose as claimed in claim 1, wherein in the second step, the data preprocessing comprises:
core fucose is absent from mouse tissues from which FUT8 was removed; normalizing the mass spectrum data of the tissues, namely dividing the abundance of each Y ion in the mass spectrum data by the sum of the abundances of all the ions to obtain normalized mass spectrum data:
using the relative abundances of the 10 characteristic ions of the normalized mass spectrometry data as training data.
4. The method for identifying core fucose as claimed in claim 1, wherein in step three, the model training comprises:
training an uncore fucose autoencoder; the non-core fucose self-encoder is a 7-layer artificial neural network, an input layer and an output layer both comprise 10 artificial neurons, a hidden layer respectively comprises 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons from left to right, and the artificial neurons of two adjacent layers are fully connected; the activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
5. The method for identifying core fucose as claimed in claim 1, wherein in step four, the threshold calculation comprises:
taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),…x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
The threshold α is calculated as:
α=μ+k·σ;
6. The method of identifying core fucose as claimed in claim 1, wherein the identification of core fucose in step five comprises:
recording the normalized data set of the mass spectrum data to be identified as Y ═ Y(1),y(2),y(3),…y(M)Y in the data set Y is calculated(i)Is recorded as
7. A core fucose identification system using the method for identifying core fucose as claimed in any one of claims 1 to 6, wherein the core fucose identification system comprises:
a characteristic ion introduction module for introducing characteristic ions identified by fucose;
the data preprocessing module is used for removing core fucose which does not exist in mouse tissues of FUT8 and normalizing the mass spectrum data;
a model training module for training the non-core fucose autoencoder;
a threshold calculation module for calculating X in the data set X(i)The mean value and the variance of the error are reconstructed, and then a threshold value is determined;
a core fucose identification module for identifying Y in the data set Y by calculating(i)And (3) performing identification of the core fucose.
8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
introducing characteristic ions identified by fucose; core fucose is absent from mouse tissues from which FUT8 has been removed, and the texture of the tissues is determinedCarrying out normalization processing on the spectrum data, and taking the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; training an uncore fucose autoencoder; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose which does not exist in mouse tissues of FUT8, normalizing mass spectrum data of the tissues, and taking the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; training an uncore fucose autoencoder; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
10. An information data processing terminal, characterized in that the information data processing terminal is configured to implement the core fucose identification system as claimed in claim 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111235011.4A CN114171130A (en) | 2021-10-22 | 2021-10-22 | Core fucose identification method, system, equipment, medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111235011.4A CN114171130A (en) | 2021-10-22 | 2021-10-22 | Core fucose identification method, system, equipment, medium and terminal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114171130A true CN114171130A (en) | 2022-03-11 |
Family
ID=80477172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111235011.4A Pending CN114171130A (en) | 2021-10-22 | 2021-10-22 | Core fucose identification method, system, equipment, medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114171130A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150160233A1 (en) * | 2012-05-21 | 2015-06-11 | Indiana University Research And Technology Corporation | Identification and Quantification of Intact Glycopeptides in Complex Samples |
US20180101643A1 (en) * | 2015-05-18 | 2018-04-12 | The Regents Of The University Of California | Systems and Methods for Predicting Glycosylation on Proteins |
WO2018223025A1 (en) * | 2017-06-01 | 2018-12-06 | Brandeis University | System and method for determining glycan topology using tandem mass spectra |
CN110009706A (en) * | 2019-03-06 | 2019-07-12 | 上海电力学院 | A kind of digital cores reconstructing method based on deep-neural-network and transfer learning |
US20200273545A1 (en) * | 2019-02-22 | 2020-08-27 | Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada | Computer-implemented methods and systems for identifying a species from mass spectra |
CN113383236A (en) * | 2018-11-23 | 2021-09-10 | 新加坡科技研究局 | Method for multi-attribute identification of unknown biological samples |
CN113484400A (en) * | 2021-07-01 | 2021-10-08 | 上海交通大学 | Mass spectrogram molecular formula calculation method based on machine learning |
CN113495094A (en) * | 2020-04-01 | 2021-10-12 | 中国电信股份有限公司 | Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer |
-
2021
- 2021-10-22 CN CN202111235011.4A patent/CN114171130A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150160233A1 (en) * | 2012-05-21 | 2015-06-11 | Indiana University Research And Technology Corporation | Identification and Quantification of Intact Glycopeptides in Complex Samples |
US20180101643A1 (en) * | 2015-05-18 | 2018-04-12 | The Regents Of The University Of California | Systems and Methods for Predicting Glycosylation on Proteins |
WO2018223025A1 (en) * | 2017-06-01 | 2018-12-06 | Brandeis University | System and method for determining glycan topology using tandem mass spectra |
CN113383236A (en) * | 2018-11-23 | 2021-09-10 | 新加坡科技研究局 | Method for multi-attribute identification of unknown biological samples |
US20200273545A1 (en) * | 2019-02-22 | 2020-08-27 | Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada | Computer-implemented methods and systems for identifying a species from mass spectra |
CN110009706A (en) * | 2019-03-06 | 2019-07-12 | 上海电力学院 | A kind of digital cores reconstructing method based on deep-neural-network and transfer learning |
CN113495094A (en) * | 2020-04-01 | 2021-10-12 | 中国电信股份有限公司 | Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer |
CN113484400A (en) * | 2021-07-01 | 2021-10-08 | 上海交通大学 | Mass spectrogram molecular formula calculation method based on machine learning |
Non-Patent Citations (3)
Title |
---|
YANG, Y等: "Prediction of glycopeptide fragment mass spectra by deep learning", NATURE COMMUNICATIONS, vol. 15, no. 1, 10 April 2024 (2024-04-10), pages 1 - 12 * |
乔彦涛;缪佳铮;孙世伟;刘金刚;卜东波;: "串联质谱的蛋白质序列鉴定技术综述", 计算机科学与探索, no. 02, 15 February 2010 (2010-02-15), pages 5 - 15 * |
苏远杰: "基于质谱数据的核心岩藻糖鉴定方法与算法研究", 中国优秀硕士学位论文全文数据库(电子期刊)), no. 04, 31 December 2022 (2022-12-31), pages 006 - 232 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110472675B (en) | Image classification method, image classification device, storage medium and electronic equipment | |
US7899625B2 (en) | Method and system for robust classification strategy for cancer detection from mass spectrometry data | |
WO2023092961A1 (en) | Semi-supervised method and apparatus for public opinion text analysis | |
WO2019202941A1 (en) | Self-training data selection device, estimation model learning device, self-training data selection method, estimation model learning method, and program | |
CN112764024B (en) | Radar target identification method based on convolutional neural network and Bert | |
Liu et al. | Feature selection method based on support vector machine and shape analysis for high-throughput medical data | |
CN110289047B (en) | Sequencing data-based tumor purity and absolute copy number prediction method and system | |
CN112883990A (en) | Data classification method and device, computer storage medium and electronic equipment | |
Lam et al. | Mixup-breakdown: a consistency training method for improving generalization of speech separation models | |
CN113889192A (en) | Single cell RNA-seq data clustering method based on deep noise reduction self-encoder | |
CN113592842A (en) | Sample serum quality identification method and identification device based on deep learning | |
CN114171130A (en) | Core fucose identification method, system, equipment, medium and terminal | |
Zhang et al. | Modified student's t‐hidden Markov model for pattern recognition and classification | |
CN114301719B (en) | Malicious update detection method and system based on variational self-encoder | |
CN105893790A (en) | Classification method for mass spectrum deficiency protein data | |
Vutov et al. | Multiple two‐sample testing under arbitrary covariance dependency with an application in imaging mass spectrometry | |
CN115115920A (en) | Data training method and device | |
Listgarten | Analysis of sibling time series data: alignment and difference detection | |
Mu et al. | Self-supervised disentangled representation learning for robust target speech extraction | |
CN114048770A (en) | Automatic detection method and system for digital audio deletion and insertion tampering operation | |
CN113641888A (en) | Event-related news filtering learning method based on fusion topic information enhanced PU learning | |
Zhang et al. | MA-CapsNet-DA: Speech emotion recognition based on MA-CapsNet using data augmentation | |
CN111160245A (en) | Dynamic signature identification method and device | |
CN110797083B (en) | Biomarker identification method based on multiple networks | |
Zhan et al. | Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |