CN114171130A - Core fucose identification method, system, equipment, medium and terminal - Google Patents

Core fucose identification method, system, equipment, medium and terminal Download PDF

Info

Publication number
CN114171130A
CN114171130A CN202111235011.4A CN202111235011A CN114171130A CN 114171130 A CN114171130 A CN 114171130A CN 202111235011 A CN202111235011 A CN 202111235011A CN 114171130 A CN114171130 A CN 114171130A
Authority
CN
China
Prior art keywords
core fucose
fucose
data
core
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111235011.4A
Other languages
Chinese (zh)
Inventor
张军英
苏远杰
刘继源
孙士生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202111235011.4A priority Critical patent/CN114171130A/en
Publication of CN114171130A publication Critical patent/CN114171130A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/62Detectors specially adapted therefor
    • G01N30/72Mass spectrometers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • General Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention belongs to the technical field of core fucose identification, and discloses a method, a system, equipment, a medium and a terminal for identifying core fucose, wherein the method for identifying the core fucose comprises the following steps: introducing characteristic ions; preprocessing data; training a model; calculating a threshold value; and (3) identifying core fucose. According to the invention, core fucose does not exist in mouse tissues of FUT8, the characterization of non-core fucose data is learned through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, so that the problems of non-core fucose with fucose migration and label errors of the core fucose data in training data are avoided. The method is simple to operate, and the training data only contain non-core fucose data; technically, 10 characteristic ions are introduced, the abundance of the characteristic ions is input into a self-encoder, and the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum is obtained. The invention can also distinguish core fucose from non-core fucose with fucose migration, and has fast identification speed.

Description

Core fucose identification method, system, equipment, medium and terminal
Technical Field
The invention belongs to the technical field of core fucose identification, and particularly relates to a core fucose identification method, a system, equipment, a medium and a terminal.
Background
Core fucosylation alters the secondary and tertiary conformations of glycoproteins, thereby playing important roles in tumor progression, immune regulation, and stem cell differentiation. Core fucosylation. It has been reported that the core fucose level of tumor tissues is increased compared to normal tissues, immunoglobulins bind to receptors on the surfaces of natural killer cells and macrophages to induce immune response, and the affinity of antibodies to receptors is reduced by 98% to 99% after core fucosylation of N-glycans in the antibodies. Furthermore, many of the core fucose modified glycoproteins may serve as important biomarkers for tumors, e.g., alpha fetoprotein is an important biomarker for hepatocellular carcinoma.
An auto-encoder is a neural network used to learn to reconstruct input data in an unsupervised manner. The whole algorithm is mainly based on the following concepts:
1. an encoder: the input data is mapped to a code characterizing the input data.
2. A decoder: the encoding is mapped to a reconstruction of the input data.
3. And (3) reconstructing errors: euclidean distance of output data to input data.
4. Threshold value: and calculating the mean value and standard deviation of the reconstruction error of the training data.
The self-encoder algorithm learns the characterization of the training data such that the reconstruction error of the training data is minimized.
Core fucose: fucose is a modification of the linkage of the α 1,6 linkage to the innermost N-acetylglucosamine of N-glycosylation.
Fucose migration phenomenon: the phenomenon of intramolecular migration of the terminal fucose unit into the adjacent or distal monosaccharide after activation (see FIG. 7). This so-called fucose migration often leads to misleading fragment ions, i.e. the fucose residues are re-linked to sterically adjacent or distant monosaccharides, leading to erroneous mass spectral data.
At present, the core fucose identification algorithm based on the mass spectrometry technology mainly comprises the following algorithms:
1. manual spectrum resolving method. Matching the inherent characteristic peak of the core fucose with mass spectrum data generated by glycopeptides at a mass spectrometer MS2 stage in a manual mode, and identifying the glycopeptides by the fact that the number of matched peaks is larger than a threshold;
2. a machine learning method. The identification of the core fucose is regarded as a two-classification problem, corresponding characteristic ions are selected, and the two-classification learning method is applied to solve the identification problem of the core fucose. For example, Heeyoun applies a Support Vector Machine (SVM) and a Deep Neural Network (DNN) to core fucose identification.
However, the existing core fucose identification method cannot distinguish core fucose from non-core fucose in which fucose migration occurs. Because the fucose migration phenomenon is not considered, the training data may have data with wrong labels, that is, the labels of the partial mass spectrum data with the core fucose in the training set used by the training model are wrong, so that the application of the trained model to the core fucose identification is unreliable. Therefore, the development of the identification of core fucose of proteins is helpful to explain the function of core fucosylation of proteins and is also of great importance for the discovery of new biomarkers that can be used for the prognosis and diagnosis of cancer.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the existing core fucose identification algorithm based on the mass spectrum technology has the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration.
(2) Because the fucose migration phenomenon is not considered, data with wrong labels is possible to exist in the training data, namely, the labels in the training set used by the training model are wrong in the data of the partial mass spectrogram with the labels of the core fucose, so that the trained model is unreliable in the core fucose identification.
The difficulty in solving the above problems and defects is: only the fucose migration phenomenon exists, and the migration condition, the migration position, the migration statistical property and the like are unknown, so that the core fucose identification is greatly challenged.
The significance of solving the problems and the defects is as follows: core fucosylation changes the secondary and tertiary conformation of glycoproteins, thereby playing an important biological role in the development, progression and metastasis of tumors. Due to the fucose migration phenomenon, non-core fucose is easily identified as core fucose, thereby misleading the understanding of the tumor development process. Therefore, the high-quality identification of the core fucose has important significance for understanding the biological mechanism of the tumor.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method, a system, equipment, a medium and a terminal for identifying core fucose, in particular to a method, a system, equipment, a medium and a terminal for identifying core fucose based on a self-encoder, aiming at solving the problem that the core fucose mass spectrogram and the non-core fucose mass spectrogram with fucose migration cannot be distinguished in the existing core fucose identification algorithm based on the mass spectrum technology.
The technical scheme of the invention is summarized as follows: removing core fucose from mouse tissues of FUT8, obtaining mass spectrum data of the tissues, introducing characteristic ions, and using relative abundance of the characteristic ions to train a self-encoder model of non-core fucose; the identification of the core fucose is regarded as an abnormal detection problem of the model, and the identification of the core fucose is realized.
The invention is realized in such a way that a method for identifying core fucose comprises the following steps:
step one, extracting characteristic ions;
step two, data preprocessing;
step three, training a model;
step four, calculating a threshold value;
and step five, identifying the core fucose.
Further, in the step one, the introducing characteristic ions includes:
the pentasaccharide core is the intrinsic structure comprised by the N-saccharide, i.e. theoretically the Y ions generated after the fragmentation of the core fucose comprise 10 ions, called the characteristic ions identified for fucose, while the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is different from the mass of these 10 characteristic ions.
Further, in step two, the data preprocessing includes:
the mouse tissues from which FUT8 was removed had no core fucose present, and the mass spectra data of these tissues were normalized by dividing the abundance of each Y ion by the sum of the abundances of all Y ions on the mass spectra data to obtain normalized mass spectra data:
Figure BDA0003317191750000041
using the relative abundances of the 10 characteristic ions of the normalized mass spectrometry data as training data.
Further, in step three, the model training includes:
training an uncore fucose autoencoder; the non-core fucose self-encoder is a 7-layer artificial neural network, an input layer and an output layer both comprise 10 artificial neurons, a hidden layer respectively comprises 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons from left to right, and the artificial neurons of two adjacent layers are fully connected; the activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
Further, in step four, the threshold calculation includes:
taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),…x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
Figure BDA0003317191750000042
Figure BDA0003317191750000043
The threshold α is calculated as:
α=μ+k·σ;
wherein μ is
Figure BDA0003317191750000044
A mean value of
Figure BDA0003317191750000045
K is a user parameter.
Further, in step five, the identification of core fucose comprises:
recording the normalized data set of the mass spectrum data to be identified as Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
Figure BDA0003317191750000046
Figure BDA0003317191750000047
If it is
Figure BDA0003317191750000048
Identifying the ith mass spectrogram to be identified as the core fucose; if it is
Figure BDA0003317191750000049
The ith mass spectrum to be identified is identified as non-core fucose.
Another object of the present invention is to provide a core fucose identification system using the method for identifying core fucose, the core fucose identification system comprising:
a characteristic ion introduction module for introducing characteristic ions identified by fucose;
the data preprocessing module is used for removing core fucose which does not exist in mouse tissues of FUT8 and normalizing the mass spectrum data of the tissues;
a model training module for training the non-core fucose autoencoder;
a threshold calculation module for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module for identifying Y in the data set Y by calculating(i)And (3) performing identification of the core fucose.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
Another object of the present invention is to provide an information data processing terminal for implementing the core fucose identification system.
By combining all the technical schemes, the invention overcomes the technical bias in the industry and fills the blank in the industry. The invention has the advantages and positive effects that: the core fucose identification method provided by the invention can solve the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration. The core fucose does not exist in the mouse tissues of the FUT8, the invention learns the characterization of the non-core fucose data through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, thereby avoiding the problems of the non-core fucose with fucose migration and the label error of the core fucose data in the training data.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the non-core fucose data only for training the non-core fucose model, so that the training data has no core fucose and non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the relative abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram Rate of accuracy
Non-core fucose 1199 89.86%
Core fucose 426 98.83%
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for identifying core fucose according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a core fucose identification method provided in an embodiment of the present invention.
FIG. 3 is a block diagram of a core fucose identification system according to an embodiment of the present invention;
in the figure: 1. a characteristic ion introduction module; 2. a data preprocessing module; 3. a model training module; 4. a threshold calculation module; 5. a core fucose identification module.
FIG. 4 is a schematic diagram of characteristic ions for core fucose identification provided by embodiments of the present invention.
Fig. 5 is a schematic diagram of parameters of a self-encoder model according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating the influence of the parameter k on the accuracy of the evaluation according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of an example of fucose migration phenomenon provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In view of the problems in the prior art, the present invention provides a method, system, device, medium and terminal for identifying core fucose, and the present invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for identifying core fucose provided by the embodiment of the present invention includes the following steps:
s101, introducing characteristic ions;
s102, preprocessing data;
s103, training a model;
s104, calculating a threshold value;
s105, core fucose identification.
The schematic diagram of the core fucose identification method provided by the embodiment of the invention is shown in figure 2.
The core fucose identification system provided by the embodiment of the invention is shown in figure 3, and comprises:
a characteristic ion introduction module 1 for introducing characteristic ions identified by fucose;
a data preprocessing module 2, which is used for carrying out normalization processing on the mass spectrum data of the mouse tissue with the FUT8 removed and core fucose not existing;
the model training module 3 is used for training the non-core fucose self-encoder;
a threshold calculation module 4 for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module 5 for identifying Y in the data set Y by calculating(i)And (4) performing core fucose identification on the reconstruction error.
The technical solution of the present invention will be further described below with reference to the term explanation.
An auto-encoder: a neural network that learns to reconstruct input data in an unsupervised manner;
core fucose: an N-glycosylation modification.
The technical solution of the present invention is further described below with reference to specific examples.
Core fucose is not present in the tissues of mice from which FUT8 was removed. According to the method, the characterization of the non-core fucose mass spectrum data is learned through the self-encoder, and the core fucose identification is regarded as an abnormal detection problem, so that the problems of label errors of the non-core fucose and the core fucose data with fucose migration in training data are avoided.
The technical route of the invention is shown in figure 2.
The technical scheme of the invention is as follows:
(1) introduction of characteristic ions
The pentasaccharide core is the inherent structure contained by the N-saccharide, i.e. theoretically, the Y ions generated after the core fucose is fragmented will include 10 ions as shown in fig. 4, which are called characteristic ions identified by fucose, and the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is often different from that of the 10 characteristic ions.
(2) Data pre-processing
Core fucose is not present in the tissues of mice from which FUT8 was removed. Normalizing the mass spectrum data of the tissues, namely dividing the abundance of each Y ion in the mass spectrum data by the sum of the abundances of all the Y ions to obtain normalized mass spectrum data:
Figure BDA0003317191750000081
the relative abundance of the above-mentioned 10 characteristic ions of these normalized mass spectral data was taken as training data.
(3) Model training
The invention trains a non-core fucose self-encoder, which is a 7-layer artificial neural network. As shown in fig. 5, the input layer and the output layer both include 10 artificial neurons, the hidden layer includes 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons, and 9 artificial neurons from left to right, respectively, and the artificial neurons of two adjacent layers are all connected. The activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
(4) Threshold calculation
Taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),...x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
Figure BDA0003317191750000091
Figure BDA0003317191750000092
The threshold α is calculated as:
α=μ+k·σ (3)
wherein μ is
Figure BDA0003317191750000093
A mean value of
Figure BDA0003317191750000094
K is a user parameter.
(5) Core fucose identification
Record and wait for identificationThe normalized data set of the qualitative spectrum data is Y ═ { Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
Figure BDA0003317191750000095
Figure BDA0003317191750000096
If it is
Figure BDA0003317191750000097
Identifying the ith mass spectrogram to be identified as core fucose; if it is
Figure BDA0003317191750000098
Identifying the ith mass spectrum to be identified as non-core fucose.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the data only containing non-core fucose, so that the training data has no core fucose and the non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
The technical solution of the present invention is further described below with reference to simulation experiments.
Experimental examples: the following examples are for illustrative purposes and are not intended to limit the scope of the present invention.
The experiment of the invention is carried out by training a self-encoder, wherein a training set comprises 18700 non-core fucose mass spectrograms, and the mass spectrograms to be identified comprise 1199 non-core fucose mass spectrograms and 426 high mannose type core fucose mass spectrograms. Fig. 6 shows the identification accuracy of the mass spectrum to be identified (dark color is the identification accuracy of core fucose, light color is the identification accuracy of non-core fucose) with the self-encoder trained by the training set, along with the variation of the parameter k sampling value.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram Rate of accuracy
Non-core fucose 1199 8986%
Core fucose 426 98.83%
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for identifying core fucose, comprising the steps of:
step one, introducing characteristic ions;
step two, data preprocessing;
step three, training a model;
step four, calculating a threshold value;
and step five, identifying the core fucose.
2. The method for identifying core fucose as claimed in claim 1, wherein in the first step, the introducing of the characteristic ions comprises:
the pentasaccharide core is the intrinsic structure comprised by the N-saccharide, i.e. theoretically the Y ions generated after the fragmentation of the core fucose comprise 10 ions, called the characteristic ions identified for fucose, while the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is different from the mass of these 10 characteristic ions.
3. The method for identifying core fucose as claimed in claim 1, wherein in the second step, the data preprocessing comprises:
core fucose is absent from mouse tissues from which FUT8 was removed; normalizing the mass spectrum data of the tissues, namely dividing the abundance of each Y ion in the mass spectrum data by the sum of the abundances of all the ions to obtain normalized mass spectrum data:
Figure FDA0003317191740000011
using the relative abundances of the 10 characteristic ions of the normalized mass spectrometry data as training data.
4. The method for identifying core fucose as claimed in claim 1, wherein in step three, the model training comprises:
training an uncore fucose autoencoder; the non-core fucose self-encoder is a 7-layer artificial neural network, an input layer and an output layer both comprise 10 artificial neurons, a hidden layer respectively comprises 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons from left to right, and the artificial neurons of two adjacent layers are fully connected; the activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
5. The method for identifying core fucose as claimed in claim 1, wherein in step four, the threshold calculation comprises:
taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),…x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
Figure FDA0003317191740000021
Figure FDA0003317191740000022
The threshold α is calculated as:
α=μ+k·σ;
wherein μ is
Figure FDA0003317191740000023
A mean value of
Figure FDA0003317191740000024
K is a user parameter.
6. The method of identifying core fucose as claimed in claim 1, wherein the identification of core fucose in step five comprises:
recording the normalized data set of the mass spectrum data to be identified as Y ═ Y(1),y(2),y(3),…y(M)Y in the data set Y is calculated(i)Is recorded as
Figure FDA0003317191740000025
Figure FDA0003317191740000026
If it is
Figure FDA0003317191740000027
Identifying the ith mass spectrogram to be identified as core fucose; if it is
Figure FDA0003317191740000028
Identifying the ith mass spectrum to be identified as non-core fucose.
7. A core fucose identification system using the method for identifying core fucose as claimed in any one of claims 1 to 6, wherein the core fucose identification system comprises:
a characteristic ion introduction module for introducing characteristic ions identified by fucose;
the data preprocessing module is used for removing core fucose which does not exist in mouse tissues of FUT8 and normalizing the mass spectrum data;
a model training module for training the non-core fucose autoencoder;
a threshold calculation module for calculating X in the data set X(i)The mean value and the variance of the error are reconstructed, and then a threshold value is determined;
a core fucose identification module for identifying Y in the data set Y by calculating(i)And (3) performing identification of the core fucose.
8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
introducing characteristic ions identified by fucose; core fucose is absent from mouse tissues from which FUT8 has been removed, and the texture of the tissues is determinedCarrying out normalization processing on the spectrum data, and taking the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; training an uncore fucose autoencoder; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose which does not exist in mouse tissues of FUT8, normalizing mass spectrum data of the tissues, and taking the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; training an uncore fucose autoencoder; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
10. An information data processing terminal, characterized in that the information data processing terminal is configured to implement the core fucose identification system as claimed in claim 7.
CN202111235011.4A 2021-10-22 2021-10-22 Core fucose identification method, system, equipment, medium and terminal Pending CN114171130A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111235011.4A CN114171130A (en) 2021-10-22 2021-10-22 Core fucose identification method, system, equipment, medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111235011.4A CN114171130A (en) 2021-10-22 2021-10-22 Core fucose identification method, system, equipment, medium and terminal

Publications (1)

Publication Number Publication Date
CN114171130A true CN114171130A (en) 2022-03-11

Family

ID=80477172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111235011.4A Pending CN114171130A (en) 2021-10-22 2021-10-22 Core fucose identification method, system, equipment, medium and terminal

Country Status (1)

Country Link
CN (1) CN114171130A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150160233A1 (en) * 2012-05-21 2015-06-11 Indiana University Research And Technology Corporation Identification and Quantification of Intact Glycopeptides in Complex Samples
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
WO2018223025A1 (en) * 2017-06-01 2018-12-06 Brandeis University System and method for determining glycan topology using tandem mass spectra
CN110009706A (en) * 2019-03-06 2019-07-12 上海电力学院 A kind of digital cores reconstructing method based on deep-neural-network and transfer learning
US20200273545A1 (en) * 2019-02-22 2020-08-27 Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada Computer-implemented methods and systems for identifying a species from mass spectra
CN113383236A (en) * 2018-11-23 2021-09-10 新加坡科技研究局 Method for multi-attribute identification of unknown biological samples
CN113484400A (en) * 2021-07-01 2021-10-08 上海交通大学 Mass spectrogram molecular formula calculation method based on machine learning
CN113495094A (en) * 2020-04-01 2021-10-12 中国电信股份有限公司 Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150160233A1 (en) * 2012-05-21 2015-06-11 Indiana University Research And Technology Corporation Identification and Quantification of Intact Glycopeptides in Complex Samples
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
WO2018223025A1 (en) * 2017-06-01 2018-12-06 Brandeis University System and method for determining glycan topology using tandem mass spectra
CN113383236A (en) * 2018-11-23 2021-09-10 新加坡科技研究局 Method for multi-attribute identification of unknown biological samples
US20200273545A1 (en) * 2019-02-22 2020-08-27 Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada Computer-implemented methods and systems for identifying a species from mass spectra
CN110009706A (en) * 2019-03-06 2019-07-12 上海电力学院 A kind of digital cores reconstructing method based on deep-neural-network and transfer learning
CN113495094A (en) * 2020-04-01 2021-10-12 中国电信股份有限公司 Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer
CN113484400A (en) * 2021-07-01 2021-10-08 上海交通大学 Mass spectrogram molecular formula calculation method based on machine learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANG, Y等: "Prediction of glycopeptide fragment mass spectra by deep learning", NATURE COMMUNICATIONS, vol. 15, no. 1, 10 April 2024 (2024-04-10), pages 1 - 12 *
乔彦涛;缪佳铮;孙世伟;刘金刚;卜东波;: "串联质谱的蛋白质序列鉴定技术综述", 计算机科学与探索, no. 02, 15 February 2010 (2010-02-15), pages 5 - 15 *
苏远杰: "基于质谱数据的核心岩藻糖鉴定方法与算法研究", 中国优秀硕士学位论文全文数据库(电子期刊)), no. 04, 31 December 2022 (2022-12-31), pages 006 - 232 *

Similar Documents

Publication Publication Date Title
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
US7899625B2 (en) Method and system for robust classification strategy for cancer detection from mass spectrometry data
WO2023092961A1 (en) Semi-supervised method and apparatus for public opinion text analysis
WO2019202941A1 (en) Self-training data selection device, estimation model learning device, self-training data selection method, estimation model learning method, and program
CN112764024B (en) Radar target identification method based on convolutional neural network and Bert
Liu et al. Feature selection method based on support vector machine and shape analysis for high-throughput medical data
CN110289047B (en) Sequencing data-based tumor purity and absolute copy number prediction method and system
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
Lam et al. Mixup-breakdown: a consistency training method for improving generalization of speech separation models
CN113889192A (en) Single cell RNA-seq data clustering method based on deep noise reduction self-encoder
CN113592842A (en) Sample serum quality identification method and identification device based on deep learning
CN114171130A (en) Core fucose identification method, system, equipment, medium and terminal
Zhang et al. Modified student's t‐hidden Markov model for pattern recognition and classification
CN114301719B (en) Malicious update detection method and system based on variational self-encoder
CN105893790A (en) Classification method for mass spectrum deficiency protein data
Vutov et al. Multiple two‐sample testing under arbitrary covariance dependency with an application in imaging mass spectrometry
CN115115920A (en) Data training method and device
Listgarten Analysis of sibling time series data: alignment and difference detection
Mu et al. Self-supervised disentangled representation learning for robust target speech extraction
CN114048770A (en) Automatic detection method and system for digital audio deletion and insertion tampering operation
CN113641888A (en) Event-related news filtering learning method based on fusion topic information enhanced PU learning
Zhang et al. MA-CapsNet-DA: Speech emotion recognition based on MA-CapsNet using data augmentation
CN111160245A (en) Dynamic signature identification method and device
CN110797083B (en) Biomarker identification method based on multiple networks
Zhan et al. Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination