CN111896609B - Method for analyzing mass spectrum data based on artificial intelligence - Google Patents
Method for analyzing mass spectrum data based on artificial intelligence Download PDFInfo
- Publication number
- CN111896609B CN111896609B CN202010707525.4A CN202010707525A CN111896609B CN 111896609 B CN111896609 B CN 111896609B CN 202010707525 A CN202010707525 A CN 202010707525A CN 111896609 B CN111896609 B CN 111896609B
- Authority
- CN
- China
- Prior art keywords
- layer
- sample
- feature
- mass spectrum
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N27/00—Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
- G01N27/62—Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
- G01N27/64—Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode using wave or particle radiation to ionise a gas, e.g. in an ionisation chamber
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Chemical & Material Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Electrochemistry (AREA)
- Physics & Mathematics (AREA)
- Toxicology (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
A method for analyzing mass spectrometry data based on artificial intelligence, the method comprising: collecting small molecular fingerprint spectrograms of the metabolites of each sample by adopting a laser-assisted desorption/ionization mass spectrometer; extracting absolute intensity from the fingerprint; and inputting the processed data into the multi-layer neural network, and performing sample grouping processing. A sample distinguishing contribution importance calculating method converts fingerprint spectrum data into a two-dimensional image; and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination. The beneficial effects of the invention are as follows: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.
Description
Technical Field
The invention belongs to the field of artificial intelligence assisted mass spectrum data mining, and particularly relates to a metabolite fingerprint spectrum obtained based on mass spectrum and an artificial intelligence analysis technology for constructing a sample grouping model and calculating grouping importance.
Background
Mass spectrometry detection methods show the advantages of high throughput analysis and multi-metabolite detection, the primary method for detecting untargeted metabolites. However, the application of mass spectrometry detection methods is also faced with lengthy pre-treatments, including high complexity and low metabolite abundance in biological samples. With the development of nanotechnology, recently developed nano-assisted laser desorption/ionization mass spectrometry (ldims) has become the most practical tool for metabolic analysis due to its high analysis throughput (300 samples/hour) and accurate metabolic identification (mass error <50 ppm).
The deep learning is mainly applied to auxiliary analysis of large and high-latitude data sets, can accept input of various data types, and becomes a leading edge technology of various medical data analysis at present. Deep learning, which is a leading field of machine learning, has become a major analysis tool nowadays, and is widely used in various fields due to its features of optimizing a loss function as much as possible to learn relevant data rules and mining potential features of data as much as possible. It has been widely used in the biomedical field. However, the analysis of the traditional machine learning method is not well suitable for analysis and mining of mass spectrum data, because the mass spectrum data has huge sample characteristics, and problems of accuracy reduction such as over fitting, under fitting and the like can occur. Furthermore, deep learning is a black box, and it is difficult to select important features from the score to explain the mechanism of the diagnosis principle.
The main method for selecting common features in deep learning is a saliency area map, which is mainly applied to the image field, and the most obvious difference area is rapidly and intuitively screened out by comparing the saliency difference areas in detected images, so that the technology has been expanded to solve the problem of complex scene understanding in various fields such as neuroscience, psychology, medical diagnosis and the like. But this method of significance analysis has not yet been applied to mass spectrometry data analysis.
Disclosure of Invention
Aiming at the problems of long time consumption, high data dimension, complex combination and the like in mass spectrum data analysis, the invention provides a method for realizing rapid sample grouping based on the classification model constructed by the improved multi-layer neural network, and calculating the classification contribution importance, which is rapid, accurate and efficient, and greatly improves the interpretability of the classification model.
A method for analyzing mass spectrum data based on artificial intelligence uses a multi-layer neural network to analyze and process the mass spectrum data so as to realize grouping of samples;
the method comprises the following steps:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite small molecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;
step 3: extracting the absolute intensity of the original metabolic fingerprint, and carrying out centering pretreatment on the data extracted from all samples;
step 4: and (3) inputting the data in the step (3) into a neural network, and performing sample grouping processing.
Further, in step 2, at least 2 independent experiments were performed on each sample to eliminate individual internal bias and improve reproducibility and stability of analysis.
Further, a multi-layer neural network comprising: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the feature extraction part, the output of the feature extraction part is processed by the nonlinear feature interaction layer, the output of the nonlinear feature interaction layer is processed by the classification layer, and the output of the classification layer is the network output;
the principle formulas from the network input up to the classification layer are:
x_input=concatenate(x_spectral,x_ext) (1)
x_fs=feature_extract(x_input) (2)
x_nl=feature_interaction(x_fs) (3)
y_pred=softmax(x_nl) (4)
further, the network input is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0. Based on the sample's finite nature, a simple scaling centering is performed on all multi-modal features.
Further, the feature extraction part (feature_extract) is formed by stacking four layers of local connected1D layers, each local connected1D layer divides all features into 32 sections to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-mode features are compatible, the feature extraction process of fine modeling mass spectrum data can be reduced while the network width and the parameter scale are reduced compared with a four-layer full-connection architecture, and overfitting is also indirectly reduced.
Further, the principle formula of four layers of localconnected 1D layer stacks:
further, a nonlinear feature interaction layer (feature_interaction) learns the nonlinear relationship of 96 hidden features obtained by the feature extraction section. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.
Further, the principle formula of the nonlinear feature interaction layer:
further, non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a related metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
Further comprises:
step 11: converting mass spectrum data into two-dimensional images, and constructing a metabolite screening picture library;
step 12: the data in the metabolite screening picture library were calculated using the Saliency Maps method (Saliency Maps) and all features were ranked to screen out the substances that contributed most to sample discrimination.
Further, training the neural network by using the sample data, and randomly taking 3/4 of the training set data as a training group and 1/4 as a test group. And carrying out 10-fold cross validation (10-fold) training on the training group sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model.
The invention has the following technical effects: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.
Drawings
Fig. 1 is a schematic diagram of a neural network structure in one embodiment of the invention.
Detailed Description
The following description of the preferred embodiments of the present application will make the technical contents thereof more clear and easier to understand. This application may be embodied in many different forms of embodiments and the scope of protection is not limited to the embodiments set forth herein.
The conception, specific structure and technical effects of the present invention will be further described below to fully understand the objects, features and effects of the present invention, but the protection of the present invention is not limited thereto.
In one embodiment of the present invention, data of a sample to be inspected is prepared first, and the steps are as follows:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite micromolecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer without any smoothing program, and carrying out five independent experiments on each sample so as to eliminate individual internal deviation and improve the repeatability and stability of a diagnosis result;
step 3: extracting absolute intensity from an original metabolism fingerprint spectrum (between 100 and 1000m/z mass-to-charge ratio), and carrying out centering pretreatment on data extracted from all samples for further machine learning;
non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a relevant metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
In this embodiment, the neural network structure for processing data extracted from a sample is as follows:
the input to the network is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0 s. Based on the sample's finite nature, a simple scaling centering (-1, 1) was performed on all features. The main body of the network is divided into two parts, namely a feature extraction part (feature_extraction) which is input immediately, a nonlinear feature interaction layer (feature_interaction) is arranged behind the feature extraction layer, and finally the recombined 96 features are input into a Softmax classification layer to carry out classification probability output.
Principle formula input from 1024 dimensions to Softmax layer:
x_input=concatenate(x_spectral,x_ext) (1)
x_fs=feature_extract(x_input) (2)
x_nl=feature_interaction(x_fs) (3)
y_pred=softmax(x_nl) (4)
and a feature extraction part (feature_extract) formed by stacking four layers of LocalyConnected 1D layers, wherein each LocalyConnected 1D layer divides all features into 32 intervals to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-modal features are compatible, the feature extraction process of the mass spectrum data can be finely modeled while the network width and the parameter scale can be reduced compared with a four-layer full-connection architecture, and the overfitting is also indirectly reduced.
Principle formula of four-layer localconnected 1D layer stack:
and a nonlinear feature interaction layer (feature_interaction) for learning nonlinear relations of 96 hidden features obtained by the feature extraction part. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.
Principle formula of nonlinear characteristic interaction layer:
training the neural network by using sample data, and randomly taking 3/4 of training set data as a training group and 1/4 as a test group. Performing 10-fold cross validation (10-fold) training on the training set sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model;
the trained network is used for analyzing the blind test set sample, and the accuracy of analysis prediction verifies that the grouping model based on the multi-layer neural network can realize accurate classification;
further analyzing the classification model contribution degree, comprising the following steps:
step 11: converting mass spectrogram data into two-dimensional images, and constructing a metabolite screening picture library;
step 12: the data in the metabolite screening picture library were calculated using the Saliency Maps method (Saliency Maps) and all features were ranked to screen out the substances that contributed most to sample discrimination.
Preferred embodiments of the present application are described in detail above. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the present application by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the conception of the present application shall be within the scope of protection defined by the claims.
Claims (5)
1. A method for analyzing mass spectrum data based on artificial intelligence is characterized in that a multi-layer neural network is used for analyzing and processing the mass spectrum data to realize grouping of samples; the method comprises the following steps:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite small molecule fingerprint spectra of between 100 and 1000 positive ion modes of each sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;
step 3: extracting absolute intensity from the fingerprint, and performing centralized pretreatment on the extracted data;
step 4: inputting the data processed in the step 3 into the multi-layer neural network for sample grouping processing;
the multi-layer neural network comprises: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the characteristic extraction part, the output of the characteristic extraction part is processed by the nonlinear characteristic interaction layer, the output of the nonlinear characteristic interaction layer is processed by the classification layer, and the output of the classification layer is the network output;
the principle formulas from the network input up to the classification layer are:
x_input=concatenate(x_spectral,x_ext)(1)
x_fs=feature_extract(x_input)(2)
x_nl=feature_interaction(x_fs)(3)
y_pred=softmax(x_nl)(4);
the network input is a 1-1024-dimensional multi-modal feature, wherein the multi-modal feature comprises an original mass spectrum data input, the rest part is filled with 0, and all the features are simply scaled and centered based on the finite property of a sample;
the feature extraction part is formed by stacking four layers of local connection layers, and each local connection layer divides all features into 32 sections for full-connection feature extraction;
the principle formula of the four-layer local connection layer stack is as follows:
the nonlinear feature interaction layer learns nonlinear relations of 96 hidden features obtained by the feature extraction part, and obtains 96 recombined features after feature recombination, and finally inputs the 96 recombined features into the Softmax classification layer for classification probability output;
the principle formula of the nonlinear characteristic interaction layer is as follows:
2. the method of analyzing mass spectrometry data based on artificial intelligence of claim 1, wherein in step 2, at least 2 independent experiments are performed for each of the samples.
3. The method for analyzing mass spectrum data based on artificial intelligence according to claim 1, wherein the fingerprint is subjected to non-target detection to obtain a related metabolite database, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
4. The method for analyzing mass spectrum data based on artificial intelligence according to claim 3, wherein 3/4 of the data of the training set sample is used as a training set and 1/4 is used as a test set, 10-fold cross-validation training is performed on the training set sample based on the multi-layer neural network, and classification is achieved by counting accurate average values of a final model.
5. A method of analyzing mass spectrometry data based on artificial intelligence as claimed in claim 3, comprising the steps of:
step 11: converting the fingerprint spectrum data into a two-dimensional image, and constructing a metabolite screening picture library;
step 12: and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010707525.4A CN111896609B (en) | 2020-07-21 | 2020-07-21 | Method for analyzing mass spectrum data based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010707525.4A CN111896609B (en) | 2020-07-21 | 2020-07-21 | Method for analyzing mass spectrum data based on artificial intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111896609A CN111896609A (en) | 2020-11-06 |
CN111896609B true CN111896609B (en) | 2023-08-08 |
Family
ID=73190809
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010707525.4A Active CN111896609B (en) | 2020-07-21 | 2020-07-21 | Method for analyzing mass spectrum data based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111896609B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022266928A1 (en) * | 2021-06-24 | 2022-12-29 | 中山大学 | Metabolic characteristic spectrum inference method and system, and computer device and storage medium |
CN118169218A (en) * | 2024-05-14 | 2024-06-11 | 杭州臻稀生物科技有限公司 | Mass spectrometry system and method based on artificial intelligence and cloud technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038602A1 (en) * | 2002-10-24 | 2004-05-06 | Warner-Lambert Company, Llc | Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications |
CN102282559A (en) * | 2008-10-20 | 2011-12-14 | 诺丁汉特伦特大学 | Data analysis method and system |
CN111292801A (en) * | 2020-01-21 | 2020-06-16 | 西湖大学 | Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning |
-
2020
- 2020-07-21 CN CN202010707525.4A patent/CN111896609B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038602A1 (en) * | 2002-10-24 | 2004-05-06 | Warner-Lambert Company, Llc | Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications |
CN102282559A (en) * | 2008-10-20 | 2011-12-14 | 诺丁汉特伦特大学 | Data analysis method and system |
CN111292801A (en) * | 2020-01-21 | 2020-06-16 | 西湖大学 | Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning |
Non-Patent Citations (1)
Title |
---|
Size-selected Core-shell Nanoalloys for Laser Desorption/ionization Detection of Small Metabolites;Jing Cao 等;《IEEE》;第350-353页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111896609A (en) | 2020-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Verbeeck et al. | Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry | |
CN107766933B (en) | Visualization method for explaining convolutional neural network | |
Chatzidakis et al. | Towards calibration-invariant spectroscopy using deep learning | |
CN110110743B (en) | Automatic recognition system and method for seven-class mass spectrum | |
Hu et al. | Emerging computational methods in mass spectrometry imaging | |
CN113011357B (en) | Depth fake face video positioning method based on space-time fusion | |
CN110309867B (en) | Mixed gas identification method based on convolutional neural network | |
Zhao et al. | Interpretable deep learning-assisted laser-induced breakdown spectroscopy for brand classification of iron ores | |
CN111896609B (en) | Method for analyzing mass spectrum data based on artificial intelligence | |
CN112149758B (en) | Hyperspectral open set classification method based on Euclidean distance and deep learning | |
CN116363440B (en) | Deep learning-based identification and detection method and system for colored microplastic in soil | |
CN110579554A (en) | 3D mass spectrometric predictive classification | |
CN113554176B (en) | Metabolic profile inference method, system, computer device, and storage medium | |
Li et al. | MSSort-DIAXMBD: A deep learning classification tool of the peptide precursors quantified by OpenSWATH | |
CN112560925A (en) | Complex scene target detection data set construction method and system | |
CN109447009B (en) | Hyperspectral image classification method based on subspace nuclear norm regularization regression model | |
CN110992301A (en) | Gas contour identification method | |
CN116665039A (en) | Small sample target identification method based on two-stage causal intervention | |
CN105844297A (en) | Local spatial information-based encapsulation type hyperspectral band selection method | |
WO2022266928A1 (en) | Metabolic characteristic spectrum inference method and system, and computer device and storage medium | |
CN114141316A (en) | Method and system for predicting biological toxicity of organic matters based on spectrogram analysis | |
Roy et al. | Identification of fraudulent alteration by similar pen ink in handwritten bank cheque | |
Tung et al. | SIGMA: Spectral interpretation using gaussian mixtures and autoencoder | |
Martyna et al. | Hybrid Likelihood Ratio Models for Forensic Applications: a Novel Solution to Determine the Evidential Value of Physicochemical Data | |
CN109190713A (en) | The minimally invasive fast inspection technology of oophoroma based on serum mass spectrum adaptive sparse feature selecting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |