CN111896609B

CN111896609B - Method for analyzing mass spectrum data based on artificial intelligence

Info

Publication number: CN111896609B
Application number: CN202010707525.4A
Authority: CN
Inventors: 钱昆; 徐伟; 曹敬
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2023-08-08
Anticipated expiration: 2040-07-21
Also published as: CN111896609A

Abstract

A method for analyzing mass spectrometry data based on artificial intelligence, the method comprising: collecting small molecular fingerprint spectrograms of the metabolites of each sample by adopting a laser-assisted desorption/ionization mass spectrometer; extracting absolute intensity from the fingerprint; and inputting the processed data into the multi-layer neural network, and performing sample grouping processing. A sample distinguishing contribution importance calculating method converts fingerprint spectrum data into a two-dimensional image; and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination. The beneficial effects of the invention are as follows: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.

Description

Method for analyzing mass spectrum data based on artificial intelligence

Technical Field

The invention belongs to the field of artificial intelligence assisted mass spectrum data mining, and particularly relates to a metabolite fingerprint spectrum obtained based on mass spectrum and an artificial intelligence analysis technology for constructing a sample grouping model and calculating grouping importance.

Background

Mass spectrometry detection methods show the advantages of high throughput analysis and multi-metabolite detection, the primary method for detecting untargeted metabolites. However, the application of mass spectrometry detection methods is also faced with lengthy pre-treatments, including high complexity and low metabolite abundance in biological samples. With the development of nanotechnology, recently developed nano-assisted laser desorption/ionization mass spectrometry (ldims) has become the most practical tool for metabolic analysis due to its high analysis throughput (300 samples/hour) and accurate metabolic identification (mass error <50 ppm).

The deep learning is mainly applied to auxiliary analysis of large and high-latitude data sets, can accept input of various data types, and becomes a leading edge technology of various medical data analysis at present. Deep learning, which is a leading field of machine learning, has become a major analysis tool nowadays, and is widely used in various fields due to its features of optimizing a loss function as much as possible to learn relevant data rules and mining potential features of data as much as possible. It has been widely used in the biomedical field. However, the analysis of the traditional machine learning method is not well suitable for analysis and mining of mass spectrum data, because the mass spectrum data has huge sample characteristics, and problems of accuracy reduction such as over fitting, under fitting and the like can occur. Furthermore, deep learning is a black box, and it is difficult to select important features from the score to explain the mechanism of the diagnosis principle.

The main method for selecting common features in deep learning is a saliency area map, which is mainly applied to the image field, and the most obvious difference area is rapidly and intuitively screened out by comparing the saliency difference areas in detected images, so that the technology has been expanded to solve the problem of complex scene understanding in various fields such as neuroscience, psychology, medical diagnosis and the like. But this method of significance analysis has not yet been applied to mass spectrometry data analysis.

Disclosure of Invention

Aiming at the problems of long time consumption, high data dimension, complex combination and the like in mass spectrum data analysis, the invention provides a method for realizing rapid sample grouping based on the classification model constructed by the improved multi-layer neural network, and calculating the classification contribution importance, which is rapid, accurate and efficient, and greatly improves the interpretability of the classification model.

A method for analyzing mass spectrum data based on artificial intelligence uses a multi-layer neural network to analyze and process the mass spectrum data so as to realize grouping of samples;

the method comprises the following steps:

step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;

step 2: collecting metabolite small molecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;

step 3: extracting the absolute intensity of the original metabolic fingerprint, and carrying out centering pretreatment on the data extracted from all samples;

step 4: and (3) inputting the data in the step (3) into a neural network, and performing sample grouping processing.

Further, in step 2, at least 2 independent experiments were performed on each sample to eliminate individual internal bias and improve reproducibility and stability of analysis.

Further, a multi-layer neural network comprising: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the feature extraction part, the output of the feature extraction part is processed by the nonlinear feature interaction layer, the output of the nonlinear feature interaction layer is processed by the classification layer, and the output of the classification layer is the network output;

the principle formulas from the network input up to the classification layer are:

x_input＝concatenate(x_spectral,x_ext) (1)

x_fs＝feature_extract(x_input) (2)

x_nl＝feature_interaction(x_fs) (3)

y_pred＝softmax(x_nl) (4)

further, the network input is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0. Based on the sample's finite nature, a simple scaling centering is performed on all multi-modal features.

Further, the feature extraction part (feature_extract) is formed by stacking four layers of local connected1D layers, each local connected1D layer divides all features into 32 sections to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-mode features are compatible, the feature extraction process of fine modeling mass spectrum data can be reduced while the network width and the parameter scale are reduced compared with a four-layer full-connection architecture, and overfitting is also indirectly reduced.

Further, the principle formula of four layers of localconnected 1D layer stacks:

further, a nonlinear feature interaction layer (feature_interaction) learns the nonlinear relationship of 96 hidden features obtained by the feature extraction section. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.

Further, the principle formula of the nonlinear feature interaction layer:

further, non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a related metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.

Further comprises:

step 11: converting mass spectrum data into two-dimensional images, and constructing a metabolite screening picture library;

step 12: the data in the metabolite screening picture library were calculated using the Saliency Maps method (Saliency Maps) and all features were ranked to screen out the substances that contributed most to sample discrimination.

Further, training the neural network by using the sample data, and randomly taking 3/4 of the training set data as a training group and 1/4 as a test group. And carrying out 10-fold cross validation (10-fold) training on the training group sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model.

The invention has the following technical effects: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.

Drawings

Fig. 1 is a schematic diagram of a neural network structure in one embodiment of the invention.

Detailed Description

The following description of the preferred embodiments of the present application will make the technical contents thereof more clear and easier to understand. This application may be embodied in many different forms of embodiments and the scope of protection is not limited to the embodiments set forth herein.

The conception, specific structure and technical effects of the present invention will be further described below to fully understand the objects, features and effects of the present invention, but the protection of the present invention is not limited thereto.

In one embodiment of the present invention, data of a sample to be inspected is prepared first, and the steps are as follows:

step 2: collecting metabolite micromolecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer without any smoothing program, and carrying out five independent experiments on each sample so as to eliminate individual internal deviation and improve the repeatability and stability of a diagnosis result;

step 3: extracting absolute intensity from an original metabolism fingerprint spectrum (between 100 and 1000m/z mass-to-charge ratio), and carrying out centering pretreatment on data extracted from all samples for further machine learning;

non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a relevant metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.

In this embodiment, the neural network structure for processing data extracted from a sample is as follows:

the input to the network is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0 s. Based on the sample's finite nature, a simple scaling centering (-1, 1) was performed on all features. The main body of the network is divided into two parts, namely a feature extraction part (feature_extraction) which is input immediately, a nonlinear feature interaction layer (feature_interaction) is arranged behind the feature extraction layer, and finally the recombined 96 features are input into a Softmax classification layer to carry out classification probability output.

Principle formula input from 1024 dimensions to Softmax layer:

x_input＝concatenate(x_spectral,x_ext) (1)

x_fs＝feature_extract(x_input) (2)

x_nl＝feature_interaction(x_fs) (3)

y_pred＝softmax(x_nl) (4)

and a feature extraction part (feature_extract) formed by stacking four layers of LocalyConnected 1D layers, wherein each LocalyConnected 1D layer divides all features into 32 intervals to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-modal features are compatible, the feature extraction process of the mass spectrum data can be finely modeled while the network width and the parameter scale can be reduced compared with a four-layer full-connection architecture, and the overfitting is also indirectly reduced.

Principle formula of four-layer localconnected 1D layer stack:

and a nonlinear feature interaction layer (feature_interaction) for learning nonlinear relations of 96 hidden features obtained by the feature extraction part. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.

Principle formula of nonlinear characteristic interaction layer:

training the neural network by using sample data, and randomly taking 3/4 of training set data as a training group and 1/4 as a test group. Performing 10-fold cross validation (10-fold) training on the training set sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model;

the trained network is used for analyzing the blind test set sample, and the accuracy of analysis prediction verifies that the grouping model based on the multi-layer neural network can realize accurate classification;

further analyzing the classification model contribution degree, comprising the following steps:

step 11: converting mass spectrogram data into two-dimensional images, and constructing a metabolite screening picture library;

Preferred embodiments of the present application are described in detail above. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the present application by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the conception of the present application shall be within the scope of protection defined by the claims.

Claims

1. A method for analyzing mass spectrum data based on artificial intelligence is characterized in that a multi-layer neural network is used for analyzing and processing the mass spectrum data to realize grouping of samples; the method comprises the following steps:

step 2: collecting metabolite small molecule fingerprint spectra of between 100 and 1000 positive ion modes of each sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;

step 3: extracting absolute intensity from the fingerprint, and performing centralized pretreatment on the extracted data;

step 4: inputting the data processed in the step 3 into the multi-layer neural network for sample grouping processing;

the multi-layer neural network comprises: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the characteristic extraction part, the output of the characteristic extraction part is processed by the nonlinear characteristic interaction layer, the output of the nonlinear characteristic interaction layer is processed by the classification layer, and the output of the classification layer is the network output;

x_input＝concatenate(x_spectral,x_ext)(1)

x_fs＝feature_extract(x_input)(2)

x_nl＝feature_interaction(x_fs)(3)

y_pred＝softmax(x_nl)(4)；

the network input is a 1-1024-dimensional multi-modal feature, wherein the multi-modal feature comprises an original mass spectrum data input, the rest part is filled with 0, and all the features are simply scaled and centered based on the finite property of a sample;

the feature extraction part is formed by stacking four layers of local connection layers, and each local connection layer divides all features into 32 sections for full-connection feature extraction;

the principle formula of the four-layer local connection layer stack is as follows:

the nonlinear feature interaction layer learns nonlinear relations of 96 hidden features obtained by the feature extraction part, and obtains 96 recombined features after feature recombination, and finally inputs the 96 recombined features into the Softmax classification layer for classification probability output;

the principle formula of the nonlinear characteristic interaction layer is as follows:

2. the method of analyzing mass spectrometry data based on artificial intelligence of claim 1, wherein in step 2, at least 2 independent experiments are performed for each of the samples.

3. The method for analyzing mass spectrum data based on artificial intelligence according to claim 1, wherein the fingerprint is subjected to non-target detection to obtain a related metabolite database, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.

4. The method for analyzing mass spectrum data based on artificial intelligence according to claim 3, wherein 3/4 of the data of the training set sample is used as a training set and 1/4 is used as a test set, 10-fold cross-validation training is performed on the training set sample based on the multi-layer neural network, and classification is achieved by counting accurate average values of a final model.

5. A method of analyzing mass spectrometry data based on artificial intelligence as claimed in claim 3, comprising the steps of:

step 11: converting the fingerprint spectrum data into a two-dimensional image, and constructing a metabolite screening picture library;

step 12: and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination.