CN115862749A

CN115862749A - Mass spectrum data qualitative method based on Transformer

Info

Publication number: CN115862749A
Application number: CN202211548308.0A
Authority: CN
Inventors: 崔球; 刘欢; 崔天伦; 李世铭; 祁宽; 李敏怡; 王浩然; 王一岚
Original assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Current assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-03-28

Abstract

The invention provides a mass spectrum data qualitative method based on a Transformer. Collecting primary high-resolution mass spectrum data, and constructing a basic data set, wherein the data of the data set comprises isotope distribution data, mass-to-charge ratio data and abundance data of compound ions; randomly sampling data of a data set, and respectively constructing a training set, a verification set and a test set; training the model by adopting training set data to obtain a deep learning model; carrying out deep data processing on the training set data, converting the training set data into an index data format, carrying out model training and selecting an optimal model; and performing model verification on the deep learning model by adopting verification set data, and adjusting the optimal model parameters. Compared with the traditional mass-to-charge ratio-database matching method, the primary high-resolution mass spectrometry data qualitative method based on the Transformer can quickly acquire the chemical formula of the analyte, and is short in analysis time and high in efficiency.

Description

Mass spectrum data qualitative method based on Transformer

Technical Field

The invention belongs to the technical field of organic molecule mass spectrum research, and particularly relates to a mass spectrum data qualitative method based on a Transformer.

Background

Mass spectrometry is a modern analytical technique for detecting compounds by detecting gas phase ions, and is widely used in the practical field due to its high specificity, high flexibility and high universality. Accurate mass measurements are routine experiments performed by modern mass spectrometers. For analyte characterization, the exact mass-to-charge ratio is first obtained by high resolution mass spectrometry, and the database retrieves a list of matching elemental compositions and corresponding molecular formulae. However, analytes are identified by matching the peak mass-to-charge ratio to a database, which often lists a series of molecular formulas for the analyte (up to several tens of molecular formulas). In addition, when a large amount of complex samples are analyzed, a primary mass spectrum acquires a data set of mixed analytes of the samples, and the traditional mass-to-charge ratio-database matching method cannot meet the requirement of mass spectrum high-throughput analysis of the large amount of samples.

Machine learning techniques, especially deep learning, are applied to spectrogram analysis, and pure data-driven form design spectrogram analysis can be realized.

One of the best known methods based on machine learning is MS2PIP, which is constructed based on random forests and then improved by using the XGBoost algorithm. The method can realize qualitative analysis of mass spectrum. But this method relies on the aided analysis of the database and the accuracy of the analysis of the unknown spectrum can be significantly reduced.

The Transformer network structure is already in a dominance in the natural language field, and exceeds other methods such as machine translation, text generation and the like on many tasks. Nowadays, more and more researchers are trying to apply the powerful modeling capability of the Transformer model to the field of natural science.

Disclosure of Invention

The invention aims to provide a mass spectrum data qualitative method based on a Transformer to solve the problem of high-throughput and high-precision first-order mass spectrum qualitative analysis.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for qualitative mass spectrum data based on Transformer comprises the following steps:

s1: a data collection step: collecting primary high-resolution mass spectrum data, and constructing a basic data set, wherein the data of the data set comprises isotope distribution data, mass-to-charge ratio data and abundance data of compound ions;

s2: a data set classification step: randomly sampling data of a data set, and respectively constructing a training set, a verification set and a test set;

s3: model training: training the model by adopting training set data to obtain a deep learning model; the model comprises an Embedding layer, a position Embedding layer, a multi-head attention layer, a LayerNorm layer, a Linear layer, an encoder layer and a decoder layer; carrying out deep data processing on the training set data, converting the training set data into an index data format, carrying out model training, and selecting an optimal model as a deep learning model;

s4: a model verification step: carrying out model verification on the deep learning model by adopting verification set data, and adjusting parameters of the deep learning model, wherein the model verification comprises the following steps: carrying out depth data processing on the mass spectrum data of the verification set, inputting the isotope data of the data set into an encoder layer for encoding, and inputting the encoded data into a decoder layer; and performing deep processing on the real molecular formula data, inputting the processed real molecular formula data into a decoder layer, searching a final result by using greedy search on the decoded data, and converting the final result into a molecular formula.

In some embodiments of the present invention, the method further comprises an S5 model test step:

and carrying out depth data processing on the data in the test set, inputting the data into an encoder for encoding, then obtaining an inference result through a decoder, and testing the accuracy of the verified model.

In some embodiments of the present invention, the data collecting step further includes a data preprocessing step, and the data preprocessing method includes:

converting the high-resolution mass spectrum data into a csv format, taking a csv format file as a basic data set for model training, and taking isotope data in a character format.

In some embodiments of the present invention, the data set is subjected to a depth processing to obtain a mass spectrometry data index sequence, and the depth processing includes:

converting the isotope data in the csv format into data in the FloatTensor format, and splitting the molecular formula into a list taking elements and the number of the elements as a unit by adopting a token function;

constructing a target dictionary and an input dictionary to perform index mapping on the split data;

the input dictionary includes:

molecular mass identification: mass for each molecular formula in a mass spectral data set;

the ion relative abundance indicates: data representing the relative abundance of each ion of each molecular formula in a mass spectral data set, said relative abundance being in the range of 0-100;

the target dictionary includes:

sequence start flag and sequence end flag: respectively used for representing the index sequence corresponding to a certain sub-formula;

the molecular formula is filled up and identified: because the number of elements is different, the lengths of the molecular formulas are different, and the molecular formulas with fixed lengths need to be ensured in the analysis training process, all the molecular formulas need to be supplemented to be the same length;

element identification: used for representing element types in a certain sub-formula;

element quantity identification: used for representing the number of each element in a certain sub-formula;

and performing model training by adopting the data after deep processing.

In some embodiments of the present invention, in step S4, a greedy search is used to search for the decoded data according to the target dictionary and the decoded data is converted into a molecular formula.

In some embodiments of the invention, mark embedding and position embedding are generated for different index mark segments in the mass spectrum data index sequence after deep processing, and then the index mark segments are input into an encoder layer,

the location embedding algorithm includes:

wherein: PE (polyethylene) _(pos,2i) 、PE _(pos,2i+1) Indicating the embedding position, pos represents the relative position of the characters in the mass spectrum data index mark segment, d _ model is the output vector dimension after the warp layer is artificially specified, 2i is the even dimension in the vector dimension, and 2i +1 is the odd dimension in the vector dimension.

In some embodiments of the invention, during the model training process, the model error is calculated, back propagation is performed according to the error calculation result, and the weight of the model is updated.

In some embodiments of the present invention, the data volume ratio of the training set, the validation set, and the test set is 7:2:1.

the qualitative method of mass spectrum data provided by the invention has the beneficial effects that:

(1) Compared with the traditional mass-to-charge ratio-database matching method, the primary high-resolution mass spectrometry data qualitative method based on the Transformer can quickly acquire the chemical formula of the analyte, and is short in analysis time and high in efficiency.

(2) The method does not need to rely on a database, obtains the chemical formula of the analyte based on the mass spectrum data through a model training analysis method, and can still give a relatively accurate result under the condition that the database cannot be searched.

(3) In the aspect of efficiency and cost, the method does not need to carry out a large amount of search, has less calculation consumption and short calculation time, and can carry out high-throughput data analysis. The qualitative analysis of mass spectrum data can be completed by analyzing under 3080TiGPU, and the calculation time of a single piece of data is about 0.12 s. By the parallel computing technology, large-batch data can be computed in a short time.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for qualitative data analysis of high resolution mass spectrometry according to the present invention;

FIG. 2 is C ₅₃ H ₁₀₃ O ₆ Isotope distribution mass spectrogram;

FIG. 3 is a schematic diagram of a Transformer model architecture.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not delimit the invention.

The invention provides a mass spectrum data qualitative method based on a Transformer, which can be used for quickly analyzing the chemical formula composition of a chemical substance based on primary high-resolution mass spectrum data of the chemical substance to be analyzed.

The qualitative method of mass spectrum data provided by the invention comprises the following steps, and the whole flow refers to the figure 1.

S1: and (5) data collection.

First order high resolution mass spectral data is collected, and a basic data set is constructed, wherein the data of the data set comprises isotope distribution data of compound ions.

The constructed data set is used as a public data set for subsequent model training, model testing and model verification. The data set includes important information such as the general nomenclature, systematic nomenclature, exact mass, molecular formula, and ion mass to charge ratio and corresponding relative abundance, ion molecular formula, and isotopic distribution of the compound. The isotope distribution of the compound comprises an ion peak M, accurate mass-to-charge ratios of isotope peaks M +1, M +2, M +3 and M +4 of the ion peak M, corresponding relative abundance values, and the number of elements can be estimated according to the ratio of particle peaks.

Because mass spectrometry data includes information on molecular weight and elemental composition, molecular building block information, interconnection order information, and the like, based on analysis of the mass spectrometry data, a compound composition result can be obtained: including a list of elemental compositions and corresponding molecular formulas.

Performing preliminary processing on the data set, the processing comprising:

(1) And extracting isotope distribution data of compound ions, and deleting missing data.

(2) In order to facilitate data reading during model training, data are converted into csv format data, and the obtained csv format file is used as a basic data set for model training. The csv format stores isotope data in character format.

S2: and (5) classifying the data set.

Randomly sampling the data of the public data set, and respectively constructing a training set, a verification set and a test set; the training set is used for training a qualitative model of mass spectrum data, the verification set is used for verifying the accuracy of the trained model, and the test set is used for testing whether the model meets the training requirements.

The data amount of the training set, the verification set and the test set is not limited, and in some embodiments, the ratio of the data amount of the training set, the data amount of the verification set and the data amount of the test set is 7:2:1.

s3: model training step, the flow refers to fig. 3.

And training the model by adopting the training set data to obtain a deep learning model.

The model to be trained is a first-level high-resolution mass spectrum data qualitative model based on a Transformer, and specifically comprises the following steps: an Embedding layer, a position Embedding layer, a multi-head attention layer, a LayerNorm layer, a Linear layer, 12 layers of encoders in total and 6 layers of decoders.

Before model training, the data of the data set is further subjected to deep processing, and the processing steps comprise the following steps.

S31: and data index mapping.

Converting the isotope data in the csv format into data in the FloatTensor format, splitting the molecular formula into a list with elements and the number of the elements as a unit by adopting a token function, and performing index mapping on the split data by using a constructed target dictionary.

Splitting the molecular formula into a list taking elements and the number of the elements as units, constructing an input dictionary, and performing index mapping on split data by using the constructed target dictionary and the constructed input dictionary to map the split data into an index sequence;

the input dictionary includes:

the ion relative abundance indicates: data representing the relative abundance of each ion in each molecular formula in a mass spectral data set, the relative abundance being in the range of 0-100.

Table 1 input dictionary identification example

/>

The target dictionary includes:

sequence start flag and sequence end flag indicate: respectively used for representing the index sequence corresponding to a certain sub-formula; the molecular formula is filled up and identified: because the number of elements is different, the lengths of the molecular formulas are different, and the molecular formulas with fixed lengths need to be ensured in the analysis training process, all the molecular formulas need to be supplemented to be the same length;

element number identification: used to indicate the number of elements in a certain sub-formula.

In some embodiments of the present invention, the target dictionary may employ numbers to represent the index mapping. For example, for C ₄₆ H ₈₆ O ₁₀ The following index mapping table is designed:

TABLE 2 target dictionary identification example

Thus, for the input data:

798.6，799.6，800.6，801.6，0.0，100，52，15，3，0；C 46H 86O 10；

respectively mapping the mass spectrum data and the real molecular formula data according to the rules, wherein the format of the mapped index data is as follows:

tensor([4,5,6,7,8,9,10,11,12,13])，tensor([4,5,6,7,8,9])。

wherein tenor ([ 4,5,6,7,8,9,10,11,12,13 ]) corresponds to the molecular formula and the mass-to-charge ratio and abundance index sequence, and tenor ([ 4,5,6,7,8,9 ]) corresponds to the true molecular formula index sequence. The two sets of mapping data are used as input data for subsequent model training respectively.

S32: and (5) training a model.

Generating marker embedding and position embedding for different index marker segments in the mass spectrum data index sequence, wherein the position embedding information is used for correlating the correlation of the front isotope and the back isotope.

The location embedding algorithm includes:

wherein: PE (polyethylene) _(pos,2i) 、PE _(pos,2i+1) Representing an embedding position, pos representing the relative position of characters in a mass spectrum data index fragment, and d _ model being an artificially specified output vector dimension after a warp layer and generally selected according to experience; 2i is the even dimension of the vector dimensions and 2i +1 is the odd dimension of the vector dimensions.

Due to the context of the time of existence of the mass spectral data, it is necessary to obtain the relative information of the position using position embedding, and then to feed the mass spectral data subjected to position embedding and mark embedding into the input encoder layer.

The encoder layer firstly performs calculation through MHA (multi-head attention) to obtain local attention information, and the calculation formula is as follows:

q, K and V in the formula are all obtained by linear layer calculation and are data for calculating self attention, and d _k Variance, attention (Q, K, V) represents local Attention; q, K, V are calculated from the linear layers.

And after the calculation is finished, performing a residual operation and a LayerNorm normalization operation, inputting the operation into the linear layer for calculation, repeating the steps for multiple times, and inputting the final calculation result into a transform module decoder. Meanwhile, molecular formula mapping data corresponding to the mass spectrum data is input into a decoder.

After the decoder obtains the input, the input obtained by the encoder and the target after the decoder Mask are input into the multi-head attention, and the multi-head attention calculation method is consistent with that of the encoder. And the decoder calculates to obtain a final result, performs softmax layer calculation, inputs the result into the linear layer and outputs the result.

S33: and (5) calculating a model error.

And calculating the error of input and output, wherein the error function is Cross EntropyLoss, and the calculation formula is as follows:

wherein: x represents the index corresponding to the current sample class, class represents the actual class, x [ j ] represents the jth output, and loss (x, class) represents the qualitative analysis model error of mass spectrum data.

S34: and updating the model weight.

And performing back propagation on the calculated loss, and updating the weight of the model.

S35: and adjusting the dynamic learning rate.

Adjustment of dynamic learning rate, wherein d _model The vector dimension of the linear layer output is shown, stepNum is the current step number, warmupSteps is the preheating step length, and the calculation formula is as follows:

wherein: lr represents a dynamic learning rate. The dynamic learning rate is combined with the optimizer after the current learning rate value is calculated by the formula.

S36: and (4) model optimization.

And storing the model according to the returned verification set result, wherein the storage rule is that the model with the lowest loss of the verification set is stored as the current optimal model.

S4: and (5) a model verification step.

And performing model verification on the deep learning model by adopting verification set data, and adjusting parameters of the deep learning model. In the real-time training process, the loss and the accuracy of the training set are compared through the values of the loss and the accuracy of the verification set, so that whether the training of the model needs to be terminated in advance to avoid overfitting can be obtained, and the optimal model is stored through the loss of the verification set.

Step S41, acquiring the required isotope verification mass spectrum data in the verification data set.

Step S42: the data is stored in a csv floating point tensor format.

Step S43: the preferred model already saved in step S3 is loaded.

Step S44: referring to the data index conversion method, after the isotope data is subjected to index processing, the isotope data is input into an encoder to be encoded, and the data of the extracted verification data set is converted into a mass spectrum data index sequence.

Step S45: and inputting the encoded data into a decoder, inputting the corresponding real molecular formula data into the decoder for decoding, searching the final result by using greedy search, and converting the final result into the molecular formula by using the dictionary. The dictionaries described herein correspond to the input dictionary and the target dictionary described above.

Step S46: and circularly reasoning until the reasoning position reaches the eos end, stopping reasoning and outputting a result.

S5: and (5) testing the model.

And inputting the test set into an encoder for encoding, then obtaining an inference result through a decoder, and obtaining a group of results with highest probability through a greedy search algorithm. Whether the generalization of the trained model meets the final requirement can be obtained through the test set. And performing one round of reasoning on all data of the test set by using the model to obtain the final accuracy.

In the following, the implementation of the present solution will be described with reference to a specific analysis process for a certain mass spectrum data. Tables 3 and 4 illustrate the isotope distributions of the mass spectral data, and the isotope distribution table for a particular compound, respectively, which can be embodied in the high resolution mass spectral data.

Table 3 isotope distribution diagram of mass spectral data

TABLE 4C ₅₃ H ₁₀₃ O ₆ Isotope distribution diagram

Isotope peak	Mass to charge ratio	Relative abundance
			M	835.7755	100
M+1	836.7788	60.7610
			M+2	837.7822	19.1828
M+3	838.7855	4.0285
			M+4	839.7889	0.4709

Constructing a lipidomics isotope data set, specifically comprising the following steps.

(1) Firstly, constructing an oil molecule isotope data set such as fatty acid, glyceride, phospholipid, sphingolipid and the like by a mass (M/z) calculation tool, wherein the data set mainly comprises molecular formula and isotope distribution of oil compound ions, wherein the isotope distribution comprises an isotope ion peak M and accurate mass-to-charge ratios M/z, M/z1, M/z2, M/z3 and M/z4 of M +1, M +2, M +3 and M +4 and corresponding relative abundances I, I1, I2, I3 and I4, wherein I is a base peak, and the percentages of I1, I2, I3 and I4 relative to I, namely the relative abundances, are obtained after normalization.

Secondly, 76951 pieces of data are acquired after the mass spectrum data are subjected to primary processing and serve as a basic data set, and the basic data set is converted into a csv format.

Finally, the basic data set is randomly divided into a training set, a validation set and a test set, and the ratio is 7.

(2) A first-level high-resolution mass spectrometry qualitative analysis data model based on a Transformer is constructed, and the data model mainly comprises an Embedding layer, a position Embedding layer, a multi-head attention layer, a LayerNorm layer, a Linear layer, 12 layers of encoders and 6 layers of decoders.

(3) And (5) carrying out model training. The training process is as described above and will not be described in detail.

(4) And verifying and testing the model by using the segmented verification set and test set, wherein the verification set is mainly used for adjusting some hyper-parameters in the training process of the model and checking the accuracy and loss indexes of the model in the training process. The test set is used to test the generalization of the model. The model is tested after the model training is over and converges. Firstly loading model weight, then reading in abundance and mass-to-charge ratio data, carrying out preliminary processing on the data, adding a batch dimension of 1, converting the batch dimension into a tensor format, sending the tensor format into the model for reasoning, obtaining a final result through a greedy search algorithm, outputting the final result, and finally obtaining the final accuracy in a test set.

(5) Reasoning a grease molecule isotope data set by using a constructed Transformer model, firstly compressing information by using an encoder, then decoding characteristics by using a decoder, outputting a final result by using greedy search, generating probability distribution of one character at each time step, taking a maximum value, then transmitting the value to the next time step, and finally generating all character indexes. The component reasoning is done by converting the character index into a string of characters through a known dictionary. The serial inference time of each sequence GPU is about 0.1 s. The overall accuracy is as high as 98%.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A mass spectrum data qualitative method based on Transformer is characterized by comprising the following steps: the method comprises the following steps:

s1: a data collection step: collecting primary high-resolution mass spectrum data, and constructing a basic data set, wherein the data of the data set comprises isotope distribution data, isotope peak mass-to-charge ratio data and relative abundance data of a compound;

s3: model training: the model comprises an Embedding layer, a position Embedding layer, a multi-head attention layer, a LayerNorm layer, a Linear layer, an encoder layer and a decoder layer; carrying out deep data processing on the training set data, converting the training set data into an index data format, carrying out model training, and selecting an optimal model as a deep learning model;

2. The method of fransformer-based qualitative profiling of mass spectrometry data of claim 1, further comprising the step of S5 model testing:

and carrying out depth data processing on the data in the test set, inputting the data into an encoder for encoding, then obtaining an inference result through a decoder, and testing the accuracy of the verification model.

3. The Transformer-based qualitative method of mass spectrometry data of claim 1, wherein said data collection step further comprises a data initialization step, the data initialization method comprising:

4. The method for qualitative analysis of mass spectrometry data of fransormer of claim 1, 2 or 3, wherein the data set is further processed to obtain the index sequence of mass spectrometry data, the further processing step comprises:

the input dictionary includes:

the target dictionary includes:

and performing model training by adopting the data after deep processing.

5. The method of claim 4, wherein in step S4, the decoded data is searched for final results using a greedy search and converted to molecular formulas based on a target dictionary.

6. The method of claim 4, wherein marker embedding and position embedding are generated for different index marker segments in the deep processed mass spectrometry data index sequence and then input to the encoder layer,

the location embedding algorithm includes:

7. The method of claim 6, wherein during the training of the model, the model error is calculated, back propagation is performed according to the error calculation result, and the weight of the model is updated.

8. The method of qualitative characterization of mass spectrometry data of the fransormer of claim 1, wherein the ratio of the data volumes of the training set, the validation set and the test set is 7:2:1.