CN111275130A

CN111275130A - Deep learning prediction method, system, medium and device based on multiple modes

Info

Publication number: CN111275130A
Application number: CN202010098684.9A
Authority: CN
Inventors: 钱晓华; 陈夏晗
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-12
Anticipated expiration: 2040-02-18
Also published as: CN111275130B

Abstract

The invention provides a deep learning prediction method, a system, a medium and equipment based on multiple modes, wherein the deep learning prediction method based on the multiple modes comprises the following steps: acquiring an image dataset comprising image data of at least two modalities; performing feature extraction on the image data to generate a feature extraction result corresponding to each modality; and combining preset constraint items to fuse and classify and predict the feature extraction results. The invention designs a multi-modal network structure, for each modal image, a convolutional neural network is respectively used for feature extraction, then the features are fused in a full connection layer by combining constraint terms, and feature information of different modalities is synthesized to obtain a final classification result. Therefore, the information characteristics of a single mode are reserved, multi-mode information can be comprehensively utilized, and the reliability of final decision is improved.

Description

Deep learning prediction method, system, medium and device based on multiple modes

Technical Field

The invention belongs to the technical field of deep learning, relates to a learning prediction method, and particularly relates to a deep learning prediction method, a deep learning prediction system, a deep learning prediction medium and deep learning prediction equipment based on multiple modes.

Background

In the prior art, certain achievements have been made in an omics method and a deep learning manner of three-dimensional images, for example, in the field of non-invasive evaluation of gene changes using images such as CT (Computed Tomography), MRI (Magnetic resonance imaging), and the like. But there are still many deficiencies in deep learning: sometimes, images only have few data sets, and overfitting is easily caused by too few data in the model training process; the reasonable and full utilization of three-dimensional image information is still a difficulty, on one hand, the 3D neural network has numerous parameters and large calculation amount, and needs to occupy a large amount of calculation resources, and on the other hand, the information amount of the 2D section is often insufficient and can not comprehensively represent the three-dimensional characteristics of the tumor; the information of the single-mode image is insufficient, and the multi-mode information fusion method is single. Most of multi-mode deep learning models are used for directly splicing extracted features and inputting the features into a full connection layer for feature selection and fusion, diversity among the features is ignored, deviation is easily generated in the training process, so that the feature selection difference among modes is too large, multi-mode information cannot be fully utilized, and the prediction effect is poor.

Therefore, how to provide a method, a system, a medium and a device for deep learning prediction based on multiple modes to solve the defects that the prior art cannot combine diversity among features to generate a multi-mode deep learning model and efficiently classify and predict the model, and the like, is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a method, system, medium and device for deep learning prediction based on multiple modes, which are used to solve the problem that the prior art cannot effectively combine diversity among features to generate a multi-mode deep learning model and perform efficient prediction.

To achieve the above and other related objects, an aspect of the present invention provides a multi-modal based deep learning prediction method, including: acquiring an image dataset comprising image data of at least two modalities; performing feature extraction on the image data to generate a feature extraction result corresponding to each modality; and combining preset constraint items to fuse and classify and predict the feature extraction results.

In an embodiment of the invention, the image dataset is a two-dimensional image dataset after a spiral transformation and data amplification.

In an embodiment of the present invention, the feature extraction is performed on the image data through a convolutional neural network, where the convolutional neural network includes a residual structure and a bilinear pooling structure.

In an embodiment of the present invention, the step of performing classification prediction on the feature extraction result by combining a preset constraint term includes: connecting the feature extraction result to a full connection layer for feature fusion to generate a prediction output result; and (4) combining a preset constraint term to carry out parameter optimization on the prediction model so as to enable the prediction output result to be more accurate.

In an embodiment of the present invention, the preset constraint terms include a first constraint term, a second constraint term, and a third constraint term; the step of optimizing the parameters of the prediction model by combining the preset constraint term comprises the following steps: supervising a prediction output process by the first constraint term; performing feature selection on the feature extraction result through the second constraint term; constraint among all modes is carried out through the third constraint item, and the diversity of the feature extraction result is kept; adding the first constraint term, the second constraint term and the third constraint term according to a preset weight to determine a loss function; and performing parameter optimization on the prediction model by using a gradient descent method so as to minimize the loss function.

In an embodiment of the invention, parameter optimization is performed in the prediction model by an iterative principle of a gradient descent method.

In an embodiment of the invention, the multi-modal based deep learning prediction method further includes: determining a final prediction model when the loss function is minimized; and evaluating the final prediction model through a preset evaluation index.

In another aspect, the present invention provides a multi-modal based deep learning prediction system, including: a data acquisition module to acquire an image dataset comprising image data of at least two modalities; the characteristic extraction module is used for carrying out characteristic extraction on the image data to generate a characteristic extraction result corresponding to each modality; and the prediction module is used for combining a preset constraint item to fuse and classify and predict the feature extraction result.

Yet another aspect of the present invention provides a medium having stored thereon a computer program which, when executed by a processor, implements the multi-modality based deep learning prediction method.

A final aspect of the invention provides an apparatus comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory to cause the apparatus to perform the multi-modal based deep learning prediction method.

As described above, the method, system, medium and apparatus for deep learning prediction based on multiple modalities according to the present invention have the following advantages:

in the aspect of multi-mode fusion, prior knowledge is added in the process of fusing features in the method of model and data hybrid driving, compared with simple feature vector splicing, the method is higher in accuracy, and the AUC (Area Under the ROC Curve) is increased to some extent, wherein the ROC (Receiver Operating Characteristic Curve) refers to the ROC Curve. In addition, the method of data and model hybrid driving also obtains better results in the aspect of precision, not only plays a role of feature selection, but also enables a plurality of modes to be combined more effectively, and the modes supplement each other and act together. In the test process, the final prediction effect is greatly improved by the intra-modal feature sparsization and the inter-modal effect equalization of the combined action.

Drawings

FIG. 1 is a diagram illustrating an example data set for a multi-modal based deep learning prediction method according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating an application scenario architecture of the multi-modal based deep learning prediction method according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a coordinate system construction of the multi-modal-based deep learning prediction method according to an embodiment of the present invention.

FIG. 4 is a schematic flow chart diagram illustrating a multi-modal based deep learning prediction method in accordance with an embodiment of the present invention.

FIG. 5 is a schematic diagram illustrating data transformation of the multi-modal-based deep learning prediction method according to an embodiment of the present invention.

FIG. 6 is a graph showing the effect of two data amplifications of the multi-modal-based deep learning prediction method of the present invention in one embodiment.

FIG. 7 is a data distribution diagram of the multi-modal-based deep learning prediction method according to an embodiment of the present invention.

FIG. 8 is a flow chart of an analysis prediction method for a multi-modal based deep learning prediction method according to an embodiment of the present invention.

FIG. 9 is a flowchart illustrating model optimization of the multi-modal based deep learning prediction method according to an embodiment of the present invention.

FIG. 10 is a graph showing the effect of different loss functions in one embodiment of the multi-modal based deep learning prediction method of the present invention.

FIG. 11 is a schematic diagram of a multi-modal based deep learning prediction system according to an embodiment of the present invention.

Description of the element reference numerals

5 deep learning prediction system based on multiple modes

51 data acquisition module

52 feature extraction module

53 prediction module

S41-S45

S431 to S432

S432A-S432E

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention provides a deep learning prediction method, a system, a medium and equipment based on multiple modes, and provides an intelligent prediction model based on spiral transformation and model driving. The input of the model is a multi-modal three-dimensional image, the preprocessing of the multi-modal three-dimensional image is converted into a two-dimensional plane through spiral conversion, and the two-dimensional plane is output as a final probability value. The whole model comprises three parts of spiral transformation preprocessing, data amplification, feature extraction and feature fusion, and provides a new model-driven loss function constraint term.

Example one

The embodiment provides a multi-mode-based deep learning prediction method, which includes:

acquiring an image dataset comprising image data of at least two modalities;

performing feature extraction on the image data to generate a feature extraction result corresponding to each modality;

and combining preset constraint items to fuse and classify and predict the feature extraction results.

The multi-modal based deep learning prediction method provided by the present embodiment will be described in detail below with reference to the drawings.

Please refer to fig. 1, which is a diagram illustrating an example of a data set of a multi-modal based deep learning prediction method according to an embodiment of the present invention. The multimodal deep learning prediction method according to the present invention is applicable to any data set in which a target region is approximated to a sphere, and in the present embodiment, the multimodal deep learning prediction method will be described in detail by taking a pancreatic cancer image data set as an example.

Pancreatic cancer is one of the most dangerous malignant tumors, has the characteristics of late diagnosis, high mortality rate, low overall survival rate and the like, and the five-year survival rate of patients is less than 3.5 percent, wherein 75 percent of patients have TP53 gene mutation and 75 to 90 percent of patients have KRAS gene mutation. TP53 is used for coding P53 protein as a tumor suppressor gene, can inhibit the proliferation of cells in a plurality of cell processes, and the proto-oncogene KRAS is closely related to the division, differentiation and apoptosis of the cells. The mutated TP53/KRAS gene will promote tumor cell proliferation, invasion and survival. In the treatment process of pancreatic cancer, the mutation condition of the gene is closely related to the measurement of the prognosis effect of a patient and the selection of a reasonable treatment mode. At present, surgical excision or biopsy is the main method for detecting TP53/KRAS gene mutation, but the method has the defects of limitation, blindness, invasiveness and invasiveness, and the like, so that the clinical requirement for detecting the gene change condition of tumors by using a noninvasive technology is great.

In recent years, noninvasive evaluation of genetic changes in living tissue using images has been the hotspot and key of many studies. Tumors are characterized by somatic mutations, and changes in the genes are ultimately reflected in the tumor phenotype. The characterization includes quantitative features such as intensity, shape, size or volume and texture of the tumor in the image, which provide information on the tumor phenotype and microenvironment.

In related research, a plurality of methods of imaging group have achieved certain results. For example, Eran et al demonstrate the correlation between CT images and gene expression in primary human liver cancer; coudray et al extracts histograms of lung cancer histopathology images by using the images, and predicts the mutation conditions of genes such as TP 53; some studies extract partial shape, texture, density, etc. radiologic features for feature selection, and then build a machine learning model to predict gene transformation.

In another aspect, the method of deep learning is also applied to image prediction of changes in tumor markers. For example, a 3D convolutional neural network (3D-CNN) is constructed, and three-dimensional tumor images are directly classified; wang et al construct a deep learning model with a two-dimensional slice image as input and visualize the prediction region with attention mapping.

The above problems become more prominent in the prediction of TP53/KRAS gene mutation of pancreatic cancer, and the task of predicting pancreatic cancer gene mutation is very challenging. The current method mainly has the following defects: first, pancreatic cancer is small and very difficult to segment automatically, while pancreatic tumors are closely associated with surrounding tissue, while displaying similar intensity as tissue, are difficult to identify themselves, are time consuming to segment manually, and for low-priced physicians, marking tumor region labels is a significant challenge. Therefore, the invention does not carry out the segmentation work of tumors, and predicts the TP53/KRAS of the pancreatic cancer by a deep learning-based method. Secondly, in clinical diagnosis, pathological tissue biopsy is needed for confirmation of gene mutation conditions, operation is inconvenient, the inspection period is long, the cost for obtaining effective labels is high, and the data volume is small. In order to fully utilize tumor information, the present embodiment provides a new data amplification method. Based on the spiral transformation, the information contained in each amplified image has a certain difference, and the spatial correlation of the information such as tumor texture and the like is kept. Thirdly, it is difficult to obtain a reliable diagnosis effect on the condition of gene change only by using a single-modality image, and it is not easy to obtain a multi-modality MRI image and to effectively fuse multi-modality information. In the embodiment, through deep analysis of feature information of different modes, a constraint term of a loss function is designed, full connection weight in the constrained mode and a predicted value between the modes are constrained, and diversity and correlation of information between the modes are fully utilized.

Please refer to fig. 2, which is a diagram illustrating an application scenario architecture of the multi-modal based deep learning prediction method according to an embodiment of the present invention. The present embodiment provides a multi-modal network structure, where for an image of each modality, three-dimensional image data is converted into two-dimensional image data by using spiral transformation, feature extraction is performed through a convolutional neural network, and then feature fusion is performed on the features at a full connection layer by combining constraint terms, so as to synthesize feature information of different modalities to obtain a final classification prediction result. Therefore, the information characteristics of a single mode are reserved, multi-mode information can be comprehensively utilized, and the reliability of final decision is improved. Referring to fig. 3, a schematic diagram of a coordinate system construction of the multi-modal-based deep learning prediction method according to an embodiment of the present invention is shown in the process of performing the spiral transformation. As shown in fig. 3, a spatial rectangular coordinate system is established with O as the origin of coordinates. In three-dimensional space, the spiral line A is determined by the azimuth angle psi, the elevation angle 1-theta and the distance r from the origin.

Please refer to fig. 4, which is a schematic flowchart illustrating a multi-modal based deep learning prediction method according to an embodiment of the present invention. In the present embodiment, a multi-modal deep learning prediction method including a helical transformation and a data amplification method will be described in detail with respect to the acquisition of pancreatic cancer image data. As shown in fig. 4, the multi-modal based deep learning prediction method specifically includes the following steps:

s41, an image dataset is acquired, the image dataset comprising image data of at least two modalities.

In this embodiment, the image dataset is a two-dimensional image dataset after a spiral transformation and data amplification.

Specifically, before the deep learning framework is constructed, the data is preprocessed through a spiral transformation method. The three-dimensional target area is spirally expanded to a two-dimensional plane by fully utilizing three-dimensional information, the correlation between original adjacent pixels is reserved in the transformation process, and then the transformed image is used for predicting gene mutation.

The acquired image dataset is a pancreatic cancer image dataset, the pancreatic cancer data is acquired from a magnetic resonance image of a pancreatic cancer patient, the acquired data needs image information containing a plurality of parameters, in the embodiment, three modalities, namely, ADC (applied Diffusion Coefficient Imaging), DWI (Diffusion Weighted Imaging) and T2(transverse relaxation time Weighted Imaging), are adopted, and MRI data of 64 patients are totally acquired, and the data of the three modalities are image data corresponding to three different Imaging parameters. At the same time, the location of the tumor has been determined in the image data. In this example, the data set was from patients with pancreatic cancer treated by surgery at Rekin Hospital, from 1 month 2016 to 12 months 2016, each case containing pathological examination of the tumor, i.e., mutations in TP53 (an anti-cancer gene) and KRAS (a proto-oncogene).

Specifically, a point (e.g., the center point of the tumor) located in the tumor in the original three-dimensional MRI is selected as a midpoint O of the spiral transformation, and the maximum distance from the tumor edge to the point O determines a maximum radius R of the spiral transformation, as shown in fig. 3, a spatial rectangular coordinate system is established with the point O as a coordinate origin. In three-dimensional space, the spiral line A is determined by the azimuth angle psi, the elevation angle 1-theta and the distance r from the origin. According to the transformation relationship of the coordinate system, the coordinates of the point a can be expressed as:

the key to the spiral transformation is to construct a relationship of the two angles Θ and Ψ. According to different requirements, different relations can be constructed. For example, to have the sampling points evenly distributed at the two poles and equator of the sphere, the radian between fixed sampling points is constant. Let the circle on the equator have 2N sampling points, define the sampling radian as the distance d between two points on the equator:

the number of horizontal plane sampling points corresponding to the theta angle is set as

Setting the angle theta to be divided into N angles in the value range, and under the condition of a specified radius, if N is large enough, the total number of sampling points can be obtained by integral calculation of a formula (3):

from the above, the total number of sampling points on the surface of the sphere with the designated radius

Knowing the coordinates of point A, the radian measure between two adjacent points can be expressed as Ψ^*Sin Θ, then Θ and Ψ satisfy the relationship expressed by equation (4):

therein Ψ^*Is the difference value of the included angle psi between the two adjacent coordinate points and the positive direction of the X axis.

Similarly, in the practical application process, different rotation transformation rules can be used to establish different relationships between Θ and Ψ, for example, to make Θ and Ψ vary uniformly within a value range, to make the surface density and bulk density of the sampling points equal, and to design a targeted distribution of the sampling points for a specific target object.

The gray value of the point is then calculated using a method of tri-linear interpolation and its coordinates in three-dimensional space are mapped to the position of the original matrix. And finally, filling the gray value into a two-dimensional matrix to obtain a two-dimensional image which is spirally transformed and expanded.

Most two-dimensional convolutional neural networks use slices of a cross section as input to the network, and only contain two-dimensional information of one slice. However, each layer of the three-dimensional target region has strong correlation in space, and the simple two-dimensional section ignores the inter-layer relation. Meanwhile, the visual angle of the cross section is single, the image characteristics of other visual angles cannot be comprehensively represented, and the representation of the texture characteristics on the three-dimensional space is insufficient. The spiral transformation is to expand the image from the three-dimensional space to the two-dimensional space from two poles to the equator in sequence by taking the radius or the diameter in the three-dimensional space as an axis. The transformation method preserves the correlation of features such as textures and the like on a 3D space to a certain extent. For a sample, the two-dimensional image obtained by the spiral transformation contains more comprehensive and complete three-dimensional information than the two-dimensional image obtained by a section, and a high-quality data set is provided for subsequent classification by using a neural network.

In particular, the present invention applies this method of spiral transformation to data amplification. The purpose of data amplification is to increase the diversity of data in the sample, to combat overfitting of the network. However, the most common data amplification method using geometric transformation hardly changes the information content of the original data, and the data before and after amplification are very similar, so that the improvement of the model result is limited.

The most common data amplification methods are geometric transformations of the image, such as horizontal flipping, scaling within a small range multiple (e.g., 0.8-1.15), rotation, etc., of the two-dimensional image. These methods increase the amount of data to some extent, but the transformation results are all from the original data. For example, horizontal flipping changes only the view angle of the two-dimensional image, hardly changes the information content of the data set, and the data before and after augmentation are very similar, thereby limiting the effect of model prediction.

For the convenience of comparing the difference of transformation results under two coordinate systems, the same coordinate origin and positive direction of z axis are set, only positive direction of x axis is changed, △ psi is changed, and then the corresponding coordinates A 'of point A' can be expressed as:

if a '(x', y ', z') is a (x, y, z), formula (1) and formula (5) are combined, formula (6) can be derived:

the formula (7) is obtained after simplification:

as can be seen from the equation set, when cos △ Ψ is 1, a '(x', y ', z') is a (x, y, z) if and only if △ Ψ is 2 π k, it means that different spiral transformation results can be obtained by changing the positive direction angle of the x axis in the plane of the spatial coordinate system XOY.

Similarly, besides changing the positive direction angle of the coordinate axis, different spiral transformation results can be obtained from the same three-dimensional data by using other transformation modes. Such as changing the location of the origin of the coordinate system, geometrically transforming the raw data, changing the parameters of the spiral transformation (including the number of rotations, sampling intervals, etc.), horizontal-vertical flipping, zooming in to a small range multiple (e.g., 0.8-1.15), etc. The transformed two-dimensional image is a part of the original three-dimensional image, and the result of the spiral transformation is equivalent to a subset of the original data, so that the amplification data obtained based on different coordinate systems have a certain complementary relation to the same three-dimensional original data.

Specifically, the acquired three-dimensional MRI is transformed into two-dimensional space according to a specified spiral transformation method, for example, in targeted sampling, Θ and Ψ satisfy

The maximum radius of the spiral transformation is 60, and N is 20, so that a 120 × 254 two-dimensional image is finally obtained.

In addition, in the data amplification process, the parameters of the fixed-pitch helical transform and the origin and the positive direction of the spatial rectangular coordinate system are selected in the present embodiment, and the raw data is subjected to the geometric transform. The three-dimensional data is subjected to geometrical transformation such as rotation at different angles along the z axis, horizontal turning, vertical turning and the like, and then is converted into a two-dimensional space through spiral transformation, so that the data is expanded to 27 times of the original data. Then, the data set was divided into five equal parts by patient according to the ratio of the positive and negative samples, four of which were training sets and one of which was a test set.

It should be noted that we can use the method of spiral transformation to perform data amplification, and increase the information amount of the training sample in the deep learning method. The pancreatic cancer data set is only used as a specific embodiment of the invention, and the data amplification method of the spiral transformation is also suitable for other data sets of which the target area is similar to a sphere, so that a new data amplification idea is provided for solving the problem of insufficient data volume in deep learning.

Since the TP53 gene prediction of pancreatic cancer is a very challenging task, the proportion of tumor regions is small, the identification difficulty is high, and the difficulty in obtaining multi-modal data causes insufficient sample size, which increases the difficulty of the task. This embodiment improves the small sample problem to some extent. Considering that a large amount of spatial information is lost in a conventional section image, and a large amount of calculation is added to a three-mode network with a large parameter by directly using three-dimensional convolution, a novel spiral transformation method is provided, and an image is input into a convolution neural network for operation after being subjected to spiral transformation. Computational resources and model parameters are reduced compared to 3D models.

During the experiment, the original image is amplified to 27 times by the way of spiral transformation and geometric amplification in the prior art, and the effect of data amplification is evaluated by using normalized mutual information. Wherein the geometric amplification is to perform geometric transformation data amplification such as horizontal and vertical turnover on a 2D section with the largest tumor area. Please refer to fig. 6, which is a diagram illustrating two data amplification effects of the multi-modal based deep learning prediction method according to an embodiment of the present invention. As shown in fig. 6, (a) is spiral transformation data amplification, and (b) is 2D sectional geometry transformation data amplification with the largest tumor area. The results of two methods for treatment of one case are shown in figure 6. The top left corner is the original image and the other three are the amplified images. In order to compare the similarity between the images before and after amplification in the two methods, the classification of the 26 amplified images and the original image in FIG. 6(a) and FIG. 6(b) was calculatedAnd normalizing mutual information, summing the two groups of data to obtain the sum of the mutual information of the spiral transformation of 32.8838, and obtaining a section image of 38.3224. The normalized mutual information is a way of measuring the similarity of two images, and is a measure of the similarity of one image containing the other image, and the larger the value of the normalized mutual information is, the higher the similarity of the two images is, and the normalized mutual information can be realized by calculating the information entropy and the joint information entropy of the images. In addition, two sets of data were subjected to t-test to obtain p-6.4920 × 10^-7And the confidence coefficient is far less than 0.01, which shows that the normalized mutual information of the two groups of data has a significant difference, namely the similarity of images amplified by the spiral transformation is smaller.

The data information after the spiral transformation and the geometric transformation amplification in the prior art of the present invention are subjected to list management and edited in a table form by combining with fig. 6, see table 1 for a data amplification comparison table.

Table 1: data amplification comparison table

	Helical transformation	Geometric transformation
			Normalized mutual information	32.8838	38.3224
Degree of dispersion	0.2709	0.0927
			Euclidean distance	14.5826	7.7633

In order to facilitate visual observation of the data amplification effect, only the original data and the data obtained by doubling the data (horizontal inversion and vertical inversion) are subjected to dimensionality reduction visualization, and the data is normalized to calculate the dispersion degree S of two-dimensional discrete points, which is shown in fig. 7 as a data distribution schematic diagram in an embodiment of the multi-modal-based deep learning prediction method of the present invention. As shown in fig. 7, (a) the first graph shows the degree of dispersion of the data distribution in the geometric transformation, (a) the second graph shows an enlarged view of a portion of the first graph where the data is more concentrated, and (b) shows the degree of dispersion of the data distribution in the helical transformation and the data amplification. As can be seen from table 1, the degree of dispersion of the geometric transformation in the prior art is 0.0927, and the degree of dispersion of the helical transformation in one embodiment of the present invention is 0.2709, which is significantly higher than that in the prior art. In addition, Euclidean distances d from each amplified point to the original point are calculated and summed to obtain 7.7633 as a distance of geometric transformation, and 14.5826 as a spiral transformation. In conclusion, the normalization mutual information of the spiral transformation mode is lower, the discrete degree is higher, and the Euclidean distance is larger, so that the similarity among data is smaller, the distribution range is wider, and the data amplification effect is better.

The results show that the data set obtained by the spiral transformation is also a two-dimensional image, and the data set is more widely distributed, namely contains more comprehensive three-dimensional information. The data amplification method of the method is proved to be capable of enabling a single 2D image to keep 3D information and reflecting the spatial distribution characteristics and spatial texture relation of a tumor region on one hand; on the other hand, when data amplification is carried out each time, different tumor information can be obtained only by changing the coordinate axis angle of the spiral transformation, so that the data amplified each time are different, the amplified sample contains more information, and the spiral transformation is a very effective data amplification method.

In addition, when the spiral transformation and data amplification are applied to deep learning, the constraint of prior knowledge is added into a model-driven loss function, and overfitting is also facilitated to be relieved; the main network is initialized by using the parameters of the image network pre-training, and the idea of transfer learning is combined, so that the network parameters have better initialization distribution, the features (such as angles, edges and the like) of the lowest level can be quickly extracted under the condition of a small sample, the convergence speed is accelerated, and overfitting is reduced.

S42, performing feature extraction on the image data to generate a feature extraction result corresponding to each modality.

In this embodiment, the feature extraction is performed on the image data by a convolutional neural network, which includes a residual structure and a bilinear pooling structure.

Specifically, in the embodiment, a multi-modal model including three tributaries is constructed, the feature extraction part of each tributary in the network framework adopts a residual block structure of ResNet18, and the pre-training result of ImageNet in ResNet18 is used as the initialization parameter. Under the condition of insufficient sample size, the transfer learning can quickly learn low-level features such as direction, color and the like, and only the high-level features need to be finely adjusted, so that the convergence speed can be increased, and the prediction accuracy can be improved. But additional parameters such as the full connectivity layer cannot be initialized with pre-training. The convolution neural network model and the structural parameters used in particular can be adjusted according to the actual requirements of the project.

This example predicts that changes in tumor-associated genes fall into the category of fine-grained classification. In order to obtain better fine-grained classification performance, a bilinear module is introduced into the method. The bilinear pooling structure has been proven to improve the effect of fine-grained classification in bird, airplane and automobile data sets. Thus, for a single tributary network, the present invention connects a bilinear pooling structure at the last convolutional layer. The bilinear pooling layer is composed of two feature extractors, and the two feature extractors share weight in the invention. Bilinear pooling computes the outer product for each position of the feature and sums.

Suppose the feature mapping dimension of the convolutional layer output is f (I) epsilon R^c×h×wThen for each spatial location point i, bilinear pooling can be expressed as:

bilin_i＝f(i,I)×f(i,I)^T,bilin_i∈R^c×C(8)

and summing all the spatial positions to obtain bilinear pooled output:

bilinear pooling is beneficial for extracting texture features, which are very important in fine-grained classification.

For a single tributary network, this embodiment connects a bi-linear pooling structure after the last convolutional layer. The bilinear pooling layer consists of two feature extractors that share weights. Bilinear pooling computes the outer product for each position of the feature and sums.

Suppose the feature mapping dimension of the convolutional layer output is f (I) epsilon R^c×h×wThen for each spatial location point i, bilinear pooling can be represented as blin_i＝f(i,I)×f(i,I)^T,bilin_i∈R^c×c. The fused feature quantity is the square of the number of channels of the original feature map. Finally, summing all the spatial positions to obtain bilinear pooled output:

wherein y (I) is the final extracted feature, and the feature is connected to the full connection layer for fusion.

And S43, combining the preset constraint items to fuse and classify and predict the feature extraction results.

Specifically, in the network structure, the full connection layer of the three tributary networks adopts the method of direct serial splicing for feature fusion, and the feature vector y can be written as the combination of each sub-vector

Subsequently, y is classified through a full connection layer and a softmax (logistic regression) layer to obtain a final predicted output p, and the embodiment performs two classifications by using two output values, which respectively represent whether to use the two output values to determine whether to perform the two classificationsInstead of selecting an output and setting a threshold, a gene mutation occurs.

In the whole framework, the structure of each tributary network is the same, and the input is the image X after each modal helical transformation [ X1, X2, …, Xn]The feature extraction part uses a convolution neural network containing a residual error structure, and the final feature extraction result is obtained by an output vector y of the tributary networkⁱAre connected in series to obtain. In a particular implementation, the convolutional network framework best suited for a particular task may be selected. With n sub-networks, the feature vector y can be written as a combination of each sub-vector

(c is the number of feature mapping channels output by the convolutional layer):

y＝W_cX (10)

wherein, W_cIs the weight of the feature extraction part.

Please refer to fig. 8, which is a flowchart illustrating an analysis prediction method of the multi-modal based deep learning prediction method according to an embodiment of the present invention. As shown in fig. 8, in the present embodiment, S43 includes:

and S431, connecting the feature extraction result to a full connection layer for feature fusion to generate a prediction output result.

Specifically, the result y of bilinear pooling is classified through a full connection layer and softmax to obtain a final prediction output p, two output values are used for secondary classification to respectively represent whether gene mutation occurs, and the sum of the probabilities of two classifications of gene mutation and non-mutation is 1. And y is fused at a full connection layer, so that the fine-grained classification performance of the network can be improved.

Wherein the content of the first and second substances,

each represents y¹,y²,yⁿWeights at the fully-connected layer, i.e. feature fusionThe weight of (c).

And S432, combining preset constraint terms, and performing parameter optimization on the prediction model so as to enable the prediction output result to be more accurate.

To fuse a priori knowledge in the end-to-end training process, we predict loss (L) except using pancreatic cancer genes₁) Strongly supervised classification prediction, and also design intra-modal feature selection loss (L)₂) And inter-modal prediction constraint loss (L)₃) Two constraint terms, based on the model driven training process.

Please refer to fig. 9, which is a flowchart illustrating a model optimization process of the multi-modal based deep learning prediction method according to an embodiment of the present invention. As shown in fig. 9, in the present embodiment, S432 includes:

and S432A, supervising the prediction output process through the first constraint item.

Specifically, gene mutation status prediction of TP53 was strongly supervised by a given label, so we used a generic cross-entropy loss function for two classes. The cross entropy loss function is the basis for performing the classification process, and its mathematical expression is expressed by equation (12):

wherein, y_iIs a tag (mutation to 1, non-mutation to 0), p_iIs the prediction probability of the specified category output.

And S432B, selecting the features of the feature extraction result through the second constraint term.

Specifically, for better feature selection and feature fusion, some new loss function constraint terms are introduced based on cross entropy. The model uses a bilinear module to realize the performance of fine-grained classification. However, the output dimension of the bilinear module is very large and contains certain redundant information, so that feature selection needs to be performed on the output feature vector to make the redundant feature weight close to 0. On the one hand, features which are favorable for prediction can be selected, and on the other hand, excessive features and neural network overfitting can be prevented. The feature selection penalty in the constraint term is expressed by equation (13):

wherein

Weight of fully connected layer for ith modality

n is the number of modes, k_iIs the weight vector length of the ith modality.

And S432C, performing constraint among the modalities through the third constraint item, and keeping the diversity of the feature extraction results.

Specifically, the characteristics among n modes are mutually complementary, complementary and coactive, the situation that the selected characteristic tends to a certain mode can occur in the characteristic selection process, and in order to prevent the action deviation among the modes from being too large and maintain the diversity of the characteristic, a prediction constraint term L is designed₃Expressed by equation (14):

wherein X is the input, p⁽¹⁾Is the probability of occurrence of a gene mutation, W_cIs the weight of the feature extraction part. Constraints between the various modalities help to preserve the diversity of features.

And S432D, adding the first constraint term, the second constraint term and the third constraint term according to a preset weight, and determining a loss function.

In particular, the constraint term L is predicted₃And a feature selection item L₂The two combined actions are combined to obtain sparse characteristics and add prior knowledge, so that each mode has a similar predicted value, redundant characteristics are reduced, and the diversity of the characteristics is reserved. The final loss function is a linear combination of the three parts:

where α, γ represents the weight of the balanced three-part loss function.

S432E, performing parameter optimization on the prediction model by using a gradient descent method to minimize the loss function.

In particular, the training goal of convolutional neural networks is to minimize the loss function, and the parameters are usually updated using a gradient descent method in a deep learning-based prediction model. The proof process that the prediction model of the present invention can minimize the constructed loss function by the gradient descent method is as follows:

due to the formula (14) L₃In (1)

And

is a scalar quantity, L₃Can be converted into:

gene prediction loss (L)₁) Is a function f (W) with weight W, so the loss function can be expressed as the sum of f (W) and the L1 regularization term:

wherein W_f，W^s，W^tAre all components of W, and g (W) is a constraint term with L1 regularization.

The L1 regularized model can minimize the loss function with the near gradient descent method (PGD), which is a special gradient descent method. Therefore, under the constraint of the loss function, the multi-modal fine-grained classification model can also be solved by using a near-end gradient descent method (PGD), and end-to-end training is realized.

S44, when the loss function is minimized, determining a final prediction model.

Specifically, in the training process, in formula (15), let coefficients of the loss function constraint terms α be 1, β be 0.001, and γ be 0.01, set an initial learning rate to be 0.0001, and after iterating five times, sequentially decrease in a cosine-decaying manner.

Wherein, t₀(set to 5) is the number of iterations for which the learning rate starts to change, T represents the number of iterations that change continuously, and T (set to 20) is the total number of iterations in the training process; lr of₀(set to 0.0001) is the initial learning rate and the number of batch trainings is 32. The computing environment for the entire scheme is configured as Intel i7-8700 CPU @3.20GHz, 32GB RAM and a monolithic NVIDIATITAN X (Pascal) GPU.

After the training of the multi-modal fine-grained classification network is finished, a regression model is trained by taking the total 27 predicted values of each case after amplification as a sample. And (3) inputting the probability of each test case being predicted to be 1 in CNN (Convolutional Neural Networks) into the regression model to obtain a final prediction result Y.

In the aspect of gene prediction effect, a hybrid-driven multi-modal pancreatic cancer TP53/KRAS gene mutation prediction model is provided for the pancreatic cancer gene mutation prediction difficulty, and the experimental results are analyzed. Specifically, the index data of the experimental results of the five-fold cross validation of the present invention is subjected to list management and compiled in a table form, and Accuracy (Accuracy), AUC (area under ROC curve), Recall (Recall), Precision (Precision) and F1score (F1 score) are used as evaluation indexes, see table 2 experimental result data table.

Table 2: data sheet of experimental results

	Accuracy	AUC	Recall	Precision	F1score
						Cross validation
0	0.8462	0.8500	1.0000	0.8000	0.8889
						Cross validation 1	0.6154	0.6500	0.6250	0.7143	0.6667
Cross validation 2	0.7692	0.8500	1.0000	0.7273	0.8421
						Cross validation 3	0.8462	0.7250	1.0000	0.8000	0.8889
Cross validation 4	0.7500	0.8125	0.7500	0.8571	0.8000
						Average	0.7654	0.7775	0.8750	0.7797	0.8173

Furthermore, the bilinear mechanism can extract discriminative texture information, which is the most important process in the fine-grained classification. The bilinear module can improve TP53/KRAS gene mutation prediction performance in our model, specifically, list management is carried out on the multi-modal model effect, and the multi-modal model effect is edited in a table form, and Accuracy, area under the ROC curve, Recall, Precision and F1score (F1 score) are used as evaluation indexes, and see a table 3 multi-modal model effect comparison table. As shown in table 3, the model accuracy increased by 9% after adding bilinear. Besides the classical bilinear pooling operation, the spiral transformation furthest reserves the texture information and the spatial correlation thereof, and is beneficial to feature extraction of the convolutional neural network. Besides obviously improving the prediction performance in the multi-modal pancreatic cancer data set, the methods such as spiral transformation, data driving and the like have certain reference significance in the processing of other data sets.

Table 3: multi-modal model effect comparison table

And S45, evaluating the final prediction model through a preset evaluation index.

Specifically, in order to fully evaluate the performance of the model, Accuracy (Accuracy), AUC (area under ROC curve), Recall (Recall), Precision (Precision), and F1score (F1 score) are used as evaluation indexes, which are widely used in the classification field. Each index is defined as follows:

wherein TP is the true positive rate, TN is the true negative rate, FP is the false positive rate, FN is the false negative rate, and AUC is the area under the ROC curve.

In the aspect of multi-modal fusion, the method of model and data hybrid driving adds prior knowledge in the process of fusing features, specifically, lists and manages the prediction effects of different loss functions, and edits the prediction effects in a table form, referring to a table 4 for comparing the prediction effects of different loss functions. And (3) verifying the stability of the model by using five-fold cross validation, and obtaining a final result of predicting the TP53 gene change of the pancreatic cancer. The Accuracy, AUC, Recall, Precision and F1score of the implementation method are 0.7654, 0.7775, 0.8750, 0.7797 and 0.8173 respectively. Compared with simple feature vector splicing, the accuracy of the method is improved by 11%, and the AUC is increased from 0.7475 to 0.7775. The hybrid-driven approach also gives better results in terms of precision and F1Score, indicating that the inventive approach not only serves as a feature selection, but also allows the three modalities to combine more efficiently, complement each other, and act together.

Table 4: prediction effect comparison table of different loss functions

Loss value	Accuracy	AUC	Recall	Precision	F1Score
						L1	0.6564	0.7475	0.8750	0.6733	0.7599
αL1+βL2+γL3	0.7654	0.7775	0.8750	0.7797	0.8173

In the test process, the final prediction effect is greatly improved by the intra-modal feature sparsization and the inter-modal effect equalization of the combined action. Please refer to fig. 10, which is a diagram illustrating the effect of different loss functions in an embodiment of the multi-modal based deep learning prediction method of the present invention. As shown in fig. 10, ROC curves (subject operating characteristic curves) and PR curves (precision-recall curves) of two models are plotted to visualize the relationship between sensitivity-specificity and precision-recall. The result shows that the curve of the deep learning prediction method based on the multiple modes is respectively closer to the upper left corner of the ROC curve and the upper right corner of the PR curve, and the deep learning prediction method based on the multiple modes has better performance.

The present embodiment provides a computer storage medium having stored thereon a computer program that, when executed by a processor, implements the multi-modality based deep learning prediction method.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage media comprise: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The deep learning prediction method based on multiple modes utilizes the common drive of the model and the data, considers the diversity among the characteristics and can fully combine the correlation among the modes. In addition, the data are amplified by using a new spiral transformation mode, so that the information quantity of a training set can be effectively increased, and the model is helped to obtain better robustness.

Example two

The present embodiment provides a multi-modal based deep learning prediction system, which includes:

a data acquisition module to acquire an image dataset comprising image data of at least two modalities;

the characteristic extraction module is used for carrying out characteristic extraction on the image data to generate a characteristic extraction result corresponding to each modality;

and the prediction module is used for combining a preset constraint item to fuse and classify and predict the feature extraction result.

The multi-modal based deep learning prediction system provided by the present embodiment will be described in detail with reference to the drawings. It should be noted that the division of the modules of the following system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: a module may be a separate processing element, or may be integrated into a chip of the system described below. Further, a certain module may be stored in the memory of the following system in the form of program code, and a certain processing element of the following system may call and execute the function of the following certain module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the above method or the following modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When some of the following modules are implemented in the form of a program code called by a processing element, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling the program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

Please refer to fig. 11, which is a schematic structural diagram of a multi-modal based deep learning prediction system according to an embodiment of the present invention. As shown in fig. 11, the multi-modal based deep learning prediction system 5 includes: a data acquisition module 51, a feature extraction module 52 and a prediction module 53.

The data acquisition module 51 is configured to acquire an image dataset comprising image data of at least two modalities.

The feature extraction module 52 is configured to perform feature extraction on the image data to generate a feature extraction result corresponding to each modality.

In this embodiment, the feature extraction module 52 is specifically configured to perform feature extraction on the image data through a convolutional neural network, where the convolutional neural network includes a residual structure and a bilinear pooling structure.

The prediction module 53 is configured to combine a preset constraint term to fuse and classify the feature extraction results.

In this embodiment, the prediction module 53 is specifically configured to connect the feature extraction result to a full connection layer for feature fusion, so as to generate a prediction output result; and (4) combining a preset constraint term to carry out parameter optimization on the prediction model so as to enable the prediction output result to be more accurate.

In particular, the prediction module 53 is configured to supervise a prediction output process by the first constraint term; performing feature selection on the feature extraction result through the second constraint term; constraint among all modes is carried out through the third constraint item, and the diversity of the feature extraction result is kept; adding the first constraint term, the second constraint term and the third constraint term according to a preset weight to determine a loss function; the prediction model is parametrically optimized using a gradient descent method to minimize the loss function.

The multi-modal-based deep learning prediction system is driven by a model and data together, the diversity among characteristics is considered, and the correlation among various modalities can be fully combined. In addition, the data are amplified by using a new spiral transformation mode, so that the information quantity of a training set can be effectively increased, and the model is helped to obtain better robustness.

EXAMPLE III

This embodiment provides an apparatus, the apparatus comprising: a processor, memory, transceiver, communication interface, or/and system bus; the memory and the communication interface are connected with the processor and the transceiver through a system bus and are used for mutually communicating, the memory is used for storing a computer program, the communication interface is used for communicating with other equipment, and the processor and the transceiver are used for running the computer program to enable the equipment to execute the steps of the deep learning prediction method based on the multi-mode.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

The protection scope of the multi-modal-based deep learning prediction method is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of adding, subtracting and replacing the steps in the prior art according to the principle of the invention are included in the protection scope of the invention.

The invention also provides a multi-mode-based deep learning prediction system, which can realize the multi-mode-based deep learning prediction method, but the device for realizing the multi-mode-based deep learning prediction method comprises but is not limited to the structure of the multi-mode-based deep learning prediction system, and all structural modifications and substitutions in the prior art made according to the principle of the invention are included in the protection scope of the invention. It should be noted that the multi-modal based deep learning prediction method and the multi-modal based deep learning prediction system are also applicable to other multimedia content such as video, friend circle message, and the like, and are included in the protection scope of the present invention.

In summary, in the multi-modal fusion aspect, the model and data hybrid driving method adds prior knowledge in the feature fusion process, and compared with simple feature vector splicing, the method has higher accuracy and increased AUC (Area Under ROC Curve), wherein ROC (Receiver Operating Characteristic Curve) refers to the ROC Curve. In addition, the method of data and model hybrid driving also obtains better results in the aspect of precision, not only plays a role of feature selection, but also enables a plurality of modes to be combined more effectively, and the modes supplement each other and act together. In the test process, the final prediction effect is greatly improved by the intra-modal feature sparsization and the inter-modal effect equalization of the combined action. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A multi-mode-based deep learning prediction method is characterized by comprising the following steps:

acquiring an image dataset comprising image data of at least two modalities;

2. The multi-modality based deep learning prediction method of claim 1,

the image data set is a two-dimensional image data set after spiral transformation and data amplification.

3. The multi-modality based deep learning prediction method of claim 1,

and performing feature extraction on the image data through a convolutional neural network, wherein the convolutional neural network comprises a residual error structure and a bilinear pooling structure.

4. The multi-modal-based deep learning prediction method according to claim 3, wherein the step of fusing and classifying the feature extraction results in combination with preset constraint terms comprises:

connecting the feature extraction result to a full connection layer for feature fusion to generate a prediction output result;

and (4) combining a preset constraint term to carry out parameter optimization on the prediction model so as to enable the prediction output result to be more accurate.

5. The multi-modality-based deep learning prediction method of claim 4, wherein the preset constraint terms include a first constraint term, a second constraint term and a third constraint term; the step of optimizing the parameters of the prediction model by combining the preset constraint term comprises the following steps:

supervising a prediction output process by the first constraint term;

performing feature selection on the feature extraction result through the second constraint term;

constraint among all modes is carried out through the third constraint item, and the diversity of the feature extraction result is kept;

adding the first constraint term, the second constraint term and the third constraint term according to a preset weight to determine a loss function;

and performing parameter optimization on the prediction model by using a gradient descent method so as to minimize the loss function.

6. The multi-modality based deep learning prediction method of claim 5,

and performing parameter optimization in the prediction model by using an iteration principle of a gradient descent method.

7. The multi-modality based deep learning prediction method of claim 5, further comprising:

determining a final prediction model when the loss function is minimized;

and evaluating the final prediction model through a preset evaluation index.

8. A multi-modal based deep learning prediction system, the multi-modal based deep learning prediction system comprising:

9. A medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements the multi-modality based deep learning prediction method of any one of claims 1 to 7.

10. An apparatus, comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the apparatus to perform the multi-modality based deep learning prediction method of any one of claims 1 to 7.