CN111275130B

CN111275130B - Multi-mode-based deep learning prediction method, system, medium and equipment

Info

Publication number: CN111275130B
Application number: CN202010098684.9A
Authority: CN
Inventors: 钱晓华; 陈夏晗
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2023-09-08
Anticipated expiration: 2040-02-18
Also published as: CN111275130A

Abstract

The invention provides a multi-mode-based deep learning prediction method, a multi-mode-based deep learning prediction system, a multi-mode-based deep learning prediction medium and multi-mode-based deep learning prediction equipment, wherein the multi-mode-based deep learning prediction method comprises the following steps: acquiring an image dataset comprising image data of at least two modalities; extracting features of the image data to generate feature extraction results corresponding to each mode; and combining a preset constraint item to fuse the feature extraction results and carrying out classified prediction. According to the invention, a multi-mode network structure is designed, a convolutional neural network is used for extracting features of each mode image, then the features are fused in a full-connection layer by combining constraint terms, and feature information of different modes is synthesized to obtain a final classification result. Therefore, the information characteristics of a single mode are reserved, multi-mode information can be comprehensively utilized, and the reliability of a final decision is improved.

Description

Multi-mode-based deep learning prediction method, system, medium and equipment

Technical Field

The invention belongs to the technical field of deep learning, relates to a learning prediction method, and in particular relates to a multi-mode-based deep learning prediction method, a multi-mode-based deep learning prediction system, a multi-mode-based deep learning prediction medium and multi-mode-based deep learning prediction equipment.

Background

In the prior art, a certain achievement has been achieved in a three-dimensional image histology method and a deep learning method, for example, in the field of noninvasive evaluation of genetic changes using images such as CT (Computed Tomography, electronic computed tomography), MRI (Magnetic Resonance Imaging ) and the like. However, there are still a number of disadvantages in deep learning: sometimes, the images have few data sets, and during the model training process, the over fitting is easily caused by too few data; the reasonable full utilization of three-dimensional image information is still a difficulty, on one hand, the 3D neural network has numerous parameters and large calculation amount, and needs to occupy a large amount of calculation resources, and on the other hand, the information amount of the 2D section is often insufficient, so that the three-dimensional characteristics of the tumor cannot be comprehensively represented; the information of the single-mode image is insufficient, and the multi-mode information fusion method is single. Most multi-modal deep learning models are used for directly splicing the extracted features, then the extracted features are input into a full-connection layer for feature selection and fusion, diversity among the features is ignored, deviation is easily generated in the training process, the feature selection difference among modes is overlarge, multi-modal information cannot be fully utilized, and the prediction effect is poor.

Therefore, how to provide a multi-mode-based deep learning prediction method, system, medium and device, so as to solve the defects that the prior art cannot combine the diversity among features to generate a multi-mode deep learning model and perform efficient classification prediction, etc., is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a multi-modal-based deep learning prediction method, system, medium and apparatus, which are used for solving the problem that the prior art cannot effectively combine the diversity among features to generate a multi-modal deep learning model and predict efficiently.

To achieve the above and other related objects, an aspect of the present invention provides a multi-modal based deep learning prediction method, including: acquiring an image dataset comprising image data of at least two modalities; extracting features of the image data to generate feature extraction results corresponding to each mode; and combining a preset constraint item to fuse the feature extraction results and carrying out classified prediction.

In an embodiment of the invention, the image dataset is a two-dimensional image dataset after spiral transformation and data amplification.

In one embodiment of the present invention, the image data is feature extracted by a convolutional neural network, which includes a residual structure and a bilinear pooling structure.

In an embodiment of the present invention, the step of classifying and predicting the feature extraction result by combining a preset constraint term includes: connecting the feature extraction result to a full connection layer for feature fusion to generate a prediction output result; and carrying out parameter optimization on the prediction model by combining with a preset constraint term so as to enable the prediction output result to be more accurate.

In an embodiment of the present invention, the preset constraint items include a first constraint item, a second constraint item, and a third constraint item; the step of parameter optimization of the prediction model by combining the preset constraint items comprises the following steps: monitoring a prediction output process through the first constraint item; performing feature selection on the feature extraction result through the second constraint item; constraint among modes is carried out through the third constraint item, and diversity of the feature extraction result is kept; adding the first constraint item, the second constraint item and the third constraint item according to preset weights, and determining a loss function; the predictive model is parameter optimized using a gradient descent method to minimize the loss function.

In one embodiment of the invention, the parameter optimization is performed in the predictive model by the iterative principle of the gradient descent method.

In an embodiment of the present invention, the multi-mode-based deep learning prediction method further includes: determining a final predictive model when the loss function is minimized; and evaluating the final prediction model through a preset evaluation index.

In another aspect, the present invention provides a multi-modal based deep learning prediction system, including: the data acquisition module is used for acquiring an image data set, wherein the image data set comprises image data of at least two modes; the feature extraction module is used for carrying out feature extraction on the image data so as to generate a feature extraction result corresponding to each mode; and the prediction module is used for fusing the feature extraction results and classifying and predicting the feature extraction results by combining with a preset constraint item.

In yet another aspect, the present invention provides a medium having stored thereon a computer program which, when executed by a processor, implements the multi-modality based deep learning prediction method.

In a final aspect the invention provides an apparatus comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory so as to enable the device to execute the multi-mode-based deep learning prediction method.

As described above, the multi-mode-based deep learning prediction method, system, medium and device provided by the invention have the following beneficial effects:

in the aspect of multi-mode fusion, the prior knowledge is added in the process of fusing features by the method of model and data hybrid driving, compared with simple feature vector splicing, the accuracy of the method is higher, and the AUC (Area Under ROC Curve) is increased, wherein ROC (Receiver Operating Characteristic Curve, receiver operation characteristic Curve) refers to the ROC Curve. In addition, the data and model mixed driving method also obtains better results in terms of precision, not only plays a role in feature selection, but also enables a plurality of modes to be combined more effectively, complement each other and act together. In the test process, the final prediction effect is greatly improved by the intra-mode feature sparsification and inter-mode effect equalization of the combined action.

Drawings

FIG. 1 is a diagram illustrating an exemplary dataset of a multi-modal based deep learning prediction method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an application scenario of the multi-mode-based deep learning prediction method according to an embodiment of the invention.

FIG. 3 is a schematic diagram illustrating the construction of a coordinate system in an embodiment of the multi-modal based deep learning prediction method according to the present invention.

FIG. 4 is a schematic flow chart of a multi-modal based deep learning prediction method according to an embodiment of the invention.

Fig. 5 is a schematic diagram of data transformation of an embodiment of the multi-mode-based deep learning prediction method according to the present invention.

FIG. 6 is a diagram showing two data amplification effects of the multi-modal based deep learning prediction method according to an embodiment of the invention.

Fig. 7 is a schematic diagram showing a data distribution of a multi-mode-based deep learning prediction method according to an embodiment of the invention.

FIG. 8 is a flow chart illustrating the analysis and prediction of the multi-modal based deep learning prediction method according to an embodiment of the invention.

FIG. 9 is a flow chart of model optimization in an embodiment of the multi-modal based deep learning prediction method of the present invention.

FIG. 10 is a graph showing the effect of the multi-modal based deep learning prediction method according to the present invention on different loss functions in one embodiment.

FIG. 11 is a schematic diagram illustrating the structure of a multi-modal based deep learning prediction system according to an embodiment of the invention.

Description of element reference numerals

5. Multi-mode-based deep learning prediction system

51. Data acquisition module

52. Feature extraction module

53. Prediction module

S41 to S45 steps

Steps S431 to S432

Steps S432A to S432E

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

The invention provides a multi-mode-based deep learning prediction method, a multi-mode-based deep learning prediction system, a multi-mode-based deep learning prediction medium and multi-mode-based deep learning prediction equipment. The input of the model is a multi-mode three-dimensional image, the model is preprocessed and converted into a two-dimensional plane through spiral transformation, and the model is output as a final probability value. The whole model comprises three parts of spiral transformation pretreatment, data amplification, feature extraction and feature fusion, and provides a new model-driven loss function constraint term.

Example 1

The embodiment provides a multi-mode-based deep learning prediction method, which comprises the following steps:

acquiring an image dataset comprising image data of at least two modalities;

extracting features of the image data to generate feature extraction results corresponding to each mode;

and combining a preset constraint item to fuse the feature extraction results and carrying out classified prediction.

The multi-modal-based deep learning prediction method provided by the present embodiment will be described in detail below with reference to the drawings.

Referring to fig. 1, an exemplary diagram of a data set of an embodiment of a multi-modal based deep learning prediction method according to the present invention is shown. The multi-mode-based deep learning prediction method of the present invention is applicable to any data set in which the target region is approximately a sphere, and in this embodiment, the multi-mode-based deep learning prediction method is described in detail by taking the pancreatic cancer image data set as an example.

Pancreatic cancer is one of the most powerful malignant tumors, has the characteristics of late diagnosis, high death rate, low total survival rate and the like, and has five-year survival rate of less than 3.5 percent of patients, wherein 75 percent of patients have TP53 gene mutation, and 75 to 90 percent of patients have KRAS gene mutation. TP53, which is a tumor suppressor gene, is used to encode P53 protein, and can inhibit proliferation of cells in many cell processes, while proto-oncogene KRAS is closely related to division, differentiation and apoptosis of cells. The mutated TP53/KRAS gene will promote tumor cell proliferation, invasion and survival. In the pancreatic cancer treatment process, the mutation condition of the gene is closely related to the measurement of the prognosis effect of patients and the selection of reasonable treatment modes. At present, surgical excision or puncture biopsy is a main method for detecting TP53/KRAS gene mutation, but has the defects of limitation, blindness, traumatism and the like, so that the noninvasive technology is used for detecting the gene change condition of tumors, and great clinical demands are made.

In recent years, noninvasive evaluation of genetic changes of living tissues using images is a hotspot and key of many studies. Tumors are characterized by somatic mutations, and changes in genes are ultimately reflected in tumor phenotype. The specific expression includes that quantitative characteristics such as the intensity, shape, size or volume of the tumor in the image and texture provide information of tumor phenotype and microenvironment.

In the related research, many methods of image histology have achieved certain results. For example, eran et al demonstrated a correlation between CT images and gene expression of primary human liver cancer; coudray et al uses lung cancer histopathological images to extract histograms thereof and predict the conditions of TP53 and other gene mutations; some studies extract part of radiological features such as shape, texture, density, etc., perform feature selection, and then build machine learning models to predict transformation of genes.

In another aspect, the method of deep learning is also applied to image prediction of changes in tumor markers. For example, constructing a 3D convolutional neural network (3D-CNN), and directly classifying three-dimensional tumor images; wang et al takes two-dimensional slice images as input, constructs a deep learning model, and visualizes the predicted region with an attention map.

The above problems become more pronounced in pancreatic cancer TP53/KRAS gene mutation prediction, which also results in a very challenging pancreatic cancer gene mutation prediction task. The prior method mainly has the following defects: first, pancreatic cancer is small in size, automatic segmentation is very difficult, pancreatic tumors are tightly connected to surrounding tissues, and show similar intensities to the tissues, are difficult to identify per se, manual segmentation is long-lasting, and marking tumor area labels is also a great challenge for low-cost doctors. Therefore, the invention does not carry out the tumor segmentation work, and predicts the pancreatic cancer TP53/KRAS based on a deep learning method. Secondly, in clinical diagnosis, the mutation condition of the gene needs to be confirmed by pathological tissue biopsy, the operation is inconvenient, the test period is long, the cost for obtaining an effective label is high, and the data volume is small. In order to fully utilize tumor information, the present embodiment provides a new data amplification method. Based on spiral transformation, each amplified image contains certain difference of information, and meanwhile, spatial correlation of information such as tumor textures and the like is reserved. Thirdly, it is difficult to obtain reliable diagnostic effects for the case of gene change only by single-mode images, and it is not easy to obtain multi-mode MRI images and effectively fuse multi-mode information. According to the embodiment, through deep analysis of characteristic information of different modes, constraint items of a loss function are designed, full-connection weights in the modes and predicted values among the modes are constrained, and diversity and correlation of information among the modes are fully utilized.

Referring to fig. 2, a schematic diagram of an application scenario of the multi-mode-based deep learning prediction method according to an embodiment of the invention is shown. The embodiment provides a multi-mode network structure, for each mode image, three-dimensional image data are respectively converted into two-dimensional image data by using spiral transformation, then feature extraction is performed through a convolutional neural network, then feature fusion is performed on the features at a full-connection layer by combining constraint terms, and feature information of different modes is synthesized to obtain a final classification prediction result. Therefore, the information characteristics of a single mode are reserved, multi-mode information can be comprehensively utilized, and the reliability of a final decision is improved. In the process of performing the spiral transformation, please refer to fig. 3, which shows a schematic diagram of coordinate system construction of an embodiment of the multi-mode-based deep learning prediction method according to the present invention. As shown in fig. 3, a space rectangular coordinate system is established with O as the origin of coordinates. In three dimensions, the spiral A is determined by the azimuth angle ψ and the elevation angle 1- Θ, and the distance r to the origin.

Referring to fig. 4, a schematic flow chart of a multi-mode-based deep learning prediction method according to an embodiment of the invention is shown. In this embodiment, a multi-modal-based deep learning prediction method including a spiral transformation and a data amplification method is described in detail by the acquisition of pancreatic cancer image data. As shown in fig. 4, the multi-mode-based deep learning prediction method specifically includes the following steps:

S41, acquiring an image data set, wherein the image data set comprises image data of at least two modalities.

In this embodiment, the image dataset is a two-dimensional image dataset after spiral transformation and data amplification.

Specifically, the data is preprocessed by a spiral transformation method before the deep learning framework is constructed. The three-dimensional information is fully utilized, the three-dimensional target area is spirally unfolded to a two-dimensional plane, the correlation between original adjacent pixels is reserved in the transformation process, and then the transformed image is used for predicting gene mutation.

The acquired image dataset is a pancreatic cancer image dataset, the pancreatic cancer data is acquired from a magnetic resonance image of a pancreatic cancer patient, the acquired data needs to contain image information of a plurality of parameters, in the embodiment, three modes of ADC (Apparent Diffusion Coefficient, apparent diffusion coefficient imaging), DWI (Diffusion Weighted Imaging ) and T2 (transverse relaxation time, transverse relaxation time weighted imaging) are adopted for MRI data of 64 patients, and the data of the three modes are image data corresponding to three different imaging parameters. At the same time, the location of the tumor has been determined in the image data. In this example, the data set was from pancreatic cancer patients who received surgery at the Ruijin Hospital from 1 st 2016 to 12 th 2016, each of which contained the pathological examination results of tumors, i.e., mutations in TP53 (an oncogene) and KRAS gene (a protooncogene).

Specifically, a point in the tumor (such as the center point of the tumor) in the original three-dimensional MRI is selected as the midpoint O of the spiral transformation, and the maximum distance from the edge of the tumor to the point O determines the maximum radius R of the spiral transformation, and as shown in fig. 3, a space rectangular coordinate system is established with O as the origin of coordinates. In three dimensions, the spiral A is determined by the azimuth angle ψ and the elevation angle 1- Θ, and the distance r to the origin. According to the conversion relation of the coordinate system, the a-point coordinates can be expressed as:

the key to the spiral transformation is to construct the relationship of the two angles Θ and ψ. Depending on the requirements we can construct different relations. For example, to have the sampling points evenly distributed at the two poles and equator of the sphere, the radian between the fixed sampling points is constant. Let the circle on the equator have 2N sampling points, define the sampling radian as the distance d between two points on the equator:

setting the number of horizontal plane sampling points corresponding to the angle theta asSetting Θ to be divided into N angles in the value range, if N is large enough under the condition of the specified radius, the total sampling point number can be obtained through integral calculation of a formula (3):

from the above, the surface sampling total point number of the sphere with the specified radius

Knowing the coordinates of point a, the radian between two adjacent points can be expressed as ψ ^* Sin Θ, then Θ and ψ satisfy the relationship expressed by equation (4):

wherein ψ is ^* Is the difference between the two adjacent coordinate points and the positive X-axis direction included angle psi.

Similarly, in the practical application process, different rotation transformation rules can be used to establish different relations for Θ and ψ, for example, the Θ and ψ are uniformly changed in the value range, the surface density and the bulk density of the sampling points are equal, and for a special target object, targeted sampling point distribution is designed.

The gray value of the point is then calculated using a tri-linear interpolation method and its coordinates in three-dimensional space are mapped to the position of the original matrix. And finally, filling the gray value into a two-dimensional matrix to obtain the two-dimensional image expanded by spiral transformation.

Most two-dimensional convolutional neural networks use cross-sectional slices as inputs to the network, containing only two-dimensional information for one slice. However, each layer of the three-dimensional target region has a strong spatial correlation, and a simple two-dimensional section ignores the layer-to-layer correlation. Meanwhile, the visual angle of the cross section is single, the image characteristics of other visual angles cannot be comprehensively represented, and the three-dimensional texture characteristics are not fully represented. The spiral transformation is to sequentially expand an image from a three-dimensional space to a two-dimensional space from two poles to the equator with a radius or diameter in the three-dimensional space as an axis. The transformation method maintains the correlation of the characteristics such as textures and the like in the 3D space to a certain extent. For one sample, the two-dimensional image obtained by spiral transformation contains more comprehensive and complete three-dimensional information than the two-dimensional image obtained by one section, and a high-quality data set is provided for classifying by using a neural network subsequently.

In particular, the present invention applies this method of spiral transformation to data amplification. The purpose of data amplification is to increase the diversity of the data in the sample against network overfitting. However, the most commonly used geometric transformation data amplification method hardly changes the information amount of the original data, and the data before and after amplification are very similar, so that the improvement of model results is limited.

The most commonly used data amplification methods are geometric transformations of the image, such as horizontal flipping of a two-dimensional image, scaling within a small range of multiples (e.g., 0.8-1.15), rotation, etc. These methods increase the amount of data to some extent, but the transformation results are all from the original data. For example, the horizontal flip changes only the view angle of the two-dimensional image, hardly changes the information amount of the data set, and the data before and after amplification are very similar, thus limiting the effect of model prediction.

The result of the spiral transformation depends on the constructed space rectangular coordinate system and the parameter setting of the spiral transformation. Under the condition of the same parameters, different spiral transformation results can be obtained by constructing different coordinate systems for the same three-dimensional data. In order to facilitate comparison of differences in the down-conversion results of the two coordinate systems, the same origin of coordinates and positive direction of the z-axis are set, and only the positive direction of the x-axis is changed. Assuming that the positive direction of the x-axis changes by Δψ, the corresponding point a coordinates a' can be expressed as:

If a '(x', y ', z')=a (x, y, z), equation (1) and equation (5) are combined, equation (6) can be obtained:

the formula (7) is obtained after simplification:

solving the equation set is available, cos Δψ=1, then a '(x', y ', z')=a (x, y, z) if and only if Δψ=2ρk. It is described that different spiral transformation results can be obtained by changing the positive direction angle of the x-axis in the space coordinate system XOY plane. Referring to fig. 5, a schematic diagram of data transformation of an embodiment of the multi-mode-based deep learning prediction method according to the present invention is shown. As shown in fig. 5, the spiral transformation results after three geometric transformations are shown.

Similarly, in addition to changing the positive direction angle of the coordinate axis, different spiral transformation results can be obtained by using other transformation modes for the same three-dimensional data. Such as changing the origin position of the coordinate system, geometrically transforming the original data, changing parameters of the spiral transformation (including the number of rotations, sampling intervals, etc.), horizontal-vertical flipping, scaling in small multiples (e.g., 0.8-1.15 times), etc. The transformed two-dimensional image is a part of the original three-dimensional image, and the result of the spiral transformation is equivalent to a subset of the original data, so that for the same three-dimensional original data, the amplified data obtained based on different coordinate systems have a certain complementary relationship.

Specifically, the acquired three-dimensional MRI is converted into a two-dimensional space according to a specified spiral transformation method, for example, in targeted sampling, Θ and ψ satisfyThe maximum radius of the spiral transformation is 60, N is 20, and finally a two-dimensional image of 120 multiplied by 254 is obtained.

In addition, in the data amplification process, in the present embodiment, the parameters of the fixed spiral transformation, the origin and the positive direction of the space rectangular coordinate system are selected, and the geometric transformation is performed on the original data. The three-dimensional data is subjected to geometric transformation such as rotation by different angles along a z-axis, horizontal overturning, vertical overturning and the like, and then is subjected to spiral transformation to be converted into a two-dimensional space, so that the data is amplified to 27 times of the original data. Subsequently, the data set was divided equally into five parts according to the patient, four parts being training sets and one part being test set, according to the ratio of positive and negative samples.

It should be noted that, the data amplification can be performed by using a spiral transformation method, so as to increase the information amount of the training sample in the deep learning method. The pancreatic cancer data set is only used as a specific embodiment of the invention, and the data amplification method of spiral transformation is also suitable for data sets with other target areas similar to spheres, so that a new data amplification idea is provided for solving the problem of insufficient deep learning data quantity.

The TP53 gene prediction of pancreatic cancer is a very challenging task, so that the proportion of tumor areas is small, the recognition difficulty is high, the sample size is insufficient due to the difficulty in acquiring multi-mode data, and the difficulty of the task is increased. This embodiment improves the small sample problem to some extent. Considering that the conventional tangent plane image loses a great deal of space information, and the direct use of three-dimensional convolution can increase a great deal of calculation amount for a three-mode network with larger self-parameters, a novel spiral transformation method is provided, and the image is input into a convolution neural network for operation after spiral transformation. Compared to 3D models, the computational resources and model parameters are reduced.

In the test process, the original image is amplified to 27 times according to the spiral transformation and the geometric amplification mode in the prior art, and the effect of data amplification is evaluated by using normalized mutual information. The geometric amplification is to perform geometric transformation data amplification such as horizontal and vertical overturn on a 2D section with the largest tumor area. Referring to fig. 6, two data amplification effect graphs of the multi-mode-based deep learning prediction method according to an embodiment of the invention are shown. As shown in FIG. 6, (a) is the spiral transformation data amplification, and (b) is the 2D section geometry transformation data amplification with the largest tumor area. The results of a case treated in two ways are shown in figure 6. The top left corner is the original image, and the other three are the amplified images. In order to compare the similarity of the images before and after amplification by the two methods, normalized mutual information of the 26 images obtained by amplification in fig. 6 (a) and fig. 6 (b) and the original image is calculated, the sum of the mutual information of the spiral transformation obtained by summing the two sets of data is 32.8838, and the tangential plane image is 38.3224. The normalized mutual information is one way to measure the similarity of two images, and is a measure that one image contains the other image, and the larger the value of the normalized mutual information is the higher the similarity of the two images, the higher the value of the normalized mutual information is, and the normalized mutual information can be realized by calculating the information entropy and the joint information entropy of the images. In addition, t-test was performed on both sets of data, resulting in p=6.4920×10 ^-7 The confidence coefficient is far smaller than 0.01, which indicates that the normalized mutual information of the two groups of data has a significant difference, namely, the similarity of the images amplified by spiral transformation is smaller.

The data information after the spiral transformation and geometric transformation amplification in the prior art is subjected to list management in combination with fig. 6, and is edited in a table form, see the data amplification comparison table in table 1.

Table 1: data amplification comparison table

	Spiral transformation	Geometric transformation
			Normalizing mutual information	32.8838	38.3224
Degree of discretization	0.2709	0.0927
			Euclidean distance	14.5826	7.7633

In order to observe the effect of data amplification intuitively, we only perform dimension reduction visualization on the original data and the data obtained after twice amplification (horizontal inversion and vertical inversion), normalize the data, and calculate the degree of dispersion S of two-dimensional discrete points, please refer to fig. 7, which shows a data distribution schematic diagram of the multi-mode-based deep learning prediction method according to the present invention in an embodiment. As shown in fig. 7, the first plot in (a) shows the degree of dispersion of the data distribution in the geometric transformation, the second plot in (a) shows an enlarged plot of the more concentrated portion of the data in the first plot, and the second plot in (b) shows the degree of dispersion of the data distribution in the spiral transformation and data amplification. As can be seen from table 1, the degree of dispersion of the geometric transformation method in the prior art is 0.0927, the degree of dispersion of the spiral transformation in one embodiment of the invention is 0.2709, and the degree of dispersion is obviously higher, which is better than that in the prior art. Furthermore, the Euclidean distance d from each amplified point to the original point was calculated and summed to give a geometric transformed distance of 7.7633 and a spiral transformed 14.5826. In summary, the normalization mutual information of the spiral transformation mode is lower, the discrete degree is higher, and the euclidean distance is larger, so that the similarity between data is smaller, the distribution range is wider, and the effect of data amplification is better.

The result shows that the data set obtained by spiral transformation is also a two-dimensional image, and the data set obtained by spiral transformation is widely distributed, namely contains more comprehensive three-dimensional information. On the one hand, the data amplification mode of the method can ensure that 3D information is reserved for a single 2D image, and the spatial distribution characteristics and the spatial texture relation of a tumor area are reflected; on the other hand, when data amplification is carried out each time, different tumor information can be obtained by changing the angle of the coordinate axis of the spiral transformation, so that the amplified data of each time are different, the amplified sample contains more information, and the spiral transformation is used as a very effective data amplification method.

In addition, when spiral transformation and data amplification are applied to deep learning, the model-driven loss function adds the constraint of priori knowledge, and is also beneficial to relieving over-fitting; the main network is initialized by using the parameters of the image network pre-training, and the idea of migration learning is combined, so that the network parameters have better initialization distribution, and the characteristics (such as angles, edges and the like) of the lowest level can be rapidly extracted under the condition of small samples, thereby accelerating the convergence speed and reducing the overfitting.

And S42, carrying out feature extraction on the image data to generate a feature extraction result corresponding to each mode.

In this embodiment, feature extraction is performed on the image data by a convolutional neural network, which includes a residual structure and a bilinear pooling structure.

Specifically, in this embodiment, a multi-mode model including three tributaries is constructed, and the feature extraction portion of each tributary in the network frame adopts the residual block structure of the res net18, and uses the pre-training result of the ImageNet in the res net18 as the initialization parameter thereof. Under the condition of insufficient sample size, the transfer learning can quickly learn low-level features such as directions, colors and the like, and only fine adjustment is needed to be carried out on the high-level features, so that the convergence speed can be increased, and the prediction accuracy can be improved. But additional parameters such as full connection layer cannot be initialized with pre-training. The specific convolutional neural network model and the structural parameters can be adjusted according to the actual requirements of the project.

This example predicts that the change in tumor-associated genes falls within the category of fine-grained classification. To obtain better fine-grained classification performance, a bilinear module is introduced into the method. The bilinear pooling structure has been shown to promote fine-grained classification in bird, airplane, and car datasets. Thus, for a single tributary network, the present invention connects a bilinear pooling structure at the last convolutional layer. The bilinear pooling layer consists of two feature extractors, which share weights in the present invention. Bilinear pooling calculates the outer product for each location of the feature and sums it.

Assuming that the feature mapping dimension of the convolutional layer output is f (I) E R ^c×h×w Bilinear pooling can be expressed as:

bilin _i ＝f(i,I)×f(i,I) ^T ,bilin _i ∈R ^c×C (8)

and summing all the spatial positions to obtain a bilinear pooled output:

bilinear pooling is advantageous for extracting texture features, which are important in fine-grained classification.

For a single tributary network, this embodiment connects a bilinear pooling structure after the last convolutional layer. The bilinear pooling layer consists of two feature extractors that share weights. Bilinear pooling calculates the outer product for each location of the feature and sums it.

Assuming that the feature mapping dimension of the convolutional layer output is f (I) E R ^c×h×w Bilinear pooling can be expressed as bilin for each spatial location point i _i ＝f(i,I)×f(i,I) ^T ,bilin _i ∈R ^c×c . The number of features fused is the square of the number of channels of the original feature map. Finally, summing all the spatial positions to obtain a bilinear pooled output:wherein y (I) is the final extracted feature, and then the feature is connected to the full connection layer for fusion.

S43, combining preset constraint items to fuse and classify and predict the feature extraction results.

Specifically, on the network structure, the full connection layer of the three tributary networks adopts the direct series splicing method to perform feature fusion, and the feature vector y can be written into the combination of each sub-vectorThen, y is classified through a full-connection layer and a softmax (logistic regression) layer to obtain a final predicted output p, and in this embodiment, two output values are adopted to perform two classifications, which respectively represent whether gene mutation occurs, instead of selecting one output and setting a threshold value for judgment.

In the whole framework, each tributary network has the same structure, and the input is an image X= [ X1, X2, …, xn after the spiral transformation of each mode]The feature extraction part uses a convolutional neural network containing a residual structure, and the final feature extraction result is represented by an output vector y of a tributary network ⁱ And (5) obtaining the product through serial connection. In a particular implementation, the convolutional network framework that is best suited for a particular task may be selected. With n sub-networks, the feature vector y can be written as a combination of each sub-vector(c is the number of eigenvalued channels of the convolutional layer output):

y＝W _c X (10)

wherein W is _c Is the weight of the feature extraction section.

Referring to fig. 8, a flowchart of an analysis and prediction method based on a multi-mode deep learning prediction method according to an embodiment of the invention is shown. As shown in fig. 8, in the present embodiment, S43 includes:

And S431, connecting the feature extraction result to a full-connection layer for feature fusion, and generating a prediction output result.

Specifically, the result y of bilinear pooling is classified by a full-connection layer and softmax to obtain a final predicted output p, and two output values are used for two classifications, wherein the two classifications respectively represent whether gene mutation occurs, and the sum of the probability of the two classifications of gene mutation and non-mutation is 1. Wherein y is fused in the full connection layer, so that the fine granularity classification performance of the network can be improved.

Wherein, the liquid crystal display device comprises a liquid crystal display device,respectively represent y ¹ ,y ² ,y ⁿ Weights at the full connection layer, i.e. weights for feature fusion.

S432, carrying out parameter optimization on the prediction model by combining with a preset constraint term so as to enable the prediction output result to be more accurate.

To fuse a priori knowledge during the end-to-end training process, we used, in addition to pancreatic oncogene prediction loss (L ₁ ) Strongly supervised classification prediction, also designed intra-modal feature selection loss (L ₂ ) And inter-modality prediction constraint loss (L ₃ ) Two constraint terms drive the training process based on the model.

Referring to fig. 9, a flow chart of model optimization in an embodiment of the multi-modal-based deep learning prediction method of the present invention is shown. As shown in fig. 9, in the present embodiment, S432 includes:

S432A, supervising the prediction output process through the first constraint item.

Specifically, the genetic mutation case prediction of TP53 is trained by strong supervision of a given tag, so we use the generic cross entropy loss function for two classifications. The cross entropy loss function is the basis for performing the classification process, and its mathematical expression is expressed as formula (12):

wherein y is _i Is a tag (mutation 1, non-mutation 0), p _i Is the predicted probability of the specified class output.

And S432B, selecting the characteristics of the characteristic extraction result through the second constraint item.

Specifically, for better feature selection and feature fusion, some new loss function constraint terms are introduced based on cross entropy. The model uses bilinear modules to achieve fine-grained classification performance. However, the bilinear module has very large output dimension and contains certain redundant information, so that feature selection needs to be performed on the output feature vector to enable the redundant feature weight to be close to 0. On the one hand, the characteristics favorable for prediction can be selected, and on the other hand, the overmany characteristics and the overfitting of the neural network are prevented. The feature selection loss in the constraint term is expressed by equation (13):

wherein the method comprises the steps ofWeight of full connection layer for ith modality +. >n is the number of modes, k _i The weight vector length for the i-th modality.

S432C, restraining among modes through the third constraint item, and maintaining the diversity of the feature extraction result.

Specifically, the characteristics among n modes are mutually complemented, complement each other and act together, the condition that the selected characteristics tend to be in a certain mode possibly appears in the characteristic selection process, and in order to prevent the action deviation among the modes from being too large, the diversity of the characteristics is kept, a prediction constraint item L is designed ₃ Expressed by formula (14) as:

wherein X is input, p ⁽¹⁾ Is the probability of gene mutation, W _c Is the weight of the feature extraction section. Constraints between the various modalities help to preserve the diversity of features.

S432D, adding the first constraint term, the second constraint term and the third constraint term according to preset weights, and determining a loss function.

Specifically, the constraint term L is predicted ₃ And feature selection item L ₂ The two functions together, the prior knowledge is added while the sparse feature is obtained, so that each mode has a similar predicted value, redundant features are reduced, and the diversity of the features is reserved. The final loss function is a linear combination of the three parts:

where α, β, γ represent weights that balance the three-part loss function.

And S432E, performing parameter optimization on the prediction model by using a gradient descent method so as to minimize the loss function.

In particular, the training goal of convolutional neural networks is to minimize the loss function, typically with updating of parameters using gradient descent in a deep learning based predictive model. The predictive model of the present invention can minimize the constructed loss function by gradient descent method as follows:

due to formula (14) L ₃ In (a) and (b)And->Is scalar, L ₃ Can be converted into:

gene predictive loss (L) ₁ ) Is a function f (W) containing a weight W, so the loss function can be expressed as the sum of f (W) and the L1 regularization term:

wherein W is _f ，W ^s ，W ^t Are all components of W, g (W) is a constraint term containing L1 regularization.

The model with L1 regularization can minimize the loss function using the near-end gradient descent method (PGD), which is a special gradient descent method. Therefore, the multi-modal fine-grained classification model can also be solved by a near-end gradient descent method (PGD) under the constraint of the loss function, so that end-to-end training is realized.

S44, determining a final prediction model when the loss function is minimized.

Specifically, in the training process, in the formula (15), let the coefficient of the loss function constraint term be α=1, β=0.001, γ=0.01; setting the initial learning rate to be 0.0001, and sequentially reducing the initial learning rate according to a cosine decay mode after iterating five times. The change formula of the learning rate can be expressed as:

Wherein t is ₀ (set to 5) is the iteration number of which the learning rate starts to change, T represents the iteration number of which the learning rate starts to change, and T (set to 20) is the total number of iterations in the training process; lr (lr) ₀ (set to 0.0001) is the initial learning rate, and the batch training number is 32. The computing environment of the entire solution is configured as Intel i7-8700 [email protected],32GB RAM and a monolithic NVIDIA TITAN X (Pascal) GPU.

And after the multi-mode fine-granularity classification network training is finished, taking 27 predicted values after amplification of each case as one sample, and training a regression model. And (3) inputting the probability of predicting each test case as 1 in CNN (Convolutional Neural Networks, convolutional neural network) into a regression model to obtain a final prediction result Y.

In the aspect of gene prediction effect, a hybrid-driven multi-mode pancreatic cancer TP53/KRAS gene mutation prediction model is provided for pancreatic cancer gene mutation prediction difficulty, experimental results are analyzed, and the model provided by the invention has good prediction performance under comprehensive indexes. Specifically, index data of experimental results of the five-fold cross validation of the present invention are subjected to list management and compiled in a tabular form, and AUC (Accuracy), AUC (area under ROC curve), recall (Precision) and F1score (F1 score) are used as evaluation indexes, see table 2 experimental result data table.

Table 2: experimental results data sheet

	Accuracy	AUC	Recall	Precision	F1score
						Cross validation 0	0.8462	0.8500	1.0000	0.8000	0.8889
Cross validation 1	0.6154	0.6500	0.6250	0.7143	0.6667
						Cross validation 2	0.7692	0.8500	1.0000	0.7273	0.8421
Cross validation 3	0.8462	0.7250	1.0000	0.8000	0.8889
						Cross validation 4	0.7500	0.8125	0.7500	0.8571	0.8000
Average of	0.7654	0.7775	0.8750	0.7797	0.8173

Furthermore, bilinear mechanisms can extract discernable texture information, the most important process in fine-grained classification. The bilinear module can improve the TP53/KRAS gene mutation prediction performance in our model, specifically, the multi-modal model effect of the invention is subjected to list management and edited in a table form, and Accurcry (Accuracy), AUC (area under ROC curve), recall (Recall), precision and F1score are used as evaluation indexes, see the multi-modal model effect comparison table in Table 3. As shown in Table 3, the model accuracy was improved by 9% after bilinear addition. Besides classical bilinear pooling operation, spiral transformation keeps texture information and spatial correlation thereof to the greatest extent, and is beneficial to feature extraction of a convolutional neural network. The spiral transformation, data driving and other methods not only obviously improve the prediction performance in the multi-mode pancreatic cancer data set, but also have certain reference significance in the processing of other data sets.

Table 3: multi-modal model effect comparison table

S45, evaluating the final prediction model through a preset evaluation index.

Specifically, in order to comprehensively evaluate the performance of the model, accuracy (Accuracy), AUC (area under ROC curve), recall (Precision), and F1score (F1 score) are used as evaluation indexes, which are widely used in the classification field. The definition of each index is as follows:

where TP is true positive rate, TN is true negative rate, FP is false positive rate, FN is false negative rate, and AUC is area under the ROC curve.

In the aspect of multi-mode fusion, the method of model and data hybrid driving adds priori knowledge in the process of fusing features, specifically, list management is carried out on the predicted effects of different loss functions, and the predicted effects are edited in a table form, see the comparison table of the predicted effects of the different loss functions in table 4. And (3) verifying the stability of the model by using five-fold cross verification, and obtaining a final result of predicting pancreatic cancer TP53 gene change. Accuracy, AUC, recall, precision, F1score of the present embodiment was 0.7654,0.7775,0.8750,0.7797,0.8173. Compared with simple eigenvector stitching, the accuracy of the method is improved by 11%, and the AUC is increased from 0.7475 to 0.7775. In the aspects of precision and F1Score, the hybrid driving method also obtains better results, which shows that the method not only plays a role in feature selection, but also enables three modes to be combined more effectively to complement each other and act together.

Table 4: prediction effect comparison table of different loss functions

Loss value	Accuracy	AUC	Recall	Precision	F1Score
						L1	0.6564	0.7475	0.8750	0.6733	0.7599
αL1+βL2+γL3	0.7654	0.7775	0.8750	0.7797	0.8173

In the test process, the final prediction effect is greatly improved by the intra-mode feature sparsification and inter-mode effect equalization of the combined action. Referring to fig. 10, an effect diagram of different loss functions generated by the multi-mode-based deep learning prediction method according to an embodiment of the invention is shown. As shown in fig. 10, ROC curves (subject work characteristic curves) and PR curves (precision-recall curves) of two models are plotted for visualizing the relationship between sensitivity-specificity and precision-recovery. The result shows that the curves of the multi-mode-based deep learning prediction method are respectively closer to the upper left corner of the ROC curve and the upper right corner of the PR curve, and have better performance.

The present embodiment provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the multi-modality based deep learning prediction method.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned computer-readable storage medium includes: various computer storage media such as ROM, RAM, magnetic or optical disks may store program code.

The multi-mode-based deep learning prediction method utilizes the common driving of the model and the data, considers the diversity among the features, and can fully combine the correlation among all modes. In addition, the data is amplified by using a new spiral transformation mode, so that the information quantity of a training set can be effectively increased, and the model is helped to obtain better robustness.

Example two

The embodiment provides a multi-mode-based deep learning prediction system, which comprises:

the data acquisition module is used for acquiring an image data set, wherein the image data set comprises image data of at least two modes;

the feature extraction module is used for carrying out feature extraction on the image data so as to generate a feature extraction result corresponding to each mode;

and the prediction module is used for fusing the feature extraction results and classifying and predicting the feature extraction results by combining with a preset constraint item.

The multi-modal based deep learning prediction system provided by the present embodiment will be described in detail below with reference to the drawings. It should be noted that, it should be understood that the division of the modules of the following system is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. The modules can be realized in a form of calling the processing element through software, can be realized in a form of hardware, can be realized in a form of calling the processing element through part of the modules, and can be realized in a form of hardware. For example: a module may be a separately established processing element or may be integrated in a chip of a system as described below. In addition, a certain module may be stored in the memory of the following system in the form of program codes, and the functions of the following certain module may be called and executed by a certain processing element of the following system. The implementation of the other modules is similar. All or part of the modules can be integrated together or can be implemented independently. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module below may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Singnal Processor, DSP for short), one or more field programmable gate arrays (Field Programmable Gate Array, FPGA for short), and the like. When a module is implemented in the form of a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may call program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC) for short.

Referring to fig. 11, a schematic diagram of a multi-mode-based deep learning prediction system according to an embodiment of the invention is shown. As shown in fig. 11, the multi-modality-based deep learning prediction system 5 includes: a data acquisition module 51, a feature extraction module 52 and a prediction module 53.

The data acquisition module 51 is configured to acquire an image dataset comprising image data of at least two modalities.

The feature extraction module 52 is configured to perform feature extraction on the image data to generate a feature extraction result corresponding to each modality.

In this embodiment, the feature extraction module 52 is specifically configured to perform feature extraction on the image data through a convolutional neural network, where the convolutional neural network includes a residual structure and a bilinear pooling structure.

The prediction module 53 is configured to combine preset constraint terms to fuse and classify and predict the feature extraction result.

In this embodiment, the prediction module 53 is specifically configured to connect the feature extraction result to a full connection layer for feature fusion, so as to generate a predicted output result; and carrying out parameter optimization on the prediction model by combining with a preset constraint term so as to enable the prediction output result to be more accurate.

Specifically, the prediction module 53 is configured to supervise a prediction output process through the first constraint term; performing feature selection on the feature extraction result through the second constraint item; constraint among modes is carried out through the third constraint item, and diversity of the feature extraction result is kept; adding the first constraint item, the second constraint item and the third constraint item according to preset weights, and determining a loss function; the predictive model is parameter optimized using a gradient descent method to minimize the loss function.

The multi-mode-based deep learning prediction system in the embodiment utilizes the common driving of the model and the data, considers the diversity among the features, and can fully combine the correlation among all modes. In addition, the data is amplified by using a new spiral transformation mode, so that the information quantity of a training set can be effectively increased, and the model is helped to obtain better robustness.

Example III

The present embodiment provides an apparatus including: a processor, memory, transceiver, communication interface, or/and system bus; the memory and the communication interface are connected with the processor and the transceiver through the system bus and complete communication with each other, the memory is used for storing a computer program, the communication interface is used for communicating with other devices, and the processor and the transceiver are used for running the computer program to enable the devices to execute the steps of the multi-mode deep learning prediction method.

The system bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The system bus may be classified into an address bus, a data bus, a control bus, and the like. The communication interface is used for realizing communication between the database access device and other devices (such as a client, a read-write library and a read-only library). The memory may comprise random access memory (Random Access Memory, RAM) and may also comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (scan application lication Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The protection scope of the multi-mode-based deep learning prediction method is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes realized by the steps of increasing and decreasing and step replacement in the prior art according to the principles of the invention are included in the protection scope of the invention.

The invention also provides a multi-mode-based deep learning prediction system, which can realize the multi-mode-based deep learning prediction method, but the realization device of the multi-mode-based deep learning prediction method comprises but is not limited to the structure of the multi-mode-based deep learning prediction system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention. It should be noted that the multi-mode-based deep learning prediction method and the multi-mode-based deep learning prediction system are also applicable to content in other multimedia forms such as video, friend circle message, and the like, and are included in the protection scope of the present invention.

In summary, in the multi-mode fusion aspect of the multi-mode-based deep learning prediction method, system, medium and device, the prior knowledge is added in the process of fusing the features by the model and data hybrid driving method, and compared with the simple feature vector splicing, the accuracy of the method is higher, and the AUC (Area Under ROC Curve) is increased, wherein ROC (Receiver Operating Characteristic Curve, receiver operation feature Curve) refers to the ROC Curve. In addition, the data and model mixed driving method also obtains better results in terms of precision, not only plays a role in feature selection, but also enables a plurality of modes to be combined more effectively, complement each other and act together. In the test process, the final prediction effect is greatly improved by the intra-mode feature sparsification and inter-mode effect equalization of the combined action. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. The multi-mode-based deep learning prediction method is characterized by comprising the following steps of:

acquiring an image dataset comprising image data of at least two modalities;

combining a preset constraint item to fuse the feature extraction results and carrying out classified prediction;

the step of fusing and classifying and predicting the feature extraction result by combining with a preset constraint item comprises the following steps:

connecting the feature extraction result to a full connection layer for feature fusion to generate a prediction output result;

carrying out parameter optimization on the prediction model by combining with a preset constraint item;

the preset constraint items comprise a first constraint item, a second constraint item and a third constraint item; the step of parameter optimization of the prediction model by combining the preset constraint items comprises the following steps:

monitoring a prediction output process through the first constraint item;

performing feature selection on the feature extraction result through the second constraint item;

performing constraint among modes through the third constraint item;

Adding the first constraint item, the second constraint item and the third constraint item according to preset weights, and determining a loss function; performing parameter optimization on the prediction model by using a gradient descent method so as to minimize the loss function;

the loss function is a linear combination of a first constraint term, a second constraint term and a third constraint term:

wherein α, β, γ represent weights that balance the three-part loss function;

is the first constraint, wherein y _i Is a tag, p _i Is the predictive probability of the specified class output;

for said second constraint, wherein +.>Weight of full connection layer for ith mode, n is number of modes, k _i The length of the weight vector is the i-th mode;

for the third constraint, wherein X is input, p ⁽¹⁾ Is the probability of gene mutation, W _c Is the weight of the feature extraction section.

2. The multi-modal based deep learning prediction method as claimed in claim 1, wherein,

the image data set is a two-dimensional image data set after spiral transformation and data amplification.

3. The multi-modal based deep learning prediction method as claimed in claim 1, wherein,

and extracting the characteristics of the image data through a convolutional neural network, wherein the convolutional neural network comprises a residual error structure and a bilinear pooling structure.

4. The multi-modal based deep learning prediction method as claimed in claim 1, wherein,

and carrying out parameter optimization in the prediction model through an iteration principle of a gradient descent method.

5. The multi-modality based deep learning prediction method of claim 1, further comprising:

determining a final predictive model when the loss function is minimized;

and evaluating the final prediction model through a preset evaluation index.

6. A multi-modality based deep learning prediction system, the multi-modality based deep learning prediction system comprising:

the prediction module is used for fusing the feature extraction results and classifying and predicting the feature extraction results by combining preset constraint items;

monitoring a prediction output process through the first constraint item;

performing constraint among modes through the third constraint item;

adding the first constraint item, the second constraint item and the third constraint item according to preset weights, and determining a loss function;

performing parameter optimization on the prediction model by using a gradient descent method so as to minimize the loss function;

wherein α, β, γ represent weights that balance the three-part loss function;

is the third constraint item, whereinX is input, p ⁽¹⁾ Is the probability of gene mutation, W _c Is the weight of the feature extraction section.

7. A computer readable medium having stored thereon a computer program, which when executed by a processor implements the multi-modal based deep learning prediction method of any one of claims 1 to 5.

8. An apparatus, comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the apparatus to perform the multi-modality based deep learning prediction method according to any one of claims 1 to 5.