CN114927162A

CN114927162A - Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution

Info

Publication number: CN114927162A
Application number: CN202210544114.7A
Authority: CN
Inventors: 王浩华; 高建; 林恺; 张强; 何昆仑; 石金龙
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-19
Anticipated expiration: 2042-05-19
Also published as: CN114927162B

Abstract

The invention discloses a multigroup theory-related phenotype prediction method based on hypergraph characterization and Dirichlet distribution, which comprises the following steps of: the omics data preprocessing module realizes the functions of cleaning primitive omics data and pre-screening characteristics so as to remove noise, errors and redundant characteristics which may influence the associated mining performance. And the omic data hypergraph characterization module is used for completing cosine similarity calculation in each omic and constructing a hypergraph correlation matrix according to the cosine similarity calculation. And the feature extraction module builds a hypergraph convolution neural network to extract features of each omics data. The multiomic ensemble prediction module constructs dirichlet distribution parameters using the initial results generated by each particular hypergraph convolutional neural network and inputs them to a multigroup ensemble algorithm for final label prediction. The method provided by the invention is used for mining the potential correlation of each omics information based on multiple groups of omics data and corresponding phenotype labels, effectively integrating the characteristic information of each omic, and realizing the accurate relevance prediction of the omics data and the human body phenotype.

Description

Multi-group association phenotype prediction method based on hypergraph representation and Dirichlet distribution

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a multigroup theory associated phenotype prediction method based on hypergraph representation and Dirichlet distribution.

Background

In recent years, biological correlation techniques have been rapidly developed, and especially high-throughput sequencing has been in breakthrough progress in terms of quantity, speed, accuracy, diversity and application value. People can obtain related omics data by a method which is more efficient and lower in cost than the conventional method, the research on DNA, mRNA and meth can be broadly divided into genomics, transcriptomics and epigenomics, and the integration of the data provides the basis of the integrated research on multi-omics (multi-omics) for the research on various human phenotypes. On the other hand, the complexity of the organism is often contained in the various types of data, and since the research aiming at each group of science can only find a part of the biological complexity in a limited way, the complexity of the organism can be better understood and the life science process can be more comprehensively observed by integrating a plurality of groups of science data.

Phenotype is a quantifiable characteristic expression in the biological activity process, namely a characteristic biochemical index which can be objectively evaluated under a specific state of a living being, such as height, skin color, diseases and the like. The traditional statistical method can utilize the detection results of genes, proteins and other substances contained in body fluids or tissues such as human blood, urine and the like to carry out calculation and analysis on data statistics, compares set threshold values to obtain biomarker speculation of corresponding omics, and infers the phenotype of the data to be detected according to the biomarkers. For example, the GWAS method is used for comparing the P value of sample data, and DNA gene fragments, SNP sites and the like in genomics related to diseases are researched. However, the nature of the association between the biological group and the phenotype mined by the traditional statistical method has obvious limitation, on one hand, because the method only carries out statistical calculation on a single marker in each group, but a plurality of markers with low statistical values can also play a decisive role in influencing the phenotype, so that the influence of the association between the markers with low statistical values and the phenotype cannot be eliminated. On the other hand, because the regulation process of organisms is a multi-level dynamic expression process, the research method only aiming at the single omics fundamentally has limitation, and influence brought by upper and lower level regulation of the omics cannot be considered.

In view of the above, it is necessary to use a comprehensive approach of multiomics to make full use of these data to understand biological systems. With the increasing affordable computing power of computers and high-throughput omics data, and the success of artificial intelligence technology in various fields, the application of machine learning in the biological field has become popular. Machine learning can be used to mine information hidden in experimental data. In contrast, conventional statistical-based models are typically designed using statistical assumptions and make inferences about a particular phenomenon from a given data set, while machine learning methods aim to learn knowledge from historical or existing data and use that knowledge to predict or select unknown new data. For example, Xu et al developed a HI-dfn forest framework that learns high-level feature representations from three omic datasets using stacked autoencoders, which representations are simultaneously integrated to predict cancer subtypes. The mogenet proposed by Wang et al constructs a graph structure for each omics data, performs initial prediction by using a graph convolution neural network, and then integrates through a multi-view integration network VCDN to realize multi-group integrated phenotype classification. However, the above method still has room for improvement in terms of prediction accuracy and module design composition.

Disclosure of Invention

In order to solve the problems, the invention provides a method for predicting the association between human body phenotype and omics data based on a hypergraph and Dirichlet distribution. According to the method, firstly, original data are cleaned and screened through a preprocessing module, secondly, a neural network model based on hypergraph structure characterization is developed through a combined data matrix formed by a plurality of omic data sets, hypergraph structure characterization is carried out on the plurality of omic data sets, a KNN (K-Nearest Neighbor) algorithm based on cosine similarity is adopted in the characterization process, and the relevance among different position information in the omics is deeply excavated. Then, the characterization data are subjected to efficient feature extraction through a hypergraph convolution neural network, and meanwhile, the hypergraph neural network also supports the realization of relevance prediction between the monamics and the phenotypes. And finally, forming a multi-omics combined matrix based on the characteristic matrix to construct a multi-omics (two or more) fusion algorithm based on Dirichlet distribution, completing information integration among various omics by utilizing a loss function constructed by the Dirichlet distribution, and realizing information sharing among the omics on the basis of the characteristic matrix so as to accurately predict the human body surface type condition.

In order to achieve the purpose, the specific technical scheme of the invention is as follows:

a multi-group theory association phenotype prediction method based on hypergraph characterization and Dirichlet distribution comprises the following steps:

step (1) omics data cleaning and pretreatment

Redundant noise in original data is removed through a conventional preprocessing method for each omics data, for example, only data with a chip detection success rate of at least 95% is reserved for miRNA omics data; normalized beta values were calculated for the meth omics data as expression level per methylation site. The screened data may still contain redundant features or noise that negatively impact the prediction performance. To solve this problem, the pre-selection of features is performed by the following method.

First, features in the data set having a variance less than a threshold α are filtered out.

Secondly, sequentially executing the t hypothesis of the formula (1) for each phenotype label to test whether the data of the sample omics of the same type label have significant difference, wherein the t value is larger than the threshold value gammaThe book is subjected to deletion processing, wherein

To mean the sample, μ represents the sample expectation, σ (x) represents the standard deviation of the samples, and n represents the number of samples.

Finally, because different omics data types have different expression ranges, the expression values are scaled to [0,1] by linear transformation so that the model is processed, the output of this step is the preprocessed feature matrix X.

Step (2) constructing a hypergraph structure of omics data

(2.1) A hypergraph is defined as G ═ (V, E, W), defined by the set of vertices V ═ V ₁ ,v ₂ ,…,v _m And E-super edge set E ═ E ₁ ,e ₂ ,…,e _l And W is a weight matrix of the super edges and represents the importance degree of each super edge. In the hypergraph, each vertex corresponds to a sample, and each hyperedge contains an arbitrary subset of V. And (3) carrying out cosine similarity operation on the feature matrix X output in the step (1) to measure the relationship between features in the omics.

The traditional construction method of the hypergraph structure usually adopts the Euclidean distance of a formula (2) to calculate the linear distance between vectors so as to measure the proximity degree between different samples, and the Euclidean distance is more suitable for reflecting the absolute difference on the numerical value and is not completely fit for the implicit correlation action between the features in omics data. In the present invention, different samples are regarded as different vectors, and the cosine similarity measurement matrix obtained by using formula (3) is used to measure the approximation degree of the angle difference between the vectors. Wherein x is _i A specific eigenvector, X, representing the ith sample in the feature matrix X _ir And R represents the characteristic value of the R-th item of the ith sample in the characteristic matrix X, and R represents the total number of the characteristic quantity. The method is theoretically more consistent with the action rule in omics, and the application effect of the method is proved through a control experiment.

And (2.2) carrying out KNN clustering on the samples according to the obtained cosine similarity measurement matrix. Since the cosine values between the vectors decrease with increasing angle. Therefore, the KNN clustering process in the present invention returns the index of the largest k values in each row of the similarity matrix, these indexes form the hyper-edge set e of the hyper-graph vertex, and the k indexes are set to 1 in the matrix, and the rest indexes are set to 0. The matrix H constructed by this can be expressed as the incidence matrix of the hypergraph G, defined as:

by this extension, the degree D of the vertex _v Is defined as:

wherein w (e) is the weight of the super edge in the weight matrix, the degree D of the super edge _e Is defined as:

and (3) building a hypergraph convolution neural network to perform characteristic extraction of a monoomics:

(3.1) firstly, constructing a Laplace matrix of a hypergraph incidence matrix according to a Laplace standardized formula, and converting an abstract node relation in the hypergraph into a matrix type which can be used as input of a neural network, wherein the Laplace matrix construction method of the traditional graph structure comprises the following steps:

wherein I is a unit matrix, D is the degree of a vertex in the graph, and A is an adjacent matrix of the graph structure.

Similarly, the laplacian matrix for the hypergraph structure formed in step (2) is defined as:

wherein D _v Vertex degree matrix, D, for the hypergraph obtained by equation (5) _e For the excess edge matrix obtained by formula (6), H is the correlation matrix obtained by formula (4), and for the data set without given specific weight matrix W, it is defined as unit matrix I by default, meaning that the weights of all excess edges are equal.

(3.2) taking the hypergraph laplacian matrix of the single-component data and the preprocessed feature data as input to a hypergraph convolution neural network to perform an initial prediction task. The training goal of each hypergraph convolutional neural network is to learn the association of input data with the corresponding labels, specifically, the model requires the following two inputs: one of the inputs is the result of step (1), i.e. the preprocessed feature matrix, X ∈ n × d, where n is the number of samples and d is the number of omic features. The other input is the description of the structure of the hypergraph, namely the hypergraph Laplace matrix L obtained by the formula (8) _h ∈n×n。

A HyperGraph Convolutional neural Network (HGCN) model structure is constructed by stacking 3 Convolutional layers and 1 fully-connected layer, the dimension of each Convolutional layer is set according to the dimension of a feature matrix X, and the output dimension of each fully-connected layer is the label category number. The specific definition of convolutional layers is:

HGConv ^(l+1) ＝f(HGConv ^(l) ,L _h )

＝σ(L _h (HGConv ^(l) )Z ^(l) ) (9)

in the formula HGCconv ^(l) Is the output of the first layer, Z ^(l) Is the weight matrix of the l-th layer, when l is 0, HGConv ^(l) X. σ (-) is an activation function of the hidden layer, and is set as a LeakyReLU function in the method, wherein k is a negative slope parameter of the activation function and is used for solving the problem of gradient disappearance caused by neuron failure:

a dropout mechanism is added after the first two convolutional layers to reduce the possibility of model overfitting. And the full connection layer connected behind the third convolution layer realizes characteristic integration. Output of the model F _o As a result of feature extraction, F _o E n x b, where n is the number of samples and b is the number of tag types.

Meanwhile, the invention also supports the prediction of corresponding phenotype of single group of chemical data through HGCN, namely, the network is trained by using a cross entropy loss function through the back propagation process of a single HGCN:

wherein Loss _CE (. cndot.) represents the cross entropy loss function, and y is the sample label. According to Loss value Loss _HGCN And calculating gradient, updating network weight Z to complete a back propagation process, and performing correlation prediction on single group of chemical data and phenotype by using a model stored after several iterative training processes.

Step (4) a multi-group chemical integration algorithm based on Dirichlet distribution:

constructing corresponding HGCN for each group of chemical data by using the step (3), and outputting a characteristic result matrix F for each neural network ^o E n x b, combining formula (12) to construct F ^o Dirichlet distribution parameter matrix alpha ^o ，α _ij ^o Represents alpha ^o Each element of (1). Calculating F according to the parameters ^o Each element f in _ij ^o Reliability p of (2) _ij ^o Form a matrix P ^o And uncertainty parameter u of prediction results under the omics _i ^o Component vector U ^o ：

α ^o ＝F ^o +1 (12)

Credible distribution matrix P of single-group chemical prediction result obtained based on the steps ^o And uncertainty vector U ^o And performing fusion prediction of the multiomics. The process adopts a classic D-S evidence theory, namely a formula (13), and realizes pairwise information fusion between omics:

in the formula p _i Representing the ith row of the matrix P, m is set to a value not less than 0, and in particular when m is 0, the formula implements that P ⁰ 、U ⁰ (first group prediction result) with P ¹ 、U ¹ (second group prediction result) to obtain P ² 、U ² As a fusion result of the two omics; when m is 1, the formula realizes that P ² 、U ² (fusion of first two omics) with P ³ 、U ³ (third omics prediction results) to obtain P ⁴ 、U ⁴ As a fusion result of the three omics. The multiomic fusion mode is analogized until the fusion of all the omics is completed to obtain P ^2m+2 、U ^2m+2 。

After the fusion of all kinds of omics is completed, the Dirichlet distribution parameter alpha and the fusion prediction result F under the condition of multiomic fusion are reversely deduced according to the formula (12).

And finally, training and learning of the multiomic fusion prediction are carried out, and different from the cross entropy calculation method of the formula (11), the formula (14) is adopted to calculate the fusion loss.

Loss _MOIA ＝Loss _right +λ _epoch Loss _wrong

Therein, Loss _right Loss function for correct label, Loss _wrong As a function of Loss of false tags, Loss _MOIA As a function of total loss; lambda _epoch The loss weight which is dynamically changed according to the current training times is taken as a value between (0, 1); k represents the number of a particular kind of tag; y is _i Set of labels, y, representing the ith sample in a one hot code of the sample label _ij Representing the element represented by the jth label of the ith sample in one hot coding; alpha (alpha) ("alpha") _i Dirichlet distribution parameter set, α, for the ith sample _ij A Dirichlet distribution parameter estimate representing a jth classification result for an ith sample; Γ (·) is a gamma function, where t is a constant integral parameter. The method makes full use of Dirichlet distribution parameter estimation alpha and calculates Loss _right So thatThe model predicts the correct label according to the maximum, calculates Loss _wrong Enables prediction of false tags to be further reduced, Loss _MOIA And optimizing and improving the model precision from two aspects. And the loss value is used for carrying out gradient calculation, finishing a back propagation process and updating the neuron weight of the hypergraph convolution neural network. The trained model can be used for accurately predicting the phenotype based on specific omics information and cross-group association learning.

The invention has the beneficial effects that:

(1) the designed hypergraph data structure is used as an input data type, and compared with a traditional graph structure, the hypergraph data structure can represent data containing a multidirectional relation in a higher fidelity, a hypergraph convolution neural network is constructed according to the data, and the relevance between different features in the omics is mined by fully combining original features and hypergraph characterization.

(2) A multi-group chemical integration algorithm based on Dirichlet distribution is provided, the complementary relation of biological characteristics under different levels is effectively utilized, and the potentially unknown association relation of the human body omics and the phenotype is further improved. Can help people to better understand the process of biodynamic regulation, and can provide more comprehensive theoretical support in the aspects of disease detection, typing and risk prediction.

Drawings

Fig. 1 is an overall architecture diagram of the present invention.

Figure 2 is a framework diagram of the present invention for implementing multiomic phenotypic association prediction.

Fig. 3 is an overall flow chart of the present invention.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

As shown in fig. 1, a multi-group mathematical correlation phenotype prediction method based on hypergraph characterization and dirichlet distribution according to the present invention can be roughly divided into: four modules of omics data preprocessing, omics data hypergraph representation, feature extraction of a hypergraph neural network and multi-group chemical integration prediction;

(1) the preprocessing module relates to the cleaning of primitive omics data and the pre-screening of characteristics: preprocessing operation is respectively carried out on each type of omics data so as to remove noise, errors and redundant characteristics which possibly influence the associated mining performance, and a better understanding and supporting effect is played for a subsequent model algorithm. First, features with no probe signal or low difference (mean close to 0) were filtered for individual omics data. Because different omics data types have different expression ranges, the expression values are optionally scaled by linear transformation for the model to operate.

(2) The omics data hypergraph characterization module relates to a cosine similarity calculation and KNN clustering process: for each kind of feature data after being preprocessed, a cosine similarity matrix of the feature data among different samples is calculated firstly, then k samples with the largest cosine value of each hypergraph node are screened out according to a KNN algorithm, finally the most similar samples are indexed in a matrix with the index of 1, and the rest indexes are indexed with 0 to complete the construction of the hypergraph correlation matrix.

(3) The feature extraction module of the hypergraph neural network realizes the building and specific training process of the neural network: and constructing a Laplace matrix of the hypergraph structure according to a hypergraph Laplace standardized formula, and converting abstract node relations in the hypergraph into matrix types which can be used as neural network input. And (3) respectively constructing a hypergraph convolutional neural network (HGCN) by combining the preprocessed specific omics feature matrix, and performing specific learning of association of the omics and the phenotype by taking the omics preprocessed feature matrix and the corresponding hypergraph Laplace matrix as the input of the HGCN. The main advantage of HGCN is that the potential correlations between samples in omics data can be well combined to achieve more efficient feature extraction.

(4) And the multi-group learning integration prediction module constructs Dirichlet distribution parameters according to the output of each HGCN model, so that a loss function different from the traditional cross entropy is designed to carry out final label prediction learning. The multi-omic integration algorithm (MOIA) firstly calculates the uncertainty of the Dirichlet distribution parameters of each omic, and mines the potential correlation among different omics through the classical D-S combination rule, thereby effectively integrating the characteristics extracted by the specific network of each omic.

As shown in fig. 3, taking the BRCA proteomics data set of TCGA as an example for the association prediction of breast cancer subtypes, the following steps are performed:

(1) firstly, performing feature screening on each omics data according to a route of a preprocessing step, reserving features highly related to a phenotype tag, and filtering features with sample calling retention rate of less than 5% in miRNA and mRNA data by taking three omics data (methylation, mRNA and miRNA data) related to BRCA phenotype obtained from a TCGA starting database as an example; for methylation data, normalized beta values were calculated as the methylation level for each methylation site. Second, features in the training dataset with a variance of less than 0.3 are filtered out.

Meanwhile, for each tag prediction task, the t test in the formula (1) is sequentially executed to evaluate whether the sample data is significantly different from other data with the same tag, the sample with the overlarge difference is deleted, and each type of omics data is scaled to the range of [0,1] through linear transformation.

(2) And (3) characterizing the preprocessed screened omics data into a hypergraph structure. As shown in fig. 2, namely, the incidence matrix and the laplacian matrix of the hypergraph are constructed for the feature matrix data, the cosine similarity matrix of the single set of the mathematical data is calculated according to the formula (3), and k with the largest cosine value in each row of the matrix is selected to be 10 indexes, so as to construct the incidence matrix of the hypergraph G, and the laplacian matrix of the hypergraph is obtained through the formula (8).

(3) And respectively inputting the feature matrix and the Laplace matrix of the single omics into a hypergraph convolutional neural network (HGCN) for feature extraction. As shown in fig. 2, HGCNs are respectively built for each type of science, each HGCN learns the characteristics of the hypergraph representation by using the characteristics of each type of science node and the association relationship between the nodes, in this example, the dimension of the original characteristics is 1000 × 612, the number of classification labels is 5, and therefore, the hidden layer dimensions are respectively set to 400, 400, and 200, the input layer dimension is 1000, and the output layer dimension is 5. The operation process of the specific neural network refers to the formula (9-10), and meanwhile, a dropout mechanism with the parameter of 0.5 is added after the convolution layers of the first two layers, so that the probability of model overfitting is reduced.

(4) In the steps, the result of each omic corresponding to the HGCN is input into a multi-group chemical integration algorithm (MOIA) for final integration prediction, the MOIA can reveal the potential cross-group chemical label correlation, a Dirichlet distribution parameter is constructed based on a formula (12), and a classic D-S evidence theory like a formula (13) is introduced to realize pairwise information fusion between the omics. After the fusion of all kinds of omics data is finished, the hypergraph convolution neural network is trained in a back propagation mode by using the loss function of the formula (14). And finally, the output correlation prediction result is predicted based on specific omics information and cross-group correlation learning. The obtained result is shown as the final output of fig. 2, and is n × 5 tensor (n is the number of samples), the 5 parameters of each row respectively represent the probability distribution of five subtypes (Normal-like, Basal-like, HER 2-inverter, lumineal a and lumineal B) of the sample with BRCA, and the value with the highest probability represents the final prediction result.

(5) Multiple sets of control experiments performed on the same data set for efficiency comparison demonstrate that the method of the invention is superior to other existing methods. Some of the control experiments were as follows:

I. compared with the MOGONET method published in Nature Communication in 2021, the single set of chemical prediction Accuracy (ACC) exceeds the method by 0.06-0.09, and the multiple set of chemical integration prediction Accuracy (ACC) exceeds the method by 0.04 (the method is 0.8289, the invention is 0.8670), and meanwhile, by referring to the experimental part content in the MOGONET paper, the accuracy of the method disclosed by the invention is far superior to that of other conventional machine learning methods.

II. The accuracy of the single omics prediction on the HGNN was: the prediction accuracy of mRNA (0.8517), meth (0.7871) and miRNA (0.8061) after MOIA integration is 0.8670, and the integration effectiveness of the MOIA module is proved.

And III, comparing experiments on a hypergraph construction method, wherein compared with a hypergraph structure constructed by a traditional Euclidean distance method, the hypergraph structure constructed by the cosine similarity method improves the final prediction accuracy by 0.02-0.04, and the effectiveness of the cosine similarity method is proved.

Claims

1. A multigroup theory association phenotype prediction method based on hypergraph characterization and Dirichlet distribution is characterized by comprising the following steps of:

step (1) omics data cleaning and pretreatment

Redundant noise in original data needs to be removed from each omics data, and then pre-selection of features is carried out, wherein the pre-selection method comprises the following steps:

firstly, filtering out the characteristic that the variance in a data set is smaller than a threshold value alpha;

secondly, sequentially executing a t hypothesis of a formula (1) for each phenotype label to check whether the data of the omics of the samples of the same type label have significant difference, and deleting the samples with the t value larger than a threshold value gamma, wherein the t hypothesis is used for deleting the samples with the t value larger than the threshold value gamma

For the sample mean, μ represents the sample expectation, σ (x) represents the standard deviation of the sample, and n represents the number of samples;

finally, because different omics data types have different expression ranges, the expression values are scaled to [0,1] through linear transformation, and the expression values are output as a preprocessed feature matrix X;

step (2) constructing hypergraph structure of omics data

(2.1) A hypergraph is defined as G ═ (V, E, W), defined by the set of vertices V ═ V ₁ ,v ₂ ,…,v _m E and super edge set E ═ E ₁ ,e ₂ ,…,e _l W is a weight matrix of the excess edges, and represents the importance degree of each excess edge; in the hypergraph, each vertex corresponds to a sample, and each hyperedge contains an arbitrary subset of V; carrying out cosine similarity operation on the feature matrix X output in the step (1) to measure the relationship between features in the omics;

regarding different samples as different vectors, and using a formula (3) to obtain a cosine similarity measurement matrix to measure the approximation degree of the cosine similarity measurement matrix by using the angle difference between the vectors;

wherein x is _i Representing a specific feature vector of an ith sample in the feature matrix X;

(2.2) carrying out KNN clustering on the samples according to the obtained cosine similarity measurement matrix; because cosine values among vectors are reduced along with the increase of angles, the KNN clustering process returns indexes of the maximum k values of each row in the similarity matrix, the indexes form a hyper-edge set e of the vertex of the hyper-graph, the k indexes are set to be 1 in the matrix, and the rest indexes are set to be 0; the matrix H constructed in this way can be represented as the incidence matrix of the hypergraph G, defined as:

by this extension, the degree D of the vertex _v Is defined as:

and (3) constructing a hypergraph convolution neural network to perform characteristic extraction of a monamics:

(3.1) firstly, constructing a Laplace matrix of a hypergraph incidence matrix according to a Laplace standardized formula, and converting abstract node relations in the hypergraph into matrix types capable of being used as neural network input;

the Laplace matrix of the hypergraph structure formed in the step (2) is defined as:

wherein D _v The vertex degree matrix, D, of the hypergraph obtained for equation (5) _e For the super-edge matrix obtained by the formula (6), H is the incidence matrix obtained by the formula (4), and for the data set without the specific weight matrix W, the data set is defined as a unit matrix I by default, namely, the weights of all super edges are equal;

(3.2) inputting the hypergraph Laplace matrix of the single-component mathematical data and the preprocessed feature data into a hypergraph convolution neural network as input to execute an initial prediction task; the training goal of each hypergraph convolutional neural network is to learn the association of input data with corresponding labels, specifically, the model requires the following two inputs: one of the inputs is the result of step (1), i.e. the preprocessed feature matrix, X ∈ n × d, where n is the number of samples and d is the number of omics features; the other input is the description of the structure of the hypergraph, namely the hypergraph Laplace matrix L obtained by the formula (8) _h ∈n×n；

The hypergraph convolutional neural network HGCN model structure is constructed by stacking 3 convolutional layers and 1 full-connection layer, the dimension of the convolutional layers is set according to the dimension of a characteristic matrix X, and the output dimension of the full-connection layer is the label category number; the specific definition of convolutional layers is:

HGConv ^(l+1) ＝f(HGConv ^(l) ,L _h )

＝σ(L _h (HGConv ^(l) )Z ^(l) ) (9)

in the formula HGCconvnv ^(l) Is the output of the first layer, Z ^(l) Is the weight matrix of the l-th layer, when l is 0, HGConv ^(l) X; σ (-) is the activation function of the hidden layer, set as LeakyReLU function, where k is the negative slope parameter of the activation function:

a dropout mechanism is added after the first two convolutional layers to reduce the probability of overfitting the model; the full connection layer connected behind the third convolution layer realizes feature integration; output of the model F _o As a result of feature extraction, F _o E is n multiplied by b, wherein n is the number of samples, and b is the number of label types;

meanwhile, the method supports the prediction of corresponding phenotypes on single set of chemical data through the HGCN, namely the network is trained by using a cross entropy loss function through the back propagation process of a single HGCN:

wherein Loss _CE (. cndot.) represents a cross entropy loss function, y is the sample label; according to Loss value Loss _HGCN Calculating gradient, updating network weight Z to complete a back propagation process, and performing correlation prediction on single-group chemical data and phenotype by using a model stored after several iterative training processes;

constructing a corresponding HGCN for each group of chemical data by using the step (3), and outputting a characteristic result matrix F for each neural network ^o E n × b, first construct F in conjunction with equation (12) ^o Dirichlet distribution parameter matrix alpha ^o ，α _ij ^o Represents alpha ^o Each element of (a); calculating F according to the parameters ^o Each element f _ij ^o Reliability p of (2) _ij ^o Form a matrix P ^o And uncertainty parameter u of prediction results in omics _i ^o Component vector U ^o ：

α ^o ＝F ^o +1 (12)

The obtained single group of mathematical prediction knotsCredible distribution matrix P of fruits ^o And uncertainty vector U ^o Performing fusion prediction of the multiomics; the process adopts the classic D-S evidence theory, namely the mode of formula (13), and realizes pairwise information fusion between omics:

in the formula, p _i Represents the ith row of matrix P; m is set to a value of not less than 0; specifically, when m is 0, the formula implements the first group prediction result P ⁰ 、U ⁰ And a second group prediction result P ¹ 、U ¹ By fusion of (1) to obtain P ² 、U ² As a fusion result of the two omics; when m is 1, the formula realizes the fusion result P of the first two omics ² 、U ² And third omics prediction result P ³ 、U ³ By fusion of (b) to obtain P ⁴ 、U ⁴ As a fusion result of the three omics; the fusion mode of the multiomics is analogized until the fusion of all the omics is completed to obtain P ^2m+2 、U ^2m+2 ；

After the fusion of all kinds of omics is finished, a Dirichlet distribution parameter alpha and a fusion prediction result F under the condition of multiomic fusion are reversely deduced according to a formula (12);

and finally, training and learning of the multiomic fusion prediction are carried out, and the fusion loss is calculated by adopting a formula (14):

Loss _MOIA ＝Loss _right +λ _epoch Loss _wrong

therein, Loss _right Loss function for correct label, Loss _wrong As a function of Loss of false tags, Loss _MOIA As a function of total loss; lambda [ alpha ] _epoch The loss weight which is dynamically changed according to the current training times is taken as a value between (0, 1); k represents the number of a particular kind of tag; y is _i Set of labels, y, representing the ith sample in a one hot code of the sample label _ij Representing the element represented by the jth label of the ith sample in one hot coding; alpha is alpha _i Dirichlet distribution parameter set, α, for the ith sample _ij A dirichlet distribution parameter estimate representing a jth classification result of the ith sample; Γ (·) is the gamma function, where t is a constant integration parameter.