CN114927162A - Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution - Google Patents

Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution Download PDF

Info

Publication number
CN114927162A
CN114927162A CN202210544114.7A CN202210544114A CN114927162A CN 114927162 A CN114927162 A CN 114927162A CN 202210544114 A CN202210544114 A CN 202210544114A CN 114927162 A CN114927162 A CN 114927162A
Authority
CN
China
Prior art keywords
matrix
hypergraph
omics
data
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210544114.7A
Other languages
Chinese (zh)
Other versions
CN114927162B (en
Inventor
王浩华
高建
林恺
张强
何昆仑
石金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202210544114.7A priority Critical patent/CN114927162B/en
Publication of CN114927162A publication Critical patent/CN114927162A/en
Application granted granted Critical
Publication of CN114927162B publication Critical patent/CN114927162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multigroup theory-related phenotype prediction method based on hypergraph characterization and Dirichlet distribution, which comprises the following steps of: the omics data preprocessing module realizes the functions of cleaning primitive omics data and pre-screening characteristics so as to remove noise, errors and redundant characteristics which may influence the associated mining performance. And the omic data hypergraph characterization module is used for completing cosine similarity calculation in each omic and constructing a hypergraph correlation matrix according to the cosine similarity calculation. And the feature extraction module builds a hypergraph convolution neural network to extract features of each omics data. The multiomic ensemble prediction module constructs dirichlet distribution parameters using the initial results generated by each particular hypergraph convolutional neural network and inputs them to a multigroup ensemble algorithm for final label prediction. The method provided by the invention is used for mining the potential correlation of each omics information based on multiple groups of omics data and corresponding phenotype labels, effectively integrating the characteristic information of each omic, and realizing the accurate relevance prediction of the omics data and the human body phenotype.

Description

Multi-group association phenotype prediction method based on hypergraph representation and Dirichlet distribution
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a multigroup theory associated phenotype prediction method based on hypergraph representation and Dirichlet distribution.
Background
In recent years, biological correlation techniques have been rapidly developed, and especially high-throughput sequencing has been in breakthrough progress in terms of quantity, speed, accuracy, diversity and application value. People can obtain related omics data by a method which is more efficient and lower in cost than the conventional method, the research on DNA, mRNA and meth can be broadly divided into genomics, transcriptomics and epigenomics, and the integration of the data provides the basis of the integrated research on multi-omics (multi-omics) for the research on various human phenotypes. On the other hand, the complexity of the organism is often contained in the various types of data, and since the research aiming at each group of science can only find a part of the biological complexity in a limited way, the complexity of the organism can be better understood and the life science process can be more comprehensively observed by integrating a plurality of groups of science data.
Phenotype is a quantifiable characteristic expression in the biological activity process, namely a characteristic biochemical index which can be objectively evaluated under a specific state of a living being, such as height, skin color, diseases and the like. The traditional statistical method can utilize the detection results of genes, proteins and other substances contained in body fluids or tissues such as human blood, urine and the like to carry out calculation and analysis on data statistics, compares set threshold values to obtain biomarker speculation of corresponding omics, and infers the phenotype of the data to be detected according to the biomarkers. For example, the GWAS method is used for comparing the P value of sample data, and DNA gene fragments, SNP sites and the like in genomics related to diseases are researched. However, the nature of the association between the biological group and the phenotype mined by the traditional statistical method has obvious limitation, on one hand, because the method only carries out statistical calculation on a single marker in each group, but a plurality of markers with low statistical values can also play a decisive role in influencing the phenotype, so that the influence of the association between the markers with low statistical values and the phenotype cannot be eliminated. On the other hand, because the regulation process of organisms is a multi-level dynamic expression process, the research method only aiming at the single omics fundamentally has limitation, and influence brought by upper and lower level regulation of the omics cannot be considered.
In view of the above, it is necessary to use a comprehensive approach of multiomics to make full use of these data to understand biological systems. With the increasing affordable computing power of computers and high-throughput omics data, and the success of artificial intelligence technology in various fields, the application of machine learning in the biological field has become popular. Machine learning can be used to mine information hidden in experimental data. In contrast, conventional statistical-based models are typically designed using statistical assumptions and make inferences about a particular phenomenon from a given data set, while machine learning methods aim to learn knowledge from historical or existing data and use that knowledge to predict or select unknown new data. For example, Xu et al developed a HI-dfn forest framework that learns high-level feature representations from three omic datasets using stacked autoencoders, which representations are simultaneously integrated to predict cancer subtypes. The mogenet proposed by Wang et al constructs a graph structure for each omics data, performs initial prediction by using a graph convolution neural network, and then integrates through a multi-view integration network VCDN to realize multi-group integrated phenotype classification. However, the above method still has room for improvement in terms of prediction accuracy and module design composition.
Disclosure of Invention
In order to solve the problems, the invention provides a method for predicting the association between human body phenotype and omics data based on a hypergraph and Dirichlet distribution. According to the method, firstly, original data are cleaned and screened through a preprocessing module, secondly, a neural network model based on hypergraph structure characterization is developed through a combined data matrix formed by a plurality of omic data sets, hypergraph structure characterization is carried out on the plurality of omic data sets, a KNN (K-Nearest Neighbor) algorithm based on cosine similarity is adopted in the characterization process, and the relevance among different position information in the omics is deeply excavated. Then, the characterization data are subjected to efficient feature extraction through a hypergraph convolution neural network, and meanwhile, the hypergraph neural network also supports the realization of relevance prediction between the monamics and the phenotypes. And finally, forming a multi-omics combined matrix based on the characteristic matrix to construct a multi-omics (two or more) fusion algorithm based on Dirichlet distribution, completing information integration among various omics by utilizing a loss function constructed by the Dirichlet distribution, and realizing information sharing among the omics on the basis of the characteristic matrix so as to accurately predict the human body surface type condition.
In order to achieve the purpose, the specific technical scheme of the invention is as follows:
a multi-group theory association phenotype prediction method based on hypergraph characterization and Dirichlet distribution comprises the following steps:
step (1) omics data cleaning and pretreatment
Redundant noise in original data is removed through a conventional preprocessing method for each omics data, for example, only data with a chip detection success rate of at least 95% is reserved for miRNA omics data; normalized beta values were calculated for the meth omics data as expression level per methylation site. The screened data may still contain redundant features or noise that negatively impact the prediction performance. To solve this problem, the pre-selection of features is performed by the following method.
First, features in the data set having a variance less than a threshold α are filtered out.
Secondly, sequentially executing the t hypothesis of the formula (1) for each phenotype label to test whether the data of the sample omics of the same type label have significant difference, wherein the t value is larger than the threshold value gammaThe book is subjected to deletion processing, wherein
Figure BDA0003651406420000031
To mean the sample, μ represents the sample expectation, σ (x) represents the standard deviation of the samples, and n represents the number of samples.
Figure BDA0003651406420000032
Finally, because different omics data types have different expression ranges, the expression values are scaled to [0,1] by linear transformation so that the model is processed, the output of this step is the preprocessed feature matrix X.
Step (2) constructing a hypergraph structure of omics data
(2.1) A hypergraph is defined as G ═ (V, E, W), defined by the set of vertices V ═ V 1 ,v 2 ,…,v m And E-super edge set E ═ E 1 ,e 2 ,…,e l And W is a weight matrix of the super edges and represents the importance degree of each super edge. In the hypergraph, each vertex corresponds to a sample, and each hyperedge contains an arbitrary subset of V. And (3) carrying out cosine similarity operation on the feature matrix X output in the step (1) to measure the relationship between features in the omics.
The traditional construction method of the hypergraph structure usually adopts the Euclidean distance of a formula (2) to calculate the linear distance between vectors so as to measure the proximity degree between different samples, and the Euclidean distance is more suitable for reflecting the absolute difference on the numerical value and is not completely fit for the implicit correlation action between the features in omics data. In the present invention, different samples are regarded as different vectors, and the cosine similarity measurement matrix obtained by using formula (3) is used to measure the approximation degree of the angle difference between the vectors. Wherein x is i A specific eigenvector, X, representing the ith sample in the feature matrix X ir And R represents the characteristic value of the R-th item of the ith sample in the characteristic matrix X, and R represents the total number of the characteristic quantity. The method is theoretically more consistent with the action rule in omics, and the application effect of the method is proved through a control experiment.
Figure BDA0003651406420000041
Figure BDA0003651406420000042
And (2.2) carrying out KNN clustering on the samples according to the obtained cosine similarity measurement matrix. Since the cosine values between the vectors decrease with increasing angle. Therefore, the KNN clustering process in the present invention returns the index of the largest k values in each row of the similarity matrix, these indexes form the hyper-edge set e of the hyper-graph vertex, and the k indexes are set to 1 in the matrix, and the rest indexes are set to 0. The matrix H constructed by this can be expressed as the incidence matrix of the hypergraph G, defined as:
Figure BDA0003651406420000043
by this extension, the degree D of the vertex v Is defined as:
Figure BDA0003651406420000044
wherein w (e) is the weight of the super edge in the weight matrix, the degree D of the super edge e Is defined as:
Figure BDA0003651406420000051
and (3) building a hypergraph convolution neural network to perform characteristic extraction of a monoomics:
(3.1) firstly, constructing a Laplace matrix of a hypergraph incidence matrix according to a Laplace standardized formula, and converting an abstract node relation in the hypergraph into a matrix type which can be used as input of a neural network, wherein the Laplace matrix construction method of the traditional graph structure comprises the following steps:
Figure BDA0003651406420000052
wherein I is a unit matrix, D is the degree of a vertex in the graph, and A is an adjacent matrix of the graph structure.
Similarly, the laplacian matrix for the hypergraph structure formed in step (2) is defined as:
Figure BDA0003651406420000053
wherein D v Vertex degree matrix, D, for the hypergraph obtained by equation (5) e For the excess edge matrix obtained by formula (6), H is the correlation matrix obtained by formula (4), and for the data set without given specific weight matrix W, it is defined as unit matrix I by default, meaning that the weights of all excess edges are equal.
(3.2) taking the hypergraph laplacian matrix of the single-component data and the preprocessed feature data as input to a hypergraph convolution neural network to perform an initial prediction task. The training goal of each hypergraph convolutional neural network is to learn the association of input data with the corresponding labels, specifically, the model requires the following two inputs: one of the inputs is the result of step (1), i.e. the preprocessed feature matrix, X ∈ n × d, where n is the number of samples and d is the number of omic features. The other input is the description of the structure of the hypergraph, namely the hypergraph Laplace matrix L obtained by the formula (8) h ∈n×n。
A HyperGraph Convolutional neural Network (HGCN) model structure is constructed by stacking 3 Convolutional layers and 1 fully-connected layer, the dimension of each Convolutional layer is set according to the dimension of a feature matrix X, and the output dimension of each fully-connected layer is the label category number. The specific definition of convolutional layers is:
HGConv (l+1) =f(HGConv (l) ,L h )
=σ(L h (HGConv (l) )Z (l) ) (9)
in the formula HGCconv (l) Is the output of the first layer, Z (l) Is the weight matrix of the l-th layer, when l is 0, HGConv (l) X. σ (-) is an activation function of the hidden layer, and is set as a LeakyReLU function in the method, wherein k is a negative slope parameter of the activation function and is used for solving the problem of gradient disappearance caused by neuron failure:
Figure BDA0003651406420000061
a dropout mechanism is added after the first two convolutional layers to reduce the possibility of model overfitting. And the full connection layer connected behind the third convolution layer realizes characteristic integration. Output of the model F o As a result of feature extraction, F o E n x b, where n is the number of samples and b is the number of tag types.
Meanwhile, the invention also supports the prediction of corresponding phenotype of single group of chemical data through HGCN, namely, the network is trained by using a cross entropy loss function through the back propagation process of a single HGCN:
Figure BDA0003651406420000062
wherein Loss CE (. cndot.) represents the cross entropy loss function, and y is the sample label. According to Loss value Loss HGCN And calculating gradient, updating network weight Z to complete a back propagation process, and performing correlation prediction on single group of chemical data and phenotype by using a model stored after several iterative training processes.
Step (4) a multi-group chemical integration algorithm based on Dirichlet distribution:
constructing corresponding HGCN for each group of chemical data by using the step (3), and outputting a characteristic result matrix F for each neural network o E n x b, combining formula (12) to construct F o Dirichlet distribution parameter matrix alpha o ,α ij o Represents alpha o Each element of (1). Calculating F according to the parameters o Each element f in ij o Reliability p of (2) ij o Form a matrix P o And uncertainty parameter u of prediction results under the omics i o Component vector U o
Figure BDA0003651406420000071
Figure BDA0003651406420000072
α o =F o +1 (12)
Credible distribution matrix P of single-group chemical prediction result obtained based on the steps o And uncertainty vector U o And performing fusion prediction of the multiomics. The process adopts a classic D-S evidence theory, namely a formula (13), and realizes pairwise information fusion between omics:
Figure BDA0003651406420000073
Figure BDA0003651406420000074
in the formula p i Representing the ith row of the matrix P, m is set to a value not less than 0, and in particular when m is 0, the formula implements that P 0 、U 0 (first group prediction result) with P 1 、U 1 (second group prediction result) to obtain P 2 、U 2 As a fusion result of the two omics; when m is 1, the formula realizes that P 2 、U 2 (fusion of first two omics) with P 3 、U 3 (third omics prediction results) to obtain P 4 、U 4 As a fusion result of the three omics. The multiomic fusion mode is analogized until the fusion of all the omics is completed to obtain P 2m+2 、U 2m+2
After the fusion of all kinds of omics is completed, the Dirichlet distribution parameter alpha and the fusion prediction result F under the condition of multiomic fusion are reversely deduced according to the formula (12).
And finally, training and learning of the multiomic fusion prediction are carried out, and different from the cross entropy calculation method of the formula (11), the formula (14) is adopted to calculate the fusion loss.
Loss MOIA =Loss rightepoch Loss wrong
Figure BDA0003651406420000081
Figure BDA0003651406420000082
Figure BDA0003651406420000083
Figure BDA0003651406420000084
Figure BDA0003651406420000085
Therein, Loss right Loss function for correct label, Loss wrong As a function of Loss of false tags, Loss MOIA As a function of total loss; lambda epoch The loss weight which is dynamically changed according to the current training times is taken as a value between (0, 1); k represents the number of a particular kind of tag; y is i Set of labels, y, representing the ith sample in a one hot code of the sample label ij Representing the element represented by the jth label of the ith sample in one hot coding; alpha (alpha) ("alpha") i Dirichlet distribution parameter set, α, for the ith sample ij A Dirichlet distribution parameter estimate representing a jth classification result for an ith sample; Γ (·) is a gamma function, where t is a constant integral parameter. The method makes full use of Dirichlet distribution parameter estimation alpha and calculates Loss right So thatThe model predicts the correct label according to the maximum, calculates Loss wrong Enables prediction of false tags to be further reduced, Loss MOIA And optimizing and improving the model precision from two aspects. And the loss value is used for carrying out gradient calculation, finishing a back propagation process and updating the neuron weight of the hypergraph convolution neural network. The trained model can be used for accurately predicting the phenotype based on specific omics information and cross-group association learning.
The invention has the beneficial effects that:
(1) the designed hypergraph data structure is used as an input data type, and compared with a traditional graph structure, the hypergraph data structure can represent data containing a multidirectional relation in a higher fidelity, a hypergraph convolution neural network is constructed according to the data, and the relevance between different features in the omics is mined by fully combining original features and hypergraph characterization.
(2) A multi-group chemical integration algorithm based on Dirichlet distribution is provided, the complementary relation of biological characteristics under different levels is effectively utilized, and the potentially unknown association relation of the human body omics and the phenotype is further improved. Can help people to better understand the process of biodynamic regulation, and can provide more comprehensive theoretical support in the aspects of disease detection, typing and risk prediction.
Drawings
Fig. 1 is an overall architecture diagram of the present invention.
Figure 2 is a framework diagram of the present invention for implementing multiomic phenotypic association prediction.
Fig. 3 is an overall flow chart of the present invention.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
As shown in fig. 1, a multi-group mathematical correlation phenotype prediction method based on hypergraph characterization and dirichlet distribution according to the present invention can be roughly divided into: four modules of omics data preprocessing, omics data hypergraph representation, feature extraction of a hypergraph neural network and multi-group chemical integration prediction;
(1) the preprocessing module relates to the cleaning of primitive omics data and the pre-screening of characteristics: preprocessing operation is respectively carried out on each type of omics data so as to remove noise, errors and redundant characteristics which possibly influence the associated mining performance, and a better understanding and supporting effect is played for a subsequent model algorithm. First, features with no probe signal or low difference (mean close to 0) were filtered for individual omics data. Because different omics data types have different expression ranges, the expression values are optionally scaled by linear transformation for the model to operate.
(2) The omics data hypergraph characterization module relates to a cosine similarity calculation and KNN clustering process: for each kind of feature data after being preprocessed, a cosine similarity matrix of the feature data among different samples is calculated firstly, then k samples with the largest cosine value of each hypergraph node are screened out according to a KNN algorithm, finally the most similar samples are indexed in a matrix with the index of 1, and the rest indexes are indexed with 0 to complete the construction of the hypergraph correlation matrix.
(3) The feature extraction module of the hypergraph neural network realizes the building and specific training process of the neural network: and constructing a Laplace matrix of the hypergraph structure according to a hypergraph Laplace standardized formula, and converting abstract node relations in the hypergraph into matrix types which can be used as neural network input. And (3) respectively constructing a hypergraph convolutional neural network (HGCN) by combining the preprocessed specific omics feature matrix, and performing specific learning of association of the omics and the phenotype by taking the omics preprocessed feature matrix and the corresponding hypergraph Laplace matrix as the input of the HGCN. The main advantage of HGCN is that the potential correlations between samples in omics data can be well combined to achieve more efficient feature extraction.
(4) And the multi-group learning integration prediction module constructs Dirichlet distribution parameters according to the output of each HGCN model, so that a loss function different from the traditional cross entropy is designed to carry out final label prediction learning. The multi-omic integration algorithm (MOIA) firstly calculates the uncertainty of the Dirichlet distribution parameters of each omic, and mines the potential correlation among different omics through the classical D-S combination rule, thereby effectively integrating the characteristics extracted by the specific network of each omic.
As shown in fig. 3, taking the BRCA proteomics data set of TCGA as an example for the association prediction of breast cancer subtypes, the following steps are performed:
(1) firstly, performing feature screening on each omics data according to a route of a preprocessing step, reserving features highly related to a phenotype tag, and filtering features with sample calling retention rate of less than 5% in miRNA and mRNA data by taking three omics data (methylation, mRNA and miRNA data) related to BRCA phenotype obtained from a TCGA starting database as an example; for methylation data, normalized beta values were calculated as the methylation level for each methylation site. Second, features in the training dataset with a variance of less than 0.3 are filtered out.
Meanwhile, for each tag prediction task, the t test in the formula (1) is sequentially executed to evaluate whether the sample data is significantly different from other data with the same tag, the sample with the overlarge difference is deleted, and each type of omics data is scaled to the range of [0,1] through linear transformation.
(2) And (3) characterizing the preprocessed screened omics data into a hypergraph structure. As shown in fig. 2, namely, the incidence matrix and the laplacian matrix of the hypergraph are constructed for the feature matrix data, the cosine similarity matrix of the single set of the mathematical data is calculated according to the formula (3), and k with the largest cosine value in each row of the matrix is selected to be 10 indexes, so as to construct the incidence matrix of the hypergraph G, and the laplacian matrix of the hypergraph is obtained through the formula (8).
(3) And respectively inputting the feature matrix and the Laplace matrix of the single omics into a hypergraph convolutional neural network (HGCN) for feature extraction. As shown in fig. 2, HGCNs are respectively built for each type of science, each HGCN learns the characteristics of the hypergraph representation by using the characteristics of each type of science node and the association relationship between the nodes, in this example, the dimension of the original characteristics is 1000 × 612, the number of classification labels is 5, and therefore, the hidden layer dimensions are respectively set to 400, 400, and 200, the input layer dimension is 1000, and the output layer dimension is 5. The operation process of the specific neural network refers to the formula (9-10), and meanwhile, a dropout mechanism with the parameter of 0.5 is added after the convolution layers of the first two layers, so that the probability of model overfitting is reduced.
(4) In the steps, the result of each omic corresponding to the HGCN is input into a multi-group chemical integration algorithm (MOIA) for final integration prediction, the MOIA can reveal the potential cross-group chemical label correlation, a Dirichlet distribution parameter is constructed based on a formula (12), and a classic D-S evidence theory like a formula (13) is introduced to realize pairwise information fusion between the omics. After the fusion of all kinds of omics data is finished, the hypergraph convolution neural network is trained in a back propagation mode by using the loss function of the formula (14). And finally, the output correlation prediction result is predicted based on specific omics information and cross-group correlation learning. The obtained result is shown as the final output of fig. 2, and is n × 5 tensor (n is the number of samples), the 5 parameters of each row respectively represent the probability distribution of five subtypes (Normal-like, Basal-like, HER 2-inverter, lumineal a and lumineal B) of the sample with BRCA, and the value with the highest probability represents the final prediction result.
(5) Multiple sets of control experiments performed on the same data set for efficiency comparison demonstrate that the method of the invention is superior to other existing methods. Some of the control experiments were as follows:
I. compared with the MOGONET method published in Nature Communication in 2021, the single set of chemical prediction Accuracy (ACC) exceeds the method by 0.06-0.09, and the multiple set of chemical integration prediction Accuracy (ACC) exceeds the method by 0.04 (the method is 0.8289, the invention is 0.8670), and meanwhile, by referring to the experimental part content in the MOGONET paper, the accuracy of the method disclosed by the invention is far superior to that of other conventional machine learning methods.
II. The accuracy of the single omics prediction on the HGNN was: the prediction accuracy of mRNA (0.8517), meth (0.7871) and miRNA (0.8061) after MOIA integration is 0.8670, and the integration effectiveness of the MOIA module is proved.
And III, comparing experiments on a hypergraph construction method, wherein compared with a hypergraph structure constructed by a traditional Euclidean distance method, the hypergraph structure constructed by the cosine similarity method improves the final prediction accuracy by 0.02-0.04, and the effectiveness of the cosine similarity method is proved.

Claims (1)

1. A multigroup theory association phenotype prediction method based on hypergraph characterization and Dirichlet distribution is characterized by comprising the following steps of:
step (1) omics data cleaning and pretreatment
Redundant noise in original data needs to be removed from each omics data, and then pre-selection of features is carried out, wherein the pre-selection method comprises the following steps:
firstly, filtering out the characteristic that the variance in a data set is smaller than a threshold value alpha;
secondly, sequentially executing a t hypothesis of a formula (1) for each phenotype label to check whether the data of the omics of the samples of the same type label have significant difference, and deleting the samples with the t value larger than a threshold value gamma, wherein the t hypothesis is used for deleting the samples with the t value larger than the threshold value gamma
Figure FDA0003651406410000011
For the sample mean, μ represents the sample expectation, σ (x) represents the standard deviation of the sample, and n represents the number of samples;
Figure FDA0003651406410000012
finally, because different omics data types have different expression ranges, the expression values are scaled to [0,1] through linear transformation, and the expression values are output as a preprocessed feature matrix X;
step (2) constructing hypergraph structure of omics data
(2.1) A hypergraph is defined as G ═ (V, E, W), defined by the set of vertices V ═ V 1 ,v 2 ,…,v m E and super edge set E ═ E 1 ,e 2 ,…,e l W is a weight matrix of the excess edges, and represents the importance degree of each excess edge; in the hypergraph, each vertex corresponds to a sample, and each hyperedge contains an arbitrary subset of V; carrying out cosine similarity operation on the feature matrix X output in the step (1) to measure the relationship between features in the omics;
regarding different samples as different vectors, and using a formula (3) to obtain a cosine similarity measurement matrix to measure the approximation degree of the cosine similarity measurement matrix by using the angle difference between the vectors;
Figure FDA0003651406410000013
wherein x is i Representing a specific feature vector of an ith sample in the feature matrix X;
(2.2) carrying out KNN clustering on the samples according to the obtained cosine similarity measurement matrix; because cosine values among vectors are reduced along with the increase of angles, the KNN clustering process returns indexes of the maximum k values of each row in the similarity matrix, the indexes form a hyper-edge set e of the vertex of the hyper-graph, the k indexes are set to be 1 in the matrix, and the rest indexes are set to be 0; the matrix H constructed in this way can be represented as the incidence matrix of the hypergraph G, defined as:
Figure FDA0003651406410000021
by this extension, the degree D of the vertex v Is defined as:
Figure FDA0003651406410000022
wherein w (e) is the weight of the super edge in the weight matrix, the degree D of the super edge e Is defined as:
Figure FDA0003651406410000023
and (3) constructing a hypergraph convolution neural network to perform characteristic extraction of a monamics:
(3.1) firstly, constructing a Laplace matrix of a hypergraph incidence matrix according to a Laplace standardized formula, and converting abstract node relations in the hypergraph into matrix types capable of being used as neural network input;
the Laplace matrix of the hypergraph structure formed in the step (2) is defined as:
Figure FDA0003651406410000024
wherein D v The vertex degree matrix, D, of the hypergraph obtained for equation (5) e For the super-edge matrix obtained by the formula (6), H is the incidence matrix obtained by the formula (4), and for the data set without the specific weight matrix W, the data set is defined as a unit matrix I by default, namely, the weights of all super edges are equal;
(3.2) inputting the hypergraph Laplace matrix of the single-component mathematical data and the preprocessed feature data into a hypergraph convolution neural network as input to execute an initial prediction task; the training goal of each hypergraph convolutional neural network is to learn the association of input data with corresponding labels, specifically, the model requires the following two inputs: one of the inputs is the result of step (1), i.e. the preprocessed feature matrix, X ∈ n × d, where n is the number of samples and d is the number of omics features; the other input is the description of the structure of the hypergraph, namely the hypergraph Laplace matrix L obtained by the formula (8) h ∈n×n;
The hypergraph convolutional neural network HGCN model structure is constructed by stacking 3 convolutional layers and 1 full-connection layer, the dimension of the convolutional layers is set according to the dimension of a characteristic matrix X, and the output dimension of the full-connection layer is the label category number; the specific definition of convolutional layers is:
HGConv (l+1) =f(HGConv (l) ,L h )
=σ(L h (HGConv (l) )Z (l) ) (9)
in the formula HGCconvnv (l) Is the output of the first layer, Z (l) Is the weight matrix of the l-th layer, when l is 0, HGConv (l) X; σ (-) is the activation function of the hidden layer, set as LeakyReLU function, where k is the negative slope parameter of the activation function:
Figure FDA0003651406410000031
a dropout mechanism is added after the first two convolutional layers to reduce the probability of overfitting the model; the full connection layer connected behind the third convolution layer realizes feature integration; output of the model F o As a result of feature extraction, F o E is n multiplied by b, wherein n is the number of samples, and b is the number of label types;
meanwhile, the method supports the prediction of corresponding phenotypes on single set of chemical data through the HGCN, namely the network is trained by using a cross entropy loss function through the back propagation process of a single HGCN:
Figure FDA0003651406410000032
wherein Loss CE (. cndot.) represents a cross entropy loss function, y is the sample label; according to Loss value Loss HGCN Calculating gradient, updating network weight Z to complete a back propagation process, and performing correlation prediction on single-group chemical data and phenotype by using a model stored after several iterative training processes;
step (4) a multi-group chemical integration algorithm based on Dirichlet distribution:
constructing a corresponding HGCN for each group of chemical data by using the step (3), and outputting a characteristic result matrix F for each neural network o E n × b, first construct F in conjunction with equation (12) o Dirichlet distribution parameter matrix alpha o ,α ij o Represents alpha o Each element of (a); calculating F according to the parameters o Each element f ij o Reliability p of (2) ij o Form a matrix P o And uncertainty parameter u of prediction results in omics i o Component vector U o
Figure FDA0003651406410000041
Figure FDA0003651406410000042
α o =F o +1 (12)
The obtained single group of mathematical prediction knotsCredible distribution matrix P of fruits o And uncertainty vector U o Performing fusion prediction of the multiomics; the process adopts the classic D-S evidence theory, namely the mode of formula (13), and realizes pairwise information fusion between omics:
Figure FDA0003651406410000043
Figure FDA0003651406410000044
in the formula, p i Represents the ith row of matrix P; m is set to a value of not less than 0; specifically, when m is 0, the formula implements the first group prediction result P 0 、U 0 And a second group prediction result P 1 、U 1 By fusion of (1) to obtain P 2 、U 2 As a fusion result of the two omics; when m is 1, the formula realizes the fusion result P of the first two omics 2 、U 2 And third omics prediction result P 3 、U 3 By fusion of (b) to obtain P 4 、U 4 As a fusion result of the three omics; the fusion mode of the multiomics is analogized until the fusion of all the omics is completed to obtain P 2m+2 、U 2m+2
After the fusion of all kinds of omics is finished, a Dirichlet distribution parameter alpha and a fusion prediction result F under the condition of multiomic fusion are reversely deduced according to a formula (12);
and finally, training and learning of the multiomic fusion prediction are carried out, and the fusion loss is calculated by adopting a formula (14):
Loss MOIA =Loss rightepoch Loss wrong
Figure FDA0003651406410000051
Figure FDA0003651406410000052
Figure FDA0003651406410000053
Figure FDA0003651406410000054
Figure FDA0003651406410000055
therein, Loss right Loss function for correct label, Loss wrong As a function of Loss of false tags, Loss MOIA As a function of total loss; lambda [ alpha ] epoch The loss weight which is dynamically changed according to the current training times is taken as a value between (0, 1); k represents the number of a particular kind of tag; y is i Set of labels, y, representing the ith sample in a one hot code of the sample label ij Representing the element represented by the jth label of the ith sample in one hot coding; alpha is alpha i Dirichlet distribution parameter set, α, for the ith sample ij A dirichlet distribution parameter estimate representing a jth classification result of the ith sample; Γ (·) is the gamma function, where t is a constant integration parameter.
CN202210544114.7A 2022-05-19 2022-05-19 Multi-mathematic association phenotype prediction method based on hypergraph characterization and dirichlet allocation Active CN114927162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210544114.7A CN114927162B (en) 2022-05-19 2022-05-19 Multi-mathematic association phenotype prediction method based on hypergraph characterization and dirichlet allocation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210544114.7A CN114927162B (en) 2022-05-19 2022-05-19 Multi-mathematic association phenotype prediction method based on hypergraph characterization and dirichlet allocation

Publications (2)

Publication Number Publication Date
CN114927162A true CN114927162A (en) 2022-08-19
CN114927162B CN114927162B (en) 2024-06-14

Family

ID=82808101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210544114.7A Active CN114927162B (en) 2022-05-19 2022-05-19 Multi-mathematic association phenotype prediction method based on hypergraph characterization and dirichlet allocation

Country Status (1)

Country Link
CN (1) CN114927162B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565610A (en) * 2022-09-29 2023-01-03 四川大学 Method and system for establishing recurrence transfer analysis model based on multiple sets of mathematical data
CN115631847A (en) * 2022-10-19 2023-01-20 哈尔滨工业大学 Early lung cancer diagnosis system based on multiple mathematical characteristics, storage medium and equipment
CN115631799A (en) * 2022-12-20 2023-01-20 深圳先进技术研究院 Sample phenotype prediction method and device, electronic equipment and storage medium
CN115798598A (en) * 2022-11-16 2023-03-14 大连海事大学 Hypergraph-based miRNA-disease association prediction model and method
CN116844645A (en) * 2023-08-31 2023-10-03 云南师范大学 Gene regulation network inference method based on multi-view layered hypergraph
CN116992919A (en) * 2023-09-28 2023-11-03 之江实验室 Plant phenotype prediction method and device based on multiple groups of science
CN117235665A (en) * 2023-09-18 2023-12-15 北京大学 Self-adaptive privacy data synthesis method, device, computer equipment and storage medium
CN117541844A (en) * 2023-09-27 2024-02-09 合肥工业大学 Weak supervision histopathology full-section image analysis method based on hypergraph learning
CN117633658A (en) * 2024-01-25 2024-03-01 北京大学 Rock reservoir lithology identification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028939A (en) * 2019-11-15 2020-04-17 华南理工大学 Multigroup intelligent diagnosis system based on deep learning
WO2020113673A1 (en) * 2018-12-07 2020-06-11 深圳先进技术研究院 Cancer subtype classification method employing multiomics integration
CN112820403A (en) * 2021-02-25 2021-05-18 中山大学 Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data
CN113254729A (en) * 2021-06-29 2021-08-13 中国科学院自动化研究所 Multi-modal evolution characteristic automatic conformal representation method based on dynamic hypergraph network
CN113723485A (en) * 2021-08-23 2021-11-30 天津大学 Method for processing brain image hypergraph of mild hepatic encephalopathy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020113673A1 (en) * 2018-12-07 2020-06-11 深圳先进技术研究院 Cancer subtype classification method employing multiomics integration
CN111028939A (en) * 2019-11-15 2020-04-17 华南理工大学 Multigroup intelligent diagnosis system based on deep learning
CN112820403A (en) * 2021-02-25 2021-05-18 中山大学 Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data
CN113254729A (en) * 2021-06-29 2021-08-13 中国科学院自动化研究所 Multi-modal evolution characteristic automatic conformal representation method based on dynamic hypergraph network
CN113723485A (en) * 2021-08-23 2021-11-30 天津大学 Method for processing brain image hypergraph of mild hepatic encephalopathy

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565610A (en) * 2022-09-29 2023-01-03 四川大学 Method and system for establishing recurrence transfer analysis model based on multiple sets of mathematical data
CN115565610B (en) * 2022-09-29 2024-06-11 四川大学 Recurrence and metastasis analysis model establishment method and system based on multiple groups of study data
CN115631847A (en) * 2022-10-19 2023-01-20 哈尔滨工业大学 Early lung cancer diagnosis system based on multiple mathematical characteristics, storage medium and equipment
CN115631847B (en) * 2022-10-19 2023-07-14 哈尔滨工业大学 Early lung cancer diagnosis system, storage medium and equipment based on multiple groups of chemical characteristics
CN115798598A (en) * 2022-11-16 2023-03-14 大连海事大学 Hypergraph-based miRNA-disease association prediction model and method
CN115798598B (en) * 2022-11-16 2023-11-14 大连海事大学 Hypergraph-based miRNA-disease association prediction model and method
CN115631799A (en) * 2022-12-20 2023-01-20 深圳先进技术研究院 Sample phenotype prediction method and device, electronic equipment and storage medium
CN116844645A (en) * 2023-08-31 2023-10-03 云南师范大学 Gene regulation network inference method based on multi-view layered hypergraph
CN116844645B (en) * 2023-08-31 2023-11-17 云南师范大学 Gene regulation network inference method based on multi-view layered hypergraph
CN117235665A (en) * 2023-09-18 2023-12-15 北京大学 Self-adaptive privacy data synthesis method, device, computer equipment and storage medium
CN117541844A (en) * 2023-09-27 2024-02-09 合肥工业大学 Weak supervision histopathology full-section image analysis method based on hypergraph learning
CN116992919B (en) * 2023-09-28 2023-12-19 之江实验室 Plant phenotype prediction method and device based on multiple groups of science
CN116992919A (en) * 2023-09-28 2023-11-03 之江实验室 Plant phenotype prediction method and device based on multiple groups of science
CN117633658A (en) * 2024-01-25 2024-03-01 北京大学 Rock reservoir lithology identification method and system
CN117633658B (en) * 2024-01-25 2024-04-19 北京大学 Rock reservoir lithology identification method and system

Also Published As

Publication number Publication date
CN114927162B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
CN114927162A (en) Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
Lee et al. Review of statistical methods for survival analysis using genomic data
Sun et al. Gene expression data analysis with the clustering method based on an improved quantum-behaved Particle Swarm Optimization
Maulik et al. Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data
Kim et al. Prediction of colon cancer using an evolutionary neural network
CN112951321B (en) Tensor decomposition-based miRNA-disease association prediction method and system
Huang et al. Clustering gene expression pattern and extracting relationship in gene network based on artificial neural networks
Ickstadt et al. Toward integrative Bayesian analysis in molecular biology
Zhu et al. Deep-gknock: nonlinear group-feature selection with deep neural networks
Molho et al. Deep learning in single-cell analysis
KARLIK Soft computing methods in bioinformatics: a comprehensive review
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
US20070078606A1 (en) Methods, software arrangements, storage media, and systems for providing a shrinkage-based similarity metric
Du et al. Deep multi-label joint learning for RNA and DNA-binding proteins prediction
Yoo et al. Discovery of gene-regulation pathways using local causal search.
CN116758993A (en) DNA methylation prediction method integrating multiple groups of chemical characteristics
Higa et al. Growing seed genes from time series data and thresholded Boolean networks with perturbation
CN111755074B (en) Method for predicting DNA replication origin in saccharomyces cerevisiae
Liang et al. Hierarchical Bayesian neural network for gene expression temporal patterns
Roy et al. A hidden-state Markov model for cell population deconvolution
Yaman et al. MachineTFBS: Motif-based method to predict transcription factor binding sites with first-best models from machine learning library
Şahin et al. Sequential Feature Maps with LSTM Recurrent Neural Networks for Robust Tumor Classification
Walker Iterative Random Forest Based High Performance Computing Methods Applied to Biological Systems and Human Health
Dragomir et al. SOM‐based class discovery exploring the ICA‐reduced features of microarray expression profiles
Deng Algorithms for reconstruction of gene regulatory networks from high-throughput gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant