CN113345593A

CN113345593A - Method for predicting disease association relation in biological association network

Info

Publication number: CN113345593A
Application number: CN202110287525.8A
Authority: CN
Inventors: 郭菲; 王浩; 唐继军; 丁漪杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-09-03

Abstract

The invention discloses a method for predicting disease association relation in a biological association network, which comprises the following steps: s1 creating a polynuclear representation of non-coding RNAs

And multi-nuclear representation of disease

S2, respectively fusing the non-coding RNA multi-core and disease multi-core by adopting a central core alignment calculation method to obtain an optimal core

And

s3, calculating method by using singular value decompositionThe fused optimal kernel is decomposed into two matrices, i.e.

And

s4 matrix pair method using hypergraph regular term three-matrix decomposition

And

calculating to obtain a hypergraph Laplace matrix

And

s5, performing cross validation on the Laplace matrix of the hypergraph

And

calculating to obtain a new incidence relation matrix, Y^*＝AΘB^T(ii) a The invention solves the problem of incidence relation prediction of non-coding RNA and diseases, adds the Laplacian regularization term of the hypergraph into the three-matrix decomposition calculation, and adopts a multi-core fusion method of central core alignment, thereby obviously improving the prediction precision.

Description

Method for predicting disease association relation in biological association network

Technical Field

The invention belongs to the field of biological association network prediction algorithms in bioinformatics, and particularly relates to a method for predicting disease association relationship in a biological association network.

Background

The precise correlation between non-coding RNA and disease is of great help for the treatment of human biomedical research. However, the conventional techniques are applied to only one non-coding RNA or one specific disease, and the separation of the two is performed, and the experimental method is time-consuming and expensive. Based on known non-coding RNAs and disease-related information, many computational tools have been proposed to detect new associations. Since non-coding rnas (ncrnas), including circular rnas (circrnas), micro rnas (mirnas), and long non-coding rnas (lncrnas), are closely related to the progression of various diseases in humans, it is important to develop an effective computational method to predict ncRNA-disease association.

Disclosure of Invention

In view of the problems in the prior art, the present invention aims to provide a method for predicting disease association relationship in a biological association network. The method uses a multi-core learning algorithm with central core alignment to fuse a plurality of cores, and then uses a three-matrix decomposition method based on a hypergraph regular term to train, so as to predict the new incidence relation between non-coding RNA and diseases.

In order to solve the problems in the prior art, the invention adopts the following technical scheme:

a method for disease association prediction in a biological association network, comprising the steps of:

s1 creating a polynuclear representation of non-coding RNAs

And multi-nuclear representation of disease

Wherein u and v are the number of nuclei in the expression space of the non-coding RNA and the disease;

s2, fusing the non-coding RNA nucleus and the disease nucleus respectively by adopting a central nucleus alignment method to obtain the optimal nucleus

And

s3, decomposing the fused optimal kernel into two matrixes by using a singular value decomposition method, namely

And

the calculation process is as follows:

wherein, A and B are low rank approximate matrixes; r is_ncAnd r_dPotential characteristic spatial dimensions of non-coding RNA and disease, respectively;

s4 matrix pair method using hypergraph regular term three-matrix decomposition

And

calculating to obtain a hypergraph Laplace matrix

And

s5, performing cross validation on the Laplace matrix of the hypergraph

And

calculating to obtain a new incidence relation matrix, Y^*＝AΘB^T。

Further, the laplace matrix of the hypergraph in the step S4

And

the calculation formula is as follows:

L^h＝I-Θ

where I is the identity matrix.

Further, in the step S5, the laplace matrix of the hypergraph is processed by the cross validation method

And

the calculation process is as follows:

A^TAΘB^TB+λ₁A^TL₁AΘB^TB+λ₂A^TAΘB^TL₂B＝A^TY_trainB

AΘB^T+λ₁L₁AΘB^T+λ₂AΘB^TL₂＝Y_train

(I+λ₁L₁)AΘB^T+λ₂AΘB^TL₂＝Y_train

A^-1(I+λ₁L₁)AΘ+λ₂ΘB^TL₂(B^T)^-1＝A^-1Y_train(B^T)^-1

wherein the content of the first and second substances,

is a correlation matrix of known ncRNA-diseases;

is a double projection matrix; lambda [ alpha ]₁And λ₂Regularizing coefficients for two different graphs, each set to 1;

and

is shown as a drawingThe normalized laplacian matrix is calculated as follows:

wherein the content of the first and second substances,

and

is a diagonal matrix of the angles,

and

advantageous effects

The project utilizes a multi-core fusion method to identify the association relationship between non-coding RNA and diseases. And searching for important features influencing the incidence relation by using an efficient multi-core learning algorithm, evaluating the importance of the core matrix, and reducing the deviation brought by core fusion. A reasonable kernel matrix evaluation method is constructed, weight coefficients are calculated for different kernel matrices, the kernel matrix containing noise can be effectively filtered, useful kernel matrices are reserved to the maximum extent, and a foundation is provided for improving the prediction accuracy of the model. Each kernel matrix represents information of different heterogeneous data, the weight coefficient of each kernel matrix reflects the contribution degree of different information in a prediction model, and key information influencing the ncRNA-disease associated prediction accuracy can be further found. The prediction accuracy of the method is better than that of other methods in the prior art. The method is high in prediction accuracy, simple and efficient, solves the problem that the accuracy of ncRNA-disease association identification is not high in the existing method, and has important significance for promoting non-coding RNA research.

Drawings

FIG. 1 is a flow chart of the computational process of the present invention;

FIG. 2, weight of each core in five data sets;

FIG. 3 compares AUC and AUPR for different kernel functions by 5-fold cross validation of 5 data sets;

FIG. 4 compares AUC and AUPR of different matrix factorization methods by 5-fold cross validation of 5 data sets;

FIG. 5, different r_dAnd r_ncAUC of parameters under 5-fold cross validation;

FIG. 6, different r_dAnd r_ncAUPR with parameters under 5-fold cross validation;

FIG. 7, optimal parameter r calculated over five data sets_ncAnd r_d；

FIG. 8 compares AUC results of prior excellent methods by 5-fold cross validation and leave-one-fold cross validation;

FIG. 9 shows ten new correlations of lung cancer, liver cancer and pancreatic cancer.

Detailed Description

The invention is described in detail below with reference to the attached drawing figures:

as shown in figure 1, the invention realizes the accurate identification of the association relationship between non-coding RNA and diseases, and is greatly helpful for disease treatment of human biomedical research. However, conventional techniques are only applied to one non-coding RNA or one specific disease, and the experimental method is time-consuming and expensive. Based on known non-coding RNAs and disease-related information, many computational tools have been proposed to detect new associations. Since ncRNAs (circRNAs, miRNAs, and lncRNAs) are closely related to various disease progression in humans, development of an efficient computational method is crucial for ncRNA-disease association prediction.

The basic idea of the invention is: fusing a plurality of nuclei of non-coding RNA and a plurality of nuclei of diseases by adopting a central nucleus alignment method, and predicting a new incidence relation by adopting a three-matrix decomposition method of a hypergraph regular term.

The invention mainly comprises the following steps: firstly, obtaining non-coding RNA nuclei and disease nuclei as much as possible, then fusing the non-coding RNA nuclei and the disease nuclei by adopting a central nucleus alignment method, decomposing the fused nuclei into two matrixes by using a singular value decomposition method, and finally obtaining a new incidence relation matrix by using a three-matrix decomposition method of hypergraph regular terms in a cross validation mode. The method comprises the following specific steps:

s1 construction of non-coding RNA nuclei

And nucleus of disease

Wherein u and v are the number of nuclei in the non-coding RNA and the disease space;

s2, fusing the non-coding RNA nucleus and the disease nucleus to obtain the optimal nucleus by adopting a central nucleus alignment method

And

the parity value of the invention can describe the similarity of two kernels; CKA-MKL the relationship between the ideal nuclear matrix and ncRNA nuclei (or disease nuclei) was calculated as follows:

β^p≥0,p＝1,2,…,N

wherein, K_idealIs an idealThe core is a core of a plurality of cores,

and

is an ideal inner core of ncRNA and disease constructed by known correlation;

s3, decomposing the fused optimal kernel into two matrixes by adopting a singular value decomposition method, namely

And

the calculation process is as follows:

wherein, A and B are low rank approximate matrixes; r is_ncAnd r_dPotential characteristic space dimensions of ncRNA and disease respectively;

s4, three-matrix decomposition method matrix using hypergraph regular term

And

calculating to obtain a hypergraph Laplacian matrix

And

the laplace matrix of the hypergraph in the step S4

And

the calculation formula is as follows:

L^h＝I-Θ

where I is the identity matrix.

S5, performing cross validation on the Laplace matrix of the hypergraph

And

calculating to obtain a new incidence relation matrix, Y^*＝AΘB^T。

The cross validation method in the step S5 is used for the Laplacian moment of the hypergraph

And

the matrix calculation process is as follows:

A^TAΘB^TB+λ₁A^TL₁AΘB^TB+λ₂A^TAΘB^TL₂B＝A^TY_trainB

AΘB^T+λ₁L₁AΘB^T+λ₂AΘB^TL₂＝Y_train

(I+λ₁L₁)AΘB^T+λ₂AΘB^TL₂＝Y_train

A^-1(I+λ₁L₁)AΘ+λ₂ΘB^TL₂(B^T)^-1＝A^-1Y_train(B^T)^-1

wherein the content of the first and second substances,

is a correlation matrix of known ncRNA-diseases;

and

the laplacian matrix is normalized for the graph as calculated:

wherein the content of the first and second substances,

and

in the form of a diagonal matrix,

and

the invention calculates the new incidence relation matrix through the steps S1-S5, and can obtain the identification accuracy of the new incidence relation through inquiring and verifying in other databases.

The implementation process of the invention comprises the following steps:

according to the calculation method, the invention obtains the optimal r of the D1 data set by using a grid search method_ncAnd r_d. The present invention uses 100 steps to test for different values from 100 to the maximum. The rest data sets use the same grid searching method, and the optimal parameters r of different data sets_ncAnd r_dAs shown in fig. 7. Meanwhile, fig. 5 and 6 are different from each other in r_dAnd r_ncAUC and aucr under 5-fold cross validation of the model. Here, r_nc(horizontal axis) and r_dThe (vertical axis) setting ranges from 100 to 1500, step size 100.

The invention counts the performance of multi-core and single-core applications on 5 data sets. Fig. 2 shows the weights for each core in the five data sets. It can be seen that the semantic similarity of diseases is given almost the greatest weight. Non-coding RNAs that function similarly are of greater weight, which means that they have more potent information.

Figure 3 shows the results of 5-fold cross validation (AUC and aucr) on 5 data sets for different kernel functions. It can be seen that the process of the invention (CKA-HGRTMF) gives the best performance (AUC) on D2(0.9775), D3(0.9023), D4(0.8809) and D5 (0.9185). The method (CKA-HGRTMF) of the invention achieves the best AUPR performance on 5 data sets. For a single core, CKA-HGRTMF selects different combinations of cores in two feature spaces and tests are performed using the HGRTMF model. It was found experimentally that both ncRNA and the most weighted nuclei of the disease feature set gave better results than the other methods.

Comparing HGRTMF and other MF based computational models, including three matrix factorization (CMF), graph regularization term matrix factorization (GRMF), Three Matrix Factorization (TMF), NRLMF, and graph regularization term three matrix factorization (GRTMF), the results are shown in fig. 4. The method (CKA-HGRTMF) of the present invention performed the best AUPRs at D1 (AUPR: 0.9173), D2 (AUPR: 0.7712), D3 (AUPR: 0.6224) and D5 (AUPR: 0.5017), and performed the best AUCs at D2 (AUC: 0.9775), D3 (AUC: 0.9023), D4 (AUC: 0.8809) and D5 (AUC: 0.9185), which are superior to other MF-based computational models. The AUPRs of CKA-HGRTMF on 5 datasets were 0.8957, 0.7456, 0.6014, 0.3992 and 0.4250, respectively, and the AUCs were 0.9857, 0.9746, 0.8991, 0.8774 and 0.8991, respectively. The AUCs and AUPRs of CKA-GRTMF at D1, D2 and D5 are all higher than CKA-TMF. The result shows that the addition of the calculation model of the graph regularization term is beneficial to improving the prediction performance.

In order to evaluate the performance of the CKA-GRTMF model, the method is compared with other existing methods. Fig. 8 shows the results of 5-fold cross validation and leave-one cross validation of AUCs. The method of the invention (CKA-GRTMF) achieved the best results in both 5-fold cross validation on D2, D3, D4, D5 and one-leave cross validation on 5 datasets. The calculation method of the invention finds ten new association relations among lung cancer, liver cancer and pancreatic cancer, and the result is shown in figure 9.

In conclusion, the invention solves the problem of predicting the association relationship between non-coding RNA and diseases. Adding a Laplace regularization term of the hypergraph into three-matrix decomposition, and adopting a center kernel alignment method to obviously improve the prediction precision. The calculation process of the invention has the characteristics of simplicity, easy realization and the like, and the hardware equipment and the calculation resources required by calculation are also lower, thus having wide usability. The method is realized by C + + and MATLAB, is applied to a computer with a common 2.5GHz8 core CPU and a 24GB memory, and can complete the prediction tasks of thousands of samples in a short time.

Claims

1. A method for disease association prediction in a biological association network, comprising the steps of:

s1 creating a polynuclear representation of non-coding RNAs

And multi-nuclear representation of disease

Wherein: u and v are the number of nuclei in the feature space of the non-coding RNA and the disease;

s2, fusing the non-coding RNA multi-core and disease multi-core by adopting a central core alignment calculation method to obtain an optimal core

And

s3, decomposing the fused optimal kernel into two matrixes by adopting a singular value decomposition calculation method, namely

And

the calculation process is as follows:

wherein, A and B are low rank approximate matrixes; r is_ncAnd r_dPotential feature space dimensions for non-coding RNA and disease, respectively;

s4 matrix pair method using hypergraph regular term three-matrix decomposition

And

calculating to obtain a hypergraph Laplace matrix

And

s5, drawing the hypergraph by a cross validation methodThe placian matrix

And

calculating to obtain a new incidence relation matrix, Y^*＝AΘB^T。

2. The method for disease association prediction in biological association network as claimed in claim 1, wherein said step S4 is executed by using the laplace matrix of hypergraph

And

the calculation formula is as follows:

L^h＝I-Θ

where I is the identity matrix.

3. The method for disease association prediction in biological association network as claimed in claim 1, wherein the step S5 is performed by cross-validation of the laplacian matrix of the hypergraph