CN111640468A

CN111640468A - Method for screening disease-related protein based on complex network

Info

Publication number: CN111640468A
Application number: CN202010418499.3A
Authority: CN
Inventors: 李旭; 任静; 王学敏; 张文; 闫凯境
Original assignee: Tianshili International Gene Network Drug Innovation Center Co ltd
Current assignee: Tianshili International Gene Network Drug Innovation Center Co ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-09-08
Anticipated expiration: 2040-05-18
Also published as: CN111640468B

Abstract

The invention discloses a method for screening disease-related protein based on a complex network, which comprises the following steps: 1) obtaining a seed gene related to a target disease; 2) constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database; 3) extracting characteristic data of the protein in the protein interaction network; 4) taking the characteristic data of the protein as training data, and training by adopting a machine learning algorithm to obtain a PU classifier; 5) predicting a protein associated with the target disease in the protein interaction network based on the PU classifier. The method can quickly and efficiently identify the protein related to the disease, and is beneficial to the experiment verification of the biomedical experts or the development of the related researchers.

Description

Method for screening disease-related protein based on complex network

Technical Field

The invention relates to the technical field of protein screening, in particular to a method for screening disease-related protein based on a complex network.

Background

The recognition of disease-associated proteins plays an important role in molecular typing, diagnosis, treatment, and the like of diseases. The accurate and efficient identification of the disease-related protein is helpful for finding pathogenic genes and identifying drug targets, and has profound significance in disease diagnosis and treatment and drug design. GWAS is an important research tool for discussing disease susceptibility genes, and can quickly find more obvious disease susceptibility sites. However, GWAS has low data utilization rate, and covers a large number of possibly significant disease-related proteins. Meanwhile, the traditional GWAS single-site association analysis treats each gene in an organism independently, ignores the interaction between genes in the organism and is difficult to find the protein really related to diseases.

Protein-protein interaction (PPI) network analysis remedies the above deficiencies. In recent years, with the increasing improvement of PPI data, the study of protein interaction networks from a system perspective using computer network and graph theory and methods has become a popular field. Many researchers gradually turn to protein identification research based on calculation, and many classical algorithms such as Degree Center (DC), Betweenness Center (BC), proximity center (CC), and the like are proposed, however, the identification accuracy of these algorithms is not high generally. Therefore, how to obtain a protein characterization based on a protein interaction network and to find a protein with similar function to a known disease-related protein is a difficult point in protein network analysis.

Another difficulty in the identification of disease-associated proteins is that tagged disease proteins are very rare in the protein interaction network around the disease seed gene, and the relationship between a large number of untagged proteins and disease is unknown, i.e. these untagged proteins may be disease-associated proteins or may not be disease-associated, with only the seed gene being the positive sample. Generally, the relevance of these unlabeled proteins to disease depends on biological experiments or artificial literature search corrections, which are expensive and time consuming. The rapid development of big data and artificial intelligence technology provides a low-cost and high-efficiency approach for disease protein screening. The protein most related to diseases is predicted by a machine learning technology, so that the protein plays a great promoting role in developing new drugs and clinical treatment. However, the large amount of unlabeled data easily causes model under-fitting or over-fitting problems in the machine learning process, resulting in insufficient learning of the model to information in the entire sample space or insufficient model normalization capability. How to reasonably use the unmarked data to construct the model greatly reduces the requirement on the marked data, and is a technical problem to be solved urgently.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention aims to provide a method for screening disease-related proteins based on a complex network. The method can quickly and efficiently identify the protein related to the disease, and is beneficial to the experiment verification of the biomedical experts or the development of the related researchers.

The invention provides a method for screening disease-related protein based on a complex network. Finding seed genes related to diseases through genome-wide association analysis (GWAS), obtaining a protein interaction network with seed genes as cores based on a protein interaction database (such as a Biogrid database, a String database, an act database, an HPRD database and the like), extracting characteristic data of proteins in the protein interaction network by using a node2vec algorithm, and predicting proteins related to diseases in the protein interaction network by using a semi-supervised learning PU-learning algorithm. The method comprises the following specific steps:

s1: performing whole genome scanning on target disease group population and control group population by adopting a case-control research method, and performing whole genome association analysis (GWAS) to obtain seed genes related to the target disease;

s2: constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database;

s3: extracting characteristic data of the protein in the protein interaction network based on a node2vec algorithm;

the step S3 specifically includes:

s31, constructing an undirected graph G from the protein interaction network data obtained in the S2 to obtain a node set (protein set) and an edge set (protein-protein interaction relation set);

and S32, expressing the protein features in the S2 protein interaction network as d-dimensional vectors by using a node2vec method to express the relationship between proteins, and further researching the content similarity and the structure similarity of the proteins in the network.

In the specific practice task of machine learning, the function of feature selection is to reduce the number of features, reduce dimensions, improve the performance of the model, make the generalization capability of the model stronger and reduce overfitting. Selecting a representative set of features for constructing the model is a very important issue. Aiming at the problem that the existing feature learning method can not capture the diversity mode of the connection of a complex network (a protein interaction network), the invention utilizes the node2vec algorithm to simulate the process of biased random walk by a second-order Markov chain to replace deep search (DFS) and breadth search (BFS), and searches first-order neighbors and multi-order isomorphic neighbors of a node to realize the diversity of node neighbor sampling. The sampling strategy principle is as follows:

given the current vertex v, the probability of visiting the next vertex x is:

wherein pi_vxIs the transition probability of vertex v and vertex x, Z being a normalization constant. Node2vec introduces two hyper-parameters p and q to control the random walk strategy, supposing that the current random walk reaches the vertex v through the edges (t, v), and let pi_vx＝α_pq(t,x)·w_vx，w_vxIs the weight between vertices v and x. Then

Wherein d is_txIs the shortest path distance between vertex t and vertex x.

S33, in the second-order random walk function described in S32, the parameter p controls the probability of repeatedly accessing the vertex that has just been visited, and the higher p, the lower the probability of accessing the vertex that has just been visited. The parameter q controls whether the walk is inward or outward, if q >1, the random walk tends to visit the vertices that are close to vertex t (biased toward BFS); if q <1, vertices away from vertex t tend to be visited (towards DFS). In the invention, the preferred range of p is [2, 5], the preferred range of q is [0.1, 3] and the preferred range of dimension d is [128, 256] in the process of expressing the protein characteristics in the S2 protein interaction network as d-dimensional vectors by using a node2vec method.

And S4, taking the protein structural feature data obtained in the S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of the protein related to the target disease in the protein interaction network.

The PU-learning method is a semi-supervised learning method for solving the problem of two-classification, and is different from the traditional method in that the PU-learning method can process the situations of less positive samples and missing negative samples, thereby improving the prediction performance of the disease-related protein. The training set of PU-learning is composed of a positive sample and a label-free sample, specifically, only a few proteins (seed genes) in the protein structural feature data in the protein interaction network obtained in S3 are known disease-related genes and are positive samples; proteins that interact with these seed genes are unlabeled samples (which may or may not be associated with disease).

The step S4 specifically includes:

s41, establishing the PU classifier by using a two-step method: firstly, finding reliable negative samples from unlabeled samples; in the second step, the classifier is trained with the positive samples determined and the reliable negative samples.

Specifically, Spy technique is used for searching reliable negative samples in the first step of S41, partial Spy samples (partial seed proteins) S are randomly selected from a positive example labeling set P (seed proteins related to target diseases), and the positive example labeling set P is placed in a unlabeled data set U (namely, proteins which cannot be determined to be related to the target diseases and are unlabeled proteins). And training the P-S as a positive case set (PS) and the U + S as a negative case set (US) by using a naive Bayes algorithm to obtain the NB classifier. Through an NB classifier, each protein k in the counterexample set US is classified, namely, each protein k is endowed with a probability list identifier Pr (1| k), wherein 1 identifies the positive example category. The probabilistic signature of spy set S determines which of the unlabeled protein set U are most likely not associated with disease. Specifically, given a threshold H by the probability markers of spy set S, proteins with probability markers Pr (1| k) < H of proteins are considered as negative samples (proteins not related to disease).

S412, repeating the step S411 for 100-1000 times to obtain a high-frequency stable negative sample (protein unrelated to disease) set RN. According to the invention, by improving the algorithm, the step S411 is carried out for multiple times, a negative sample is obtained every time, and then the protein which appears frequently is selected as the negative sample protein instead of the result of one test. For example, a protein is selected as a negative sample protein in 100 tests, and the protein is considered to belong to the stable negative sample set RN.

S413, in particular, the second step in S41 uses SVM algorithm. And (3) iteratively operating the learning algorithm SVM on the positive sample P (known disease-related protein) and the negative sample (disease-unrelated protein) set RN obtained by the S43 until the learning algorithm SVM converges or a set stopping condition is reached, and obtaining the PU classifier.

S42, using the PU classifier obtained in the S41 to predict the probability that each protein in the protein interaction network is the protein related to the disease, and realizing the prediction of the protein related to the disease.

S5, the evaluation indexes of the predicted protein are as follows: accuracy, precision, recall, F1 value. Under ideal conditions, the precision rate and the recall rate are both high and the best, but under general conditions, the precision rate is high, and the recall rate is low; the recall rate is high, and the precision rate is low. And under the condition of ensuring the precision rate, the recall rate is improved.

S51, the accuracy rate substantially evaluates the accuracy of the model, and the formula is,

s52, the accuracy rate shows the accuracy of the classifier for detecting a certain protein as a disease-related protein, the formula is,

s53, recall table indicates whether the classifier can detect all objects, formula,

S54F 1 value is the harmonic mean of precision and recall, i.e.

Equivalent to the comprehensive evaluation index of the accuracy and the recall rate.

The invention has the technical effects that:

(1) the invention uses the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network, and compared with the traditional topological property, the accuracy of protein identification in the protein interaction network can be improved.

(2) The invention combines and uses the node2vec algorithm and the PU-learning algorithm, reduces the workload of biological experiments or artificial identification of disease-related proteins, improves the precision and achieves the effect of automatic classification as far as possible.

Drawings

Fig. 1 is a schematic flow chart of a method for screening a disease-related protein based on a complex network according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

The first embodiment is as follows: coronary heart disease

The embodiment shows the technical effect 1 of the invention, and compared with the traditional topological property, the method can improve the accuracy of protein identification in the protein interaction network by using the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network. In the S5 part of the present example, 958 coronary heart disease-related proteins were obtained by applying the combined algorithm of the node2vec algorithm and the PU-learning algorithm, with accuracy (92.35%) and recall (56.89%); and 8348 coronary heart disease-related proteins are obtained by using the combination of the topological property algorithm and the PU-learning algorithm, with the accuracy (32.93%) and the recall rate (10.66%).

As shown in fig. 1, an embodiment of the present invention provides a method for screening a disease-related protein based on a complex network, the method including:

s1: adopting a case-contrast research method to carry out whole genome scanning on a disease group (coronary heart disease) population and a contrast group population, and obtaining 409 disease (coronary heart disease) related seed genes (set P) by whole genome association analysis (GWAS);

s2: constructing a protein interaction network taking a disease (coronary heart disease) seed gene as a core based on a protein-protein interaction database (Biogrid, version 3.5.169, website https:// the biological and. org /), wherein the network consists of 11303 proteins and 44194 protein-protein interaction relations;

s3: extracting characteristic data of 11303 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is set to be 50, window is set to be 10, min _ count is set to be 1, batch _ words is set to be 4, a hyper parameter p is set to be 3, q is set to be 3, and the dimensionality of the extracted protein feature data is 128.

S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of known proteins related to coronary heart disease are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US U + S) are generated, an NB classifier is obtained by using a naive bayesian algorithm, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). Repeating the above steps 100 times to obtain high-frequency negative sample set RN (containing 9455 proteins). An SVM algorithm is operated on a positive sample (409 proteins known to be related to coronary heart disease) and the obtained negative sample RN (protein unrelated to disease) set, the probability that each protein in the protein interaction network is related to the disease is predicted, and 958 proteins possibly related to coronary heart disease are obtained in total.

S5: evaluation section: the genes (997 in total) related to coronary heart disease obtained by literature retrieval and artificial correction are selected as gold standard, and the accuracy (92.35%), the accuracy (54.66%), the recall rate (56.89%) and the F1 value (55.75%) of 958 coronary heart disease related proteins obtained in S4 are evaluated. 8348 coronary heart disease-related proteins are obtained by using PU-learning algorithm prediction by using 6 topological properties (absolute point center degree, close center degree, point eigenvector center degree evcent of the point, point center degree beta of the point, average shortest path average distance and Pagerank score) as characteristic values of the proteins, and the accuracy (32.93%), the accuracy (89.47%), the recall rate (10.66%) and the F1 value (19.05%) of the proteins are obtained. The result shows that the characteristic structure of the protein in the protein interaction network is extracted by using the node2vec algorithm, and compared with the traditional topological property, the accuracy of protein identification in the protein interaction network can be improved.

Example two: ischemic cardiomyopathy

In order to protect the parameter range and dimension d of the node2vec of the method, dimension 128-. And when the selection dimension is 64, the obtained stable negative sample set RN result is less, and when the selection dimension is 512, the sensitivity of the data model on the verification set is greatly reduced.

S1: 270 seed genes (set P) related to the ischemic cardiomyopathy are obtained based on GWAS (global warming syndrome) databases, IPA (isopropyl alcohol), DisGeNET (digenet) databases and the like;

s2: constructing a protein interaction network with ischemic cardiomyopathy seed groups as cores on the basis of a protein-protein interaction database (BIOGRID, HPRD, INTACT and STRING protein interaction databases), wherein the network consists of 9329 proteins and 30274 protein-protein interaction relations; the 270 seed genes in S1 identified 263 and 7 unrecognized proteins in the protein interaction database.

S3: extracting characteristic data of 9329 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is set to be 50, window is set to be 10, min _ count is set to be 1, batch _ words is set to be 4, and the dimensionalities d of the extracted protein feature data are respectively 56-dimensional, 128-dimensional, 256-dimensional and 512-dimensional.

S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of proteins known to be related to ischemic cardiomyopathy are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US-U + S) are generated, a naive bayesian algorithm is used to obtain an NB classifier, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of the unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). Repeating the steps 1000 times to obtain the high-frequency negative sample set RN. And (3) operating SVM algorithm on the positive sample and the obtained negative sample RN assembly, and predicting the probability that each protein in the protein interaction network is the protein related to the disease. The results are as follows:

the dimension 128-. When the dimension is selected to be 64, the obtained stable negative sample set RN results are less, and when the dimension is selected to be 512, the sensitivity of the data model on the verification set is reduced to 0.6735.

Example three: atrial fibrillation

In this example, to preserve the parameter ranges of method node2vec, a preferred range for p is [2, 5] and a preferred range for q is [0.1, 3 ].

S1: 141 atrial fibrillation related seed genes (set P) are obtained based on GWAS catalog, Malacards, DisGeNET and other databases;

s2: constructing a protein interaction network of atrial fibrillation seed group as a core based on a protein-protein interaction database (BIOGRID, HPRD, INTACT and STRING protein interaction database), wherein the network consists of 5745 proteins and 13606 protein-protein interaction relations; 141 seed genes in S1 identified 131 in the protein interaction database, 10 were not identified.

S3: extracting characteristic data of 9329 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is 50, window is 10, min _ count is 1, batch _ words is 4, p values are respectively 0.1, 0.2, 0.5, 1, 2, 3, 4 and 5, q values are respectively 0.1, 0.2, 0.5, 1, 2, 3, 4 and 5, and the dimensionality of extracted protein characteristic data is respectively 128 dimensionalities.

S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of proteins known to be related to ischemic cardiomyopathy are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US-U + S) are generated, a naive bayesian algorithm is used to obtain an NB classifier, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of the unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). And repeating the steps for 100 times to obtain a high-frequency negative sample set RN. And (3) operating SVM algorithm on the positive sample and the obtained negative sample RN assembly, and predicting the probability that each protein in the protein interaction network is the protein related to the disease. The results are as follows:

when the hyper-parameters p is 2 and q is 1, the model is optimal, and the accuracy is 0.9994 at most.

The technical solutions of the present invention are clearly and completely described above, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative work belong to the protection scope of the present invention.

Claims

1. A method for screening disease-associated proteins based on a complex network, comprising the steps of:

1) obtaining a seed gene related to a target disease;

2) constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database;

3) extracting characteristic data of the protein in the protein interaction network;

4) taking the characteristic data of the protein as training data, and training by adopting a machine learning algorithm to obtain a PU classifier;

5) predicting a protein associated with the target disease in the protein interaction network based on the PU classifier.

2. The method of claim 1, wherein the characteristic data of the protein in the protein interaction network is extracted by using a node2vec algorithm, and the method comprises the following steps:

31) constructing an undirected graph G based on the protein interaction network data to obtain a node set and an edge set; each node in the node set corresponds to a protein, and edges in the edge set represent interaction relations between the proteins;

32) and (5) carrying out graph embedding on the undirected graph G by using a node2vec algorithm to obtain the protein characteristics in the protein interaction network.

3. The method of claim 2, wherein the node2vec algorithm iteratively passes through nodes with two superparameters p and q controlling probabilities; wherein the value range of the super parameter p is [2, 5], and the value range of the super parameter q is [0.1, 3 ].

4. The method of claim 3, wherein the protein features are represented as a d-dimensional vector, and the dimension d has a value in the range of [128, 256 ].

5. The method of claim 1, wherein the machine learning algorithm is a semi-supervised algorithm PU-learning.

6. The method of claim 5, wherein the method for obtaining the PU classifier by using the semi-supervised algorithm PU-learning training comprises:

41) randomly selecting a part of spy samples from the positive example labeling set P to obtain a set S, and placing the set S in a data set U without a label; taking P-S as a positive example set PS and U + S as a negative example set US, and training to obtain an NB classifier to classify each protein sample in the negative example set US; wherein, the positive case label set P is a seed protein set related to the target disease, and the protein in the data set U is a protein which can not be determined to be related to the target disease;

42) repeating the step 41), and taking the negative sample meeting the set requirement as a high-frequency stable negative sample to form a negative sample set RN;

43) and (4) iteratively operating the learning algorithm SVM on the positive sample P and the negative sample set RN until the SVM converges or a set stopping condition is reached, and obtaining the PU classifier.

7. The method according to claim 6, wherein a sample in which the proportion of the number of times the same sample is classified as a negative sample to the total number of repetitions exceeds a set threshold is taken as a high-frequency stable negative sample.

8. The method of claim 1, wherein the target disease group population and the control group population are subjected to genome-wide scanning, and the target disease-associated seed gene is obtained by genome-wide association analysis.

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to execute instructions of the steps of the method of any of claims 1-8 when executed.

10. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 8.