CN111640468A - Method for screening disease-related protein based on complex network - Google Patents

Method for screening disease-related protein based on complex network Download PDF

Info

Publication number
CN111640468A
CN111640468A CN202010418499.3A CN202010418499A CN111640468A CN 111640468 A CN111640468 A CN 111640468A CN 202010418499 A CN202010418499 A CN 202010418499A CN 111640468 A CN111640468 A CN 111640468A
Authority
CN
China
Prior art keywords
protein
disease
algorithm
protein interaction
proteins
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010418499.3A
Other languages
Chinese (zh)
Other versions
CN111640468B (en
Inventor
李旭
任静
王学敏
张文
闫凯境
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianshili International Gene Network Drug Innovation Center Co ltd
Original Assignee
Tianshili International Gene Network Drug Innovation Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianshili International Gene Network Drug Innovation Center Co ltd filed Critical Tianshili International Gene Network Drug Innovation Center Co ltd
Priority to CN202010418499.3A priority Critical patent/CN111640468B/en
Publication of CN111640468A publication Critical patent/CN111640468A/en
Application granted granted Critical
Publication of CN111640468B publication Critical patent/CN111640468B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for screening disease-related protein based on a complex network, which comprises the following steps: 1) obtaining a seed gene related to a target disease; 2) constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database; 3) extracting characteristic data of the protein in the protein interaction network; 4) taking the characteristic data of the protein as training data, and training by adopting a machine learning algorithm to obtain a PU classifier; 5) predicting a protein associated with the target disease in the protein interaction network based on the PU classifier. The method can quickly and efficiently identify the protein related to the disease, and is beneficial to the experiment verification of the biomedical experts or the development of the related researchers.

Description

Method for screening disease-related protein based on complex network
Technical Field
The invention relates to the technical field of protein screening, in particular to a method for screening disease-related protein based on a complex network.
Background
The recognition of disease-associated proteins plays an important role in molecular typing, diagnosis, treatment, and the like of diseases. The accurate and efficient identification of the disease-related protein is helpful for finding pathogenic genes and identifying drug targets, and has profound significance in disease diagnosis and treatment and drug design. GWAS is an important research tool for discussing disease susceptibility genes, and can quickly find more obvious disease susceptibility sites. However, GWAS has low data utilization rate, and covers a large number of possibly significant disease-related proteins. Meanwhile, the traditional GWAS single-site association analysis treats each gene in an organism independently, ignores the interaction between genes in the organism and is difficult to find the protein really related to diseases.
Protein-protein interaction (PPI) network analysis remedies the above deficiencies. In recent years, with the increasing improvement of PPI data, the study of protein interaction networks from a system perspective using computer network and graph theory and methods has become a popular field. Many researchers gradually turn to protein identification research based on calculation, and many classical algorithms such as Degree Center (DC), Betweenness Center (BC), proximity center (CC), and the like are proposed, however, the identification accuracy of these algorithms is not high generally. Therefore, how to obtain a protein characterization based on a protein interaction network and to find a protein with similar function to a known disease-related protein is a difficult point in protein network analysis.
Another difficulty in the identification of disease-associated proteins is that tagged disease proteins are very rare in the protein interaction network around the disease seed gene, and the relationship between a large number of untagged proteins and disease is unknown, i.e. these untagged proteins may be disease-associated proteins or may not be disease-associated, with only the seed gene being the positive sample. Generally, the relevance of these unlabeled proteins to disease depends on biological experiments or artificial literature search corrections, which are expensive and time consuming. The rapid development of big data and artificial intelligence technology provides a low-cost and high-efficiency approach for disease protein screening. The protein most related to diseases is predicted by a machine learning technology, so that the protein plays a great promoting role in developing new drugs and clinical treatment. However, the large amount of unlabeled data easily causes model under-fitting or over-fitting problems in the machine learning process, resulting in insufficient learning of the model to information in the entire sample space or insufficient model normalization capability. How to reasonably use the unmarked data to construct the model greatly reduces the requirement on the marked data, and is a technical problem to be solved urgently.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention aims to provide a method for screening disease-related proteins based on a complex network. The method can quickly and efficiently identify the protein related to the disease, and is beneficial to the experiment verification of the biomedical experts or the development of the related researchers.
The invention provides a method for screening disease-related protein based on a complex network. Finding seed genes related to diseases through genome-wide association analysis (GWAS), obtaining a protein interaction network with seed genes as cores based on a protein interaction database (such as a Biogrid database, a String database, an act database, an HPRD database and the like), extracting characteristic data of proteins in the protein interaction network by using a node2vec algorithm, and predicting proteins related to diseases in the protein interaction network by using a semi-supervised learning PU-learning algorithm. The method comprises the following specific steps:
s1: performing whole genome scanning on target disease group population and control group population by adopting a case-control research method, and performing whole genome association analysis (GWAS) to obtain seed genes related to the target disease;
s2: constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database;
s3: extracting characteristic data of the protein in the protein interaction network based on a node2vec algorithm;
the step S3 specifically includes:
s31, constructing an undirected graph G from the protein interaction network data obtained in the S2 to obtain a node set (protein set) and an edge set (protein-protein interaction relation set);
and S32, expressing the protein features in the S2 protein interaction network as d-dimensional vectors by using a node2vec method to express the relationship between proteins, and further researching the content similarity and the structure similarity of the proteins in the network.
In the specific practice task of machine learning, the function of feature selection is to reduce the number of features, reduce dimensions, improve the performance of the model, make the generalization capability of the model stronger and reduce overfitting. Selecting a representative set of features for constructing the model is a very important issue. Aiming at the problem that the existing feature learning method can not capture the diversity mode of the connection of a complex network (a protein interaction network), the invention utilizes the node2vec algorithm to simulate the process of biased random walk by a second-order Markov chain to replace deep search (DFS) and breadth search (BFS), and searches first-order neighbors and multi-order isomorphic neighbors of a node to realize the diversity of node neighbor sampling. The sampling strategy principle is as follows:
given the current vertex v, the probability of visiting the next vertex x is:
Figure BDA0002495938940000021
wherein pivxIs the transition probability of vertex v and vertex x, Z being a normalization constant. Node2vec introduces two hyper-parameters p and q to control the random walk strategy, supposing that the current random walk reaches the vertex v through the edges (t, v), and let pivx=αpq(t,x)·wvx,wvxIs the weight between vertices v and x. Then
Figure BDA0002495938940000031
Wherein d istxIs the shortest path distance between vertex t and vertex x.
S33, in the second-order random walk function described in S32, the parameter p controls the probability of repeatedly accessing the vertex that has just been visited, and the higher p, the lower the probability of accessing the vertex that has just been visited. The parameter q controls whether the walk is inward or outward, if q >1, the random walk tends to visit the vertices that are close to vertex t (biased toward BFS); if q <1, vertices away from vertex t tend to be visited (towards DFS). In the invention, the preferred range of p is [2, 5], the preferred range of q is [0.1, 3] and the preferred range of dimension d is [128, 256] in the process of expressing the protein characteristics in the S2 protein interaction network as d-dimensional vectors by using a node2vec method.
And S4, taking the protein structural feature data obtained in the S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of the protein related to the target disease in the protein interaction network.
The PU-learning method is a semi-supervised learning method for solving the problem of two-classification, and is different from the traditional method in that the PU-learning method can process the situations of less positive samples and missing negative samples, thereby improving the prediction performance of the disease-related protein. The training set of PU-learning is composed of a positive sample and a label-free sample, specifically, only a few proteins (seed genes) in the protein structural feature data in the protein interaction network obtained in S3 are known disease-related genes and are positive samples; proteins that interact with these seed genes are unlabeled samples (which may or may not be associated with disease).
The step S4 specifically includes:
s41, establishing the PU classifier by using a two-step method: firstly, finding reliable negative samples from unlabeled samples; in the second step, the classifier is trained with the positive samples determined and the reliable negative samples.
Specifically, Spy technique is used for searching reliable negative samples in the first step of S41, partial Spy samples (partial seed proteins) S are randomly selected from a positive example labeling set P (seed proteins related to target diseases), and the positive example labeling set P is placed in a unlabeled data set U (namely, proteins which cannot be determined to be related to the target diseases and are unlabeled proteins). And training the P-S as a positive case set (PS) and the U + S as a negative case set (US) by using a naive Bayes algorithm to obtain the NB classifier. Through an NB classifier, each protein k in the counterexample set US is classified, namely, each protein k is endowed with a probability list identifier Pr (1| k), wherein 1 identifies the positive example category. The probabilistic signature of spy set S determines which of the unlabeled protein set U are most likely not associated with disease. Specifically, given a threshold H by the probability markers of spy set S, proteins with probability markers Pr (1| k) < H of proteins are considered as negative samples (proteins not related to disease).
S412, repeating the step S411 for 100-1000 times to obtain a high-frequency stable negative sample (protein unrelated to disease) set RN. According to the invention, by improving the algorithm, the step S411 is carried out for multiple times, a negative sample is obtained every time, and then the protein which appears frequently is selected as the negative sample protein instead of the result of one test. For example, a protein is selected as a negative sample protein in 100 tests, and the protein is considered to belong to the stable negative sample set RN.
S413, in particular, the second step in S41 uses SVM algorithm. And (3) iteratively operating the learning algorithm SVM on the positive sample P (known disease-related protein) and the negative sample (disease-unrelated protein) set RN obtained by the S43 until the learning algorithm SVM converges or a set stopping condition is reached, and obtaining the PU classifier.
S42, using the PU classifier obtained in the S41 to predict the probability that each protein in the protein interaction network is the protein related to the disease, and realizing the prediction of the protein related to the disease.
S5, the evaluation indexes of the predicted protein are as follows: accuracy, precision, recall, F1 value. Under ideal conditions, the precision rate and the recall rate are both high and the best, but under general conditions, the precision rate is high, and the recall rate is low; the recall rate is high, and the precision rate is low. And under the condition of ensuring the precision rate, the recall rate is improved.
S51, the accuracy rate substantially evaluates the accuracy of the model, and the formula is,
Figure BDA0002495938940000041
s52, the accuracy rate shows the accuracy of the classifier for detecting a certain protein as a disease-related protein, the formula is,
Figure BDA0002495938940000042
s53, recall table indicates whether the classifier can detect all objects, formula,
Figure BDA0002495938940000043
Figure BDA0002495938940000044
S54F 1 value is the harmonic mean of precision and recall, i.e.
Figure BDA0002495938940000045
Equivalent to the comprehensive evaluation index of the accuracy and the recall rate.
The invention has the technical effects that:
(1) the invention uses the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network, and compared with the traditional topological property, the accuracy of protein identification in the protein interaction network can be improved.
(2) The invention combines and uses the node2vec algorithm and the PU-learning algorithm, reduces the workload of biological experiments or artificial identification of disease-related proteins, improves the precision and achieves the effect of automatic classification as far as possible.
Drawings
Fig. 1 is a schematic flow chart of a method for screening a disease-related protein based on a complex network according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
The first embodiment is as follows: coronary heart disease
The embodiment shows the technical effect 1 of the invention, and compared with the traditional topological property, the method can improve the accuracy of protein identification in the protein interaction network by using the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network. In the S5 part of the present example, 958 coronary heart disease-related proteins were obtained by applying the combined algorithm of the node2vec algorithm and the PU-learning algorithm, with accuracy (92.35%) and recall (56.89%); and 8348 coronary heart disease-related proteins are obtained by using the combination of the topological property algorithm and the PU-learning algorithm, with the accuracy (32.93%) and the recall rate (10.66%).
As shown in fig. 1, an embodiment of the present invention provides a method for screening a disease-related protein based on a complex network, the method including:
s1: adopting a case-contrast research method to carry out whole genome scanning on a disease group (coronary heart disease) population and a contrast group population, and obtaining 409 disease (coronary heart disease) related seed genes (set P) by whole genome association analysis (GWAS);
s2: constructing a protein interaction network taking a disease (coronary heart disease) seed gene as a core based on a protein-protein interaction database (Biogrid, version 3.5.169, website https:// the biological and. org /), wherein the network consists of 11303 proteins and 44194 protein-protein interaction relations;
s3: extracting characteristic data of 11303 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is set to be 50, window is set to be 10, min _ count is set to be 1, batch _ words is set to be 4, a hyper parameter p is set to be 3, q is set to be 3, and the dimensionality of the extracted protein feature data is 128.
S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of known proteins related to coronary heart disease are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US U + S) are generated, an NB classifier is obtained by using a naive bayesian algorithm, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). Repeating the above steps 100 times to obtain high-frequency negative sample set RN (containing 9455 proteins). An SVM algorithm is operated on a positive sample (409 proteins known to be related to coronary heart disease) and the obtained negative sample RN (protein unrelated to disease) set, the probability that each protein in the protein interaction network is related to the disease is predicted, and 958 proteins possibly related to coronary heart disease are obtained in total.
S5: evaluation section: the genes (997 in total) related to coronary heart disease obtained by literature retrieval and artificial correction are selected as gold standard, and the accuracy (92.35%), the accuracy (54.66%), the recall rate (56.89%) and the F1 value (55.75%) of 958 coronary heart disease related proteins obtained in S4 are evaluated. 8348 coronary heart disease-related proteins are obtained by using PU-learning algorithm prediction by using 6 topological properties (absolute point center degree, close center degree, point eigenvector center degree evcent of the point, point center degree beta of the point, average shortest path average distance and Pagerank score) as characteristic values of the proteins, and the accuracy (32.93%), the accuracy (89.47%), the recall rate (10.66%) and the F1 value (19.05%) of the proteins are obtained. The result shows that the characteristic structure of the protein in the protein interaction network is extracted by using the node2vec algorithm, and compared with the traditional topological property, the accuracy of protein identification in the protein interaction network can be improved.
Example two: ischemic cardiomyopathy
In order to protect the parameter range and dimension d of the node2vec of the method, dimension 128-. And when the selection dimension is 64, the obtained stable negative sample set RN result is less, and when the selection dimension is 512, the sensitivity of the data model on the verification set is greatly reduced.
S1: 270 seed genes (set P) related to the ischemic cardiomyopathy are obtained based on GWAS (global warming syndrome) databases, IPA (isopropyl alcohol), DisGeNET (digenet) databases and the like;
s2: constructing a protein interaction network with ischemic cardiomyopathy seed groups as cores on the basis of a protein-protein interaction database (BIOGRID, HPRD, INTACT and STRING protein interaction databases), wherein the network consists of 9329 proteins and 30274 protein-protein interaction relations; the 270 seed genes in S1 identified 263 and 7 unrecognized proteins in the protein interaction database.
S3: extracting characteristic data of 9329 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is set to be 50, window is set to be 10, min _ count is set to be 1, batch _ words is set to be 4, and the dimensionalities d of the extracted protein feature data are respectively 56-dimensional, 128-dimensional, 256-dimensional and 512-dimensional.
S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of proteins known to be related to ischemic cardiomyopathy are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US-U + S) are generated, a naive bayesian algorithm is used to obtain an NB classifier, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of the unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). Repeating the steps 1000 times to obtain the high-frequency negative sample set RN. And (3) operating SVM algorithm on the positive sample and the obtained negative sample RN assembly, and predicting the probability that each protein in the protein interaction network is the protein related to the disease. The results are as follows:
Figure BDA0002495938940000061
the dimension 128-. When the dimension is selected to be 64, the obtained stable negative sample set RN results are less, and when the dimension is selected to be 512, the sensitivity of the data model on the verification set is reduced to 0.6735.
Example three: atrial fibrillation
In this example, to preserve the parameter ranges of method node2vec, a preferred range for p is [2, 5] and a preferred range for q is [0.1, 3 ].
S1: 141 atrial fibrillation related seed genes (set P) are obtained based on GWAS catalog, Malacards, DisGeNET and other databases;
s2: constructing a protein interaction network of atrial fibrillation seed group as a core based on a protein-protein interaction database (BIOGRID, HPRD, INTACT and STRING protein interaction database), wherein the network consists of 5745 proteins and 13606 protein-protein interaction relations; 141 seed genes in S1 identified 131 in the protein interaction database, 10 were not identified.
S3: extracting characteristic data of 9329 proteins in an S2 protein interaction network based on a node2vec algorithm; in the implementation process, a random walk parameter walk _ length is set to be 80, num _ walks is 50, window is 10, min _ count is 1, batch _ words is 4, p values are respectively 0.1, 0.2, 0.5, 1, 2, 3, 4 and 5, q values are respectively 0.1, 0.2, 0.5, 1, 2, 3, 4 and 5, and the dimensionality of extracted protein characteristic data is respectively 128 dimensionalities.
S4: and taking the protein structural feature data obtained in the step S3 as input, and analyzing by using a semi-supervised algorithm PU-learning algorithm to realize prediction of protein related to diseases in the protein interaction network. Specifically, 15% of proteins known to be related to ischemic cardiomyopathy are randomly selected by using a Spy technology to serve as a "Spy" sample set S, a positive case set (PS-S) and a negative case set (US-U + S) are generated, a naive bayesian algorithm is used to obtain an NB classifier, each protein k in the negative case set US is classified, a probability list mark Pr (1| k) is given to each protein, and the probability mark of the Spy set S determines which of the unlabeled protein U is most likely to be unrelated to diseases. Specifically, given a threshold H (10 quantile of the spy set S probability) by the spy set S probability signature, proteins with a probability signature of proteins Pr (1| k) < H are considered as negative (disease-independent proteins). And repeating the steps for 100 times to obtain a high-frequency negative sample set RN. And (3) operating SVM algorithm on the positive sample and the obtained negative sample RN assembly, and predicting the probability that each protein in the protein interaction network is the protein related to the disease. The results are as follows:
Figure BDA0002495938940000071
when the hyper-parameters p is 2 and q is 1, the model is optimal, and the accuracy is 0.9994 at most.
The technical solutions of the present invention are clearly and completely described above, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative work belong to the protection scope of the present invention.

Claims (10)

1. A method for screening disease-associated proteins based on a complex network, comprising the steps of:
1) obtaining a seed gene related to a target disease;
2) constructing a protein interaction network taking the seed gene as a core based on a protein-protein interaction database;
3) extracting characteristic data of the protein in the protein interaction network;
4) taking the characteristic data of the protein as training data, and training by adopting a machine learning algorithm to obtain a PU classifier;
5) predicting a protein associated with the target disease in the protein interaction network based on the PU classifier.
2. The method of claim 1, wherein the characteristic data of the protein in the protein interaction network is extracted by using a node2vec algorithm, and the method comprises the following steps:
31) constructing an undirected graph G based on the protein interaction network data to obtain a node set and an edge set; each node in the node set corresponds to a protein, and edges in the edge set represent interaction relations between the proteins;
32) and (5) carrying out graph embedding on the undirected graph G by using a node2vec algorithm to obtain the protein characteristics in the protein interaction network.
3. The method of claim 2, wherein the node2vec algorithm iteratively passes through nodes with two superparameters p and q controlling probabilities; wherein the value range of the super parameter p is [2, 5], and the value range of the super parameter q is [0.1, 3 ].
4. The method of claim 3, wherein the protein features are represented as a d-dimensional vector, and the dimension d has a value in the range of [128, 256 ].
5. The method of claim 1, wherein the machine learning algorithm is a semi-supervised algorithm PU-learning.
6. The method of claim 5, wherein the method for obtaining the PU classifier by using the semi-supervised algorithm PU-learning training comprises:
41) randomly selecting a part of spy samples from the positive example labeling set P to obtain a set S, and placing the set S in a data set U without a label; taking P-S as a positive example set PS and U + S as a negative example set US, and training to obtain an NB classifier to classify each protein sample in the negative example set US; wherein, the positive case label set P is a seed protein set related to the target disease, and the protein in the data set U is a protein which can not be determined to be related to the target disease;
42) repeating the step 41), and taking the negative sample meeting the set requirement as a high-frequency stable negative sample to form a negative sample set RN;
43) and (4) iteratively operating the learning algorithm SVM on the positive sample P and the negative sample set RN until the SVM converges or a set stopping condition is reached, and obtaining the PU classifier.
7. The method according to claim 6, wherein a sample in which the proportion of the number of times the same sample is classified as a negative sample to the total number of repetitions exceeds a set threshold is taken as a high-frequency stable negative sample.
8. The method of claim 1, wherein the target disease group population and the control group population are subjected to genome-wide scanning, and the target disease-associated seed gene is obtained by genome-wide association analysis.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to execute instructions of the steps of the method of any of claims 1-8 when executed.
10. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 8.
CN202010418499.3A 2020-05-18 2020-05-18 Method for screening disease-related protein based on complex network Active CN111640468B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010418499.3A CN111640468B (en) 2020-05-18 2020-05-18 Method for screening disease-related protein based on complex network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010418499.3A CN111640468B (en) 2020-05-18 2020-05-18 Method for screening disease-related protein based on complex network

Publications (2)

Publication Number Publication Date
CN111640468A true CN111640468A (en) 2020-09-08
CN111640468B CN111640468B (en) 2021-08-24

Family

ID=72331064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010418499.3A Active CN111640468B (en) 2020-05-18 2020-05-18 Method for screening disease-related protein based on complex network

Country Status (1)

Country Link
CN (1) CN111640468B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634982A (en) * 2020-11-23 2021-04-09 上海欧易生物医学科技有限公司 Method for screening key genes and key protein sets related to research purposes
CN112652355A (en) * 2020-12-08 2021-04-13 湖南工业大学 Medicine-target relation prediction method based on deep forest and PU learning
CN112927766A (en) * 2021-03-29 2021-06-08 天士力国际基因网络药物创新中心有限公司 Method for screening disease combination drug
CN112927765A (en) * 2021-03-29 2021-06-08 天士力国际基因网络药物创新中心有限公司 Method for repositioning medicine

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631244A (en) * 2015-12-30 2016-06-01 上海交通大学 Method for predicting common disease-causing genes of two diseases
CN106529203A (en) * 2016-12-21 2017-03-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method for predicting miRNA [micro-RNA (ribonucleic acid)] target proteins of miRNA regulation protein interaction networks
US20170159045A1 (en) * 2015-12-07 2017-06-08 Zymergen, Inc. Microbial strain improvement by a htp genomic engineering platform
CN107766697A (en) * 2017-09-18 2018-03-06 西安电子科技大学 A kind of general cancer gene expression and the association analysis method that methylates
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
CN108629159A (en) * 2018-05-14 2018-10-09 辽宁大学 A method of for finding the pathogenic key protein matter of alzheimer's disease
CN109155148A (en) * 2015-12-31 2019-01-04 思科利康有限公司 The method to identify protein ligand interaction is docked for protein groups
CN109411033A (en) * 2018-11-05 2019-03-01 杭州师范大学 A kind of curative effect of medication screening technique based on complex network
WO2019079180A1 (en) * 2017-10-16 2019-04-25 Illumina, Inc. Deep convolutional neural networks for variant classification
CN110400600A (en) * 2019-08-01 2019-11-01 枣庄学院 A kind of disease associated prediction technique of miRNA- based on rotation forest algorithm
CN110910953A (en) * 2019-11-28 2020-03-24 长沙学院 Key protein prediction method based on protein-domain heterogeneous network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
US20170159045A1 (en) * 2015-12-07 2017-06-08 Zymergen, Inc. Microbial strain improvement by a htp genomic engineering platform
CN105631244A (en) * 2015-12-30 2016-06-01 上海交通大学 Method for predicting common disease-causing genes of two diseases
CN109155148A (en) * 2015-12-31 2019-01-04 思科利康有限公司 The method to identify protein ligand interaction is docked for protein groups
CN106529203A (en) * 2016-12-21 2017-03-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method for predicting miRNA [micro-RNA (ribonucleic acid)] target proteins of miRNA regulation protein interaction networks
CN107766697A (en) * 2017-09-18 2018-03-06 西安电子科技大学 A kind of general cancer gene expression and the association analysis method that methylates
WO2019079180A1 (en) * 2017-10-16 2019-04-25 Illumina, Inc. Deep convolutional neural networks for variant classification
CN108629159A (en) * 2018-05-14 2018-10-09 辽宁大学 A method of for finding the pathogenic key protein matter of alzheimer's disease
CN109411033A (en) * 2018-11-05 2019-03-01 杭州师范大学 A kind of curative effect of medication screening technique based on complex network
CN110400600A (en) * 2019-08-01 2019-11-01 枣庄学院 A kind of disease associated prediction technique of miRNA- based on rotation forest algorithm
CN110910953A (en) * 2019-11-28 2020-03-24 长沙学院 Key protein prediction method based on protein-domain heterogeneous network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
ADITYA GROVER 等: "node2vec: Scalable Feature Learning for Networks", 《ARXIV》 *
JIAJIE PENG 等: "Predicting Parkinson’s Disease Genes Based on Node2vec and Autoencoder", 《FRONTIERS IN GENETICS》 *
JICHAO ZHAO 等: "Protein complexes prediction via positive and unlabeled learning of the PPI networks", 《2016 13TH INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT (ICSSSM)》 *
PENG YANG 等: "Ensemble Positive Unlabeled Learning for Disease Gene Identification", 《PLOS ONE》 *
QI ZHAO等: "SSCMDA: spy and super cluster strategy for MiRNA-disease association prediction", 《ONCOTARGET》 *
周漩: "基于蛋白质相互作用网络拓扑参数预测乳腺癌相关基因", 《广东药科大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634982A (en) * 2020-11-23 2021-04-09 上海欧易生物医学科技有限公司 Method for screening key genes and key protein sets related to research purposes
CN112652355A (en) * 2020-12-08 2021-04-13 湖南工业大学 Medicine-target relation prediction method based on deep forest and PU learning
CN112652355B (en) * 2020-12-08 2023-07-04 湖南工业大学 Drug-target relation prediction method based on deep forest and PU learning
CN112927766A (en) * 2021-03-29 2021-06-08 天士力国际基因网络药物创新中心有限公司 Method for screening disease combination drug
CN112927765A (en) * 2021-03-29 2021-06-08 天士力国际基因网络药物创新中心有限公司 Method for repositioning medicine
CN112927765B (en) * 2021-03-29 2022-02-22 天士力国际基因网络药物创新中心有限公司 Method for repositioning medicine

Also Published As

Publication number Publication date
CN111640468B (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN111640468B (en) Method for screening disease-related protein based on complex network
Moradi et al. A graph theoretic approach for unsupervised feature selection
Aliniya et al. A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm
Nguyen et al. Learning graph representation via frequent subgraphs
Wang et al. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval
Blekas et al. Greedy mixture learning for multiple motif discovery in biological sequences
Alrefai et al. Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets
Gorisse et al. Salsas: Sub-linear active learning strategy with approximate k-nn search
Wang et al. Machine learning-based methods for prediction of linear B-cell epitopes
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
Du et al. Identification and analysis of cancer diagnosis using probabilistic classification vector machines with feature selection
Wu et al. Semi-supervised multi-label collective classification ensemble for functional genomics
Nayak et al. A Comparative Study using Next Generation Sequencing Data and Machine Learning Approach for Crohn's Disease (CD) Identification
CN117198408A (en) Multimode comprehensive integrated drug repositioning system and method
Bai et al. A unified deep learning model for protein structure prediction
Salman et al. Gene expression analysis via spatial clustering and evaluation indexing
Iqbal et al. A distance-based feature-encoding technique for protein sequence classification in bioinformatics
Rao et al. Support vector machine based disease classification model employing hasten eagle Cuculidae search optimization
Elshazly et al. Lymph diseases diagnosis approach based on support vector machines with different kernel functions
CN115206423A (en) Label guidance-based protein action relation prediction method
Chellamuthu et al. Data mining and machine learning approaches in breast cancer biomedical research
CN111666902B (en) Training method of pedestrian feature extraction model, pedestrian recognition method and related device
Liang et al. Modern Hopfield Networks for graph embedding
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
Özçift et al. Swarm optimized organizing map (SWOM): a swarm intelligence basedoptimization of self-organizing map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Li Xu

Inventor after: Ren Jing

Inventor after: Wang Xuemin

Inventor after: Zhang Wen

Inventor after: Yan Kaijing

Inventor after: Wang Wenjia

Inventor before: Li Xu

Inventor before: Ren Jing

Inventor before: Wang Xuemin

Inventor before: Zhang Wen

Inventor before: Yan Kaijing

GR01 Patent grant
GR01 Patent grant