CN111724855B - Protein compound identification method based on minimal spanning tree Prim - Google Patents

Protein compound identification method based on minimal spanning tree Prim Download PDF

Info

Publication number
CN111724855B
CN111724855B CN202010378184.0A CN202010378184A CN111724855B CN 111724855 B CN111724855 B CN 111724855B CN 202010378184 A CN202010378184 A CN 202010378184A CN 111724855 B CN111724855 B CN 111724855B
Authority
CN
China
Prior art keywords
protein
node
cluster
cohesiveness
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010378184.0A
Other languages
Chinese (zh)
Other versions
CN111724855A (en
Inventor
梁冰
吕嘉庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010378184.0A priority Critical patent/CN111724855B/en
Publication of CN111724855A publication Critical patent/CN111724855A/en
Application granted granted Critical
Publication of CN111724855B publication Critical patent/CN111724855B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a protein complex identification method based on minimal spanning tree Prim. The method comprises the following steps: s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network; s2, expanding the identified protein cluster by adopting a Prim spanning tree; s3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method; s4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly. The protein identification method provided by the invention can be used for accurately predicting the protein compound, and the identification accuracy is high. The accurate protein complex identification can effectively identify the protein complex causing the disease, provide clues for the root cause of the disease and provide basis for identifying disease genes and developing new drug targets.

Description

Protein compound identification method based on minimal spanning tree Prim
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a protein complex identification method based on minimal spanning tree Prim.
Background
Protein-protein interaction data generally refers to protein-protein interactions (PPIs) where a node is a network of proteins and the edge between two nodes represents a known interaction between two proteins. Most proteins are biologically active only when they are part of a protein complex. Protein complexes are biological molecules that perform cellular functions such as replication, transcription, and gene expression. From an evolutionary point of view, understanding the formation of protein complexes, it is believed that protein complexes should be extended starting from a core protein with a critical function. The core protein associates with certain proteins to form a protein complex, which in turn continues to associate with certain proteins, making the complex progressively larger. In the course of increasing expansion, the links between proteins within protein complexes become progressively more compact.
In view of the above, great efforts have been made to identify protein complexes in protein-protein interactions in large-scale, time-consuming laboratory experiments, such as Affinity Purification (AP) followed by Mass Spectrometry (MS), and attempts have recently been made to identify protein complexes by computational methods in order to reduce the number of trial and error steps involved in the experiments. Most of the existing research methods rely on the idea that proteins in the same complex will interact relatively more. These computational methods, based on different topological properties such as density, k-nuclei, structure and periphery of nucleus-attached biomolecules, etc. can be found from different graph clustering algorithms, describing how to find a dense subgraph and assemble nodes into dense subgraphs. However, the above method cannot predict proteins accurately, and the recognition accuracy is low.
Disclosure of Invention
In accordance with the technical problems set forth above, a method for identifying a protein complex based on minimal spanning tree Prim is provided. The protein identification method provided by the invention is more biased to accurate prediction and has high identification accuracy.
The technical means adopted by the invention are as follows:
a protein complex identification method based on minimal spanning tree Prim comprises the following steps:
s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network;
s2, expanding the identified protein cluster by adopting a Prim spanning tree;
s3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method;
s4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly.
Further, the step S1 specifically includes:
s1, representing a relationship network having N proteins by an undirected graph G = (V, E); wherein, the vertex set V represents protein, and the edge set E represents an interaction set between protein pairs;
s2, calculating the degrees of protein nodes in the protein interaction relation network, wherein nodes V in the protein network G = (V, E) are directly connected with a node set to be D v ,D v Set D if k ∈ V | (V, k) ∈ E }, and V ∈ V v The number of middle elements being degrees of node v, i.e. deg v =|D v L, wherein deg v Degree, | D, representing node v v | represents a set D v The number of the elements in (B).
Further, the step S2 specifically includes:
s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge is not present, mincost [ v ] = ∞.
S22, setting the root of the tree to be u, setting mincost [ u ] =0, finding all nodes v which are adjacent to the root and are not in the tree, if w (u, v) < mincost [ v ], adding the edge to the tree, which is more optimal than before, changing the mincost value of the node v and the parent node parent [ v ] of the node v, and updating Q, wherein w (u, v) represents the weight value of the edge connecting the protein node u and the protein node v.
Further, the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:
Figure BDA0002481014290000031
wherein C represents a protein cluster, W in Represents the sum of the weights of the edges completely contained in the protein cluster C, W out Represents the sum of the edge weights connecting the proteins belonging to protein cluster C to the rest of the network, P is used to reflect the protein interaction network uncertainty.
Further, the formula for judging the cohesiveness of the protein cluster is used for reflecting whether the protein complex has strong connection inside and good separation from the outside.
Further, the step S4 includes a step of labeling each protein to which one of the protein complexes has been added, and adding no more protein.
Compared with the prior art, the invention has the following advantages:
the protein compound identification method based on the minimal spanning tree Prim provided by the invention can accurately predict the protein compound and has high identification accuracy. The accurate protein complex identification can effectively identify the protein complex causing the disease, provide clues for the root of the disease and provide basis for identifying the disease gene and developing a new drug target.
For the above reasons, the present invention can be widely applied to the fields of bioinformatics and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a MIPS standard protein complex provided by an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in FIG. 1, the present invention provides a method for identifying a protein complex based on minimal spanning tree Prim, comprising the steps of:
s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network;
further, as a preferred embodiment of the present invention, the step S1 specifically includes:
s1, representing a relationship network having N proteins by an undirected graph G = (V, E); wherein, the vertex set V represents protein, and the edge set E represents an interaction set between protein pairs;
s2, calculating the degrees of protein nodes in the protein interaction relationship network, wherein the nodes V in the protein network G = (V, E) are directly connectedNode set is D v ,D v Set D if k ∈ V | (V, k) ∈ E }, and V ∈ V v The number of middle elements being degrees of node v, i.e. deg v =|D v L, wherein deg v Degree, | D, representing node v v I represents the set D v The number of the elements in (B). Most proteins interact only marginally with other proteins, but there are some protein nodes with a large number of directly linked proteins, proteins involved in simple tasks may only require a few interacting partners, and those for more complex and global tasks are more extensive.
S2, expanding the identified protein cluster by adopting a Prim spanning tree;
further, as a preferred embodiment of the present invention, the step S2 specifically includes:
s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge does not exist, mincost [ v ] = ∞.
S22, setting the root of the tree to be u, setting mincost [ u ] =0, finding all nodes v which are adjacent to the root and are not in the tree, if w (u, v) < mincost [ v ], adding the edge to the tree, which is more optimal than before, changing the mincost value of the node v and the parent node parent [ v ] of the node v, and updating Q, wherein w (u, v) represents the weight value of the edge connecting the protein node u and the protein node v.
S3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method;
further, as a preferred embodiment of the present invention, the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:
Figure BDA0002481014290000051
wherein C represents an eggWhite matter cluster, W in Represents the sum of the weights of the edges completely contained in the protein cluster C, W out Represents the sum of the edge weights that connect the proteins belonging to protein cluster C to the rest of the network, P being used to reflect the protein interaction network uncertainty.
Further, as a preferred embodiment of the present invention, the formula for judging the cohesiveness of the protein cluster is used to reflect whether the protein complex has strong connection inside and good separation from the outside.
S4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly. Obtaining two arrays mincost [ v ] representing the minimum weight of edges connecting all the nodes v and one vertex in the tree; parent [ v ] represents the parent of node v. For each protein to which one of the protein complexes has been added, it is labeled and no more protein is added.
In order to verify the effectiveness of the method, the method is compared and analyzed with the performance of six protein compound identification algorithms based on network topological characteristics on a collins network, in an experiment, a standard protein compound is from MIPS, and after the protein compound with the protein number less than 3 in the standard protein compound is filtered out, 203 standard protein compounds are obtained.
In order to evaluate the performance of the Prim algorithm-based protein Complex recognition method, it was compared with MCL (Markov Clustering), RRW (random walk algorithm), clusterine (Clustering with overlapping neighbor expansion-based Clustering), MCODE (Molecular Complex Detection, MCODE, molecular Complex Detection algorithm), COACH, CMC,6 methods. As shown in table 1 below:
TABLE 1 comparison of Performance of different protein Complex identification algorithms
Figure BDA0002481014290000061
Experimental results show that the protein complex identification method based on the minimum spanning tree can detect more matched complexes when matched with the standard protein complex. Namely, the accuracy of the protein compound identification method exceeds that of the original 6 methods, the method can realize accurate prediction on the protein compound, and the identified quantity is small, but the accuracy is high. The accurate recognition rate is high when a specific protein complex is researched. Although some of the protein complexes that have been identified so far are not known protein complexes, there is a greater possibility that they will be identified as true protein complexes by laboratory experiments in the future.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (5)

1. A method for identifying a protein complex based on minimal spanning tree Prim, comprising the steps of:
s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network;
s2, expanding the identified protein cluster by adopting a Prim spanning tree; the step S2 specifically comprises the following steps:
s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge does not exist, mincost [ v ] = ∞;
s22, setting the root of the tree as u, setting mincost [ u ] =0, finding all nodes v which are adjacent to the root and are not in the tree, if w (u, v) < mincost [ v ], adding the edge which is more optimal than before into the tree, changing the mincost value of the node v and the parent node parent [ v ] of the node v, and updating Q, wherein w (u, v) represents the weight of the edge connecting the protein node u and the protein node v;
s3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method;
s4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly.
2. The method for identifying a protein complex based on minimal spanning tree Prim according to claim 1, wherein the step S1 is specifically:
s1, representing a relationship network having N proteins by an undirected graph G = (V, E); wherein, the vertex set V represents protein, and the edge set E represents an interaction set between protein pairs;
s2, calculating the degrees of protein nodes in the protein interaction relation network, wherein nodes V in the protein network G = (V, E) are directly connected with a node set to be D v ,D v K ∈ V | (V, k) ∈ E }, V ∈ V, then the set D v The number of middle elements being degrees of node v, i.e. deg v =D v Wherein deg. de v Degree, D, representing node v v Set of representations D v The number of the elements in (B).
3. The method for identifying a protein complex based on minimal spanning tree Prim according to claim 1, wherein the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:
Figure FDA0004038548040000021
wherein C represents a protein cluster, W in Represents the sum of the weights of the edges completely contained in the protein cluster C, W out Represents the sum of the edge weights connecting the proteins belonging to protein cluster C to the rest of the network, P is used to reflect the protein interaction network uncertainty.
4. The minimal spanning tree Prim-based protein complex recognition method according to claim 3, wherein the formula for determining protein cluster cohesiveness is used to reflect whether or not a protein complex has strong links inside and is well separated from the outside.
5. The minimal spanning tree Prim-based protein complex identification method according to claim 1, wherein said step S4 further comprises the step of labeling each protein to which one of the protein complexes has been added, and not adding any protein.
CN202010378184.0A 2020-05-07 2020-05-07 Protein compound identification method based on minimal spanning tree Prim Active CN111724855B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010378184.0A CN111724855B (en) 2020-05-07 2020-05-07 Protein compound identification method based on minimal spanning tree Prim

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010378184.0A CN111724855B (en) 2020-05-07 2020-05-07 Protein compound identification method based on minimal spanning tree Prim

Publications (2)

Publication Number Publication Date
CN111724855A CN111724855A (en) 2020-09-29
CN111724855B true CN111724855B (en) 2023-03-10

Family

ID=72564239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010378184.0A Active CN111724855B (en) 2020-05-07 2020-05-07 Protein compound identification method based on minimal spanning tree Prim

Country Status (1)

Country Link
CN (1) CN111724855B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945333A (en) * 2012-12-04 2013-02-27 中南大学 Key protein predicating method based on priori knowledge and network topology characteristics
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060106545A1 (en) * 2004-11-12 2006-05-18 Jubilant Biosys Ltd. Methods of clustering proteins
US8396884B2 (en) * 2006-02-27 2013-03-12 The Regents Of The University Of California Graph querying, graph motif mining and the discovery of clusters
US20080215301A1 (en) * 2006-05-22 2008-09-04 Yeda Research And Development Co. Ltd. Method and apparatus for predicting protein structure
US20080133197A1 (en) * 2006-12-04 2008-06-05 Electronics And Telecommunications Research Institute Layout method for protein-protein interaction networks based on seed protein
CN103414786B (en) * 2013-08-28 2016-03-16 电子科技大学 A kind of data aggregation method based on minimum spanning tree
US20170131247A1 (en) * 2015-11-09 2017-05-11 Thermo Finnigan Llc Minimal Spanning Trees for Extracted Ion Chromatograms
WO2017081687A1 (en) * 2015-11-10 2017-05-18 Ofek - Eshkolot Research And Development Ltd Protein design method and system
CN109033746B (en) * 2018-06-29 2020-01-14 大连理工大学 Protein compound identification method based on node vector
CN110517729B (en) * 2019-09-02 2021-05-04 吉林大学 Method for excavating protein compound from dynamic and static protein interaction network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945333A (en) * 2012-12-04 2013-02-27 中南大学 Key protein predicating method based on priori knowledge and network topology characteristics
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
不确定图最小生成树算法;张安珍等;《智能计算机与应用》;20191101(第06期);第8-12+19段 *
基于粒度空间的最小生成树分类算法;孙梦梦等;《南京大学学报(自然科学)》;20170930(第05期);第147-155段 *
基于距离测定的蛋白质复合物识别算法;李敏等;《吉林大学学报(工学版)》;20100915(第05期);第147-152段 *
最小生成树用于基因表示数据的聚类算法;杨国慧等;《计算机研究与发展》;20031030(第10期);第24-28段 *

Also Published As

Publication number Publication date
CN111724855A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN105138866A (en) Method for identifying protein functions based on protein-protein interaction network and network topological structure features
Cho et al. Predicting protein function by frequent functional association pattern mining in protein interaction networks
Xu et al. Essential protein detection by random walk on weighted protein-protein interaction networks
CN109545275B (en) Uncertain PPI network function module mining method based on fuzzy spectral clustering
Xu et al. From function to interaction: A new paradigm for accurately predicting protein complexes based on protein-to-protein interaction networks
CN111667881B (en) Protein function prediction method based on multi-network topology structure
CN112582027A (en) Homologous protein detection method based on biological protein information network comparison
Haque et al. A common neighbor based technique to detect protein complexes in PPI networks
Ji et al. Improved ant colony optimization for detecting functional modules in protein-protein interaction networks
CN111724855B (en) Protein compound identification method based on minimal spanning tree Prim
CN112270950B (en) Network enhancement and graph regularization-based fusion network drug target relation prediction method
CN116884505A (en) Protein-small molecule compound docking method based on local template similarity
Yu et al. A method based on local density and random walks for complexes detection in protein interaction networks
Ng et al. Blocked pattern matching problem and its applications in proteomics
CN112116947A (en) Protein interaction identification and prediction method and device based on symbol network
Yu et al. A hybrid clustering algorithm for identifying modules in Protein? Protein Interaction networks
Lei et al. A random walk based approach for improving protein-protein interaction network and protein complex prediction
Efimov et al. Detecting protein complexes from noisy protein interaction data
Feng et al. A max-flow based approach to the identification of protein complexes using protein interaction and microarray data
Choi et al. Consistent and efficient reconstruction of latent tree models
Lei et al. Identifying Essential Proteins in Dynamic PPI Network with Improved FOA
Chakrabarty et al. Analysis of graph centrality measures for identifying Ankyrin repeats
Cingovska et al. Protein Function Prediction by Clustering of Protein-Protein Interaction Network
Carter et al. Deployment and retrieval simulation of a single tether satellite system
Yue et al. Multi-scale Protein Complex Discovery based on Graph Wavelet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant