CN111724855B

CN111724855B - Protein compound identification method based on minimal spanning tree Prim

Info

Publication number: CN111724855B
Application number: CN202010378184.0A
Authority: CN
Inventors: 梁冰; 吕嘉庆
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2023-03-10
Anticipated expiration: 2040-05-07
Also published as: CN111724855A

Abstract

The invention provides a protein complex identification method based on minimal spanning tree Prim. The method comprises the following steps: s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network; s2, expanding the identified protein cluster by adopting a Prim spanning tree; s3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method; s4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly. The protein identification method provided by the invention can be used for accurately predicting the protein compound, and the identification accuracy is high. The accurate protein complex identification can effectively identify the protein complex causing the disease, provide clues for the root cause of the disease and provide basis for identifying disease genes and developing new drug targets.

Description

Protein compound identification method based on minimal spanning tree Prim

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a protein complex identification method based on minimal spanning tree Prim.

Background

Protein-protein interaction data generally refers to protein-protein interactions (PPIs) where a node is a network of proteins and the edge between two nodes represents a known interaction between two proteins. Most proteins are biologically active only when they are part of a protein complex. Protein complexes are biological molecules that perform cellular functions such as replication, transcription, and gene expression. From an evolutionary point of view, understanding the formation of protein complexes, it is believed that protein complexes should be extended starting from a core protein with a critical function. The core protein associates with certain proteins to form a protein complex, which in turn continues to associate with certain proteins, making the complex progressively larger. In the course of increasing expansion, the links between proteins within protein complexes become progressively more compact.

In view of the above, great efforts have been made to identify protein complexes in protein-protein interactions in large-scale, time-consuming laboratory experiments, such as Affinity Purification (AP) followed by Mass Spectrometry (MS), and attempts have recently been made to identify protein complexes by computational methods in order to reduce the number of trial and error steps involved in the experiments. Most of the existing research methods rely on the idea that proteins in the same complex will interact relatively more. These computational methods, based on different topological properties such as density, k-nuclei, structure and periphery of nucleus-attached biomolecules, etc. can be found from different graph clustering algorithms, describing how to find a dense subgraph and assemble nodes into dense subgraphs. However, the above method cannot predict proteins accurately, and the recognition accuracy is low.

Disclosure of Invention

In accordance with the technical problems set forth above, a method for identifying a protein complex based on minimal spanning tree Prim is provided. The protein identification method provided by the invention is more biased to accurate prediction and has high identification accuracy.

The technical means adopted by the invention are as follows:

a protein complex identification method based on minimal spanning tree Prim comprises the following steps:

s1, calculating the degrees of protein nodes in a protein interaction relationship network, and identifying protein clusters of the protein interaction relationship network;

s2, expanding the identified protein cluster by adopting a Prim spanning tree;

s3, judging the cohesiveness of the protein cluster after each expansion by adopting a cohesiveness method;

s4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly.

Further, the step S1 specifically includes:

s1, representing a relationship network having N proteins by an undirected graph G = (V, E); wherein, the vertex set V represents protein, and the edge set E represents an interaction set between protein pairs;

s2, calculating the degrees of protein nodes in the protein interaction relation network, wherein nodes V in the protein network G = (V, E) are directly connected with a node set to be D _v ,D _v Set D if k ∈ V | (V, k) ∈ E }, and V ∈ V _v The number of middle elements being degrees of node v, i.e. deg _v ＝|D _v L, wherein deg _v Degree, | D, representing node v _v | represents a set D _v The number of the elements in (B).

Further, the step S2 specifically includes:

s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge is not present, mincost [ v ] = ∞.

S22, setting the root of the tree to be u, setting mincost [ u ] =0, finding all nodes v which are adjacent to the root and are not in the tree, if w (u, v) < mincost [ v ], adding the edge to the tree, which is more optimal than before, changing the mincost value of the node v and the parent node parent [ v ] of the node v, and updating Q, wherein w (u, v) represents the weight value of the edge connecting the protein node u and the protein node v.

Further, the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:

wherein C represents a protein cluster, W _in Represents the sum of the weights of the edges completely contained in the protein cluster C, W _out Represents the sum of the edge weights connecting the proteins belonging to protein cluster C to the rest of the network, P is used to reflect the protein interaction network uncertainty.

Further, the formula for judging the cohesiveness of the protein cluster is used for reflecting whether the protein complex has strong connection inside and good separation from the outside.

Further, the step S4 includes a step of labeling each protein to which one of the protein complexes has been added, and adding no more protein.

Compared with the prior art, the invention has the following advantages:

the protein compound identification method based on the minimal spanning tree Prim provided by the invention can accurately predict the protein compound and has high identification accuracy. The accurate protein complex identification can effectively identify the protein complex causing the disease, provide clues for the root of the disease and provide basis for identifying the disease gene and developing a new drug target.

For the above reasons, the present invention can be widely applied to the fields of bioinformatics and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a MIPS standard protein complex provided by an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in FIG. 1, the present invention provides a method for identifying a protein complex based on minimal spanning tree Prim, comprising the steps of:

further, as a preferred embodiment of the present invention, the step S1 specifically includes:

s2, calculating the degrees of protein nodes in the protein interaction relationship network, wherein the nodes V in the protein network G = (V, E) are directly connectedNode set is D _v ,D _v Set D if k ∈ V | (V, k) ∈ E }, and V ∈ V _v The number of middle elements being degrees of node v, i.e. deg _v ＝|D _v L, wherein deg _v Degree, | D, representing node v _v I represents the set D _v The number of the elements in (B). Most proteins interact only marginally with other proteins, but there are some protein nodes with a large number of directly linked proteins, proteins involved in simple tasks may only require a few interacting partners, and those for more complex and global tasks are more extensive.

S2, expanding the identified protein cluster by adopting a Prim spanning tree;

further, as a preferred embodiment of the present invention, the step S2 specifically includes:

s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge does not exist, mincost [ v ] = ∞.

further, as a preferred embodiment of the present invention, the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:

wherein C represents an eggWhite matter cluster, W _in Represents the sum of the weights of the edges completely contained in the protein cluster C, W _out Represents the sum of the edge weights that connect the proteins belonging to protein cluster C to the rest of the network, P being used to reflect the protein interaction network uncertainty.

Further, as a preferred embodiment of the present invention, the formula for judging the cohesiveness of the protein cluster is used to reflect whether the protein complex has strong connection inside and good separation from the outside.

S4, if the cohesiveness of the protein cluster is increased, continuing to expand; and if the cohesiveness of the protein cluster is reduced, withdrawing the expansion, and adding the protein node before the expansion into the protein cluster to obtain a protein complex assembly. Obtaining two arrays mincost [ v ] representing the minimum weight of edges connecting all the nodes v and one vertex in the tree; parent [ v ] represents the parent of node v. For each protein to which one of the protein complexes has been added, it is labeled and no more protein is added.

In order to verify the effectiveness of the method, the method is compared and analyzed with the performance of six protein compound identification algorithms based on network topological characteristics on a collins network, in an experiment, a standard protein compound is from MIPS, and after the protein compound with the protein number less than 3 in the standard protein compound is filtered out, 203 standard protein compounds are obtained.

In order to evaluate the performance of the Prim algorithm-based protein Complex recognition method, it was compared with MCL (Markov Clustering), RRW (random walk algorithm), clusterine (Clustering with overlapping neighbor expansion-based Clustering), MCODE (Molecular Complex Detection, MCODE, molecular Complex Detection algorithm), COACH, CMC,6 methods. As shown in table 1 below:

TABLE 1 comparison of Performance of different protein Complex identification algorithms

Experimental results show that the protein complex identification method based on the minimum spanning tree can detect more matched complexes when matched with the standard protein complex. Namely, the accuracy of the protein compound identification method exceeds that of the original 6 methods, the method can realize accurate prediction on the protein compound, and the identified quantity is small, but the accuracy is high. The accurate recognition rate is high when a specific protein complex is researched. Although some of the protein complexes that have been identified so far are not known protein complexes, there is a greater possibility that they will be identified as true protein complexes by laboratory experiments in the future.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying a protein complex based on minimal spanning tree Prim, comprising the steps of:

s2, expanding the identified protein cluster by adopting a Prim spanning tree; the step S2 specifically comprises the following steps:

s21, initializing operation, namely taking the protein node with the maximum node degree at present as the root of the tree, and putting all nodes which are not in the tree into a minimum priority queue Q based on a mincost domain; for each node v, mincost [ v ] represents the minimum weight of edges connecting all the nodes v with one of the vertexes in the tree; if the edge does not exist, mincost [ v ] = ∞;

s22, setting the root of the tree as u, setting mincost [ u ] =0, finding all nodes v which are adjacent to the root and are not in the tree, if w (u, v) < mincost [ v ], adding the edge which is more optimal than before into the tree, changing the mincost value of the node v and the parent node parent [ v ] of the node v, and updating Q, wherein w (u, v) represents the weight of the edge connecting the protein node u and the protein node v;

2. The method for identifying a protein complex based on minimal spanning tree Prim according to claim 1, wherein the step S1 is specifically:

s2, calculating the degrees of protein nodes in the protein interaction relation network, wherein nodes V in the protein network G = (V, E) are directly connected with a node set to be D _v ,D _v K ∈ V | (V, k) ∈ E }, V ∈ V, then the set D _v The number of middle elements being degrees of node v, i.e. deg _v ＝D _v Wherein deg. de _v Degree, D, representing node v _v Set of representations D _v The number of the elements in (B).

3. The method for identifying a protein complex based on minimal spanning tree Prim according to claim 1, wherein the formula for determining the cohesiveness of the protein clusters after each expansion in step S3 is specifically as follows:

4. The minimal spanning tree Prim-based protein complex recognition method according to claim 3, wherein the formula for determining protein cluster cohesiveness is used to reflect whether or not a protein complex has strong links inside and is well separated from the outside.

5. The minimal spanning tree Prim-based protein complex identification method according to claim 1, wherein said step S4 further comprises the step of labeling each protein to which one of the protein complexes has been added, and not adding any protein.