CN107240026B

CN107240026B - Community discovery method suitable for noise network

Info

Publication number: CN107240026B
Application number: CN201710260472.4A
Authority: CN
Inventors: 杨清海; 蒋群利
Original assignee: Xidian University; Xian Cetc Xidian University Radar Technology Collaborative Innovation Research Institute Co Ltd
Current assignee: Xidian University; Xian Cetc Xidian University Radar Technology Collaborative Innovation Research Institute Co Ltd
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2021-01-29
Anticipated expiration: 2037-04-20
Also published as: CN107240026A

Abstract

The invention belongs to the technical field of social networks, and discloses a community discovery method suitable for a noise network, which comprises the following steps: calculating the importance value of the nodes in the network, and establishing a core point set and a boundary point set; selecting core representative points to construct prior information; selecting boundary representative points to construct prior information; incorporating the prior information into an extremum optimization process; randomly dividing the network into two parts with approximately equal node numbers according to a topological structure to form an initial community structure; and calculating the contribution value of each node to the community module density, moving the node with the minimum contribution to another part to perform self-organization optimization, and repeating the self-organization optimization process until the module density value of the network is not increased any more. And removing the connecting edge between the two finally obtained communities until the module density value of the whole network reaches the maximum. The invention effectively improves the accuracy of community division with lower cost and improves the robustness of community division in a noise environment.

Description

Community discovery method suitable for noise network

Technical Field

The invention belongs to the technical field of social networks, and particularly relates to a community discovery method suitable for a noise network.

Background

Many networks in the real world, such as telephone networks, mail networks, crime networks, etc., often contain wrong or missing individual connections due to the difficulty in obtaining accurate and complete network structure information, and such networks are called noise networks. Most of the current methods for community discovery are based on the connection relationship between nodes in the network to discover the community structure in the network. Because the methods completely depend on the topological structure of the network, the method cannot be applied to the network with noise, and when the noise ratio in the network is increased, the capability of finding a real community structure is rapidly reduced; under a real network environment, part of prior knowledge of community division is known. For example, we may know that some users belong to a certain community, or that some two users belong to the same or different communities. The prior information is integrated into community division for community discovery, so that the community division accuracy can be effectively improved, and the robustness of community division in a noise environment is improved. However, the existing methods either do not give the prior information from which to come, or randomly extract part of the nodes from the network to form the prior information. In general, the prior information is obtained by labeling nodes selected from the network by experts in the corresponding field. Such labeling work is time-consuming and labor-consuming, and requires high cost, while a random extraction mode is too blind, and the obtained prior information may not have a strong guiding effect, so that the quality of community division cannot be effectively improved with a low cost.

In summary, the problems of the prior art are as follows: the payment cost is high, and the obtained prior information may not have a strong guiding function; the existing community discovery method can not be suitable for a network with noise, and the capability of discovering a real community structure is rapidly reduced along with the increase of the noise ratio in the network; when the prior information is obtained, the high-quality prior information cannot be obtained with low labor cost, and the accuracy of community discovery is reduced.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a community discovery method suitable for a noise network.

The community discovery method suitable for the noise network is based on the organic combination of the community division method of the extremum optimization module density and the active learning method;

the community division method based on the density of the extremum optimization module combines the prior information into the method of the density of the extremum optimization module, optimizes local variables and global variables by utilizing a pair of constraint sets, and guides community discovery in the process of optimizing an objective function;

the active learning method constructs a pair of constraint sets by actively selecting core nodes which can represent local community structures and nodes at community boundaries from a network, and generates high-quality prior information.

Further, the community division method based on the density of the extremum optimization module for the community discovery method suitable for the noise network comprises the following steps:

calculating an importance value of a node in a network, and establishing a core point set and a boundary point set;

selecting core representative points to construct prior information;

selecting boundary representative points to construct prior information;

step four, combining the prior information into an extreme value optimization process;

step five, dividing the network into two parts with approximately equal node numbers randomly according to a topological structure to form an initial community structure;

step six, calculating the contribution value of each node to the community module density, moving the node with the minimum contribution to another part to perform self-organization optimization, and repeating the self-organization optimization process until the module density value of the network is not increased any more;

and step seven, removing the connecting edge between the two finally obtained communities, and then executing the step five and the step six on each sub-network until the module density value of the whole network reaches the maximum value.

Further, the first step specifically includes: the importance value of the calculation node utilizes an index for comprehensively measuring the importance of the node based on the degree and the aggregation coefficient, and is represented as follows:

p_i＝f(k_i)+g(c_i)；

wherein,

is the degree of the node i and,

is the cluster coefficient of node i, E_iThe number of edges actually between the nodes; f (k)_i) Is to k_iThe value of the normalization process is the ratio of the difference between the value of the node i and the minimum value of the node in the network to the difference between the maximum value of the node in the network and the minimum value of the node in the network; g (c)_i) Is to c_iIs the difference between the maximum aggregation coefficient of the nodes in the network and the aggregation coefficient of the node i, and the networkThe ratio of the difference between the maximum aggregation factor of the nodes in the network and the minimum aggregation factor of the nodes in the network.

According to given parameters

Determining a set of core points and a set of boundary nodes, all importance values in the network being greater than a given parameter

The node(s) of (b) constitute a core point set CS, and the boundary point set BS is a node set constituted by non-core nodes.

Further, the second step specifically includes: if the representative point set RS is empty, selecting a node k with the maximum importance value from the core point set CS and adding the node k into the representative point set RS; otherwise, selecting the node i with the minimum similarity to the representative point set RS from the core point set CS as the candidate representative point, where the similarity between the node i and the set C is represented as:

S(i,C)＝max(Sim(i,j)|j∈C)；

wherein,

N_+ithe method is a set formed by a node i and a neighbor node thereof, wherein delta is a gain value, and is selected to be 1;

and constructing prior information < i, j > for each pair of representative points in the representative point set RS, and labeling the constraint types by a domain expert.

Further, the third step specifically includes: selecting a boundary point b1 with the maximum similarity to the node i from the boundary node set BS, if a plurality of nodes meeting the conditions exist, selecting the node with the minimum importance value as a representative point, constructing prior information < i, b1>, and submitting the prior information to a domain expert to mark the constraint type of the prior information;

and selecting a boundary point b2 with the minimum similarity to the node i from the boundary point set BS, if a plurality of nodes meeting the conditions exist, selecting the node with the maximum importance value as a representative point, constructing prior information < i, b2>, and submitting the prior information to a domain expert to label the constraint type.

Further, the fourth step specifically includes: module density of the network D Global variables and contribution of each node to the module density q_iThe local variables are related, the module density D is optimized and solved by the mode of punishing violating the constraint condition by using the known pairwise constraint information, and the general form of defining a punishment function is as follows:

wherein alpha is₁、α₂Is a balance factor between the penalty and the reward,<i,j,w,type>e C represents the relevant community membership of nodes i and j,

representing non-negative costs of violations of constraints, C_iIs the community to which node i belongs, when C_i＝C_jWhen is delta (C)_i,C_j) 1, otherwise δ (C)_i,C_j)＝0；

And punishing the partition which does not meet the constraint condition by adopting a punishment mode, namely reducing the module density contribution value of the node i. At this time, let U (C) be alpha₁＝0，α₂1, hence, the optimized local variable q 'in combination with the prior information'_iExpressed as:

wherein,

represents Community C_iNode i and community C within_iThe number of other node-connecting edges in the interior,

represents Community C_iNode i and community C within_iNumber of connecting edges of other nodes outside, | C_iI denotes Community C_iNode inThe number of the cells;

rewarding the partitions meeting the constraint conditions in a rewarding mode, namely increasing the value of the global variable D; at this time, let U (C) be alpha₁＝1，α ₂0, therefore, the global variable D' optimized in conjunction with a priori information is represented as:

wherein,

is represented by C₁And C₂The number of edges in between;

is represented by C ₁2 times the sum of the number of internal edges;

represents V₁Total number of connecting edges between the internal node and the external node, wherein

Given one division of the network G: g₁(C₁,E₁)，G₂(C₂,E₂)，…，G_m(C_m,E_m) In which C is_iAnd E_iIs G_iVertex set and edge set of (i ═ 1,2, …, m), | C_iI is Community C_iThe number of internal nodes.

Further, in the fifth step: dividing the network G into two parts G randomly according to the topology structure₁And G₂Each part has approximately equal node number, and nodes connected by edges in each part form a community to form an initial community structure.

Further, in the sixth step: calculating contribution value q 'of each node to community module density'_iMoving the node with the minimum contribution to the community module density to another part for self-organization optimization; recalculate each time after each moveThe contribution value of each node; this self-organizing optimization process is repeated until the module density value D' of the network no longer increases.

Further, in the seventh step: removing the connecting edge between the two finally obtained communities to obtain a plurality of connected sub-networks; steps 5 and 6 are performed for each subnetwork until the module density of the whole network is maximized.

Another object of the present invention is to provide a social network applying the community discovery method applicable to a noise network.

The invention has the advantages and positive effects that:

1. as the prior information is integrated into the community division for semi-supervised community discovery, the influence caused by noise is effectively compensated, and the robustness of the community division in a noise environment is improved.

2. The prior information is obtained by adopting an active learning technology, and the prior information which effectively improves the community division quality can be obtained with lower labor cost.

3. The module density is used as a community evaluation function, and the resolution limit phenomenon based on a modularity optimization method is overcome.

When the noise ratios are respectively set to be 2%, 4%, 6%, 8% and 10%, in the Dolphins network, only 10 pairs of constraint NMI values are added, so that the value can be improved by 1-7%, and 20 pairs of constraint NMI values can be improved by 3-14%; similar results are also obtained in Football networks, the addition of 10 pairs of constrained NMI values can increase by 2-7%, and the addition of 20 pairs of constrained NMI values can increase by 6-13%.

Drawings

Fig. 1 is a flowchart of a community discovery method suitable for a noise network according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating an implementation of the community discovery method suitable for the noise network according to the embodiment of the present invention.

FIG. 3 is a graph illustrating performance of the present invention at different noise ratios on a Dophins network, according to an embodiment of the present invention.

Fig. 4 is a performance evaluation graph of the invention at different noise ratios on a Football network, provided by an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

The community discovery method suitable for the noise network provided by the embodiment of the invention is an organic combination of a community division method based on the density of an extremum optimization module and an active learning method; the prior information is combined into the method of the density of the extremum optimization module, local variables and global variables are optimized by utilizing a pair of constraint sets, and community discovery is guided in the process of optimizing an objective function; core nodes which can represent local community structures and nodes at community boundaries are actively selected from the network to construct a pair of constraint sets, and high-quality prior information is generated.

As shown in fig. 1, the community discovery method applicable to a noise network according to an embodiment of the present invention includes the following steps:

s101: calculating the importance value of the nodes in the network, and establishing a core point set and a boundary point set;

s102: selecting core representative points to construct prior information;

s103: selecting boundary representative points to construct prior information;

s104: incorporating the prior information into an extremum optimization process;

s105: randomly dividing the network into two parts with approximately equal node numbers according to a topological structure to form an initial community structure;

s106: calculating the contribution value of each node to the community module density, moving the node with the minimum contribution to another part to perform self-organization optimization, and repeating the self-organization optimization process until the module density value of the network is not increased any more;

s107: and removing the connecting edge between the two finally obtained communities, and executing the step S105 and the step S106 on each sub-network until the module density value of the whole network reaches the maximum value.

The community discovery method suitable for the noise network provided by the embodiment of the invention specifically comprises the following steps:

step 1: and calculating the importance value of the nodes in the network, and determining a core point set and a boundary point set.

Further, the importance of each node in the network is evaluated by using a node importance measuring index, all nodes with importance values larger than a given parameter in the network form a core point set according to the given parameter, and the boundary point set is a node set formed by non-core points.

Step 2: and selecting core representative points to construct prior information.

Further, if the representative point set is empty, selecting a node k with the largest importance value from the core point set and adding the node k to the representative point set; otherwise, selecting the node i with the minimum similarity to the representative point set from the core point set as a candidate representative point;

further, for each pair of representative points in the representative point set, prior information is constructed, and a domain expert marks the constraint type of the prior information.

And step 3: selecting boundary representative points to construct prior information.

Further, selecting a boundary point b1 with the maximum similarity to the node i from the boundary point set, if a plurality of nodes meeting the conditions exist, selecting the node with the minimum importance value as a representative point, constructing prior information < i, b1>, and submitting the prior information to a domain expert to label the constraint type of the prior information;

further, a boundary point b2 with the minimum similarity to the node i is selected from the boundary point set, if a plurality of nodes meeting the conditions exist, the node with the maximum importance value is selected as a representative point, prior information < i, b2> is constructed, and the prior information is submitted to a domain expert to mark the constraint type.

And 4, step 4: and judging whether the acquired prior information reaches the specified number, if so, continuing to execute the step 5, otherwise, returning to the step 2.

And 5: the a priori information is incorporated into an extremum optimization process.

Further, the module density (global variable) of the network is related to the contribution amount (local variable) of each node to the module density, and the module density is optimized and solved in a mode of punishing (rewarding) violation (meeting) constraint conditions by using known pairwise constraint information;

further, punishment is carried out on the partition which does not meet the constraint condition in a punishment mode, namely the value of the local variable is reduced;

further, the partition meeting the constraint condition is rewarded in a rewarding mode, namely the value of the global variable is increased.

Step 6: initialization: the whole network is randomly divided into two parts, each part has approximately equal node number, and the nodes connected by edges in each part form a community, so that an initial community structure is formed.

And 7: iteration: and moving the node with the minimum contribution to the community module density to another part to perform self-organization optimization, recalculating the contribution value of each node after each movement, and repeating the self-organization optimization process until the module density value of the network is not increased any more.

And 8: optimizing: and (4) removing the connecting edge between the two finally obtained communities to obtain a plurality of connected sub-networks, and then executing the step 6 and the step 7 on each sub-network until the module density value of the whole network reaches the maximum.

The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 2, the implementation steps of the present invention are as follows:

step 1: and calculating the importance value of the node, and establishing a core point set and a boundary point set.

Further, the importance value of the computing node utilizes an index for comprehensively measuring the importance of the node based on the degree and the aggregation coefficient, and is represented as follows:

p_i＝f(k_i)+g(c_i)；

wherein,

is the degree of the node i and,

is the cluster coefficient of node i, E_iIs the number of edges actually between these nodes. f (k)_i) Is to k_iThe value of the normalization process is the ratio of the difference between the value of the node i and the minimum value of the node in the network to the difference between the maximum value of the node in the network and the minimum value of the node in the network; g (c)_i) Is to c_iThe normalization process of (1) is a ratio of a difference between the maximum aggregation coefficient of the nodes in the network and the aggregation coefficient of the node i to a difference between the maximum aggregation coefficient of the nodes in the network and the minimum aggregation coefficient of the nodes in the network.

Further, according to given parameters

Step 2: and selecting representative points from the core point set to construct prior information.

Further, if the representative point set RS is empty, selecting the node k with the maximum importance value from the core point set CS and adding the node k into the representative point set RS; otherwise, selecting the node i with the minimum similarity to the representative point set RS from the core point set CS as the candidate representative point, where the similarity between the node i and the set C is represented as:

S(i,C)＝max(Sim(i,j)|j∈C)；

wherein,

further, for each pair of representative points in the representative point set RS, prior information < i, j > is constructed, and the constraint types are labeled by domain experts.

And step 3: and selecting representative points from the boundary point set to construct prior information.

Further, selecting a boundary point b1 with the maximum similarity to the node i from the boundary node set BS, if a plurality of nodes meeting the conditions exist, selecting the node with the minimum importance value as a representative point, constructing prior information < i, b1>, and submitting the prior information to a domain expert to label the constraint type of the prior information;

further, a boundary point b2 with the minimum similarity to the node i is selected from the boundary point set BS, if a plurality of nodes meeting the conditions exist, the node with the maximum importance value is selected as a representative point, prior information < i, b2> is constructed, and the constraint type is marked by a domain expert.

And 4, step 4: the a priori information is incorporated into an extremum optimization process.

Further, the module density D (global variable) of the network and the contribution q of each node to the module density_i(local variables) are related, and the module density D is optimized and solved by the form of violation (conformity) of penalty (reward) constraint condition by using known paired constraint information, and the general form of defining penalty (reward) function is as follows:

Further, punishment is carried out on the partition which does not meet the constraint condition in a punishment mode, namely the module density contribution value of the node i is reduced. At this time, let U (C) be alpha₁＝0，α₂1, hence, the optimized local variable q 'in combination with the prior information'_iExpressed as:

wherein,

represents Community C_iNode i and community C within_iNumber of connecting edges of other nodes outside, | C_iI denotes Community C_iThe number of nodes within;

further, the partition meeting the constraint condition is rewarded in a rewarding mode, namely the value of the global variable D is increased. At this time, let U (C) be alpha₁＝1，α ₂0, therefore, the global variable D' optimized in conjunction with a priori information is represented as:

wherein,

is represented by C₁And C₂The number of edges in between;

is represented by C ₁2 times the sum of the number of internal edges;

Given networkOne division of G: g₁(C₁,E₁)，G₂(C₂,E₂)，…，G_m(C_m,E_m) In which C is_iAnd E_iIs G_iVertex set and edge set of (i ═ 1,2, …, m), | C_iI is Community C_iThe number of internal nodes.

And 5: dividing the network G into two parts G randomly according to the topology structure₁And G₂Each part has approximately equal node number, and nodes connected by edges in each part form a community to form an initial community structure.

Step 6: calculating contribution value q 'of each node to community module density'_iMoving the node with the minimum contribution to the community module density to another part for self-organization optimization; recalculating the contribution value of each node after each movement; this self-organizing optimization process is repeated until the module density value D' of the network no longer increases.

And 7: removing the connecting edge between the two finally obtained communities to obtain a plurality of connected sub-networks; steps 5 and 6 are performed for each subnetwork until the module density of the whole network is maximized.

The effects of the present invention will be described in detail below with reference to performance evaluation.

FIG. 3 is a graph of performance evaluation of the present invention on a Dophins network at different noise ratios.

FIG. 4 is a graph of performance evaluation of the present invention at different noise ratios on a Football network.

The dopins network was constructed by d.lusseau et al observing a dolphin population inhabiting the Doubtful Sound channel in new zealand for up to 7 years. The network includes 62 nodes and 159 edges, where each node in the network represents one dolphin in the population and an edge represents that two connected dolphins have frequent contacts.

Football networks are networks abstractly built by Girvan and Newman for the American college student Football league for the 2000 season. The network includes 115 nodes and 613 edges, where each node in the network represents a football team and the edges represent two teams playing during the season. The games between teams in the same league are more frequent and the games between teams in different leagues are less.

As can be seen from fig. 3 and 4, the performance of the algorithm can be greatly improved by increasing the prior information at each noise ratio of the two networks. In the Dolphins network, the constraint NMI value can be improved by 1-7% by adding 10 pairs of constraint NMI values, and can be improved by 3-14% by adding 20 pairs of constraint NMI values; similar results are also obtained in Football networks, the addition of 10 pairs of constrained NMI values can increase by 2-7%, and the addition of 20 pairs of constrained NMI values can increase by 6-13%.

With the increase of the noise ratio in the network, the performance of the community discovery method completely depending on the network topology structure can be rapidly reduced, and the prior information is integrated into the community discovery process, so that the influence caused by noise can be effectively compensated, and the higher community division accuracy is kept.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A community discovery system applicable to a noise network for implementing a community discovery method applicable to the noise network is characterized in that the community discovery method applicable to the noise network is based on the organic combination of a community division method of extremum optimization module density and an active learning method;

the active learning method comprises the steps of actively selecting core nodes capable of representing local community structures and nodes at community boundaries from a network to construct a pair of constraint sets, and generating high-quality prior information;

the community division method based on the density of the extremum optimization module for the community discovery method suitable for the noise network comprises the following steps:

selecting core representative points to construct prior information;

selecting boundary representative points to construct prior information;

2. The community discovery system for a noise network of claim 1, wherein said first step comprises: the importance value of the calculation node utilizes an index for comprehensively measuring the importance of the node based on the degree and the aggregation coefficient, and is represented as follows:

p_i＝f(k_i)+g(c_i)；

wherein,

is the degree of the node i and,

is the cluster coefficient of node i, E_iThe number of edges actually between the nodes; f (k)_i) Is to k_iIs the difference between the value of node i and the minimum value of the node in the network and the maximum value of the node in the network and the minimum value of the node in the networkThe ratio of the difference between the values; g (c)_i) Is to c_iThe value of the normalization process of (1) is the ratio of the difference between the maximum aggregation coefficient of the nodes in the network and the aggregation coefficient of the node i to the difference between the maximum aggregation coefficient of the nodes in the network and the minimum aggregation coefficient of the nodes in the network;

according to given parameters

3. The community discovery system for a noise network as claimed in claim 1, wherein said second step specifically comprises: if the representative point set RS is empty, selecting a node k with the maximum importance value from the core point set CS and adding the node k into the representative point set RS; otherwise, selecting the node i with the minimum similarity to the representative point set RS from the core point set CS as the candidate representative point, where the similarity between the node i and the set CS is represented as:

S(i,CS)＝max(Sim(i,j)|j∈CS)；

wherein,

and constructing prior information < i, j > for each pair of representative points in the representative point set RS, and marking the constraint types of the prior information < i, j >.

4. The community discovery system for a noise network of claim 1, wherein said step three specifically comprises: selecting a boundary point b1 with the maximum similarity to the node i from the boundary node set BS, if a plurality of nodes meeting the conditions exist, selecting the node with the minimum importance value as a representative point, constructing prior information < i, b1>, and marking the constraint type of the prior information < i, b1 >;

5. The community discovery system for a noise network of claim 1, wherein said step four comprises: module density of the network D Global variables and contribution of each node to the module density q_iThe local variables are related, the module density D is optimized and solved by the mode of punishing violating the constraint condition by using the known pairwise constraint information, and the general form of defining a punishment function is as follows:

Punishing the partition which does not meet the constraint condition by adopting a punishing mode, namely reducing the module density contribution value of the node i; at this time, let U (C) be alpha₁＝0，α₂1, hence, the optimized local variable q 'in combination with the prior information'_iExpressed as:

wherein,

rewarding the partitions meeting the constraint conditions in a rewarding mode, namely increasing the value of the global variable D; at this time, let U (C) be alpha₁＝1，α₂0, therefore, the global variable D' optimized in conjunction with a priori information is represented as:

wherein,

is represented by C₁And C₂The number of edges in between;

is represented by C₁2 times the sum of the number of internal edges;

6. The community discovery system for noise networks of claim 1, wherein in step five: dividing the network G into two parts G randomly according to the topology structure₁And G₂Each part has approximately equal node number, and nodes connected by edges in each part form a community to form an initial community structure.

7. The community discovery system for a noise network of claim 1, wherein in step six: calculating contribution value q 'of each node to community module density'_iMoving the node with the minimum contribution to the community module density to another part for self-organization optimization; recalculating the contribution value of each node after each movement; this self-organizing optimization process is repeated until the module density value D' of the network no longer increases.

8. The community discovery system for a noise network of claim 1, wherein in step seven: removing the connecting edge between the two finally obtained communities to obtain a plurality of connected sub-networks; steps 5 and 6 are performed for each subnetwork until the module density of the whole network is maximized.