CN111159483A

CN111159483A - Social network diagram abstract generation method based on incremental calculation

Info

Publication number: CN111159483A
Application number: CN201911373671.1A
Authority: CN
Inventors: 谢夏; 王健; 金海�
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-15
Anticipated expiration: 2039-12-26
Also published as: CN111159483B

Abstract

The invention discloses a social network diagram abstract generation method based on incremental computation, and belongs to the field of social networks. The method comprises the following steps: tensor expression is carried out on the social network diagram in the target time period to obtain a Boolean tensor T_G(ii) a For Boolean tensor T_GCarrying out tensor decomposition to obtain a decomposed node matrix N₁,N₂Attribute matrix A₁,…A_h‑3And a time matrix T; node matrix N₁Or N₂Clustering to obtain a cluster center and a type of each node; and (4) regarding the cluster center as the over point of the graph abstract, and calculating the over edge weight value between the over points to obtain the graph abstract. The invention carries out multidimensional data fusion on the nodes, the node attributes and the timestamps of the social network, and expresses the characteristics based on the binary high-dimensional degree of the binary property and the tensor of the social network graphAnd the properties of the method realize the uniform expression of high-dimensional graph data and the Boolean quantitative representation of a complex social network. Incremental CP decomposition is introduced, prior information such as a decomposition result of the old graph tensor is fully utilized, the size of the decomposition tensor is reduced, and the decomposition efficiency of the graph abstract is improved.

Description

Social network diagram abstract generation method based on incremental calculation

Technical Field

The invention belongs to the field of social networks, and particularly relates to a social network diagram abstract generation method based on incremental computation.

Background

Social network analysis is a hot topic of the data mining community in recent years, and the interaction between entities in a social network is queried and inferred, so that interesting and deep insight on various phenomena can be inspired. However, due to the characteristics of the social network such as large dynamic and variable complex data, the expression and mining of the social network graph data are limited by certain computing resources and cost overhead. Therefore, the starting point for analyzing these complex large graph data is usually a concise representation, i.e., a graph summary. It helps to understand these datasets and to represent queries in a meaningful way. Graph summarization has a very important role in the processing of graph data, from reducing the number of bits required to encode the original graph to more complex database operations, and so on.

In recent years, tensor methods have been applied to graph summarization methods, enabling more accurate weighted graph summaries to be generated. A tensor is a form of data storage in multiple dimensions, referred to as the order of the tensor. Since real tensor data often has the characteristic of high-dimensional sparsity, a tensor decomposition method is generally used for retaining original information, reducing the computational complexity and reducing the data loss.

The current graph abstract method only focuses on time sequence dynamics or node attributes of graph data, while user nodes in the social network contain various attributes, the connection relationship between users can be changed newly every moment, and the social network graph data has both dynamics and node attributes. In addition, for the time-series dynamic graph, the current method can repeatedly calculate the historical data, which results in low calculation efficiency.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a social network graph abstract generation method based on incremental computation, and aims to adopt an incremental computation framework to uniformly express the dynamics and the node attributes of social network graph data and introduce a Boolean tensor decomposition method to realize extensible and efficient graph abstract computation.

To achieve the above object, according to a first aspect of the present invention, there is provided a method for generating a social network diagram abstract based on incremental computation, the method including the following steps:

s1, tensor expression is carried out on the social network diagram in the target time period to obtain a target Boolean tensor T_G；

S2, aiming at the target Boolean tensor T_GCarrying out tensor decomposition to obtain a decomposed node matrix N₁，N₂；

S3. pair node matrix N₁Or N₂Clustering to obtain a cluster center and a type of each node;

and S4, regarding the cluster center as the super points of the graph abstract, and calculating the super edge weight values among the super points to obtain the graph abstract of the social network graph.

Preferably, the social network graph is a dynamic undirected graph, one-to-one corresponding to a timestamp.

Preferably, step S2 includes the steps of:

s21, combining the old Boolean tensor T_oldAnd the target Boolean tensor T_GCombined into a Boolean tensor T_allThe Boolean tensor T_allThe last order is the time dimension, the old Boolean tensor T_oldA tensorial representation of the social networking graph for a previous time period;

s22, pair of Boolean tensors T_allBiased sampling is carried out to generate k sub tensors sT_i；

S23, for each sub tensor sT_iPerforming parallel distributed Boolean CP decomposition, and calculating to obtain decomposition factor matrix of each sub tensor

S24, dividing the sub tensor sT_iBoolean decomposition matrix of

And old Boolean tensor T_oldBoolean decomposition matrix of

Merging to obtain new Boolean tensor T_allBoolean CP decomposition result of

Wherein i is more than or equal to 1 and less than or equal to k, and j is more than or equal to 1 and less than or equal to h.

Preferably, the step S22 includes the steps of:

s221, for h-order old Boolean tensor T_oldIs summed to obtain

S222. will

Divided by T_oldThe number of the medium and non-zero elements is calculated to obtain the sampling probability of each order of index

S223, calculating T according to the set sampling factor_oldSize L of sampling index of each order_j；

S224. according to the sampling probability

For T_oldIs indexed by the jth order of_jSub-sampling to obtain sample index set

S225, collecting sampling indexes

And the target Boolean tensor T_GAre combined to obtain { V₁，V₂，...，V_h∪{V_new} in which V is_newRepresents T_GThe time dimension index of;

s226. according to the index set { V₁，V₂，...，V_h∪{V_newGet a sampling sub tensor;

and S227, repeating the steps S221 to S226 until k sub tensors are generated.

Preferably, the step S23 includes the steps of:

s231. sub tensor sT_iFactor matrix of

Initializing for Y times, wherein each time the initialization is a Boolean matrix with the non-zero item probability of p, and taking a factor matrix with the minimum reduction error as a final initialization matrix;

s232, h iterations are carried out, wherein in each iteration process, (h-1) factor matrixes are fixed, and the remaining factor matrixes are optimized, so that the overall reduction error is minimized, and one iteration is completed;

s233, repeating the step 232 until the number of iteration rounds reaches k or the iteration error is smaller than e, and returning to the Boolean factor matrix

Preferably, step S24 includes the steps of:

s241, dividing the sub tensor sT₁Boolean decomposition matrix of

And old tensor T_oldBoolean decomposition matrix of

Merging to obtain a Boolean decomposition matrix set

S242, the sub tensor sT₂Boolean decomposition matrix of

And Boolean decomposed matrix set

Combining corresponding matrixes, and so on until the sub-tensor sT_kBoolean decomposition matrix of

And Boolean decomposed matrix set

Combining the corresponding matrixes to obtain a new tensor T_allBoolean CP decomposition matrix of

Preferably, the merging of the boolean decomposition matrices comprises the following steps:

(1) calculating tensors V and U, where V_xIs composed of

X line of (1), u_xIs composed of

Corresponds to an index row, V is V_xTensor restored by matrix with other factors, U being U_xA tensor restored from the matrix of other factors;

(2) calculating the reconstruction error epsilon of the tensor V and the old tensor factor matrix₁And ε₂；

ε₁＝||V-T_x||

ε₂＝||U-T_x||

Wherein, T_xA slice tensor that is a corresponding index row;

(3) judging whether epsilon is satisfied₁＜ε₂If yes, u of the original tensor factor matrix_xUsing v_xAnd updating, otherwise, not updating.

Preferably, in step S3, hamming distance is selected, the number r of clustering centers is set, and K-Means clustering is adopted to obtain the clustering center S_iAnd each node belongs to a cluster, i 1.. r.

Preferably, step S4 includes the steps of:

s41, calculating the excess edge weight between the excess points in the graph abstract, wherein the calculation formula is as follows:

wherein S is_i、S_jCluster centers calculated for the clustering algorithm, l and m being S, respectively_i、S_jL is the Boolean tensor T_allLength in time dimension, N being T_allNumber of nodes, σ (S)_i) Is S_iThe number of points contained;

s42, calculating the reconstruction error of the graph abstract, wherein the calculation formula is as follows:

s43, judging whether the reconstruction error meets a set threshold value, if so, taking the cluster as a node in the graph abstract, and taking the excess edge weight value as the weight of the edge of the graph abstract, otherwise, changing the number of cluster centers, and then entering the step S3.

To achieve the above object, according to a second aspect of the present invention, there is provided a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the method for generating a social network diagram summary based on incremental computation according to the first aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) aiming at the problem that the existing graph abstract method only focuses on the dynamics or node attributes of graph data, the invention performs multidimensional data fusion on the nodes, the node attributes and the time stamps of the social network, and realizes the uniform expression of high-dimensional graph data and the Boolean quantitative expression of a complex social network based on the high-dimensional expression characteristics of the binary property and tensor of the social network graph.

(2) Aiming at the problem of low calculation efficiency of the existing graph summarization method, incremental Boolean CP decomposition is introduced, prior information such as a decomposition result of an old graph tensor is fully utilized, the size of the decomposition tensor is reduced, and the decomposition efficiency of the graph summarization is improved.

Drawings

Fig. 1 is a flowchart of a method for generating a social network diagram abstract based on incremental computation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

First, some terms related to the present invention are explained.

And (4) drawing abstract: the abstract is a concise representation of an original graph, and a large number of points and edges in the graph are aggregated into a super point and a super edge, so that the visualization of a large graph and the mining of graph data are facilitated. The super point is a point set formed by aggregating a plurality of nodes in the graph, the super edge is an edge set formed by aggregating a plurality of edges in the graph, and the super edge weight is calculated by the edge adjacency characteristics and the weight in the set.

The Boolean tensor is: all elements are tensors of 0 or 1, and due to the binary nature of the adjacency matrix of the unweighted graph, the dynamic unweighted graph can be represented as a Boolean tensor, where the order is the dimension of the tensor.

Undirected, weightless graph: edges in the graph have neither direction nor weight, wherein the dynamic undirected weighted graph is an undirected weighted graph at each time stamp.

Tensor decomposition: a scheme of representing a tensor as a basic sequence of operations on other simpler tensors is generally available for tensor filling, dimensionality reduction, feature extraction, and so on.

And (3) CP decomposition: a common form of tensor Decomposition, the tensor is decomposed into the sum of a number of rank 1 tensors, a special tensor which can be decomposed into the outer product of a number of vectors.

As shown in fig. 1, the present invention provides a method for generating a social network diagram abstract based on incremental computation, which includes the following steps:

s1, tensor expression is carried out on the social network diagram in the target time period to obtain a Boolean tensor T_G。

Abstracting users in the social network into nodes, abstracting the relationship between the users in the social network into edges, and obtaining the social network graph. For example, in a microblog social network, microblog users are nodes, each node has a plurality of node attributes, such as gender, academic calendar, work and occupation, and the concern relationship between the users is an edge. The concern relationships between users are dynamically changing, and thus, the social networking graph data is dynamic.

In this embodiment, the target time period is 1 day, that is, a social network diagram summary within 1 day needs to be generated. In a generated graph abstract in a microblog social network, user nodes with similar user attributes and user concerns are represented by a super point, and connection relations among different user super points are represented by super edges.

The graph data is constructed into a high-order tensor, the node attribute and the timestamp of the graph data serve as different dimensions of the tensor, the tensor is a binary tensor, and the nonzero element in the tensor represents two nodes, the node attribute and the timestamp of an edge in the dynamic attribute graph.

For a high-order sparse tensor, if all elements in the tensor are stored, a large amount of storage space is consumed, so for a graph tensor, the invention only uses tuples to store the index values of non-zero elements in different dimensions, for example, (Node1, Node2, Node1.attribute, Node2.attribute, T,.). at time T, there are edges with Node1 and Node2 as vertexes, and the attributes of Node1 and Node2 are: node1.attribute, node2. attribute. In order to support the calculation of large-scale graph data, graph tuple of the graph is uploaded to a distributed file system (HDFS).

S2, a Boolean tensor T is paired_GCarrying out tensor decomposition to obtain a decomposed node matrix N₁，N₂Attribute matrix A₁，…A_h-3And a time matrix T.

Decomposed node matrix N₁，N₂Is a feature vector, attribute matrix A, used to represent the adjacency characteristics of nodes₁，…A_h-3Is a feature vector used to represent the attributes of a graph node, and the time matrix T is a feature vector used to represent a graph in the time dimension.

Preferably, step S2 includes the steps of:

s21, setting the Boolean tensor T_oldAnd the Boolean tensor T_GCombined into a Boolean tensor T_allWherein the last order is the time dimension, the Boolean tensor T_oldA tensoriated representation of the social networking graph in a previous time period.

In this embodiment, the Boolean tensor T_oldA tensoriated representation of the social networking graph within the previous day.

S22, pair of Boolean tensors T_allBiased sampling is carried out to generate k sub tensors sT_i，1≤i≤k。

For Boolean tensor T_allBiased sampling according to importance measure is to increase the non-zero item density of the sampled sub-tensor and to increase the decomposition result pair T of each sub-tensor_allThe impact of the update. The following procedure is illustrated with h ═ 2.

Suppose that

The sampling probability is 0.5.

Preferably, the step S22 includes the steps of:

s221, for h-order old Boolean tensor T_oldIs summed to obtain

In the present embodiment, the first and second electrodes are,

s222. by

Divided by T_oldThe number of the medium and non-zero items is calculated to obtain the sampling probability of each order index

In the present embodiment, the first and second electrodes are,

s223, calculating T according to the set sampling factor_oldSize L of sampling index of each order_j。

In this embodiment, L₁＝2*0.5＝1，L₂＝2*0.5＝1。

S224. according to the sampling probability

For T_oldIs indexed by the jth order of_jSub-sampling to obtain sample index set on the order

In this embodiment, in the first dimension, the sample size is 1, and is [0, 1 ] overall]The sampling probability of the corresponding element is [ 0.330.67 ]](ii) a In the second dimension, the sample size is 1, overall [0, 1 ]]The sampling probability of the corresponding element is [ 0.330.67 ]]. Assuming the sampling result

S225, merging the sampling index set and the index set of the new tensor to obtain { V }₁，V₂，...，V_h∪{V_new} in which V is_newRepresents T_GIs indexed by the time dimension of (a).

In this embodiment, V_new＝[2，3]The final sampling index is { [1 ]]，[1，2，3]}。

S226. according to sampling index { V₁，V₂，...，V_h∪{V_newGet the sample sub-tensor.

In this embodiment, the sub-tensor is T_all[1，{1，2，3}]＝[1 1 1]。

S227, repeating the steps S221-S226 until k sub-tensors sT are generated₁，......，sT_k。

S23, for each sub tensor sT_iPerforming parallel distributed Boolean CP decomposition, and calculating to obtain decomposition factor matrix of each matrix

1≤i≤k，1≤j≤h。

Preferably, the step S23 includes the steps of:

s231. sub tensor sT_iFactor matrix of

And initializing for Y times, wherein each time, the initialization is carried out to a Boolean matrix with the probability of non-zero items being p, and taking a factor matrix with the minimum reduction error as a final initialization matrix.

In this embodiment, Y is set according to actual requirements, and is generally any integer of 5 to 20.

And S232, h iterations are carried out, wherein in each iteration process, (h-1) factor matrixes are fixed, and the rest factor matrixes are optimized, so that the integral reduction error is minimum.

The present embodiment employs least squares optimization. The following procedure is illustrated with h ═ 3.

Sub-tensor sT_iRespectively is

Fixing

Optimization

So that the reduction error is minimized; fixing

Optimization

So that the reduction error is minimized; fixing

Optimization

So that the reduction error is minimized.

In this embodiment, k and e are set according to actual requirements.

S24, dividing the sub tensor sT_iBoolean decomposition matrix of

And old tensor T_oldBoolean decomposition matrix of

Merging to obtain new tensor T_allCloth ofResult of ErCP decomposition

Combining the two to obtain a new tensor T_allBoolean CP decomposition result of

Can be the old tensor T_allThe decomposition matrix of (2) introduces updates to reduce errors of the decomposition.

Preferably, step S24 includes the steps of:

s241, dividing the sub tensor sT₁Boolean decomposition matrix of

And old tensor T_oldBoolean decomposition matrix of

Merging to obtain a Boolean decomposition matrix set

S242, the sub tensor sT₂Boolean decomposition matrix of

And Boolean decomposed matrix set

And Boolean decomposed matrix set

…

(1) calculating tensors V and U, wherein V is V_xThe tensor, U, restored from the matrix of other factors is U_xTensor, v, restored with other factor matrices_xIs composed of

X line of (1), u_xIs composed of

The sample of (b) corresponds to an index row.

ε₁＝||V-T_x||

ε₂＝||U-T_x||

Wherein, T_xFor the slice tensor corresponding to the index row, | | | | represents tensor 1-norm, i.e., the number of nonzero entries in the boolean tensor.

Satisfies epsilon₁＜ε₂Representing updated rows may reduce the overall reconstruction error, u, of the original tensor factor matrix_xUsing v_xTo carry outAnd (6) updating.

S3, a node matrix N is paired₁Or N₂And clustering to obtain a cluster center and the type of each node.

Preferably, the clustering method in step S3 is a K-Means clustering method, and the distance is a hamming distance.

The method comprises the following steps:

s31, selecting a row vector set { N ] of the node Boolean factor matrix N₁，n₂，...，n_lAnd l is the number of rows of the matrix N and the number of nodes in the graph.

S32, optionally selecting r vectors from the graph summary as initial clustering centers, wherein r represents the number of clustering centers and is the number of generated super points in the final graph summary.

In this embodiment, r is K, and is initialized to 100.

And S33, calculating the Hamming distance from other nodes to the center of each cluster, and dividing the Hamming distance into the clusters with the closest distances.

And S34, updating the cluster center by using the integral average value of all the vectors according to all the vectors in each cluster, and finishing one round of iteration.

And S35, if the iteration times reach the specified value, outputting the cluster to which each point belongs.

In this embodiment, the number of iterations is specified to be 10.

And S4, regarding the cluster center as the over point of the graph abstract, and calculating the over edge weight value between the over points to obtain the complete graph abstract.

Preferably, step S4 includes the steps of:

and S41, calculating the excess edge weight between the excess points in the graph abstract according to the graph node adjacency similarity formula.

And S42, calculating the reconstruction error of the graph abstract according to the Euclidean distance of the tensor.

Wherein S is_i、S_jCluster centers calculated for the clustering algorithm, l and m being S, respectively_i、S_jL is the Boolean tensor T_allLength in time dimension, N being T_allNumber of nodes, σ (S)_i) Is S_iThe number of points involved, | | | represents the absolute value operator.

In this embodiment, the reconstruction error setting threshold is 1000. And if the reconstruction error does not meet the set threshold, increasing the number of the clustering centers.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1.A method for generating a social network diagram abstract based on incremental computation is characterized by comprising the following steps:

2. The method of claim 1, wherein the social network graph is a dynamic undirected graph, one-to-one with timestamps.

3. The method according to claim 1 or 2, wherein step S2 comprises the steps of:

S24, dividing the sub tensor sT_iBoolean decomposition matrix of

And old Boolean tensor T_oldBoolean decomposition matrix of

Merging to obtain new Boolean tensor T_allBoolean CP decomposition result of

4. The method of claim 3, wherein the step S22 includes the steps of:

s221, for h-order old Boolean tensor T_oldIs summed to obtain

S222. will

S224. according to the sampling probability

S225, collecting sampling indexes

and S227, repeating the steps S221 to S226 until k sub tensors are generated.

5. The method of claim 3, wherein the step S23 includes the steps of:

s231. sub tensor sT_iFactor matrix of

6. The method of claim 3, wherein the step S24 includes the steps of:

s241, dividing the sub tensor sT₁Boolean decomposition matrix of

And old Boolean tensor T_oldBoolean decomposition matrix of

Merging to obtain a Boolean decomposition matrix set

S242, the sub tensor sT₂Boolean decomposition matrix of

And Boolean decomposed matrix set

And Boolean decomposed matrix set

7. The method of claim 6, wherein the merging of the Boolean decomposed matrices comprises the steps of:

(1) calculating tensors V and U, where V_xIs composed of

X line of (1), u_xIs composed of

ε₁＝||V-T_x||

ε₂＝||U-T_x||

Wherein, T_xA slice tensor that is a corresponding index row;

8. The method of claim 1, wherein in step S3, hamming distance is selected, the number r of cluster centers is set, and K-Means clustering is used to obtain cluster center S_iAnd each node belongs to a cluster, i 1.. r.

9. The method according to any one of claims 3 to 7, wherein step S4 includes the steps of:

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for generating a social network diagram summary based on incremental computation according to any one of claims 1 to 9.