CN116595479A

CN116595479A - Community discovery method, system, equipment and medium based on graph double self-encoder

Info

Publication number: CN116595479A
Application number: CN202310498705.XA
Authority: CN
Inventors: 李明娇; 储星
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-08-15

Abstract

The invention discloses a community discovery method, a system, equipment and a medium based on a graph double self-encoder, and relates to the technical field of community discovery, wherein the method comprises the steps of inputting a given quotation network into the graph double self-encoder to obtain graph structure representation information and graph attribute representation information; fusing the graph structure representation information and the graph attribute representation information to obtain fused graph representation information; and carrying out community division on the fused graph representation information by adopting a clustering method to obtain a community discovery result. The method and the device improve the accuracy of community division in the quotation network.

Description

Community discovery method, system, equipment and medium based on graph double self-encoder

Technical Field

The invention relates to the technical field of community discovery, in particular to a community discovery method, system, equipment and medium based on a graph double self-encoder.

Background

Citation analysis refers to analysis of citations and cited phenomena of analysis objects such as scientific journals, papers, authors and the like to reveal quantitative characteristics and intrinsic laws thereof. The introduction analysis is very useful for research choice, research hotspots, research trends in the field, searching by high-impact scientists, document backtracking and the like. The most common quotation analysis tools are web of science, scopus and Google Scholar, however these three tools are generally only useful for calculating statistics of journal influence factors, the number of times a certain article is cited, the citation of a certain author, etc. It is not practical for common researchers, especially for aiding in understanding the subject content, and an excellent quotation analysis method can find important documents from the point of view of document citation and explore the scientific knowledge flow. The graph is a general data structure for exploring and modeling complex systems in the real world, and is also one of hot spots of current research as an important medium for entity relationship interaction. A complex network is typically represented by a graph having a set of nodes (vertices) with connections (edges), a quoted network being a complex network in which each node represents a quote, and if an edge exists between two points, a quoted or referenced relationship between the two quotes, and if no edge exists, a non-quoted or referenced relationship between the two quotes. Community discovery refers to finding community structures with similar characteristics in a network diagram so as to know the topological structure and attribute information of the community structures, thereby being applied to tasks such as classification, prediction and the like and serving a real society. Community discovery has important practical significance, and has been widely studied and applied in many real network problems. The exploration of the community structure of the quotation network is beneficial to the quotation analysis process, and has great significance for discovering important documents and exploring scientific knowledge flow direction, so that a good community discovery method has great effect on the development of the quotation analysis field.

With the advent of complex networks, the network has not only a large number of nodes, but also various node characteristics, and has important attribute information. This presents challenges to conventional community discovery methods, which basically process the structure information of the graph, but do not fully discover the content in the attribute information, and these methods have achieved good results on networks without node characteristics, but are also an emerging research task when facing the dataset of today's large networks (e.g. quoted networks), how to simultaneously maintain the network structure information and node attribute information to detect the community structure in the complex network. The graph neural network is the application and innovation of the traditional deep learning method on graph structure data and is used for extracting characteristic representations in the graph, and the defects of the traditional method are overcome by the technical appearance. As an artificial neural network for unsupervised learning, an Automatic Encoder (AE) is widely used in feature extraction, and its success in the field of image processing has led researchers to try to use an automatic encoder for community discovery.

Most of the recent community discovery methods based on graph self-encoders adopt a structural reconstruction mode, few encoders adopting feature reconstruction still adopt a common architecture, the feature reconstruction without damage may not be reliable, and the designed model is not robust enough, so that the existing graph self-encoders have a great progress space when solving community discovery problems. In recent years, in the field of community discovery, a number of graph double self-encoder models of reconstruction structures and reconstruction attribute features appear, the algorithms have good graph representation learning ability and community division effect, and the idea of using the graph double self-encoder for community division has revealed the potential.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) The network topology structure and the node attribute characteristics cannot be considered at the same time, the traditional community discovery method mainly comprises a statistical inference method and a machine learning method, the methods are all based on the structural characteristics of the network, communities are divided only by considering the connection side relationship among nodes, and the characteristics of the nodes are ignored, so that the communities are divided to lack of semanteme; still other classical methods, such as K-Means, use only node attributes for community discovery, which ignore relationships between nodes, i.e. structural features of the network.

(2) Most of the recent community discovery methods based on graph self-encoders adopt a single reconstruction structure or a reconstruction feature mode, and the learning of graph representation is insufficient. Most of the methods use a structural reconstruction mode, so structural information is too important; and a small part of encoders adopting characteristic reconstruction still adopt a common framework, the characteristic reconstruction without damage can be unreliable, and the designed model has the problem of weak robustness.

(3) Most of the objects in the graph are feature vectors with small information, and the multi-layer perceptron commonly used as a decoder in the graph self-encoder may not bridge the gap between the encoder representation and the decoder object, and the graph features cannot be obtained well, so that the poor graph representation information is unfavorable for the subsequent community division.

(4) The Mean Square Error (MSE) employed by current feature reconstruction from encoders for reconstruction loss computation is affected by the existence of various feature vector norms and dimensions, with the risk of model instability.

The above problems limit the development of community discovery methods and further limit the progress of citation analysis technology.

Disclosure of Invention

The invention aims to provide a community discovery method, a system, equipment and a medium based on a graph double self-encoder, which improve the accuracy of community division in a quotation network.

In order to achieve the above object, the present invention provides the following solutions:

a graph-based dual self-encoder community discovery method, comprising:

inputting a given quotation network into a graph double self-encoder to obtain graph structure representing information and graph attribute representing information;

fusing the graph structure representation information and the graph attribute representation information to obtain fused graph representation information;

and carrying out community division on the fused graph representation information by adopting a clustering method to obtain a community discovery result.

Optionally, the dual self-encoder comprises a first encoder and a second encoder;

the first encoder is used for outputting the characteristics of each node after fusing neighborhood information according to the attention coefficients of each node and the neighbor nodes and the node characteristics of the neighbor nodes to obtain graph structure representation information; the nodes are nodes in the quotation network;

The second encoder is used for sampling nodes in the quotation network by adopting a random sampling strategy to obtain a sampling set, shielding the characteristics of the nodes in the sampling set by adopting a first mask token, and learning graph information of the nodes which are shielded by adopting the first mask token and the nodes which are not shielded by adopting the first mask token to obtain graph attribute representation information.

Optionally, the graph automatic encoder of the reconstruction structure employs a graph attention network; the graph automatic encoder of the reconstruction feature adopts a graph neural network.

Optionally, the community discovery method based on the graph double self-encoder further comprises training the graph double self-encoder; the loss functions used for training the graph double self-encoder comprise structural reconstruction loss, characteristic reconstruction loss and clustering loss.

Optionally, the dual self-encoder further comprises a first decoder and a second decoder, wherein the first decoder is a decoder of the automatic encoder of the graph of the reconstruction structure, and the second decoder is a decoder of the automatic encoder of the graph of the reconstruction feature;

the first decoder is used for performing inner product operation on the graph structure representation information to obtain a reconstructed adjacent matrix;

The second decoder is configured to:

re-masking the node subjected to the first mask token masking processing by adopting a second mask token;

for the node subjected to the re-masking processing, reconstructing the characteristics of the node subjected to the re-masking processing by adopting a graph neural network based on the neighbor nodes of the node subjected to the re-masking processing, and obtaining a reconstructed characteristic matrix;

the structural reconstruction loss is expressed as:

wherein Ai _j Values of elements in the adjacency matrix representing the initial graph of the quoted network, N is the number of nodes in the quoted network,representing values of elements in the reconstructed adjacency matrix;

the feature reconstruction loss is expressed as:

wherein x is _i Representing the original characteristics of node i in the quotation network, z _i For the features of node i after feature reconstruction, γ represents the scale factor,and (3) representing a node set obtained by sampling nodes in the quotation network by adopting a random sampling strategy, wherein T represents transposition.

Optionally, the loss function is expressed as:

Loss＝L _X +L _S +εL _clu ；

wherein Loss represents the value of the Loss function, LX represents the characteristic reconstruction Loss, L _S Representing structural reconstruction loss, L _clu Representing the cluster loss and epsilon representing the first hyper-parameter.

Alternatively, the fused graph representation information is expressed as: c= [ (1- α) ×z+α×h ];

Wherein C represents the graph representation information after fusion, α represents the second super parameter, Z represents the graph structure representation information, and H represents the graph attribute representation information. .

The invention also discloses a community discovery system based on the graph double self-encoder, which comprises:

the diagram information representation module is used for inputting a given quotation network into the diagram double self-encoder to obtain diagram structure representation information and diagram attribute representation information;

the information fusion module is used for fusing the graph structure representation information and the graph attribute representation information to obtain fused graph representation information;

and the clustering module is used for carrying out community division on the fused graph representation information by adopting a clustering method to obtain a community discovery result.

The invention also discloses an electronic device, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the community discovery method based on the graph double self-encoder.

The invention also discloses a computer readable storage medium storing a computer program which when executed by a processor implements the graph-based dual self-encoder community discovery method.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention fuses the graph structure representation information and the graph attribute representation information, and carries out community division based on the graph representation information obtained after fusion, so that the graph attribute information and the structure information are fully mined, thereby improving the effect of community division in the quotation network and optimizing the exploration of the quotation network structure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a community discovery method based on a graph dual self-encoder according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of community discovery provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an automatic encoder architecture employing feature reconstruction according to an embodiment of the present invention;

FIG. 4 is a schematic illustration of an MLP model provided by an embodiment of the invention;

FIG. 5 is a schematic diagram of reconstruction loss calculation according to an embodiment of the present invention;

FIG. 6 is a detailed schematic diagram of the dual self-encoder provided in an embodiment of the present invention;

FIG. 7 is a schematic diagram of a K-Means algorithm process provided by an embodiment of the present invention;

FIG. 8 is a diagram of dual self-encoder loss provided by an embodiment of the present invention;

FIG. 9 is a diagram showing an example of the structure of the drawings provided by the embodiment of the present invention;

FIG. 10 is a schematic diagram of an exemplary messaging process in the diagram structure provided by an embodiment of the present invention;

FIG. 11 is a schematic diagram of an automatic encoder of the type provided in an embodiment of the present invention;

FIG. 12 is a schematic diagram of a community discovery system result based on a graph dual self-encoder according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 2, an embodiment of the present inventionThe provided community discovery schematic diagram vividly explains how to explore the structure of a content main body-quotation network which is vital to the quotation analysis process, so that the process is more consistent with the information represented by the graph itself, the division quality is further higher, the mode that the traditional community discovery method singly utilizes the graph topological structure or node attribute information to explore the community is abandoned, the wave of deep learning is followed, and the tool of the graph automatic encoder is fully utilized to study the community discovery method. FIG. 2 shows a graph structure for community discovery, which is intended to cluster graph nodes with related information, and FIG. 2, which clusters nodes into two communities C based on occupational criteria ₁ And C ₂ Community C ₁ Including nodes 1, 2, 3 and 4, community C ₂ Including nodes 5, 6 and 7; if the structure is a quotation network, articles with related quotations can be aggregated in the same community.

In the community discovery method based on the graph automatic encoder in recent years, most of the community discovery method adopts a reconstruction structure mode, and a small part of the community discovery method adopts a reconstruction node characteristic mode, as shown in fig. 3, a common framework schematic diagram of the graph automatic encoder adopting characteristic reconstruction provided by the embodiment of the invention describes an operation mechanism of a traditional encoder, the traditional graph automatic encoder adopting the reconstruction characteristic simply relies on minimizing errors between input and reconstruction signals to obtain an implicit layer characteristic representation of the input, the training strategy cannot ensure that the extracted essential characteristics of data, and the simple dependence on minimizing the reconstruction errors possibly leads to that the characteristics learned by the encoder are only copies of original input, so that graph representation information learning is not ideal; in addition, the feature reconstruction without damage may cause unreliable architecture, and further cause the problem of weak robustness of the designed model.

In designing an automatic encoder, the decoder will generally choose to depend on the semantic level of the object X, which contains more semantic information (e.g., X is a single-hot matrix), and a relatively simple model, such as a multi-layer perceptron (MLP), can be chosen by the decoder; the less semantic information that object X contains, the more complex the decoder is needed. In the automatic image encoder with the reconstruction features appearing in recent years, a simple MLP is generally selected as a decoder, and as shown in fig. 4, the MLP model provided by the embodiment of the invention is a sketchy description of the structural form of a multi-layer perceptron, and the multi-layer perceptron introduces one or more hidden layers (hidden layers) on the basis of a single-layer neural network, wherein the hidden layers are positioned between an input layer and an output layer. In fig. 4, the decoder reconstructs multi-dimensional node features with relatively small semantic information, the multi-layer perceptron has poor expressive power, and the gap between the encoder representation and the decoder target cannot be closed, so that the learned hidden layer representation H tends to be almost consistent with the input feature X, which is unfavorable for the subsequent community division work.

As shown in fig. 5, the reconstruction loss calculation schematic provided in the embodiment of the present invention shows a simple reconstruction process and a loss calculation body. The reconstruction loss calculation is often performed by a graph automatic encoder with feature reconstruction, and a Mean Square Error (MSE) is often adopted. Since node features are multidimensional and continuous, it is not appropriate to use conventional MSE as a criterion for feature reconstruction. In particular, experiments have found that MSE losses can be minimized to near zero, which is insufficient for feature reconstruction; in addition, MSE has problems of sensitivity and low separation. Sensitivity refers to the sensitivity of the MSE to vector norms (vector norm) and dimensions, and extreme values of certain feature dimensions may also cause the MSE to overfit them. A low degree of separation indicates that the MSE is not sufficiently separable to allow the center of gravity of the model to be placed on a more difficult sample.

In order to solve the problems in the community discovery methods, the invention designs a graph double self-encoder comprising structure reconstruction and feature reconstruction with mask strategy to perform community discovery, fully uses topology information and attribute information of the graph, and improves the problems of imprecise model hidden layer vector learning, weak model stability and the like by reconstructing after damaging node features. In addition, when the graph automatic encoder for reconstructing the characteristics is designed, the problems that the multi-layer perceptron is not good enough as the expressive force of a decoder, the mean square error is not proper as the standard for reconstructing the characteristics and the like are fully considered, and the model is further improved.

Example 1

As shown in fig. 1, the present embodiment provides a community discovery method based on a graph dual self-encoder, which includes the following steps.

Step 101: and inputting the given quotation network into a graph double self-encoder to obtain graph structure representation information and graph attribute representation information.

A given quotation network is built from the data sets to be discovered by the community.

The dual self-encoder is a trained dual self-encoder in step 101. The structure of the dual self-encoder of the present invention is shown in fig. 6.

Given global information of a quoted network, let g= (V, a, X) denote a graph of the given quoted network, where V is the set of nodes, n= |v| is the number of nodes, a e {0,1} ^N×N Is the adjacency matrix of the graph, A _ij For values of elements in the adjacency matrix, X εR ^N×d Is the feature matrix of the nodes in the graph, where x _i Representing the ith sample, N is the number of samples, and d represents the first feature dimension.

The dual self-encoder of the graph includes a first encoder that is an encoder of the automatic encoder of the graph of the reconstructed structure and a second encoder that is an encoder of the automatic encoder of the graph of the reconstructed feature. The first encoder employs a graphics attention network (Graph Attention Network, GAT); the second encoder employs a graph neural network (Graph Neural Network, GNN).

The first encoder is used for outputting the characteristics of each node after fusing neighborhood information according to the attention coefficients of each node and the neighbor nodes and the node characteristics of the neighbor nodes to obtain graph structure representation information; the nodes are nodes in the quotation network.

The specific workflow of the first encoder includes:

for any one node i, N _i For the set of neighbor nodes on the graph of node i, the neighbors (j e N) of node i are computed one by one _i ) And a correlation coefficient e between nodes i _ij ：

e _ij ＝a([Wx _i ||Wx _j ])，j∈N _i ；

Wherein x is _i For node characteristics, the correlation between node i, node j is accomplished by a learnable parameter W and a mapping function a (): firstly, a linear mapping of a shared parameter W is used for increasing the dimension of the characteristics of the nodes, which is a common characteristic enhancement (feature augmentations) method; [. I. ]]For node i, the transformed features of node j are stitched (spliced); finally a (-) maps the spliced high-dimensional features to a real number.

For correlation coefficient e _ij Normalization to obtain the attention coefficient alpha _ij ：

According to the attention coefficient alpha _ij The feature weighted sum (aggregate) yields a new feature (fused neighborhood information) for each vertex i.

Wherein σ (·) is the activation function; will Z= [ Z ]' _i ]The hidden layer representation finally obtained by the graph automatic encoder as a reconstruction structure, Z is the new characteristic of the GAT output for all nodes, namely graph structure representation information.

The specific workflow of the second encoder includes:

sampling the set of nodes by adopting a unified random sampling strategy to obtain a sampling set, and shielding the characteristics of each node in the sampling set by using a first MASK token (MASK mark), wherein the characteristics are expressed by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,sampling set obtained by sampling nodes, x _[M] ∈R ^d As a learnable vector, x _i V is _i Node characteristics of e V->Marking the node characteristics of the node i in the processed node set V for the mask.

The graph information learning (coding learning) is performed in a graph automatic encoder of a reconstructed feature, and the graph automatic encoder of the reconstructed feature specifically takes a graph rolling network (Graph Convolutional Networks, GCN) as an example, and the coding learning process comprises:

wherein L represents the number of layers of GCN, H ⁽¹⁾ 、H ^(l-1) 、H ^(l) The learned graph information of the 1 st layer, the 1 st layer and the first layer of GCN respectively show that the input of the first layer of GCN is X ', X' obtained by mask markingConstructing; phi (·) is an activation function of the full connectivity layer, such as a Relu or Sigmoid function; /> The degree matrix, I is the unit diagonal matrix of each node self-circulating adjacent matrix A; w (W) ⁽¹⁾ 、W ^(l-1) The weight matrixes of the 1 st layer and the 1 st layer of the GCN are respectively.

Normalizing the H generated in the previous step by adopting softmax ^(L) A final hidden representation of the encoder stage is obtained. The last layer of the GCN module is a multi-classification layer with a softmax function, H E RN x dh represents a hidden layer representation obtained by encoding a GCN encoder, and d is a product of a graph automatic encoder of a reconstruction characteristic _h Representing a second feature dimension, H is described as:

wherein H is ^(L) W is a representation learned by the L-th layer of GCN ^(L) Is the weight matrix of the L layer of the GCN.

Step 102: and fusing the graph structure representation information and the graph attribute representation information to obtain fused graph representation information.

The fused graph representation information is expressed as: c= [ (1- α) ×z+α×h ];

wherein C represents the graph representation information after fusion, α represents the second super parameter, Z represents the graph structure representation information, and H represents the graph attribute representation information.

Step 103: and carrying out community division on the fused graph representation information by adopting a clustering method to obtain a community discovery result.

The clustering method adopts a K-Means clustering method.

As shown in FIG. 7, the schematic process diagram of the K-Means algorithm provided by the embodiment of the invention vividly represents the primary clustering process of the K-Means algorithm, and the thought of the K-Means algorithm is as follows: firstly, K objects are randomly selected as initial cluster centers, then the distance between each object and each seed cluster center is calculated, then the objects are respectively distributed to a cluster center closest to the object, a new cluster is formed by the distributed objects and the cluster centers as long as the objects are distributed, the distributed objects cannot be changed, then the calculation is carried out again according to the distance between the cluster centers of each cluster and the objects, and the calculation is repeated until no object can be distributed to different clusters or the cluster centers are not changed any more or the square sum of errors is minimum locally, so that the cyclic calculation is stopped. In fig. 7, (a) shows the original seed distribution, and (b) to (f) show the change process of the cluster center by the symbol crosses, and the operation mechanism of the K-Means algorithm is as follows:

Input: number of class clusters K, iteration termination value Z

And (3) outputting: clustering results

1：For(t＝1；t＜＝Z；t+＝1){

2: giving data object X _i The method comprises the steps of carrying out a first treatment on the surface of the Data X/data _i More than K

3: calculating distance dist (X) between cluster center and object _i ，Center)；

4: x is to be _i Distance X of demarcation _i The center of the nearest cluster is located in the cluster;

5: for (up to X _i Not able to be allocated) {

6: updating all cluster centers _；

7：}

8: outputting a clustering result;

9：}

the community discovery method based on the graph double self-encoder further comprises training the graph double self-encoder; losses in the loss function employed to train the dual self-encoder of the graph include structural reconstruction losses, feature reconstruction losses, and clustering losses, as shown in fig. 8.

The dual self-encoder further includes a first decoder that is a decoder of the automatic encoder of the graph of the reconstructed structure and a second decoder that is a decoder of the automatic encoder of the graph of the reconstructed feature.

The first decoder is used for performing inner product operation on the graph structure representation information to obtain a reconstructed adjacent matrix.

The reconstructed adjacency matrix is expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,reconstructed adjacency matrix->With the original adjacent matrix A of the corresponding position _ij Together, the structural reconstruction losses are calculated.

The second decoder is configured to: re-masking the node subjected to the first mask token masking processing by adopting a second mask token; and for the node subjected to the re-masking processing, reconstructing the characteristics of the node subjected to the re-masking processing by adopting a graph neural network based on the neighbor nodes of the node subjected to the re-masking processing, and obtaining a reconstructed characteristic matrix.

The specific workflow of the second decoder includes: using another mask token [ DMASK]Masking the node subjected to the first mask token masking processing (second mask token), performing zero setting processing on a node feature vector corresponding to the node subjected to the first mask token masking processing in an output H of a graph automatic encoder of the reconstructed feature,for the vector after masking of the second decoder, < >>Is-> Is a re-mask code of (c).

As shown in fig. 9 and 10, the exemplary graph and exemplary message passing process provided by the embodiments of the present invention is a process in which a node listens for information from its neighbors and then updates the information and passes it forward, a process known as message passing, in a simple graph structure.

The second decoder is used to reconstruct the input features of the masked nodes from the adjacent unmasked potential representations, and the reconstruction process is completed according to the message propagation process. The second decoder is only used to perform node feature reconstruction tasks during the self-supervised training phase, and therefore, the second decoder architecture is independent of encoder selection, any type of GNN may be used. Given f _E As a graph encoder, f _D As a graph decoder, the overall learning process of the graph automatic encoder of the reconstructed feature is expressed as:

H＝f _E (A，X)，G′＝f _D (A，H)。

wherein G' represents the reconstructed map, which is the product generated by the second decoder.

According to the method for community discovery by adopting the graph double self-encoder based on quotation analysis, after the graph double self-encoder model is designed, a reasonable objective function is required to be selected for optimizing the model, and the community division quality is improved by designing the objective function, so that the community discovery process is promoted.

For a citation network, the various association modes of documents can be used, for example: community division is performed by coauthoring, co-indexing, coupling and the like, and by taking coauthoring as an example, authors of a coauthoring are clustered into the same community. On one hand, the process can promote the propagation rate and sharing effect of scientific data, more documents related to a certain research direction can be found through community division, and the data recommendation effect is beneficial to acquiring more comprehensive knowledge of the certain research direction; on the other hand, through community division, the flow direction of scientific knowledge can be more conveniently known from scientific data in the same community, and the development and evolution state of a certain research can be explored.

The loss function of the dual self-encoder consists of reconstruction loss and cluster loss. The reconstruction loss consists of characteristic reconstruction loss and structure reconstruction loss, and the aim of the graph double self-encoder training is to minimize reconstruction errors between input and output, and the quality of vectors finally learned by the graph double self-encoder determines the effect of the model.

In order to enhance the robustness of the model, the invention uses cosine error (SCE) to replace Mean Square Error (MSE) to calculate the reconstruction loss after feature reconstruction.

The cosine error is used as a standard for reconstructing the original node characteristics, the influence of the dimension and the vector norm can be eliminated, and l in the cosine error ₂ The norm can map the vector to the unit hypersphere, which can greatly improve the training stability of the representation learning.

The feature reconstruction loss is expressed as:

The loss after structure reconstruction was calculated using classical cross entropy functions. The cross entropy can measure the difference degree of two different probability distributions in the same random variable, and is expressed as the difference between the real probability distribution and the predicted probability distribution in machine learning, and the smaller the value of the cross entropy is, the better the model prediction effect is.

The structural reconstruction loss is expressed as:

wherein A is _ij Adjacency moment representing initial graph of the quotation networkThe values of the elements in the array, N being the number of nodes in the quotation network,representing the values of the elements in the reconstructed adjacency matrix.

In the process of feature reconstruction and structure reconstruction, the model generates corresponding graph representation information, and when the feature reconstruction is carried out, people hope that the extracted features can reflect the original input features, so that a graph representation with better learned attribute information is obtained, namely, a vector H output by a graph automatic encoder for reconstructing the features is obtained; when the structure is reconstructed, the extracted structure information is expected to reflect the structure characteristics of the original input, so that a better learned graph representation of the structure information, namely the vector Z output by the automatic graph encoder of the reconstructed structure, is obtained.

Based on the clustering result obtained in step 103, the KL divergence loss between the clustering result distribution Q and the target distribution P is calculated, the smaller the KL divergence is, the closer the distribution of P and Q is, and the distribution of Q can be made to approach P by repeatedly training Q. The cluster loss is expressed as:

wherein q _ij Representing elements, p, in the clustering result distribution Q _ij Representing elements in the target profile P.

q _ij Can be seen as the probability of assigning node i to cluster j; f (f) _j For the ith row q _ij And (3) summing; h is a _i I-th row of H; mu (mu) _j K-Means initialization class on representation learned by pre-training automatic encoderIdentifying; t is the degree of freedom of student t distribution.

The loss function is expressed as:

Loss＝L _X +L _S +εL _clu ；

where Loss represents the value of the Loss function, L _X Representing the loss of feature reconstruction, L _S Representing structural reconstruction loss, L _clu Representing the cluster loss, epsilon represents the first super-parameter, epsilon > 0, epsilon is the super-parameter balancing the cluster optimization and local structure preservation of the original data.

As shown in fig. 11, the schematic diagram of the automatic encoder model provided by the embodiment of the present invention simply depicts the structure of the self-encoder, which is an unsupervised application that uses back propagation to update parameters, with the ultimate goal of bringing the output x' infinitely close to the input x. In this process, the self-encoder compresses the input data to a lower dimensional feature, and then uses the lower dimensional feature to reproduce the input data, which is the output of the self-encoder. In essence, the self-encoder is a compression algorithm. A self-encoder consists of 3 parts: encoder (Encoder): for data compression; compressed feature vector (Compressed Feature Vector): the characteristics compressed by the encoder; decoder (Decoder): for data decoding. In FIG. 11, W _enc Representing the weight matrix of the encoder, W _dec Representing the weight matrix of the decoder.

The invention takes the loss function as an objective function, and optimizes the dual self-encoder of the graph by utilizing the objective function.

By minimizing the objective function, the parameters which can obtain better clustering effect are learned from the encoder model by adopting random gradient descent (SGD) back propagation help graph double self-learning, and the training efficiency is improved.

Meanwhile, the weighted summation of the graph representation information with different emphasis points generated by the two graph automatic encoders to generate a new graph representation, and then clustering is carried out, so that the community division result is more accurate and high-quality, more accurate quotation network structures are explored, and the quotation analysis process is promoted.

To embody the advancement of the present invention, the present invention has been conducted on a real world dataset, citeser, which contains the quotation links between 3312 papers and 4732 papers, all belonging to 6 different academic fields of study, each paper represented by a keyword represented by a 3703-dimensional word vector; the values of the super parameters alpha and epsilon are respectively 0.5 and 0.001; and three commonly used measurement standards in the community discovery field, namely precision (ACC), normalized Mutual Information (NMI) and adjustment of Lande coefficient (ARI), are selected to evaluate the model performance. On this basis, other community discovery methods are compared, and the detail is as follows:

K-Means (Krishna & Murty, 1999): it initializes K different clusters, calculates the center of each cluster using a mean calculation method, and then iteratively updates the cluster center until the criterion function converges.

TADW (Yang et al, 2015): the method integrates the node text information into the network representation learning through matrix decomposition, thereby combining rich topological structure and semantic information.

GAE & VGAE (Kipf & Welling, 2016): they integrate topology and attribute information values into the learned representation using a graph automation encoder constructed from a graph convolution network.

Graphencoder (Salehi & Davulcu, 2020): the method learns nonlinear representation of an original network through a stacked automatic encoder, and realizes a clustering result through a K-Means method.

Details of the experimental results are shown in table 1, where the optimum values are indicated in bold. A. X and A & X respectively indicate whether the method only utilizes network topology and attribute information or whether the method utilizes the network topology and the attribute information simultaneously.

Table 1 representation of the present graph dual self-encoder and other community finding algorithms in community detection tasks

By comparing the data, it can be found that the model of the present invention is superior to the other methods described above in the data set citeser. Specifically, compared with other community discovery algorithms which simultaneously utilize network topology and attribute information, the Accuracy (Accuracy, ACC), normalized mutual information (Normalized Mutual Info, NMI) and adjusted lander coefficient (Adjusted Rand Index, ARI) evaluation indexes of our model on the data set are respectively improved by 20.4%, 27.1% and 21.4% on average; compared with the traditional clustering algorithm K-Means which only utilizes attribute information, the ACC, NMI and ARI evaluation indexes of the model of the invention on the data set are respectively improved by 9.4%, 17.8% and 10.5%; compared with the Graphencoder algorithm only using network topology in 2020, the model of the invention improves the ACC, NMI and ARI evaluation indexes on the data set by 34.5%, 43.3% and 34.7%, respectively, which verifies the effectiveness of the invention.

The invention promotes the quotation analysis by optimizing the community division algorithm, and has a help effect on the researchers to find important documents and explore the scientific knowledge flow direction in reality. The community discovery method based on the graph double self-encoder can be embedded into document searching or document reading software, and a document being searched or read is related to other documents related to the document, so that researchers can find important documents and explore domain knowledge streams (a web page can recommend other movies related to a movie when a movie is searched or watched in analogy). The network includes nodes and edges, the quotation network is graph data with documents as nodes and quotation links as edges, and the citeser data set is the quotation network, but the quotation network has more than one data set of citeser, and the embodiment is an example for illustrating the technical effect of the invention, and the optimized verification of community discovery on the quotation network is achieved through feedback of measurement indexes.

The invention designs a graph double self-encoder community discovery algorithm with higher community division quality based on quotation analysis, provides a feasible solution for the problem of negative influence on the development of the graph automatic encoder, overcomes the defect that attribute information is not fully explored, and enhances the robustness of the self-encoder, thereby improving the community division effect and optimizing the exploration of quotation network structure; meanwhile, a new scheme is provided for future community discovery method research by introducing a mask idea into the community discovery field, and the development of the technical field of quotation analysis is further promoted.

The community discovery method based on the graph double self-encoder provides a feasible solution to the problem that the development of the graph automatic encoder is negatively affected: 1. most of the targets in the figure are feature vectors with small information content, and a multi-layer perceptron (MLP) commonly used as a decoder in GAE may not be able to bridge the gap between the encoder representation and the decoder target, and the figure features cannot be obtained well. For the problem, the invention proposes to adopt a Graph Neural Network (GNN) with better expressive force as a decoder, and the graph automatic encoder obtains better hidden layer representation after improvement, thereby being beneficial to improving the effect of subsequent community division. 2. The Mean Square Error (MSE) adopted when the automatic encoder of the graph with characteristic reconstruction performs reconstruction loss calculation is affected by the problem that various characteristic vector norms and dimensions exist, and the risk of unstable models exists. For this problem, the present invention proposes to use cosine error (SCE) for calculation of the feature reconstruction loss, and the robustness of the model is enhanced after improvement.

The technical scheme of the invention solves the problem that the traditional community discovery method cannot consider the network topology structure and the node attribute characteristics at the same time, and the invention designs a graph double self-encoder model with a reconstruction structure and the reconstruction characteristics and simultaneously utilizes the structure information and the attribute information of the graph. The technical scheme of the invention also solves the problem that the recent community discovery method based on the graph automatic encoder is insufficient in learning graph representation due to the adoption of a single reconstruction structure or a reconstruction feature mode. In addition, the automatic graph encoder of the reconstructed features adopts a common framework, which is a feature reconstruction without damage, and experiments find that the framework is possibly unreliable, and the designed model has the problem of weak robustness. In order to solve the problem, the invention adopts a reconstruction mode of firstly destroying the original characteristics of partial nodes and then reconstructing the characteristics, which enhances the robustness of the graph self-encoder, inspired by the widely adopted ideas of a denoising automatic encoder for destroying input and reconstructing input in the fields of computer vision and natural language processing.

The technical scheme of the invention fills the technical blank in the domestic and foreign industries: the invention designs a damaged characteristic reconstruction mode by referring to the ideas of a denoising automatic encoder which firstly breaks down input and then reconstructs the input in the fields of computer vision and natural language processing. In contrast to conventional encoders, the hidden layer representation of the automatic encoder of the graph employing this feature reconstruction approach is not directly mapped from the original input, but from a "corrupted" version of the original input. The encoder randomly sets the nodes in the original input to zero according to a certain proportion, and the 'spoiled' version of the original input is obtained without processing the rest nodes. The method is equivalent to introducing a certain proportion of blank elements into the original input, so that the information contained in the original input is reduced; and then, through learning the sum, the lost information is tried to be filled, and the data structure is further learned, so that the extracted characteristics can reflect the characteristics of the original input. By injecting noise into the input, and then using noisy "corrupted" samples to reconstruct a clean "input that is free of noise, higher-level feature expressions of the input are facilitated. The conventional automatic encoder simply relies on minimizing errors between the input and the reconstructed signal to obtain an implicit layer feature representation of the input, and this training strategy cannot guarantee that the essential features of the extracted data, and simply relying on minimizing the reconstruction errors may result in the features learned by the encoder being merely copies of the original input, and the damaged feature reconstruction can avoid the above-mentioned problems. The mask denoising idea is used for the community discovery method for the first time, fills up the technical blank in the domestic and foreign industries, is a fully innovative exploration of the community discovery method, is further beneficial to exploration of a quotation network structure, and can promote development of the quotation analysis technical field.

Example 2

As shown in fig. 12, a community discovery system based on a graph double self-encoder includes the following structure.

The graph information representation module 201 is configured to input a given quotation network into the graph dual self-encoder to obtain graph structure representation information and graph attribute representation information.

The information fusion module 202 is configured to fuse the graph structure representation information and the graph attribute representation information to obtain fused graph representation information.

And the clustering module 203 is configured to perform community division on the fused graph representation information by using a clustering method, so as to obtain a community discovery result.

Example 3

An embodiment of the present invention provides an electronic device including a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the graph-based dual self-encoder community discovery method of embodiment 1.

Alternatively, the electronic device may be a server.

In addition, the embodiment of the present invention also provides a computer readable storage medium storing a computer program, which when executed by a processor, implements the graph-based dual self-encoder community discovery method of embodiment 1.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A community discovery method based on graph double self-encoders, comprising:

2. The graph-dual self-encoder based community discovery method of claim 1, wherein the graph-dual self-encoder comprises a first encoder and a second encoder;

3. The community discovery method based on graph double self-encoders according to claim 2, characterized in that the graph automatic encoder of the reconstruction structure adopts a graph attention network; the graph automatic encoder of the reconstruction feature adopts a graph neural network.

4. The graph-dual self-encoder based community discovery method of claim 2, further comprising training the graph-dual self-encoder; the loss functions used for training the graph double self-encoder comprise structural reconstruction loss, characteristic reconstruction loss and clustering loss.

5. The community discovery method based on a graph double self-encoder according to claim 4, wherein the graph double self-encoder further comprises a first decoder and a second decoder, the first decoder is a decoder of a graph automatic encoder of a reconstruction structure, and the second decoder is a decoder of a graph automatic encoder of a reconstruction feature;

the second decoder is configured to:

the structural reconstruction loss is expressed as:

wherein A is _ij Values of elements in the adjacency matrix representing the initial graph of the quoted network, N is the number of nodes in the quoted network,representing values of elements in the reconstructed adjacency matrix;

the feature reconstruction loss is expressed as:

6. The graph-double self-encoder based community finding method as claimed in claim 5, wherein the loss function is expressed as:

Loss＝L _X +L _S +εL _clu ；

where Loss represents the value of the Loss function, L _X Representing the loss of feature reconstruction, L _S Representing structural reconstruction loss, L _clu Representing the cluster loss and epsilon representing the first hyper-parameter.

7. The graph-double self-encoder-based community finding method according to claim 1, wherein the fused graph representation information is represented as: c= [ (1- α) ×z+α×h ];

8. A graph-based dual self-encoder community discovery system, comprising:

9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the graph-based dual self-encoder community discovery method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the graph-based dual self-encoder community finding method as claimed in any one of claims 1 to 7.