CN111428741B

CN111428741B - Network community discovery method and device, electronic equipment and readable storage medium

Info

Publication number: CN111428741B
Application number: CN201811565878.4A
Authority: CN
Inventors: 陈川; 钱慧; 林志伟; 凌国惠; 张宗一; 郑子彬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2023-04-07
Anticipated expiration: 2038-12-20
Also published as: CN111428741A

Abstract

The embodiment of the invention provides a network community discovery method and device, electronic equipment and a readable storage medium, and belongs to the technical field of community discovery. The method comprises the following steps: the method comprises the steps of obtaining multi-source social network data of a social network user, wherein the multi-source social network data comprises data corresponding to at least two data sources; respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source; clustering social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain a clustering result; and obtaining the network community division result of the social network user corresponding to the main data source according to the clustering result. According to the scheme of the embodiment of the invention, the accuracy of community discovery can be effectively improved.

Description

Network community discovery method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of community discovery, in particular to a network community discovery method and device, electronic equipment and a readable storage medium.

Background

The community discovery is a generalized clustering algorithm, that is, the method is used for discovering a community structure in a network, and dividing and extracting an entity set with similarity attributes in the network structure. One web community corresponds to one cluster (class) in the cluster.

In recent years, various community discovery algorithms have been proposed, but most of these algorithms are single-source community discovery algorithms. The single-source community discovery algorithm divides the data examples into a plurality of communities according to the data characteristics of a single data source, so that the similarity of the data examples in the communities is large, and the similarity of the data examples among the communities is small. Although the community discovery can be realized through the existing single-source community discovery algorithm, the existing scheme is realized based on the data of a single data source, the problems of single visual angle and low fault tolerance rate exist, and the accuracy of the community discovery result is low.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical problems of the prior art. The scheme provided by the embodiment of the invention is as follows:

in a first aspect, the present invention provides a method for discovering a web community, the method including:

the method comprises the steps of obtaining multi-source social network data of a social network user, wherein the multi-source social network data comprises data corresponding to at least two data sources;

respectively determining the association relationship between each auxiliary data source and a main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of at least two specified data sources, and the auxiliary data source is a data source except the main data source in the at least two data sources;

and performing community division on the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

In an alternative of the first aspect, the respectively determining an association relationship between each auxiliary data source and a main data source based on a user relationship between a social network user corresponding to each auxiliary data source and a social network user corresponding to the main data source includes:

respectively constructing a relationship matrix between each auxiliary data source and the main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source;

the relationship matrix corresponding to each auxiliary data source is used for representing the incidence relationship between each auxiliary data source and the main data source, and the elements in the relationship matrix are used for representing the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

In an alternative of the first aspect, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network user corresponding to the row where the element is located in the relationship matrix and the social network user corresponding to the column where the element is located in the relationship matrix are the same user.

In an alternative of the first aspect, based on the data of the main data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the main data source, performing community division on the social network users corresponding to the main data source to obtain a division result of the network community of the social network users corresponding to the main data source, includes:

obtaining a first objective function through a first clustering algorithm based on data of the main data source, wherein the first objective function comprises a community indication matrix before solving corresponding to the main data source;

obtaining a second objective function through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

obtaining a final objective function based on the first objective function and the second objective function;

solving the final objective function to obtain a solved community indication matrix corresponding to the main data source;

and obtaining a network community division result of the social network user corresponding to the main data source based on the solved community indication matrix.

In an alternative of the first aspect, the number of lines of the solved community indication matrix is the number of social network users corresponding to the main data source, and the number of columns of the solved community indication matrix is the number of pre-divided network communities.

In an alternative of the first aspect, obtaining a second objective function through a second clustering algorithm based on data of each auxiliary data source and an association relationship between each auxiliary data source and a main data source includes:

obtaining sub-objective functions corresponding to each auxiliary data source through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

and obtaining a second objective function based on the sub-objective functions corresponding to each auxiliary data source.

In an alternative of the first aspect, obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source includes:

and obtaining a second objective function based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source.

In an alternative of the first aspect, obtaining a first objective function through a first clustering algorithm based on data of a main data source includes:

calculating a user similarity matrix corresponding to the main data source based on the data of the main data source;

obtaining a first objective function through a first clustering algorithm based on a user similarity matrix corresponding to a main data source;

based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source, obtaining the sub-objective function corresponding to each auxiliary data source through a second clustering algorithm, which comprises the following steps:

calculating a user similarity matrix corresponding to each auxiliary data source based on the data of each auxiliary data source;

obtaining sub-objective functions corresponding to each auxiliary data source through a second clustering algorithm based on the user similarity matrix corresponding to each auxiliary data source and the relation matrix corresponding to each auxiliary data source;

the relationship matrix corresponding to each auxiliary data source is used for representing the incidence relationship between each auxiliary data source and the main data source, the relationship matrix corresponding to each auxiliary data source is a matrix constructed based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, and elements in the relationship matrix are used for representing the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

In an alternative of the first aspect, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U ^T L _v′,v U)

wherein,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving ^T A transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S _v′,v Representing a relationship matrix, S, between a secondary data source v' and a primary data source v _v,v′ Denotes S _v′,v Transposed matrix of (A) _v′ A user similarity matrix representing the correspondence of the secondary data sources,

representation matrix S _v,v′ A _v′ S _v′,v Degree matrix, | · | circum _ Lily _f Representing the F-norm.

In an alternative of the first aspect, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein,

v denotes the total number of primary and secondary data sources, μ _v′ Representing the weight of the secondary data source v'.

In an alternative of the first aspect, the method further comprises:

constructing a regular term of a weight vector, wherein the weight vector is a vector formed by weights corresponding to each auxiliary data source;

obtaining a final objective function based on the first objective function and the second objective function, including:

and obtaining a final objective function according to the first objective function, the second objective function and the regular terms of the weight vector, wherein the weight vector is a term to be solved in the final objective function.

In an alternative of the first aspect, the regularization term of the weight vector is:

where μ denotes a weight vector, β denotes a first regularization coefficient,

represents the square of the 2-norm of μ.

In an alternative of the first aspect, the method further comprises:

acquiring Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

constructing a constraint function according to the Must-link supervision information;

and obtaining a final objective function based on the first objective function, the second objective function and the constraint function.

In an alternative aspect of the first aspect, constructing the constraint function according to the best-link supervision information includes:

constructing a constraint matrix according to the Must-link supervision information;

and obtaining a constraint function based on the constraint matrix and the community indication matrix before solving corresponding to the main data source.

In an alternative of the first aspect, an element of each row in the constraint matrix represents a piece of best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In an alternative of the first aspect, the constraint function is:

γ||Z|| ₁

wherein γ is a second regularization coefficient, MU = Z, M represents a constraint matrix, U represents a community indication matrix before solving corresponding to the main data source, | Z | | calculation ₁ Represents the 1-norm of Z.

In an alternative of the first aspect, the final objective function comprises a first objective function, a second objective function, regular terms of the weight vector and a constraint function.

In an alternative of the first aspect, solving the final objective function to obtain a solved community indication matrix includes:

and solving the final objective function by utilizing an AMDD (Alternating Direction Method of Multipliers) and a Lagrange multiplier Method to obtain a solved community indication matrix.

In an alternative of the first aspect, solving the final objective function by using an alternating direction multiplier algorithm AMDD and a lagrange multiplier method to obtain a solved community indication matrix includes:

initializing a community indication matrix U, a weight vector mu and a Lagrange multiplier before solving;

and repeatedly executing the operations of fixing mu and iteratively updating U, fixing U and Z, and iteratively updating mu until a convergence condition is met, wherein U when the convergence condition is met is a solved community indication matrix.

In an alternative of the first aspect, obtaining a community division result of the network community based on the solved community indication matrix includes:

and clustering the solved community indication matrix by adopting a K-means algorithm to obtain a community division result of the network community.

In a second aspect, the present invention provides an apparatus for discovering a web community, the apparatus comprising:

the multi-source social network data acquisition module is used for acquiring multi-source social network data of a social network user, and the multi-source social network data comprises data corresponding to at least two data sources;

the data source relation determining module is used for respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of at least two specified data sources, and the auxiliary data source is a data source except the main data source in the at least two data sources;

and the community division result determining module is used for clustering the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

In an alternative of the second aspect, the data source relationship determining module is specifically configured to:

In an alternative of the second aspect, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network users corresponding to the row where the element is located in the relationship matrix and the social network users corresponding to the column where the element is located are the same user.

In an alternative of the second aspect, the community division result determining module is specifically configured to:

In an alternative of the second aspect, the number of rows of the solved community indication matrix is the number of social network users corresponding to the main data source, and the number of columns of the solved community indication matrix is the number of pre-divided network communities.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when obtaining the second objective function through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source:

and obtaining a second objective function based on the corresponding sub-objective functions of each auxiliary data source.

In an alternative of the second aspect, when the community partition result determining module obtains the first objective function through the first clustering algorithm based on the data of the main data source, the community partition result determining module is specifically configured to:

when the community division result determining module obtains the sub-targeting function corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source, specifically:

In an alternative of the second aspect, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U ^T L _v′,v U)

wherein,

/>

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, U ^T A transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S _v′,v Representing a relationship matrix, S, between a secondary data source v' and a primary data source v _v,v′ Denotes S _v′,v Transposed matrix of A _v′ A user similarity matrix representing the correspondence of the secondary data sources,

representation matrix S _v,v′ A _v′ S _v′,v Degree matrix, | · | | non-conducting phosphor _F Representing the F-norm.

In an alternative of the second aspect, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein,

In an alternative of the second aspect, the apparatus further comprises:

the first regular term construction module is used for constructing a regular term of a weight vector, and the weight vector is a vector formed by weights corresponding to each auxiliary data source;

the community division result determining module is specifically configured to, when obtaining the final objective function based on the first objective function and the second objective function:

In an alternative of the second aspect, the regularization term of the weight vector is:

represents the square of the 2-norm of μ.

In an alternative of the second aspect, the apparatus further comprises a constraint function construction module, the constraint function construction module being configured to:

acquiring the Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

In an alternative of the second aspect, when the constraint function building module builds the constraint function according to the Must-link supervision information, the constraint function building module is specifically configured to:

In an alternative of the second aspect, each row of elements in the constraint matrix represents a piece of the best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In an alternative of the second aspect, the constraint function is:

γ||Z|| ₁

In an alternative of the second aspect, the final objective function comprises a first objective function, a second objective function, regular terms of the weight vector and a constraint function.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when solving the final objective function to obtain a solved community indication matrix:

and solving the final objective function by utilizing an AMDD (amplitude modulation and direct digital display) and Lagrange multiplier method to obtain a solved community indication matrix.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when solving the final objective function by using AMDD and a lagrange multiplier method to obtain a solved community indication matrix:

and repeatedly executing the operations of fixing mu and iteratively updating U, fixing U and Z, and iteratively updating mu until the convergence condition is met, wherein U when the convergence condition is met is the solved community indication matrix.

In an alternative of the second aspect, when the community division result determining module obtains the community division result of the network community based on the solved community indication matrix, the community division result determining module is specifically configured to:

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory; the memory has stored therein readable instructions which, when loaded and executed by the processor, implement a method of discovery of a network community as set forth in the first aspect or any of the alternatives to the first aspect above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which readable instructions are stored, and when the readable instructions are loaded and executed by a processor, the method for discovering a web community as shown in the first aspect or any alternative scheme of the first aspect is implemented.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

according to the scheme provided by the embodiment of the invention, one main data source can be selected according to actual requirements, and the information of a plurality of auxiliary data sources can be synthesized for community discovery.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

FIG. 1 is a schematic diagram illustrating a discovery method for a web community according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a discovery method for a web community according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating relationships between nodes in three data sources in an example of the invention;

fig. 4 is a schematic structural diagram illustrating a discovery apparatus of a web community according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better illustration and understanding of the solutions of the embodiments of the present invention, the following briefly describes technical solutions related to the solutions provided in the embodiments of the present invention.

(1) Spectral clustering

The spectral clustering algorithm converts the community discovery problem into a graph cutting problem, so that nodes in a community (one node represents one user) have high similarity, and nodes in a community interval have low similarity. According to a given data set (data corresponding to users to be partitioned), a spectral clustering algorithm firstly utilizes a similarity function to calculate the similarity between data instances, constructs an undirected weighted graph, constructs a Laplace matrix according to the similarity matrix, calculates the eigenvalue of the Laplace matrix, selects the least eigenvectors of K eigenvalues to construct an indication matrix, and finally obtains a clustering result.

(2) Semi-supervised nonnegative matrix factorization

Semi-supervised non-negative matrix factorization achieves the goal of improving clustering effect by adding cluster labels and pairwise constraint information (including Must-link and don-not-link) as in the conventional semi-supervised method. The objective function of the non-negative matrix factorization minimizes the loss of the matrix factorization, while the semi-supervised non-negative matrix factorization further utilizes constraint information to guide the matrix factorization process, and the constraint information is an important method for improving the community discovery effect.

Although various community discovery technologies exist in the prior art, the existing community discovery technologies are generally community division based on data of a single data source, and the community division accuracy is low.

The invention provides a network community discovery method, aiming at solving the problems in the prior art and improving the accuracy of network community division. The invention aims to fuse multi-source information of a multi-source social network by using a multi-view learning mechanism, realize division of network communities based on multi-source data and improve accuracy of network community discovery. In addition, the embodiment of the invention can also effectively solve the problem of data loss of partial sources, and further improve the accuracy of community discovery by adding an automatic screening technology and supervision information Must-link on the basis.

Fig. 1 shows a schematic diagram of a discovery method for a web community in an optional embodiment of the present invention, and as shown in the diagram, the method may be mainly divided into two major parts as a whole: the first part is to select a reasonable similarity calculation method according to the characteristics of each data source in the multi-source social network data (namely, the multi-source information shown in the figure) to obtain the similarity moment between the user and the user corresponding to each data sourceAn array, i.e., a user similarity matrix (the similarity matrix shown in the figure). The second part is to select a main data source (the weight of the data source corresponding to the main data source can be regarded as 1), adopt a multi-view learning mechanism to fuse other multi-source information, and adopt an automatic screening technology to learn the weight of the data source (W shown in the figure) ₁ 、W ₂ 、W ₃ I.e. the weight mu of the auxiliary data source as described hereinafter _v′ And will be described in detail later), and guides community discovery by adding the supervision information, best-link, to obtain a final community discovery result, such as the partitioning result of the three network communities shown in the figure.

The following describes the technical solution of the present invention and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a discovery method for a web community according to the present invention, and as shown in fig. 2, the method may include the following steps:

step S110: multi-source social network data of a social network user is obtained.

The multi-source social network data, that is, the social network data of at least two data sources, that is, the multi-source data of the social network, means that the social network user has multi-source information, such as attribute characteristics and behavior characteristics of the user, and interactions between the user and the user, and specifically may be topology information (such as a friend relationship), attribute information, behavior information (a published utterance, praise, forward, and the like) of the user, and the like.

In practical application, since not all users have the multi-source information, a problem of data loss of partial sources exists, some data sources may only have corresponding social network data for some users, for example, some users have behavior information, that is, behavior data, and some users do not have behavior information, when the behavior information is used as a data source, the social network data corresponding to the data source only has related data of some users, and thus, the number of social network users corresponding to different data sources is likely to be different.

After the multi-source social network data is obtained, one data source needs to be designated as a main data source according to needs, namely, the data source plays a main role in dividing network communities, and the data sources except the main data source in the multiple data sources are called auxiliary data sources. As can be seen, there is data from one primary data source and data from at least one secondary data source in the multi-source social network data.

The main data source is the data source which plays the main decision role for the network community division result. In practical application, which data source is specifically selected by the main data source can be selected according to the community classification requirement, that is, one main data source can be selected according to the community discovery target. For example, if advertisement delivery is required, the interest data source of the user may be used as a main data source, and the other data sources are used as auxiliary data sources; for another example, when the user needs to be classified, the direct friend relationship of the user may be used as a main data source, and the other data sources may be used as auxiliary data sources.

Step S120: and respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

For an auxiliary data source, the user relationship is a relationship between a social network user corresponding to the auxiliary data source and a social network user corresponding to the main data source, and in practical application, which user relationship is specifically selected may be configured according to actual requirements. For example, in an optional manner, the user relationship may refer to whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user, or may refer to whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are in a friend relationship, or the like.

Since the user relationship is a relationship between the auxiliary data source and the user in the main data source, based on the user relationship corresponding to each auxiliary data source, an association relationship between each auxiliary data source and the main data source can be obtained, and the association relationship corresponding to one auxiliary data source reflects a relationship between the main data source and the user corresponding to the auxiliary data source, so that the fusion of the auxiliary data source and the main data source is realized through the association relationship. For example, when the user relationship is that the social network user corresponding to the secondary data source and the social network user corresponding to the primary data source are the same user, the corresponding association relationship reflects the status of the same user corresponding to the secondary data source and the primary data source.

In addition, the association relationship is determined based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, so that even if some data sources lack part of data, multi-source data can be effectively fused.

Step S130: and performing community division on the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

As can be seen from the above description, the association relationship corresponding to each type of secondary data source reflects the relationship between the primary data source and the user corresponding to the secondary data source, so that the data of each type of secondary data source and the association relationship between each type of secondary data source and the primary data source can assist the community discovery for the user corresponding to the primary data source.

According to the scheme provided by the embodiment of the invention, one main data source can be selected according to actual requirements, and the information of a plurality of auxiliary data sources can be synthesized for carrying out community discovery.

In an optional embodiment of the present invention, the determining, based on a user relationship between a social network user corresponding to each auxiliary data source and a social network user corresponding to a main data source, an association relationship between each auxiliary data source and the main data source includes:

In particular, as an alternative, the incidence relation between the secondary data source and the primary data source may be characterized by a relation matrix, each element of the matrix corresponding to a user relation between one social network user in the secondary data source and one social network user in the secondary data source.

In an optional embodiment of the present invention, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network user corresponding to the row where the element is located in the relationship matrix and the social network user corresponding to the column where the element is located are the same user.

As an alternative, the user relationship between the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source may be whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user. Specifically, if the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user, the value of the element at the corresponding position in the relationship matrix may be 1, and if the social network users and the main data source are not the same user, the value of the element at the corresponding position in the relationship matrix may be 0.

In this scheme, a relationship matrix corresponding to an auxiliary data source, that is, a relationship matrix between the auxiliary data source and a main data source, reflects the conditions of the same user corresponding to the main data source and the auxiliary data source, and if the auxiliary data source and the main data source correspond to the same user, the relationship (which can be reflected by user similarity) of the same user based on the auxiliary data source can be used to assist the community discovery of the user corresponding to the main data source.

As an example, fig. 3 shows a schematic relationship diagram between nodes corresponding to three data sources (one node corresponds to one social network user), and for convenience of representing nodes in different data sources, the nodes corresponding to the same data source are located on the same plane, as shown in the figure. In this example, the direct friend relationship may be used as a main data source, i.e., source2, source1, and source3 shown in the figure are two auxiliary data sources. The connection line, i.e. the edge, between the nodes of each data source represents the relationship between the two nodes, the weight of the edge between the two nodes in the same data source can be the similarity between the two nodes, and when the similarity is zero, the two nodes can be disconnected.

As can be seen from the figure, in this example, the social network users corresponding to the main data source2 include all users, specifically 7 users. The number of the social network users corresponding to the source1 is 6, and the number of the social network users corresponding to the source3 is also 6. For source1 and source2, the social network user corresponding to source1 is the social network user corresponding to source2 except for the node P in the graph ₁ All users except the corresponding user, namely the social network user corresponding to the source3 is the social network user corresponding to the source2 except the node P in the graph ₂ All users except the corresponding user can see that source1 and source2 correspond to 6 same users, and source3 and source2 also correspond to 6 same users.

Number of social network users corresponding to auxiliary data source is used as relationshipThe number of rows of the matrix is the number of columns of the relation matrix, namely the number of the social network users corresponding to the main data source is used as the number of columns of the relation matrix, and then the relation matrix S between the source1 and the source2 ₁₂ And a relation matrix S between source3 and source2 ₃₂ As follows:

with S ₁₂ For the purpose of illustration, S ₁₂ The element in the first row and the first column in (1) represents S ₁₂ The user corresponding to the first row and the user corresponding to the first column are the same user, and S ₁₂ The user corresponding to the first row in the matrix is the user corresponding to source1, the user corresponding to the first column is the user corresponding to source2, and then 1 in the matrix represents the same user in the users corresponding to source1 and source2, S ₁₂ The user corresponding to the third column is the node P ₁ Corresponding user, due to P ₁ Not the same user as source1 and source2, and therefore S ₁₂ All elements of the third column in (1) are 0.

By constructing the relationship matrix for each auxiliary data source and the main data source, the fusion of data of different data sources is realized, and the problem of difficult fusion caused by data confirmation of different data sources can be effectively solved.

In an optional embodiment of the present invention, based on data of a main data source, data of each auxiliary data source, and an association relationship between each auxiliary data source and the main data source, performing community division on social network users corresponding to the main data source to obtain a division result of a network community of the social network users corresponding to the main data source, including:

The community indication matrix is a matrix for indicating a community division result, that is, an object matrix for indicating a clustering result. Specifically, the solved community indication matrix is the clustered target matrix corresponding to the main data source, and the final objective function is optimized and solved to realize iterative optimization of the community indication matrix before solving, so that the required clustered target matrix, namely the cluster indication matrix of the clustering result, is obtained.

Specifically, the number of lines of the solved community indication matrix may be the number of social network users corresponding to the main data source, the number of columns of the solved community indication matrix is the number of pre-partitioned communities, that is, the number of clustered clusters, and an element in each column of the solved community indication matrix corresponds to a clustering result of one cluster.

And (4) carrying out optimization solution on the final objective function, namely carrying out iterative processing on the final objective function based on a pre-configured convergence condition, namely a constraint condition until the finally obtained value of the final objective function meets the convergence condition. In an optional manner, the convergence condition may mean that a difference between values of the final objective function after two iterations is smaller than a set value, that is, a difference between a value of the final objective function after last iteration optimization and a value of the final objective function after current iteration optimization is smaller than a set value, that is, a difference between a community indication matrix obtained by current solving and a community indication matrix obtained by last iteration is smaller than a set threshold, and the like, and that each parameter to be solved in the final objective function respectively satisfies a respective preset condition.

In practical application, when the objective function is optimized, a clustering index may be set, and whether the algorithm converges or not may be determined by the clustering index, for example, the clustering index may be NMI (Normalized Mutual Information), ACC (calibration method of clustering accuracy), and the like. For different clustering indexes, the corresponding convergence conditions may also be different, for example, for NMI and ACC, the value of the objective function is slowly decreased and tends to be stable, and the NMI or ACC index is slowly increased and tends to be stable, so that the algorithm convergence can be judged.

In the scheme of the invention, the final objective function comprises a first objective function obtained based on the data of the main data source and a second objective function obtained based on the data of the auxiliary data source and the incidence relation between the auxiliary data source and the main data source, and the second objective function is obtained based on the incidence relation between each auxiliary data source and the main data source, and the objective function identifies the influence of each auxiliary data source on the clustering of the community network users corresponding to the main data source, so the final objective function effectively fuses multi-source data, and the scheme of determining the community division result based on the community indication matrix obtained by solving the final objective function can greatly improve the accuracy of community discovery compared with the existing community discovery technology.

In an optional embodiment of the present invention, obtaining, by a second clustering algorithm, a second objective function based on data of each auxiliary data source and an association relationship between each auxiliary data source and a main data source includes:

The sub-targeting function corresponding to each auxiliary data source identifies the influence of the data source on the clustering of the main data source, so that after the sub-targeting function corresponding to each auxiliary data source is determined, a second objective function for representing the total influence of each auxiliary data source on the clustering can be obtained based on the sub-targeting function corresponding to each auxiliary data source. For example, one alternative may be to add the sub-objective functions corresponding to the auxiliary data sources to obtain the second objective function.

In an optional embodiment of the present invention, obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source includes:

In practical applications, since each auxiliary data source may only provide partial information, and the information of each view angle has different effects on the clusters corresponding to the main data sources, each auxiliary data source may be assigned a weight parameter, i.e. a weight, where the weight corresponding to each auxiliary data source is used to indicate the importance degree of the auxiliary data source to the clusters, i.e. the importance degree of the data of each auxiliary data source, where the weight corresponding to each auxiliary data source is not negative, and the sum of the weights corresponding to all auxiliary data sources is 1.

In an optional embodiment of the present invention, obtaining a first objective function through a first clustering algorithm based on data of a main data source includes:

obtaining the sub-objective function corresponding to each auxiliary data source through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source, wherein the sub-objective function comprises:

The user similarity matrix is a matrix in which the user represents the similarity between the user and the user. For each data source, generally, the number of rows and columns in the user similarity matrix is the number of social network users corresponding to the corresponding data source, and an element in the user similarity matrix is the similarity between a social network user corresponding to the row where the element is located and a social network user corresponding to the column where the element is located. For example, if the data source is behavior information of social network users, similarity between the users based on the obtained behavior information of the social network users may be calculated.

For different types of data sources, different methods for calculating the similarity can be adopted, so that the calculated similarity can better reflect the relationship between users corresponding to a certain data source.

As an alternative, the following provides a way to compute similarity for several different types of data sources.

(1) Graph-based topological structure relationships: the friend relationship can be generally represented by a connection line between nodes, and the more common friends two nodes have, the more close the relationship is. For this type of data source, a Jaccard coefficient (Jaccard similarity coefficient) can be used to calculate the similarity between two users, and the calculation formula of the Jaccard coefficient is:

where A, B represent two different users, N (A) represents friends of user A, N (A) # N (V) represents common friends of users A, B, | N (A) # N (B) | represents all friends of users A, B. The larger the Jaccard coefficient value, i.e., jaccard (A, B), the higher the similarity between users A, B.

(2) Numerical attribute relationship: an object type can be determined by the value of the object attribute, and the similarity between two instances can be solved by using a kernel function mode aiming at the numerical attribute of the object.

The cosine kernel function records the similarity of two objects by solving the cosine value (the value range is 0-1) between the vectors, and the geometric meaning of the cosine kernel function is that when the included angle of the two vectors in a multi-dimensional space is smaller (the cosine value is larger), the vectors tend to be in the same direction, and the similarity is larger. The gaussian kernel function is used to represent the weight of the connecting edge of two nodes in the data graph structure, i.e. the similarity.

For example, with X _i ＝(x _i1 ；x _i2 ；…；x _im ),X _j ＝(x _j1 ；x _j2 ；…；x _jm ) The attribute vectors of two samples are represented, m is the dimension representing each sample vector, and the formula for solving the similarity of the two samples by using a cosine kernel function and a gaussian kernel function respectively can be as follows:

cosine kernel function:

gaussian kernel function:

wherein,

represents a dot product of two vectors, | | X _i ||*||X _j | | denotes a 2-norm multiplication of two vectors, | | X _i -X _j The parameter is used for controlling the problem of abrupt similarity change caused by the fact that the difference of the Euclidean distances is larger, so that the speed of the output result of the kernel function, which is reduced along with the increase of the distances, can be changed.

(3) Document type: an alternative is that the document similarity can be calculated using bag-of-words model (bag-of-words model). The principle of the word bag model is that the attribute vector of the document is obtained by calculating the number of each keyword in the document, and the problem of solving the similarity measurement of the document is converted into the problem of vector similarity to be solved.

It should be noted that the three similarity calculation methods are only optional ways for the similarity between the data of the three types of data sources, and are not exclusive, and in practical applications, a scheme for a user to calculate the similarity between the data of each data source may be configured as needed.

In practical application, when the objective function is obtained through a clustering algorithm, which clustering algorithm is specifically adopted can be determined according to actual requirements. The first clustering algorithm and the second clustering algorithm may be the same or different. In an alternative, the first clustering algorithm and the second clustering algorithm may be spectral clustering algorithms.

Spectral clustering is based on graph segmentation principle, and the main idea is to take all data points as nodes in a graph, and the points can be connected by edges. The edge weight value between two points with a longer distance is lower, the edge weight value between two points with a shorter distance is higher, and the graph formed by all data points is cut, so that the edge weight sum between different subgraphs after graph cutting is as low as possible, and the edge weight sum in the subgraph is as high as possible, thereby achieving the purpose of clustering. Spectral clustering has the characteristic of clustering on spatial samples of any shape and converging to an optimal solution according to a divided target function. The spectral clustering objective function can be expressed as:

/>

wherein, U _i An indication vector representing user i, i.e. a vector indicating to which community user i belongs, A _ij And representing the similarity between the user i and the user j, namely the weight of the connecting edges in the network structure, and N representing the number of the users needing community division.

Spectral clustering causes

At the minimum, namely, the more similar the indication vectors among the nodes in the same network community are, and the more dissimilar the indication vectors among the nodes in different social regions are, the spectral clustering objective function can be converted into:

wherein,

U∈R ^N*C is a community indication matrix, each line of U is an indication vector of a user, U ^T A transpose matrix representing U, N representing the number of users, C representing the number of network communities, L being a normalized Laplacian matrix, A being a user similarity matrix, D being a degree matrix of A,

tr () represents a trace of the matrix.

Therefore, when the first clustering algorithm is a spectral clustering algorithm, the user similarity matrix corresponding to the main data source may be used as the similarity matrix of the spectral clustering algorithm, so as to obtain a first objective function, specifically, the first objective function may be represented as:

at this time, U represents the community indication matrix before the solution corresponding to the main data source, i.e. the item to be solved in the final objective function, and U after the final objective function is optimized and solved is the clustering result of the spectral clustering,

wherein, A is the user similarity matrix corresponding to the main data source, tr (U) ^T L _v U) represents U ^T Trace of LU.

In an alternative of the present invention, when the second clustering algorithm is a spectral clustering algorithm, the sub-objective functions may be:

Tr(U ^T L _v′,v U)

wherein,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving ^T A transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S _v′,v Representing a relationship matrix, S, between the secondary data source v' and the primary data source v _v,v′ Denotes S _v′,v Transposed matrix of A _v′ A user similarity matrix representing the correspondence of the secondary data sources,

Wherein, the F-norm of the matrix, i.e. Frobenius norm, also called Euclid norm or E-norm, is marked as | | · | | purple _F For any matrix T, F-norm | | | T | | calucity thereof _F The square root of the sum of squares of the elements of the matrix T is obtained by first summing the squares of the elements and then squaring.

For any auxiliary data source v ', when two nodes (i.e. community network users) belong to the same community in one auxiliary data source v', if the two nodes are also two nodes in the main data source, that is, the two nodes are common nodes of the main data source and the auxiliary data source, when the community division is performed based on the main data source, the indication vectors of the two nodes should be similar as much as possible, and the elements in the relationship matrix are used for identifying the user relationship (such as whether the two nodes are the same user) between the user corresponding to the main data source and the user corresponding to the auxiliary data source, so that the indication vector of the node in each auxiliary data source can be represented by the indication vector of the node in the main data source and the relationship matrix of each auxiliary data source and the main data source.

Still taking fig. 3 as an example for illustration, as shown in fig. 3, node P ₂ And a node P ₃ Is two nodes common to source1 and source2, and in the main data source, source2, node P ₂ And a node P ₃ Is irrelevant, i.e. the similarity is zero, while in source1 node P ₂ And a node P ₃ The relationship between the same users in the source1 and the main data source can be used for clustering based on the main data source in an auxiliary manner, namely, the influence of the auxiliary data source is fused into a clustering result which mainly comprises the main data source, so that the accuracy of the clustering result is improved.

Specifically, the community indication matrix corresponding to the main data source is U, and the relationship matrix between the auxiliary data source v' and the main data source v is S _v′,v The corresponding indication matrix of the secondary data source v' may be denoted as S _v′,v U, namely dot multiplication of the relation matrix and the community indication matrix corresponding to the main data source, is used for enabling sigma to be obtained when spectral clustering is carried out on the basis of the auxiliary data source v _i,j A _v′ (i,j)[(S _v′,v U) _i -(S _v′,v U) _j ] ² Or make Tr (U) ^T L _v′,v U) is as small as possible.

Wherein, A _v′ (i, j) represents the similarity between the user i and the user j in the auxiliary data source v ', that is, the values of the elements corresponding to the user i and the user j in the similarity matrix corresponding to the auxiliary data source v'.

In an alternative aspect of the present invention, the second objective function may be:

where V represents the total number of primary and secondary data sources, i.e., the number or number of sources of the various data sources. In an alternative of the present invention, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function may be:

wherein,

v denotes the total number of primary and secondary data sources, i.e. the number or number of sources of the various data sources, μ _v′ Representing the corresponding weight of the secondary data source v'.

In an alternative aspect of the present invention, the method may further comprise:

correspondingly, the obtaining of the final objective function based on the first objective function and the second objective function specifically includes:

That is to say, the final objective function may further include a regular term of the weight vector, where the regular term is a term to be solved in the final objective function, and when performing optimized solution on the final objective function based on the regular term, the automated solution of the weight corresponding to each auxiliary data source may be implemented. The weight vector can be thinned by the regular term so as to delete data sources (containing a large amount of noise and irrelevant information) which are irrelevant to the main data source, and the weight of the relevant auxiliary data source, namely the influence degree of each auxiliary data source on the main data source, can be solved automatically. According to the scheme, the automatic screening of the auxiliary data sources is realized, so that the auxiliary data sources related to the main data sources are reserved, the data sources irrelevant to the main data sources are removed, and particularly, the weight of the screened irrelevant data is 0, so that the influence of the screened irrelevant data on the community division result is 0.

In an alternative of the present invention, the regularizing term of the weight vector may be

Where μ denotes a weight vector, β denotes a first regularization coefficient, @>

Represents the square of the 2-norm of μ. The 2-norm of the vector is commonly used for calculating the length of the vector, and specifically is the square sum and the reopening of the absolute value of vector elements.

In practical application, the L2 regular term can be used for realizing the sparsity of the weight vector, and compared with the L1 regular term, the L2 regular term can be used for more reasonably controlling the sparsity of the weight vector, so that the problem that excessive auxiliary data sources are removed due to too strong sparsity (namely, the weight after optimization solution is zero) is avoided. Wherein, beta can be configured according to the needs, and the value of beta can be adjusted for different application scenes. The regularization coefficient generally takes a value greater than zero, and the configuration beta can be set to 10 as required ^-5 To 10 ⁵ A value within the range.

correspondingly, in step S150, obtaining a final objective function based on the first objective function and the second objective function may include:

And the introduction of the supervision information in the clustering is beneficial to community discovery and improves the accuracy of community division. The existing supervision information includes class labels and pair constraint information (best-Link and Cannot-Link, which respectively indicate that two users necessarily belong to the same community and necessarily belong to different communities). In practical application, the cost of collecting the paired constraint information is lower than that of collecting the class labels, the cost of collecting the Must-links is lower than that of collecting the cannot-links, and the number of the collected Must-links is less than that of the cannot-links in the social network.

Specifically, while multi-source social network data is collected, some monitoring information such as best-link information can be obtained, and the monitoring information best-link is used as constraint information for community discovery, so that the community discovery effect can be effectively improved.

In an alternative scheme of the present invention, constructing a constraint function according to the best-link supervision information may specifically include:

Optionally, elements of each row in the constraint matrix represent a piece of the best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In the embodiment of the invention, the Must-link supervision information is converted into a constraint matrix M e R ^n*N And n is the number of the Must-links, i.e. the number of pieces of the Must-link supervision information. The expression of one piece of Must-link information (corresponding to one row of the constraint matrix) is as followsThe formula is as follows:

(1 -1 0…0)

the expression indicates that node 1 (the user corresponding to the first column in the constraint matrix) and node 2 (the user corresponding to the second column in the constraint matrix) necessarily belong to the same community.

To make the best-link help for clustering, the constraint function MU =0 may be added to constrain the node pairs that necessarily belong to the same community to have the same indicator vector. However, the equality constraint function has two defects, firstly, the clustering result is distorted by the too strong constraint, and secondly, the uncertain and inferred Must-link supervision information cannot be expressed, so that in practical application, L can be applied ₁ Regularization relaxes the constraint strength of the constraint function while controlling the number of Must-links that satisfy the condition.

Thus, in an alternative, γ | | | Z | | may be used to count the cells ₁ As a constraint function, γ is a second regularization coefficient, MU = Z, M represents a constraint matrix, U represents a community indication matrix before solving corresponding to the main data source, | | Z | white cells ₁ Represents the 1-norm of Z. Similarly, γ can be configured according to actual needs, for example, β can be set to 10 according to needs ^-5 To 10 ⁵ A value within the range.

Where the 1-norm of the matrix, also called the column and norm, is the maximum of the sum of the absolute values of all matrix column vectors.

In an alternative of the present invention, the final objective function may include a first objective function, a second objective function, a regularization term of the weight vector, and a constraint function.

Specifically, the mathematical expression of the final objective function can be expressed as:

the first objective function + the second objective function + the regularization term of the weight vector + the constraint function.

Wherein, the third term can adopt an L2 regular term, namely the above

As an important step in the autofilter technique, L of the weight vector μ is added to the objective function ₂ RegularizationThe purpose of the term is to make the weight vector sparse, eliminate irrelevant data sources (containing a large amount of noise and irrelevant information), and automatically find the importance degree of the relevant data sources. Wherein, beta controls the sparsity of mu, when the value of beta is small, only one data source has non-zero weight, and when the value of beta is large, all the data sources tend to be or>

When β is between these two values, a sparse μ will be obtained. In addition, a sub-targeting function w is associated with each type of auxiliary data source _v′ When the spectral clustering algorithm is selected, the sub-objective function w _v′ ＝Tr(U ^T L _v′,v U)＝∑ _i,j A _v′ (i,j)[(S _v′,v U) _i -(S _v′,v U) _j ] ² As can be seen from the foregoing description, the greater w _v′ Meaning that the more noisy and uncorrelated information the v' data source contains, the greater w _v′ The greater likelihood is that the weight of the v' data source is set to 0, and is therefore based on

Data sources containing a lot of noise and irrelevant information can also be deleted, and the weight of relevant data sources is automatically obtained by the weight, w _v′ The smaller, mu _v′ The larger the corresponding secondary data source is.

In an optional embodiment of the present invention, when both the first clustering algorithm and the second clustering algorithm are spectral clustering algorithms, the final objective function may be written as:

wherein s.t represents the constraint condition of the final objective function, and v ' ≠ v represents that v ' is not the primary data source, i.e. v ' is the secondary data source. The first term in the final objective function is a first objective function, the second term is an alternative of a second objective function, the third term is a data term corresponding to the described automatic screening technology, namely a regular term of a weight vector, and is used for realizing automatic determination of the weight of each auxiliary data source, and the fourth term is an alternative of a constraint function obtained based on the Must-link supervision information.

In an alternative scheme of the present invention, solving the final objective function to obtain a solved community indication matrix may include:

In order to solve the optimal objective function, the embodiment of the invention provides that an iterative algorithm is used for solving the relatively optimal solution. In theory, the target problem can be decomposed into two sub-problems, respectively a prediction indication matrix and an automatic solution weight. When the final objective function includes the weight vector μ, the community indication matrix U before solving, and the constraint function Z (Z = MU), in order to solve the first sub-problem, μmay be fixed first, and U and Z may be iteratively optimized by using the ADMM method. In order to solve the second subproblem, U and Z are fixed, a closed-form solution of mu is obtained by using a Lagrange multiplier method, and the two processes are repeated until convergence meets the preset convergence condition. Similarly, when the final objective function includes the weight vector μ and the community indication matrix U before the solution, μmay be fixed, U may be iteratively optimized by using the ADMM method, U may be fixed, a closed solution of μmay be obtained by using the lagrange multiplier method, and these two processes may be repeated until the convergence meets the preset convergence condition.

Taking the example that the final objective function includes the weight vector μ, the community indication matrix U before the solution, and the constraint function Z, in the alternative of the present invention, the final objective function is solved by using the AMDD and the lagrange multiplier method, so as to obtain the solved community indication matrix, which specifically includes:

As can be seen from the foregoing description, the convergence condition can be configured according to actual requirements.

The following final objective function is taken as an example to explain a specific optimization solving process of the final objective function:

the specific optimization processing mode for the final objective function comprises the following steps:

● Fixing mu, iteratively updating U and Z by using an ADMM method, wherein the subproblems required to be solved are as follows:

s.t.MU＝Z

the augmented lagrange form of the above equation is:

where p is a penalty term coefficient, Y is a Lagrangian multiplier,

fixing mu, and the updating iteration process of iteratively updating U and Z is as follows:

Y＝Y+ρ(MU-Z)

wherein, shrink represents a soft threshold function, and is defined as:

shrink(x,y)＝sign(x)⊙max{|x|-y,0}

it can be understood that x and y in the shrink (x, y) are only two schematic parameters, sign is a sign function, and takes a value of 1 when x is greater than 0, takes a value of 0 when x is equal to 0, and takes a value of-1 when x is less than 0, if x and y are numerical values, shrink (x, y) = sign (x) max { | x | -y,0}, that is, the values of the two numerical values. In this scheme, x corresponds to a matrix Y + ρ MU, Y corresponds to γ, and then sign (x) and | x | -Y are both matrices, as a hadamard product (hadamard product), for achieving the multiplication of the corresponding components of two matrices before and after |.

As one example, the first and second sensors may be, for example,

y =0.5, then +>

max is taken to be the maximum between | x | -y and the zero matrix, and therefore, be | -y>

At this time, it is>

Fix U and Z, update μ. The sub-problems to be solved at this time are:

wherein w = [ w = ₁ ,w ₂ ,…,w _V ]Is not provided with w _v The vector of (V-1) × 1, V being the total number of primary and secondary data sources, w ^T Is a transposed matrix of w, w _v′ Represents Tr (U) ^T L _v′,v U), i.e. the sub-targeting function to which the auxiliary data source v' corresponds, assuming non-descending ordering of the elements in w, i.e. w ₁ ≤w ₂ ≤…≤w _V And applying a Lagrange multiplier method, and solving the subproblem as a closed solution:

argmax _v′ (θ-w _v′ >0) P is represented by satisfying theta-w _v′ >Maximum value of v' under 0 condition.

Continuously and repeatedly fixing mu by adopting the optimization mode, and iteratively updating U; fix U and Z, update the process of μ until U is obtained at convergence.

In the alternative scheme of the invention, the community division result of the network community is obtained based on the solved community indication matrix, which comprises the following steps:

By performing an optimization request on the final objective function, the directly obtained community indication matrix may not completely indicate the attribution of each sample, for example, when clustering is performed by using a spectral clustering algorithm, the solved community indication matrix obtained after optimization generally cannot completely indicate the attribution of each sample, and therefore, after the solved community indication matrix is obtained, conventional clustering needs to be performed on each row, for example, K-Means clustering is used, so as to further improve the effect of community division.

In conclusion, the method provided by the embodiment of the invention can well fuse data of multiple data sources, can further screen out data sources containing a large amount of noise and irrelevant information, and can further increase the supervision information Must-link to further improve the accuracy rate of community discovery. The method provided by the embodiment of the invention can be applied to various different application scenes needing to be classified, in practical application, a main data source, namely a target data source can be selected according to practical application requirements, and other data sources are used as auxiliary data sources to obtain a final clustering result taking the main data source as a guide. For example, the method can be applied to the classification of the user in the instant messaging application, so that better services can be provided for the user based on the classification result, such as placing advertisements more meeting the user requirements for the user, and the like.

To better illustrate the provided aspects of embodiments of the present invention, further description is provided below with reference to a specific example. The scheme of the embodiment of the invention can be applied to community division of social network users (hereinafter referred to as users) in instant messaging software, and the example specifically takes application a as an example, in which a spectral clustering algorithm is adopted by a first clustering algorithm and a second clustering algorithm. The method for community division of users in application a according to the solution of the embodiment of the present invention may specifically include:

first, multi-source social network data of a user in application a is obtained. In this example, the at least two data sources include three data sources of a friend relationship, a user attribute, and user behavior information of the user, and correspondingly, the multi-source social network data includes data corresponding to the friend relationship, data corresponding to the user attribute, and data corresponding to the user behavior information. In this example, the friend relationship of the user is used as a main data source, and the user attribute and the user behavior information are used as two auxiliary data sources. In addition, in the process of acquiring multi-source social network data, some Must-link supervision information can be collected to be used for constructing a constraint function.

After the multi-source social network data is obtained, for a friend relationship data source, a user similarity matrix corresponding to the data source is calculated based on data corresponding to the friend relationship, elements in the matrix represent the similarity between two users corresponding to the data source, and the similarity between the two users can be calculated by adopting a Jaccard coefficient. And taking the user similarity matrix corresponding to the friend relation data source as a similarity matrix of a spectral clustering algorithm to obtain a first objective function based on the main data source.

In order to obtain the second objective function, a relationship matrix between each auxiliary data source and the main data source and a user similarity matrix corresponding to each auxiliary data source need to be calculated. In this example, the relationship matrix corresponding to each auxiliary data source may be obtained in the manner shown in fig. 3 in the foregoing. Specifically, for example, for a data source of the user attribute, if a user corresponding to the data source is the same user as a user corresponding to the friend relationship data source, a corresponding element in the relationship matrix is 1, and if the user is not the same user, the corresponding element in the relationship matrix is 0. For the data source corresponding to the user attribute, a cosine kernel function, for example, may be used to calculate the similarity between the users corresponding to the data source, so as to obtain a matrix for similarity corresponding to the data source, and for the user behavior information, a scheme, such as a bag-of-words model, may be used to calculate the similarity of data of different user behavior information, and the similarity may represent the similarity between different users, so as to obtain a user similarity matrix corresponding to the data source.

Then, a sub-targeting function corresponding to the data source can be obtained based on the relationship matrix and the user similarity matrix corresponding to the data source, which is the user attribute, and a sub-targeting function corresponding to the data source can be obtained based on the relationship matrix and the user similarity matrix corresponding to the data source, which is the user behavior information. And then, based on the weights respectively corresponding to the two auxiliary data sources (i.e. the weights respectively corresponding to the two sub-objective functions), obtaining a second objective function by means of weighted summation of the two sub-objective functions.

In this example, the influence on the auxiliary data source may be controlled by constructing a regularization term of the weight vector, and a constraint function may be constructed based on the collected best-link supervision information. Then, a final objective function needing optimized solving is obtained based on the first objective function, the second objective function, the regular items of the weight vectors and the constraint function, a community indication matrix of the user corresponding to the friend relationship data source is obtained through optimized solving of the final objective function, the obtained clustering result can be used as a final community division result through re-clustering of elements in the community indication matrix, and community division with data corresponding to the friend relationship of the user as main data and the other two kinds of data as auxiliary data is achieved. It should be noted that, in practical applications, the execution order of each step in the embodiment of the present invention is not absolute, but may be changed, and in the above example, the determining steps of the first objective function and the second objective function may not be executed in a sequential order, and for example, the first objective function may be executed after the calculation of the user similarity matrix corresponding to the main data source is completed. In practical applications, the execution sequence between the steps can be flexibly adjusted or performed in an intersecting manner, which is also clear to those skilled in the art and is not listed here.

It can be understood that, in the embodiment of the present invention, the discovery of the network community is to obtain various "human circles" formed by the network interaction behavior of the user by mining the social network interaction data of the user, the network community is a set of nodes with higher similarity or closely conformed related connections in the network, the connections between internal nodes of the same network community are relatively tight, the connections between nodes of different network communities are relatively sparse, and one network community can be regarded as a "group" or a "cluster".

It can be seen that the discovery of the network community is a discovery of a community structure implemented based on social network data of a user, and a user community based on a certain relationship is determined by mining a certain relationship (such as a user relationship, content posted by the user, attention or a friend relationship of the user in the social network, and the like) between individuals in the social network. For example, based on the community discovery of the friend relationship of the users in the social network, the network community based on the user connection can be obtained, the connection among the users in the same network community is relatively close, and the connection among the users in different network communities is relatively sparse. For another example, based on the community discovery of the user interests, network communities divided based on the interests can be obtained, the interests of the users in the same network community are similar, and the interests of the users in different network communities are greatly different. The method can also be used for community discovery based on attribute data of users, wherein attribute differences among users in the same network community are relatively small, and attribute differences among users in different network communities are relatively large.

Based on the same principle as the network community discovery method provided by the embodiment of the present invention, the embodiment of the present invention further provides a network community discovery apparatus, as shown in fig. 4, the network community discovery apparatus 100 may include a multi-source social data obtaining module 110, a data source relationship determining module 120, and a community division result determining module 130. Specifically, the method comprises the following steps:

the social data acquiring module 110 is configured to acquire multi-source social network data of a social network user, where the multi-source social network data includes data corresponding to at least two data sources;

a data source relationship determining module 120, configured to determine, based on a user relationship between a social network user corresponding to each type of auxiliary data source and a social network user corresponding to a main data source, an association relationship between each type of auxiliary data source and the main data source, respectively, where the main data source is one of at least two specified data sources, and the auxiliary data source is a data source other than the main data source of the at least two data sources;

the community division result determining module 130 is configured to cluster the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the main data source, so as to obtain the division result of the network community of the social network users corresponding to the main data source.

Optionally, the data source relationship determining module is specifically configured to:

Optionally, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network user corresponding to the row where the element is located in the relationship matrix and the social network user corresponding to the column where the element is located are the same user.

Optionally, the community division result determining module is specifically configured to:

and obtaining the network community division result of the social network user corresponding to the main data source based on the solved community indication matrix.

Optionally, the line number of the solved community indication matrix is the number of social network users corresponding to the main data source, and the column number of the solved community indication matrix is the number of pre-divided network communities.

Optionally, when the community division result determining module obtains the second objective function through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source, the community division result determining module is specifically configured to:

Optionally, when the community division result determining module obtains the first objective function through the first clustering algorithm based on the data of the main data source, the community division result determining module is specifically configured to:

when the community division result determining module obtains the sub-goal function corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source, specifically:

Optionally, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U ^T L _v′,v U)

wherein,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, U ^T A transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S _v′,v Representing a relationship matrix, S, between the secondary data source v' and the primary data source v _v,v′ Denotes S _v′,v Transposed matrix of A _v′ A user similarity matrix representing the correspondence of the secondary data sources,

Optionally, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein,

Optionally, the apparatus further comprises:

the first regular term building module is used for building regular terms of weight vectors, and the weight vectors are vectors formed by weights corresponding to each auxiliary data source;

Optionally, the regular term of the weight vector is:

represents the square of the 2-norm of μ.

Optionally, the apparatus further includes a constraint function construction module, where the constraint function construction module is configured to:

acquiring the best-link supervision information, wherein the best-link supervision information is used for identifying that two community network users belong to the same network community;

Optionally, when the constraint function building module builds the constraint function according to the Must-link supervision information, the constraint function building module is specifically configured to:

constructing a constraint matrix according to the best-link supervision information;

Optionally, an element of each row in the constraint matrix represents a piece of Must-link supervision information, a row number of the constraint matrix is a number of pieces of the Must-link supervision information, and a column number of the constraint matrix is a number of community network users corresponding to the main data source.

Optionally, the constraint function is:

γ||Z|| ₁

wherein γ is a second regularization coefficient, MU = Z, M represents a constraint matrix, U represents a community indication matrix before solving corresponding to the main data source, | Z | | calvert ₁ Represents the 1-norm of Z.

Optionally, the final objective function includes a first objective function, a second objective function, a regular term of the weight vector, and a constraint function.

Optionally, the community division result determining module is specifically configured to, when solving the final objective function to obtain a solved community indication matrix:

and solving the final objective function by utilizing an AMDD (amplitude modulation and direct decomposition) and Lagrange multiplier method to obtain a solved community indication matrix.

Optionally, the community division result determining module is specifically configured to, when solving the final objective function by using AMDD and a lagrangian multiplier method to obtain a solved community indication matrix:

Optionally, when the community division result determining module obtains the community division result of the network community based on the solved community indication matrix, the community division result determining module is specifically configured to:

The device provided by the embodiment of the invention can be applied to various electronic devices, such as mobile terminal devices, fixed terminal devices and servers.

It is understood that the above modules in the apparatus in the embodiments of the present disclosure have functions of implementing corresponding steps in the method shown in any embodiment of the present disclosure, and the functions may be implemented by hardware or by hardware executing corresponding software, and the hardware or software includes one or more modules corresponding to the above functions. The modules can be realized independently, and also can be realized by integrating a plurality of modules. For the detailed functional description of the data processing apparatus, reference may be made to the corresponding description in the foregoing method, and details are not repeated here.

Based on the same principle as the data processing method and the data processing apparatus provided by the embodiment of the present invention, an embodiment of the present invention also provides an electronic device, which may include a processor and a memory. Wherein the memory has stored therein readable instructions, which when loaded and executed by the processor, may implement the method shown in any of the embodiments of the present invention.

Embodiments of the present invention further provide a computer-readable storage medium, where readable instructions are stored, and when the readable instructions are loaded and executed by a processor, the method shown in any embodiment of the present invention is implemented.

Fig. 5 is a schematic structural diagram of an electronic device applicable to the embodiment of the present invention, and as shown in fig. 5, the electronic device may specifically be a server, and the server may be used to implement the method for discovering a network community shown in any embodiment of the present invention.

Specifically, as shown in fig. 5, the server 2000 may mainly include at least one processor 2001, a memory 2002, a network interface 2003, an input/output interface 2004, and the like. The components may communicate with each other via a bus 2005.

In particular, the memory 2002 may be used to store an operating system, application programs, etc., which may include program code or instructions that when invoked by the processor 2001 implement the methods illustrated in embodiments of the present invention, and may also include programs for implementing other functions or services.

The Memory 2002 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The processor 2001 is connected to the memory 2002 via the bus 2005, and realizes a corresponding function by calling an application program stored in the memory 2002. The Processor 2001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

The server 2000 may be connected to a network through a network interface 2003 to communicate with other devices (e.g., user terminal devices or other servers) through the network to realize data interaction. For example, the server 2000 communicates with a user terminal device through a network interface to obtain multi-source social network data of a user. The network interface 2003 may include a wired network interface and/or a wireless network interface, among others.

The server 2000 may be connected to a desired input/output device such as a keyboard, a display device, etc. through an input/output interface 2004, and a storage device such as a hard disk, etc. may be connected through the interface so that data in the server 2000 may be stored in the storage device or data in the storage device may be stored in the server 200. It is to be appreciated that the input/output interface 2004 can be a wired interface or a wireless interface. Depending on the actual application scenario, the device connected to the input/output interface 2004 may be a component of the server 200, or may be an external device connected to the server 200 as needed.

The bus 2005 for connecting the various components may include a path that carries information between the components. The bus 2002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like according to functions

Alternatively, for the solution provided by the embodiment of the present invention, the memory 2003 may be used for storing application program codes for executing the solution of the present invention, and the processor 2001 controls the execution. The processor 2001 is used to execute the application program code stored in the memory 2003 to implement the actions of the method or apparatus provided by the embodiments of the present invention.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for discovering a web community, comprising:

respectively determining the association relationship between each auxiliary data source and a main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of the at least two specified data sources, the auxiliary data source is a data source of the at least two data sources except the main data source, and the user relationship is whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user or associated users;

obtaining a first objective function through a first clustering algorithm based on the data of the main data source, wherein the first objective function comprises a community indication matrix corresponding to the main data source before solving;

and obtaining the division result of the network community of the social network user corresponding to the main data source based on the solved community indication matrix.

2. The method according to claim 1, wherein the determining the association relationship between each secondary data source and the primary data source based on the user relationship between the social network user corresponding to each secondary data source and the social network user corresponding to the primary data source respectively comprises:

3. The method of claim 1, wherein obtaining a second objective function through a second clustering algorithm based on the data of each secondary data source and the association relationship between each secondary data source and the primary data source comprises:

obtaining sub-targeting functions corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

and obtaining the second objective function based on the sub-objective functions corresponding to each auxiliary data source.

4. The method of claim 3, wherein obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source comprises:

and obtaining the second objective function based on the sub objective functions corresponding to the auxiliary data sources and the weights corresponding to the auxiliary data sources.

5. The method of claim 3, wherein obtaining a first objective function based on the data of the primary data source by a first clustering algorithm comprises:

obtaining the first objective function through the first clustering algorithm based on the user similarity matrix corresponding to the main data source;

the obtaining of the sub-targeting function corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source includes:

obtaining sub-objective functions corresponding to each auxiliary data source through the second clustering algorithm based on the user similarity matrix corresponding to each auxiliary data source and the relation matrix corresponding to each auxiliary data source;

6. The method of claim 5, wherein the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U ^T L _v′，v U)

wherein,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, U ^T A transpose matrix representing U, v representing a primary data source, v' representing a secondary data source, S _v′,v Representing a relationship matrix, S, between a secondary data source v' and a primary data source v _v,v′ Denotes S _v′,v Transposed matrix of A _v′ A user similarity matrix representing the correspondence of the secondary data sources,

7. The method of claim 4, further comprising:

obtaining a final objective function based on the first objective function and the second objective function includes:

and obtaining the final objective function according to the first objective function, the second objective function and the regular terms of the weight vector, wherein the weight vector is a term to be solved in the final objective function.

8. The method of any one of claims 1 to 7, further comprising:

acquiring Must-link constraint Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

and obtaining the final objective function based on the first objective function, the second objective function and the constraint function.

9. The method of claim 8, wherein the constructing a constraint function according to the Must-link supervision information comprises:

10. The method of claim 7, wherein the final objective function comprises the first objective function, the second objective function, a regular term and a constraint function of the weight vector; wherein the constraint function is obtained by:

and constructing a constraint function according to the Must-link supervision information.

11. The method according to any one of claims 1 to 7, wherein the solving the final objective function to obtain a solved community indication matrix comprises:

and solving the final objective function by using an alternative direction multiplier algorithm AMDD and a Lagrange multiplier method to obtain a solved community indication matrix.

12. An apparatus for discovering a web community, comprising:

the system comprises a multi-source social data acquisition module, a data processing module and a data processing module, wherein the multi-source social data acquisition module is used for acquiring multi-source social network data of a social network user, and the multi-source social network data comprises data corresponding to at least two data sources;

a data source relationship determining module, configured to determine, based on a user relationship between a social network user corresponding to each type of auxiliary data source and a social network user corresponding to a main data source, an association relationship between each type of auxiliary data source and the main data source, respectively, where the main data source is one of the at least two specified data sources, and the auxiliary data source is a data source other than the main data source in the at least two data sources; the user relationship refers to whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user or related users;

a community division result determination module for performing the following operations:

13. An electronic device, comprising a processor and a memory;

the memory has stored therein readable instructions which, when loaded and executed by the processor, implement a method of discovery of a network community as claimed in any one of claims 1 to 11.

14. A computer-readable storage medium, wherein the storage medium has stored therein readable instructions, which when loaded and executed by a processor, implement the method for discovery of a web community as claimed in any one of claims 1 to 11.