CN112417234B

CN112417234B - Data clustering method and device and computer readable storage medium

Info

Publication number: CN112417234B
Application number: CN201910784526.6A
Authority: CN
Inventors: 赵剑; 邱思远
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2024-01-26
Anticipated expiration: 2039-08-23
Also published as: CN112417234A

Abstract

The embodiment of the invention discloses a data clustering method and device and a computer readable storage medium, wherein the data clustering method comprises the following steps: receiving and converting an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, the clustering result corresponding to the original data set is obtained by utilizing spectral clustering, so that an ideal clustering effect can be obtained, and the clustering performance is effectively improved.

Description

Data clustering method and device and computer readable storage medium

Technical Field

The present invention relates to data detection technology, and in particular, to a data clustering method and apparatus, and a computer readable storage medium.

Background

When the data sets of the high-dimensional data are clustered, the high-dimensional data from different subspaces can be segmented into the respective low-dimensional subspaces according to the potential subspace structures of the data sets, and the different subspaces correspond to different categories. Subspace clustering algorithms are widely used in many fields, wherein linear representation-based subspace clustering algorithms represented by sparse subspace clustering algorithms (Sparse subspace clustering, SSC), low rank representation subspace clustering algorithms (Low rank representation for subspace clustering, LRR) and least squares regression subspace clustering algorithms (Robust and efficient subspace segmentation via least squares regression, LSR) have attracted widespread interest to researchers due to their algorithm simplicity and high dimensional data clustering effectiveness.

At present, a common subspace clustering algorithm based on linear representation is implemented through l ₁ -representing coefficients with a norm, a nuclear norm or an F-norm constraint to obtain representing coefficients with a block diagonal structureZ, however, a single norm constraint represents the coefficient Z, and the obtained representing coefficient Z usually has defects, so that the final clustering result is not ideal enough and the clustering performance is low.

Disclosure of Invention

To solve the above-mentioned technical problems, embodiments of the present invention desire to provide a data clustering method and apparatus, and a computer-readable storage medium,

in order to achieve the above object, the technical solution of the embodiment of the present invention is as follows:

the embodiment of the invention provides a data clustering method, which comprises the following steps:

receiving and converting an original data set;

determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set;

determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix;

establishing a similarity matrix corresponding to the original data set according to the representation coefficient;

and based on the similarity matrix, obtaining a clustering result corresponding to the original data set by utilizing spectral clustering.

The embodiment of the application provides a data clustering method and device and a computer readable storage medium, wherein the data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, a clustering result corresponding to the original data set is obtained by utilizing spectral clustering. Therefore, in the embodiment of the application, the data clustering device can firstly acquire a denoised low-rank dictionary from the original data set, then construct the target coefficient by combining the weight matrix acquired according to the original data set, so as to acquire the similarity matrix corresponding to the original data set, and perform clustering processing on the original data set by using the similarity matrix to acquire a corresponding clustering result.

Drawings

FIG. 1 is a basic framework of a subspace clustering algorithm based on linear representation;

fig. 2 is a schematic implementation flow diagram of a data clustering method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a partial relationship;

fig. 4 is a second implementation flow chart of a data clustering method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a composition structure of a data clustering device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a composition structure of a data clustering device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and not limiting of the application. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.

With the rapid development of information technology, data is ubiquitous in our daily life, and huge scale and complex structure of data bring a lot of challenges to data processing, and how to effectively mine valuable information from the data becomes a big problem. Along with the proposal of a classical clustering algorithm, the clustering algorithm can effectively solve the problem of low-dimensional data clustering, but application environments are day-to-day and moon-shaped, high-dimensional data is visible everywhere in work and life, wherein the dimensions of various image data, video data and text data are often as high as tens of thousands of dimensions, for example, a picture shot by a smart phone can reach tens of thousands of pixels, and the traditional clustering algorithm cannot obtain ideal results when the problem of high-dimensional data clustering is processed. The main problems faced by high-dimensional data clustering are: the data distribution in the high-dimensional space is more sparse than the data distribution in the low-dimensional space, the distances between the data are almost equal, and some irrelevant attributes exist in the data, so that clustering cannot be realized in the high-dimensional space according to the distance relation between the data, however, most of the traditional clustering methods are based on the distance for clustering, and how to design a new clustering algorithm to solve the problem of clustering of the high-dimensional data has become the key point of research in the fields of data mining, machine learning and the like. The subspace clustering algorithm is an extension of the traditional clustering algorithm, and according to the potential subspace structure of the data set, the high-dimensional data from different subspaces are segmented into the respective low-dimensional subspaces, and the different subspaces correspond to different categories. Subspace clustering algorithms are widely used in many fields, for example: image clustering, motion segmentation, and the like. Currently, among algorithms of subspace clustering, a subspace clustering algorithm based on linear representation becomes a research hotspot in the field due to the superior clustering performance.

Subspace clustering algorithms based on linear representations are expected to better construct similarity matrices by exploiting global information between data points. Among them, the linear representation-based subspace clustering algorithm represented by the sparse subspace clustering algorithm (Sparse subspace clustering, SSC), the low rank representation subspace clustering algorithm (Low rank representation for subspace clustering, LRR) and the least squares regression subspace clustering algorithm (Robust and efficient subspace segmentation via least squares regression, LSR) has attracted widespread interest to researchers due to its algorithm simplicity and high-dimensional data clustering effectiveness. The algorithm does not need to know the dimension of subspace, obtains the representation coefficient of each data point by using the self-representation of the data, establishes a similarity matrix by using the obtained representation coefficient, and is applied to spectral clustering to obtain a clustering result.

SSC algorithm under the assumption of linear representation, through l ₁ -norm minimization enforces sparsity of the matrix of representation coefficients such that the representation coefficients between classes are zero, the coefficients are represented within a classSparse. The LRR algorithm can well group together highly correlated data by minimizing the lowest rank representation of the global structure of the kernel norms revealing the data. And also good robustness can be obtained when processing noisy and heavily contaminated data. The LSR algorithm uses F-norm to restrict the representation coefficient, so that the coefficient has grouping effect, and the aggregation performance of the related data is maintained. Under the assumption of subspace independence, the representation matrix obtained by the LSR algorithm has a block diagonal structure. When the data points are insufficient, the obtained representation coefficient matrix also has a block diagonal structure under the assumption of subspace orthogonality. Meanwhile, the objective function of the LSR algorithm can calculate the analytic solution, so that an iterative solving process is avoided, and the time complexity of the algorithm is greatly reduced. Fig. 1 is a basic framework of a subspace clustering algorithm based on linear representation, and as shown in fig. 1, the subspace clustering algorithm based on linear representation mainly carries out linear representation on an input data set to obtain a representation coefficient, then constructs a similarity matrix according to the representation coefficient, and then carries out spectral clustering on the similarity matrix obtained by the construction, so that a clustering result can be obtained.

Classical subspace clustering algorithm based on linear representation, through l ₁ The coefficient is represented by a norm, a kernel norm or an F-norm constraint to find a representation coefficient Z with a block diagonal, whereas a single norm constraint represents a coefficient Z, the found representation coefficient Z usually has a deficiency, such as by minimizing l by the SSC algorithm ₁ -norms to obtain the sparsest representation of the samples as coefficient matrix, minimizing l if the data from the same subspace has high correlation ₁ The norm typically randomly selects a small number of data points for linear representation, and ignoring other related data points, the coefficient matrix found does not guarantee a link between data points within the class, so although the SSC algorithm can construct a sparse similarity matrix, satisfactory results may not be obtained. The LRR algorithm finds the lowest rank representation between the high-dimensional data, enabling the global structure of the data to be obtained. The LRR algorithm solves the optimization problem using a minimization kernel norm instead of rank minimization. The low rank representation clustering algorithm can obtain representation systems with good block diagonal propertiesThe number matrix, but the algorithm only focuses on the constraint of global rank, so that the final representation coefficient matrix lacks sparsity, a large number of non-zero elements still exist in the inter-class representation coefficients, and the intra-class representation coefficients have large differences, so that the final clustering result is not ideal enough.

In order to solve the defect of classical subspace clustering algorithm based on linear representation, a non-negative low-rank sparse graph is used for semi-supervised learning to learn l ₁ The norms are introduced into the objective function simultaneously with the kernel norms to achieve the effect of eliminating the representation coefficients of the inter-class-compactness. The structured sparse constraint is added into the low-rank representation subspace clustering algorithm by the structured constraint low-rank representation algorithm, so that the algorithm can better represent coefficients among sparse classes and can process more general subspace distribution structures. The smooth representation clustering constrains the representation coefficients through local relations among the data, so that the intra-class representation coefficients tend to be smooth, and ideal clustering quality is obtained.

The application provides a data clustering method based on a low-rank representation algorithm and smooth representation clustering of structural constraints, wherein the data clustering method can utilize a smooth low-rank representation subspace clustering algorithm (Structured smooth low-rank representation subspace clustering, SSLRR) to introduce local similarity constraints into an LRR objective function, improve intra-class consistency of representation coefficients through local relations among data points, introduce structural sparse constraints into the objective function, and increase inter-class sparsity of the representation coefficients. In order to make the algorithm better process the data containing noise, the algorithm firstly obtains a low-rank structure dictionary through a low-rank recovery technology for linearly representing the original data set, so that the robustness of the algorithm in processing the noise data is improved, and meanwhile, higher clustering performance can be obtained.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Example 1

In the embodiment of the present invention, as shown in fig. 2, a method for performing data clustering by a data clustering device may include the following steps:

step 101, receiving and converting an original data set.

In an embodiment of the present application, the data clustering device may first receive an original data set, and after receiving the original data set, perform dimension conversion on the original data set.

Further, in an embodiment of the present application, the original data set may be high-dimensional data, for example, the original data set may be an Extended Yale B face data set, an augmented reality (Augmented Reality, AR) face data set, or high-dimensional data such as a handwriting digital data set.

It should be noted that, in the embodiment of the present application, the data clustering device may be a device integrated with a data clustering algorithm, and the data clustering device may be used for clustering, analyzing and experiment on a data set. For example, the data clustering means may be installed with a subspace clustering application, e.g. the data clustering means may be installed with a face clustering application or a handwritten numerical clustering application.

Further, in embodiments of the present application, the raw data set may be a high-dimensional data set, e.g., raw data set x= [ X ] ₁ ,x ₂ ,...,x _n ]∈R ^m×n Wherein each column represents a data sample, n represents the number of data, m represents the dimension of the data, x _i Representing the i-th sample in the dataset.

It should be noted that, in the embodiment of the present application, the data clustering device may perform a dimension reduction process on the original data set after receiving the original data set, so as to perform a dimension conversion on the original data set. Specifically, the data clustering device may reduce the dimensionality of the data to 6×k dimensions by principal component analysis (Principle Component Analysis, PCA) when performing a dimensionality reduction process on the original data set, where k represents a category parameter.

Step 102, determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set.

In the implementation of the present application, after receiving and converting the original data set, the data clustering device may determine a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set, respectively.

It should be noted that, in the embodiment of the present application, the original data set received by the data clustering device may carry random noise, that is, there may be data contaminated by noise in the original data set. To better address the noise data clustering problem, the data clustering device may use robust principal component analysis (Robust principle component analysis, RPCA) to recover an identified low-rank dictionary from the original dataset.

Further, in an embodiment of the present application, the data clustering means may extract the low rank dictionary from the original data set according to a first objective function, where the first objective function may be used to denoise the original data set, specifically, an expression of the first objective function is shown in formula (1),

min _A,E ‖A‖ _* +γ‖E‖ ₁ s.t.X＝A+E (1)

wherein II A II _* Representing the core norms of the matrix, E ₁ L representing matrix ₁ The norm, in particular, the first objective function may be solved using an inaccurate lagrangian multiplier algorithm, resulting in a low rank dictionary a.

Further, in the embodiment of the present application, the data clustering device may further obtain a weight matrix corresponding to the original data set according to the original data set. The weight matrix may include a first weight matrix and a second weight matrix. Specifically, the first weight matrix is used for reducing the representation coefficient; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.

It should be noted that, in the embodiment of the present application, the first weight matrix and the second weight matrix may be related to the third objective function used by the data clustering device when clustering the original data set, so the data clustering device may determine the first weight matrix and the second weight matrix according to the original data set.

Further, in the embodiment of the present application, the weights in the first weight matrix may be obtained by formula (2),

wherein W is _ij For the weight values in the first weight matrix,and->Respectively data point x _i And x _j Is defined according to equation (3),

the parameter sigma is the average of all elements in matrix B. The first weight matrix can be defined by the formula (2), so that the weight among the data points in different subspaces in the original data set is larger, the weight among the data points in the same subspace in the original data set tends to zero, and the data term II W II can be minimized ₁ To better reduce the representation coefficients between data points in different subspaces, wherein, as a result of the Hadamard product, H= IIW++ZIII is defined in the examples of the present application ₁ 。

Further, in embodiments of the present application, to better characterize the local relationships between data points in the original dataset, the data clustering means may determine the local relationships between data points through a local linear embedding algorithm (locally linear embedding, LLE) graph. First determine each data point x _i K-nearest neighbor of (1), then use data point x _i K adjacent point pairs x of (2) _i Performing linear reconstruction, solving weight values by using minimized reconstruction error, and obtaining weight value M in a second weight matrix _ij Representing the jth data point reconstructionThe contribution of i data points, the greater the weight between two data points when they are closer. For example, fig. 3 is a schematic diagram of a local relationship, in a high-dimensional space, when the number of neighbor points k=3, the data point x _i With 3 neighbor points x _j 、x _k 、x _l The linear reconstruction relationship between them is shown in FIG. 3, where W _ij 、W _ik 、W _il Respectively data point x _i And x _j 、x _k 、x _l Weight of the same. Based on two constraints: (1) Each data point is linearly reconstructed from K nearest neighbor data points, when a certain data point x _j When not belonging to the K neighbor of the data point, M _ij =0; (2) The sum of the reconstructed weight coefficients of each data point is 1, the second objective function of the data clustering means for solving the second weight matrix may be expressed as formula (4),

wherein n represents the number of data points, Q _i Representing each data point x _i Defining a formula (5) by the subscript set of K adjacent points,

V _jk ＝(x _i -x _j ) ^T (x _i -x _k ) (5)

then M _ij Can be expressed as formula (6),

further, in the embodiment of the present application, the data clustering device may determine the second weight matrix according to the formula (6), specifically, the second weight matrix may be a symmetric non-negative weight matrix, for example, the second weight matrix M may be represented by the formula (7),

It should be noted that, in the embodiment of the present application, after receiving the original data set, the data clustering device may determine the low-rank dictionary, the first weight matrix and the second weight matrix according to the above formula (1) and the value formula (7) based on the original data set, so as to continuously determine the representation coefficient according to the low-rank dictionary and the weight matrix.

And 103, determining the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

In an embodiment of the present application, after determining the low-rank dictionary and the weight matrix corresponding to the original data set according to the original data set, the data clustering device may further determine the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

It should be noted that, in the embodiment of the present application, the LRR algorithm can well obtain the global structure of the data through the low rank criterion, but the inter-class representation coefficient generates a large number of non-zero elements, so as to affect the accuracy of clustering. In embodiments of the present application, l may be ₁ The norms are introduced into the LRR objective function, i.e. into a third objective function that performs the clustering process, i.e. the third objective function may be the objective function corresponding to the LRR algorithm, so that l may be utilized ₁ -the norm increases the sparsity of the representation coefficients. In particular, the third objective function may be expressed according to equation (8),

min _Z,E ‖Z‖ _* +β‖Z‖ ₁ +γ‖E‖ ₁ s.t.X＝AZ+E (8)

where β, γ are used to balance the effects of low rank, sparse, and noise terms. In particular, in embodiments of the present application, structured sparsity constraint term minimization is superior to standard l ₁ The norm is minimized, so equation (8) can be converted into equation (9) to represent the third objective function,

min _Z,E ‖Z‖ _* +βH+γ‖E‖ _2,1 s.t.X＝AZ+E (9)

wherein W is the first weight matrix in the weight matrices. Office for better obtaining data in original data setThe partial relationship can be assumed if the data point x _i And x _j In the potential geometry of the data distribution, then the two data points are also similar in embedding or projection into the new space, in particular, the data clustering means may define l=d-M as the laplace matrix and D as the degree matrixThen, the formula (9) is converted through the formula Laplace matrix, the obtained converted third target function is shown as a formula (10), namely, in mathematics, the hypothesized relation can be expressed as the formula (10),

wherein M is a second weight matrix in the weight matrix, reflecting the local relationship of the data in the original data set in the original space, z _i And z _j Respectively data point x _i And x _j Corresponding representation coefficients. Fusing the formula (9) and the formula (10), restricting the representation coefficients through the local relation among the data points, enabling the representation coefficients in the class to tend to be smooth, promoting the improvement of the final clustering accuracy, enabling the converted third objective function to be represented through the formula (11),

wherein α is used to balance the effects of the graph regularization term with the other three terms.

Further, in the embodiment of the present application, in order to effectively solve the above formula (11), the data clustering device may iteratively solve the formula (11) using an alternate direction multiplier algorithm. Specifically, the data clustering device may introduce a preset auxiliary variable J, T e R ^n×n The above formula (11) can be converted into formula (12),

equation (13) can be obtained using the lagrange multiplier reconstruction equation (12),

wherein Y is ^A 、Y ^B Y is as follows ^C Represents the lagrangian multiplier and μ represents a penalty parameter to control the convergence of the third objective function.

It should be noted that, in the embodiment of the present application, the data clustering device may operate with a singular value soft threshold, based on Y ^C And Z, carrying out updating iteration on J; meanwhile, the data clustering device can also operate by utilizing a shrinkage threshold value and is based on Y ^b And Z, carrying out updating iteration on the T; furthermore, the data clustering device can also use the Bartels-Stewart algorithm to solve, iterate based on the low-rank dictionary, and in the iteration process, the representation coefficient Z has a unique solution, so that the optimal value of the representation coefficient can be obtained.

Step 104, establishing a similarity matrix corresponding to the original data set according to the representation coefficient.

In the embodiment of the present application, the data clustering device may establish a similarity matrix corresponding to the original data set according to the representation coefficients after determining the representation coefficients corresponding to the original data set according to the low-rank dictionary and the weight matrix.

It should be noted that, in the embodiment of the present application, after obtaining the representation coefficient, the data clustering apparatus may construct the similarity matrix according to the representation coefficient, specifically, the data clustering apparatus may establish the similarity matrix through formula (14),

it should be noted that, in the embodiment of the present application, the similarity matrix determined by the data clustering device according to the formula (14) may be used to perform spectral clustering on the original data set.

Step 105, based on the similarity matrix, obtaining a clustering result corresponding to the original data set by utilizing spectral clustering.

In the embodiment of the application, after the data clustering device establishes the similarity matrix corresponding to the original data set according to the representation coefficient, the clustering result corresponding to the original data set can be obtained by utilizing spectral clustering based on the similarity matrix.

Further, in the embodiment of the present application, the data clustering device may further determine a category parameter corresponding to the original data set after performing the dimension reduction processing on the original data set.

It should be noted that, in the embodiment of the present application, after determining the similarity matrix, the data clustering device may further determine a normalized symmetric laplace matrix according to the similarity matrix, then may obtain K feature vectors in the normalized symmetric laplace matrix according to the class parameter K of the original data set, and normalize the target matrix formed by the K feature vectors, and then may use a K-means clustering algorithm for the normalized target matrix, so that class allocation of the original data set may be output finally, that is, a clustering result corresponding to the original data set may be obtained.

According to the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, a clustering result corresponding to the original data set is obtained by utilizing spectral clustering. Therefore, in the embodiment of the application, the data clustering device can firstly acquire a denoised low-rank dictionary from the original data set, then construct the target coefficient by combining the weight matrix acquired according to the original data set, so as to acquire the similarity matrix corresponding to the original data set, and perform clustering processing on the original data set by using the similarity matrix to acquire a corresponding clustering result.

Example two

Based on the above embodiment, in still another embodiment of the present application, when solving the third objective function after conversion, that is, when solving the above formula (11), the data clustering device may iteratively solve the third objective function after conversion according to a preset auxiliary variable to obtain the representation coefficient.

Further, in the embodiment of the present application, the data clustering device may introduce a preset auxiliary variable J, T e R ^n×n And reconstructing by using an augmented Lagrangian multiplier method after introducing the preset auxiliary variable to obtain the above formula (13), and sequentially updating the preset auxiliary variable J, the preset auxiliary variable T, Z, E, the Lagrangian multiplier and μ to obtain the optimal representation coefficient Z ^* 。

In the embodiments of the present application, exemplary, the original data set x= [ X ] ₁ ,x ₂ ,...,x _n ]∈R ^m×n When determining the representation coefficients, the smooth low-rank representation subspace clustering algorithm proposed by the data clustering device can comprise the following steps:

step 201, initializing a variable.

Setting the maximum iteration number maxiter=1000, the current iteration number k=0, the initialization z=j=t=0, e=0, y ^A ＝0，Y ^B ＝Y ^C ＝0，μ＝10 ^-6 ，max _μ ＝10 ¹⁰ ，ρ＝1.1，ε＝10 ^-8 . Wherein Z-J _∞ >Epsilon or | Z-T _∞ >Epsilon or X AZ-E| _∞ >ε。

Step 202, updating a preset auxiliary variable J.

Other variables are fixed to update the preset auxiliary variable J,specifically, when updating the variable J, let +.>Singular value decomposition is performed on P, and SVD (P) = [ U, sigma, V]Thresholding is performed on the singular value matrix sigma: g _τ (∑)＝diag((σ _i -τ) ₊ ) Wherein sigma _i Is the main diagonal element of sigma, also the singular value of matrix P, τ is the threshold value, taken +.>G _τ The (Σ) represents: if the diagonal element sigma _i If the ratio is larger than τ, then sigma is taken _i ＝σ _i - τ, otherwise σ _i =0. The optimal solution for each iteration of final J is j=ug _τ (∑)V ^T 。

Step 203, updating the preset auxiliary variable T.

Other variables are fixed to update the preset auxiliary variable T,specifically, when the variable T is updated, let +.>At this time, the variable T may be expressed as t=s _ε (Q) for each element T in T _ij The following relationship of formula (15) is satisfied:

step 204, updating the variable Z.

Fixing other variables to update the variable Z, in particular, solving equation μA using the Bartels-Stewart algorithm while updating the variable Z ^T AZ+αZ(2I+L)+(-A ^T Y ^A +Y ^B +Y ^C +μ(A ^T E-A ^T X-J-T))＝0。A ^T A is a semi-positive definite matrix, thus for A ^T Arbitrary eigenvalue p of A _i Satisfy p _i Not less than 0,2I+L is a positive definite matrix, thus for any eigenvalue μ of 2I+L _i Satisfy mu _i >0. Since for any eigenvalue p _i And mu _i Satisfy p _i +μ _i >0, in the iterative process, the variable Z has a unique solution.

Step 205, update variable E.

Fixing the other variable update variable E, wherein E satisfies the following equation (16):

specifically, when the variable E is updated, it is setu _i Each column representing the matrix U, each column of E satisfies the condition of the following equation (17):

step 206, updating the Lagrangian multiplier.

Opposite Lagrangian multiplier Y ^A 、Y ^B Y is as follows ^C And updating. Specifically, it can be according to Y ^A ＝Y ^A +μ(X-AZ-E)、Y ^B ＝Y ^B +μ (Z-T) and Y ^C ＝Y ^C +mu (Z-J) for Y ^A 、Y ^B Y is as follows ^C And updating.

Step 207, updating penalty parameter μ.

According to μ=min (ρμ, max _μ ) And updating the punishment parameters.

Step 208, let k=k+1, repeat the above steps 202 to 207 until the optimal representation coefficient Z is output ^* 。

According to the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, a clustering result corresponding to the original data set is obtained by utilizing spectral clustering. Therefore, in the embodiment of the application, the data clustering device can firstly acquire a denoised low-rank dictionary from the original data set, then construct the target coefficient by combining the weight matrix acquired according to the original data set, so as to acquire the similarity matrix corresponding to the original data set, and perform clustering processing on the original data set by using the similarity matrix to acquire a corresponding clustering result. Example III

Based on the first embodiment and the second embodiment, in still another embodiment of the present application, fig. 4 is a schematic diagram illustrating a second implementation flow of a data clustering method according to the embodiment of the present application, and as shown in fig. 4, a method for obtaining, by using a data clustering device, a clustering result corresponding to an original data set by using spectral clustering based on a similarity matrix may include the following steps:

step 301, calculating according to the similarity matrix to obtain a normalized symmetric laplace matrix corresponding to the original data set.

In the embodiment of the application, after the similarity matrix is determined, the data clustering device may perform clustering processing on the original data set according to a normalized symmetric spectral clustering algorithm.

Further, in the embodiment of the present application, the data clustering device may first acquire a normalized symmetric laplace matrix corresponding to the original data set according to the similarity matrix. For example, based on the similarity matrix C obtained by the above formula (14), normalized symmetric pulls corresponding to the original data set are calculatedLaplace matrix L _sym 。

And 302, constructing a target matrix according to the class parameters and the normalized symmetric Laplacian matrix.

In the embodiment of the application, the data clustering device may further construct the target matrix by combining the class parameters corresponding to the original data set after obtaining the normalized symmetric laplace matrix according to the similarity matrix.

It should be noted that, in the embodiment of the present application, when the type parameter is k, the data clustering device may first calculate the laplace matrix L _sym The first k eigenvectors u ₁ ,u ₂ ,…,u _k Then according to k eigenvectors u ₁ ,u ₂ ,…,u _k Constituting the target matrix u= [ U ] ₁ ,u ₂ ,…,u _k ]∈R ^n×k 。

Step 303, performing normalization processing on the target matrix to obtain a normalized target matrix.

In the embodiment of the application, the data clustering device can normalize the target matrix after forming the target matrix according to the class parameter and the normalized symmetric Laplace matrix, so that the normalized target matrix can be obtained. Specifically, the data clustering device may normalize the target matrix U according to rows to obtain a normalized target matrix T e R ^n×k 。

And 304, clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

In the embodiment of the application, after normalizing the target matrix to obtain the normalized target matrix, the data clustering device may perform clustering on the normalized target matrix to obtain a clustering result corresponding to the original data set.

Further, in an embodiment of the present application, the data clustering means may cluster each row q in the normalized target matrix T _i ∈R ^k Regarded as R ^k A point in the space is then used with a K-means clustering algorithm to obtain a cluster node corresponding to the original datasetAnd (5) fruits.

Example IV

Based on the first to third embodiments, the data clustering device performs clustering processing on the original data set according to the SSLRR to obtain a corresponding clustering result, and in order to verify the clustering effect of the SSLRR, the embodiments of the present application propose the following two-point proving manner from a theoretical point of view.

Mode one: the optimal solution for SSLRR has a block diagonal structure.

Without taking noise into account, for the problem of equation (18):

given a set of m-dimensional datasets x= [ X ] ₁ ,x ₂ ,...,x _n ]＝[X ₁ ,X ₂ ,…,X _k ]∈R ^m×n And dataSet X is taken from k independent linear subspacesWherein X is _i Is m x n _i Each column of which comes from the same subspace S _i And n is ₁ +n ₂ +…+n _i ＝n，Z ^* Is the optimal solution to minimize the problem (18), then the coefficient Z is represented ^* Having a block diagonal structure.

Let Z be ^* Is the optimal solution of the objective function (18), defines the formula (19),

and Z is ^C ＝Z ^* -Z ^D ，Z ^C 0, Z according to the assumption of orthogonality of subspaces ^D Is also a viable solution to the objective function (17) and is obtainable Z from the kernel norm property of the matrix ^* || _* ≥||Z ^D || _* . From Z ^C Not less than 0, tr (Z) ^* LZ ^*T )＝tr((Z ^D +Z ^C )L(Z ^D +Z ^C ) ^T )≥tr(Z ^D L(Z ^D ) ^T ) Since the weight matrix W is a non-negative matrix, for H, it is possible to obtain:

wherein L= IIW.sub.Z ^D ‖ ₁ From Z ^* || _* ≥||Z ^D || _* 、tr(Z ^* LZ ^*T )≥tr(Z ^D L(Z ^D ) ^T ) And II W Z ^* ‖ ₁ ≥‖W⊙Z ^D ‖ ₁ It can be deduced that:

||Z ^* || _* +tr(Z ^* LZ ^*T )+‖W⊙Z ^* ‖ ₁ ≥||Z ^D || _* +tr(Z ^D L(Z ^D ) ^T )+L (21)

and due to Z ^* Is the optimal solution of equation (18) and thus can yield Z ^* || _* +tr(Z ^* LZ ^*T )+‖W⊙Z ^* ‖ ₁ ＝||Z ^D || _* +tr(Z ^D L(Z ^D ) ^T )+L，Z ^C =0, thereby obtaining Z ^* ＝Z ^D The optimal solution Z of equation (18) therefore has a block diagonal structure.

Mode two: and (5) time complexity analysis.

For dataset x= [ X ₁ ,x ₂ ,...,x _n ]∈R ^m×n In step 101, the time complexity of recovering a low rank dictionary A using RPCA is O (t ₁ n ³ )，t ₁ Representing the number of algorithm iterations. Update J, T, E and Lagrangian multiplier Y in steps 202-207 ^A 、Y ^B 、Y ^c The time complexity of (a) is O (n) ³ )、O(n ² )、O(mn ² )、O(mn ² )、O(n ² )、O(n ² ) When updating Z, the Bartels-Stewart algorithm is used to solve the Sylvester equation, so the time complexity is O (n ³ ) So the overall time complexity in steps 202-207 is O (3 t ₂ n ² +2t ₂ mn ² +2t ₂ n ³ ) If m is<n, time complexity of O (2 t ₂ n ³ )，t ₂ Representing the number of iterations of the alternate direction multiplier algorithm. Step 105 spectral clustering has an overall temporal complexity of O (n ³ ). Therefore, the temporal complexity of the algorithm SSLRR proposed in this chapter is O ((t) ₁ +2t ₂ +1)n ³ )。

Example five

Based on the first to fourth embodiments, fig. 5 is a schematic diagram of the composition structure of the data clustering device according to the embodiment of the present application, as shown in fig. 5, in the embodiment of the present invention, the data clustering device 1 includes a receiving unit 11, a converting unit 12, a determining unit 13, a creating unit 14 and an acquiring unit 15,

the receiving unit 11 is configured to receive an original data set.

The conversion unit 12 is configured to convert the original data set.

The determining unit 13 is configured to determine, according to the original data set, a low-rank dictionary and a weight matrix corresponding to the original data set; and determining the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

The establishing unit 14 is configured to establish a similarity matrix corresponding to the original data set according to the representation coefficient.

The obtaining unit 15 is configured to obtain, based on the similarity matrix, a clustering result corresponding to the original data set by using spectral clustering.

Further, in the embodiment of the present application, the converting unit 12 is specifically configured to perform a dimension reduction process on the original data set after receiving the original data set.

Further, in an embodiment of the present application, the determining unit 13 is specifically configured to determine the low rank dictionary from the raw dataset according to a first objective function; the first objective function is used for denoising the original data set; or, the determining unit 13 is further specifically configured to obtain a third objective function according to the first weight matrix; obtaining a Laplace matrix according to the second weight matrix; converting the third objective function according to the Laplace matrix to obtain a converted third objective function; and solving the converted third objective function to obtain the representation coefficient.

Further, in the embodiment of the present application, the weight matrix includes a first weight matrix and the second weight matrix, and the determining unit 13 is further specifically configured to calculate the first weight matrix according to the original data set; the first weight matrix is used for reducing the representation coefficient; and determining the second weight matrix according to a second objective function and the original dataset; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.

Further, in the embodiment of the present application, the determining unit 13 is further specifically configured to iteratively solve the converted third objective function according to a preset auxiliary variable, so as to obtain the representation coefficient.

Further, in the embodiment of the present application, the determining unit 13 is further configured to determine a category parameter corresponding to the original data set after performing the dimension reduction processing on the original data set.

Further, in the embodiment of the present application, the obtaining unit 15 is specifically configured to obtain, according to the similarity matrix calculation, a normalized symmetric laplace matrix corresponding to the original data set; forming a target matrix according to the category parameters and the normalized symmetric Laplacian matrix; normalizing the target matrix to obtain a normalized target matrix; and clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

Fig. 6 is a schematic diagram of a second composition structure of the data clustering device according to the embodiment of the present application, as shown in fig. 6, the data clustering device 1 according to the embodiment of the present application may further include a processor 16, a memory 17 storing instructions executable by the processor 16, and further, the data clustering device 1 may further include a communication interface 18, and a bus 19 for connecting the processor 16, the memory 17 and the communication interface 18.

In embodiments of the present application, the processor 16 may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor. It will be appreciated that the electronics for implementing the processor function may be other for different devices, and embodiments of the present application are not specifically limited. The data clustering means 1 may further comprise a memory 17, which memory 17 may be connected to the processor 16, wherein the memory 17 is adapted to store executable program code comprising computer operating instructions, the memory 17 may comprise a high speed RAM memory, and may further comprise a non-volatile memory, e.g. at least two disk memories.

In the present embodiment, a bus 19 is used to connect the communication interface 18, the processor 16 and the memory 17 and the intercommunication among these devices.

In an embodiment of the present application, the memory 17 is used for storing instructions and data.

Further, in an embodiment of the present application, the processor 16 is configured to receive and convert the raw data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; and based on the similarity matrix, obtaining a clustering result corresponding to the original data set by utilizing spectral clustering.

In practical applications, the Memory 17 may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 16.

In addition, each functional module in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.

The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on this understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, or all or part of the technical solution may be embodied in a storage medium, which includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, a clustering result corresponding to the original data set is obtained by utilizing spectral clustering. Therefore, in the embodiment of the application, the data clustering device can firstly acquire a denoised low-rank dictionary from the original data set, then construct the target coefficient by combining the weight matrix acquired according to the original data set, so as to acquire the similarity matrix corresponding to the original data set, and perform clustering processing on the original data set by using the similarity matrix to acquire a corresponding clustering result.

The embodiment of the application provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements the data clustering method as described above.

Specifically, the program instructions corresponding to one data clustering method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to one data clustering method in the storage medium are read or executed by an electronic device, the method includes the following steps:

receiving and converting an original data set;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, display, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application.

Claims

1. A data clustering method, applied to a data clustering device, the method comprising:

receiving and converting an original data set; the original data set is at least one of the following: a face data set and a handwritten digital data set; wherein the original data set is a high-dimensional image data set containing random noise;

Based on the similarity matrix, obtaining a clustering result corresponding to the original data set by utilizing spectral clustering;

the weight matrix includes a first weight matrix and a second weight matrix, and the determining the weight matrix corresponding to the original data set according to the original data set includes: calculating the first weight matrix according to the original data set; the first weight matrix is used for reducing the representation coefficient; determining the second weight matrix according to a second objective function and the original data set; the second weight matrix is used for representing the local relation of the data in the original data set in the original space;

wherein, the determining, according to the low-rank dictionary and the weight matrix, the representation coefficient corresponding to the original data set includes: obtaining a third objective function according to the first weight matrix; obtaining a Laplace matrix according to the second weight matrix; converting the third objective function according to the Laplace matrix to obtain a converted third objective function; and solving the converted third objective function to obtain the representation coefficient.

2. The method of claim 1, wherein the converting the original dataset comprises:

After receiving the original data set, the original data set is subjected to a dimensionality reduction process.

3. The method of claim 1, wherein said determining a low rank dictionary corresponding to said original data set from said original data set comprises:

determining the low rank dictionary from the raw dataset according to a first objective function; the first objective function is used for denoising the original data set.

4. The method of claim 1, wherein said solving the transformed third objective function to obtain the representation coefficients comprises:

and carrying out iterative solution on the converted third objective function according to a preset auxiliary variable to obtain the representation coefficient.

5. The method of claim 2, wherein after said dimensionality reduction of said original dataset, said method further comprises:

and determining the category parameters corresponding to the original data set.

6. The method of claim 5, wherein the obtaining, based on the similarity matrix, a clustering result corresponding to the original dataset using spectral clustering comprises:

Calculating according to the similarity matrix to obtain a normalized symmetric Laplacian matrix corresponding to the original data set;

forming a target matrix according to the category parameters and the normalized symmetric Laplacian matrix;

normalizing the target matrix to obtain a normalized target matrix;

and clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

7. A data clustering device, characterized in that the data clustering device comprises: a receiving unit, a converting unit, a determining unit, a setting-up unit and an acquiring unit,

the receiving unit is used for receiving the original data set; the original data set is at least one of the following: a face data set and a handwritten digital data set; wherein the original data set is a high-dimensional image data set containing random noise;

the conversion unit is used for converting the original data set;

the determining unit is used for determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix;

The establishing unit is used for establishing a similarity matrix corresponding to the original data set according to the representation coefficient;

the acquisition unit is used for acquiring a clustering result corresponding to the original data set by utilizing spectral clustering based on the similarity matrix;

the determining unit is further specifically configured to obtain a third objective function according to the first weight matrix; obtaining a Laplace matrix according to the second weight matrix; converting the third objective function according to the Laplace matrix to obtain a converted third objective function; solving the converted third objective function to obtain the representation coefficient;

the weight matrix comprises the first weight matrix and the second weight matrix, and the determining unit is further specifically configured to calculate the first weight matrix according to the original data set; the first weight matrix is used for reducing the representation coefficient; and determining the second weight matrix according to a second objective function and the original dataset; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.

8. The data clustering device as claimed in claim 7, wherein,

the conversion unit is specifically configured to perform a dimension reduction process on the original data set after receiving the original data set.

9. The data clustering device as claimed in claim 7, wherein,

the determining unit is specifically configured to determine the low-rank dictionary from the original dataset according to a first objective function; the first objective function is used for denoising the original data set.

10. The data clustering device as claimed in claim 7, wherein,

the determining unit is further specifically configured to iteratively solve the converted third objective function according to a preset auxiliary variable, so as to obtain the representation coefficient.

11. The data clustering device as claimed in claim 8, wherein,

the determining unit is further configured to determine a category parameter corresponding to the original data set after performing the dimension reduction processing on the original data set.

12. The data clustering device as claimed in claim 11, wherein,

the obtaining unit is specifically configured to obtain a normalized symmetric laplace matrix corresponding to the original data set according to the similarity matrix; forming a target matrix according to the category parameters and the normalized symmetric Laplacian matrix; normalizing the target matrix to obtain a normalized target matrix; and clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

13. A data clustering device comprising a processor, a memory storing instructions executable by the processor, a communication interface, and a bus for connecting the processor, the memory, and the communication interface, which when executed by the processor, implements the method of any of claims 1-6.

14. A computer readable storage medium having stored thereon a program for use in a data clustering device, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.