CN117131395A

CN117131395A - Method for clustering data based on Gaussian mixture model

Info

Publication number: CN117131395A
Application number: CN202311166058.9A
Authority: CN
Inventors: 姜励; 聂劲松; 周炜; 张�杰
Original assignee: Zhejiang Public Information Industry Co ltd
Current assignee: Zhejiang Public Information Industry Co ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-11-28

Abstract

The application discloses a method for clustering data based on a Gaussian mixture model, which comprises the following steps: s1, acquiring an original sample, and marking the original sample as a first sample; s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample; s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; s4, inserting data; s5, reading data, and acquiring a clustering task and a target data sample; s6, updating a cluster tree; and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result. The method and the device can reduce the calculation complexity and the time complexity of the data clustering method.

Description

Method for clustering data based on Gaussian mixture model

Technical Field

The application relates to a data clustering method, in particular to a method for clustering data based on a Gaussian mixture model.

Background

In recent years, a cluster learning algorithm is still a very interesting and important research hotspot in the field of artificial intelligence machine learning. Clustering is a very important non-supervision learning method, and aims to divide the data into different clusters according to mutual similarity under the condition that a group of data is given, wherein the division leads the similarity of samples belonging to the same cluster to be as high as possible, namely to be as similar as possible; so that the sample differences belonging to different clusters are as high as possible, i.e. as dissimilar as possible. Stated another way, clustering is the unsupervised classification of data samples or feature vectors, etc., into individual clusters. In many research setting areas and in the efforts of many researchers, the main problem of cluster learning has been solved at present, reflecting its broad appeal and applicability as one of the exploratory data analysis steps. However, the mutual combination and improvement of clustering methods is still a relatively difficult problem, and there are often many differences and different assumptions in different research contexts and scientific fields, which make the transition combination of common and effective general concepts and methods slow.

The 'Gaussian mixture model data clustering method based on transfer learning' disclosed in the application number of CN201911130984.4 is also an increasingly mature technology, and a data set which is similar to the data type of a target domain and has sufficient data quantity is selected as a source domain by taking a sparse data set to be clustered as the target domain. Firstly, clustering data in a source domain by using a Gaussian mixture model to obtain a class mean value and a class covariance matrix of the source domain. Clustering data in a target domain by using a class mean value and class covariance matrix obtained from source domain clustering and a Gaussian mixture model clustering algorithm based on transfer learning to obtain posterior probability of each data point, and using a component with the maximum value of the posterior probability as a class label of the data point, so that the accuracy of data clustering in the target domain is effectively improved;

however, most of the existing clustering methods at present are static clustering methods, that is, the whole data set needs to be scanned before each clustering method is executed, such as a K-means method, an EM-MDL method, a DENCLUE method, a CLIQUE method and the like. However, in the big data age, the traditional static clustering method faces a large technical bottleneck: firstly, as the data volume is continuously increased, the memory space occupied by the data is increased, and when the space occupied by the data set exceeds the space of the computer memory, the data in the data set cannot be stored in the computer memory in advance; second, as the speed of data growth continues to increase, computational complexity and time complexity become unacceptable if the entire data set needs to be reclustered each time a clustering method is performed.

Disclosure of Invention

In order to solve at least one technical problem mentioned in the background art, the application aims to provide a method for clustering data based on a Gaussian mixture model, which reduces the computational complexity and the time complexity of the data clustering method.

In order to achieve the above purpose, the present application provides the following technical solutions:

a method for clustering data based on a Gaussian mixture model comprises the following steps:

s1, acquiring an original sample, and marking the original sample as a first sample;

s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;

s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes;

s4, inserting data;

s5, reading data, and acquiring a clustering task and a target data sample;

s6, updating a cluster tree;

and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.

In some embodiments of the present application, the step S1 further includes:

s0, initializing parameters of the Gaussian mixture model.

In some embodiments of the application, the parameter initialization includes:

number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:。

in some embodiments of the application, the parameter initialization further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.

In some embodiments of the present application, in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.

In some embodiments of the application, in S7, the gaussian mixture model is composed of K gaussian distributions,

where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.

In some embodiments of the present application, in S7, the clustering is implemented by using a maximum likelihood estimation method to solve an optimized objective function of the established gaussian mixture model cluster.

Compared with the prior art, the application has the beneficial effects that:

1. according to the method and the device, according to different clustering densities required by users, nodes of different layers in the clustering tree can be selected as final clustering results, so that the practicability is wider, and the clustering effect is better.

2. According to the historical data of the user, the data package and the data association characteristic are analyzed, the basic Gaussian mixture model algorithm is improved, the clustering effect is improved, the method is applied to data stream clustering, and the detection accuracy is effectively improved while rapid modeling and detection are carried out.

Drawings

FIG. 1 is a flow chart of a data clustering method of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, the present embodiment provides a method for clustering data based on a gaussian mixture model, which includes the following steps:

s0, firstly, initializing parameters of the Gaussian mixture model.

The parameter initialization includes:

the parameter initialization further includes: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.

specifically, for each original sample, a random vector (disturbance vector) is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.

The main purpose of generating new samples is to introduce variations and diversity to enable clustering algorithms to better understand the distribution and structure of data:

(1) Enhancing data diversity.

By adding the random vector, the second sample will deviate to some extent from the first sample after adding the random vector, helping to introduce more data diversity.

(2) Analog data change

Adding random vectors can simulate nuances of raw data due to noise, measurement errors, or other factors. The clustering algorithm can better cope with the clustering process of the original data.

(3) Improving adaptability of clustering algorithm

The introduction of the second sample can help the algorithm to better adapt to different types of data, reduce the dependence of the algorithm on specific data distribution and improve the adaptability of the clustering algorithm.

(4) Preventing overfitting

By introducing randomness, the clustering algorithm is prevented from over-fitting in some cases because it is too dependent on the raw data.

S3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; the method comprises the following specific steps:

(1) Preparing data

And fusing the first sample and the second sample, and taking the fused sample set as an initial sample set for establishing an initial Gaussian mixture model.

(2) Gaussian mixture model initialization

Two gaussian distributions were randomly selected as the initial cluster centers. For each distribution, the mean and covariance are initialized.

(3) It is desirable to maximize the iteration:

a: the probability that each sample data point belongs to each distribution is calculated.

B: the probability that each sample data point belongs to each distribution, the mean and covariance are updated to maximize the objective function (likelihood function).

It should be noted that the objective function is an optimized objective function for establishing a gaussian mixture model cluster by using a maximum likelihood estimation method by randomly initializing and selecting representative points of each gaussian mixture model component.

(4) Iteration termination condition

In the iteration process, when any one of the set iteration times and the change of the judgment target function reaches a set corresponding threshold value, stopping iteration. The value of the threshold may be set according to the user's requirement, and is not particularly limited herein.

(5) Construction of cluster tree

And constructing a cluster tree on the basis of the cluster model. Each cluster corresponds to a node of the tree, and the hierarchical structure of the tree represents nesting and containment relationships between different clusters.

S4, inserting data;

specifically, new data is inserted into the sample set after the first sample and the second sample are mixed. And new data is brought into cluster analysis, so that the follow-up model update and cluster analysis are convenient to capture data distribution and structure more accurately.

S5, reading data, and acquiring a clustering task and a target data sample;

(1) Defining clustering tasks

The targets and demands of the clustering tasks are defined.

(2) Selecting feature data

And extracting characteristic data from the fused sample set to prepare for subsequent cluster analysis.

(3) Screening target data samples

And (3) taking the sample set fused in the step (S4) as a target data sample.

S6, updating a cluster tree;

The gaussian mixture model consists of K gaussian distributions,

The method is characterized in that the maximum likelihood estimation method is used for solving the optimized objective function of the established Gaussian mixture model cluster to realize the clustering.

The application constructs the Gaussian mixture model based on the constraint of the side information equivalence set, integrates the side information (data association characteristics) among the user history data points into the model, and expresses the information as the limiting conditions in the model, so that each component part of the model is ensured to follow the side information in the parameter optimization and clustering process, and the internal structure and mode of the data are captured better.

In general, only data characteristic information is used when the Gaussian mixture model is clustered, and the correlation characteristic among data is not utilized. The application can more accurately reflect the clustering structure of the data by introducing the side information (data association characteristics into the model as equations or constraints in the model, ensure that components in the model can accord with the relationship conveyed by the side information when optimizing the model parameters.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The method for clustering data based on the Gaussian mixture model is characterized by comprising the following steps of:

s4, inserting data;

s5, reading data, and acquiring a clustering task and a target data sample;

s6, updating a cluster tree;

2. The method for clustering data based on the gaussian mixture model according to claim 1, wherein said S1 further comprises, before:

s0, initializing parameters of the Gaussian mixture model.

3. The method for clustering data based on the gaussian mixture model according to claim 2, wherein said initializing parameters comprises:

4. a method for clustering data based on a gaussian mixture model according to claim 3, wherein said initializing parameters further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.

5. The method according to claim 1, wherein in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, the random vector is added to the original sample, a new sample is obtained, and the new sample is marked as a second sample.

6. The method for clustering data based on Gaussian mixture model as recited in claim 1, wherein the Gaussian mixture model is composed of K Gaussian distributions in S7,

7. The method for clustering data based on the gaussian mixture model according to claim 1, wherein in S7, the clustering is implemented by using an optimized objective function of the maximum likelihood estimation method to solve the established gaussian mixture model clustering.