CN117131395A - Method for clustering data based on Gaussian mixture model - Google Patents

Method for clustering data based on Gaussian mixture model Download PDF

Info

Publication number
CN117131395A
CN117131395A CN202311166058.9A CN202311166058A CN117131395A CN 117131395 A CN117131395 A CN 117131395A CN 202311166058 A CN202311166058 A CN 202311166058A CN 117131395 A CN117131395 A CN 117131395A
Authority
CN
China
Prior art keywords
sample
clustering
mixture model
gaussian mixture
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311166058.9A
Other languages
Chinese (zh)
Inventor
姜励
聂劲松
周炜
张�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Public Information Industry Co ltd
Original Assignee
Zhejiang Public Information Industry Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Public Information Industry Co ltd filed Critical Zhejiang Public Information Industry Co ltd
Priority to CN202311166058.9A priority Critical patent/CN117131395A/en
Publication of CN117131395A publication Critical patent/CN117131395A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method for clustering data based on a Gaussian mixture model, which comprises the following steps: s1, acquiring an original sample, and marking the original sample as a first sample; s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample; s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; s4, inserting data; s5, reading data, and acquiring a clustering task and a target data sample; s6, updating a cluster tree; and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result. The method and the device can reduce the calculation complexity and the time complexity of the data clustering method.

Description

Method for clustering data based on Gaussian mixture model
Technical Field
The application relates to a data clustering method, in particular to a method for clustering data based on a Gaussian mixture model.
Background
In recent years, a cluster learning algorithm is still a very interesting and important research hotspot in the field of artificial intelligence machine learning. Clustering is a very important non-supervision learning method, and aims to divide the data into different clusters according to mutual similarity under the condition that a group of data is given, wherein the division leads the similarity of samples belonging to the same cluster to be as high as possible, namely to be as similar as possible; so that the sample differences belonging to different clusters are as high as possible, i.e. as dissimilar as possible. Stated another way, clustering is the unsupervised classification of data samples or feature vectors, etc., into individual clusters. In many research setting areas and in the efforts of many researchers, the main problem of cluster learning has been solved at present, reflecting its broad appeal and applicability as one of the exploratory data analysis steps. However, the mutual combination and improvement of clustering methods is still a relatively difficult problem, and there are often many differences and different assumptions in different research contexts and scientific fields, which make the transition combination of common and effective general concepts and methods slow.
The 'Gaussian mixture model data clustering method based on transfer learning' disclosed in the application number of CN201911130984.4 is also an increasingly mature technology, and a data set which is similar to the data type of a target domain and has sufficient data quantity is selected as a source domain by taking a sparse data set to be clustered as the target domain. Firstly, clustering data in a source domain by using a Gaussian mixture model to obtain a class mean value and a class covariance matrix of the source domain. Clustering data in a target domain by using a class mean value and class covariance matrix obtained from source domain clustering and a Gaussian mixture model clustering algorithm based on transfer learning to obtain posterior probability of each data point, and using a component with the maximum value of the posterior probability as a class label of the data point, so that the accuracy of data clustering in the target domain is effectively improved;
however, most of the existing clustering methods at present are static clustering methods, that is, the whole data set needs to be scanned before each clustering method is executed, such as a K-means method, an EM-MDL method, a DENCLUE method, a CLIQUE method and the like. However, in the big data age, the traditional static clustering method faces a large technical bottleneck: firstly, as the data volume is continuously increased, the memory space occupied by the data is increased, and when the space occupied by the data set exceeds the space of the computer memory, the data in the data set cannot be stored in the computer memory in advance; second, as the speed of data growth continues to increase, computational complexity and time complexity become unacceptable if the entire data set needs to be reclustered each time a clustering method is performed.
Disclosure of Invention
In order to solve at least one technical problem mentioned in the background art, the application aims to provide a method for clustering data based on a Gaussian mixture model, which reduces the computational complexity and the time complexity of the data clustering method.
In order to achieve the above purpose, the present application provides the following technical solutions:
a method for clustering data based on a Gaussian mixture model comprises the following steps:
s1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes;
s4, inserting data;
s5, reading data, and acquiring a clustering task and a target data sample;
s6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
In some embodiments of the present application, the step S1 further includes:
s0, initializing parameters of the Gaussian mixture model.
In some embodiments of the application, the parameter initialization includes:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:
in some embodiments of the application, the parameter initialization further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
In some embodiments of the present application, in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.
In some embodiments of the application, in S7, the gaussian mixture model is composed of K gaussian distributions,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
In some embodiments of the present application, in S7, the clustering is implemented by using a maximum likelihood estimation method to solve an optimized objective function of the established gaussian mixture model cluster.
Compared with the prior art, the application has the beneficial effects that:
1. according to the method and the device, according to different clustering densities required by users, nodes of different layers in the clustering tree can be selected as final clustering results, so that the practicability is wider, and the clustering effect is better.
2. According to the historical data of the user, the data package and the data association characteristic are analyzed, the basic Gaussian mixture model algorithm is improved, the clustering effect is improved, the method is applied to data stream clustering, and the detection accuracy is effectively improved while rapid modeling and detection are carried out.
Drawings
FIG. 1 is a flow chart of a data clustering method of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, the present embodiment provides a method for clustering data based on a gaussian mixture model, which includes the following steps:
s0, firstly, initializing parameters of the Gaussian mixture model.
The parameter initialization includes:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:
the parameter initialization further includes: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
S1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
specifically, for each original sample, a random vector (disturbance vector) is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.
The main purpose of generating new samples is to introduce variations and diversity to enable clustering algorithms to better understand the distribution and structure of data:
(1) Enhancing data diversity.
By adding the random vector, the second sample will deviate to some extent from the first sample after adding the random vector, helping to introduce more data diversity.
(2) Analog data change
Adding random vectors can simulate nuances of raw data due to noise, measurement errors, or other factors. The clustering algorithm can better cope with the clustering process of the original data.
(3) Improving adaptability of clustering algorithm
The introduction of the second sample can help the algorithm to better adapt to different types of data, reduce the dependence of the algorithm on specific data distribution and improve the adaptability of the clustering algorithm.
(4) Preventing overfitting
By introducing randomness, the clustering algorithm is prevented from over-fitting in some cases because it is too dependent on the raw data.
S3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; the method comprises the following specific steps:
(1) Preparing data
And fusing the first sample and the second sample, and taking the fused sample set as an initial sample set for establishing an initial Gaussian mixture model.
(2) Gaussian mixture model initialization
Two gaussian distributions were randomly selected as the initial cluster centers. For each distribution, the mean and covariance are initialized.
(3) It is desirable to maximize the iteration:
a: the probability that each sample data point belongs to each distribution is calculated.
B: the probability that each sample data point belongs to each distribution, the mean and covariance are updated to maximize the objective function (likelihood function).
It should be noted that the objective function is an optimized objective function for establishing a gaussian mixture model cluster by using a maximum likelihood estimation method by randomly initializing and selecting representative points of each gaussian mixture model component.
(4) Iteration termination condition
In the iteration process, when any one of the set iteration times and the change of the judgment target function reaches a set corresponding threshold value, stopping iteration. The value of the threshold may be set according to the user's requirement, and is not particularly limited herein.
(5) Construction of cluster tree
And constructing a cluster tree on the basis of the cluster model. Each cluster corresponds to a node of the tree, and the hierarchical structure of the tree represents nesting and containment relationships between different clusters.
S4, inserting data;
specifically, new data is inserted into the sample set after the first sample and the second sample are mixed. And new data is brought into cluster analysis, so that the follow-up model update and cluster analysis are convenient to capture data distribution and structure more accurately.
S5, reading data, and acquiring a clustering task and a target data sample;
(1) Defining clustering tasks
The targets and demands of the clustering tasks are defined.
(2) Selecting feature data
And extracting characteristic data from the fused sample set to prepare for subsequent cluster analysis.
(3) Screening target data samples
And (3) taking the sample set fused in the step (S4) as a target data sample.
S6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
The gaussian mixture model consists of K gaussian distributions,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
The method is characterized in that the maximum likelihood estimation method is used for solving the optimized objective function of the established Gaussian mixture model cluster to realize the clustering.
The application constructs the Gaussian mixture model based on the constraint of the side information equivalence set, integrates the side information (data association characteristics) among the user history data points into the model, and expresses the information as the limiting conditions in the model, so that each component part of the model is ensured to follow the side information in the parameter optimization and clustering process, and the internal structure and mode of the data are captured better.
In general, only data characteristic information is used when the Gaussian mixture model is clustered, and the correlation characteristic among data is not utilized. The application can more accurately reflect the clustering structure of the data by introducing the side information (data association characteristics into the model as equations or constraints in the model, ensure that components in the model can accord with the relationship conveyed by the side information when optimizing the model parameters.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (7)

1. The method for clustering data based on the Gaussian mixture model is characterized by comprising the following steps of:
s1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes;
s4, inserting data;
s5, reading data, and acquiring a clustering task and a target data sample;
s6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
2. The method for clustering data based on the gaussian mixture model according to claim 1, wherein said S1 further comprises, before:
s0, initializing parameters of the Gaussian mixture model.
3. The method for clustering data based on the gaussian mixture model according to claim 2, wherein said initializing parameters comprises:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:
4. a method for clustering data based on a gaussian mixture model according to claim 3, wherein said initializing parameters further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
5. The method according to claim 1, wherein in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, the random vector is added to the original sample, a new sample is obtained, and the new sample is marked as a second sample.
6. The method for clustering data based on Gaussian mixture model as recited in claim 1, wherein the Gaussian mixture model is composed of K Gaussian distributions in S7,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
7. The method for clustering data based on the gaussian mixture model according to claim 1, wherein in S7, the clustering is implemented by using an optimized objective function of the maximum likelihood estimation method to solve the established gaussian mixture model clustering.
CN202311166058.9A 2023-09-11 2023-09-11 Method for clustering data based on Gaussian mixture model Pending CN117131395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311166058.9A CN117131395A (en) 2023-09-11 2023-09-11 Method for clustering data based on Gaussian mixture model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311166058.9A CN117131395A (en) 2023-09-11 2023-09-11 Method for clustering data based on Gaussian mixture model

Publications (1)

Publication Number Publication Date
CN117131395A true CN117131395A (en) 2023-11-28

Family

ID=88858154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311166058.9A Pending CN117131395A (en) 2023-09-11 2023-09-11 Method for clustering data based on Gaussian mixture model

Country Status (1)

Country Link
CN (1) CN117131395A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117672445A (en) * 2023-12-18 2024-03-08 郑州大学 Diabetes mellitus debilitation current situation analysis method and system based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117672445A (en) * 2023-12-18 2024-03-08 郑州大学 Diabetes mellitus debilitation current situation analysis method and system based on big data

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN109117793B (en) Direct-push type radar high-resolution range profile identification method based on deep migration learning
Jain et al. Feature selection: Evaluation, application, and small sample performance
CN113326731B (en) Cross-domain pedestrian re-identification method based on momentum network guidance
CN108121975B (en) Face recognition method combining original data and generated data
CN112883839B (en) Remote sensing image interpretation method based on adaptive sample set construction and deep learning
Xiao et al. A fast method for particle picking in cryo-electron micrographs based on fast R-CNN
CN109002755B (en) Age estimation model construction method and estimation method based on face image
CN112906770A (en) Cross-modal fusion-based deep clustering method and system
CN112765352A (en) Graph convolution neural network text classification method based on self-attention mechanism
CN113076970A (en) Gaussian mixture model clustering machine learning method under deficiency condition
CN112464004A (en) Multi-view depth generation image clustering method
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
CN111259917B (en) Image feature extraction method based on local neighbor component analysis
CN117131395A (en) Method for clustering data based on Gaussian mixture model
CN113987236B (en) Unsupervised training method and unsupervised training device for visual retrieval model based on graph convolution network
CN112434553A (en) Video identification method and system based on deep dictionary learning
CN111259264B (en) Time sequence scoring prediction method based on generation countermeasure network
CN112883931A (en) Real-time true and false motion judgment method based on long and short term memory network
CN114036308A (en) Knowledge graph representation method based on graph attention neural network
Lin et al. A deep clustering algorithm based on gaussian mixture model
Zhuang et al. A handwritten Chinese character recognition based on convolutional neural network and median filtering
Cho et al. Genetic evolution processing of data structures for image classification
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
Cho Content-based structural recognition for flower image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination