CN117131395A - Method for clustering data based on Gaussian mixture model - Google Patents
Method for clustering data based on Gaussian mixture model Download PDFInfo
- Publication number
- CN117131395A CN117131395A CN202311166058.9A CN202311166058A CN117131395A CN 117131395 A CN117131395 A CN 117131395A CN 202311166058 A CN202311166058 A CN 202311166058A CN 117131395 A CN117131395 A CN 117131395A
- Authority
- CN
- China
- Prior art keywords
- sample
- clustering
- mixture model
- gaussian mixture
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000000203 mixture Substances 0.000 title claims abstract description 45
- 238000007621 cluster analysis Methods 0.000 claims abstract description 7
- 235000012571 Ficus glomerata Nutrition 0.000 claims abstract description 6
- 244000153665 Ficus glomerata Species 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 11
- 238000007476 Maximum Likelihood Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004138 cluster model Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011985 exploratory data analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method for clustering data based on a Gaussian mixture model, which comprises the following steps: s1, acquiring an original sample, and marking the original sample as a first sample; s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample; s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; s4, inserting data; s5, reading data, and acquiring a clustering task and a target data sample; s6, updating a cluster tree; and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result. The method and the device can reduce the calculation complexity and the time complexity of the data clustering method.
Description
Technical Field
The application relates to a data clustering method, in particular to a method for clustering data based on a Gaussian mixture model.
Background
In recent years, a cluster learning algorithm is still a very interesting and important research hotspot in the field of artificial intelligence machine learning. Clustering is a very important non-supervision learning method, and aims to divide the data into different clusters according to mutual similarity under the condition that a group of data is given, wherein the division leads the similarity of samples belonging to the same cluster to be as high as possible, namely to be as similar as possible; so that the sample differences belonging to different clusters are as high as possible, i.e. as dissimilar as possible. Stated another way, clustering is the unsupervised classification of data samples or feature vectors, etc., into individual clusters. In many research setting areas and in the efforts of many researchers, the main problem of cluster learning has been solved at present, reflecting its broad appeal and applicability as one of the exploratory data analysis steps. However, the mutual combination and improvement of clustering methods is still a relatively difficult problem, and there are often many differences and different assumptions in different research contexts and scientific fields, which make the transition combination of common and effective general concepts and methods slow.
The 'Gaussian mixture model data clustering method based on transfer learning' disclosed in the application number of CN201911130984.4 is also an increasingly mature technology, and a data set which is similar to the data type of a target domain and has sufficient data quantity is selected as a source domain by taking a sparse data set to be clustered as the target domain. Firstly, clustering data in a source domain by using a Gaussian mixture model to obtain a class mean value and a class covariance matrix of the source domain. Clustering data in a target domain by using a class mean value and class covariance matrix obtained from source domain clustering and a Gaussian mixture model clustering algorithm based on transfer learning to obtain posterior probability of each data point, and using a component with the maximum value of the posterior probability as a class label of the data point, so that the accuracy of data clustering in the target domain is effectively improved;
however, most of the existing clustering methods at present are static clustering methods, that is, the whole data set needs to be scanned before each clustering method is executed, such as a K-means method, an EM-MDL method, a DENCLUE method, a CLIQUE method and the like. However, in the big data age, the traditional static clustering method faces a large technical bottleneck: firstly, as the data volume is continuously increased, the memory space occupied by the data is increased, and when the space occupied by the data set exceeds the space of the computer memory, the data in the data set cannot be stored in the computer memory in advance; second, as the speed of data growth continues to increase, computational complexity and time complexity become unacceptable if the entire data set needs to be reclustered each time a clustering method is performed.
Disclosure of Invention
In order to solve at least one technical problem mentioned in the background art, the application aims to provide a method for clustering data based on a Gaussian mixture model, which reduces the computational complexity and the time complexity of the data clustering method.
In order to achieve the above purpose, the present application provides the following technical solutions:
a method for clustering data based on a Gaussian mixture model comprises the following steps:
s1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes;
s4, inserting data;
s5, reading data, and acquiring a clustering task and a target data sample;
s6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
In some embodiments of the present application, the step S1 further includes:
s0, initializing parameters of the Gaussian mixture model.
In some embodiments of the application, the parameter initialization includes:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:。
in some embodiments of the application, the parameter initialization further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
In some embodiments of the present application, in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.
In some embodiments of the application, in S7, the gaussian mixture model is composed of K gaussian distributions,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
In some embodiments of the present application, in S7, the clustering is implemented by using a maximum likelihood estimation method to solve an optimized objective function of the established gaussian mixture model cluster.
Compared with the prior art, the application has the beneficial effects that:
1. according to the method and the device, according to different clustering densities required by users, nodes of different layers in the clustering tree can be selected as final clustering results, so that the practicability is wider, and the clustering effect is better.
2. According to the historical data of the user, the data package and the data association characteristic are analyzed, the basic Gaussian mixture model algorithm is improved, the clustering effect is improved, the method is applied to data stream clustering, and the detection accuracy is effectively improved while rapid modeling and detection are carried out.
Drawings
FIG. 1 is a flow chart of a data clustering method of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, the present embodiment provides a method for clustering data based on a gaussian mixture model, which includes the following steps:
s0, firstly, initializing parameters of the Gaussian mixture model.
The parameter initialization includes:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:。
the parameter initialization further includes: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
S1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
specifically, for each original sample, a random vector (disturbance vector) is generated by randomly sampling from a predefined gaussian distribution, and the random vector is added to the original sample to obtain a new sample, and the new sample is marked as a second sample.
The main purpose of generating new samples is to introduce variations and diversity to enable clustering algorithms to better understand the distribution and structure of data:
(1) Enhancing data diversity.
By adding the random vector, the second sample will deviate to some extent from the first sample after adding the random vector, helping to introduce more data diversity.
(2) Analog data change
Adding random vectors can simulate nuances of raw data due to noise, measurement errors, or other factors. The clustering algorithm can better cope with the clustering process of the original data.
(3) Improving adaptability of clustering algorithm
The introduction of the second sample can help the algorithm to better adapt to different types of data, reduce the dependence of the algorithm on specific data distribution and improve the adaptability of the clustering algorithm.
(4) Preventing overfitting
By introducing randomness, the clustering algorithm is prevented from over-fitting in some cases because it is too dependent on the raw data.
S3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes; the method comprises the following specific steps:
(1) Preparing data
And fusing the first sample and the second sample, and taking the fused sample set as an initial sample set for establishing an initial Gaussian mixture model.
(2) Gaussian mixture model initialization
Two gaussian distributions were randomly selected as the initial cluster centers. For each distribution, the mean and covariance are initialized.
(3) It is desirable to maximize the iteration:
a: the probability that each sample data point belongs to each distribution is calculated.
B: the probability that each sample data point belongs to each distribution, the mean and covariance are updated to maximize the objective function (likelihood function).
It should be noted that the objective function is an optimized objective function for establishing a gaussian mixture model cluster by using a maximum likelihood estimation method by randomly initializing and selecting representative points of each gaussian mixture model component.
(4) Iteration termination condition
In the iteration process, when any one of the set iteration times and the change of the judgment target function reaches a set corresponding threshold value, stopping iteration. The value of the threshold may be set according to the user's requirement, and is not particularly limited herein.
(5) Construction of cluster tree
And constructing a cluster tree on the basis of the cluster model. Each cluster corresponds to a node of the tree, and the hierarchical structure of the tree represents nesting and containment relationships between different clusters.
S4, inserting data;
specifically, new data is inserted into the sample set after the first sample and the second sample are mixed. And new data is brought into cluster analysis, so that the follow-up model update and cluster analysis are convenient to capture data distribution and structure more accurately.
S5, reading data, and acquiring a clustering task and a target data sample;
(1) Defining clustering tasks
The targets and demands of the clustering tasks are defined.
(2) Selecting feature data
And extracting characteristic data from the fused sample set to prepare for subsequent cluster analysis.
(3) Screening target data samples
And (3) taking the sample set fused in the step (S4) as a target data sample.
S6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
The gaussian mixture model consists of K gaussian distributions,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
The method is characterized in that the maximum likelihood estimation method is used for solving the optimized objective function of the established Gaussian mixture model cluster to realize the clustering.
The application constructs the Gaussian mixture model based on the constraint of the side information equivalence set, integrates the side information (data association characteristics) among the user history data points into the model, and expresses the information as the limiting conditions in the model, so that each component part of the model is ensured to follow the side information in the parameter optimization and clustering process, and the internal structure and mode of the data are captured better.
In general, only data characteristic information is used when the Gaussian mixture model is clustered, and the correlation characteristic among data is not utilized. The application can more accurately reflect the clustering structure of the data by introducing the side information (data association characteristics into the model as equations or constraints in the model, ensure that components in the model can accord with the relationship conveyed by the side information when optimizing the model parameters.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (7)
1. The method for clustering data based on the Gaussian mixture model is characterized by comprising the following steps of:
s1, acquiring an original sample, and marking the original sample as a first sample;
s2, randomly generating a new sample from the original sample, and marking the new sample as a second sample;
s3, mixing the first sample with the second sample, and performing clustering training to obtain a clustering model of sample nodes;
s4, inserting data;
s5, reading data, and acquiring a clustering task and a target data sample;
s6, updating a cluster tree;
and S7, carrying out cluster analysis and classification on the data by adopting a Gaussian mixture model to obtain a clustering result.
2. The method for clustering data based on the gaussian mixture model according to claim 1, wherein said S1 further comprises, before:
s0, initializing parameters of the Gaussian mixture model.
3. The method for clustering data based on the gaussian mixture model according to claim 2, wherein said initializing parameters comprises:
number K of Gaussian distributions in Gaussian mixture model, and average value of each Gaussian distributionSum of covariance->And the mixing coefficient corresponding to the Gaussian distribution +.>And satisfy->The initialization parameters of the gaussian mixture model are:。
4. a method for clustering data based on a gaussian mixture model according to claim 3, wherein said initializing parameters further comprises: regularized coefficient lambda, updated coefficient gamma initial value, neighbor number l and iteration termination value delta, and iteration sequence number t is initialized to 1, namely t=1.
5. The method according to claim 1, wherein in S2, for each original sample, a random vector is generated by randomly sampling from a predefined gaussian distribution, the random vector is added to the original sample, a new sample is obtained, and the new sample is marked as a second sample.
6. The method for clustering data based on Gaussian mixture model as recited in claim 1, wherein the Gaussian mixture model is composed of K Gaussian distributions in S7,
where Θ= (pi 1, μ1, Σ1, …, pi K, μk, Σk) is a parameter of a gaussian mixture model, μk and Σk are the mean and covariance of the kth gaussian distribution, nk (xi|μk, Σk) is the corresponding gaussian distribution density, pi K is the corresponding mixing coefficient thereof, and satisfiesThe method comprises the steps of carrying out a first treatment on the surface of the xi represents one sample in sample set X, i=1, …, n, each sample xi containing d-dimensional features.
7. The method for clustering data based on the gaussian mixture model according to claim 1, wherein in S7, the clustering is implemented by using an optimized objective function of the maximum likelihood estimation method to solve the established gaussian mixture model clustering.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311166058.9A CN117131395A (en) | 2023-09-11 | 2023-09-11 | Method for clustering data based on Gaussian mixture model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311166058.9A CN117131395A (en) | 2023-09-11 | 2023-09-11 | Method for clustering data based on Gaussian mixture model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117131395A true CN117131395A (en) | 2023-11-28 |
Family
ID=88858154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311166058.9A Pending CN117131395A (en) | 2023-09-11 | 2023-09-11 | Method for clustering data based on Gaussian mixture model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117131395A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117672445A (en) * | 2023-12-18 | 2024-03-08 | 郑州大学 | Diabetes mellitus debilitation current situation analysis method and system based on big data |
-
2023
- 2023-09-11 CN CN202311166058.9A patent/CN117131395A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117672445A (en) * | 2023-12-18 | 2024-03-08 | 郑州大学 | Diabetes mellitus debilitation current situation analysis method and system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN109117793B (en) | Direct-push type radar high-resolution range profile identification method based on deep migration learning | |
Jain et al. | Feature selection: Evaluation, application, and small sample performance | |
CN113326731B (en) | Cross-domain pedestrian re-identification method based on momentum network guidance | |
CN108121975B (en) | Face recognition method combining original data and generated data | |
CN112883839B (en) | Remote sensing image interpretation method based on adaptive sample set construction and deep learning | |
Xiao et al. | A fast method for particle picking in cryo-electron micrographs based on fast R-CNN | |
CN109002755B (en) | Age estimation model construction method and estimation method based on face image | |
CN112906770A (en) | Cross-modal fusion-based deep clustering method and system | |
CN112765352A (en) | Graph convolution neural network text classification method based on self-attention mechanism | |
CN113076970A (en) | Gaussian mixture model clustering machine learning method under deficiency condition | |
CN112464004A (en) | Multi-view depth generation image clustering method | |
CN109033833B (en) | Malicious code classification method based on multiple features and feature selection | |
CN111259917B (en) | Image feature extraction method based on local neighbor component analysis | |
CN117131395A (en) | Method for clustering data based on Gaussian mixture model | |
CN113987236B (en) | Unsupervised training method and unsupervised training device for visual retrieval model based on graph convolution network | |
CN112434553A (en) | Video identification method and system based on deep dictionary learning | |
CN111259264B (en) | Time sequence scoring prediction method based on generation countermeasure network | |
CN112883931A (en) | Real-time true and false motion judgment method based on long and short term memory network | |
CN114036308A (en) | Knowledge graph representation method based on graph attention neural network | |
Lin et al. | A deep clustering algorithm based on gaussian mixture model | |
Zhuang et al. | A handwritten Chinese character recognition based on convolutional neural network and median filtering | |
Cho et al. | Genetic evolution processing of data structures for image classification | |
CN112668633B (en) | Adaptive graph migration learning method based on fine granularity field | |
Cho | Content-based structural recognition for flower image classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |