CN117195027A - Cluster weighted clustering integration method based on member selection - Google Patents

Cluster weighted clustering integration method based on member selection Download PDF

Info

Publication number
CN117195027A
CN117195027A CN202311166210.3A CN202311166210A CN117195027A CN 117195027 A CN117195027 A CN 117195027A CN 202311166210 A CN202311166210 A CN 202311166210A CN 117195027 A CN117195027 A CN 117195027A
Authority
CN
China
Prior art keywords
cluster
clustering
matrix
target
members
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311166210.3A
Other languages
Chinese (zh)
Inventor
徐秀芳
高婷
徐森
黄曙荣
花小朋
许贺洋
郭乃瑄
卞学胜
孙雯
刘轩绮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Institute of Technology
Yancheng Institute of Technology Technology Transfer Center Co Ltd
Original Assignee
Yancheng Institute of Technology
Yancheng Institute of Technology Technology Transfer Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Institute of Technology, Yancheng Institute of Technology Technology Transfer Center Co Ltd filed Critical Yancheng Institute of Technology
Priority to CN202311166210.3A priority Critical patent/CN117195027A/en
Publication of CN117195027A publication Critical patent/CN117195027A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cluster weighted clustering integration method based on member selection, which comprises the following steps: constructing a cluster member set; inputting the cluster member set into a pre-trained decision tree model, outputting the label of each cluster member in the cluster member set, screening out the label as the pre-labeled cluster member, and generating a target cluster set; determining a cluster layer weighting coefficient of each cluster in the target cluster group; determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient; and executing a hierarchical clustering algorithm according to the target CA matrix to obtain a final clustering result. Selecting high-quality cluster members, and then taking the diversity of cluster layers of the cluster members into consideration, measuring the uncertainty of clusters, and carrying out weight distribution on the cluster layers; and finally, finely adjusting the CA matrix according to the weight and the high confidence information, so as to improve the accuracy and the robustness of clustering.

Description

Cluster weighted clustering integration method based on member selection
Technical Field
The invention relates to the technical field of data clustering, in particular to a cluster weighted clustering integration method based on member selection.
Background
Cluster analysis is one of the hot spots of machine learning research, is widely used for data compression, information retrieval, image segmentation and text clustering, and is receiving more and more attention in the fields of biology, geology, geography, abnormal data detection and the like. The clustering analysis is an unsupervised machine learning, and the priori knowledge of the data set is lacking in advance, and the data set is automatically divided into a plurality of groups or clusters only according to the similarity measurement among the data points, the samples and the objects, so that the similarity among the points belonging to the same cluster is as high as possible, and the similarity among the points belonging to different clusters is as low as possible. The clustering is to introduce the ensemble learning idea into the clustering analysis, so that the clustering integration research is started. Mainly comprises the following two steps: the first step is to take the data set as input, run a clustering algorithm, and output a plurality of different clustering results, which is called cluster member generation; and secondly, taking a set formed by all cluster members, namely a cluster body, as input, combining the cluster members and outputting a final clustering result, wherein the step is called clustering integration and is also called consensus function design.
Clustering algorithm: as a basis for member selection, common clustering algorithms include K-means, DBSCAN, hierarchical clustering, and the like. These clustering algorithms may assign data points to different clusters. The selection method comprises the following steps: for selecting cluster members that participate in the integration. Common member selection methods include selection based on cluster performance, such as selecting cluster members with higher stability and consistency; selection based on diversity, such as selecting cluster members with greater variability; and selection based on heuristic rules or models, such as selecting empirically better cluster members. Integration strategy: it is determined how to integrate the selected cluster members to generate a final cluster result. Common integration strategies include voting (final cluster allocation is determined based on the votes of the cluster members), weighting (final cluster allocation is determined based on the weights of the cluster members), and the like. Clustering performance evaluation: and the method is used for evaluating the quality and effect of the clustering integrated result. Common cluster performance evaluation indexes comprise inter-cluster distance, cluster internal compactness, contour coefficients and the like.
Currently, a common limitation of most approaches is that they treat all clusters and all base clusters in the collection generally equally, and either lower quality clusters or lower quality base clusters may occur.
The prior art 1 is a paper, zhuang Dong, a clustering integration algorithm research [ D ] based on member selection, hangzhou university of electronic technology, discloses a clustering integration algorithm based on member selection, comprising: step 1: three clustering algorithms, namely a K-mean clustering algorithm, a fuzzy c-mean clustering algorithm and a fuzzy c-mean clustering algorithm based on a kernel, are selected in a cluster member generation stage; step 2: measuring the variability between cluster members based on standard mutual information (Normalized Mutual Information, NMI) and adjusted Rand index (Adjusted Rand Index, ARI); step 3: after the cluster member differences are calculated, part of cluster members are selected to form a new data set; step 4: and clustering the new data set by using a K-means algorithm, wherein the cluster value is set as the number of cluster members to be selected. In the technical scheme, a plurality of generated cluster members are regarded as a new data set, the multi-difference index of each cluster member is used as a characteristic value of a sample, then the new data set is clustered by using a K-means algorithm, and then the cluster members with the highest quality in each cluster are calculated and selected by using a joint quality evaluation function to form a required member subset, wherein the cluster members in the member subset simultaneously meet the requirements of large difference and high quality.
Prior art 2, journal of the art Shao Changlong, sun Tongfeng, ding Shifei, discloses an information entropy weighting based clustering integration algorithm [ J ], comprising: step 1: selecting a K-means algorithm in a cluster member generation stage, and randomly generating basic cluster members; step 2: measuring the stability of each cluster in the basic clusters, and introducing cluster evaluation indexes (Information Entropy Index, IEI) of information entropy; step 3: forming a weighted co-ordination matrix S by using IEI indexes; step 4: and regarding the matrix S as an undirected graph, and performing graph segmentation on the undirected graph by using an Ncut algorithm to obtain a final result. In the technical scheme, information entropy is introduced to evaluate an uncertainty index of a cluster, and the uncertainty index is used as a measurement index of cluster weighting.
The following technical problems exist in the prior art: 1. as the number of cluster members increases, a large number of redundant cluster members may appear. After cluster members are generated by the traditional cluster integration algorithm, all cluster members are integrated, and when redundant cluster members are more, the integration loses meaning and the space complexity is increased. 2. In the integration stage, the contribution of cluster members with different quality to the integration result is considered to be the same, and each cluster member is treated equally, so that the influence of the cluster members with poor quality on the integration result is aggravated. 3. In the integration process, although each cluster member is treated differently, the cluster members are evaluated or weighted, the cluster members are regarded as independent individuals, and the local diversity of the clusters in the same cluster member is ignored.
Disclosure of Invention
The present invention aims to solve, at least to some extent, one of the technical problems in the above-described technology. Therefore, the invention aims to provide a cluster weighted clustering integration method based on member selection, which is used for selecting high-quality cluster members, and then taking the diversity of cluster layers of the cluster members into consideration, measuring the uncertainty of the clusters and carrying out weight distribution on the cluster layers; and finally, finely adjusting the CA matrix according to the weight and the high confidence information, so as to improve the accuracy and the robustness of clustering.
In order to achieve the above objective, an embodiment of the present invention provides a cluster weighted clustering integration method based on member selection, including:
constructing a cluster member set;
inputting the cluster member set into a pre-trained decision tree model, outputting the label of each cluster member in the cluster member set, screening out the label as the pre-labeled cluster member, and generating a target cluster set;
determining a cluster layer weighting coefficient of each cluster in the target cluster group;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm according to the target CA matrix to obtain a final clustering result.
According to some embodiments of the invention, constructing a cluster member set includes:
obtaining the number r of cluster members and the number k of clusters;
initializing i=1;
judging whether i is less than or equal to r;
when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;
and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.
According to some embodiments of the present invention, a method for obtaining a pre-trained decision tree model includes:
acquiring a sample cluster member set;
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, labeling the sample cluster members with the Davies-Bouldin indexes lower than the average value with high labels, and labeling the sample cluster members with the Davies-Bouldin indexes higher than the average value with low labels;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
According to some embodiments of the present invention, training is performed based on labeled sample cluster members as a training set to obtain a trained decision tree model, including:
determining ARI, NM I and F-measure indexes of each sample cluster member, and taking the ARI, NM I and F-measure indexes as a characteristic attribute set; the ARI value range is [ -1,1], and the NM and F-measure indexes value range is [0,1];
calculating the coefficient of the feature attribute set about the ARI, the NM I and the F-measure index, comparing, selecting the feature attribute 1 with the minimum coefficient of the feature as a root node, and marking the value of the feature attribute 1 close to 1 as high; then, taking the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 as a new label set, continuously calculating the coefficient of the base of the two remaining characteristic attributes respectively, and selecting the smallest characteristic attribute 2 as an internal node at the moment; finally, the rest characteristic attributes are used as 'characteristic attributes 3' to become the final internal nodes, and then a trained decision tree model is obtained.
According to some embodiments of the invention, determining a target CA matrix for a target cluster set from cluster layer weighting coefficients includes:
constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
According to some embodiments of the invention, the hierarchical clustering algorithm is an average linking method.
According to some embodiments of the invention, further comprising: and evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value.
According to some embodiments of the invention, the external index is an F-measure value or an NMI value.
According to some embodiments of the invention, computing the coefficient of pari, NMI, and F-measure indices in a feature attribute set, comprises:
wherein Gini (D) represents the genii purity of dataset D; p (P) i The representation class label is i's duty cycle in the dataset D, i.e. the number of samples belonging to class i divided by the total number of samples; gini (D, a) represents the genii purity of dataset D under the condition of characteristic attribute a; d (D) v Representing the number of samples of the subset of data obtained when the value of the characteristic attribute a is v; gini (D) v ) Representing data subset D v Is of non-purity.
According to some embodiments of the invention, constructing a CA matrix for a target cluster set includes:
wherein A is CA matrix; m represents an mth cluster member; m represents the total number of cluster members in the cluster set; cls (o) i ) Representing sample point o i A cluster in which the cluster is located; CLs (o) j ) Representing sample point o j The cluster in which it is located.
The invention provides a cluster weighted clustering integration method based on member selection, which introduces a decision tree to assist in selecting high-quality cluster members, trains out a decision tree model to assist in selecting the cluster members, considers the quality and the difference of the cluster members from a plurality of angles, and simultaneously considers the internal difference of the cluster members, thereby realizing more comprehensively distinguishing the cluster members. Then, considering the diversity of cluster layers of cluster members, measuring the uncertainty of clusters, and carrying out weight distribution on the cluster layers; and finally, finely adjusting the CA matrix according to the weight and the high confidence information, so as to improve the accuracy and the robustness of clustering.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow diagram of a cluster weighted clustering integration method based on member selection, in accordance with one embodiment of the invention;
FIG. 2 is a flow diagram of building a cluster member set according to one embodiment of the invention;
FIG. 3 is a flow chart of a cluster weighted clustering integration method based on member selection in accordance with yet another embodiment of the invention;
FIG. 4 is a flow chart of training a decision tree according to one embodiment of the invention;
FIG. 5 is a schematic diagram of a trained decision tree model according to one embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
As shown in fig. 1, an embodiment of the present invention provides a cluster weighted clustering integration method based on member selection, including steps S1-S5:
s1, constructing a cluster member set;
s2, inputting the cluster member set into a pre-trained decision tree model, outputting the label of each cluster member in the cluster member set, screening out the label as the pre-labeled cluster member, and generating a target cluster set;
s3, cluster layer weighting coefficients of each cluster in the target cluster group;
s4, determining a target CA matrix of the target cluster body according to the cluster layer weighting coefficient;
s5, performing hierarchical clustering algorithm according to the target CA matrix to obtain a final clustering result.
The working principle of the technical scheme is as follows: in this embodiment, the cluster member set is constructed to set the cluster member number r and the cluster number k, and the cluster number k is set to beWherein r epsilon N+ and N are the number of data sample points, r cluster members are randomly generated by using a K-Means algorithm and used as a cluster set P, and then a cluster member set is generated.
In this embodiment, the preset label is a label marked high, the cluster member set is input into a pre-trained decision tree model, the label of each cluster member in the cluster member set is output, and the cluster member with the label marked in advance is screened out, namely, the high-quality cluster member is determined based on the decision tree model. The target cluster set is a set constructed by high-quality cluster members. When processing and identifying the cluster member set based on the decision tree model, ARI, NMI and F-measure indexes in the characteristic attribute set of each cluster member are output, and the decision tree model predicts the cluster quality label ("high" or "low") according to the learned node division rule. And forming the cluster members with the labels of high into a new cluster group P', and participating in subsequent processing, namely generating a target cluster group.
In this embodiment, a method for determining a cluster layer weighting coefficient of each cluster in a target cluster set includes: information entropy is introduced to the target clusters, and an uncertainty index IEI of each cluster is calculated and used as a cluster layer weighting coefficient. The method for calculating IEI is as follows:
wherein n represents the total number of clusters; p (C) i ,C j ) For measuring cluster C i Cluster C j Similarity between IEI indicators reflects cluster C i The likelihood that points in the cluster remain in the same cluster in other base clusters, the greater the IEI indicates cluster C i The more likely the points in the cluster are to be grouped into the same cluster in other base clusters; capturing High confidence information from the CA matrix, supplementing and perfecting the captured High confidence information to obtain an ideal CA matrix, and obtaining an HC (High-Confidence Matr ix) matrix from the High confidence information to be marked as H; max (H) is the maximum value in the H matrix; min (H) is the minimum value in the H matrix; h (C) i ) For cluster C in the H matrix i A corresponding value;
or KL divergence is introduced to determine a cluster layer weighting coefficient;
or a plurality of evaluation indexes are introduced and fused together to form a new evaluation index to determine the cluster layer weighting coefficient.
In this embodiment, the target CA matrix is a final CA matrix obtained by adjusting the initial CA matrix according to the weight distribution and the high confidence information of the cluster layer.
The beneficial effects of the technical scheme are that: a decision tree is introduced to assist in selecting high-quality cluster members, a decision tree model is trained to assist in selecting the cluster members, the quality and the difference of the cluster members are considered from multiple angles, meanwhile, the difference inside the cluster members is considered, and the cluster members are more comprehensively and differentially treated. Then, considering the diversity of cluster layers of cluster members, measuring the uncertainty of clusters, and carrying out weight distribution on the cluster layers; and finally, finely adjusting the CA matrix according to the weight and the high confidence information, so as to improve the accuracy and the robustness of clustering.
In an embodiment, constructing the cluster member set may also be generated using a hierarchical clustering, spectral clustering, or the like clustering method.
As shown in fig. 2, constructing a cluster member set according to some embodiments of the invention includes:
obtaining the number r of cluster members and the number k of clusters;
initializing i=1;
judging whether i is less than or equal to r;
when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;
and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.
The beneficial effects of the technical scheme are that: and the K-Means algorithm is used for clustering, so that a cluster member set can be accurately constructed.
According to some embodiments of the present invention, a method for obtaining a pre-trained decision tree model includes:
acquiring a sample cluster member set;
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, labeling the sample cluster members with the Davies-Bouldin indexes lower than the average value with high labels, and labeling the sample cluster members with the Davies-Bouldin indexes higher than the average value with low labels;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
The working principle of the technical scheme is as follows: in this embodiment, the Davies-Bouldin index is davisonburg Ding Zhishu, also known as a classification suitability index, which is an index for evaluating the quality of the clustering algorithm.
The beneficial effects of the technical scheme are that: and combining the clustering integration algorithm with the decision tree to obtain a trained decision tree model, and determining high-quality or low-quality cluster members based on the decision tree model so as to facilitate accurate cluster analysis.
As shown in fig. 3-4, training based on labeled sample cluster members as a training set according to some embodiments of the present invention, to obtain a trained decision tree model includes:
determining ARI, NMI and F-measure indexes of each sample cluster member, and taking the ARI, NMI and F-measure indexes as a characteristic attribute set; the ARI value range is [ -1,1], and the NMI and F-measure indexes value range is [0,1];
calculating the coefficient of the feature attribute set about the ARI, NMI and F-measure index, comparing, selecting the feature attribute 1 with the minimum coefficient of the feature as a root node, and marking the value of the feature attribute 1 close to 1 as high; then, taking the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 as a new label set, continuously calculating the coefficient of the base of the two remaining characteristic attributes respectively, and selecting the smallest characteristic attribute 2 as an internal node at the moment; finally, the rest characteristic attributes are used as 'characteristic attributes 3' to become the final internal nodes, and then a trained decision tree model is obtained.
The working principle of the technical scheme is as follows: in this embodiment, the F-measure index (FMI) is defined as the geometric mean between accuracy and recall) as a set of characteristic attributes for each sample cluster member based on the adjusted Rand index (Adjusted Rand Index, ARI), normalized mutual information (Normalized Mutual Information, NMI). The three characteristic attributes are all better as approaching to 1, wherein the value range of the ARI is [ -1,1], the three characteristic attributes are divided into two types, one type is larger than 0, and the other type is smaller than or equal to 0; the NMI and F-measure indexes are uniformly divided into two types, wherein the value range is 0,1, and one type is more than 0.5 and the other type is less than or equal to 0.5. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties. The decision tree model is shown in fig. 5.
The attributes of the root node and the internal nodes of the decision tree are measured using the coefficient of the radix;
for example, the key coefficients of ARI properties are determined as follows:
according to the following formula:
first, the coefficient of the kunit when ARI is greater than 0 is calculated as:
at this time, gini (D) v1 ) Representing data subset D v1 Is of a non-purity of the base; d (D) v1 Representing a subset of data obtained at a characteristic attribute ARI greater than 0; p (P) 1 The number of samples representing a category label of "high" divided by the number of samples when ARI is greater than 0; p (P) 2 The number of samples representing a category label of "low" divided by the number of samples when ARI is greater than 0;
next, the kunit when ARI is 0 or less is calculated as:
at this time, gini (D) v2 ) Representing data subset D v2 Is of a non-purity of the base; d (D) v2 Representing a subset of data obtained when the characteristic attribute ARI is 0 or less; p (P) 1' The number of samples indicating that the category label is "high" divided by the number of samples when ARI is 0 or less; p (P) 2' The number of samples indicating that the category label is "low" divided by the number of samples when ARI is 0 or less;
finally, according to the following formula:
the kunit of AR I is calculated as follows:
at this time, gini (D, ARI) represents a coefficient of a data set D with a characteristic attribute of ARI; |D v1 And | represents the number of samples when ARI is greater than 0 and | represents the number of samples of data set D.
In this embodiment, when constructing the decision tree, the values of the feature attributes are equally divided into two sections of different value ranges, and two branches are used.
The beneficial effects of the technical scheme are that: based on the method, the clustering algorithm and the decision tree are effectively aggregated to obtain a decision tree model, the decision tree selects the best characteristic attribute as a root node according to the coefficient of the foundation, and the characteristic attribute with higher importance is continuously selected in the subsequent internal node division, so that the method can be used for classifying and predicting new cluster members based on the decision tree model.
In an embodiment, when the decision tree is constructed, the value of the characteristic attribute can be further subdivided into a plurality of segments with different value ranges, namely multiple branches, branches of each decision tree are increased, the number of decision classification is increased, and the classification precision is further improved.
According to some embodiments of the invention, determining a target CA matrix for a target cluster set from cluster layer weighting coefficients includes:
constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
The working principle of the technical scheme is as follows: in this embodiment, constructing a CA matrix for a target cluster set includes:
wherein A is CA matrix; m represents an mth cluster member; m represents the total number of cluster members in the cluster set; cls (o) i ) Representing sample point o i A cluster in which the cluster is located; CLs (o) j ) Representing sample point o j The cluster in which it is located.
In this embodiment, the weighting processing is performed on the CA matrix based on the cluster layer weighting coefficient to obtain the processed data B, including:
in this embodiment, capturing high confidence information from the CA matrix to obtain the HC matrix includes:
capturing High-confidence information from a CA matrix, supplementing and perfecting the captured High-confidence information to obtain an ideal CA matrix, and marking a High-confidence information HC (High-Confidence Matrix) matrix as H, wherein the method comprises the following steps of:
H=Ψ Ω (A)
wherein Ω= { (i, j) |a ij The positions of the highly reliable elements are recorded at equal to or more than alpha,is an element operator. When the ratio of the number of times two sample points are classified into the same cluster to the total number of cluster members exceeds a predefined threshold value alpha epsilon 0,1]The corresponding position of the a matrix is considered as a piece of highly reliable information (i.e. one element in the H matrix).
In this embodiment, determining the target CA matrix of the target cluster set according to the processing data B and the HC matrix includes:
the final CA matrix, designated C, is obtained from the H matrix and the B matrix, and the method is as follows:
s.t.Ψ Ω (E)=0,F=F T ,0≤F≤1
where Φ is the Laplacian matrix, γ 1 ,γ 2 Represents Lagrangian term, default value is 1, lambda is used to balance error loss term, Y 1 And Y 2 Representing the lagrangian multiplier and, I.I F Frobenius norms representing matrices, E representing error terms, F being an intermediate matrix for alleviating constraints on the range of values and symmetry of an ideal CA matrixConstraint.
According to some embodiments of the invention, the hierarchical clustering algorithm is an average linking method.
According to some embodiments of the invention, further comprising: and evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value.
The technical scheme has the working principle and beneficial effects that: the cluster result validity evaluation index is generally classified into an internal index and an external index. In most cases, class labels of the data set are known (not used in the clustering process), and an external index can be used to evaluate the effectiveness of clustering, where an F-measure is a relatively common comprehensive index for evaluating the quality of text clustering. The larger the F value is, the higher the clustering quality is, and when the clustering result is completely consistent with the real category, the F value reaches the maximum value, and the value is 1. In addition, NMI value is also a popular clustering result effectiveness evaluation index, and can quantify the matching degree of the clustering result and the real text category label.
According to some embodiments of the invention, computing the coefficient of pari, NMI, and F-measure indices in a feature attribute set, comprises:
wherein Gini (D) represents the genii purity of dataset D; p (P) i The representation class label is i's duty cycle in the dataset D, i.e. the number of samples belonging to class i divided by the total number of samples; gini (D, a) represents the genii purity of dataset D under the condition of characteristic attribute a; d (D) v Representing the number of samples of the subset of data obtained when the value of the characteristic attribute a is v; gini (D) v ) Representing data subset D v Is of non-purity.
The technical scheme has the working principle and beneficial effects that: based on the formula, the coefficient of the base of the label-bearing cluster member on the ARI, NMI, F-measure index is obtained, so that the corresponding coefficient of base can be calculated accurately, the accuracy of judging the sizes of the coefficient of base of the label-bearing cluster member is improved, and the root node and the internal node are divided accurately.
The invention is specifically illustrated and described as being applicable in the medical field. In the medical field, we can combine cluster integration algorithms with decision trees to help doctors to make disease diagnosis and treatment decisions. Medical diagnostics are often faced with a large number of complex cases and clinical data, and doctors need to quickly and accurately classify patients in order to formulate personalized treatment regimens. However, rapid classification and prediction of large-scale data is challenging due to the complexity and diversity of the disease. Furthermore, different patients may exhibit different symptoms and characteristics, and a method is needed that can differentiate patient populations and provide accurate diagnosis.
First, patient data is clustered using a clustering algorithm to generate a plurality of cluster members. Each cluster member divides the patient into different clusters, each cluster representing a patient population with similar symptoms and features. Next, high quality cluster members are selected with the aid of a decision tree model. In order to further improve accuracy and interpretation of the clustering result, attention is given to the internal diversity of the clustering members, the diversity of the internal clusters is measured, fine adjustment is carried out on the CA matrix, and finally, hierarchical clustering analysis is used for obtaining the final clustering result. By the method, more optimized clustering results are obtained, the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. The cluster weighted clustering integration method based on member selection is characterized by comprising the following steps of:
constructing a cluster member set;
inputting the cluster member set into a pre-trained decision tree model, outputting the label of each cluster member in the cluster member set, screening out the label as the pre-labeled cluster member, and generating a target cluster set;
determining a cluster layer weighting coefficient of each cluster in the target cluster group;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm according to the target CA matrix to obtain a final clustering result.
2. The method for member selection-based cluster weighted clustering integration of claim 1, wherein constructing a cluster member set comprises:
obtaining the number r of cluster members and the number k of clusters;
initializing i=1;
judging whether i is less than or equal to r;
when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;
and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.
3. The method for clustering and integrating cluster weights based on member selection as claimed in claim 1, wherein the method for obtaining the pre-trained decision tree model comprises the following steps:
acquiring a sample cluster member set;
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, labeling the sample cluster members with the Davies-Bouldin indexes lower than the average value with high labels, and labeling the sample cluster members with the Davies-Bouldin indexes higher than the average value with low labels;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
4. The method of clustering weighted cluster integration based on member selection of claim 3, wherein training based on labeled sample cluster members as a training set to obtain a trained decision tree model comprises:
determining ARI, NMI and F-measure indexes of each sample cluster member, and taking the ARI, NMI and F-measure indexes as a characteristic attribute set; the ARI value range is [ -1,1], and the NMI and F-measure indexes value range is [0,1];
calculating the coefficient of the feature attribute set about the ARI, NMI and F-measure index, comparing, selecting the feature attribute 1 with the minimum coefficient of the feature as a root node, and marking the value of the feature attribute 1 close to 1 as high; then, taking the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 as a new label set, continuously calculating the coefficient of the base of the two remaining characteristic attributes respectively, and selecting the smallest characteristic attribute 2 as an internal node at the moment; finally, the rest characteristic attributes are used as 'characteristic attributes 3' to become the final internal nodes, and then a trained decision tree model is obtained.
5. The method for member selection-based cluster weighted clustering integration of claim 1, wherein determining the target CA matrix of the target cluster set according to the cluster layer weighting coefficients comprises:
constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
6. The method for clustering and integrating cluster weights based on member selection according to claim 1, wherein the hierarchical clustering algorithm is an average linking method.
7. The method for member selection-based cluster weighted clustering integration of claim 1, further comprising: and evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value.
8. The method for member selection-based cluster weighted clustering integration of claim 7, wherein the external index is an F-measure value or an NMI value.
9. The method of member selection-based cluster weighted clustering integration of claim 4, wherein calculating the coefficient of the feature attribute set for the three aspects of ARI, NMI, and F-measure index comprises:
Gini(D)=1-∑ i P i 2
wherein Gini (D) represents the genii purity of dataset D; p (P) i The representation class label is i's duty cycle in the dataset D, i.e. the number of samples belonging to class i divided by the total number of samples; gini (D, a) represents the genii purity of dataset D under the condition of characteristic attribute a; d (D) v Representing the number of samples of the subset of data obtained when the value of the characteristic attribute a is v; gini (D) v ) Representing data subset D v Is of non-purity.
10. The method for member selection-based cluster weighted clustering integration of claim 5, wherein constructing a CA matrix for the target cluster set comprises:
wherein A is CA matrix; m represents an mth cluster member; m represents the total number of cluster members in the cluster set; cls (o) i ) Representing sample point o i A cluster in which the cluster is located; CLs (o) j ) Representing sample point o j The cluster in which it is located.
CN202311166210.3A 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection Pending CN117195027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311166210.3A CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311166210.3A CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection

Publications (1)

Publication Number Publication Date
CN117195027A true CN117195027A (en) 2023-12-08

Family

ID=89001143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311166210.3A Pending CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection

Country Status (1)

Country Link
CN (1) CN117195027A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688412A (en) * 2024-02-02 2024-03-12 中国人民解放军海军青岛特勤疗养中心 Intelligent data processing system for orthopedic nursing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688412A (en) * 2024-02-02 2024-03-12 中国人民解放军海军青岛特勤疗养中心 Intelligent data processing system for orthopedic nursing
CN117688412B (en) * 2024-02-02 2024-05-07 中国人民解放军海军青岛特勤疗养中心 Intelligent data processing system for orthopedic nursing

Similar Documents

Publication Publication Date Title
CN111161879B (en) Disease prediction system based on big data
CN107358014B (en) Clinical pretreatment method and system of physiological data
US20180165413A1 (en) Gene expression data classification method and classification system
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN109935337B (en) Medical record searching method and system based on similarity measurement
Luo et al. Retinal image classification by self-supervised fuzzy clustering network
CN113408605A (en) Hyperspectral image semi-supervised classification method based on small sample learning
CN108877947B (en) Depth sample learning method based on iterative mean clustering
CN111000553A (en) Intelligent classification method for electrocardiogram data based on voting ensemble learning
CN113113130A (en) Tumor individualized diagnosis and treatment scheme recommendation method
CN113674864B (en) Malignant tumor combined venous thromboembolism risk prediction method
CN116259415A (en) Patient medicine taking compliance prediction method based on machine learning
CN114596467A (en) Multimode image classification method based on evidence deep learning
CN117195027A (en) Cluster weighted clustering integration method based on member selection
Dong et al. Cervical cell classification based on the CART feature selection algorithm
CN115715416A (en) Medical data inspector based on machine learning
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN110400610B (en) Small sample clinical data classification method and system based on multichannel random forest
CN113707317B (en) Disease risk factor importance analysis method based on mixed model
Li et al. A mixed data clustering algorithm with noise-filtered distribution centroid and iterative weight adjustment strategy
CN117591953A (en) Cancer classification method and system based on multiple groups of study data and electronic equipment
CN115985503B (en) Cancer prediction system based on ensemble learning
CN111582330A (en) Integrated ResNet-NRC method for dividing sample space based on lung tumor image
CN116564534A (en) Multi-view clustering method and device for clinical data of traditional Chinese medicine and electronic equipment
EP3891761A1 (en) Integrated system and method for personalized stratification and prediction of neurodegenerative disease

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination