CN110569922A

CN110569922A - Interactive hierarchical clustering implementation method, device and equipment and readable storage medium

Info

Publication number: CN110569922A
Application number: CN201910878011.2A
Authority: CN
Inventors: 黄启军; 唐兴兴; 李诗琦; 陈瑞钦; 卓本刚
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-13
Anticipated expiration: 2039-09-17
Also published as: CN110569922B

Abstract

The invention discloses an interactive hierarchical clustering implementation method, device and equipment and a readable storage medium, wherein the method comprises the following steps: when a clustering instruction for clustering a sample set is detected, extracting clustering configuration parameters from the clustering instruction; and carrying out clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set. The method and the device realize that in the hierarchical clustering operation process, a user can configure various parameters of each hierarchical clustering operation by combining the experience of the user in the field, so that the human experience can be fully fused in the clustering process of the machine, the operability of the hierarchical clustering algorithm is improved, and the clustering effect of the hierarchical clustering model obtained by training is also improved.

Description

interactive hierarchical clustering implementation method, device and equipment and readable storage medium

Technical Field

the invention relates to the technical field of machine learning, in particular to an interactive hierarchical clustering implementation method, device and equipment and a readable storage medium.

Background

clustering is an unsupervised learning task that looks for natural populations of observed samples based on the internal structure of the data. With the development of machine learning technology, people are also making continuous efforts to improve the performance of various clustering methods. Hierarchical clustering is a main clustering method, and the principle is to analyze data on different levels based on the similarity between clusters, so as to form a tree-shaped clustering structure. Hierarchical clustering generally has two partitioning strategies: a bottom-up agglomeration strategy and a top-down fragmentation strategy. However, the current hierarchical clustering algorithm can only perform parameter configuration before clustering, and cannot operate in combination with field experience in the process of each hierarchical clustering, thereby causing poor clustering effect. Namely, the existing hierarchical clustering algorithm has low operability.

Disclosure of Invention

The invention mainly aims to provide an interactive hierarchical clustering implementation method, device and equipment and a readable storage medium, and aims to solve the problem of low operability of the conventional hierarchical clustering algorithm.

In order to achieve the above object, the present invention provides an interactive hierarchical clustering implementation method, which comprises the following steps:

When a clustering instruction for clustering a sample set is detected, extracting clustering configuration parameters from the clustering instruction;

And carrying out clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set.

Optionally, the clustering configuration parameters include algorithm configuration parameters, and the step of performing clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set includes:

determining a clustering algorithm according to the algorithm configuration parameters, wherein the clustering algorithm comprises a custom distance algorithm;

and clustering the sample set under the current hierarchical clustering model by adopting the determined clustering algorithm to obtain a new hierarchical clustering model of the sample set.

Optionally, the clustering configuration parameters include policy configuration parameters, and the step of performing clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set includes:

Determining a clustering strategy and a cluster to be clustered in the current hierarchical clustering model according to the strategy configuration parameters, wherein the clustering strategy comprises an agglomeration strategy and a splitting strategy;

and carrying out clustering operation on the sample data in the cluster to be clustered according to the determined clustering strategy to obtain a new hierarchical clustering model of the sample set.

Optionally, after the step of performing a clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set, the method further includes:

performing statistical analysis on the sample set under the new hierarchical clustering model to obtain a statistical analysis result of each preset index;

And outputting the statistical analysis result of each preset index.

Optionally, the preset index includes one or more of a cluster validity index, a cluster stability index, a single feature analysis index, a multi-feature analysis index, and a sample spot check index.

Calculating the distance between each cluster divided by the sample set in the new hierarchical clustering model to obtain a distance matrix;

And outputting the distance matrix according to a preset visualization mode.

Optionally, before the step of outputting the distance matrix according to a preset visualization manner, the method further includes:

and carrying out normalization processing on each distance value in the distance matrix, and updating the distance matrix by adopting each distance value after the normalization processing.

in order to achieve the above object, the present invention further provides an interactive hierarchical clustering implementation apparatus, including:

The extraction module is used for extracting clustering configuration parameters from a clustering instruction when the clustering instruction for clustering the sample set is detected;

And the clustering module is used for clustering the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set.

In order to achieve the above object, the present invention further provides an interactive hierarchical clustering implementation apparatus, including: the interactive hierarchical clustering system comprises a memory, a processor and an interactive hierarchical clustering implementation program which is stored on the memory and can run on the processor, wherein when the interactive hierarchical clustering implementation program is executed by the processor, the steps of the interactive hierarchical clustering implementation method are implemented.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, on which an interactive hierarchical clustering implementation program is stored, and when the interactive hierarchical clustering implementation program is executed by a processor, the interactive hierarchical clustering implementation method implements the steps of the method.

according to the invention, when a clustering instruction is detected, clustering configuration parameters are extracted from the clustering instruction, and a sample set under the current hierarchical clustering model is clustered according to the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set, so that a user can configure various parameters of each hierarchical clustering operation by combining with own field experience in the hierarchical clustering operation process, thereby fully fusing human experience in the clustering process of a machine, improving the operability of a hierarchical clustering algorithm, and improving the clustering effect of the hierarchical clustering model obtained by training.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a method for implementing interactive hierarchical clustering according to the present invention;

FIG. 3 is a schematic diagram of an interaction flow in a hierarchical clustering model training process according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating a functional diagram of an apparatus for implementing interactive hierarchical clustering according to a preferred embodiment of the present invention.

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, the interactive hierarchical clustering implementation device in the embodiment of the present invention may be a smart phone, a personal computer, a server, and other devices, and is not limited herein.

As shown in fig. 1, the interactive hierarchical clustering implementation apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the device architecture shown in FIG. 1 does not constitute a limitation of the interactive hierarchical clustering implementation device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

as shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an interactive hierarchical clustering implementation program. The operating system is a program for managing and controlling hardware and software resources of the equipment, and supports the operation of an interactive hierarchical clustering implementation program and other software or programs.

in the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with each participating device; and the processor 1001 may be configured to invoke the interactive hierarchical clustering implementation stored in the memory 1005 and perform the following operations:

Further, the clustering configuration parameters include algorithm configuration parameters, and the step of performing clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set includes:

Further, the clustering configuration parameters include policy configuration parameters, and the step of performing clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set includes:

Further, after the step of performing a clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameter to obtain a new hierarchical clustering model of the sample set, the processor 1001 may be configured to invoke an interactive hierarchical clustering implementation program stored in the memory 1005, and further perform the following operations:

and outputting the statistical analysis result of each preset index.

Further, the preset index comprises one or more of a clustering effectiveness index, a clustering stability index, a single-feature analysis index, a multi-feature analysis index and a sample spot check index.

and outputting the distance matrix according to a preset visualization mode.

further, before the step of outputting the distance matrix according to the preset visualization manner, the processor 1001 may be configured to call the interactive hierarchical clustering implementation program stored in the memory 1005, and further perform the following operations:

based on the structure, various embodiments of the interactive hierarchical clustering implementation method are provided.

referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the interactive hierarchical clustering implementation method of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein. The following embodiments are described with a clustering device as an execution subject. In this embodiment, the interactive hierarchical clustering implementation method includes:

Step S10, when a clustering instruction for clustering the sample set is detected, extracting clustering configuration parameters from the clustering instruction;

in this embodiment, an interactive hierarchical clustering is proposed based on the concept of hierarchical clustering, and the principle is that a user can select to configure various parameters of clustering operation before the clustering operation of each hierarchy, so that human experience is integrated in the clustering process of a machine, and the operability of hierarchical clustering is improved.

specifically, in order to implement the above interactive hierarchical clustering, in this embodiment, a configuration interface of clustering configuration parameters may be set in the clustering device, so that a user may configure the clustering configuration parameters through the configuration interface. The cluster configuration parameters can be multiple, and the configuration interface can be provided with configuration options or user-defined input boxes of each cluster configuration parameter, so that a user can select the configuration options according to own field experience after analyzing the sample set, or input the user-defined parameters in the user-defined input boxes. The cluster configuration parameters may include any configurable parameters such as a cluster operation algorithm, a distance algorithm, and initial values involved in the algorithm, that is, the user may configure the cluster operation algorithm, the distance algorithm, and the like used in the cluster operation of each level, for example, may configure the k-means algorithm used in the sample set to perform the first level of cluster operation, and configure the number of target clusters in the first level of cluster operation to be N. It should be noted that, in this embodiment, the cluster configuration parameter is not specifically limited.

And after the user configures the clustering configuration parameters, the clustering control can be operated to trigger the clustering instructions in the clustering equipment. And the clustering equipment triggers a clustering instruction based on user operation, and carries the currently configured clustering configuration parameters in the clustering instruction. It should be noted that the clustering device may set a default clustering configuration parameter, and when the user does not reconfigure the default clustering configuration parameter, the clustering device carries the default clustering configuration parameter in the clustering instruction. And when the clustering device detects a clustering instruction for clustering the sample set, extracting clustering configuration parameters from the clustering instruction.

And step S20, clustering the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set.

And after the clustering equipment extracts the clustering configuration parameters, clustering the sample set under the current hierarchical clustering model according to the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set. Specifically, the clustering device determines a new level of clustering operation according to the clustering configuration parameters, such as a clustering algorithm used for determining the new level of clustering operation according to the clustering configuration parameters, clusters to be aggregated or split, and the like. The new-level clustering operation of the clustering device is performed on the previous clustering result, and after the new-level clustering operation, a new-level clustering model is obtained, wherein the new-level clustering model comprises the clustering result after the current clustering operation, namely the number of the partitioned clusters of the sample set, the cluster to which each sample data belongs, and data in all previous-level clustering models, such as the clustering algorithm adopted by each level, the number of the partitioned clusters, and the like.

After the new-level clustering operation, the clustering device can feed the new-level clustering model back to the user, that is, the clustering result of the sample set can be fed back to the user. The user can judge the clustering effect of the currently obtained hierarchical clustering model by analyzing the clustering result of the sample set under the current hierarchical clustering model, can select to further cluster the clustering result when the effect is considered to be poor, and at the moment, the user can configure clustering configuration parameters in clustering equipment by combining the experience of the user in the field and trigger a clustering instruction in the clustering equipment. And the clustering equipment extracts the clustering configuration parameters configured by the user according to the clustering instruction, and performs new-level clustering operation on the sample set under the current-level clustering model according to the clustering configuration parameters, namely clustering again on the basis of the clustering result of the currently obtained sample set to obtain a new-level clustering model. And after the user obtains a satisfactory hierarchical clustering model, the user can choose not to train the hierarchical clustering model any more, and can cluster other sample sets through the finally obtained hierarchical clustering model.

In the embodiment, when the clustering instruction is detected, the clustering configuration parameters are extracted from the clustering instruction, and the clustering operation is performed on the sample set under the current hierarchical clustering model according to the clustering configuration parameters to obtain the new hierarchical clustering model of the sample set, so that a user can configure various parameters of each hierarchical clustering operation by combining the experience of the user in the hierarchical clustering operation process, thereby fully fusing the human experience in the clustering process of the machine, improving the operability of the hierarchical clustering algorithm, and improving the clustering effect of the hierarchical clustering model obtained by training.

Further, based on the first embodiment, a second embodiment of the interactive hierarchical clustering implementation method according to the present invention is provided, where in the second embodiment of the interactive hierarchical clustering implementation method according to the present invention, the cluster configuration parameters include algorithm configuration parameters, and the step S20 includes:

Step S201, determining a clustering algorithm according to the algorithm configuration parameters, wherein the clustering algorithm comprises a custom distance algorithm;

in this embodiment, the cluster configuration parameters may include algorithm configuration parameters, and configuration options or custom input boxes of the algorithm configuration parameters may be provided in the cluster device, so that a user may configure the algorithm configuration parameters. Specifically, multiple clustering operation algorithms and multiple distance algorithms can be preset in the clustering device as configuration options for a user to select, for example, a common clustering operation algorithm is provided: k-means (k-means), GMM (Gaussian mixture model), DBSCAN (Density-Based spatial clustering of Applications with Noise), etc., a commonly used distance algorithm: euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, etc.; an input box of a self-defined distance algorithm can be further arranged for a user to self-define the distance algorithm, and if the user can self-define one distance algorithm: for the discrete variable A, values are A1, A2, A3 and A4, the distance between the self-defined A1 and A2 is 0.5, and the distance between every two defined A1, A2, A3 and A4 is 1.

after extracting the algorithm configuration parameters, the clustering device determines the clustering algorithm to be adopted by the new hierarchical clustering operation according to the algorithm configuration parameters, for example, the extracted algorithm configuration parameters are as follows: and selecting k-means by a clustering operation algorithm and Euclidean distance by a distance algorithm, and determining that the clustering algorithm adopted by the new-level clustering operation is the k-means and Euclidean distance algorithm by the clustering equipment. When a user self-defines the distance algorithm, the clustering equipment comprises the self-defined distance algorithm according to the clustering algorithm determined by the algorithm configuration parameters, namely, in the new-level clustering operation process, the clustering equipment adopts the self-defined distance algorithm to calculate the distance between each sample or each cluster.

step S202, the determined clustering algorithm is adopted to perform clustering operation on the sample set under the current hierarchical clustering model, and a new hierarchical clustering model of the sample set is obtained.

and after determining the clustering algorithm adopted by the new hierarchical clustering operation, the clustering equipment adopts the determined clustering algorithm to perform clustering operation on the sample set under the current hierarchical clustering model to obtain a new hierarchical clustering model of the sample set. If the clustering device determines that the clustering algorithm is k-means and Euclidean distance algorithm, the Euclidean distance algorithm is adopted to calculate the distance between samples in the process of clustering the sample set under the current hierarchical clustering model by adopting k-means.

In this embodiment, the algorithm configuration parameters are extracted from the clustering instruction, the clustering algorithm adopted by the hierarchical clustering operation is determined according to the algorithm configuration parameters, the determined clustering algorithm is used for performing clustering operation on the sample set under the current hierarchical clustering model to obtain a new hierarchical clustering model of the sample set, and the determined clustering algorithm may include a user-defined distance algorithm, so that in the hierarchical clustering process, a user can participate in selection of the clustering algorithm adopted by each hierarchical clustering operation according to own field experience, thereby improving the operability of hierarchical clustering and improving the clustering effect of the finally obtained hierarchical clustering model.

Further, the cluster configuration parameters include policy configuration parameters, and step S20 includes:

step S203, determining a clustering strategy and a cluster to be clustered in the current hierarchical clustering model according to the strategy configuration parameters, wherein the clustering strategy comprises an agglomeration strategy and a splitting strategy;

in this embodiment, the cluster configuration parameters may include policy configuration parameters, and configuration options of the policy configuration parameters may be provided in the cluster device, so that a user may configure the policy configuration parameters. The configuration options of the strategy configuration parameters can comprise two strategy options of an agglomeration strategy option and a splitting strategy option, and can also comprise a setting option of clusters to be clustered, and a user can judge whether a new-level clustering operation is to perform agglomeration operation on a certain cluster or to perform splitting operation on a certain cluster or a plurality of clusters according to experience, so that an agglomeration strategy or a splitting strategy is selected in the clustering equipment, a cluster to be agglomerated or split is selected, and a clustering instruction in the clustering equipment is triggered. The clustering device triggers a clustering instruction according to user operation, and carries strategy configuration parameters configured by a user in the clustering instruction, if the user selects an agglomeration strategy and selects a first cluster and a second cluster which are divided for a sample set under a current hierarchical clustering model, the strategy configuration parameters carried in the clustering instruction are that the clustering strategy is the agglomeration strategy, and the cluster to be clustered is the first cluster and the second cluster.

And the clustering equipment determines a clustering strategy and a cluster to be clustered according to the strategy configuration parameters carried in the strategy command.

And S204, carrying out clustering operation on the sample data in the cluster to be clustered according to the determined clustering strategy to obtain a new hierarchical clustering model of the sample set.

And clustering sample data in the cluster to be clustered by the clustering equipment according to the determined clustering strategy to obtain a new hierarchical clustering model of the sample set. If the determined clustering strategy is an agglomeration strategy, and the cluster to be clustered is a first cluster and a second cluster, the clustering equipment performs agglomeration operation on the sample data in the first cluster and the second cluster to obtain a new hierarchical clustering model of the sample set, wherein the sample data in the first cluster and the second cluster are divided into one cluster in the new hierarchical clustering model through the agglomeration operation.

It should be noted that, the configuration options of the policy configuration parameters may also be setting options of the target cluster numbers of the aggregation operation and the splitting operation, so that the user can select the target cluster numbers of the aggregation operation and the splitting operation. And if the clustering equipment determines that the clustering strategy is a splitting strategy, the cluster to be clustered is the fifth cluster and the number of the target clusters is 3 according to the strategy configuration parameters, the clustering equipment splits the sample data in the fifth cluster divided by the sample set under the current hierarchical clustering model into 3 sub-clusters.

In this embodiment, by extracting the policy configuration parameter from the clustering instruction, determining the clustering policy of the hierarchical clustering operation and the cluster to be clustered according to the policy configuration parameter, and performing the clustering operation on the sample data in the cluster to be clustered according to the determined clustering policy to obtain a new hierarchical clustering model of the sample set, the user can participate in the selection of the clustering policy of each hierarchical clustering operation according to the experience of the user in the hierarchical clustering process, so that two clustering policies of agglomeration and splitting are fused in the hierarchical clustering process, the operability of hierarchical clustering is improved, and the clustering effect of the finally obtained hierarchical clustering model is also improved.

Steps S201 and S202, and steps S203 and S204 in the present embodiment may be implemented individually or in combination. When the method is combined with implementation, the clustering configuration parameters comprise algorithm configuration parameters and strategy configuration parameters, the clustering equipment extracts the algorithm configuration parameters and the strategy configuration parameters from the clustering instruction, determines a clustering algorithm according to the algorithm configuration parameters, determines a clustering strategy and a cluster to be clustered according to the strategy configuration parameters, and performs clustering operation on the cluster to be clustered according to the determined clustering algorithm and the determined clustering strategy to obtain a new hierarchical clustering model.

further, based on the second embodiment, a third embodiment of the interactive hierarchical clustering implementation method according to the present invention is provided, and in the third embodiment of the interactive hierarchical clustering implementation method according to the present invention, after step S20, the method further includes:

Step S30, carrying out statistical analysis on the sample set under the new hierarchical clustering model to obtain the statistical analysis result of each preset index;

And the clustering equipment carries out statistical analysis on the sample set under the new hierarchical clustering model to obtain the statistical analysis result of each preset index. The preset index may be a preset index for evaluating the clustering quality of the hierarchical clustering model. Specifically, the clustering device performs statistical analysis on the sample set under the new hierarchical clustering model according to each preset index, and if the preset index is the similarity of the sample data in each cluster, the clustering device performs statistical analysis on the similarity of the sample data in each cluster to each cluster in which the sample set under the new hierarchical clustering model is divided, so as to obtain a similarity statistical result.

step S40, outputting the statistical analysis result of each of the preset indexes.

And after the clustering equipment obtains the statistical analysis result of each preset index, outputting the statistical analysis result for the user to check the statistical analysis result of each preset index. The step of outputting the statistical analysis result may be outputting and displaying the statistical analysis result on a current display page of the clustering device, and if the clustering device is a personal computer of a user, the user may visually check the statistical analysis result on the computer display page.

Furthermore, an index analysis interface can be arranged in the clustering equipment, a viewing control of each index is arranged in the index analysis interface, and a user can trigger the clustering equipment to perform statistical analysis on the sample set by operating the viewing control of the index to be viewed, so as to obtain a statistical analysis result of the index, and output and display the statistical analysis result. In addition, the clustering equipment can render each statistical analysis result, so that each statistical analysis result output and displayed is more visual, and the statistical analysis results can be analyzed by a user more conveniently.

further, the preset index may include one or more of a cluster validity index, a cluster stability index, a single feature analysis index, a multi-feature analysis index, and a sample spot check index.

The clustering effectiveness index can comprise Jaccard coefficient, contour coefficient and other indexes which can be used for judging whether the clustering effect of the hierarchical clustering model is good or bad; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the larger the Jaccard coefficient value is, the higher the sample similarity is; the contour coefficient can be used for evaluating the influence of different algorithms or different operation modes of the algorithms on the clustering result on the basis of the same original data; and the clustering equipment performs statistical calculation on each cluster divided by the sample set under the new hierarchical clustering model according to the calculation mode of the Jaccard coefficient and the contour coefficient to obtain a Jaccard coefficient value and a contour coefficient value, so that a user can judge whether the sample set is reasonably divided by the new hierarchical clustering model according to the Jaccard coefficient value and the contour coefficient value.

the cluster stability index can be a change value of the cluster effectiveness index, such as a change value of a Jaccard coefficient and a change value of a contour coefficient, wherein the smaller the change value is, the better the cluster stability of the new hierarchical cluster model is. Specifically, when the clustering device changes an initial value of a new hierarchical clustering model or performs clustering operation on different sample sets by adopting the new hierarchical clustering model, the change value of the Jaccard coefficient and the change value of the contour coefficient are recorded, so that a user can judge the stability of the new hierarchical clustering model according to the change values of the Jaccard coefficient and the contour coefficient.

The single-feature analysis index may be an index for individually analyzing each dimensional feature of the sample data. Specifically, the clustering device may count dimensional features of each sample data in each cluster under the new hierarchical clustering model, to obtain sample distribution of each feature of each cluster, and draw a statistical result into a graph, which is output in a graphical manner, such as a box diagram, a histogram, and the like, for a user to perform single feature analysis on each cluster divided by the sample set, so as to determine whether the new hierarchical clustering model divides the sample set reasonably.

The multi-feature analysis index may be an index for performing comprehensive analysis on each dimensional feature of the sample data. Specifically, the clustering device may perform dimensionality reduction on each sample data in a sample set under the new hierarchical clustering model through a T-SNE dimensionality reduction mode and a Principal Component Analysis (PCA) algorithm, project each sample data subjected to dimensionality reduction into a two-dimensional or three-dimensional graph, distinguish colors or sizes of each sample data according to a cluster to which the sample data belongs, output and display the drawn two-dimensional or three-dimensional graph, and perform multi-feature analysis on each cluster divided by the sample set by a user, thereby determining whether the new hierarchical clustering model reasonably divides the sample set.

The sample spot check index may be an index for performing spot check on sample data in each cluster. The clustering equipment extracts sample data in each cluster divided by the sample set under the new hierarchical clustering model, outputs and displays the extracted sample data of each cluster, and allows a user to analyze partial sample data in each cluster, thereby judging whether the new hierarchical clustering model divides the sample set reasonably.

In this embodiment, after obtaining the new hierarchical clustering model of the sample set, the statistical analysis result of each preset index is obtained by performing statistical analysis on the sample set under the new hierarchical clustering model according to the preset index, and each statistical analysis result is output, so that a user can judge whether the division of the sample set is reasonable or not more intuitively according to the statistical analysis result of each index after performing new hierarchical clustering operation on the sample set, thereby helping the user determine whether the sample set needs to be further divided or not more accurately, improving the operability of hierarchical clustering, and improving the clustering effect of the finally obtained hierarchical clustering model.

further, based on the third embodiment, a fourth embodiment of the interactive hierarchical clustering implementation method according to the present invention is provided, and in the fourth embodiment of the interactive hierarchical clustering implementation method according to the present invention, after step S20, the method further includes:

step S50, calculating the distance between each cluster divided by the sample set in the new hierarchical clustering model to obtain a distance matrix;

After obtaining the new hierarchical clustering model, the clustering device may calculate distances between clusters divided by the sample set in the new hierarchical clustering model to obtain a distance matrix. Specifically, the clustering device calculates the distance between each cluster divided in the new hierarchical model according to the currently configured distance algorithm, and if there are 3 clusters divided for the sample set in the new hierarchical model, which are respectively C1, C2, and C3, the distance between every two of the 3 clusters is calculated, so as to obtain the distance matrix shown in table 1 below.

	C1	C2	C3
				C1	0	0.1	0.8
C2	0.1	0	0.3
				C3	0.8	0.3	0

TABLE 1

And step S60, outputting the distance matrix according to a preset visualization mode.

And after the clustering equipment obtains the distance matrix, outputting the distance matrix according to a preset visualization mode. The preset visualization mode can be set as required, for example, the preset visualization mode is set to be output and displayed in a display interface of the clustering equipment in the form of the table, and different colors can be further adopted for representing each distance according to different distances, so that a user can be helped to analyze the distance between each cluster more intuitively.

In this embodiment, after a new hierarchical clustering model of a sample set is obtained, a distance matrix is obtained and output for the distance between each cluster divided by the sample set in the new hierarchical clustering model, so that a user can more intuitively analyze the distance between each cluster, and the user is helped to judge whether the sample set needs to be further divided.

further, before the step S60, the method further includes:

Step S70, performing normalization processing on each distance value in the distance matrix, and updating the distance matrix with each distance value after the normalization processing.

since the types of the features of each dimension of the sample data may be different, such as continuous type, ordinal type, or discrete type, in order to support clustering of different types of features, the clustering device may perform normalization processing on each distance value in the obtained distance matrix after calculating each distance value. The normalization processing means: different evaluation indexes often have different dimensions and dimension units, the data analysis result is influenced under the condition, in order to eliminate the dimension influence among the indexes, data standardization processing is needed to solve the comparability among the data indexes, and after data standardization processing is carried out on original data, all the indexes are in the same order of magnitude, so that the comprehensive comparison and evaluation are suitable. In this embodiment, the normalization process may be performed by a conventional normalization process, such as a Z-score normalization process. After the distance values are subjected to normalization processing, the distance values are comparable, and therefore the distance between clusters can be analyzed more conveniently by a user.

It should be noted that, after obtaining the new hierarchical clustering model, the clustering device may output the statistical analysis results of each preset index and the distance matrix of each cluster, so that the user may more intuitively and fully analyze the clustering effect of the new hierarchical clustering model according to the statistical analysis results and the distance matrix, thereby improving the operability of hierarchical clustering.

further, referring to fig. 3, which is an interactive flow diagram in a hierarchical clustering model training process, a user selects a next-level clustering strategy and selects several clusters to be aggregated or one cluster to be split according to a new hierarchical clustering model obtained by clustering equipment in combination with statistical analysis results of various indexes and distance matrix analysis data, and may also select a clustering algorithm and configure various distance configuration parameters in the clustering equipment; and the clustering equipment configures and runs an algorithm of next-level clustering operation according to each clustering configuration parameter configured by the user to obtain a new-level clustering model, and can update the statistical analysis result and the distance matrix of each index and output the result. And circulating until the user analyzes the data and determines that the new hierarchical clustering model accords with the clustering target, and stopping training the hierarchical clustering model.

The following is a specific example, which further explains the interaction process between the user and the clustering device in the hierarchical clustering process.

for a sample set (including S sample data) formed by a batch of cosmetic orders, a user needs to train a hierarchical clustering model in a clustering device to perform expected clustering on the sample set. A user configures clustering configuration parameters in clustering equipment, selects a k-means algorithm to perform clustering operation of a first level, and configures the number of target clusters of the clustering operation of the first level to be N (N is less than S); and the clustering equipment divides the sample set into N1 clusters by adopting a k-means algorithm according to the configured clustering configuration parameters, then performs statistical analysis on each preset index, and outputs the statistical analysis result and the distance matrix of each cluster obtained by calculation. The user can judge whether the clustering effect is good or not according to experience by analyzing the statistical analysis result and the distance matrix.

For example, when N is configured as 2, through analysis, the user finds two clusters of C1 and C2 obtained through division, mainly orders of male customers for buying shampoo and shower gel in C1, orders of female customers for buying various articles in C2, judges that the division of C1 and C2 meets the experience, but orders in C2 are particularly miscellaneous, and judges that C2 should continue to split, at this time, the user can split users who only use coupons, users who buy less coupons and users who never use coupons by using a custom distance algorithm and configuring a clustering strategy as a splitting strategy, and the three user clusters are associated with the orders in C2, so that the clustering device combines the experience of the user, and splits C2 into 3 clusters through a second-level clustering operation to obtain a new clustering model, wherein a sample set in the new clustering model is divided into 4 clusters, and the device updates the statistical analysis result and the distance matrix according to the new clustering model, for further analysis by the user.

As another example, when N is configured as 10, the first hierarchical clustering operation divides the sample set into 10 clusters, C1, C2 … … C10, respectively. The user analyzes that C1 is an order for buying shampoo for a male client, C2 is an order for buying shower gel for the male client, C1 and C2 are judged to be aggregated into a cluster, at the moment, the user can aggregate C1 and C2 into a cluster C1.1 through configuring a clustering strategy as an aggregation strategy by the user, and C1 and C2 are selected as clusters to be clustered, so that the clustering equipment combines the user experience, a new hierarchical clustering model is obtained by aggregating C1 and C2 into a cluster C1.1 through second hierarchical clustering operation, a sample set in the new hierarchical clustering model is divided into 9 clusters, and the clustering equipment updates a statistical analysis result and a distance matrix according to the new hierarchical clustering model for further analysis by the user.

By analogy, in the hierarchical clustering process of the clustering equipment, in order to achieve a modeling target, a user continuously analyzes and fuses personal experience to operate the hierarchical clustering process of the clustering equipment, and finally obtains a satisfactory hierarchical clustering model with good clustering effect.

In addition, an embodiment of the present invention further provides an interactive hierarchical clustering implementation apparatus, and with reference to fig. 4, the interactive hierarchical clustering implementation apparatus includes:

the extraction module 10 is configured to, when a clustering instruction for performing a clustering operation on a sample set is detected, extract a clustering configuration parameter from the clustering instruction;

and the clustering module 20 is configured to perform clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set.

further, the clustering configuration parameters include algorithm configuration parameters, and the clustering module 20 includes:

The first determining unit is used for determining a clustering algorithm according to the algorithm configuration parameters, wherein the clustering algorithm comprises a custom distance algorithm;

And the first clustering unit is used for clustering the sample set under the current hierarchical clustering model by adopting the determined clustering algorithm to obtain a new hierarchical clustering model of the sample set.

Further, the clustering configuration parameters include policy configuration parameters, and the clustering module 20 further includes:

The second determining unit is used for determining a clustering strategy and a cluster to be clustered in the current hierarchical clustering model according to the strategy configuration parameters, wherein the clustering strategy comprises an agglomeration strategy and a splitting strategy;

and the second clustering unit is used for clustering the sample data in the cluster to be clustered according to the determined clustering strategy to obtain a new hierarchical clustering model of the sample set.

further, the interactive hierarchical clustering implementation apparatus further includes:

the statistical analysis module is used for performing clustering operation on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set, and then performing statistical analysis on the sample set under the new hierarchical clustering model to obtain statistical analysis results of all preset indexes;

And the output module is used for outputting the statistical analysis result of each preset index.

the calculation module is used for calculating the distance between each cluster divided by the sample set in the new hierarchical clustering model after clustering operation is carried out on the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set, so as to obtain a distance matrix;

The output module is further used for outputting the distance matrix according to a preset visualization mode.

And the normalization processing module is used for performing normalization processing on each distance value in the distance matrix before outputting the distance matrix according to a preset visualization mode, and updating the distance matrix by adopting each distance value after the normalization processing.

The expanding content of the specific implementation manner of the interactive hierarchical clustering implementation apparatus of the present invention is basically the same as that of each embodiment of the interactive hierarchical clustering implementation method described above, and is not described herein again.

in addition, an embodiment of the present invention further provides a computer-readable storage medium, where an interactive hierarchical clustering implementation program is stored on the storage medium, and when being executed by a processor, the interactive hierarchical clustering implementation program implements the following steps of the interactive hierarchical clustering implementation method.

The embodiments of the interactive hierarchical clustering implementation apparatus and the computer-readable storage medium of the present invention can refer to the embodiments of the interactive hierarchical clustering implementation method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

the above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An interactive hierarchical clustering implementation method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the cluster configuration parameters include algorithm configuration parameters, and the step of performing a clustering operation on the sample set under the current hierarchical cluster model based on the cluster configuration parameters to obtain a new hierarchical cluster model of the sample set comprises:

3. the method according to claim 1, wherein the cluster configuration parameters include policy configuration parameters, and the step of performing a clustering operation on the sample set under the current hierarchical cluster model based on the cluster configuration parameters to obtain a new hierarchical cluster model of the sample set comprises:

4. the method according to claim 1, wherein the step of clustering the sample set under the current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set further comprises:

And outputting the statistical analysis result of each preset index.

5. The method according to claim 4, wherein the predetermined criteria includes one or more of a cluster validity criteria, a cluster stability criteria, a single feature analysis criteria, a multi-feature analysis criteria, and a sample spot check criteria.

6. The method according to any one of claims 1 to 5, wherein after the step of performing a clustering operation on the sample set under a current hierarchical clustering model based on the clustering configuration parameters to obtain a new hierarchical clustering model of the sample set, the method further comprises:

and outputting the distance matrix according to a preset visualization mode.

7. The method of claim 6, wherein the step of outputting the distance matrix according to a preset visualization mode further comprises:

8. an interactive hierarchical clustering implementation apparatus, characterized in that the interactive hierarchical clustering implementation apparatus includes:

9. An interactive hierarchical clustering implementation apparatus, characterized in that the interactive hierarchical clustering implementation apparatus comprises: memory, processor and an interactive hierarchical clustering implementation stored on the memory and executable on the processor, which when executed by the processor implements the steps of the interactive hierarchical clustering implementation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an interactive hierarchical clustering implementation program, which when executed by a processor implements the steps of the interactive hierarchical clustering implementation method according to any one of claims 1 to 7.