CN110738245A

CN110738245A - automatic clustering algorithm selection system and method for scientific data analysis

Info

Publication number: CN110738245A
Application number: CN201910931657.2A
Authority: CN
Inventors: 刘悦; 田文杰; 李勃澄; 祝垲
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-31

Abstract

The invention discloses automatic clustering algorithm selection systems and methods for scientific data analysis, wherein each automatic clustering algorithm selection system comprises a data preprocessing module and a clustering algorithm selector building module, each data preprocessing module comprises a metadata extraction module and a single-scale selection module, each clustering algorithm selector building module comprises a metadata set building module, a second single-scale selection module and a multi-scale fusion module, each multi-scale fusion module utilizes single-scale features to build a Stacking set, a random forest multi-output regression model is trained on the basis of the Stacking set to obtain a clustering algorithm selector, a given data set is preprocessed through the data preprocessing module and then input to the clustering algorithm selector to select an optimal clustering algorithm for the given data set, and the optimal clustering algorithm can be selected for the given data set more comprehensively, and computing resources and time cost are effectively reduced.

Description

automatic clustering algorithm selection system and method for scientific data analysis

Technical Field

The invention relates to the technical field of computer artificial intelligence, in particular to automatic clustering algorithm selection systems and methods for scientific data analysis.

Background

The clustering analysis divides a data set into a plurality of clusters by calculating the similarity between data objects, so that the same cluster objects have higher similarity, but the difference between different cluster objects is larger, unsupervised learning methods are adopted, the clustering analysis has universal application in many scientific fields such as biology, astronomy and the like, for example, in the biological field, the original data amount is large, and is various or heterogeneous, the analysis of the biological scientific data by Means of an experimental method is time-consuming and high in cost, so that a fast and effective calculation method is needed for analyzing the biological data, the clustering is used as an important method in data mining, after the unmarked biological data is preprocessed, the related clustering algorithm is selected or improved, a clustering model is established, so that the hidden intrinsic property and rule of the biological data can be obtained, a basis is provided for data analysis in the step, scientific analysis and research on experts in the field have important significance, research on the algorithm itself is continuously developed, a plurality of clustering algorithms are provided, such as a CAN-based on-segmentation, BIRCH, a local-based clustering algorithm-based on-based clustering algorithm-based on-based clustering algorithm-based on-a research, a high-based clustering algorithm-based on-based clustering algorithm-based-on-a high-on-based clustering algorithm-based clustering algorithm-based-on-based-a high-based-on-by-a high-based clustering algorithm-on-a high-by-based-on-based clustering algorithm-on-based-on-based clustering algorithm-based-with a high-as-a high-as-a high-well.

The method comprises the steps of automatically Learning Machine (automatic ML), reducing cost and participation of scientific discovery and analysis of field experts by using Machine Learning, and improving usability of Machine Learning by automating various stages of Machine Learning such as algorithm selection, hyper-parameter optimization, feature engineering and the like, wherein the work of the existing automatic Machine Learning mainly focuses on supervised classification Learning, but research on unsupervised clustering is less.

Disclosure of Invention

The invention aims to provide automatic clustering algorithm selection systems and methods for scientific data analysis, so as to solve the problems in the prior art, greatly reduce the calculation cost of clustering algorithm selection, and more comprehensively select the most appropriate clustering algorithm for a given data set.

In order to achieve the purpose, the invention provides the following scheme:

the invention provides automatic clustering algorithm selection systems facing scientific data analysis, which comprise a data preprocessing module and a clustering algorithm selector construction module;

the data preprocessing module is used for preprocessing a given data set and comprises a metadata extraction module, an single-scale selection module, a single-scale selection module and a preprocessing module, wherein the metadata extraction module is used for extracting the metadata features of the given data set and evaluating the performance of a preprocessing algorithm on the given data set;

the clustering algorithm selector constructing module is used for constructing a clustering algorithm selector, and comprises: the metadata set construction module is used for extracting the metadata features of the data set and the performance of the candidate algorithm set to construct a metadata set; the second single-scale selection module is configured to extract single-scale features of the metadata set, where the single-scale features include: a hidden data set feature, a hidden algorithm feature and a hidden performance feature; the multi-scale fusion module is used for constructing a Stacking set by using the single-scale features of the metadata set extracted by the second single-scale selection module, and training a random forest multi-output regression model based on the Stacking set to obtain the clustering algorithm selector.

Preferably, the metadata extraction module comprises an th meta-feature calculation module and an th performance evaluation module, the th meta-feature calculation module is used for extracting meta-features of the given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the metadata set construction module comprises a second meta-feature calculation module and a second performance evaluation module, the second meta-feature calculation module is used for extracting the meta-features of the data set in the data set space and constructing a meta-feature matrix, and the second performance evaluation module is used for evaluating the performance of the algorithms in the candidate algorithm set and constructing a performance matrix.

Preferably, the single-scale selection module comprises a nearest meta-feature matching module, a nearest performance feature matching module and a performance matrix decomposition module, wherein the nearest meta-feature matching module is used for extracting the hidden data set features of the given data set, the nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the performance matrix decomposition module is used for extracting the hidden energy features of the given data set;

the second single-scale selection module comprises a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module, wherein the second nearest element feature matching module is used for extracting the hidden data set features of the metadata set and constructing a hidden data set feature matrix; the second nearest performance characteristic matching module is used for extracting hidden algorithm characteristics of the metadata set and constructing a hidden algorithm characteristic matrix; the second performance matrix decomposition module is used for extracting the implicit energy characteristics of the metadata set and constructing an implicit energy characteristic matrix.

The invention provides automatic clustering algorithm selection methods facing scientific data analysis, which specifically comprise the following steps:

s1, constructing a clustering algorithm selector, comprising the following steps:

s1-1, extracting meta-features of the data set and performance of the candidate algorithm set through a meta-data set construction module, and constructing a meta-data set;

selecting a multi-field scientific data set for clustering as a data set space D, selecting a candidate clustering algorithm set as an algorithm space A, extracting the meta-features of the data set in the data set space D through a second meta-feature calculation module, and constructing a meta-feature matrix F; the second performance evaluation module evaluates the algorithm a by using Bayesian optimization_je.A in dataset d_iOptimum performance p over e D_ijConstructing a performance matrix P, carrying out time constraint on performance evaluation of the algorithm, stopping evaluation and taking the lowest value of the clustering performance measurement index as the algorithm performance if the performance evaluation time of the algorithm exceeds a preset threshold; combining the element characteristic matrix F and the performance matrix P to construct an element data set M, and dividing the element data set M into element training sets M_trainAnd meta prediction set M_pred；

S1-2, extracting the single-scale features of the metadata set through a second single-scale selection module: constructing the meta training set M constructed in the step S1-1_trainAnd meta prediction set M_predRespectively input to the second nearest meta-feature matching module and the second nearest performanceA characteristic matching module, a second performance matrix decomposition module and a meta prediction set M_predThe hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix;

s1-3, the multi-scale fusion module utilizes the meta prediction set M constructed in the step S1-2_predConstructing a Stacking set S by the hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix, and training a random forest multi-output regression model based on the Stacking set S to obtain a clustering algorithm selector;

s2, selecting a clustering algorithm for the given data set, comprising the following steps:

s2-1, preprocessing the given data set by using a data preprocessing module, namely inputting the given data set into a metadata extraction module, calculating the meta-feature of the given data set by a meta-feature calculation module, and evaluating the performance of a preprocessing algorithm on the given data set by a performance evaluation module;

inputting the meta-features obtained by the th meta-feature calculation module and the algorithm performance obtained by the th performance evaluation module into the th single-scale selection module, and respectively extracting the hidden data set features, hidden algorithm features and hidden performance features of a given data set through the th nearest meta-feature matching module, the th nearest performance feature matching module and the th performance matrix decomposition module;

s2-2, inputting the hidden data set characteristic, the hidden algorithm characteristic and the hidden performance characteristic of the given data set obtained in the step S2-1 into a clustering algorithm selector to obtain the expected performance of the candidate algorithm, and selecting a proper clustering algorithm for the given data set according to the expected performance.

Preferably, the workflow of the second nearest meta-feature matching module in step S1-2 includes:

step 1) for the meta prediction set M_predFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set i_iAnd Yuan training set M_trainMeta-feature vector f of all data sets in_jDegree of similarity of

As shown in formula (1), and obtaining a similarity matrix SI;

wherein f is_iPresentation meta prediction set M_predMeta-feature vector of each data set in f_jPresentation element training set M_trainThe meta-feature vector of each data set in (a),represents a vector f_iThe average value of (a) of (b),represents a vector f_jThe average value of (a) of (b),

represents a vector f_iThe standard deviation of (a) is determined,

represents a vector f_jE (-) represents expectation;

step 2) obtaining a meta-training set M based on the similarity matrix SI_trainThe corresponding performance vectors topk of the k data sets most similar to the data set i_iAnd the performance vectors topk corresponding to the k data sets_iPerforming grouping to obtain algorithm a_je.A in dataset

-normalized predicted performance

As shown in equation (2), and obtain the predicted performance matrix of the data set i

Wherein i represents a meta prediction set M_predJ denotes the algorithm in the algorithm space a,

representation algorithm a_je.A in datasetAbove classification predicted Performance, topk_i，j，tRepresents topk_iThe performance average over all algorithms for the t-th dataset,

represents topk_iT is more than or equal to 1 and less than or equal to k of the performance average value of the tth data set on the algorithm j;

step 3) according to the prediction performance matrix

And selecting an optimal algorithm, generating an algorithm binary coding vector of the data set i, and constructing a hidden data set characteristic matrix.

Preferably, the workflow of the second closest performance characteristic matching module in the step S1-2 includes:

step 1) training the meta training set M_trainAs a historical performance matrix, calculate each algorithm a_jE-A in-meta training set M_trainNormalized performance mean over all data sets

As shown in equation (3):

wherein M represents a meta training set M_trainNumber of corresponding data sets in, p_ijRepresentation algorithm a_jin-Yuan training set M_trainMiddle data set d_iThe performance of (a) is improved,

representing all algorithms in the meta training set M_trainMiddle data set d_iPerformance average of (1);

step 2) normalizing the performance mean value according to the regression

Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;

step 3) for the meta prediction set M_predEach data set d in (a)_iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix P_i；

Step 4) calculating a performance vector p_iAnd Yuan training set M_trainPerformance vector p of the corresponding algorithm in_jIs measured by the Euclidean distance of_iAnd p_jThe similarity of (c) is shown in formula (4):

wherein p is_iPresentation meta prediction set M_predPerformance vector of p_jPresentation element training set M_trainA performance vector of (a);

step 5) sorting the similarity obtained in the step 4) to obtain performance vectors topk corresponding to k data sets with highest similarity_iUsing formula (2) to pair topk_iPerforming classification to obtain predicted performance

According to predicted performanceAnd selecting an algorithm with the best performance, generating an algorithm binary coding vector of the data set i, and constructing a hidden algorithm characteristic matrix.

Preferably, the workflow of the second performance matrix decomposition module in step S1-2 includes:

As shown in equation (3);

step 2) normalizing the performance mean value according to the regression Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;

Step 4) decomposing the performance matrix P into three matrices U, sigma and V by using a matrix decomposition (SVD) method^TObtaining a performance matrix model as shown in formula (5); the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;

step 5) utilizing the matrix model obtained in the step 4) to convert the performance vector p with missing values_iCompleting to obtain all algorithm in-element prediction set M_predAnd generating an algorithm binary coding vector of the data set according to the prediction performance and an algorithm with optimal selective performance for the data set according to the prediction performance, and constructing a recessive energy characteristic matrix.

The invention discloses the following technical effects:

1. the invention provides an automatic clustering algorithm selection system and method for scientific data based on multi-scale feature fusion, which comprehensively consider the meta features and performance features of the data, learn the hidden features of the data, select an applicable clustering algorithm for the data by using the hidden features, and more comprehensively select the most appropriate algorithm for a given data set.

2. The random forest multi-output regression model based on the multi-scale features is established, models are established for each candidate algorithm independently, therefore, the model does not need to be retrained when the candidate algorithms are added or deleted, only the models need to be established for the newly added candidate algorithms, and in the performance matrix evaluation, the algorithm with the time exceeding the preset threshold value stops evaluation by combining the time constraint.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a clustering algorithm selector construction of the present invention;

FIG. 2 is a flow chart of the present invention for selecting a clustering algorithm for a given data set;

FIG. 3 is a result of algorithm selection for an Arrhythmia dataset according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an algorithm selection result of a Dermatology dataset according to an embodiment of the present invention;

FIG. 5 shows the algorithm selection result of the vertebra-column dataset according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, a more detailed description is provided below in conjunction with the accompanying drawings and the detailed description.

Referring to fig. 1-2, the present embodiment provides automatic clustering algorithm selection systems applied in the biomedical field, including:

the data preprocessing module and the clustering algorithm selector building module:

the data preprocessing module is used for preprocessing a given data set and comprises a metadata extraction module and an single-scale selection module, wherein the metadata extraction module is used for extracting the meta-features of the given data set and evaluating the performance of a preprocessing algorithm on the given data set and comprises a th meta-feature calculation module and a 0 th performance evaluation module, the 1 th meta-feature calculation module is used for extracting the meta-features of the given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the th single-scale selection module is used for extracting the single-scale features of the given data set and comprises a th nearest meta-feature matching module, a th nearest performance feature matching module and a performance matrix decomposition module, the th nearest meta-feature matching module is used for extracting the hidden data set features of the given data set, the th nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the th performance matrix decomposition module is used for extracting the hidden energy features of the given data set;

the clustering algorithm selector building module comprises a metadata set building module, a second single-scale selection module and a multi-scale fusion module; the metadata set construction module is used for extracting the metadata features of the data set and the performance of the candidate algorithm set to construct the metadata set and comprises a second metadata feature calculation module and a second performance evaluation module, the second metadata feature calculation module is used for extracting the metadata features of the data set in the data set space and constructing a metadata feature matrix, and the second performance evaluation module is used for evaluating the performance of the candidate algorithm set algorithm and constructing a performance matrix; the second single-scale selection module is used for extracting single-scale features of the metadata set and comprises a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module, the second nearest element feature matching module is used for extracting the hidden data set features of the metadata set and constructing a hidden data set feature matrix, the second nearest performance feature matching module is used for extracting the hidden algorithm features of the metadata set and constructing a hidden algorithm feature matrix, and the second performance matrix decomposition module is used for extracting the hidden energy features of the metadata set and constructing a hidden energy feature matrix; and the multi-scale fusion module is used for constructing a Stacking set by using the single-scale features of the metadata set extracted by the second single-scale selection module, training a random forest multi-output regression model based on the Stacking set, and obtaining the clustering algorithm selector.

The embodiment provides automatic clustering algorithm selection methods applied to the biomedical field, which comprise the following steps:

in the embodiment, 198 pieces of multi-field scientific data from OpenML are selected as a data set space D, 6 clustering algorithms from scinit-lean are selected as an algorithm space A, and the 6 clustering algorithms specifically comprise Kmeans, MeanShift, DBSCAN, affinity prediction, Birch and agglomerative clustering; randomly selecting 132 parts of input metadata set construction modules from 198 parts of multi-field scientific data, wherein the input metadata set construction modules are used for constructing metadata sets, and the rest data sets are used for testing the generalization performance of the method; the second meta-feature calculation module extracts descriptive features (namely meta-features) of the clustered data set in the data set space D based on methods such as statistics, information theory, LandMaekers and the like, such as sample number, skewness, signal-to-noise ratio and the like, and constructs a meta-feature matrix F, wherein the meta-features extracted by the meta-feature calculation module are shown in Table 1; considering that the performance of the clustering algorithm depends on the configuration of the hyper-parameters of the clustering algorithm to a great extent, the second performance evaluation module adopts Bayesian optimization and takes the contour coefficient as the clustering calculationPerformance metric index of method, search algorithm a_je.A in dataset d_iOptimum performance p over e D_ijReducing the influence of the hyper-parameters of the clustering algorithm on the algorithm selection result, and constructing a performance matrix P according to the influence; when the optimal performance of the algorithm is evaluated, combining time constraint, if the time exceeds a preset threshold value, the clustering algorithm is not suitable for the data set, stopping evaluation, and taking the lowest value "-1" of the contour coefficient as the algorithm performance; combining the element feature matrix F and the performance matrix P to construct an element data set M, and selecting 66 construction element training sets M from 132 element data sets M_trainAnd the rest 66 construction element prediction sets M_pred。

TABLE 1

S1-2, extracting the single-scale features of the metadata set through a second single-scale selection module: constructing the meta training set M constructed in the step S1-1_trainAnd meta prediction set M_predRespectively input into a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module to construct an element prediction set M_predThe hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix;

s1-2-1, constructing a meta prediction set M through a second nearest meta feature matching module_predA hidden data set feature matrix of (a);

in order to reduce random errors caused by Bayesian optimization and improve the robustness of the method, k most similar data sets are selected, and the regression performance of the algorithm of the k data sets in the algorithm space is taken as the prediction performance of the data set to construct a hidden data set feature vector, wherein k is 3 in the embodiment, which specifically comprises the following steps:

s1-2-1-1, for meta prediction set M_predFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set i_iAnd Yuan training set M_trainMeta-feature vector f of all data sets in_jDegree of similarity of

As shown in formula (1), and obtaining a similarity matrix SI;

wherein f is_iPresentation meta prediction set M_predMeta-feature vector of each data set in f_jPresentation element training set M_trainThe meta-feature vector of each data set in (a),

represents a vector f_iThe average value of (a) of (b),

represents a vector f_jThe average value of (a) of (b),

represents a vector f_iThe standard deviation of (a) is determined,

represents a vector f_jE (-) represents expectation;

s1-2-1-2, obtaining a meta training set M based on the similarity matrix SI_trainThe corresponding performance vectors topk of the k data sets most similar to the data set i_iAnd the performance vectors topk corresponding to the k data sets_iPerforming grouping to obtain algorithm a_je.A in dataset

Tropism of Shang Gui HuaPredicting performanceAs shown in equation (2), and obtain the predicted performance matrix of the data set i

representation algorithm a_je.A in dataset

Above classification predicted Performance, topk_i，j，tRepresents topk_iThe performance average over all algorithms for the t-th dataset,

represents topk_iT is more than or equal to 1 and less than or equal to k, and k is 3;

s1-2-1-3, predicting the performance matrix according to

S1-2-2, constructing the meta prediction set M through a second nearest performance characteristic matching module_predA hidden algorithm feature matrix of (1);

the second nearest performance feature matching module takes the algorithm performance as descriptive features of the data sets, considers the similarity of the data sets in an algorithm performance space, selects an applicable algorithm according to the algorithm performance corresponding to the data set which is the closest to the given data set in the algorithm performance space, but as the algorithm performance is empty on the data set due to a new data set, the problem is converted into a cold start problem, so that three algorithms with the optimal comprehensive performance are selected for pre-evaluation in consideration of the historical performance of the algorithm, algorithm selection is realized according to the evaluated algorithm performance, and -like nearest meta-feature matching is realized, and the nearest performance feature matching also considers the k closest data sets, and the method specifically comprises the following steps:

s1-2-2-1, training set M of the principal elements_trainAs a historical performance matrix, calculate each algorithm a_jE-A in-meta training set M_trainNormalized performance mean over all data sets

As shown in equation (3):

wherein M represents a meta training set M_trainThe number of corresponding datasets in (m ═ 66); p is a radical of_ijRepresentation algorithm a_jin-Yuan training set M_trainMiddle data set d_iThe performance of (a) is improved,

s1-2-2-2, normalized Performance mean according to regression Selecting 2 algorithms of Meanshift and AglomerativeClustering with optimal comprehensive performance as a preprocessing algorithm;

s1-2-2-3, for Meta prediction set M_predEach data set d in (a)_iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix P_i；

S1-2-2-4, calculating a performance vector p_iAnd Yuan training set M_trainPerformance vector p of the corresponding algorithm in_jIs measured by the Euclidean distance of_iAnd p_jThe similarity of (c) is shown in formula (4):

s1-2-2-5, sorting the similarity obtained in the step S1-2-2-4 to obtain performance vectors topk corresponding to k data sets with the highest similarity_iUsing formula (2) to pair topk_iPerforming classification to obtain predicted performance

According to predicted performance

And selecting an algorithm with the best performance, generating an algorithm binary coding vector of the data set i, and constructing a hidden algorithm characteristic matrix.

S1-2-3, constructing the meta prediction set M through a second performance matrix decomposition module_predThe implicit energy feature matrix specifically comprises the following steps:

s1-2-3-1, training set M of the general training_trainAs a historical performance matrix, calculate each algorithm a_jE-A in-meta training set M_trainNormalized performance mean over all data setsAs shown in equation (3);

s1-2-3-2, normalized Performance mean according to regression Selecting 2 algorithms of Meanshift and AglomerativeClustering with optimal comprehensive performance as a preprocessing algorithm;

s1-2-3-3, for Meta prediction set M_predEach data set d in (a)_iExtracting the performance corresponding to the preprocessing algorithm from the performance matrix PVector p_i；

S1-2-3-4, decomposing the Performance matrix P into three matrices U, Σ, and V using a matrix decomposition (SVD) method^TObtaining a performance matrix model; as shown in equation (5); the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;

s1-2-3-5, utilizing the matrix model obtained in the step S1-2-3-4 to convert the performance vector p with the missing value_iCompleting to obtain all algorithm in-element prediction set M_predAnd generating an algorithm binary coding vector of the data set according to the prediction performance and an algorithm with optimal selective performance for the data set according to the prediction performance, and constructing a recessive energy characteristic matrix.

S1-3, the multi-scale fusion module utilizes the meta prediction set M extracted in the step S1-2_predConstructing a Stacking set S by the hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix, and training a random forest multi-output regression model based on the Stacking set S to obtain the clustering algorithm selector.

S2, selecting a clustering algorithm for a given data set, wherein three scientific data sets from the biomedical field are given in the embodiment: the method is used for distinguishing Arrhythmia data sets with or without Arrhythmia, Dermatology data sets for determining the types of the erythrocytic squamous cell diseases, vertebra-column data sets for determining the bone types according to biomechanical characteristics, optimal algorithm selection is carried out, and the clustering algorithm selector constructed in the step S1 is tested, and the method specifically comprises the following steps:

s2-1, preprocessing three given data sets by using a data preprocessing module, namely inputting the given data sets into a metadata extraction module, extracting the meta-feature vectors of the given data sets, such as sample number, skewness, signal-to-noise ratio and the like by using a th meta-feature calculation module based on statistics, information theory, Landmaekers and other methods, evaluating the performance of algorithms by using Bayesian optimization for preprocessing algorithms Meanshift and agglomerative clustering through a th performance evaluation module, and calculating the performance vectors of the two algorithms on the given data sets;

inputting the meta-features obtained by the th meta-feature calculating module and the algorithm performance obtained by the th performance evaluating module into the th single-scale selection module, calculating the hidden data set features of the given data set by the th nearest meta-feature matching module based on the meta-feature vector of the given data set, and calculating the hidden algorithm features and the hidden performance features of the given data set by the th nearest performance feature matching module and the th performance matrix decomposing module based on the performance vector of the preprocessing algorithm respectively;

The experimental results are as follows:

as shown in fig. 3, the result of algorithm selection in the above 6 clustering algorithms for the Arrhythmia data set is shown, where SIL represents the contour coefficient of the cluster, and the higher the contour coefficient, the better the clustering effect. The experimental result shows that the outline coefficient of the Agglerate algorithm is 0.59, the performance is better than that of other clustering algorithms (Kmeans: 0.13, affinity amplification: 0.05, Meanshift: 0.49, DBSCAN: 0.33, Birch: 0.11), the algorithm with better expected performance selected by the application is the Meanshift and Agglerate algorithms, the Agglerate algorithm with the best actual performance is included, and the actual performance of the selected Meanshift algorithm is inferior to that of the Agglerate algorithm.

As shown in FIG. 4, for the result of algorithm selection of the Dermatology data set in the above 6 clustering algorithms, it can be seen from the experimental results that the performance is the best for the Kmeans and the MeanShift algorithms, the contour coefficients are both 0.51, the performance is better than that of other clustering algorithms (Agglomerative: 0.49, affinity amplification: 0.43, DBSCAN: 0.34, Birch: 0.38), the algorithm with better expected performance selected by the present application is the Agglomerative and the MeanShift algorithms, including the best MeanShift algorithm and the Agglotive algorithm with the top three actual performance ranks among the candidate clustering algorithms, and the difference between the Kmeans and the MeanShift performance is smaller.

As shown in fig. 5, for the results of algorithm selection of the vertebra-column dataset in the above 6 clustering algorithms, it can be seen from experimental results that the algorithms of MeanShift, aggregative, and DBSCAN are all the algorithms with the contour coefficient of 0.86, and the algorithms of Kmeans and DBSCAN are selected in this document, although the selected Kmeans algorithm has poor performance and the contour coefficient is only 0.45, the optimal DBSCAN algorithm is included in the selected algorithms.

In conclusion, the clustering algorithm selector constructed by the application can accurately select the optimal algorithm for the three given data sets from the biomedical field. In addition, in the process of selecting the clustering algorithm for 66 data sets for testing, the clustering algorithm selector constructed in the application does not select the clustering algorithm with the worst actual performance on 100% of the data sets, meanwhile, the clustering algorithms selected for 100% of the data sets comprise the clustering algorithm with the first actual performance 3, the clustering algorithms selected for 84.8% of the data sets comprise the clustering algorithm with the first actual performance 2, and the clustering algorithms selected for 77.3% of the data sets comprise the clustering algorithm with the best actual performance, so that the algorithm selection space is effectively reduced.

The method provided by the invention can effectively reduce the algorithm selection space on the premise of meeting the performance requirements of the data set, thereby reducing the time and cost for applying machine learning to scientific discovery and research in the field of biomedicine. The clustering algorithm selection method can be applied to the field of multidisciplinary science, and the usability of machine learning is improved.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1, automatic clustering algorithm selection system facing scientific data analysis, which is characterized in that the system comprises a data preprocessing module and a clustering algorithm selector construction module;

2. The automatic clustering algorithm selection system for scientific data analysis according to claim 1, wherein the metadata extraction module comprises th meta-feature calculation module and th performance evaluation module, the th meta-feature calculation module is used for extracting meta-features of a given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the metadata set construction module comprises a second meta-feature calculation module and a second performance evaluation module, the second meta-feature calculation module is used for extracting meta-features of the data set in the data set space and constructing a meta-feature matrix, and the second performance evaluation module is used for evaluating the performance of the algorithm in the candidate algorithm set and constructing a performance matrix.

3. The scientific data analysis-oriented automatic clustering algorithm selection system as claimed in claim 1, wherein the single-scale selection module comprises a nearest-element feature matching module, a nearest performance feature matching module and a performance matrix decomposition module, wherein the nearest-element feature matching module is used for extracting the hidden data set features of the given data set, the nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the performance matrix decomposition module is used for extracting the hidden energy features of the given data set;

4, automatic clustering algorithm selection method facing scientific data analysis, which is characterized in that the method comprises the following steps:

5. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second nearest meta feature matching module workflow in the step S1-2 includes:

step 1) for the meta prediction set M_predFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set i_iAnd Yuan training set M_trainMeta-feature vector f of all data sets in_jDegree of similarity ofAs shown in formula 1, and obtaining a similarity matrix SI;

represents a vector f_iThe average value of (a) of (b),

represents a vector f_jThe average value of (a) of (b),

represents a vector f_iThe standard deviation of (a) is determined,

represents a vector f_jE (-) represents expectation;

step 2) obtaining a meta-training set M based on the similarity matrix SI_trainThe corresponding performance vectors topk of the k data sets most similar to the data set i_iAnd the performance vectors topk corresponding to the k data sets_iPerforming grouping to obtain algorithm a_je.A in dataset -normalized predicted performance

As shown in equation 2, and obtain the predicted performance matrix of the data set i

representation algorithm a_je.A in dataset

step 3) according to the prediction performance matrix

6. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second closest performance characteristic matching module workflow in the step S1-2 includes:

As shown in equation 3:

step 2) normalizing the performance mean value according to the regression

Step 4) calculating a performance vector p_iAnd Yuan training set M_trainPerformance vector p of the corresponding algorithm in_jIs measured by the Euclidean distance of_iAnd p_jThe similarity of (c) is shown in formula 4:

step 5) sorting the similarity obtained in the step 4) to obtain performance vectors topk corresponding to k data sets with highest similarity_iUsing equation 2 to pair topk_iPerforming classification to obtain predicted performance

According to predicted performance

7. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second performance matrix decomposition module workflow in the step S1-2 includes:

As shown in equation 3;

step 2) normalizing the performance mean value according to the regression

Step 4) decomposing the performance matrix P into three matrices U, sigma and V by using a matrix decomposition (SVD) method^TObtaining a performance matrix model as shown in formula 5; the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;