CN110738245A - automatic clustering algorithm selection system and method for scientific data analysis - Google Patents

automatic clustering algorithm selection system and method for scientific data analysis Download PDF

Info

Publication number
CN110738245A
CN110738245A CN201910931657.2A CN201910931657A CN110738245A CN 110738245 A CN110738245 A CN 110738245A CN 201910931657 A CN201910931657 A CN 201910931657A CN 110738245 A CN110738245 A CN 110738245A
Authority
CN
China
Prior art keywords
performance
algorithm
module
data set
meta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910931657.2A
Other languages
Chinese (zh)
Inventor
刘悦
田文杰
李勃澄
祝垲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910931657.2A priority Critical patent/CN110738245A/en
Publication of CN110738245A publication Critical patent/CN110738245A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses automatic clustering algorithm selection systems and methods for scientific data analysis, wherein each automatic clustering algorithm selection system comprises a data preprocessing module and a clustering algorithm selector building module, each data preprocessing module comprises a metadata extraction module and a single-scale selection module, each clustering algorithm selector building module comprises a metadata set building module, a second single-scale selection module and a multi-scale fusion module, each multi-scale fusion module utilizes single-scale features to build a Stacking set, a random forest multi-output regression model is trained on the basis of the Stacking set to obtain a clustering algorithm selector, a given data set is preprocessed through the data preprocessing module and then input to the clustering algorithm selector to select an optimal clustering algorithm for the given data set, and the optimal clustering algorithm can be selected for the given data set more comprehensively, and computing resources and time cost are effectively reduced.

Description

automatic clustering algorithm selection system and method for scientific data analysis
Technical Field
The invention relates to the technical field of computer artificial intelligence, in particular to automatic clustering algorithm selection systems and methods for scientific data analysis.
Background
The clustering analysis divides a data set into a plurality of clusters by calculating the similarity between data objects, so that the same cluster objects have higher similarity, but the difference between different cluster objects is larger, unsupervised learning methods are adopted, the clustering analysis has universal application in many scientific fields such as biology, astronomy and the like, for example, in the biological field, the original data amount is large, and is various or heterogeneous, the analysis of the biological scientific data by Means of an experimental method is time-consuming and high in cost, so that a fast and effective calculation method is needed for analyzing the biological data, the clustering is used as an important method in data mining, after the unmarked biological data is preprocessed, the related clustering algorithm is selected or improved, a clustering model is established, so that the hidden intrinsic property and rule of the biological data can be obtained, a basis is provided for data analysis in the step, scientific analysis and research on experts in the field have important significance, research on the algorithm itself is continuously developed, a plurality of clustering algorithms are provided, such as a CAN-based on-segmentation, BIRCH, a local-based clustering algorithm-based on-based clustering algorithm-based on-based clustering algorithm-based on-a research, a high-based clustering algorithm-based on-based clustering algorithm-based-on-a high-on-based clustering algorithm-based clustering algorithm-based-on-based-a high-based-on-by-a high-based clustering algorithm-on-a high-by-based-on-based clustering algorithm-on-based-on-based clustering algorithm-based-with a high-as-a high-as-a high-well.
The method comprises the steps of automatically Learning Machine (automatic ML), reducing cost and participation of scientific discovery and analysis of field experts by using Machine Learning, and improving usability of Machine Learning by automating various stages of Machine Learning such as algorithm selection, hyper-parameter optimization, feature engineering and the like, wherein the work of the existing automatic Machine Learning mainly focuses on supervised classification Learning, but research on unsupervised clustering is less.
Disclosure of Invention
The invention aims to provide automatic clustering algorithm selection systems and methods for scientific data analysis, so as to solve the problems in the prior art, greatly reduce the calculation cost of clustering algorithm selection, and more comprehensively select the most appropriate clustering algorithm for a given data set.
In order to achieve the purpose, the invention provides the following scheme:
the invention provides automatic clustering algorithm selection systems facing scientific data analysis, which comprise a data preprocessing module and a clustering algorithm selector construction module;
the data preprocessing module is used for preprocessing a given data set and comprises a metadata extraction module, an single-scale selection module, a single-scale selection module and a preprocessing module, wherein the metadata extraction module is used for extracting the metadata features of the given data set and evaluating the performance of a preprocessing algorithm on the given data set;
the clustering algorithm selector constructing module is used for constructing a clustering algorithm selector, and comprises: the metadata set construction module is used for extracting the metadata features of the data set and the performance of the candidate algorithm set to construct a metadata set; the second single-scale selection module is configured to extract single-scale features of the metadata set, where the single-scale features include: a hidden data set feature, a hidden algorithm feature and a hidden performance feature; the multi-scale fusion module is used for constructing a Stacking set by using the single-scale features of the metadata set extracted by the second single-scale selection module, and training a random forest multi-output regression model based on the Stacking set to obtain the clustering algorithm selector.
Preferably, the metadata extraction module comprises an th meta-feature calculation module and an th performance evaluation module, the th meta-feature calculation module is used for extracting meta-features of the given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the metadata set construction module comprises a second meta-feature calculation module and a second performance evaluation module, the second meta-feature calculation module is used for extracting the meta-features of the data set in the data set space and constructing a meta-feature matrix, and the second performance evaluation module is used for evaluating the performance of the algorithms in the candidate algorithm set and constructing a performance matrix.
Preferably, the single-scale selection module comprises a nearest meta-feature matching module, a nearest performance feature matching module and a performance matrix decomposition module, wherein the nearest meta-feature matching module is used for extracting the hidden data set features of the given data set, the nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the performance matrix decomposition module is used for extracting the hidden energy features of the given data set;
the second single-scale selection module comprises a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module, wherein the second nearest element feature matching module is used for extracting the hidden data set features of the metadata set and constructing a hidden data set feature matrix; the second nearest performance characteristic matching module is used for extracting hidden algorithm characteristics of the metadata set and constructing a hidden algorithm characteristic matrix; the second performance matrix decomposition module is used for extracting the implicit energy characteristics of the metadata set and constructing an implicit energy characteristic matrix.
The invention provides automatic clustering algorithm selection methods facing scientific data analysis, which specifically comprise the following steps:
s1, constructing a clustering algorithm selector, comprising the following steps:
s1-1, extracting meta-features of the data set and performance of the candidate algorithm set through a meta-data set construction module, and constructing a meta-data set;
selecting a multi-field scientific data set for clustering as a data set space D, selecting a candidate clustering algorithm set as an algorithm space A, extracting the meta-features of the data set in the data set space D through a second meta-feature calculation module, and constructing a meta-feature matrix F; the second performance evaluation module evaluates the algorithm a by using Bayesian optimizationje.A in dataset diOptimum performance p over e DijConstructing a performance matrix P, carrying out time constraint on performance evaluation of the algorithm, stopping evaluation and taking the lowest value of the clustering performance measurement index as the algorithm performance if the performance evaluation time of the algorithm exceeds a preset threshold; combining the element characteristic matrix F and the performance matrix P to construct an element data set M, and dividing the element data set M into element training sets MtrainAnd meta prediction set Mpred
S1-2, extracting the single-scale features of the metadata set through a second single-scale selection module: constructing the meta training set M constructed in the step S1-1trainAnd meta prediction set MpredRespectively input to the second nearest meta-feature matching module and the second nearest performanceA characteristic matching module, a second performance matrix decomposition module and a meta prediction set MpredThe hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix;
s1-3, the multi-scale fusion module utilizes the meta prediction set M constructed in the step S1-2predConstructing a Stacking set S by the hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix, and training a random forest multi-output regression model based on the Stacking set S to obtain a clustering algorithm selector;
s2, selecting a clustering algorithm for the given data set, comprising the following steps:
s2-1, preprocessing the given data set by using a data preprocessing module, namely inputting the given data set into a metadata extraction module, calculating the meta-feature of the given data set by a meta-feature calculation module, and evaluating the performance of a preprocessing algorithm on the given data set by a performance evaluation module;
inputting the meta-features obtained by the th meta-feature calculation module and the algorithm performance obtained by the th performance evaluation module into the th single-scale selection module, and respectively extracting the hidden data set features, hidden algorithm features and hidden performance features of a given data set through the th nearest meta-feature matching module, the th nearest performance feature matching module and the th performance matrix decomposition module;
s2-2, inputting the hidden data set characteristic, the hidden algorithm characteristic and the hidden performance characteristic of the given data set obtained in the step S2-1 into a clustering algorithm selector to obtain the expected performance of the candidate algorithm, and selecting a proper clustering algorithm for the given data set according to the expected performance.
Preferably, the workflow of the second nearest meta-feature matching module in step S1-2 includes:
step 1) for the meta prediction set MpredFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set iiAnd Yuan training set MtrainMeta-feature vector f of all data sets injDegree of similarity of
Figure BDA00022204527200000516
As shown in formula (1), and obtaining a similarity matrix SI;
Figure BDA0002220452720000051
wherein f isiPresentation meta prediction set MpredMeta-feature vector of each data set in fjPresentation element training set MtrainThe meta-feature vector of each data set in (a),represents a vector fiThe average value of (a) of (b),represents a vector fjThe average value of (a) of (b),
Figure BDA00022204527200000514
represents a vector fiThe standard deviation of (a) is determined,
Figure BDA0002220452720000054
represents a vector fjE (-) represents expectation;
step 2) obtaining a meta-training set M based on the similarity matrix SItrainThe corresponding performance vectors topk of the k data sets most similar to the data set iiAnd the performance vectors topk corresponding to the k data setsiPerforming grouping to obtain algorithm aje.A in dataset
Figure BDA00022204527200000515
-normalized predicted performance
Figure BDA0002220452720000055
As shown in equation (2), and obtain the predicted performance matrix of the data set i
Figure BDA0002220452720000056
Figure BDA0002220452720000057
Wherein i represents a meta prediction set MpredJ denotes the algorithm in the algorithm space a,
Figure BDA0002220452720000058
representation algorithm aje.A in datasetAbove classification predicted Performance, topki,j,tRepresents topkiThe performance average over all algorithms for the t-th dataset,
Figure BDA0002220452720000059
represents topkiT is more than or equal to 1 and less than or equal to k of the performance average value of the tth data set on the algorithm j;
step 3) according to the prediction performance matrix
Figure BDA00022204527200000510
And selecting an optimal algorithm, generating an algorithm binary coding vector of the data set i, and constructing a hidden data set characteristic matrix.
Preferably, the workflow of the second closest performance characteristic matching module in the step S1-2 includes:
step 1) training the meta training set MtrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data sets
Figure BDA00022204527200000511
As shown in equation (3):
Figure BDA00022204527200000512
wherein M represents a meta training set MtrainNumber of corresponding data sets in, pijRepresentation algorithm ajin-Yuan training set MtrainMiddle data set diThe performance of (a) is improved,
Figure BDA00022204527200000513
representing all algorithms in the meta training set MtrainMiddle data set diPerformance average of (1);
step 2) normalizing the performance mean value according to the regression
Figure BDA0002220452720000061
Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;
step 3) for the meta prediction set MpredEach data set d in (a)iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix Pi
Step 4) calculating a performance vector piAnd Yuan training set MtrainPerformance vector p of the corresponding algorithm injIs measured by the Euclidean distance ofiAnd pjThe similarity of (c) is shown in formula (4):
wherein p isiPresentation meta prediction set MpredPerformance vector of pjPresentation element training set MtrainA performance vector of (a);
step 5) sorting the similarity obtained in the step 4) to obtain performance vectors topk corresponding to k data sets with highest similarityiUsing formula (2) to pair topkiPerforming classification to obtain predicted performance
Figure BDA0002220452720000063
According to predicted performanceAnd selecting an algorithm with the best performance, generating an algorithm binary coding vector of the data set i, and constructing a hidden algorithm characteristic matrix.
Preferably, the workflow of the second performance matrix decomposition module in step S1-2 includes:
step 1) training the meta training set MtrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data sets
Figure BDA0002220452720000065
As shown in equation (3);
step 2) normalizing the performance mean value according to the regression Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;
step 3) for the meta prediction set MpredEach data set d in (a)iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix Pi
Step 4) decomposing the performance matrix P into three matrices U, sigma and V by using a matrix decomposition (SVD) methodTObtaining a performance matrix model as shown in formula (5); the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;
step 5) utilizing the matrix model obtained in the step 4) to convert the performance vector p with missing valuesiCompleting to obtain all algorithm in-element prediction set MpredAnd generating an algorithm binary coding vector of the data set according to the prediction performance and an algorithm with optimal selective performance for the data set according to the prediction performance, and constructing a recessive energy characteristic matrix.
The invention discloses the following technical effects:
1. the invention provides an automatic clustering algorithm selection system and method for scientific data based on multi-scale feature fusion, which comprehensively consider the meta features and performance features of the data, learn the hidden features of the data, select an applicable clustering algorithm for the data by using the hidden features, and more comprehensively select the most appropriate algorithm for a given data set.
2. The random forest multi-output regression model based on the multi-scale features is established, models are established for each candidate algorithm independently, therefore, the model does not need to be retrained when the candidate algorithms are added or deleted, only the models need to be established for the newly added candidate algorithms, and in the performance matrix evaluation, the algorithm with the time exceeding the preset threshold value stops evaluation by combining the time constraint.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a clustering algorithm selector construction of the present invention;
FIG. 2 is a flow chart of the present invention for selecting a clustering algorithm for a given data set;
FIG. 3 is a result of algorithm selection for an Arrhythmia dataset according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an algorithm selection result of a Dermatology dataset according to an embodiment of the present invention;
FIG. 5 shows the algorithm selection result of the vertebra-column dataset according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, a more detailed description is provided below in conjunction with the accompanying drawings and the detailed description.
Referring to fig. 1-2, the present embodiment provides automatic clustering algorithm selection systems applied in the biomedical field, including:
the data preprocessing module and the clustering algorithm selector building module:
the data preprocessing module is used for preprocessing a given data set and comprises a metadata extraction module and an single-scale selection module, wherein the metadata extraction module is used for extracting the meta-features of the given data set and evaluating the performance of a preprocessing algorithm on the given data set and comprises a th meta-feature calculation module and a 0 th performance evaluation module, the 1 th meta-feature calculation module is used for extracting the meta-features of the given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the th single-scale selection module is used for extracting the single-scale features of the given data set and comprises a th nearest meta-feature matching module, a th nearest performance feature matching module and a performance matrix decomposition module, the th nearest meta-feature matching module is used for extracting the hidden data set features of the given data set, the th nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the th performance matrix decomposition module is used for extracting the hidden energy features of the given data set;
the clustering algorithm selector building module comprises a metadata set building module, a second single-scale selection module and a multi-scale fusion module; the metadata set construction module is used for extracting the metadata features of the data set and the performance of the candidate algorithm set to construct the metadata set and comprises a second metadata feature calculation module and a second performance evaluation module, the second metadata feature calculation module is used for extracting the metadata features of the data set in the data set space and constructing a metadata feature matrix, and the second performance evaluation module is used for evaluating the performance of the candidate algorithm set algorithm and constructing a performance matrix; the second single-scale selection module is used for extracting single-scale features of the metadata set and comprises a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module, the second nearest element feature matching module is used for extracting the hidden data set features of the metadata set and constructing a hidden data set feature matrix, the second nearest performance feature matching module is used for extracting the hidden algorithm features of the metadata set and constructing a hidden algorithm feature matrix, and the second performance matrix decomposition module is used for extracting the hidden energy features of the metadata set and constructing a hidden energy feature matrix; and the multi-scale fusion module is used for constructing a Stacking set by using the single-scale features of the metadata set extracted by the second single-scale selection module, training a random forest multi-output regression model based on the Stacking set, and obtaining the clustering algorithm selector.
The embodiment provides automatic clustering algorithm selection methods applied to the biomedical field, which comprise the following steps:
s1, constructing a clustering algorithm selector, comprising the following steps:
s1-1, extracting meta-features of the data set and performance of the candidate algorithm set through a meta-data set construction module, and constructing a meta-data set;
in the embodiment, 198 pieces of multi-field scientific data from OpenML are selected as a data set space D, 6 clustering algorithms from scinit-lean are selected as an algorithm space A, and the 6 clustering algorithms specifically comprise Kmeans, MeanShift, DBSCAN, affinity prediction, Birch and agglomerative clustering; randomly selecting 132 parts of input metadata set construction modules from 198 parts of multi-field scientific data, wherein the input metadata set construction modules are used for constructing metadata sets, and the rest data sets are used for testing the generalization performance of the method; the second meta-feature calculation module extracts descriptive features (namely meta-features) of the clustered data set in the data set space D based on methods such as statistics, information theory, LandMaekers and the like, such as sample number, skewness, signal-to-noise ratio and the like, and constructs a meta-feature matrix F, wherein the meta-features extracted by the meta-feature calculation module are shown in Table 1; considering that the performance of the clustering algorithm depends on the configuration of the hyper-parameters of the clustering algorithm to a great extent, the second performance evaluation module adopts Bayesian optimization and takes the contour coefficient as the clustering calculationPerformance metric index of method, search algorithm aje.A in dataset diOptimum performance p over e DijReducing the influence of the hyper-parameters of the clustering algorithm on the algorithm selection result, and constructing a performance matrix P according to the influence; when the optimal performance of the algorithm is evaluated, combining time constraint, if the time exceeds a preset threshold value, the clustering algorithm is not suitable for the data set, stopping evaluation, and taking the lowest value "-1" of the contour coefficient as the algorithm performance; combining the element feature matrix F and the performance matrix P to construct an element data set M, and selecting 66 construction element training sets M from 132 element data sets MtrainAnd the rest 66 construction element prediction sets Mpred
TABLE 1
Figure BDA0002220452720000091
Figure BDA0002220452720000101
S1-2, extracting the single-scale features of the metadata set through a second single-scale selection module: constructing the meta training set M constructed in the step S1-1trainAnd meta prediction set MpredRespectively input into a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module to construct an element prediction set MpredThe hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix;
s1-2-1, constructing a meta prediction set M through a second nearest meta feature matching modulepredA hidden data set feature matrix of (a);
in order to reduce random errors caused by Bayesian optimization and improve the robustness of the method, k most similar data sets are selected, and the regression performance of the algorithm of the k data sets in the algorithm space is taken as the prediction performance of the data set to construct a hidden data set feature vector, wherein k is 3 in the embodiment, which specifically comprises the following steps:
s1-2-1-1, for meta prediction set MpredFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set iiAnd Yuan training set MtrainMeta-feature vector f of all data sets injDegree of similarity of
Figure BDA00022204527200001112
As shown in formula (1), and obtaining a similarity matrix SI;
Figure BDA0002220452720000111
wherein f isiPresentation meta prediction set MpredMeta-feature vector of each data set in fjPresentation element training set MtrainThe meta-feature vector of each data set in (a),
Figure BDA0002220452720000112
represents a vector fiThe average value of (a) of (b),
Figure BDA0002220452720000113
represents a vector fjThe average value of (a) of (b),
Figure BDA00022204527200001113
represents a vector fiThe standard deviation of (a) is determined,
Figure BDA0002220452720000114
represents a vector fjE (-) represents expectation;
s1-2-1-2, obtaining a meta training set M based on the similarity matrix SItrainThe corresponding performance vectors topk of the k data sets most similar to the data set iiAnd the performance vectors topk corresponding to the k data setsiPerforming grouping to obtain algorithm aje.A in dataset
Figure BDA00022204527200001114
Tropism of Shang Gui HuaPredicting performanceAs shown in equation (2), and obtain the predicted performance matrix of the data set i
Figure BDA0002220452720000116
Figure BDA0002220452720000117
Wherein i represents a meta prediction set MpredJ denotes the algorithm in the algorithm space a,
Figure BDA0002220452720000118
representation algorithm aje.A in dataset
Figure BDA0002220452720000119
Above classification predicted Performance, topki,j,tRepresents topkiThe performance average over all algorithms for the t-th dataset,
Figure BDA00022204527200001110
represents topkiT is more than or equal to 1 and less than or equal to k, and k is 3;
s1-2-1-3, predicting the performance matrix according to
Figure BDA00022204527200001111
And selecting an optimal algorithm, generating an algorithm binary coding vector of the data set i, and constructing a hidden data set characteristic matrix.
S1-2-2, constructing the meta prediction set M through a second nearest performance characteristic matching modulepredA hidden algorithm feature matrix of (1);
the second nearest performance feature matching module takes the algorithm performance as descriptive features of the data sets, considers the similarity of the data sets in an algorithm performance space, selects an applicable algorithm according to the algorithm performance corresponding to the data set which is the closest to the given data set in the algorithm performance space, but as the algorithm performance is empty on the data set due to a new data set, the problem is converted into a cold start problem, so that three algorithms with the optimal comprehensive performance are selected for pre-evaluation in consideration of the historical performance of the algorithm, algorithm selection is realized according to the evaluated algorithm performance, and -like nearest meta-feature matching is realized, and the nearest performance feature matching also considers the k closest data sets, and the method specifically comprises the following steps:
s1-2-2-1, training set M of the principal elementstrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data sets
Figure BDA0002220452720000121
As shown in equation (3):
Figure BDA0002220452720000122
wherein M represents a meta training set MtrainThe number of corresponding datasets in (m ═ 66); p is a radical ofijRepresentation algorithm ajin-Yuan training set MtrainMiddle data set diThe performance of (a) is improved,
Figure BDA0002220452720000123
representing all algorithms in the meta training set MtrainMiddle data set diPerformance average of (1);
s1-2-2-2, normalized Performance mean according to regression Selecting 2 algorithms of Meanshift and AglomerativeClustering with optimal comprehensive performance as a preprocessing algorithm;
s1-2-2-3, for Meta prediction set MpredEach data set d in (a)iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix Pi
S1-2-2-4, calculating a performance vector piAnd Yuan training set MtrainPerformance vector p of the corresponding algorithm injIs measured by the Euclidean distance ofiAnd pjThe similarity of (c) is shown in formula (4):
Figure BDA0002220452720000125
wherein p isiPresentation meta prediction set MpredPerformance vector of pjPresentation element training set MtrainA performance vector of (a);
s1-2-2-5, sorting the similarity obtained in the step S1-2-2-4 to obtain performance vectors topk corresponding to k data sets with the highest similarityiUsing formula (2) to pair topkiPerforming classification to obtain predicted performance
Figure BDA0002220452720000126
According to predicted performance
Figure BDA0002220452720000127
And selecting an algorithm with the best performance, generating an algorithm binary coding vector of the data set i, and constructing a hidden algorithm characteristic matrix.
S1-2-3, constructing the meta prediction set M through a second performance matrix decomposition modulepredThe implicit energy feature matrix specifically comprises the following steps:
s1-2-3-1, training set M of the general trainingtrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data setsAs shown in equation (3);
s1-2-3-2, normalized Performance mean according to regression Selecting 2 algorithms of Meanshift and AglomerativeClustering with optimal comprehensive performance as a preprocessing algorithm;
s1-2-3-3, for Meta prediction set MpredEach data set d in (a)iExtracting the performance corresponding to the preprocessing algorithm from the performance matrix PVector pi
S1-2-3-4, decomposing the Performance matrix P into three matrices U, Σ, and V using a matrix decomposition (SVD) methodTObtaining a performance matrix model; as shown in equation (5); the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;
Figure BDA0002220452720000133
s1-2-3-5, utilizing the matrix model obtained in the step S1-2-3-4 to convert the performance vector p with the missing valueiCompleting to obtain all algorithm in-element prediction set MpredAnd generating an algorithm binary coding vector of the data set according to the prediction performance and an algorithm with optimal selective performance for the data set according to the prediction performance, and constructing a recessive energy characteristic matrix.
S1-3, the multi-scale fusion module utilizes the meta prediction set M extracted in the step S1-2predConstructing a Stacking set S by the hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix, and training a random forest multi-output regression model based on the Stacking set S to obtain the clustering algorithm selector.
S2, selecting a clustering algorithm for a given data set, wherein three scientific data sets from the biomedical field are given in the embodiment: the method is used for distinguishing Arrhythmia data sets with or without Arrhythmia, Dermatology data sets for determining the types of the erythrocytic squamous cell diseases, vertebra-column data sets for determining the bone types according to biomechanical characteristics, optimal algorithm selection is carried out, and the clustering algorithm selector constructed in the step S1 is tested, and the method specifically comprises the following steps:
s2-1, preprocessing three given data sets by using a data preprocessing module, namely inputting the given data sets into a metadata extraction module, extracting the meta-feature vectors of the given data sets, such as sample number, skewness, signal-to-noise ratio and the like by using a th meta-feature calculation module based on statistics, information theory, Landmaekers and other methods, evaluating the performance of algorithms by using Bayesian optimization for preprocessing algorithms Meanshift and agglomerative clustering through a th performance evaluation module, and calculating the performance vectors of the two algorithms on the given data sets;
inputting the meta-features obtained by the th meta-feature calculating module and the algorithm performance obtained by the th performance evaluating module into the th single-scale selection module, calculating the hidden data set features of the given data set by the th nearest meta-feature matching module based on the meta-feature vector of the given data set, and calculating the hidden algorithm features and the hidden performance features of the given data set by the th nearest performance feature matching module and the th performance matrix decomposing module based on the performance vector of the preprocessing algorithm respectively;
s2-2, inputting the hidden data set characteristic, the hidden algorithm characteristic and the hidden performance characteristic of the given data set obtained in the step S2-1 into a clustering algorithm selector to obtain the expected performance of the candidate algorithm, and selecting a proper clustering algorithm for the given data set according to the expected performance.
The experimental results are as follows:
as shown in fig. 3, the result of algorithm selection in the above 6 clustering algorithms for the Arrhythmia data set is shown, where SIL represents the contour coefficient of the cluster, and the higher the contour coefficient, the better the clustering effect. The experimental result shows that the outline coefficient of the Agglerate algorithm is 0.59, the performance is better than that of other clustering algorithms (Kmeans: 0.13, affinity amplification: 0.05, Meanshift: 0.49, DBSCAN: 0.33, Birch: 0.11), the algorithm with better expected performance selected by the application is the Meanshift and Agglerate algorithms, the Agglerate algorithm with the best actual performance is included, and the actual performance of the selected Meanshift algorithm is inferior to that of the Agglerate algorithm.
As shown in FIG. 4, for the result of algorithm selection of the Dermatology data set in the above 6 clustering algorithms, it can be seen from the experimental results that the performance is the best for the Kmeans and the MeanShift algorithms, the contour coefficients are both 0.51, the performance is better than that of other clustering algorithms (Agglomerative: 0.49, affinity amplification: 0.43, DBSCAN: 0.34, Birch: 0.38), the algorithm with better expected performance selected by the present application is the Agglomerative and the MeanShift algorithms, including the best MeanShift algorithm and the Agglotive algorithm with the top three actual performance ranks among the candidate clustering algorithms, and the difference between the Kmeans and the MeanShift performance is smaller.
As shown in fig. 5, for the results of algorithm selection of the vertebra-column dataset in the above 6 clustering algorithms, it can be seen from experimental results that the algorithms of MeanShift, aggregative, and DBSCAN are all the algorithms with the contour coefficient of 0.86, and the algorithms of Kmeans and DBSCAN are selected in this document, although the selected Kmeans algorithm has poor performance and the contour coefficient is only 0.45, the optimal DBSCAN algorithm is included in the selected algorithms.
In conclusion, the clustering algorithm selector constructed by the application can accurately select the optimal algorithm for the three given data sets from the biomedical field. In addition, in the process of selecting the clustering algorithm for 66 data sets for testing, the clustering algorithm selector constructed in the application does not select the clustering algorithm with the worst actual performance on 100% of the data sets, meanwhile, the clustering algorithms selected for 100% of the data sets comprise the clustering algorithm with the first actual performance 3, the clustering algorithms selected for 84.8% of the data sets comprise the clustering algorithm with the first actual performance 2, and the clustering algorithms selected for 77.3% of the data sets comprise the clustering algorithm with the best actual performance, so that the algorithm selection space is effectively reduced.
The method provided by the invention can effectively reduce the algorithm selection space on the premise of meeting the performance requirements of the data set, thereby reducing the time and cost for applying machine learning to scientific discovery and research in the field of biomedicine. The clustering algorithm selection method can be applied to the field of multidisciplinary science, and the usability of machine learning is improved.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (7)

1, automatic clustering algorithm selection system facing scientific data analysis, which is characterized in that the system comprises a data preprocessing module and a clustering algorithm selector construction module;
the data preprocessing module is used for preprocessing a given data set and comprises a metadata extraction module, an single-scale selection module, a single-scale selection module and a preprocessing module, wherein the metadata extraction module is used for extracting the metadata features of the given data set and evaluating the performance of a preprocessing algorithm on the given data set;
the clustering algorithm selector constructing module is used for constructing a clustering algorithm selector, and comprises: the metadata set construction module is used for extracting the metadata features of the data set and the performance of the candidate algorithm set to construct a metadata set; the second single-scale selection module is configured to extract single-scale features of the metadata set, where the single-scale features include: a hidden data set feature, a hidden algorithm feature and a hidden performance feature; the multi-scale fusion module is used for constructing a Stacking set by using the single-scale features of the metadata set extracted by the second single-scale selection module, and training a random forest multi-output regression model based on the Stacking set to obtain the clustering algorithm selector.
2. The automatic clustering algorithm selection system for scientific data analysis according to claim 1, wherein the metadata extraction module comprises th meta-feature calculation module and th performance evaluation module, the th meta-feature calculation module is used for extracting meta-features of a given data set, the th performance evaluation module is used for evaluating the performance of the preprocessing algorithm on the given data set, the metadata set construction module comprises a second meta-feature calculation module and a second performance evaluation module, the second meta-feature calculation module is used for extracting meta-features of the data set in the data set space and constructing a meta-feature matrix, and the second performance evaluation module is used for evaluating the performance of the algorithm in the candidate algorithm set and constructing a performance matrix.
3. The scientific data analysis-oriented automatic clustering algorithm selection system as claimed in claim 1, wherein the single-scale selection module comprises a nearest-element feature matching module, a nearest performance feature matching module and a performance matrix decomposition module, wherein the nearest-element feature matching module is used for extracting the hidden data set features of the given data set, the nearest performance feature matching module is used for extracting the hidden algorithm features of the given data set, and the performance matrix decomposition module is used for extracting the hidden energy features of the given data set;
the second single-scale selection module comprises a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module, wherein the second nearest element feature matching module is used for extracting the hidden data set features of the metadata set and constructing a hidden data set feature matrix; the second nearest performance characteristic matching module is used for extracting hidden algorithm characteristics of the metadata set and constructing a hidden algorithm characteristic matrix; the second performance matrix decomposition module is used for extracting the implicit energy characteristics of the metadata set and constructing an implicit energy characteristic matrix.
4, automatic clustering algorithm selection method facing scientific data analysis, which is characterized in that the method comprises the following steps:
s1, constructing a clustering algorithm selector, comprising the following steps:
s1-1, extracting meta-features of the data set and performance of the candidate algorithm set through a meta-data set construction module, and constructing a meta-data set;
selecting a multi-field scientific data set for clustering as a data set space D, selecting a candidate clustering algorithm set as an algorithm space A, extracting the meta-features of the data set in the data set space D through a second meta-feature calculation module, and constructing a meta-feature matrix F; the second performance evaluation module evaluates the algorithm a by using Bayesian optimizationje.A in dataset diOptimum performance p over e DijConstructing a performance matrix P, carrying out time constraint on performance evaluation of the algorithm, stopping evaluation and taking the lowest value of the clustering performance measurement index as the algorithm performance if the performance evaluation time of the algorithm exceeds a preset threshold; combining the element characteristic matrix F and the performance matrix P to construct an element data set M, and dividing the element data set M into element training sets MtrainAnd meta prediction set Mpred
S1-2, extracting the single-scale features of the metadata set through a second single-scale selection module: constructing the meta training set M constructed in the step S1-1trainAnd meta prediction set MpredRespectively input into a second nearest element feature matching module, a second nearest performance feature matching module and a second performance matrix decomposition module to construct an element prediction set MpredThe hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix;
s1-3, the multi-scale fusion module utilizes the meta prediction set M constructed in the step S1-2predConstructing a Stacking set S by the hidden data set characteristic matrix, the hidden algorithm characteristic matrix and the hidden energy characteristic matrix, and training a random forest multi-output regression model based on the Stacking set S to obtain a clustering algorithm selector;
s2, selecting a clustering algorithm for the given data set, comprising the following steps:
s2-1, preprocessing the given data set by using a data preprocessing module, namely inputting the given data set into a metadata extraction module, calculating the meta-feature of the given data set by a meta-feature calculation module, and evaluating the performance of a preprocessing algorithm on the given data set by a performance evaluation module;
inputting the meta-features obtained by the th meta-feature calculation module and the algorithm performance obtained by the th performance evaluation module into the th single-scale selection module, and respectively extracting the hidden data set features, hidden algorithm features and hidden performance features of a given data set through the th nearest meta-feature matching module, the th nearest performance feature matching module and the th performance matrix decomposition module;
s2-2, inputting the hidden data set characteristic, the hidden algorithm characteristic and the hidden performance characteristic of the given data set obtained in the step S2-1 into a clustering algorithm selector to obtain the expected performance of the candidate algorithm, and selecting a proper clustering algorithm for the given data set according to the expected performance.
5. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second nearest meta feature matching module workflow in the step S1-2 includes:
step 1) for the meta prediction set MpredFor each data set i, calculating Pearson correlation coefficient to measure the meta-feature vector f of the data set iiAnd Yuan training set MtrainMeta-feature vector f of all data sets injDegree of similarity ofAs shown in formula 1, and obtaining a similarity matrix SI;
wherein f isiPresentation meta prediction set MpredMeta-feature vector of each data set in fjPresentation element training set MtrainThe meta-feature vector of each data set in (a),
Figure FDA0002220452710000032
represents a vector fiThe average value of (a) of (b),
Figure FDA0002220452710000033
represents a vector fjThe average value of (a) of (b),
Figure FDA0002220452710000034
represents a vector fiThe standard deviation of (a) is determined,
Figure FDA0002220452710000035
represents a vector fjE (-) represents expectation;
step 2) obtaining a meta-training set M based on the similarity matrix SItrainThe corresponding performance vectors topk of the k data sets most similar to the data set iiAnd the performance vectors topk corresponding to the k data setsiPerforming grouping to obtain algorithm aje.A in dataset -normalized predicted performance
Figure FDA0002220452710000037
As shown in equation 2, and obtain the predicted performance matrix of the data set i
Figure FDA0002220452710000038
Figure FDA0002220452710000039
Wherein i represents a meta prediction set MpredJ denotes the algorithm in the algorithm space a,
Figure FDA00022204527100000310
representation algorithm aje.A in dataset
Figure FDA00022204527100000311
Above classification predicted Performance, topki,j,tRepresents topkiThe performance average over all algorithms for the t-th dataset,
Figure FDA0002220452710000041
represents topkiT is more than or equal to 1 and less than or equal to k of the performance average value of the tth data set on the algorithm j;
step 3) according to the prediction performance matrix
Figure FDA0002220452710000042
And selecting an optimal algorithm, generating an algorithm binary coding vector of the data set i, and constructing a hidden data set characteristic matrix.
6. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second closest performance characteristic matching module workflow in the step S1-2 includes:
step 1) training the meta training set MtrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data sets
Figure FDA0002220452710000043
As shown in equation 3:
Figure FDA0002220452710000044
wherein M represents a meta training set MtrainNumber of corresponding data sets in, pijRepresentation algorithm ajin-Yuan training set MtrainMiddle data set diThe performance of (a) is improved,
Figure FDA0002220452710000045
representing all algorithms in the meta training set MtrainMiddle data set diPerformance average of (1);
step 2) normalizing the performance mean value according to the regression
Figure FDA0002220452710000046
Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;
step 3) for the meta prediction set MpredEach data set d in (a)iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix Pi
Step 4) calculating a performance vector piAnd Yuan training set MtrainPerformance vector p of the corresponding algorithm injIs measured by the Euclidean distance ofiAnd pjThe similarity of (c) is shown in formula 4:
Figure FDA0002220452710000047
wherein p isiPresentation meta prediction set MpredPerformance vector of pjPresentation element training set MtrainA performance vector of (a);
step 5) sorting the similarity obtained in the step 4) to obtain performance vectors topk corresponding to k data sets with highest similarityiUsing equation 2 to pair topkiPerforming classification to obtain predicted performance
Figure FDA0002220452710000048
According to predicted performance
Figure FDA0002220452710000049
And selecting an algorithm with the best performance, generating an algorithm binary coding vector of the data set i, and constructing a hidden algorithm characteristic matrix.
7. The scientific data analysis-oriented automatic clustering algorithm selection method according to claim 4, characterized in that: the second performance matrix decomposition module workflow in the step S1-2 includes:
step 1) training the meta training set MtrainAs a historical performance matrix, calculate each algorithm ajE-A in-meta training set MtrainNormalized performance mean over all data sets
Figure FDA0002220452710000051
As shown in equation 3;
step 2) normalizing the performance mean value according to the regression
Figure FDA0002220452710000052
Selecting h algorithms with optimal comprehensive performance as preprocessing algorithms;
step 3) for the meta prediction set MpredEach data set d in (a)iExtracting a performance vector P corresponding to the preprocessing algorithm from the performance matrix Pi
Step 4) decomposing the performance matrix P into three matrices U, sigma and V by using a matrix decomposition (SVD) methodTObtaining a performance matrix model as shown in formula 5; the matrix P has m rows and n columns, which indicates that m data sets exist in a data set space D, and n clustering algorithms exist in a candidate algorithm space;
Figure FDA0002220452710000053
step 5) utilizing the matrix model obtained in the step 4) to convert the performance vector p with missing valuesiCompleting to obtain all algorithm in-element prediction set MpredAnd generating an algorithm binary coding vector of the data set according to the prediction performance and an algorithm with optimal selective performance for the data set according to the prediction performance, and constructing a recessive energy characteristic matrix.
CN201910931657.2A 2019-09-29 2019-09-29 automatic clustering algorithm selection system and method for scientific data analysis Pending CN110738245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910931657.2A CN110738245A (en) 2019-09-29 2019-09-29 automatic clustering algorithm selection system and method for scientific data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910931657.2A CN110738245A (en) 2019-09-29 2019-09-29 automatic clustering algorithm selection system and method for scientific data analysis

Publications (1)

Publication Number Publication Date
CN110738245A true CN110738245A (en) 2020-01-31

Family

ID=69268339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910931657.2A Pending CN110738245A (en) 2019-09-29 2019-09-29 automatic clustering algorithm selection system and method for scientific data analysis

Country Status (1)

Country Link
CN (1) CN110738245A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723862A (en) * 2020-06-18 2020-09-29 广东电网有限责任公司清远供电局 Switch cabinet state evaluation method and device
CN112733067A (en) * 2020-12-22 2021-04-30 上海机器人产业技术研究院有限公司 Data set selection method for robot target detection algorithm
CN114492572A (en) * 2021-12-21 2022-05-13 成都产品质量检验研究院有限责任公司 Material structure classification method and system based on machine learning clustering algorithm

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723862A (en) * 2020-06-18 2020-09-29 广东电网有限责任公司清远供电局 Switch cabinet state evaluation method and device
CN112733067A (en) * 2020-12-22 2021-04-30 上海机器人产业技术研究院有限公司 Data set selection method for robot target detection algorithm
CN114492572A (en) * 2021-12-21 2022-05-13 成都产品质量检验研究院有限责任公司 Material structure classification method and system based on machine learning clustering algorithm

Similar Documents

Publication Publication Date Title
Wu et al. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval.
Zou et al. Integration of residual network and convolutional neural network along with various activation functions and global pooling for time series classification
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
Gu et al. Clustering-driven unsupervised deep hashing for image retrieval
CN102314614B (en) Image semantics classification method based on class-shared multiple kernel learning (MKL)
CN106778832B (en) The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN110738245A (en) automatic clustering algorithm selection system and method for scientific data analysis
CN110647907B (en) Multi-label image classification algorithm using multi-layer classification and dictionary learning
CN112732921B (en) False user comment detection method and system
CN105046323B (en) Regularization-based RBF network multi-label classification method
CN112329884B (en) Zero sample identification method and system based on discriminant visual attributes
CN115100709B (en) Feature separation image face recognition and age estimation method
CN111210023A (en) Automatic selection system and method for data set classification learning algorithm
CN111027636A (en) Unsupervised feature selection method and system based on multi-label learning
Chen et al. Deep subspace image clustering network with self-expression and self-supervision
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN117690178B (en) Face image recognition method and system based on computer vision
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN110175631A (en) A kind of multiple view clustering method based on common Learning Subspaces structure and cluster oriental matrix
Rijal et al. Integrating Information Gain methods for Feature Selection in Distance Education Sentiment Analysis during Covid-19.
Salman et al. Gene expression analysis via spatial clustering and evaluation indexing
Aparna et al. Comprehensive study and analysis of partitional data clustering techniques
Li et al. Multi-label feature selection with high-sparse personalized and low-redundancy shared common features
CN117349743A (en) Data classification method and system of hypergraph neural network based on multi-mode data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200131

RJ01 Rejection of invention patent application after publication