CN107103336A - A kind of mixed attributes data clustering method based on density peaks - Google Patents

A kind of mixed attributes data clustering method based on density peaks Download PDF

Info

Publication number
CN107103336A
CN107103336A CN201710294126.8A CN201710294126A CN107103336A CN 107103336 A CN107103336 A CN 107103336A CN 201710294126 A CN201710294126 A CN 201710294126A CN 107103336 A CN107103336 A CN 107103336A
Authority
CN
China
Prior art keywords
point
data
clustered
mixed attributes
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710294126.8A
Other languages
Chinese (zh)
Inventor
刘世华
叶展翔
周炳忠
张�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Polytechnic
Original Assignee
Wenzhou Polytechnic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Polytechnic filed Critical Wenzhou Polytechnic
Priority to CN201710294126.8A priority Critical patent/CN107103336A/en
Publication of CN107103336A publication Critical patent/CN107103336A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of mixed attributes data clustering method based on density peaks, including obtain mixed attributes data set to be clustered, and calculate in mixed attributes data set to be clustered the distance between each two data point and block distance;According to the distance between each two data point and distance is blocked, obtain the local density of each data point, and calculate relative distance;The γ parameter curves formed by the local density and its relative distance of each data point are defined, γ parameter values are obtained;According to the sequence number of each data point, γ parameter values and relative distance, flex point index matrix is built, and cluster centre point is obtained using default crutches point algorithm;According to cluster centre point, the expression and output of mixed attributes cluster data result to be clustered are realized.The embodiment of the present invention, better than traditional k prototypes algorithms Clustering Effects, efficiency of algorithm is high and can find clusters number automatically, the influence to outlier is insensitive.

Description

A kind of mixed attributes data clustering method based on density peaks
Technical field
Excavated the present invention relates to computer data and processing technology field, more particularly to a kind of mixing based on density peaks Attribute data clustering method.
Background technology
Clustering is always one of study hotspot in data mining and machine learning field, with the big data epoch Development, Various types of data emerges in an endless stream, and wherein most is the tradition while have the data of a variety of attribute types such as numerical value and classification Clustering algorithm such as K-Means etc. primarily directed to numerical attribute data clustering algorithm.Gather to handle mixed attributes data Class problem, researcher proposes various solutions, and can be broadly divided into traditional type by its roadmap changes Method, Clustering Ensemble Approaches: An, the method based on prototype and the method based on density, method based on level etc..
Other attributes are exactly converted to certain attribute and clustered again by the method for type conversion, such as David and Numerical attribute is converted into categorical attribute by the SpectralCAT algorithms that Averbuch is proposed, this method first, then using spectrum Data after clustering method processing conversion.
The thought of Cluster-Fusion is that one group objects is divided using many algorithms, and the result that algorithms of different is drawn is adopted Merged to draw final cluster result with common recognition function.Its earliest by A.Strehl and J.Ghosh in 2002 propose, Subsequently become one of main stream approach of mixed attributes cluster.Zhao Yu etc. proposes a kind of mixed attributes cluster based on Cluster-Fusion Algorithm CEMC, the method system of Cluster-Fusion is introduced into mixed attributes data clusters problem.He etc. is proposed to be melted based on cluster Conjunction and the mixed attributes clustering algorithm CEBMDC of Squeezer algorithms, the algorithm is clustered for categorical attribute subset to be gathered with last Class fusion all employs the progress of Squeezer algorithms.
K-prototypes (k prototypes) algorithm that Huang was proposed in 1997, the algorithm uses the base of k-means algorithms This thought, the mode combinations of the cluster centre of numerical attribute and categorical attribute are got up, and construct a new mixed attributes number It is prototype (prototype) according to center, and it is public to construct based on prototype a distance metric for mixed attributes data Mixed attributes are directly clustered by formula and cost function using cluster process as the k-means classes of algorithms.Calculation based on prototype Method thought is simple, and efficiency high, its key is to be the definition of the distance between data tuple measure formulas.Yiu-ming Cheung etc. proposes a kind of unified similarity measurement (unified similarity metric) method, by numerical attribute portion The distance metric divided is normalized, and the value of similarity measurement is constrained in [0,1] interval, then by each categorical attribute Similarity measurement assigns weight and is normalized respectively, finally obtains a unified distance metric formula.Based on this Formula, they propose a kind of iterative algorithm OCIL to cluster mixed attributes data, meanwhile, by introduction of competition and punishing Mechanism is penalized, further improvement has been carried out to OCIL, it is proposed that is capable of the mixed attributes clustering algorithm of automatic discrimination clusters number (PCL-OC).OCIL algorithms and k-prototypes algorithms have been carried out experiment and compared by them, and its clustering precision improves a lot, But the computation complexity of its unified metric value is higher.
Li and Biswas propose SBAC (Similarity Based Agglomerative Clustering) algorithm [i], this is the Agglomerative Hierarchical Clustering algorithm based on Goodall similarities, and this method effect is pretty good, but computation complexity is higher than O (n2*logn)。
The RDBC_M algorithms of the propositions such as yellow ability and political integrity employ the range formula towards dimension, to per it is one-dimensional it is independent calculate away from From logarithm value attribute uses Euclidean distance, to categorical attribute then by way of expert estimation between the attribute different value Similarity definition one distance matrix weighs dimension distance, and it, which builds, needs artificial marking.
The MDCDen algorithms and DC-MDACC algorithms of the propositions such as Chen Jinyin be by mixed attributes data be divided into numerical value be dominant, Classification is dominant and the class of balanced type mixed attributes data three, and different distance metric functions are then defined for each class.They are needed The analysis that is dominant first is carried out to data set.
The above-mentioned method based on prototype is still suffered from it needs to be determined that clustering number, the selection sensitivity to cluster center, can not finding The cluster of arbitrary shape and to abnormity point it is more sensitive the shortcomings of;Method existence time and space complexity based on level compared with High, the irreversible shortcoming of cluster process;The measuring similarity of categorical attribute in RDBC_M algorithms needs the evaluation of domain expert Assignment;MDCDen algorithms need three parameters of regulation to obtain preferable result.
2014, Alex Rodriguez and Alessandro Laio existed《Science》Deliver a kind of quick on magazine Search and the clustering algorithm (this paper abbreviation DPC algorithms) for finding density peaks.The algorithm Clustering Effect is good, efficiency high, parameter are few, It can be found that clusters number, and data of different shapes can be clustered, automatic identification outlier.The input of DPC algorithms It is the distance matrix between data point, as long as the distance metric between solving the problems, such as the data points of mixed attributes data, it is possible to directly Clustering is carried out using the algorithm, but not yet inquires other at present mixed attributes data are clustered using DPC algorithms Research report.
Therefore, the cluster of a kind of rational mixed attributes data point distance calculating method and processing mixed attributes data is needed badly Method, it is better than traditional k-prototypes algorithms Clustering Effect, efficiency of algorithm is high and can find clusters number automatically, to from The influence of group's point is insensitive.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of mixed attributes data clustering method based on density peaks, than passing The k-prototypes algorithms Clustering Effect of system is good, efficiency of algorithm is high and can find clusters number automatically, to the shadow of outlier Sound is insensitive.
In order to solve the above-mentioned technical problem, the embodiments of the invention provide a kind of mixed attributes data based on density peaks Clustering method, methods described includes:
S1, acquisition mixed attributes data set to be clustered, and according to the mixed attributes data set to be clustered, calculate described The distance between each two data point in mixed attributes data set to be clustered, and calculate the mixed attributes data to be clustered What is collected blocks distance;
The distance between each two data point and institute in S2, the mixed attributes data set to be clustered calculated according to The local density blocked distance, obtain each data point in the mixed attributes data set to be clustered calculated is stated, is gone forward side by side The local density of each data point in the mixed attributes data set to be clustered that one step is obtained according to, calculates and described waits to gather The relative distance of each data point in class mixed attributes data set;
S3, the definition local density of each data point and its corresponding phase in the mixed attributes data set to be clustered Adjust the distance the γ parameter curves to be formed, and determine the γ parameters of each data point in the mixed attributes data set to be clustered Value;
S4, according to the sequence number of each data point in the mixed attributes data set to be clustered, γ parameter values and it is relative away from From, flex point index matrix is built, and the flex point index matrix of the structure is solved using default crutches point algorithm, obtain institute State the cluster centre point of mixed attributes data set to be clustered;
S5, the mixed attributes data set to be clustered obtained according to cluster centre point, realize the mixing to be clustered The expression and output of attribute data clustering result;Wherein, what is obtained in the mixed attributes data set to be clustered except described in is poly- Data point outside class central point will be assigned to during neighbour local density highest clusters, and complete the expression of cluster result and defeated Go out.
Wherein, the distance between each two data point is by formula D (X in the mixed attributes data set to be clusteredi, Xj)=d (Xi,Xj)r+d(Xi,Xj)cTo realize;Wherein, d (Xi,Xj)rRepresent numerical attribute in mixed attributes data set to be clustered Partial distance, d (Xi,Xj)cRepresent the distance of categorical attribute part in mixed attributes data set to be clustered;
Wherein, d (Xi,Xj)rIt is by formulaTo realize;Wherein,Represent data point XiAnd XjThe normalization of numerical part attribute after Euclidean distance, and distance value d (Xi,Xj)r It is interval in [0,1];
Wherein, d (Xi,Xj)cIt is by formulaTo realize;Wherein,For data point XiAnd XjThe matching distance in categorical attribute is tieed up in t; The entropy weight in categorical attribute is tieed up for t, wherein,p(ats) in t dimension categorical attributes Classification value total number be mtWhen, s (s=1,2 ..., mt) the individual probability for being worth appearance.
Wherein, the γ parameter values of each data point are by formula γ in the mixed attributes data set to be clusteredii ×δiAnd obtain;Wherein, γiFor the γ parameter values of i-th of data point;ρiFor the local density of i-th of data point;δiFor i-th The relative distance of individual data point.
Wherein, the step S4 is specifically included:
Sequence number, γ parameter values and the relative distance of each data point in the mixed attributes data set to be clustered are determined, And sequence number set, γ set of parameter values and relative distance set are further formed respectively;Wherein, sequence number set I=[1,2 ..., N], γ set of parameter values γ=[γ12,…,γn], relative distance set delta=[δ12,…,δn];N is described to be clustered Data point sum in mixed attributes data set, and be positive integer;
According to the sequence number set of the formation, γ set of parameter values and relative distance set, flex point index matrix CT is built =[I;γ;δ];Wherein, the flex point index matrix CT=[I;γ;δ] be by the sequence number of data point, γ parameter values and it is relative away from From the 3xn matrixes respectively formed as row vector, data point sum n by column vector;
By the flex point index matrix CT=[I;γ;δ] according to first γ rows are sorted from big to small again to δ rows from big to small The mode of sequence is adjusted, the flex point index matrix after being adjusted, and the flex point index matrix after the adjustment is solved The second dervative of γ rows, obtained value retains columns as flex point, and further in the flex point index matrix after the adjustment Less than or equal to institute's directed quantity of the flex point, candidate centers collection is formed;
Judge that the candidate centers concentrate whether vectorial columns is less than or equal to 2;
If it is, regarding the data point of candidate centers collection midrange correspondence sequence number as cluster centre point;
If it is not, then to the candidate centers collection continue solve δ rows second dervative, obtained value as secondary flex point, And the candidate centers are concentrated to the data point of columns correspondence sequence number before being less than or equal to the secondary flex point as in cluster Heart point.
Implement the embodiment of the present invention, have the advantages that:
The embodiment of the present invention provides mixed attributes data set the distance metric formula of unified mixed attributes data point, and The flex point index matrix between the data point of mixed attributes data is built with this, and then is proposed in the automatic cluster based on crutches point The heart determines method, thus it is better than traditional k-prototypes algorithms Clustering Effect, efficiency of algorithm is high and can find to gather automatically Class number, the influence to outlier is insensitive.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, according to These accompanying drawings obtain other accompanying drawings and still fall within scope of the invention.
Fig. 1 is the flow chart of the mixed attributes data clustering method provided in an embodiment of the present invention based on density peaks.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
As shown in figure 1, in the embodiment of the present invention, a kind of mixed attributes data clustering method based on density peaks is proposed, Methods described includes:
Step S101, acquisition mixed attributes data set to be clustered, and according to the mixed attributes data set to be clustered, calculate Go out the distance between each two data point in the mixed attributes data set to be clustered, and calculate the mixing category to be clustered Property data set blocks distance;
Detailed process is to determine mixed attributes data set to be clustered, can set mixed attributes data set S={ X to be clustered1, X2,…,XnIt is the mixed attributes data set that d to be clustered ties up n data point;Wherein, j-th of data point is represented by Xj= [xj1,xj2,…,xjd];Assuming that numerical attribute has d in mixed attributes data set S to be clusteredrPeacekeeping categorical attribute has dcDimension so that dr+dc=d, such as preceding drIt is numerical attribute for individual attribute, rear dcIndividual attribute is categorical attribute, now mixed attributes data to be clustered The collection S d for blocking distance as categorical attributec
Unified distance metric definition is proposed to mixed attributes data set S to be clustered, mixed attributes data to be clustered are calculated Collect the distance between each two data point in S, such as to any two data point X in mixed attributes data set S to be clusterediAnd Xj, Their distance can be as obtained by being calculated formula (1):
D(Xi,Xj)=d (Xi,Xj)r+d(Xi,Xj)c(1);
In formula (1), d (Xi,Xj)rRepresent the distance of numerical attribute part in mixed attributes data set to be clustered, d (Xi,Xj)c Represent the distance of categorical attribute part in mixed attributes data set to be clustered;
And d (Xi,Xj)rIt can be represented using formula (2), formula (2) is as follows;
In formula (2),Represent data point XiAnd XjNumerical part attribute normalization after Euclidean distance. Because Euclidean distance is non-negative, therefore the distance value d (X of numerical attribute part can be ensuredi,Xj)rIt is interval in [0,1].
d(Xi,Xj)cIt can be represented using formula (3), formula (3) is as follows;
Formula (3) uses the matching process for adding entropy weight, wherein,For data point XiAnd Xj The matching distance in categorical attribute is tieed up in t;The entropy weight in categorical attribute is tieed up for t, wherein,p(ats) it is that the total number that t ties up the classification value in categorical attribute is mtWhen, s (s= 1,2,...,mt) the individual probability for being worth appearance.
The distance between each two data point in step S102, the mixed attributes data set to be clustered calculated according to And it is described calculate block distance, the part for obtaining each data point in the mixed attributes data set to be clustered is close The local density of each data point, calculates in degree, and the mixed attributes data set to be clustered further obtained according to The relative distance of each data point in the mixed attributes data set to be clustered;
Detailed process is, according to the distance between each two data point D (X in mixed attributes data set S to be clusteredi,Xj) with And block apart from dc, the local density ρ of each data point in mixed attributes data set S to be clustered is calculated by formula (4)i
In formula (4), ρiFor the local density of i-th of data point;
Again by formula (5), the relative distance δ of each data point in mixed attributes data set S to be clustered is calculatedi
In formula (4), δiFor the relative distance of i-th of data point, when local density is not maximal density, data point XiIt is right The distance answered is the minimum value of point distance of its big point to all density ratios, otherwise, takes it to arrive the maximum of every other point Distance.
To sum up, according to the local density ρ of each data pointiWith apart from δiThe decision diagram of structure, user can explicitly send out The number and central point now clustered with selection.
Step S103, define the local density of each data point and its right in the mixed attributes data set to be clustered The γ parameter curves for the relative distance formation answered, and determine the γ of each data point in the mixed attributes data set to be clustered Parameter value;
Detailed process is, in order to realize automatically determining for cluster centre, and γ parameter values γ is defined firstii×δiTo make Determine γ parameter curves, and by the γ parameter values γ of γ parameter curvesiInverted order is arranged, at this moment γiMust be local close than larger point Spend ρiOr relative distance δiThan larger point.Wherein, γiFor the γ parameter values of i-th of data point.
Step S104, according to the sequence number of each data point in the mixed attributes data set to be clustered, γ parameter values and Relative distance, builds flex point index matrix, and the flex point index matrix of the structure is solved using default crutches point algorithm, Obtain the cluster centre point of the mixed attributes data set to be clustered;
Detailed process is, by calculating γiAnd δiTwo flex points can determine cluster centre point, i.e., before crutches point Those central points can meet local density ρiWith apart from δiAll than larger.It therefore, it can be defined and solved according to the flex point of function Method, by the second dervative f " (x) for calculating a function f (x).Solve flex point x0So that f " (x0)=0, and in x0Both sides Numerical value positive and negative values are different, specific as follows:
Sequence number, γ parameter values and the relative distance of each data point in mixed attributes data set S to be clustered are determined, is gone forward side by side One step forms sequence number set, γ set of parameter values and relative distance set respectively;Wherein, sequence number set I=[1,2 ..., n], γ Set of parameter values γ=[γ12,…,γn], relative distance set delta=[δ12,…,δn];N is mixed attributes number to be clustered According to data point sum in collection S, and it is positive integer;
According to sequence number set, γ set of parameter values and relative distance set, flex point index matrix CT=[I are built;γ;δ]; Wherein, flex point index matrix CT=[I;γ;δ] it is respectively as row by the sequence number of data point, γ parameter values and relative distance Vector, the 3xn matrixes that data point sum n is formed by column vector;
By flex point index matrix CT=[I;γ;δ] δ rows are sorted from big to small again according to first being sorted from big to small to γ rows Mode be adjusted, the flex point index matrix after being adjusted, and to after adjustment flex point index matrix solve γ rows two Order derivative, obtained value as flex point, and further in flex point index matrix after the adjustment retain columns be less than or equal to turn Institute's directed quantity of point, forms candidate centers collection;
Judge that candidate centers concentrate whether vectorial columns is less than or equal to 2;
If it is, regarding the data point of candidate centers collection midrange correspondence sequence number as cluster centre point;
If it is not, then continuing to solve the second dervative of δ rows to candidate centers collection, obtained value is incited somebody to action as secondary flex point The candidate centers concentrate the data point less than or equal to columns correspondence sequence number before the secondary flex point to be used as cluster centre point.
As an example, by taking bank credit voucher as an example, the flex point index matrix formed after 10 row adjustment, such as table 1 below It is shown:
Table 1
The second dervative of γ rows is solved to the flex point index matrix after adjustment, it is as shown in table 2 below:
Table 2
Flex point appears in the 8th row it can be seen from γ ", is arranged it can thus be concluded that going out preceding 8 as candidate centers collection HSCT, due to Flex point appears in the 8th row and is more than 2, then second dervative that can again to candidate centers collection HSCT solution δ rows, as shown in table 3 below:
Table 3
From upper table 3 it can be seen that, δ flex points appear in the 2nd row, and the cluster centre that can draw the data set is 2, i.e., the 407 and 127 data points are used as cluster centre point.
Step S105, the cluster centre point according to the obtained mixed attributes data set to be clustered, wait to gather described in realization The expression and output of class mixed attributes cluster data result;Wherein, obtained in the mixed attributes data set to be clustered except described To cluster centre point outside data point will be assigned to during neighbour local density highest clusters, complete the table of cluster result Show and export.
Detailed process is, according to cluster centre point, to realize the expression of the mixed attributes cluster data result to be clustered And output, certainly by non-cluster central point be sequentially allocated with its arest neighbors high density point identical category, thus complete one time gather Class simultaneously exports cluster result, i.e., the data point in mixed attributes data set S to be clustered in addition to cluster centre point will be assigned to During neighbour local density highest clusters, the expression and output of cluster result are completed.
In the embodiment of the present invention, the data set that data points are n, algorithm space complexity is deposited essentially from distance matrix Storage is, it is necessary to which the row of storage 3 coexist in 3*n* (n-1)/2 memory space, i.e. distance matrix, and 1 row and the 2nd row are data point sequence number, the 3rd row For the distance of two data points.
Storage of array local density ρ that 3 length are n is needed in step S103, apart from δ and its product γ, therefore space Complexity is O (n2), and local density calculates and relative distance is calculated and the product of the two is calculated, and time complexity is O(n2);Sorting time complexity depends on the sort algorithm used, minimum O (nlog during crutches point is calculated in step S104 (n)), it is O (n to the maximum2), therefore time complexity is no more than O (n2);The number of expression and the output of cluster result in step S105 Strong point distribution time complexity is O (n).Therefore, the total complexity of algorithm is O (n2), than traditional k-prototypes algorithms Clustering Effect is good, efficiency of algorithm is high and can find clusters number automatically.
Implement the embodiment of the present invention, have the advantages that:
The embodiment of the present invention provides mixed attributes data set the distance metric formula of unified mixed attributes data point, and The flex point index matrix between the data point of mixed attributes data is built with this, and then is proposed in the automatic cluster based on crutches point The heart determines method, thus it is better than traditional k-prototypes algorithms Clustering Effect, efficiency of algorithm is high and can find to gather automatically Class number, the influence to outlier is insensitive
Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is The hardware of correlation is instructed to complete by program, described program can be stored in a computer read/write memory medium, Described storage medium, such as ROM/RAM, disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention Any modifications, equivalent substitutions and improvements made within refreshing and principle etc., should be included in the scope of the protection.

Claims (4)

1. a kind of mixed attributes data clustering method based on density peaks, it is characterised in that methods described includes:
S1, mixed attributes data set to be clustered is obtained, and according to the mixed attributes data set to be clustered, calculate and described wait to gather The distance between each two data point in class mixed attributes data set, and calculate the mixed attributes data set to be clustered Block distance;
The distance between each two data point and the meter in S2, the mixed attributes data set to be clustered calculated according to What is calculated blocks distance, obtains the local density of each data point in the mixed attributes data set to be clustered, and further According to the local density of each data point in the obtained mixed attributes data set to be clustered, calculate described to be clustered mixed Close the relative distance that attribute data concentrates each data point;
S3, define in the mixed attributes data set to be clustered the local density of each data point and its it is corresponding it is relative away from From the γ parameter curves of formation, and determine the γ parameter values of each data point in the mixed attributes data set to be clustered;
S4, sequence number, γ parameter values and relative distance according to each data point in the mixed attributes data set to be clustered, structure Flex point index matrix is built, and the flex point index matrix of the structure is solved using default crutches point algorithm, described treat is obtained Cluster the cluster centre point of mixed attributes data set;
S5, the mixed attributes data set to be clustered obtained according to cluster centre point, realize the mixed attributes to be clustered The expression and output of cluster data result;Wherein, except in the obtained cluster in the mixed attributes data set to be clustered Data point outside heart point will be assigned to during neighbour local density highest clusters, and complete the expression and output of cluster result.
2. the mixed attributes data clustering method as claimed in claim 1 based on density peaks, it is characterised in that described to wait to gather The distance between each two data point is by formula D (X in class mixed attributes data seti,Xj)=d (Xi,Xj)r+d(Xi,Xj)c To realize;Wherein, d (Xi,Xj)rRepresent the distance of numerical attribute part in mixed attributes data set to be clustered, d (Xi,Xj)cRepresent The distance of categorical attribute part in mixed attributes data set to be clustered;
Wherein, d (Xi,Xj)rIt is by formulaTo realize;Wherein,Represent Data point XiAnd XjThe normalization of numerical part attribute after Euclidean distance, and distance value d (Xi,Xj)rIt is interval in [0,1];
Wherein, d (Xi,Xj)cIt is by formulaTo realize;Wherein,For data point XiAnd XjThe matching distance in categorical attribute is tieed up in t; The entropy weight in categorical attribute is tieed up for t, wherein,p(ats) in t dimension categorical attributes Classification value total number be mtWhen, s (s=1,2 ..., mt) the individual probability for being worth appearance.
3. the mixed attributes data clustering method as claimed in claim 2 based on density peaks, it is characterised in that described to wait to gather The γ parameter values of each data point are by formula γ in class mixed attributes data setii×δiAnd obtain;Wherein, γi For the γ parameter values of i-th of data point;ρiFor the local density of i-th of data point;δiFor the relative distance of i-th of data point.
4. the mixed attributes data clustering method as claimed in claim 3 based on density peaks, it is characterised in that the step S4 is specifically included:
Sequence number, γ parameter values and the relative distance of each data point in the mixed attributes data set to be clustered are determined, is gone forward side by side One step forms sequence number set, γ set of parameter values and relative distance set respectively;Wherein, sequence number set I=[1,2 ..., n], γ Set of parameter values γ=[γ12,…,γn], relative distance set delta=[δ12,…,δn];N is the mixing category to be clustered Property data set in data point sum, and for positive integer;
According to the sequence number set of the formation, γ set of parameter values and relative distance set, flex point index matrix CT=[I are built; γ;δ];Wherein, the flex point index matrix CT=[I;γ;δ] it is by the sequence number of data point, γ parameter values and relative distance point The 3xn matrixes that Wei do not formed as row vector, data point sum n by column vector;
By the flex point index matrix CT=[I;γ;δ] δ rows are sorted from big to small again according to first being sorted from big to small to γ rows Mode be adjusted, the flex point index matrix after being adjusted, and to after the adjustment flex point index matrix solve γ rows Second dervative, obtained value as flex point, and further in the flex point index matrix after the adjustment retain columns be less than Or equal to institute's directed quantity of the flex point, form candidate centers collection;
Judge that the candidate centers concentrate whether vectorial columns is less than or equal to 2;
If it is, regarding the data point of candidate centers collection midrange correspondence sequence number as cluster centre point;
If it is not, then continuing to solve the second dervative of δ rows to the candidate centers collection, obtained value is incited somebody to action as secondary flex point The candidate centers concentrate the data point less than or equal to columns correspondence sequence number before the secondary flex point to be used as cluster centre point.
CN201710294126.8A 2017-04-28 2017-04-28 A kind of mixed attributes data clustering method based on density peaks Pending CN107103336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710294126.8A CN107103336A (en) 2017-04-28 2017-04-28 A kind of mixed attributes data clustering method based on density peaks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710294126.8A CN107103336A (en) 2017-04-28 2017-04-28 A kind of mixed attributes data clustering method based on density peaks

Publications (1)

Publication Number Publication Date
CN107103336A true CN107103336A (en) 2017-08-29

Family

ID=59656642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710294126.8A Pending CN107103336A (en) 2017-04-28 2017-04-28 A kind of mixed attributes data clustering method based on density peaks

Country Status (1)

Country Link
CN (1) CN107103336A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334898A (en) * 2018-01-23 2018-07-27 华中科技大学 A kind of multi-modal industrial process modal identification and Fault Classification
CN110320894A (en) * 2019-08-01 2019-10-11 陕西工业职业技术学院 A kind of accurate Coal Pulverizing System of Thermal Power Plant fault detection method for dividing overlapping area data category
CN111209347A (en) * 2018-11-02 2020-05-29 北京京东尚科信息技术有限公司 Method and device for clustering mixed attribute data
CN111257905A (en) * 2020-02-07 2020-06-09 中国地质大学(武汉) Slice self-adaptive filtering algorithm based on single photon laser point cloud density segmentation
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN113158817A (en) * 2021-03-29 2021-07-23 南京信息工程大学 Objective weather typing method based on rapid density peak clustering
CN113743457A (en) * 2021-07-29 2021-12-03 暨南大学 Quantum density peak value clustering method based on quantum Grover search technology
CN113923043A (en) * 2021-10-27 2022-01-11 温州职业技术学院 User entity behavior analysis method based on density peak value adaptive clustering
CN116434880A (en) * 2023-03-06 2023-07-14 哈尔滨理工大学 High-entropy alloy hardness prediction method based on fuzzy self-consistent clustering integration

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334898A (en) * 2018-01-23 2018-07-27 华中科技大学 A kind of multi-modal industrial process modal identification and Fault Classification
CN111209347A (en) * 2018-11-02 2020-05-29 北京京东尚科信息技术有限公司 Method and device for clustering mixed attribute data
CN111209347B (en) * 2018-11-02 2024-04-16 北京京东振世信息技术有限公司 Method and device for clustering mixed attribute data
CN110320894A (en) * 2019-08-01 2019-10-11 陕西工业职业技术学院 A kind of accurate Coal Pulverizing System of Thermal Power Plant fault detection method for dividing overlapping area data category
CN110320894B (en) * 2019-08-01 2022-04-15 陕西工业职业技术学院 Thermal power plant pulverizing system fault detection method capable of accurately dividing aliasing area data categories
CN111257905B (en) * 2020-02-07 2022-03-04 中国地质大学(武汉) Slice self-adaptive filtering algorithm based on single photon laser point cloud density segmentation
CN111257905A (en) * 2020-02-07 2020-06-09 中国地质大学(武汉) Slice self-adaptive filtering algorithm based on single photon laser point cloud density segmentation
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN113158817B (en) * 2021-03-29 2023-07-18 南京信息工程大学 Objective weather typing method based on rapid density peak clustering
CN113158817A (en) * 2021-03-29 2021-07-23 南京信息工程大学 Objective weather typing method based on rapid density peak clustering
CN113743457A (en) * 2021-07-29 2021-12-03 暨南大学 Quantum density peak value clustering method based on quantum Grover search technology
CN113743457B (en) * 2021-07-29 2023-07-28 暨南大学 Quantum density peak clustering method based on quantum Grover search technology
CN113923043A (en) * 2021-10-27 2022-01-11 温州职业技术学院 User entity behavior analysis method based on density peak value adaptive clustering
CN113923043B (en) * 2021-10-27 2024-02-09 温州职业技术学院 User entity behavior analysis method based on density peak value self-adaptive clustering
CN116434880A (en) * 2023-03-06 2023-07-14 哈尔滨理工大学 High-entropy alloy hardness prediction method based on fuzzy self-consistent clustering integration
CN116434880B (en) * 2023-03-06 2023-09-08 哈尔滨理工大学 High-entropy alloy hardness prediction method based on fuzzy self-consistent clustering integration

Similar Documents

Publication Publication Date Title
CN107103336A (en) A kind of mixed attributes data clustering method based on density peaks
Ding et al. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood
Michalski et al. Automated construction of classifications: Conceptual clustering versus numerical taxonomy
Greene et al. Producing a unified graph representation from multiple social network views
Zhang et al. Multilevel projections with adaptive neighbor graph for unsupervised multi-view feature selection
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
Nguyen et al. SLINT: a schema-independent linked data interlinking system
Zhang et al. Local community detection based on network motifs
Xu et al. An improved k-means clustering algorithm
CN109726749A (en) A kind of Optimal Clustering selection method and device based on multiple attribute decision making (MADM)
CN109492022A (en) The searching method of semantic-based improved k-means algorithm
Zhou et al. Relevance feature mapping for content-based multimedia information retrieval
Zhou et al. ECMdd: Evidential c-medoids clustering with multiple prototypes
CN101980251A (en) Remote sensing classification method for binary tree multi-category support vector machines
CN109858518A (en) A kind of large data clustering method based on MapReduce
Niu et al. Overlapping community detection with adaptive density peaks clustering and iterative partition strategy
CN106156795A (en) A kind of determination method and device of suspicious money laundering account
Melendez-Melendez et al. An improved algorithm for partial clustering
Xue et al. GOMES: A group-aware multi-view fusion approach towards real-world image clustering
CN107392048A (en) Differential privacy protection method in data visualization and evaluation index thereof
CN109697471A (en) A kind of density peaks clustering method based on KNN
Shen et al. A dimensionality reduction framework for detection of multiscale structure in heterogeneous networks
Chen et al. PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data
CN110516741A (en) Classification based on dynamic classifier selection is overlapped unbalanced data classification method
CN109886332A (en) Improvement DPC clustering algorithm and system based on symmetrical neighborhood

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170829