CN107315647A

CN107315647A - Outlier detection method and system

Info

Publication number: CN107315647A
Application number: CN201710497183.6A
Authority: CN
Inventors: 徐骄
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2017-11-03

Abstract

The invention relates to an outlier detection method and system. The method comprises the following steps: obtaining a sample space to be detected, wherein the sample space comprises a plurality of sample points, and each sample point comprises a plurality of dimensions; selecting a plurality of sample points from the sample space as the central points of the corresponding clusters; calculating the distance weight between each unselected sample point and each central point in the sample space, wherein the distance weight is the ratio of the dimensionality number with the same numerical value of the sample point and the central point to the total dimensionality number; calculating the distance between each unselected sample point and each central point according to the distance weight; determining the cluster to which each unselected sample point belongs according to the distance; sample points that do not belong to any cluster are determined to be outliers. The invention is not limited by the limitation of data quantity, and can accurately detect the outlier even when the data quantity is large.

Description

Outlier detection method and system

Technical field

The present invention relates to outlier detection technical field, more particularly to outlier detection method and outlier detection system System.

Background technology

Outlier detection, is to find out its behavior from historical data to be different from very much expected object also known as " outlier detection " A detection process, and these objects are referred to as outlier or abnormity point.With the development of science and technology, outlier detection should It is more and more extensive with prospect.It is essentially all to carry out using data flow to be oriented to for example, in current data science field 's.From the data storage that gets of data source, then to data prediction, then to data modeling, data analysis and data mining, It is last to arrive data realization again.Wherein the quality and accuracy of data prediction are played for its each follow-up step and extremely weighed The effect wanted, if data have outlier, can directly result in data mining algorithm over-fitting, it is impossible to be directly used in business.Again For example, in some business scenarios, can also have exception or the object that peels off significantly is different from other most objects, it is this In the case of to data carry out Outliers mining be just particularly important, for example, credit card fraud detection be exactly one extremely typical case Example, its main purpose is to detect purchasing model or the behavior of exception object.

In conventional art, typically by EDA (Exploratory Data Analysis, exploratory data analysis) rank Section draws scatter diagram or box figure etc. using graphical tools, it is relatively more directly perceived and can visually detect based on distance from Group's point.But by this mode of mapping, when data volume than it is larger when visualization is carried out to full dose data can bring one Fixed difficulty.

The content of the invention

Based on this, it is necessary to for it is above-mentioned can not detect outlier when data volume is larger the problem of peeled off there is provided one kind Point detecting method and system, are not only restricted to the limitation of data volume, when data volume is larger, also can accurately detect to peel off Point.

A kind of outlier detection method, including step：

Sample space to be detected is obtained, wherein the sample space includes multiple sample points, if each sample point includes Dry dimension；

Several sample points are chosen as the central point of correspondence cluster from the sample space；

The distance between each sample point unselected in the sample space and each central point weight is calculated, wherein The distance weighting is the ratio between numerical value identical number of dimensions and total number of dimensions of sample point and central point；

According to the distance between the unselected each sample point of distance weighting calculating and each central point；

Cluster according to belonging to the distance determines unselected each sample point；

The sample point that will not belong to any cluster is determined as outlier.

When above-mentioned outlier detection method, outlier in Screening Samples space, first calculate each sample point with it is each The distance between individual central point weight, is then weighted according to distance weighting to the distance of sample point and central point, according to adding Distance after power detects the outlier in sample space, due to directly filtering out outlier in sample space, without mapping, Therefore the size of data volume is not only restricted to, when data volume is larger, outlier can also be accurately detected.

In one embodiment, the sample point that will not belong to any cluster is determined as after outlier, in addition to step：Calculate The average of each cluster is poor, and the threshold value of each cluster is obtained according to average difference；Obtain in each cluster with corresponding center The distance between point is more than the sample point of threshold value, regard the sample point of acquisition as candidate's outlier；To all candidate's outliers Screened, obtain the outlier of supplement.Result to direct clustering uses the strategy of the distance of standard deviation to carry out outlier Supplement so that have preferable amendment to the result of cluster, to prevent those farthest from final central point in cluster process It is that the sample point of outlier is assigned in cluster by mistake in fact, further increases the accuracy of outlier detection.

In one embodiment, all candidate's outliers are screened, obtaining the outlier of supplement includes：Will be all Candidate's outlier be ranked up according to the size with central point distance, select default since the maximum candidate's outlier of distance Candidate's outlier of number as supplement outlier.

In one embodiment, calculating the average difference of each cluster includes：The sum of the sample point included according to cluster All sample points that amount, each sample point for being included of cluster are included in the numerical value and cluster of each dimension are in the equal of each dimension Value, obtains standard deviation of each cluster in each dimension；The average value of the standard deviation of all dimensions of each cluster is calculated, each cluster is obtained Average it is poor.

In one embodiment, according to the distance weighting calculate unselected each sample point and each central point it Between distance include：According to the reciprocal of the distance between sample point and central point weight, the numerical value of the dimension of sample point and in The numerical value of the dimension of heart point, the mahalanobis distance between the unselected each sample point of calculating and each central point.Distance weighting For a decimal, two sample points of bigger expression closer to, so in mahalanobis distance as the factor of distance weighting when use It is reciprocal, in addition, the mapping mode in conventional art can not embody the relation between multivariable, the present invention uses mahalanobis distance not Influenceed by dimension, and also contemplate the correlation between variable, the result of cluster is more fitted actual cluster, can be compared Good Clustering Effect, and then more accurately detect outlier.

In one embodiment, will not after the cluster according to belonging to the distance determines unselected each sample point The sample point for belonging to any cluster is determined as before outlier, in addition to step：Judge whether obtained cluster meets the convergence of setting Condition；If obtained cluster is unsatisfactory for the condition of convergence of setting, the central point of each cluster is chosen again, according to each chosen again Central point redefines the cluster belonging to unselected each sample point.

In one embodiment, determine that the cluster belonging to unselected each sample point includes：If apart from certain sample point most Near central point only has one, and the sample point is included into the cluster where nearest central point, if in certain sample point is nearest Heart point has multiple, the sample point is not included into any cluster.

A kind of outlier detection system, including：

Sample space acquisition module, the sample space to be detected for obtaining, wherein the sample space includes multiple samples This point, each sample point includes several dimensions；

Central point chooses module, for choosing several sample points as the central point of correspondence cluster from the sample space；

Distance weighting obtains module, for calculating each sample point unselected in the sample space and each center The distance between point weight, wherein the distance weighting is the numerical value identical number of dimensions and total number of dimensions of sample point and central point The ratio between；

Distance obtains module, for calculating unselected each sample point and each central point according to the distance weighting The distance between；

Cluster division module, for the cluster according to belonging to the unselected each sample point of the distance determination；

Outlier detection module, the sample point for will not belong to any cluster is determined as outlier.

When above-mentioned outlier detection system, outlier in Screening Samples space, first calculate each sample point with it is each The distance between individual central point weight, is then weighted according to distance weighting to the distance of sample point and central point, according to adding Distance after power detects the outlier in sample space, due to directly filtering out outlier in sample space, without mapping, Therefore the size of data volume is not only restricted to, when data volume is larger, outlier can also be accurately detected.

In one embodiment, the outlier that outlier detection system also includes being connected with the outlier detection module is mended Mold filling block, the outlier complementary module includes：Threshold value obtaining unit, the average for calculating each cluster is poor, according to institute State the threshold value that average difference obtains each cluster；Candidate's outlier obtaining unit, for obtain in each cluster with corresponding central point The distance between be more than threshold value sample point, regard the sample point of acquisition as candidate's outlier；Outlier obtaining unit is supplemented, is used Screened in all candidate's outliers, obtain the outlier of supplement.To the result of direct clustering using standard deviation away from From strategy carried out the supplement of outlier so that have preferable amendment to the result of cluster, with prevent in cluster process from Those of final central point farthest are that the sample point of outlier is assigned in cluster by mistake in fact, further increase outlier detection Accuracy.

In one embodiment, it is described supplement outlier obtaining unit by all candidate's outliers according to central point away from From size be ranked up, select candidate's outlier of predetermined number to be used as supplement since the maximum candidate's outlier of distance Outlier.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the outlier detection method of an embodiment；

Fig. 2 for a specific embodiment cluster process in outlier detection method schematic flow sheet；

Fig. 3 is the schematic flow sheet of the outlier detection method of another embodiment；

Fig. 4 is the structural representation of the outlier detection system of an embodiment；

Fig. 5 is the structural representation of the outlier detection system of another embodiment；

Fig. 6 is the structural representation of the outlier complementary module of an embodiment.

Embodiment

When carrying out outlier detection, typically there is two ways in conventional art：(1) mapped in the EDA stages detect from Group's point；(2) the population detection based on statistics.Above-mentioned (1) kind mode is not suitable for data volume than larger situation, (2) Kind of mode is not suitable for the sample space of higher-dimension, and needs to be known a priori by the distribution characteristics of data in sample space, and these points Cloth feature is very possible can not to be obtained before outlier detection is carried out.

In order to effectively solve drawbacks described above, the present invention uses the clustering algorithm based on distance, and such as K-Means algorithms are right The data of sample space carry out cluster calculation to detect outlier.But be based purely on the outlier detection of cluster there is also limitation Property, for example, outlier can influence the cluster of cluster to divide or cluster result during cluster, in order to reduce the limitation to peeling off The influence that point detection band is come, the present invention program makes following improvement：It is right when the data in all sample spaces are clustered The evaluation of distance is weighted so that each round clusters the factor for all including Weighted distance.

The improvement made of the present invention is understood in order to clearer, below in conjunction with the accompanying drawings and preferred embodiment, to clustering The technical scheme of outlier detection in journey, carries out clear and complete description.

As shown in figure 1, a kind of outlier detection method, including step：

S110, acquisition sample space to be detected, wherein the sample space includes multiple sample points, each sample point bag Include several dimensions；

S120, several sample points are chosen from the sample space as the central point of correspondence cluster；

The distance between unselected each sample point and each central point power in S130, the calculating sample space Weight, wherein the distance weighting is the ratio between numerical value identical number of dimensions and total number of dimensions of sample point and central point；

S140, the distance between unselected each sample point and each central point calculated according to the distance weighting；

S150, the cluster according to belonging to the distance determines unselected each sample point；

S160, the sample point that will not belong to any cluster are determined as outlier.

Above-mentioned outlier detection method, it is not necessary to which user possesses any domain knowledge, is directly filtered out in sample space Outlier, without mapping, therefore is not only restricted to the size of data volume, when data volume is larger, can also accurately detect Outlier；Add the distance weighting factor so that the result of cluster is more fitted actual cluster, better cluster can be obtained and imitated Really, outlier is more accurately filtered out, the distance weighting factor relies on number of dimensions in addition, supported on big data platform to big rule Mould high dimensional data carries out outlier detection, can solve computation complexity and difficulty that higher-dimension sample is brought, it is adaptable to higher-dimension Sample space, and the distribution characteristics of data in sample space need not be known a priori by.Each step is described in detail below：

In step s 110, the collection of chance event E all basic results composition is combined into E sample space.To be detected Sample space is the data of outlier to be detected.The element of sample space is referred to as sample point or elementary event.Sample point includes Several dimensions, for example, the dimension of sample point includes：Id (identity), gender (sex), age (age), salary (salary), address (address), job (work).The value of the dimension of each sample point constitute sample point it is vectorial it is specific in Hold.For example, the vector of sample point 1 is (identification card number A, female, 30,4000, address B, work as C), the vector of sample point 2 is (30,3500, address E work as F by identification card number D, man).

In the step s 120, from sample space randomly choose K sample point as K cluster central point, K for more than or equal to 1 integer.For example, being used as initial center point, i.e. sample from sample space random selection sample point 1, sample point 2 and sample point 3 Point 1 is the initial center point of cluster 1, and sample point 2 is the initial center point of cluster 2, and sample point 3 is the initial center point of cluster 3.Choose several Individual sample point then obtains several clusters, and the initial center point of each cluster is the corresponding sample point chosen.

In step s 130, because each central point necessarily belongs to corresponding cluster, so carrying out cluster division to sample point Shi Wuxu considers the central point chosen, and only considers unselected sample point, but the present invention makes restriction not to this. Distance weighting is bigger to represent that two sample point similarities are higher.For some sample point A, sample point A and some are calculated The distance between central point C weight, computational methods are：W=same_num/sum_num, wherein w are sample point A and central point The distance between C weights, same_num is sample point A and central point C dimension values identical quantity, each sample point it is total What number of dimensions was just as, so sum_num is sample point A or central point C total number of dimensions.It can be calculated using the above method Go out sample point A respectively with the distance between K central point weight, similar, each sample point and K that can be read The distance between individual central point weight.

For example, the dimension of sample point has：Id, gender, age, salary, address and job, then total number of dimensions is 6. The numerical value of each dimension of sample point 1 is (A1, B1, C1, D1, E1, F1), the numerical value of each dimension of certain central point for (A2, B1, C1, D2, E2, F2), then it is 2 to be worth identical number of dimensions, then the distance weighting of sample point 1 and the central point is 2/6.

In step S140, for some sample point, according to the distance between the sample point and certain central point weight, meter Calculate the distance with distance weighting between the sample point and the central point.Calculating the mode of distance has many kinds, for example, at one In embodiment, calculating the distance between unselected each sample point and each central point according to the distance weighting can wrap Include：According to the dimension of the reciprocal of the distance between sample point and central point weight, the numerical value of the dimension of sample point and central point Numerical value, calculate the mahalanobis distance between unselected each sample point and each central point.Specifically, passing through following formula Calculate a certain sample point A and a certain central point C mahalanobis distance：

In above formula,WithSample point A and central point C is represented respectively, is all vector, the dimension values (dimension of each sample point The numerical value of degree) it is exactly vectorial particular content；Represent sample point A and central point C distance, w represent sample point A and Central point C distance weighting；T represents transposition, and -1 represents to invert.

Calculate apart from when be using the starting point of mahalanobis distance rather than Euclidean distance, one, mahalanobis distance are and amount Guiding principle is unrelated, secondly, mahalanobis distance also contemplate correlation between variable, be unable to body compared to mode of being mapped in conventional art The defect of relation between existing multivariable so that the result of cluster is more fitted actual cluster.In addition, distance weighting is one small Number, two sample points of bigger expression closer to, so in mahalanobis distance as the factor of distance weighting when use it reciprocal instead Reflect that its value reciprocal is smaller, the distance of two sample points is smaller.

In step S150, in one embodiment, determine that the cluster belonging to unselected each sample point includes：If away from The central point nearest from certain sample point only has one, and the sample point is included into the cluster where nearest central point, if apart from some The nearest central point of this point has multiple, the sample point is not included into any cluster.

The sample point is not included into any cluster in step S160, in taking turns and circulate one only means that the sample point in epicycle It is doubtful outlier in circulation, whether the sample point is that outlier needs to determine when meeting the condition of convergence.The condition of convergence can be with Determine according to actual needs, for example, set cycle-index to be 10 times, and the cluster of all sample points is divided in last 5 circulations Do not occur any change.If meeting the condition of convergence, terminate cluster, the outlier that will not belong to any cluster is determined as outlier. If a sample point is determined as outlier, the mark of outlier can be added on the sample point, in order to which user checks.

In one embodiment, will not after the cluster according to belonging to the distance determines unselected each sample point The sample point for belonging to any cluster is determined as before outlier, in addition to step：Judge whether obtained cluster meets the convergence of setting Condition；If obtained cluster is unsatisfactory for the condition of convergence of setting, the central point of each cluster is chosen again, according to each chosen again Central point redefines the cluster belonging to unselected each sample point.The condition of convergence of setting is even unsatisfactory for, then existing K cluster in each reselect central point, continue iteration and perform step S130 to step S150, until meeting condition of convergence knot Beam.

In order to be better understood from the process that the present invention is clustered, enter below based on the specific embodiment of K-Means clustering algorithms Row is discussed in detail.

As shown in Fig. 2 K-Means cluster process, is comprised the steps of：

S1, K initial center point is randomly selected from sample space；

S2, calculate the distance between each sample point and K initial center point weight；

S3, according to distance weighting, calculate the distance between each sample point and K initial center point；

S4, according to distance to each sample point carry out cluster division, if the central point only one of which nearest away from certain sample point, The sample point is included in the cluster where the nearest central point, the sample point is not otherwise included in any cluster；

S5, judge whether cycle-index be equal to 10 and it is last 5 times circulation cluster divide any change does not occur, if meet, will The sample point for being finally not belonging to any cluster is determined as outlier, and cluster terminates；If it is not satisfied, the central point of K cluster is chosen again, Return to step S2.

Also there is another limitation in the outlier detection for being based purely on cluster：For selected clustering algorithm more according to Rely, different clustering algorithms carries out cluster to same sample space data and is likely to be obtained different cluster results.In order to reduce this The influence that limitation is brought to outlier detection, the present invention program makes following improve and innovation：After cluster is completed, to every Individual cluster calculates the standard deviation of distance, then according to the distance of sample point and central point in the standard deviation of each cluster and each cluster, Candidate's outlier of each cluster is selected, the outlier of supplement is selected finally according to candidate's outlier.Pass through the knot to direct clustering Fruit has carried out the supplement of outlier so that have preferable amendment to the result of cluster, to prevent in cluster process from final Those of heart point farthest are that the sample point of outlier is assigned in cluster by mistake in fact, further increase the accurate of outlier detection Property.

Understand another improvement that the present invention is made in order to clearer, terminate the skill of rear outlier detection to cluster below Art scheme, carries out clear and complete description.

In one embodiment, as shown in figure 3, the sample point that will not belong to any cluster is determined as after outlier, may be used also With including step：

S170, the average for calculating each cluster are poor, according to the threshold value of each cluster of average difference acquisition；

Each cluster of the step for last cluster at the end of each cluster for being generated.In one embodiment, calculate each The average difference of cluster includes：Each sample point that the total quantity of the sample point included according to cluster, cluster are included is in each dimension All sample points that the numerical value and cluster of degree are included obtain standard of each cluster in each dimension in the average of each dimension Difference；The average value of the standard deviation of all dimensions of each cluster is calculated, the average for obtaining each cluster is poor.I.e. for some cluster, The average for calculating the cluster using following formula is poor：

In above formula, σ represents that the average of cluster is poor；M represents total number of dimensions of sample point, such as each sample point has 6 Dimension, total number of dimensions is 6；N represents the total quantity for the sample point that cluster is included, and the number of users of such as one cluster is 100, then N is 100；x_iThe numerical value of the dimension of i-th of sample point is represented, vector value is that is to say；μ represent all sample points that cluster included with x_i Average in corresponding dimension.When calculating the standard deviation of cluster, each dimension will be calculated, so, if overall have M dimension, Then each cluster calculates the average value of M standard deviation, poor as the average of this cluster.

After the standard deviation for obtaining each cluster, the threshold value of each cluster can be determined according to the standard deviation, and for example threshold value is 3* σ, user can also set other threshold values according to actual needs, and the present invention makes restriction not to this.

S180, the sample point with the distance between corresponding central point more than threshold value in each cluster is obtained, by the sample of acquisition Point is used as candidate's outlier；

For each cluster, the sample point for being more than threshold value with the distance between its central point is filtered out respectively, by these samples This outlier as candidate, distance here is Weighted distance of the sample point in cluster with the central point of final cluster, i.e., Distance with distance weighting.In order to reduce the amount of calculation of candidate's outlier acquisition, optionally, for each cluster, from from cluster The farthest sample point of central point starts to judge：D ＞ A σ, A are constant, i.e., the sample point is in cluster (such as K-Means clusters) Whether it is more than A times of standard deviation with the Weighted distance of final cluster central point, if being more than, this sample point is added to candidate and peeled off Set, i.e. Candidate_Set={ x...... } is less than or equal to the sample point knot of A times of standard deviation until running into distance in the cluster Beam judges.

S190, all candidate's outliers are screened, obtain the outlier of supplement；

Screening rule can need to be configured according to user, for example, in one embodiment, being peeled off to all candidates Point is screened, and is obtained the outlier of supplement and is included：All candidate's outliers are carried out according to the size of central point distance Sequence, selects candidate's outlier of predetermined number as the outlier of supplement since the maximum candidate's outlier of distance.It is default Number can be determined according to actual needs, for example, by the sample point in final Candidate_Set set with it is corresponding in The distance between heart point is arranged according to descending, and the sample point of selection preceding 80% is mended as the difference of outlier, that is, the outlier supplemented. All outliers that the outlier judged when the outlier and cluster of the supplement screened detects as this.

Based on same inventive concept, the present invention also provides a kind of outlier detection system, below in conjunction with the accompanying drawings to the present invention The embodiment of system is described in detail.

As shown in figure 4, a kind of outlier detection system, including：

Sample space acquisition module 110, the sample space to be detected for obtaining, wherein the sample space is including multiple Sample point, each sample point includes several dimensions；

Central point chooses module 120, for choosing several sample points as the center of correspondence cluster from the sample space Point；

Distance weighting obtain module 130, for calculating each sample point unselected in the sample space and each The distance between central point weight, wherein the distance weighting is the numerical value identical number of dimensions and total dimension of sample point and central point The ratio between number of degrees；

Distance obtains module 140, for being calculated according to the distance weighting in unselected each sample point and each The distance between heart point；

Cluster division module 150, for the cluster according to belonging to the unselected each sample point of the distance determination；

Outlier detection module 160, the sample point for will not belong to any cluster is determined as outlier.

Above-mentioned outlier detection system, when the data in all sample spaces are clustered, the evaluation adjusted the distance enters Row weighting so that each round clusters the factor for all including Weighted distance, efficiently solves the outlier detection for being based purely on cluster The outlier of presence can influence cluster division or the limitation of cluster result of cluster during cluster；Do not need user to possess to appoint What domain knowledge, directly filters out outlier in sample space, without mapping, therefore is not only restricted to the size of data volume, i.e., Make, when data volume is larger, can also accurately detect outlier；Add the distance weighting factor so that the result of cluster is more fitted Actual cluster, can obtain better Clustering Effect, more accurately filter out outlier, and the distance weighting factor relies on dimension in addition Number of degrees amount, supports to carry out outlier detection to extensive high dimensional data on big data platform, can solve higher-dimension sample and bring Computation complexity and difficulty, it is adaptable to the sample space of higher-dimension, and the distribution of data in sample space need not be known a priori by Feature.Modules are described in detail below：

Sample space to be detected is the data of outlier to be detected.The element of sample space is referred to as sample point or basic Event.Sample point includes several dimensions, and the value of the dimension of each sample point constitutes the vectorial particular content of sample point.Sample Space acquisition module 110 obtains sample space to be detected, and central point chooses module 120 and randomly chooses K sample from sample space This central point as K cluster, K is the integer more than or equal to 1.

Because each central point necessarily belongs to corresponding cluster, so being chosen when carrying out cluster division to sample point without considering Central point, only consider unselected sample point, but the present invention makes restriction not to this.The bigger table of distance weighting Show that two sample point similarities are higher.For some sample point A, distance weighting obtain module 130 calculate sample point A with The distance between some central point C weight, computational methods are：W=same_num/sum_num, wherein w are sample point A with The distance between heart point C weight, same_num is sample point A and central point C dimension values identical quantity, each sample point Total number of dimensions be just as, so sum_num is sample point A or central point C total number of dimensions.Distance weighting obtains module 130 using the above methods can calculate sample point A respectively with the distance between K central point weight, similar, distance weighting Obtain the distance between each sample point that can be read of module 130 and K central point weight.

For some sample point, distance obtains module 140 according to the distance between the sample point and certain central point weight, Calculate the distance with distance weighting between the sample point and the central point.The mode that distance obtains the calculating distance of module 140 has Many kinds, for example, in one embodiment, the distance obtains module 140 according to the distance between sample point and central point power The numerical value of inverse of weight, the dimension of the numerical value of the dimension of sample point and central point, calculate unselected each sample point and Mahalanobis distance between each central point.

It is described distance obtain module 140 calculate apart from when be using the starting point of mahalanobis distance rather than Euclidean distance, One, mahalanobis distance are unrelated with dimension, secondly, mahalanobis distance also contemplate correlation between variable, compared to tradition Mode of being mapped in technology can not embody the defect of relation between multivariable so that the result of cluster is more fitted actual cluster.Separately Outside, distance weighting is a decimal, and two sample points of bigger expression in mahalanobis distance closer to so be used as distance weighting Reflect that its value reciprocal is smaller using its inverse during the factor, the distance of two sample points is smaller.

In one embodiment, cluster division module 150 determines that the cluster belonging to unselected each sample point includes：If away from The central point nearest from certain sample point only has one, and the sample point is included into the cluster where nearest central point, if apart from some The nearest central point of this point has multiple, the sample point is not included into any cluster.

The sample point is not included into any cluster and only means that the sample point is doubtful outlier, the sample in epicycle circulation Whether point is that outlier needs to determine when meeting the condition of convergence.The condition of convergence can be determined according to actual needs, for example, setting Cycle-index be 10 times, and all sample points cluster be divided in it is last 5 times circulation in do not occur any change.If meeting convergence Condition, terminates cluster, and the outlier that outlier detection module 160 will not belong to any cluster is determined as outlier.If a sample Point is determined as outlier, the mark of outlier can be added on the sample point, in order to which user checks.

In one embodiment, outlier detection system can also include being connected to cluster division module 150 and outlier inspection The central point gravity treatment module surveyed between module 160, the central point gravity treatment module is used to judge whether obtained cluster meets setting The condition of convergence；When obtained cluster is unsatisfactory for the condition of convergence of setting, the central point of each cluster is chosen again, according to selecting again Each central point taken redefines the cluster belonging to unselected each sample point.The condition of convergence of setting is even unsatisfactory for, Central point each then is reselected in existing K cluster, is recalculated between each sample point and the central point chosen again Distance weighting and distance, according to the distance recalculated again to each sample point carry out cluster division.Until meeting convergence Condition terminates.

Also there is another limitation in the outlier detection for being based purely on cluster：For selected clustering algorithm more according to Rely, different clustering algorithms carries out cluster to same sample space data and is likely to be obtained different cluster results.In order to reduce this The influence that limitation is brought to outlier detection, in one embodiment, as shown in figure 5, outlier detection system can also be wrapped Include the outlier complementary module 170 being connected with the outlier detection module.As shown in fig. 6, the outlier complementary module 170 Including：

Threshold value obtaining unit 1701, the average for calculating each cluster is poor, obtains every according to average difference The threshold value of individual cluster；

Candidate's outlier obtaining unit 1702, is more than threshold value for obtaining in each cluster with the distance between corresponding central point Sample point, regard the sample point of acquisition as candidate's outlier；

Outlier obtaining unit 1703 is supplemented, for being screened to all candidate's outliers, peeling off for supplement is obtained Point.

In one embodiment, threshold value obtaining unit 1701 calculates the average difference of each cluster and included：Wrapped according to cluster All samples that each sample point that the total quantity of the sample point contained, cluster are included is included in the numerical value and cluster of each dimension Point obtains standard deviation of each cluster in each dimension in the average of each dimension；Calculate the standard deviation of all dimensions of each cluster Average value, the average for obtaining each cluster is poor.

In one embodiment, the supplement outlier obtaining unit 1703 by all candidate's outliers according to center The size of point distance is ranked up, and selects candidate's outlier of predetermined number to be used as benefit since the maximum candidate's outlier of distance The outlier filled.

Above-mentioned outlier detection method and system, when being compared to each other with prior art, possesses advantages below：

1st, do not need user to possess any domain knowledge, category label can be added in outlier, it is directly empty in sample Between in filter out outlier, without mapping, therefore be not only restricted to the size of data volume, can also when data volume is larger Accurately detect outlier；

2nd, the mahalanobis distance of sample point and central point is calculated, compared to the mapping mode in conventional art, it is contemplated that variable Between correlation, make the outlier of detection more accurate；

3rd, the distance weighting factor is added so that the result of cluster is more fitted actual cluster, can obtain better cluster Effect, more accurately filters out outlier, and the distance weighting factor relies on number of dimensions in addition, supports on big data platform to big Scale high dimensional data carries out outlier detection, can solve computation complexity and difficulty that higher-dimension sample is brought, it is adaptable to higher-dimension Sample space, and the distribution characteristics of data in sample space need not be known a priori by；

4th, the result to direct clustering uses the strategy of the distance of standard deviation to carry out the supplement of outlier so as to cluster Result have preferable amendment, to prevent in cluster process those farthest from final central point from being the sample of outlier in fact Point is assigned in cluster by mistake, improves the accuracy of outlier detection；

There is limitation in the outlier detection for the 5th, being based purely on cluster：One is outlier can influence during cluster The cluster of cluster is divided or cluster result；The second is more being relied on for selected clustering algorithm, different clustering algorithms are to same Individual sample space data carry out cluster and are likely to be obtained different cluster results, present invention decreases two above limitation to outlier The influence that detection band is come.

One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with The hardware of correlation is instructed to complete by computer program, described program can be stored in a computer read/write memory medium In, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope of this specification record is all considered to be.

Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that coming for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of outlier detection method, it is characterised in that including step：

Sample space to be detected is obtained, wherein the sample space includes multiple sample points, each sample point includes several Dimension；

The distance between each sample point unselected in the sample space and each central point weight is calculated, wherein described Distance weighting is the ratio between numerical value identical number of dimensions and total number of dimensions of sample point and central point；

The sample point that will not belong to any cluster is determined as outlier.

2. outlier detection method according to claim 1, it is characterised in that the sample point that will not belong to any cluster judges After outlier, in addition to step：

The average for calculating each cluster is poor, and the threshold value of each cluster is obtained according to average difference；

The sample point for being more than threshold value in each cluster with the distance between corresponding central point is obtained, the sample point of acquisition is regard as candidate Outlier；

All candidate's outliers are screened, the outlier of supplement is obtained.

3. outlier detection method according to claim 2, it is characterised in that sieved to all candidate's outliers Choosing, obtaining the outlier of supplement includes：

All candidate's outliers are ranked up according to the size with central point distance, opened from the maximum candidate's outlier of distance Beginning selects candidate's outlier of predetermined number as the outlier of supplement.

4. outlier detection method according to claim 2, it is characterised in that calculate the average difference bag of each cluster Include：

Numerical value and cluster institute of each sample point that the total quantity of the sample point included according to cluster, cluster are included in each dimension Comprising all sample points in the average of each dimension, obtain the standard deviation of each cluster in each dimension；

The average value of the standard deviation of all dimensions of each cluster is calculated, the average for obtaining each cluster is poor.

5. the outlier detection method according to Claims 1-4 any one, it is characterised in that weighed according to the distance The distance between the unselected each sample point of re-computation and each central point include：

According to the dimension of the reciprocal of the distance between sample point and central point weight, the numerical value of the dimension of sample point and central point Numerical value, calculate the mahalanobis distance between unselected each sample point and each central point.

6. the outlier detection method according to Claims 1-4 any one, it is characterised in that true according to the distance After cluster belonging to fixed unselected each sample point, the sample point that will not belong to any cluster is determined as before outlier, also Including step：

Judge whether obtained cluster meets the condition of convergence of setting；

If obtained cluster is unsatisfactory for the condition of convergence of setting, the central point of each cluster is chosen again, according to each chosen again Central point redefines the cluster belonging to unselected each sample point.

7. outlier detection method according to claim 6, it is characterised in that determine unselected each sample point institute The cluster of category includes：

If the central point nearest apart from certain sample point only has one, the sample point is included into the cluster where nearest central point, if The central point nearest apart from certain sample point has multiple, the sample point is not included into any cluster.

8. a kind of outlier detection system, it is characterised in that including：

Sample space acquisition module, the sample space to be detected for obtaining, wherein the sample space includes multiple sample points, Each sample point includes several dimensions；

Distance weighting obtains module, for calculate each sample point unselected in the sample space and each central point it Between distance weighting, wherein the distance weighting be sample point and central point numerical value identical number of dimensions and total number of dimensions it Than；

Distance obtains module, for according between the unselected each sample point of distance weighting calculating and each central point Distance；

9. outlier detection system according to claim 8, it is characterised in that also including with the outlier detection module Connected outlier complementary module, the outlier complementary module includes：

Threshold value obtaining unit, the average for calculating each cluster is poor, and the threshold of each cluster is obtained according to average difference Value；

Candidate's outlier obtaining unit, the sample of threshold value is more than for obtaining with the distance between corresponding central point in each cluster Point, regard the sample point of acquisition as candidate's outlier；

Outlier obtaining unit is supplemented, for being screened to all candidate's outliers, the outlier of supplement is obtained.

10. outlier detection system according to claim 9, it is characterised in that the supplement outlier obtaining unit will All candidate's outliers are ranked up according to the size with central point distance, are selected since the maximum candidate's outlier of distance Candidate's outlier of predetermined number as supplement outlier.