CN107122475A

CN107122475A - Big data abnormal point detecting method and its system

Info

Publication number: CN107122475A
Application number: CN201710302132.3A
Authority: CN
Inventors: 江有归; 封雷; 刘东升
Original assignee: HANGZHOU ADTIME TECHNOLOGY Co Ltd
Current assignee: HANGZHOU ADTIME TECHNOLOGY Co Ltd
Priority date: 2017-05-02
Filing date: 2017-05-02
Publication date: 2017-09-01

Abstract

The invention discloses a kind of big data abnormal point detecting method.Methods described includes the dimensional attribute weight for calculating high dimensional data, and the high dimensional data has some dimensional attributes；Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold；In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, the division result related to data distribution is obtained；The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained；When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks；Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks；When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.The invention also discloses a kind of big data abnormal point examining system.Solve the problem of data exception point Detection results are not good in the prior art.

Description

Big data abnormal point detecting method and its system

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of big data abnormal point detecting method and its system.

Background technology

In the big data epoch, the quality of data directly affects the effect of big data analysis and processing method, also influences decision-making Journey.By analyzing mass data, the pattern and rule implied in data set can be therefrom found.But the abnormal data in data set It can produce great interference to analysis process, therefore in the research field utilized to big data by data digging method, it is several Turn into popular research according to abnormality detection.

There are following features in the abnormal data in big data：1. it is presented with obvious difference with normal data；2. its Generation mechanism is different from normal data, may be unknown manner；3. data dimension is higher, abnormal data as normal data, All it is high dimensional data.

Because distribution of the data point in higher dimensional space that high dimensional data is concentrated has openness.Therefore, conventional method without Anomaly data detection problem in method processing big data.In existing higher-dimension abnormal deviation data examination method, number can be generally used Euclidean space distance between strong point judges the abnormality of data as index, if belong to abnormal data.

But because distance of the high dimensional data on higher dimensional space can not be weighed as efficiency index.Further, Some schemes then introduce complementary chord angle to replace Euclidean distance as index, but this method time complexity with data set Increase, increased dramatically, while common angle contrast's method can not work well on the data set of not rounded.

Therefore, prior art is also to be developed.

The content of the invention

In view of in place of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide big data abnormal point detecting method and Its system, it is intended to solve the problem of data exception point Detection results are not good in the prior art.

In order to achieve the above object, this invention takes following technical scheme：

A kind of big data abnormal point detecting method, wherein, methods described includes：

The dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes；

Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold；

In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data point The related division result of cloth；

The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained；

When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks；

Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks；

When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.

Described method, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.

Described method, wherein, methods described also includes：

Determine data-intensive interval of the data set on each dimensional attribute；

According to it is described it is data-intensive it is interval obtain the data set several are data-intensive；

It regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.

Described method, wherein, the dimensional attribute weight for calculating high dimensional data is specifically included：

The dimensional attribute weight is calculated using following formula：

Wherein, r (p_i) it is the dimensional attribute weight, p_i, p_iDifference i-th and j-th of dimensional attribute, r_u(p_i) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations；r_c(p_i,p_j) tieed up for i-th and j-th Spend the joint weight of attribute.

Described method, wherein, calculate the independent weight using following formula：

Wherein, E_iFor p_iEntropy, For dimensional attribute p_iJ-th of property value；Also,

The joint weight is calculated using following formula：

Wherein,x_i∈p_iRepresent x_iIt is dimensional attribute p_iWherein One value；q(x_i) it is x_iProbability density, q (x_i,x_j) it is x_iAnd x_jJoint probability density.

A kind of big data abnormal point examining system, wherein, including：

Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensions Attribute；And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold；

Data set cutting module, in the data set that the high dimensional data is constituted, dividing described using standard stripping and slicing Data set, obtains the division result related to data distribution；The high dimensional data is projected into two dimensional surface, obtained and described stroke Divide the corresponding two dimensional surface block of result；And when the packing density of the two dimensional surface block is more than the second predetermined threshold, It is defined as normal blocks；

Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks； And when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.

Described system, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.

Described system, wherein, the data set cutting module is additionally operable to：Determine the data set on each dimensional attribute Data-intensive interval；According to it is described it is data-intensive it is interval obtain the data set several are data-intensive；And will be described The minimum hypermatrix of data-intensive middle volume is used as the standard stripping and slicing.

Described system, wherein, the yojan dimensionality reduction module specifically for：The dimensional attribute is calculated using following formula Weight：

Described system, wherein, the yojan dimensionality reduction module specifically for：The rights to independence are calculated using following formula Weight：

The joint weight is calculated using following formula：

Beneficial effect：Big data abnormal point detecting method and its system that the present invention is provided, based on data area cutting and The thought of yojan dimensionality reduction, effectively overcomes existing abnormal deviation data examination method, and when handling extensive high dimensional data, the time is complicated The drawbacks of high and effect is undesirable is spent, challenge as the outlier detection that the extensive high dimensional data of processing is concentrated has Preferable effect.

Brief description of the drawings

Fig. 1 is the method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention；

Fig. 2 is two dimensional surface block diagram provided in an embodiment of the present invention；

Fig. 3 is adjacent data block figure provided in an embodiment of the present invention；

Fig. 4 is the functional block diagram of big data abnormal point examining system provided in an embodiment of the present invention.

Embodiment

The present invention provides big data abnormal point detecting method and its system.To make the purpose of the present invention, technical scheme and effect Fruit is clearer, clear and definite, and the present invention is described in more detail for the embodiment that develops simultaneously referring to the drawings.It should be appreciated that this place The specific embodiment of description only to explain the present invention, is not intended to limit the present invention.

As shown in figure 1, being a kind of method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention.Institute The method of stating comprises the following steps：

100th, the dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes.

200th, the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold is removed.

Step 100 and 200 processes for belonging to pretreatment, big data generally comprise high dimensional data, and its characteristic is dimensional attribute It is more.Make p_iThe attribute in i-th dimension is represented, the difference needed according to analysis, the importance of each dimensional attribute is also differed.Cause This, before outlier detection is carried out, can remove some redundant attributes according to the weight of attribute, carry out yojan dimensionality reduction first.

Based on the power existed in high dimensional data, between dimensional attribute between certain dependency relation, different dimensions attribute In embodiments of the present invention, it can specifically be adopted by considering that other attributes are influenceed and calculate dimensional attribute with the following method again Weight：

If for High Dimensional Data Set D, P={ p₁,p₂,…,p_i,…,p_nIt is its dimensional attribute set, n is length, for Some dimensional attribute p in dimensional attribute set_iWeight r (p_i) acquisition can be calculated by formula (1)：

Wherein, r (p_i) it is the dimensional attribute weight, p_i, p_iDifference i-th and j-th of dimensional attribute.r_u(p_i) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations, r_c(p_i,p_j) tieed up for i-th and j-th Spend the joint weight of attribute.

For independent weight, formula (2) can be used to calculate acquisition：

Wherein, E_iFor p_iEntropy, For dimensional attribute p_iJ-th of property value.

For joint weight, formula (3) can be used to calculate acquisition：

After calculating and obtaining the dimensional attribute weight, can be according to actual conditions the need for, set suitable threshold value to use To screen or remove some dimensional attributes, yojan dimensionality reduction is realized.For the dimensional attribute less than threshold value, set is moved out, Reduce the expense of subsequent operation.Preferably, when the first predetermined threshold η is set to η ∈ [0.2,0.25], with ideal sieve Select effect.

300th, in the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and number According to the related division result of distribution.

Further, for data set D, the interval of data Relatively centralized can be there are on each dimensional attributeIt can make whole data set D's data-intensive,In certain embodiments, The hypermatrix of volume minimum in R can be selected as standard stripping and slicing, according to the length l of each dimension of standard stripping and slicing_iFor data Collection D is divided.

400th, the high dimensional data is projected into two dimensional surface, obtains two dimensional surface area corresponding with the division result Block., can be by by the method for data projection to two dimensional surface, obtaining corresponding two dimensional surface block rec after division_i。

500th, when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.

Fig. 2 is the schematic diagram of two dimensional surface block provided in an embodiment of the present invention.As shown in Fig. 2 different two dimensional surfaces The packing density situation of block has different.The packing density of some two dimensional surface block can be calculated by formula (4)：

Wherein, count (rec_i),acr(rec_i) it is respectively rec_iComprising number of data points and block area.Work as data When density is more than certain threshold value, the block is labeled as normal blocks.In certain embodiments, the threshold value can take all areas The average data density of block.

After at least one normal blocks is obtained, certain strategy can be used to complete the judgement for data set block (extending normal blocks).If, can be with for example, as shown in figure 3, the adjacent block density of normal blocks also meets above-mentioned condition Add it in continuous normal blocks queue.

600th, the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks is calculated.

After the completion of the normal blocks of step 500 are divided, multiple continuous normal data areas can be obtained in data set D Block.It is determined that data point not in these normal data blocks and returning to higher dimensional space, the angle variance of these data points is calculated The factor.

In certain embodiments, the angle variance of unit weight can be calculated by formula (5)：

Wherein, x_ix_kFor data point x_iAnd x_kVector difference, x_iAnd x_jFall in normal data block REC_iIt is interior, x_kFall normal Outside data block.

700th, when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data. Calculate and obtain after angle variance of unit weight, equally can also be by setting the 3rd predetermined threshold, it is determined whether belong to abnormal data.

The embodiment of the present invention additionally provides a kind of big data abnormal point examining system.As shown in figure 4, the system includes： Yojan dimensionality reduction module 100, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes； And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold.Data set cutting module 200, in institute In the data set for stating high dimensional data composition, the data set is divided using standard stripping and slicing, the division related to data distribution is obtained As a result；The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained；And When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.Abnormal data judge module 300, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks；And in the angle variance of unit weight During more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.

Specifically, average value of second predetermined threshold for the packing density of all two dimensional surface blocks.

In other embodiments of the invention, the data set cutting module 200 is additionally operable to：Determine the data set each Data-intensive interval on dimensional attribute；According to it is described it is data-intensive it is interval obtain the data set several are data-intensive； And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.

The yojan dimensionality reduction module 100 specifically for：The dimensional attribute weight is calculated using following formula：

Alternatively, the independent weight is calculated using following formula：

The joint weight is calculated using following formula：

It is understood that for those of ordinary skills, can be with technique according to the invention scheme and this hair Bright design is subject to equivalent substitution or change, and all these changes or replacement should all belong to the guarantor of appended claims of the invention Protect scope.

Claims

1. a kind of big data abnormal point detecting method, it is characterised in that methods described includes：

In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data distribution phase The division result of pass；

2. according to the method described in claim 1, it is characterised in that second predetermined threshold is all two dimensional surface blocks The average value of packing density.

3. according to the method described in claim 1, it is characterised in that methods described also includes：

4. according to the method described in claim 1, it is characterised in that the dimensional attribute weight for calculating high dimensional data, specifically Including：

The dimensional attribute weight is calculated using following formula：

<mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>-</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>n</mi> </munderover> <mo>&lsqb;</mo> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mrow> <mn>2</mn> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>|</mo> </mrow>

Wherein, r (p_i) it is the dimensional attribute weight, p_i, p_iDifference i-th and j-th of dimensional attribute, r_u(p_i) tieed up for i-th Spend independent weight of the attribute when not considering with other dimensional attribute dependency relations；r_c(p_i,p_j) it is that i-th and j-th of dimension belong to The joint weight of property.

5. method according to claim 4, it is characterised in that calculate the independent weight using following formula：

<mrow> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

The joint weight is calculated using following formula：

<mrow> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>P</mi> </mrow> </munder> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein,x_i∈p_iRepresent x_iIt is dimensional attribute p_iOne of them Value；q(x_i) it is x_iProbability density, q (x_i,x_j) it is x_iAnd x_jJoint probability density.

6. a kind of big data abnormal point examining system, it is characterised in that including：

Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes； And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold；

Data set cutting module, in the data set that the high dimensional data is constituted, the data to be divided using standard stripping and slicing Collection, obtains the division result related to data distribution；The high dimensional data is projected into two dimensional surface, obtains and is tied with described divide Really corresponding two dimensional surface block；And when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is determined that For normal blocks；

Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks；And When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.

7. system according to claim 6, it is characterised in that second predetermined threshold is all two dimensional surface blocks The average value of packing density.

8. system according to claim 6, it is characterised in that the data set cutting module is additionally operable to：Determine the number According to data-intensive interval of the collection on each dimensional attribute；According to the data-intensive interval some numbers for obtaining the data set According to intensive；And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.

9. system according to claim 6, it is characterised in that the yojan dimensionality reduction module specifically for：Use following calculation Formula calculates the dimensional attribute weight：

<mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>n</mi> </munderover> <mo>&lsqb;</mo> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mrow> <mn>2</mn> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>|</mo> </mrow>

10. system according to claim 9, it is characterised in that the yojan dimensionality reduction module specifically for：Use following calculation Formula calculates the independent weight：

The joint weight is calculated using following formula：