CN107122475A - Big data abnormal point detecting method and its system - Google Patents

Big data abnormal point detecting method and its system Download PDF

Info

Publication number
CN107122475A
CN107122475A CN201710302132.3A CN201710302132A CN107122475A CN 107122475 A CN107122475 A CN 107122475A CN 201710302132 A CN201710302132 A CN 201710302132A CN 107122475 A CN107122475 A CN 107122475A
Authority
CN
China
Prior art keywords
mrow
msub
data
dimensional
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710302132.3A
Other languages
Chinese (zh)
Inventor
江有归
封雷
刘东升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU ADTIME TECHNOLOGY Co Ltd
Original Assignee
HANGZHOU ADTIME TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU ADTIME TECHNOLOGY Co Ltd filed Critical HANGZHOU ADTIME TECHNOLOGY Co Ltd
Priority to CN201710302132.3A priority Critical patent/CN107122475A/en
Publication of CN107122475A publication Critical patent/CN107122475A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of big data abnormal point detecting method.Methods described includes the dimensional attribute weight for calculating high dimensional data, and the high dimensional data has some dimensional attributes;Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, the division result related to data distribution is obtained;The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.The invention also discloses a kind of big data abnormal point examining system.Solve the problem of data exception point Detection results are not good in the prior art.

Description

Big data abnormal point detecting method and its system
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of big data abnormal point detecting method and its system.
Background technology
In the big data epoch, the quality of data directly affects the effect of big data analysis and processing method, also influences decision-making Journey.By analyzing mass data, the pattern and rule implied in data set can be therefrom found.But the abnormal data in data set It can produce great interference to analysis process, therefore in the research field utilized to big data by data digging method, it is several Turn into popular research according to abnormality detection.
There are following features in the abnormal data in big data:1. it is presented with obvious difference with normal data;2. its Generation mechanism is different from normal data, may be unknown manner;3. data dimension is higher, abnormal data as normal data, All it is high dimensional data.
Because distribution of the data point in higher dimensional space that high dimensional data is concentrated has openness.Therefore, conventional method without Anomaly data detection problem in method processing big data.In existing higher-dimension abnormal deviation data examination method, number can be generally used Euclidean space distance between strong point judges the abnormality of data as index, if belong to abnormal data.
But because distance of the high dimensional data on higher dimensional space can not be weighed as efficiency index.Further, Some schemes then introduce complementary chord angle to replace Euclidean distance as index, but this method time complexity with data set Increase, increased dramatically, while common angle contrast's method can not work well on the data set of not rounded.
Therefore, prior art is also to be developed.
The content of the invention
In view of in place of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide big data abnormal point detecting method and Its system, it is intended to solve the problem of data exception point Detection results are not good in the prior art.
In order to achieve the above object, this invention takes following technical scheme:
A kind of big data abnormal point detecting method, wherein, methods described includes:
The dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes;
Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data point The related division result of cloth;
The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;
When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;
Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;
When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Described method, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.
Described method, wherein, methods described also includes:
Determine data-intensive interval of the data set on each dimensional attribute;
According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;
It regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
Described method, wherein, the dimensional attribute weight for calculating high dimensional data is specifically included:
The dimensional attribute weight is calculated using following formula:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th Spend the joint weight of attribute.
Described method, wherein, calculate the independent weight using following formula:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
A kind of big data abnormal point examining system, wherein, including:
Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensions Attribute;And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
Data set cutting module, in the data set that the high dimensional data is constituted, dividing described using standard stripping and slicing Data set, obtains the division result related to data distribution;The high dimensional data is projected into two dimensional surface, obtained and described stroke Divide the corresponding two dimensional surface block of result;And when the packing density of the two dimensional surface block is more than the second predetermined threshold, It is defined as normal blocks;
Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks; And when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Described system, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.
Described system, wherein, the data set cutting module is additionally operable to:Determine the data set on each dimensional attribute Data-intensive interval;According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;And will be described The minimum hypermatrix of data-intensive middle volume is used as the standard stripping and slicing.
Described system, wherein, the yojan dimensionality reduction module specifically for:The dimensional attribute is calculated using following formula Weight:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th Spend the joint weight of attribute.
Described system, wherein, the yojan dimensionality reduction module specifically for:The rights to independence are calculated using following formula Weight:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
Beneficial effect:Big data abnormal point detecting method and its system that the present invention is provided, based on data area cutting and The thought of yojan dimensionality reduction, effectively overcomes existing abnormal deviation data examination method, and when handling extensive high dimensional data, the time is complicated The drawbacks of high and effect is undesirable is spent, challenge as the outlier detection that the extensive high dimensional data of processing is concentrated has Preferable effect.
Brief description of the drawings
Fig. 1 is the method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention;
Fig. 2 is two dimensional surface block diagram provided in an embodiment of the present invention;
Fig. 3 is adjacent data block figure provided in an embodiment of the present invention;
Fig. 4 is the functional block diagram of big data abnormal point examining system provided in an embodiment of the present invention.
Embodiment
The present invention provides big data abnormal point detecting method and its system.To make the purpose of the present invention, technical scheme and effect Fruit is clearer, clear and definite, and the present invention is described in more detail for the embodiment that develops simultaneously referring to the drawings.It should be appreciated that this place The specific embodiment of description only to explain the present invention, is not intended to limit the present invention.
As shown in figure 1, being a kind of method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention.Institute The method of stating comprises the following steps:
100th, the dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes.
200th, the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold is removed.
Step 100 and 200 processes for belonging to pretreatment, big data generally comprise high dimensional data, and its characteristic is dimensional attribute It is more.Make piThe attribute in i-th dimension is represented, the difference needed according to analysis, the importance of each dimensional attribute is also differed.Cause This, before outlier detection is carried out, can remove some redundant attributes according to the weight of attribute, carry out yojan dimensionality reduction first.
Based on the power existed in high dimensional data, between dimensional attribute between certain dependency relation, different dimensions attribute In embodiments of the present invention, it can specifically be adopted by considering that other attributes are influenceed and calculate dimensional attribute with the following method again Weight:
If for High Dimensional Data Set D, P={ p1,p2,…,pi,…,pnIt is its dimensional attribute set, n is length, for Some dimensional attribute p in dimensional attribute setiWeight r (pi) acquisition can be calculated by formula (1):
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute.ru(pi) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations, rc(pi,pj) tieed up for i-th and j-th Spend the joint weight of attribute.
For independent weight, formula (2) can be used to calculate acquisition:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value.
For joint weight, formula (3) can be used to calculate acquisition:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
After calculating and obtaining the dimensional attribute weight, can be according to actual conditions the need for, set suitable threshold value to use To screen or remove some dimensional attributes, yojan dimensionality reduction is realized.For the dimensional attribute less than threshold value, set is moved out, Reduce the expense of subsequent operation.Preferably, when the first predetermined threshold η is set to η ∈ [0.2,0.25], with ideal sieve Select effect.
300th, in the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and number According to the related division result of distribution.
Further, for data set D, the interval of data Relatively centralized can be there are on each dimensional attributeIt can make whole data set D's data-intensive,In certain embodiments, The hypermatrix of volume minimum in R can be selected as standard stripping and slicing, according to the length l of each dimension of standard stripping and slicingiFor data Collection D is divided.
400th, the high dimensional data is projected into two dimensional surface, obtains two dimensional surface area corresponding with the division result Block., can be by by the method for data projection to two dimensional surface, obtaining corresponding two dimensional surface block rec after divisioni
500th, when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.
Fig. 2 is the schematic diagram of two dimensional surface block provided in an embodiment of the present invention.As shown in Fig. 2 different two dimensional surfaces The packing density situation of block has different.The packing density of some two dimensional surface block can be calculated by formula (4):
Wherein, count (reci),acr(reci) it is respectively reciComprising number of data points and block area.Work as data When density is more than certain threshold value, the block is labeled as normal blocks.In certain embodiments, the threshold value can take all areas The average data density of block.
After at least one normal blocks is obtained, certain strategy can be used to complete the judgement for data set block (extending normal blocks).If, can be with for example, as shown in figure 3, the adjacent block density of normal blocks also meets above-mentioned condition Add it in continuous normal blocks queue.
600th, the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks is calculated.
After the completion of the normal blocks of step 500 are divided, multiple continuous normal data areas can be obtained in data set D Block.It is determined that data point not in these normal data blocks and returning to higher dimensional space, the angle variance of these data points is calculated The factor.
In certain embodiments, the angle variance of unit weight can be calculated by formula (5):
Wherein, xixkFor data point xiAnd xkVector difference, xiAnd xjFall in normal data block RECiIt is interior, xkFall normal Outside data block.
700th, when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data. Calculate and obtain after angle variance of unit weight, equally can also be by setting the 3rd predetermined threshold, it is determined whether belong to abnormal data.
The embodiment of the present invention additionally provides a kind of big data abnormal point examining system.As shown in figure 4, the system includes: Yojan dimensionality reduction module 100, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes; And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold.Data set cutting module 200, in institute In the data set for stating high dimensional data composition, the data set is divided using standard stripping and slicing, the division related to data distribution is obtained As a result;The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;And When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.Abnormal data judge module 300, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks;And in the angle variance of unit weight During more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Specifically, average value of second predetermined threshold for the packing density of all two dimensional surface blocks.
In other embodiments of the invention, the data set cutting module 200 is additionally operable to:Determine the data set each Data-intensive interval on dimensional attribute;According to it is described it is data-intensive it is interval obtain the data set several are data-intensive; And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
The yojan dimensionality reduction module 100 specifically for:The dimensional attribute weight is calculated using following formula:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th Spend the joint weight of attribute.
Alternatively, the independent weight is calculated using following formula:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
It is understood that for those of ordinary skills, can be with technique according to the invention scheme and this hair Bright design is subject to equivalent substitution or change, and all these changes or replacement should all belong to the guarantor of appended claims of the invention Protect scope.

Claims (10)

1. a kind of big data abnormal point detecting method, it is characterised in that methods described includes:
The dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes;
Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data distribution phase The division result of pass;
The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;
When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;
Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;
When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
2. according to the method described in claim 1, it is characterised in that second predetermined threshold is all two dimensional surface blocks The average value of packing density.
3. according to the method described in claim 1, it is characterised in that methods described also includes:
Determine data-intensive interval of the data set on each dimensional attribute;
According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;
It regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
4. according to the method described in claim 1, it is characterised in that the dimensional attribute weight for calculating high dimensional data, specifically Including:
The dimensional attribute weight is calculated using following formula:
<mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>-</mo> <mfrac> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>&amp;NotEqual;</mo> <mi>i</mi> </mrow> <mi>n</mi> </munderover> <mo>&amp;lsqb;</mo> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <mn>2</mn> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>|</mo> </mrow>
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) tieed up for i-th Spend independent weight of the attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) it is that i-th and j-th of dimension belong to The joint weight of property.
5. method according to claim 4, it is characterised in that calculate the independent weight using following formula:
<mrow> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> </mrow> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
<mrow> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <mi>P</mi> </mrow> </munder> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein,xi∈piRepresent xiIt is dimensional attribute piOne of them Value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
6. a kind of big data abnormal point examining system, it is characterised in that including:
Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes; And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
Data set cutting module, in the data set that the high dimensional data is constituted, the data to be divided using standard stripping and slicing Collection, obtains the division result related to data distribution;The high dimensional data is projected into two dimensional surface, obtains and is tied with described divide Really corresponding two dimensional surface block;And when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is determined that For normal blocks;
Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks;And When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
7. system according to claim 6, it is characterised in that second predetermined threshold is all two dimensional surface blocks The average value of packing density.
8. system according to claim 6, it is characterised in that the data set cutting module is additionally operable to:Determine the number According to data-intensive interval of the collection on each dimensional attribute;According to the data-intensive interval some numbers for obtaining the data set According to intensive;And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
9. system according to claim 6, it is characterised in that the yojan dimensionality reduction module specifically for:Use following calculation Formula calculates the dimensional attribute weight:
<mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mfrac> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>&amp;NotEqual;</mo> <mi>i</mi> </mrow> <mi>n</mi> </munderover> <mo>&amp;lsqb;</mo> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <mn>2</mn> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>|</mo> </mrow>
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) tieed up for i-th Spend independent weight of the attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) it is that i-th and j-th of dimension belong to The joint weight of property.
10. system according to claim 9, it is characterised in that the yojan dimensionality reduction module specifically for:Use following calculation Formula calculates the independent weight:
<mrow> <msub> <mi>r</mi> <mi>u</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> </mrow> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
<mrow> <msub> <mi>r</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <mi>P</mi> </mrow> </munder> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein,xi∈piRepresent xiIt is dimensional attribute piOne of them Value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
CN201710302132.3A 2017-05-02 2017-05-02 Big data abnormal point detecting method and its system Pending CN107122475A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710302132.3A CN107122475A (en) 2017-05-02 2017-05-02 Big data abnormal point detecting method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710302132.3A CN107122475A (en) 2017-05-02 2017-05-02 Big data abnormal point detecting method and its system

Publications (1)

Publication Number Publication Date
CN107122475A true CN107122475A (en) 2017-09-01

Family

ID=59726642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710302132.3A Pending CN107122475A (en) 2017-05-02 2017-05-02 Big data abnormal point detecting method and its system

Country Status (1)

Country Link
CN (1) CN107122475A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536777A (en) * 2018-03-28 2018-09-14 联想(北京)有限公司 A kind of data processing method, server cluster and data processing equipment
CN110826904A (en) * 2019-11-01 2020-02-21 三一重能有限公司 Data processing method and device for fan, processing equipment and readable storage medium
CN112015723A (en) * 2019-05-28 2020-12-01 顺丰科技有限公司 Data grading method and device, computer equipment and storage medium
CN115389624A (en) * 2022-10-27 2022-11-25 智能网联汽车(山东)协同创新研究院有限公司 Sound wave test system for processing

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536777A (en) * 2018-03-28 2018-09-14 联想(北京)有限公司 A kind of data processing method, server cluster and data processing equipment
CN108536777B (en) * 2018-03-28 2022-03-25 联想(北京)有限公司 Data processing method, server cluster and data processing device
CN112015723A (en) * 2019-05-28 2020-12-01 顺丰科技有限公司 Data grading method and device, computer equipment and storage medium
CN110826904A (en) * 2019-11-01 2020-02-21 三一重能有限公司 Data processing method and device for fan, processing equipment and readable storage medium
CN115389624A (en) * 2022-10-27 2022-11-25 智能网联汽车(山东)协同创新研究院有限公司 Sound wave test system for processing
CN115389624B (en) * 2022-10-27 2023-02-10 智能网联汽车(山东)协同创新研究院有限公司 Sound wave test system for processing

Similar Documents

Publication Publication Date Title
CN107122475A (en) Big data abnormal point detecting method and its system
CN102843559B (en) Method and device for quickly selecting HEVC intra prediction mode on basis of texture characteristics
CN106021771A (en) Method and device for diagnosing faults
EP4009590A1 (en) Traffic abnormality detection method, and model training method and apparatus
CN107103332A (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN105678813A (en) Skin color detection method and device
CN110533022B (en) Target detection method, system, device and storage medium
CN110084165A (en) The intelligent recognition and method for early warning of anomalous event under the open scene of power domain based on edge calculations
CN104992403B (en) Mixed operation operator image redirection method based on visual similarity measurement
US20210065021A1 (en) Working condition state modeling and model correcting method
CN107679089A (en) A kind of cleaning method for electric power sensing data, device and system
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN110991527B (en) Similarity threshold determining method considering average fluctuation rate of voltage curve
CN108924148A (en) A kind of source signal collaborative compression perception data restoration methods
CN108830006B (en) Linear-nonlinear industrial process fault detection method based on linear evaluation factor
CN109101661A (en) The detection method and device of abnormal point in a kind of data sample set
CN106991049A (en) A kind of Software Defects Predict Methods and forecasting system
CN111476307A (en) Lithium battery surface defect detection method based on depth field adaptation
CN105469118A (en) Kernel function based rare category detection method fusing active learning and nonparametric semi-supervised clustering
CN108388745A (en) Least square method supporting vector machine flexible measurement method based on distributed parallel local optimum parameter
CN116071352A (en) Method for generating surface defect image of electric power safety tool
CN106970779A (en) A kind of streaming balance chart division methods calculated towards internal memory
CN115761511A (en) SAR image target detection method combined with high-confidence knowledge distillation
CN107463528A (en) The gauss hybrid models split-and-merge algorithm examined based on KS
CN108898264B (en) Method and device for calculating quality metric index of overlapping community set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170901