CN107122475A - Big data abnormal point detecting method and its system - Google Patents
Big data abnormal point detecting method and its system Download PDFInfo
- Publication number
- CN107122475A CN107122475A CN201710302132.3A CN201710302132A CN107122475A CN 107122475 A CN107122475 A CN 107122475A CN 201710302132 A CN201710302132 A CN 201710302132A CN 107122475 A CN107122475 A CN 107122475A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- data
- dimensional
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of big data abnormal point detecting method.Methods described includes the dimensional attribute weight for calculating high dimensional data, and the high dimensional data has some dimensional attributes;Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, the division result related to data distribution is obtained;The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.The invention also discloses a kind of big data abnormal point examining system.Solve the problem of data exception point Detection results are not good in the prior art.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of big data abnormal point detecting method and its system.
Background technology
In the big data epoch, the quality of data directly affects the effect of big data analysis and processing method, also influences decision-making
Journey.By analyzing mass data, the pattern and rule implied in data set can be therefrom found.But the abnormal data in data set
It can produce great interference to analysis process, therefore in the research field utilized to big data by data digging method, it is several
Turn into popular research according to abnormality detection.
There are following features in the abnormal data in big data:1. it is presented with obvious difference with normal data;2. its
Generation mechanism is different from normal data, may be unknown manner;3. data dimension is higher, abnormal data as normal data,
All it is high dimensional data.
Because distribution of the data point in higher dimensional space that high dimensional data is concentrated has openness.Therefore, conventional method without
Anomaly data detection problem in method processing big data.In existing higher-dimension abnormal deviation data examination method, number can be generally used
Euclidean space distance between strong point judges the abnormality of data as index, if belong to abnormal data.
But because distance of the high dimensional data on higher dimensional space can not be weighed as efficiency index.Further,
Some schemes then introduce complementary chord angle to replace Euclidean distance as index, but this method time complexity with data set
Increase, increased dramatically, while common angle contrast's method can not work well on the data set of not rounded.
Therefore, prior art is also to be developed.
The content of the invention
In view of in place of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide big data abnormal point detecting method and
Its system, it is intended to solve the problem of data exception point Detection results are not good in the prior art.
In order to achieve the above object, this invention takes following technical scheme:
A kind of big data abnormal point detecting method, wherein, methods described includes:
The dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes;
Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data point
The related division result of cloth;
The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;
When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;
Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;
When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Described method, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.
Described method, wherein, methods described also includes:
Determine data-intensive interval of the data set on each dimensional attribute;
According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;
It regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
Described method, wherein, the dimensional attribute weight for calculating high dimensional data is specifically included:
The dimensional attribute weight is calculated using following formula:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th
Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th
Spend the joint weight of attribute.
Described method, wherein, calculate the independent weight using following formula:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein
One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
A kind of big data abnormal point examining system, wherein, including:
Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensions
Attribute;And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
Data set cutting module, in the data set that the high dimensional data is constituted, dividing described using standard stripping and slicing
Data set, obtains the division result related to data distribution;The high dimensional data is projected into two dimensional surface, obtained and described stroke
Divide the corresponding two dimensional surface block of result;And when the packing density of the two dimensional surface block is more than the second predetermined threshold,
It is defined as normal blocks;
Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks;
And when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Described system, wherein, second predetermined threshold is the average value of the packing density of all two dimensional surface blocks.
Described system, wherein, the data set cutting module is additionally operable to:Determine the data set on each dimensional attribute
Data-intensive interval;According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;And will be described
The minimum hypermatrix of data-intensive middle volume is used as the standard stripping and slicing.
Described system, wherein, the yojan dimensionality reduction module specifically for:The dimensional attribute is calculated using following formula
Weight:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th
Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th
Spend the joint weight of attribute.
Described system, wherein, the yojan dimensionality reduction module specifically for:The rights to independence are calculated using following formula
Weight:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein
One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
Beneficial effect:Big data abnormal point detecting method and its system that the present invention is provided, based on data area cutting and
The thought of yojan dimensionality reduction, effectively overcomes existing abnormal deviation data examination method, and when handling extensive high dimensional data, the time is complicated
The drawbacks of high and effect is undesirable is spent, challenge as the outlier detection that the extensive high dimensional data of processing is concentrated has
Preferable effect.
Brief description of the drawings
Fig. 1 is the method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention;
Fig. 2 is two dimensional surface block diagram provided in an embodiment of the present invention;
Fig. 3 is adjacent data block figure provided in an embodiment of the present invention;
Fig. 4 is the functional block diagram of big data abnormal point examining system provided in an embodiment of the present invention.
Embodiment
The present invention provides big data abnormal point detecting method and its system.To make the purpose of the present invention, technical scheme and effect
Fruit is clearer, clear and definite, and the present invention is described in more detail for the embodiment that develops simultaneously referring to the drawings.It should be appreciated that this place
The specific embodiment of description only to explain the present invention, is not intended to limit the present invention.
As shown in figure 1, being a kind of method flow diagram of big data abnormal point detecting method provided in an embodiment of the present invention.Institute
The method of stating comprises the following steps:
100th, the dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes.
200th, the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold is removed.
Step 100 and 200 processes for belonging to pretreatment, big data generally comprise high dimensional data, and its characteristic is dimensional attribute
It is more.Make piThe attribute in i-th dimension is represented, the difference needed according to analysis, the importance of each dimensional attribute is also differed.Cause
This, before outlier detection is carried out, can remove some redundant attributes according to the weight of attribute, carry out yojan dimensionality reduction first.
Based on the power existed in high dimensional data, between dimensional attribute between certain dependency relation, different dimensions attribute
In embodiments of the present invention, it can specifically be adopted by considering that other attributes are influenceed and calculate dimensional attribute with the following method again
Weight:
If for High Dimensional Data Set D, P={ p1,p2,…,pi,…,pnIt is its dimensional attribute set, n is length, for
Some dimensional attribute p in dimensional attribute setiWeight r (pi) acquisition can be calculated by formula (1):
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute.ru(pi) it is i-th
Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations, rc(pi,pj) tieed up for i-th and j-th
Spend the joint weight of attribute.
For independent weight, formula (2) can be used to calculate acquisition:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value.
For joint weight, formula (3) can be used to calculate acquisition:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein
One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
After calculating and obtaining the dimensional attribute weight, can be according to actual conditions the need for, set suitable threshold value to use
To screen or remove some dimensional attributes, yojan dimensionality reduction is realized.For the dimensional attribute less than threshold value, set is moved out,
Reduce the expense of subsequent operation.Preferably, when the first predetermined threshold η is set to η ∈ [0.2,0.25], with ideal sieve
Select effect.
300th, in the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and number
According to the related division result of distribution.
Further, for data set D, the interval of data Relatively centralized can be there are on each dimensional attributeIt can make whole data set D's data-intensive,In certain embodiments,
The hypermatrix of volume minimum in R can be selected as standard stripping and slicing, according to the length l of each dimension of standard stripping and slicingiFor data
Collection D is divided.
400th, the high dimensional data is projected into two dimensional surface, obtains two dimensional surface area corresponding with the division result
Block., can be by by the method for data projection to two dimensional surface, obtaining corresponding two dimensional surface block rec after divisioni。
500th, when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.
Fig. 2 is the schematic diagram of two dimensional surface block provided in an embodiment of the present invention.As shown in Fig. 2 different two dimensional surfaces
The packing density situation of block has different.The packing density of some two dimensional surface block can be calculated by formula (4):
Wherein, count (reci),acr(reci) it is respectively reciComprising number of data points and block area.Work as data
When density is more than certain threshold value, the block is labeled as normal blocks.In certain embodiments, the threshold value can take all areas
The average data density of block.
After at least one normal blocks is obtained, certain strategy can be used to complete the judgement for data set block
(extending normal blocks).If, can be with for example, as shown in figure 3, the adjacent block density of normal blocks also meets above-mentioned condition
Add it in continuous normal blocks queue.
600th, the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks is calculated.
After the completion of the normal blocks of step 500 are divided, multiple continuous normal data areas can be obtained in data set D
Block.It is determined that data point not in these normal data blocks and returning to higher dimensional space, the angle variance of these data points is calculated
The factor.
In certain embodiments, the angle variance of unit weight can be calculated by formula (5):
Wherein, xixkFor data point xiAnd xkVector difference, xiAnd xjFall in normal data block RECiIt is interior, xkFall normal
Outside data block.
700th, when the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Calculate and obtain after angle variance of unit weight, equally can also be by setting the 3rd predetermined threshold, it is determined whether belong to abnormal data.
The embodiment of the present invention additionally provides a kind of big data abnormal point examining system.As shown in figure 4, the system includes:
Yojan dimensionality reduction module 100, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes;
And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold.Data set cutting module 200, in institute
In the data set for stating high dimensional data composition, the data set is divided using standard stripping and slicing, the division related to data distribution is obtained
As a result;The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;And
When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks.Abnormal data judge module
300, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks;And in the angle variance of unit weight
During more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
Specifically, average value of second predetermined threshold for the packing density of all two dimensional surface blocks.
In other embodiments of the invention, the data set cutting module 200 is additionally operable to:Determine the data set each
Data-intensive interval on dimensional attribute;According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;
And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
The yojan dimensionality reduction module 100 specifically for:The dimensional attribute weight is calculated using following formula:
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) it is i-th
Independent weight of the individual dimensional attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) tieed up for i-th and j-th
Spend the joint weight of attribute.
Alternatively, the independent weight is calculated using following formula:
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
Wherein,xi∈piRepresent xiIt is dimensional attribute piWherein
One value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
It is understood that for those of ordinary skills, can be with technique according to the invention scheme and this hair
Bright design is subject to equivalent substitution or change, and all these changes or replacement should all belong to the guarantor of appended claims of the invention
Protect scope.
Claims (10)
1. a kind of big data abnormal point detecting method, it is characterised in that methods described includes:
The dimensional attribute weight of high dimensional data is calculated, the high dimensional data has some dimensional attributes;
Remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
In the data set that the high dimensional data is constituted, the data set is divided using standard stripping and slicing, is obtained and data distribution phase
The division result of pass;
The high dimensional data is projected into two dimensional surface, two dimensional surface block corresponding with the division result is obtained;
When the packing density of the two dimensional surface block is more than the second predetermined threshold, it is defined as normal blocks;
Calculate the angle variance of unit weight for the high dimensional data for being not belonging to the normal blocks;
When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
2. according to the method described in claim 1, it is characterised in that second predetermined threshold is all two dimensional surface blocks
The average value of packing density.
3. according to the method described in claim 1, it is characterised in that methods described also includes:
Determine data-intensive interval of the data set on each dimensional attribute;
According to it is described it is data-intensive it is interval obtain the data set several are data-intensive;
It regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
4. according to the method described in claim 1, it is characterised in that the dimensional attribute weight for calculating high dimensional data, specifically
Including:
The dimensional attribute weight is calculated using following formula:
<mrow>
<mi>r</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>|</mo>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
<mo>-</mo>
<mfrac>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mi>j</mi>
<mo>&NotEqual;</mo>
<mi>i</mi>
</mrow>
<mi>n</mi>
</munderover>
<mo>&lsqb;</mo>
<msub>
<mi>r</mi>
<mi>c</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<mn>2</mn>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>|</mo>
</mrow>
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) tieed up for i-th
Spend independent weight of the attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) it is that i-th and j-th of dimension belong to
The joint weight of property.
5. method according to claim 4, it is characterised in that calculate the independent weight using following formula:
<mrow>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<mfrac>
<mrow>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>E</mi>
<mi>i</mi>
</msub>
</mrow>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>E</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
<mrow>
<msub>
<mi>r</mi>
<mi>c</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>H</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<mi>P</mi>
</mrow>
</munder>
<mi>H</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein,xi∈piRepresent xiIt is dimensional attribute piOne of them
Value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
6. a kind of big data abnormal point examining system, it is characterised in that including:
Yojan dimensionality reduction module, the dimensional attribute weight for calculating high dimensional data, the high dimensional data has some dimensional attributes;
And remove the dimensional attribute that the dimensional attribute weight is less than the first predetermined threshold;
Data set cutting module, in the data set that the high dimensional data is constituted, the data to be divided using standard stripping and slicing
Collection, obtains the division result related to data distribution;The high dimensional data is projected into two dimensional surface, obtains and is tied with described divide
Really corresponding two dimensional surface block;And when the packing density of the two dimensional surface block is more than the second predetermined threshold, it is determined that
For normal blocks;
Abnormal data judge module, the angle variance of unit weight for calculating the high dimensional data for being not belonging to the normal blocks;And
When the angle variance of unit weight is more than three predetermined thresholds, it is abnormal data to mark the high dimensional data.
7. system according to claim 6, it is characterised in that second predetermined threshold is all two dimensional surface blocks
The average value of packing density.
8. system according to claim 6, it is characterised in that the data set cutting module is additionally operable to:Determine the number
According to data-intensive interval of the collection on each dimensional attribute;According to the data-intensive interval some numbers for obtaining the data set
According to intensive;And it regard the minimum hypermatrix of the data-intensive middle volume as the standard stripping and slicing.
9. system according to claim 6, it is characterised in that the yojan dimensionality reduction module specifically for:Use following calculation
Formula calculates the dimensional attribute weight:
<mrow>
<mi>r</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>|</mo>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mfrac>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mi>j</mi>
<mo>&NotEqual;</mo>
<mi>i</mi>
</mrow>
<mi>n</mi>
</munderover>
<mo>&lsqb;</mo>
<msub>
<mi>r</mi>
<mi>c</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<mn>2</mn>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>|</mo>
</mrow>
Wherein, r (pi) it is the dimensional attribute weight, pi, piDifference i-th and j-th of dimensional attribute, ru(pi) tieed up for i-th
Spend independent weight of the attribute when not considering with other dimensional attribute dependency relations;rc(pi,pj) it is that i-th and j-th of dimension belong to
The joint weight of property.
10. system according to claim 9, it is characterised in that the yojan dimensionality reduction module specifically for:Use following calculation
Formula calculates the independent weight:
<mrow>
<msub>
<mi>r</mi>
<mi>u</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<mfrac>
<mrow>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>E</mi>
<mi>i</mi>
</msub>
</mrow>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<msub>
<mi>E</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, EiFor piEntropy, For dimensional attribute piJ-th of property value;Also,
The joint weight is calculated using following formula:
<mrow>
<msub>
<mi>r</mi>
<mi>c</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>H</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<mi>P</mi>
</mrow>
</munder>
<mi>H</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein,xi∈piRepresent xiIt is dimensional attribute piOne of them
Value;q(xi) it is xiProbability density, q (xi,xj) it is xiAnd xjJoint probability density.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710302132.3A CN107122475A (en) | 2017-05-02 | 2017-05-02 | Big data abnormal point detecting method and its system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710302132.3A CN107122475A (en) | 2017-05-02 | 2017-05-02 | Big data abnormal point detecting method and its system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107122475A true CN107122475A (en) | 2017-09-01 |
Family
ID=59726642
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710302132.3A Pending CN107122475A (en) | 2017-05-02 | 2017-05-02 | Big data abnormal point detecting method and its system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122475A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108536777A (en) * | 2018-03-28 | 2018-09-14 | 联想(北京)有限公司 | A kind of data processing method, server cluster and data processing equipment |
CN110826904A (en) * | 2019-11-01 | 2020-02-21 | 三一重能有限公司 | Data processing method and device for fan, processing equipment and readable storage medium |
CN112015723A (en) * | 2019-05-28 | 2020-12-01 | 顺丰科技有限公司 | Data grading method and device, computer equipment and storage medium |
CN115389624A (en) * | 2022-10-27 | 2022-11-25 | 智能网联汽车(山东)协同创新研究院有限公司 | Sound wave test system for processing |
-
2017
- 2017-05-02 CN CN201710302132.3A patent/CN107122475A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108536777A (en) * | 2018-03-28 | 2018-09-14 | 联想(北京)有限公司 | A kind of data processing method, server cluster and data processing equipment |
CN108536777B (en) * | 2018-03-28 | 2022-03-25 | 联想(北京)有限公司 | Data processing method, server cluster and data processing device |
CN112015723A (en) * | 2019-05-28 | 2020-12-01 | 顺丰科技有限公司 | Data grading method and device, computer equipment and storage medium |
CN110826904A (en) * | 2019-11-01 | 2020-02-21 | 三一重能有限公司 | Data processing method and device for fan, processing equipment and readable storage medium |
CN115389624A (en) * | 2022-10-27 | 2022-11-25 | 智能网联汽车(山东)协同创新研究院有限公司 | Sound wave test system for processing |
CN115389624B (en) * | 2022-10-27 | 2023-02-10 | 智能网联汽车(山东)协同创新研究院有限公司 | Sound wave test system for processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122475A (en) | Big data abnormal point detecting method and its system | |
CN102843559B (en) | Method and device for quickly selecting HEVC intra prediction mode on basis of texture characteristics | |
CN106021771A (en) | Method and device for diagnosing faults | |
EP4009590A1 (en) | Traffic abnormality detection method, and model training method and apparatus | |
CN107103332A (en) | A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset | |
CN105678813A (en) | Skin color detection method and device | |
CN110533022B (en) | Target detection method, system, device and storage medium | |
CN110084165A (en) | The intelligent recognition and method for early warning of anomalous event under the open scene of power domain based on edge calculations | |
CN104992403B (en) | Mixed operation operator image redirection method based on visual similarity measurement | |
US20210065021A1 (en) | Working condition state modeling and model correcting method | |
CN107679089A (en) | A kind of cleaning method for electric power sensing data, device and system | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN110991527B (en) | Similarity threshold determining method considering average fluctuation rate of voltage curve | |
CN108924148A (en) | A kind of source signal collaborative compression perception data restoration methods | |
CN108830006B (en) | Linear-nonlinear industrial process fault detection method based on linear evaluation factor | |
CN109101661A (en) | The detection method and device of abnormal point in a kind of data sample set | |
CN106991049A (en) | A kind of Software Defects Predict Methods and forecasting system | |
CN111476307A (en) | Lithium battery surface defect detection method based on depth field adaptation | |
CN105469118A (en) | Kernel function based rare category detection method fusing active learning and nonparametric semi-supervised clustering | |
CN108388745A (en) | Least square method supporting vector machine flexible measurement method based on distributed parallel local optimum parameter | |
CN116071352A (en) | Method for generating surface defect image of electric power safety tool | |
CN106970779A (en) | A kind of streaming balance chart division methods calculated towards internal memory | |
CN115761511A (en) | SAR image target detection method combined with high-confidence knowledge distillation | |
CN107463528A (en) | The gauss hybrid models split-and-merge algorithm examined based on KS | |
CN108898264B (en) | Method and device for calculating quality metric index of overlapping community set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170901 |