CN107528904A

CN107528904A - Method and apparatus for data distribution formula abnormality detection

Info

Publication number: CN107528904A
Application number: CN201710778563.7A
Authority: CN
Inventors: 林晨
Original assignee: Star Link Information Technology (shanghai) Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2017-12-29
Anticipated expiration: 2037-09-01
Also published as: CN107528904B

Abstract

The purpose of the application is to provide a kind of method and apparatus for data distribution formula abnormality detection.Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that each working node calculates the abnormal probability of each data point based on the data set that the data set stored on this working node and host node are broadcasted, this mode, give required amount of calculation reasonable distribution to each working node, take full advantage of the computing capability of whole cluster so that the performance of algorithm is significantly improved.

Description

Method and apparatus for data distribution formula abnormality detection

Technical field

The application is related to field of computer technology, more particularly to a kind of technology for data distribution formula abnormality detection.

Background technology

Abnormality detection be intended to find data set in exceptional data point, these exceptional data points substantially with other data points not Together, and their appearance is not random, and be possible to result from entirely different mechanism.Exceptional data point is divided Analysis, the hiding important information in its behind can be excavated, so as to bring considerable economic value.At present, abnormality detection is in reality Had a wide range of applications among life, such as in financial field, fraud detection is carried out to credit card using abnormality detection technology；Believing Security fields are ceased, are performed intrusion detection using abnormality detection technology；In medical field, abnormality detection technology can be used for assessing patient Possible potential health problem etc..

Existing LOF (Local Outlier Factor, local outlier factor) algorithm is a kind of exception based on density Detection algorithm, the algorithm think that the density of the adjacent point of the density of exceptional data point has notable difference, and therefore, the algorithm passes through Compare the relative density of the adjacent point of current point to detect exceptional data point.Due to good Detection accuracy, the algorithm quilt Extensive use, turn into a kind of classical Outlier Detection Algorithm, still, with the arriving in big data epoch, due to the algorithm itself Higher time complexity, it is restricted its application.

The content of the invention

The purpose of the application is to provide a kind of method and apparatus for data distribution formula abnormality detection.

According to the one side of the application, there is provided a kind of method for data distribution formula abnormality detection, wherein, the party Method includes：

Host node obtains the first data set stored in cluster on the first working node, wherein, the first data set bag Containing at least one data point；

Host node is by the second working node in first data sets broadcast to the cluster, so that in the cluster Working node calculates the abnormal probability of each data point based on the data set on this working node, wherein, the second work section Point is other working nodes that first working node is removed in the cluster.

According to the another aspect of the application, a kind of method for data distribution formula abnormality detection is additionally provided, wherein, should Method includes：

The first working node in cluster sends the first data set stored on this working node to the host node, its In, first data set includes at least one data point；

The second working node in cluster obtains the first data set of first working node of the host node broadcast, Wherein, second working node is other working nodes that first working node is removed in the cluster；

Working node in cluster calculates the abnormal probability of each data point based on the data set on this working node.

According to the another further aspect of the application, a kind of host node for data distribution formula abnormality detection is additionally provided, wherein, The host node is used for：

The first data set stored in cluster on the first working node is obtained, wherein, first data set includes at least One data point；And by the second working node in first data sets broadcast to the cluster, so that in the cluster Working node calculates the abnormal probability of each data point based on the data set on this working node, wherein, the second work section Point is other working nodes that first working node is removed in the cluster.

According to the another aspect of the application, a kind of cluster for data distribution formula abnormality detection is additionally provided, wherein, should Cluster includes：

First working node, for sending the first data set stored on this working node to the host node and being based on Data set on this working node, the abnormal probability of each data point is calculated, wherein, first data set includes at least one Data point；

Second working node, for first working node that obtains host node broadcast the first data set and Based on the data set on this working node, the abnormal probability of each data point is calculated, wherein, second working node is described Other working nodes of first working node are removed in cluster.

Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that Each working node calculates each data based on the data set that the data set stored on this working node and host node are broadcasted The abnormal probability of point, this mode give required amount of calculation reasonable distribution to each working node, take full advantage of whole cluster Computing capability so that the performance of algorithm is significantly improved.

Brief description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other Feature, objects and advantages will become more apparent upon：

Fig. 1 shows a kind of method flow diagram for data distribution formula abnormality detection according to the application one side；

Fig. 2 shows to be obtained and broadcasted showing for the first data set according to a kind of host node of one preferred embodiment of the application It is intended to；

Fig. 3 shows to obtain the first data intensive data according to first working node of one preferred embodiment of the application The schematic diagram of first global K-distance neighbours of point.

Same or analogous reference represents same or analogous part in accompanying drawing.

Embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

In one typical configuration of the application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and internal memory.

Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, magnetic disk storage or other magnetic storage apparatus or Any other non-transmission medium, the information that can be accessed by a computing device available for storage.Defined according to herein, computer Computer-readable recording medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It is below in conjunction with the accompanying drawings and preferably real for the technological means and the effect of acquirement that further elaboration the application is taken Example is applied, to the technical scheme of the application, carries out clear and complete description.

Fig. 1 shows a kind of method for data distribution formula abnormality detection according to the application one side, wherein, it is described Method includes：

The first working node in S11 clusters sends the first data set stored on this working node to host node, accordingly Ground, host node obtain the first data set stored in cluster on the first working node, wherein, first data set includes at least One data point；

Host node described in S12 is by the second working node in first data sets broadcast to the cluster, so that described Working node in cluster calculates the abnormal probability of each data point based on the data set on this working node, correspondingly, described Second working node obtains the first data set of first working node of the host node broadcast, wherein, second work It is other working nodes that first working node is removed in the cluster as node；

Working node in S13 clusters calculates the abnormal probability of each data point based on the data set on this working node.

In this embodiment, in the step S11, the first working node in cluster sends this work section to host node The first data set stored on point, correspondingly, host node obtain the first data set stored in cluster on the first working node, its In, first data set includes at least one data point.

Here, the cluster is made up of each node in distributed system, for example, being distributed in the embodiment of the present application Formula system mainly includes host node and working node, and the host node is a node in the cluster, can play collection And the effect of broadcast data；First working node is any one section in all working node in the distributed system Point, second working node are other working nodes in addition to the first working node.Here, in the distributed system Each node includes but is not limited to various intelligent movable equipment, personal computer, network host, single network server, multiple The cloud that webserver collection or multiple servers are formed；Wherein, cloud is by the big gauge based on cloud computing (Cloud Computing) Calculation machine or the webserver are formed, wherein, cloud computing is one kind of Distributed Calculation, by the computer collection group of a group loose couplings Into a virtual supercomputer.

Specifically, it would be desirable to carry out all group of data points of abnormality detection into collection be collectively referred to as data set, in this application, To whole data set, multiple data partitions can be divided into, the plurality of subregion can be stored on each working node in a distributed manner, often Individual working node can store one or more of whole data set subregion, wherein, the subregion that will be stored on the first working node Referred to as the first data set, the subregion stored on the second working node are referred to as the second data set.In this application, can utilize The distributed data collection RDD provided in Spark, raw data set is read according to RDD [(sequence number, vector)] data format In internal memory, RDD is divided into multiple subregions by Spark, is stored in a distributed manner on each working node of Spark clusters.

Due to needing to calculate the k-distance neighbours of each data point by LOF algorithms, wherein, k-distance is Refer to the distance between current point and the remote data point of its kth.And k-distance neighbours refer to all and distance of current point not Data point more than k-distance, therefore, by taking the first working node as an example, the first data on the first working node of calculating , it is necessary to which the first data set is sent to other working nodes, i.e., second during the k-distance neighbours for each data point concentrated Working node, so as to obtain k-distance neighbour of each data point in whole data set in the first data set.

So in the step S11, the first working node in cluster is sent to host node to be stored on this working node The first data set, correspondingly, host node obtains first stored in cluster on the first working node by way of collect Data set, here, each working node turns into the first working node in turn, that is, each working node in cluster can incite somebody to action Data set on this working node is sent to the host node, and the host node can be by the data sets broadcast in addition to this working node Other working nodes, that is, in the step S12, the host node is by way of broadcasting variable by first data Collection broadcast is to the second working node in the cluster, so that all working node in the cluster is based on this working node Data set, calculate the abnormal probability of each data point, correspondingly, second working node obtains the host node broadcast First data set of first working node.

Host node obtains and broadcasted the schematic diagram of the first data set as shown in Figure 2, wherein, Driver is host node, under Four, face is working node, and data set is stored with each working node, and in this embodiment, each data are concentrated with three numbers Strong point, and according to the data format storage of [(sequence number, vector)], can be using any one in four working nodes as first Working node, others are used as the second working node, in this embodiment, using first left working node as the first work Node, other working nodes are as the second working node, and host node is by the data set in the first working node, i.e. the first data Collection, collect (collection) arrive the host node, and by first data set broadcast (broadcast) to the second working node.Enter one Step ground, the second working node alternately as the first working node, so as to realize by the data sets broadcast on this working node to its His working node.

Continue in this embodiment, in the step S13, the working node in cluster is based on the number on this working node According to collection, the abnormal probability of each data point is calculated.What the working node i.e. in cluster was broadcasted based on host node by broadcasting variable The data set stored on first data set and work at present node, calculate the abnormal probability of each data point.

Preferably, wherein, the second data set is stored with second working node, the step S13 includes：S131 First working node in (not shown) cluster obtains the first global K- of data point in first data set Distance neighbours and second working node obtain the second global K- of data point in second data set Distance neighbours；Working node in S132 (not shown) clusters is based on the described first global K-distance neighbours and institute The second global K-distance neighbours are stated, determine each data point and its K-distance neighbour on this working node parallel can Up to distance；Working node in S133 (not shown) clusters based on each data point and its K-distance neighbour up to away from From determining the reachable density of each data point on this working node parallel；Working node in S134 (not shown) clusters is based on The reachable density of each data point, the abnormal probability of each data point on this working node is determined parallel.

In this embodiment, in the step S131, first working node in cluster obtains first number Obtained according to the first global K-distance neighbours of intensive data point and second working node in second data set Second global K-distance neighbours of data point.

Here, the first global K-distance neighbours represent data point in first data set in all working K-distance neighbours in all data sets of node, the second global K-distance neighbours are represented in the second data set K-distance neighbour of the data point in all data sets of all working node.The first global K-distance is near Adjacent and described second global K-distance neighbours are referred to as global K-distance neighbours, represent each data point all K-distance neighbours in all data sets of working node.For example, because host node is by the data on each working node Collection broadcast is to other working nodes, and therefore, the data point that each working node can calculate on this working node in data set exists K-distance neighbours in all data sets, so as to obtain the global K-distance neighbours of each data point.

Preferably, wherein, first working node obtains the first global K- of data point in first data set Distance neighbours include：

First working node described in S1311 (not shown) calculates the First partial K- of data point in first data set Distance neighbours, wherein, the First partial K-distance neighbours are including data point in first data set described K-distance neighbours in first data set；

Second working node described in S1312 (not shown) calculates the second local K- of data point in first data set Distance neighbours, wherein, the second local K-distance neighbours are including data point in first data set described K-distance neighbours in second data set；

First working node described in S1313 (not shown) is based on First partial K-distance neighbours and the second local K- Distance neighbours, obtain the first global K-distance neighbours of data point in first data set.

In this embodiment, in the step S1311, first working node is by calculating first data set In the distance between each data point, to obtain K- of each data point in first data set in the first data set Distance neighbours, i.e. First partial K-distance neighbours.

Similarly, in the step S1312, because the second working node has got first data set, then the second work Each data point in first data set, the distance with each data point in the second data set, so as to obtain can be calculated by making node Take K-distance neighbour of each data point in second data set, i.e., the second local K- in the first data set Distance neighbours.

Further, in the step S1313, first working node is based on First partial K-distance neighbours And the second local K-distance neighbour, the first global K-distance for obtaining data point in first data set are near It is adjacent.

It is identical with determining the global K-distance neighbours of the data point in the first data set, same side can be passed through Formula, determine the global K-distance neighbours of data point in the second data set.

Preferably, wherein, the step S1313 includes：Described in first working node described in S13131 (not shown) obtains The second local K-distance neighbours that second working node determines；First working node described in S13132 (not shown) is based on institute First partial K-distance neighbours and the second local K-distance neighbours are stated, obtains the data in first data set The global K-distance neighbours of point.

In this embodiment it is possible to the global K- of the data point in the first data set is determined by the first working node Distance neighbours, i.e., by First partial K-distance neighbours and the second local K-distance neighbours be aggregated to this first Carry out calculating the global K-distance neighbours for obtaining the data point in the first data set on working node.Similarly, Ke Yitong The second working node is crossed to determine the global K-distance neighbours of data point in the second data set, so as to realize distributed meter Calculate, improve operation efficiency.

Preferably, the step S1313 includes：First working node and second working node are by described first Local K-distance neighbours and the second local K-distance neighbours are transmitted to middle working node；First work Node receives the global K-distance neighbours of the data point in first data set that the middle working node is sent, described, Wherein, the middle working node is based on the First partial K-distance neighbours and the second local K-distance is near Neighbour, obtain the global K-distance neighbours of the data point in first data set.

In this preferred embodiment, the calculating of global K-distance neighbours can not be carried out by working node, Here, the calculating of global K-distance neighbours can be realized by intermediate node, here, intermediate node be except host node and Outside working node, the node that separately sets, the intermediate node can be by obtaining First partial K-distance neighbours and second game Portion K-distance neighbours, and carry out corresponding calculate to determine global K-distance neighbours.

Preferably, the storage mode of local K-distance and global K-distance neighbours include the sequence of the data point Number mark and the data point part and global K-distance neighbours sequence number mark.That is, storing each data The local K-distance of point and during global K-distance neighbours, can include the sequence number mark of each data point and per number The sequence number of the local K-distance at strong point and global K-distance neighbours identify, for example, data point p K-distance is near Neighbour can be denoted as：(data point p, [(Neighbor Points o₁, distance d₁) ..., (Neighbor Points o_k, distance d_k)], wherein, d is data point p With the distance of Neighbor Points.

For example, in the embodiment shown in fig. 3, the global or local K-distance neighbours of data point 0 are data point 1, Due to K=1, then data point 0 K-distance neighbours, (0,1, d1) can be stored as, wherein, d1 is data point 0 and data Point the distance between 1.

Preferably, wherein, it is described to be based on the First partial K-distance neighbours and the second local K-distance Neighbour, obtaining the global K-distance neighbours of the data point in first data set includes：

Sequence number mark in the First partial K-distance neighbours and the second local K-distance neighbours is identical Data point carry out stipulations；The global K-distance of the data point in first data set is determined based on the stipulations result Neighbour.Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work section On point, for example, stipulations are on the first working node or intermediate node.

As shown in figure 3, show that first working node obtains the first global K- of data point in first data set A kind of embodiment of distance neighbours, in this embodiment, the data set stored on each working node include three data Point, each data point is stored by the data format of (sequence number, vector), for example, being stored with sequence number 0,1,2 on the first working node Data point.In this embodiment it is assumed that the K=1 in K-distance neighbours.

In this embodiment, the data set on the first working node described in the collect of Driver ends, and broadcast is arrived On second working node, in the step S1311, first working node calculates data point in first data set First partial K-distance neighbours, that is, the data point of serial number 0,1,2 is calculated at notebook data collection (i.e. described first data set) In local K-distance neighbours, for example, the data point of serial number 0 is near in the First partial K-distance that notebook data is concentrated Neighbour is the data point of serial number 1.

Likewise, in the step S1312, second working node calculates data point in first data set Second local K-distance neighbours, i.e., described second working node calculate the data point in first data set respectively, with The distance of each data point in second data set, and closest is put as the data point in the second data set Local K-distance neighbours, for example, second local K-distance of the data point of serial number 0 in the second working node 2 Neighbour is the data point of serial number 4.

Continue in this embodiment, in the step S1313, first working node is based on First partial K- Distance neighbours and the second local K-distance neighbours, obtain the first global K- of data point in first data set Distance neighbours.Specifically, can be by near to First partial K-distance neighbours and the second local K-distance It is adjacent respectively according to apart from ascending sort, the then selected distance by way of traveling through the first and second local K-distance neighbours K minimum K-distance neighbour.

Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work On node, for example, as shown in figure 3, wherein, K=1, by the local K-distance neighbours stipulations of the data point of serial number 0 to one Rise, and be ranked up according to the distance of the data point with serial number 0, it is determined that apart from K data point of minimum, in the embodiment In, due to K=1, so the global K-distance neighbours for determining the data point of serial number 0 are (0.484,0.598), i.e. sequence number Global K-distance neighbours for the data point for 0 are the data point of serial number 1, similarly, can obtain the data point of serial number 1 Global K-distance neighbours are the data point of serial number 2.

Preferably, wherein, the step S132 includes：Working node in cluster is based on every number on this working node The global K-distance neighbours at strong point or based on each data point on this working node and other working nodes The global K-distance neighbours of data point, it is determined that each data point and the reach distance of its K-distance neighbour.

In this embodiment, when the global K-distance neighbours of each data point on this working node are this work sections During data point on point, the working node in cluster only need to the global K- based on each data point on this working node Distance neighbours, to determine each data point and the reach distance of its K-distance neighbour on this working node；When this work , it is necessary to be based on this work when the global K-distance neighbours for making to have data point on node are the data points on other working nodes Make the global K-distance neighbours of each data point on node and the data point on other working nodes being related to, come Determine each data point and the reach distance of its K-distance neighbour on this working node.Each working node parallel computation, So as to realize the reach distance for the K-distance neighbours for determining all data points.

Here, the reach distance between data point o and data point p is defined as follows:

reach-dist_k(p, o)=max { k-distOnce (O), d (p, 0) }

Wherein, d (p, o) refers to the distance between data point p and o.

Specifically, the data point in the K-distance neighbours of obtained data point is swapped with Neighbor Points, for example, Data point p K-distance neighbours are denoted as RDD [(data point p, [(Neighbor Points o₁, distance d₁) ... ..., (Neighbor Points o_k, distance d_k)])], then data point p is swapped with Neighbor Points, Swapped_RDD [(Neighbor Points o can be designated as₁, data point p, distance d₁) ... ..., Neighbor Points o_kData point p, distance d_k) ... ...], wherein, RDD every a data is stored in the form of key-value pair, Using Neighbor Points as key, data point and apart from as value.The Swapped_RDD is subjected to cogroup operations according to key and former RDD (this is that one kind to RDD that Spark is provided operates).Cogroup operations can be by key identical in Swapped_RDD and former RDD Data point stipulations are to same working node.Therefore, for the data point as key in every a data, it may compare its neighbour Point k-distance and the data point arrive the distance between Neighbor Points, take higher value as the data point to its Neighbor Points can Up to distance, its result of calculation is designated as RDD [data point p, [(Neighbor Points o₁, reach distance d₁₎..., (Neighbor Points o_k, reach distance d_k)]]。

In the application, because the data of different keys after stipulations are located at different working nodes, therefore, the present invention calculates reachable The process of distance is entirely parallel execution, and each working node is independently responsible for respective calculating task, do not influenceed mutually, further Improve the execution speed of algorithm.

Further, in the step S133, working node parallel computation each data point in cluster it is reachable Density, i.e., each working node calculate the reachable density of each data point of initial storage on this working node respectively.Wherein, may be used It is defined as follows up to density：

Wherein, lrd_MinPts(p) p reachable density is represented, MinPts (p) is represented as k=MinPts, current data point p All k-distance neighbours.Specifically, each data point in the RDD obtained to step S132 carries out groupByKey Operation, by all key identical hough transformations to same working node.For every a data after stipulations, calculation key institute The reach distance sum of all Neighbor Points and the quantity of Neighbor Points of corresponding data point, and it is reachable according to above-mentioned formula calculating Density, its result of calculation are RDD [data point p, [(Neighbor Points o₁, up to density d₁₎..., (Neighbor Points o_k, up to density d_k)]].Equally, the process that the application calculates reachable density is that complete parallel performs, and each working node is independently responsible for respective Calculating task, do not influence mutually.

Further, in the step S134, the working node in cluster is based on the reachable close of each data point Degree, the abnormal probability of each data point on this working node is determined parallel.Here, abnormal probability is designated as LOF_MinPts(p), define It is as follows：

Specifically, the RDD obtained in step S133 and Swapped_RDD is subjected to cogroup operations, calculates the data point Reachable density, the reachable density sum of all Neighbor Points of the data point, and the quantity of Neighbor Points, and according to above-mentioned formula Calculate the abnormal probability that the data point is exceptional data point.Equally, the process that the application calculates abnormal probability is that complete parallel is held OK, each working node is independently responsible for respective calculating task, does not influence mutually.

A kind of cluster for data distribution formula abnormality detection is additionally provided according to the application on the other hand, wherein, institute Stating cluster includes：

First working node, for sending the first data set stored on this working node to host node, and it is based on this work Make the data set on node, calculate the abnormal probability of each data point.First data set includes at least one data point；

The host node, for obtaining first data set, and by first data sets broadcast into the cluster The second working node so that the working node in the cluster calculates each data based on the data set on this working node The abnormal probability of point, wherein, second working node is other work sections that first working node is removed in the cluster Point；

Second working node obtains the first data set of first working node of the host node broadcast, and base Data set on first data set and this working node, calculate the abnormal probability of each data point.

In this embodiment, the first working node in cluster sends the first number stored on this working node to host node According to collection, correspondingly, host node obtains the first data set stored in cluster on the first working node, wherein, first data Collection includes at least one data point.

So the first working node in cluster sends the first data set stored on this working node, phase to host node Ying Di, host node obtains the first data set stored in cluster on the first working node by way of collect, here, respectively Individual working node turns into the first working node in turn, that is, each working node in cluster can be by this working node Data set is sent to the host node, other working nodes that the host node can be by the data sets broadcast extremely in addition to this working node, That is, the host node is saved the second work in first data sets broadcast to the cluster by way of broadcasting variable Point, so that all working node in the cluster based on the data set on this working node, calculates the exception of each data point Probability, correspondingly, second working node obtain the first data set of first working node of the host node broadcast.

Continue in this embodiment, the working node in cluster is calculated per number based on the data set on this working node The abnormal probability at strong point.Working node i.e. in cluster is based on host node by broadcasting the first data set that variable is broadcasted and current The data set stored on working node, calculate the abnormal probability of each data point.

Preferably, wherein, it is stored with the second data set on second working node, the first work section in cluster Point obtains the first global K-distance neighbours of data point and second working node in first data set and obtained The second of data point global K-distance neighbours in second data set；Further, the working node in cluster is based on The first global K-distance neighbours and the second global K-distance neighbours, determine on this working node parallel Each data point and the reach distance of its K-distance neighbour；Further, the working node in cluster is based on each data Point and the reach distance of its K-distance neighbour, the reachable density of each data point on this working node is determined parallel；Enter one Step, reachable density of the working node based on each data point in cluster, determine parallel on this working node per number The abnormal probability at strong point.

In this embodiment, first working node in cluster obtains first of data point in first data set Global K-distance neighbours and second working node obtain the second global K- of data point in second data set Distance neighbours.

First working node calculates the First partial K-distance neighbours of data point in first data set, its In, the First partial K-distance neighbours include K- of the data point in first data set in first data set Distance neighbours；

Second working node calculates the second local K-distance neighbours of data point in first data set, its In, the second local K-distance neighbours include K- of the data point in second data set in first data set Distance neighbours；

First working node is based on First partial K-distance neighbours and the second local K-distance neighbours, Obtain the first global K-distance neighbours of data point in first data set.

In this embodiment, first working node is by calculating in first data set between each data point Distance, to obtain K-distance neighbour of each data point in first data set, i.e. first game in the first data set Portion K-distance neighbours.

Similarly, because the second working node has got first data set, then the second working node can calculate this first Each data point in data set, the distance with each data point in the second data set are each in the first data set so as to obtain K-distance neighbour of the data point in second data set, i.e., the second local K-distance neighbours.

Further, first working node is based on First partial K-distance neighbours and the second local K- Distance neighbours, obtain the first global K-distance neighbours of data point in first data set.

Preferably, wherein, first working node is based on First partial K-distance neighbours and the second local K- Distance neighbours, obtaining the first global K-distance neighbours of data point in first data set includes：Described first Working node obtains the second local K-distance neighbours that second working node determines；First working node is based on The First partial K-distance neighbours and the second local K-distance neighbours, obtain the number in first data set The global K-distance neighbours at strong point.

Preferably, first working node is based on First partial K-distance neighbours and the second local K- Distance neighbours, obtaining the first global K-distance neighbours of data point in first data set includes：Described first Working node and second working node are by the First partial K-distance neighbours and the second local K- Distance neighbours are transmitted to middle working node；First working node receives the middle working node is sent, institute The global K-distance neighbours of the data point in the first data set are stated, wherein, the middle working node is based on described first Local K-distance neighbours and the second local K-distance neighbours, obtain the complete of data point in first data set Office K-distance neighbours.

In this embodiment, the data set on the first working node described in the collect of Driver ends, and broadcast is arrived On second working node, first working node calculates the First partial K-distance of data point in first data set Neighbour, that is, calculate local K-distance of the data point of serial number 0,1,2 in notebook data collection (i.e. described first data set) Neighbour, for example, the data point of serial number 0 is in the data that the First partial K-distance neighbours that notebook data is concentrated are serial number 1 Point.

Likewise, second working node calculates the second local K-distance of data point in first data set Neighbour, i.e., described second working node calculate the data point in first data set respectively, and each in the second data set The distance of data point, and closest point is near as the local K-distance in the second data set of the data point Neighbour, for example, second local K-distance neighbour of the data point of serial number 0 in the second working node 2 is the number of serial number 4 Strong point.

Continue in this embodiment, first working node is based on First partial K-distance neighbours and second game Portion K-distance neighbours, obtain the first global K-distance neighbours of data point in first data set.Specifically, may be used With by First partial K-distance neighbours and the second local K-distance neighbours respectively according to apart from ascending sort, Then K minimum K-distance of selected distance is near by way of traveling through the first and second local K-distance neighbours It is adjacent.

Preferably, wherein, the working node in the cluster is based on the described first global K-distance neighbours and institute The second global K-distance neighbours are stated, determine each data point and its K-distance neighbour on this working node parallel can Include up to distance：Global K-distance neighbour of the working node based on each data point on this working node in cluster, Or the global K-distance based on each data point on this working node and the data point on other working nodes is near Neighbour, it is determined that each data point and the reach distance of its K-distance neighbour.

reach-dist_k(p, o)=maX { k-distOnce (O), d (p, 0) }

Wherein, d (p, o) refers to the distance between data point p and o.

Further, the reachable density of working node parallel computation each data point in cluster, i.e., each work section Point calculates the reachable density of each data point of initial storage on this working node respectively.Wherein, it is defined as follows up to density：

Wherein, lrd_MinPts(p) p reachable density is represented, MinPts (p) is represented as k=MinPts, current data point p All k-distance neighbours.Specifically, each data point calculated in the RDD obtained after reach distance is carried out GroupByKey is operated, by all key identical hough transformations to same working node.For each number after stipulations According to, the reach distance sum of all Neighbor Points and the quantity of Neighbor Points of the data point corresponding to calculation key, and according to above-mentioned Formula is calculated up to density, and its result of calculation is RDD [data point p, [(Neighbor Points o₁, up to density d₁₎..., (Neighbor Points o_k, Up to density d_k)]].Equally, the process that the application calculates reachable density is that complete parallel performs, and each working node is independently born Respective calculating task is blamed, is not influenceed mutually.

Further, reachable density of the working node in cluster based on each data point, determines this work parallel The abnormal probability of each data point on node.Here, abnormal probability is designated as LOF_MinPts(p), it is defined as follows：

Specifically, it will calculate and carry out cogroup operations up to the RDD obtained after density and Swapped_RDD, calculate the number The reachable density at strong point, the reachable density sum of all Neighbor Points of the data point, and the quantity of Neighbor Points, and according to above-mentioned Formula calculates the abnormal probability that the data point is exceptional data point.Equally, the process that the application calculates abnormal probability is completely simultaneously Row performs, and each working node is independently responsible for respective calculating task, does not influence mutually.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.This Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table Show title, and be not offered as any specific order.

Claims

1. a kind of method for data distribution formula abnormality detection, wherein, this method includes：

Host node obtains the first data set stored in cluster on the first working node, wherein, first data set includes extremely A few data point；

Host node is by the second working node in first data sets broadcast to the cluster, so that the work in the cluster Node calculates the abnormal probability of each data point based on the data set on this working node, wherein, second working node is Other working nodes of first working node are removed in the cluster.

2. a kind of method for data distribution formula abnormality detection, wherein, this method includes：

The first working node in cluster sends the first data set stored on this working node to the host node, wherein, institute State the first data set and include at least one data point；

The second working node in cluster obtains the first data set of first working node of the host node broadcast, its In, second working node is other working nodes that first working node is removed in the cluster；

3. according to the method for claim 2, wherein, the second data set, the collection are stored with second working node Based on the data set on this working node, calculate the abnormal probability of each data point includes working node in group：

The first global K-distance that first working node in cluster obtains data point in first data set is near Adjacent and described second working node obtains the second global K-distance neighbours of data point in second data set；

Working node in cluster is based on the described first global K-distance neighbours and the second global K-distance Neighbour, each data point and the reach distance of its K-distance neighbour on this working node are determined parallel；

Reach distance of the working node based on each data point Yu its K-distance neighbour in cluster, determines this work parallel The reachable density of each data point on node；

Reachable density of the working node based on each data point in cluster, determines each data on this working node parallel The abnormal probability of point.

4. according to the method for claim 3, wherein, first working node obtains data point in first data set The first global K-distance neighbours include：

First working node calculates the First partial K-distance neighbours of data point in first data set, wherein, The First partial K-distance neighbours include K- of the data point in first data set in first data set Distance neighbours；

Second working node calculates the second local K-distance neighbours of data point in first data set, wherein, The second local K-distance neighbours include K- of the data point in second data set in first data set Distance neighbours；

First working node is based on First partial K-distance neighbours and the second local K-distance neighbours, obtains The first of data point global K-distance neighbours in first data set.

5. according to the method for claim 4, wherein, first working node is based on First partial K-distance neighbours And the second local K-distance neighbour, obtain the first global K-distance neighbours of data point in first data set Including：

First working node obtains the second local K-distance neighbours that second working node determines；

First working node is based on the First partial K-distance neighbours and the second local K-distance neighbours, Obtain the global K-distance neighbours of the data point in first data set.

6. according to the method for claim 4, wherein, first working node is based on First partial K-distance neighbours And the second local K-distance neighbour, obtain the first global K-distance neighbours of data point in first data set Including：

First working node and second working node are by the First partial K-distance neighbours and described second Local K-distance neighbours are transmitted to middle working node；

First working node receives the overall situation of the data point in first data set that the middle working node is sent, described K-distance neighbours, wherein, the middle working node is based on the First partial K-distance neighbours and second game Portion K-distance neighbours, obtain the global K-distance neighbours of the data point in first data set.

7. the method according to any one of claim 4 to 6, local K-distance and global K-distance neighbours' Storage mode includes the sequence number mark and the part of the data point and the sequence of global K-distance neighbours of the data point Number mark.

8. the method according to claim 11, wherein, it is described to be based on the First partial K-distance neighbours and second Local K-distance neighbours, obtaining the global K-distance neighbours of the data point in first data set includes：

Sequence number in the First partial K-distance neighbours and the second local K-distance neighbours is identified into identical number Strong point carries out stipulations；

The global K-distance neighbours of the data point in first data set are determined based on the stipulations result.

9. according to the method for claim 3, wherein, the working node in the cluster is based on the described first global K- Distance neighbours and the second global K-distance neighbours, each data point and its are determined on this working node parallel The reach distance of K-distance neighbours includes：

Global K-distance neighbours based on each data point on this working node of working node in cluster or it is based on The global K-distance neighbours of each data point on this working node and the data point on other working nodes, it is parallel true Each data point and the reach distance of its K-distance neighbour on this fixed working node.

10. a kind of host node for data distribution formula abnormality detection, wherein, the host node is used for：

The first data set stored in cluster on the first working node is obtained, wherein, first data set includes at least one Data point；And by the second working node in first data sets broadcast to the cluster, so that the work in the cluster Node calculates the abnormal probability of each data point based on the data set on this working node, wherein, second working node is Other working nodes of first working node are removed in the cluster.

11. a kind of cluster for data distribution formula abnormality detection, wherein, the cluster includes：

First working node, for sending the first data set for being stored on this working node to the host node and based on this work Make the data set on node, calculate the abnormal probability of each data point, wherein, first data set includes at least one data Point；

Second working node, for first working node that obtains host node broadcast the first data set and be based on Data set on this working node, the abnormal probability of each data point is calculated, wherein, second working node is the cluster In remove first working node other working nodes.

12. cluster according to claim 11, wherein, the second data set is stored with second working node, wherein, First working node obtains in first data set the first global K-distance neighbours of data point and described the Two working nodes obtain the second global K-distance neighbours of data point in second data set；Work in the cluster Node is based on the described first global K-distance neighbours and the second global K-distance neighbours, determines this work parallel Make each data point and the reach distance of its K-distance neighbour on node；

And the reach distance based on each data point Yu its K-distance neighbour, each data on this working node are determined parallel The reachable density of point；

And based on the reachable density of each data point, the abnormal probability of each data point on this working node is determined parallel.

13. cluster according to claim 12, wherein, first working node obtains the first data intensive data First global K-distance neighbours of point include：

14. cluster according to claim 13, wherein, it is near that first working node is based on First partial K-distance Adjacent and the second local K-distance neighbours, the first global K-distance for obtaining data point in first data set are near Neighbour includes：

15. cluster according to claim 13, wherein, it is near that first working node is based on First partial K-distance Adjacent and the second local K-distance neighbours, the first global K-distance for obtaining data point in first data set are near Neighbour includes：

And the global K-distance for receiving the data point in first data set that the middle working node is sent, described is near Neighbour, wherein, the middle working node is based on the First partial K-distance neighbours and the second local K-distance Neighbour, obtain the global K-distance neighbours of the data point in first data set.

16. the cluster according to any one of claim 13 to 15, local K-distance and global K-distance are near Adjacent storage mode includes sequence number mark and part and the global K-distance neighbours of the data point of the data point Sequence number mark.

17. cluster according to claim 16, wherein, it is described based on the First partial K-distance neighbours and the Two local K-distance neighbours, obtaining the global K-distance neighbours of the data point in first data set includes：

18. cluster according to claim 12, wherein, the working node in the cluster is based on the described first global K- Distance neighbours and the second global K-distance neighbours, each data point and its are determined on this working node parallel The reach distance of K-distance neighbours includes：