CN107528904B

CN107528904B - Method and apparatus for data distributed anomaly detection

Info

Publication number: CN107528904B
Application number: CN201710778563.7A
Authority: CN
Inventors: 林晨
Original assignee: Transwarp Technology Shanghai Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2020-02-18
Anticipated expiration: 2037-09-01
Also published as: CN107528904A

Abstract

The application aims to provide a method and equipment for data distributed anomaly detection. Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.

Description

Method and apparatus for data distributed anomaly detection

Technical Field

The present application relates to the field of computer technologies, and in particular, to a technique for data distributed anomaly detection.

Background

Anomaly detection aims to discover anomalous data points in a data set that are significantly different from other data points, and their occurrence is not random, but may arise from a completely different mechanism. The analysis of the abnormal data points can dig out important information hidden behind the abnormal data points, thereby bringing considerable economic value. At present, anomaly detection is widely applied in real life, for example, in the financial field, the anomaly detection technology is adopted to carry out fraud detection on credit cards; in the field of information security, an anomaly detection technology is adopted for intrusion detection; in the medical field, anomaly detection techniques can be used to assess health issues and the like that may be underlying for a patient.

The existing LOF (Local Outlier Factor) algorithm is a density-based anomaly detection algorithm that considers that the density of an abnormal data point is significantly different from the density of its neighboring points, and thus detects an abnormal data point by comparing the relative densities of the current point and its neighboring points. Due to good detection accuracy, the algorithm is widely applied and becomes a classic anomaly detection algorithm, but with the coming of a big data era, the application range of the algorithm is limited due to higher time complexity of the algorithm.

Disclosure of Invention

The application aims to provide a method and equipment for data distributed anomaly detection.

According to one aspect of the application, a method for data distributed anomaly detection is provided, wherein the method comprises:

the method comprises the steps that a main node acquires a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point;

and broadcasting the first data set to a second working node in the cluster by the master node so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.

According to another aspect of the present application, there is also provided a method for data distributed anomaly detection, wherein the method comprises:

a first working node in the cluster sends a first data set stored on the working node to the main node, wherein the first data set comprises at least one data point;

a second working node in the cluster acquires a first data set of the first working node broadcasted by the main node, wherein the second working node is other working nodes except the first working node in the cluster;

and calculating the abnormal probability of each data point by the working nodes in the cluster based on the data set on the working node.

According to another aspect of the present application, there is also provided a master node for data distributed anomaly detection, wherein the master node is configured to:

obtaining a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point; and broadcasting the first data set to a second working node in the cluster so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.

According to yet another aspect of the present application, there is also provided a cluster for data distributed anomaly detection, wherein the cluster comprises:

the first working node is used for sending a first data set stored on the working node to the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the first data set comprises at least one data point;

and the second working node is used for acquiring the first data set of the first working node broadcasted by the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the second working node is other working nodes except the first working node in the cluster.

Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 illustrates a flow diagram of a method for distributed anomaly detection of data, according to one aspect of the subject application;

FIG. 2 is a schematic diagram illustrating a master node acquiring and broadcasting a first data set in accordance with a preferred embodiment of the present application;

fig. 3 is a schematic diagram illustrating a first working node obtaining a first global K-distance neighbor of a data point in the first data set according to a preferred embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

To further illustrate the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.

FIG. 1 illustrates a method for data distributed anomaly detection, according to one aspect of the present application, wherein the method comprises:

s11, a first working node in the cluster sends a first data set stored on the working node to a master node, and correspondingly, the master node acquires the first data set stored on the first working node in the cluster, wherein the first data set comprises at least one data point;

s12 the master node broadcasts the first data set to a second working node in the cluster, so that the working node in the cluster calculates an abnormal probability of each data point based on the data set on the working node, and accordingly the second working node obtains the first data set of the first working node broadcasted by the master node, where the second working node is another working node except the first working node in the cluster;

and (8) calculating the abnormal probability of each data point by the working nodes in the S13 cluster based on the data sets on the working nodes.

In this embodiment, in step S11, a first working node in the cluster sends a first data set stored on the working node to the master node, and accordingly, the master node obtains the first data set stored on the first working node in the cluster, where the first data set includes at least one data point.

Here, the cluster is composed of each node in the distributed system, for example, in the embodiment of the present application, the distributed system mainly includes a master node and a working node, and the master node is one node in the cluster and can play a role in collecting and broadcasting data; the first working node is any one of all working nodes in the distributed system, and the second working node is other working nodes except the first working node. Here, each node in the distributed system includes, but is not limited to, various mobile intelligent devices, personal computers, network hosts, a single network server, a plurality of network server sets, or a cloud of multiple servers; wherein, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), wherein the Cloud Computing is one of distributed Computing, and is a virtual super computer composed of a group of loosely coupled computer sets.

Specifically, a set composed of all data points that need to be subjected to anomaly detection is referred to as a data set, in this application, the entire data set is divided into a plurality of data partitions, the plurality of partitions are distributively stored on each working node, each working node stores one or more partitions in the entire data set, a partition stored on a first working node is referred to as a first data set, and a partition stored on a second working node is referred to as a second data set. In the application, the distributed data set RDD provided in Spark may be used to read the original data set into the memory according to the data format of RDD [ (serial number, vector) ], where the RDD is divided into multiple partitions by Spark and stored in each working node of the Spark cluster in a distributed manner.

Since k-distance neighbors of each data point need to be calculated through the LOF algorithm, wherein k-distance refers to the distance between the current point and the kth data point. And the k-distance neighbors mean all data points whose distance from the current point does not exceed k-distance, so that, taking the first working node as an example, when calculating the k-distance neighbors of each data point in the first data set on the first working node, the first data set needs to be sent to other working nodes, namely the second working node, so that the k-distance neighbors of each data point in the first data set in the whole data set can be obtained.

Therefore, in step S11, the first working node in the cluster sends the first data set stored on the working node to the master node, and accordingly, the master node acquires the first data set stored on the first working node in the cluster by means of collection, in this case, each working node becomes the first working node in turn, that is, each working node in the cluster will send the data set on the working node to the master node, the master node broadcasts the data set to other working nodes, i.e., in step S12, the master node broadcasts the first data set to a second working node in the cluster by means of a broadcast variable, such that all worker nodes in the cluster compute an anomaly probability for each data point based on the data set on the worker node and, accordingly, and the second working node acquires a first data set of the first working node broadcasted by the main node.

Fig. 2 is a schematic diagram illustrating a master node acquiring and broadcasting a first data set, where Driver is the master node, the following four are working nodes, and each working node stores a data set, in this embodiment, each data set has three data points, and the data are stored according to a data format of [ (serial number, vector) ], any one of the four working nodes may be used as a first working node, and the others are used as second working nodes, in this embodiment, the first working node on the left is used as a first working node, the other working nodes are used as second working nodes, and the master node collects the data set in the first working node, i.e., the first data set, to the master node, and broadcasts the first data set broadcast to the second working node. Furthermore, the second working nodes are taken as the first working nodes in turn, so that the data sets on the working nodes are broadcast to other working nodes.

Continuing in this embodiment, in said step S13, the worker nodes in the cluster calculate the anomaly probability of each data point based on the data set on the worker node. That is, the working nodes in the cluster calculate the anomaly probability of each data point based on the first data set broadcast by the master node through the broadcast variable and the data set stored on the current working node.

Preferably, wherein the second working node has a second data set stored thereon, the step S13 includes: s131 (not shown) the first worker node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second worker node acquires a second global K-distance neighbor of a data point in the second data set; s132 (not shown), working nodes in the cluster parallelly determine the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor; s133 (not shown) the working nodes in the cluster parallelly determine the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point; s134 (not shown), the working nodes in the cluster determine the abnormal probability of each data point on the working node in parallel based on the reachable density of each data point.

In this embodiment, in the step S131, the first worker node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second worker node acquires a second global K-distance neighbor of a data point in the second data set.

The first global K-distance neighbor represents a K-distance neighbor of a data point in the first data set in all data sets of all working nodes, and the second global K-distance neighbor represents a K-distance neighbor of a data point in the second data set in all data sets of all working nodes. The first global K-distance neighbor and the second global K-distance neighbor are collectively called as global K-distance neighbors and represent the K-distance neighbors of each data point in all data sets of all working nodes. For example, since the master node broadcasts the data set on each working node to other working nodes, each working node may calculate K-distance neighbors of data points in the data set on the working node in all data sets, so as to obtain a global K-distance neighbor of each data point.

Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set includes:

s1311 (not shown) the first worker node computing a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;

s1312 (not shown) the second worker node computing a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;

s1313 (not shown), the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

In this embodiment, in step S1311, the first working node obtains a K-distance neighbor, that is, a first local K-distance neighbor, of each data point in the first data set by calculating a distance between each data point in the first data set.

Similarly, in the step S1312, since the second working node obtains the first data set, the second working node calculates a distance between each data point in the first data set and each data point in the second data set, so as to obtain a K-distance neighbor, i.e., a second local K-distance neighbor, of each data point in the first data set in the second data set.

Further, in the step S1313, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

The global K-distance neighbors for data points in the second data set may be determined in the same manner as the global K-distance neighbors for data points in the first data set.

Preferably, wherein the step S1313 includes: s13131 (not shown), the first working node acquires a second local K-distance neighbor determined by the second working node; s13132 (not shown), the first working node obtains a global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

In this embodiment, the global K-distance neighbor of the data point in the first data set may be determined by the first working node, that is, the first local K-distance neighbor and the second local K-distance neighbor are aggregated onto the first working node to perform calculation to obtain the global K-distance neighbor of the data point in the first data set. Similarly, the global K-distance neighbors of the data points in the second data set can be determined through the second working node, so that distributed calculation is realized, and the operation efficiency is improved.

Preferably, the step S1313 includes: the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node; the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.

In the preferred embodiment, the calculation of the global K-distance neighbor may be performed without using a working node, where the calculation of the global K-distance neighbor may be implemented by using an intermediate node, where the intermediate node is a node that is provided in addition to the master node and the working node, and the intermediate node may determine the global K-distance neighbor by obtaining the first local K-distance neighbor and the second local K-distance neighbor and performing corresponding calculation.

Preferably, the storage manner of the local K-distance and the global K-distance neighbors includes serial number identifiers of the data points and serial number identifiers of the local K-distance neighbors and the global K-distance neighbors of the data points. That is, when storing the local K-distance and the global K-distance neighbor of each data point, the sequence number identifier of each data point and the sequence number identifiers of the local K-distance and the global K-distance neighbor of each data point are included, for example, the K-distance neighbor of the data point p may be written as: (data point p, [ (neighbor point o)₁Distance d₁) ,.., (neighboring points o)_kDistance d_k)]Where d is the distance between the data point p and the neighboring point.

For example, in the embodiment shown in fig. 3, the global or local K-distance neighbor of data point 0 is data point 1, and since K is 1, the K-distance neighbor of data point 0 can be stored as (0, 1, d1), where d1 is the distance between data point 0 and data point 1.

Preferably, the obtaining a global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor comprises:

carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor; determining a global K-distance neighbor of a data point in the first data set based on the specification result. Specifically, all data points with the same serial number can be reduced to the same working node through the reducibykey operation, for example, the data points are reduced to the first working node or an intermediate node.

As shown in fig. 3, an embodiment of the first working node obtaining the first global K-distance neighbor of the data point in the first data set is shown, in which the data set stored on each working node includes three data points, and each data point is stored in a (sequence number, vector) data format, for example, the data points with sequence numbers 0, 1, and 2 are stored on the first working node. In this embodiment, it is assumed that K in the K-distance neighbor is 1.

In this embodiment, a Driver end collect the data set on the first working node, and broadcast onto the second working node, in step S1311, the first working node calculates a first local K-distance neighbor of the data point in the first data set, that is, calculates a local K-distance neighbor of the data point with sequence numbers 0, 1, and 2 in the data set (that is, the first data set), for example, the first local K-distance neighbor of the data point with sequence number 0 in the data set is the data point with sequence number 1.

Likewise, in step S1312, the second working node calculates a second local K-distance neighbor of the data points in the first data set, that is, the second working node calculates the data points in the first data set and the distance from each data point in the second data set, and takes the closest point as the local K-distance neighbor of the data point in the second data set, for example, the second local K-distance neighbor of the data point with sequence number 0 in the second working node 2 is the data point with sequence number 4.

Continuing in this embodiment, in the step S1313, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor. Specifically, the first local K-distance neighbors and the second local K-distance neighbors can be sorted in ascending order of distance, and then the K local neighbors with the minimum distance can be selected by traversing the first local K-distance neighbors and the second local K-distance neighbors.

Specifically, all data points with the same sequence number may be reduced to the same working node through a reduciebykey operation, for example, as shown in fig. 3, where K is 1, local K-distance neighbors of data points with sequence number 0 are reduced together, and sorted according to the distance from the data point with sequence number 0, and K data points with the smallest distance are determined, in this embodiment, since K is 1, a global K-distance neighbor of a data point with sequence number 0 is determined to be (0.484, 0.598), that is, a global K-distance neighbor of a data point with sequence number 0 is a data point with sequence number 1, and similarly, a global K-distance neighbor of a data point with sequence number 1 is obtained as a data point with sequence number 2.

Preferably, wherein the step S132 includes: the working nodes in the cluster determine the reachable distance between each data point and the K-distance neighbors of the data points based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.

In this embodiment, when the global K-distance neighbor of each data point on the working node is a data point on the working node, the working node in the cluster only needs to determine the reachable distance between each data point on the working node and the K-distance neighbor thereof based on the global K-distance neighbor of each data point on the working node; when the global K-distance neighbors of the data points existing on the working node are the data points on other working nodes, the reachable distance between each data point on the working node and the K-distance neighbors of the data points on the working node needs to be determined based on each data point on the working node and the global K-distance neighbors of the data points on the other working nodes. And each working node performs parallel calculation, so that the reachable distance of the K-distance neighbors of all the data points is determined.

Here, the reachable distance between the data point o and the data point p is defined as follows:

reach-dist_k(p，o)＝max{k-distOnce(O)，d(p，0)}

where d (p, o) refers to the distance between data points p and o.

In particular, data points in a K-distance neighborhood of the resulting data point are exchanged with neighboring points, e.g., the K-distance neighborhood of data point p is denoted as RDD [ (data point p, [ (neighboring point o)₁Distance d₁) … …, (neighbor point o)_kDistance d_k)])]Then data point p is Swapped with a neighbor point, which may be denoted as Swapped _ RDD [ (neighbor point o)₁Data point p, distance d₁) … … neighboring point o_kData point p, distance d_k)，……]Wherein each piece of data of the RDD is stored in the form of a key-value pair, with the neighboring points as keys and the data points and distances as values. The Swapped _ RDD is subjected to group operation with the original RDD according to the key (this is an operation on the RDD provided by Spark). The group operation will reduce the data points with the same key in the swappod _ RDD and the original RDD to the same working node. Therefore, for each data point as a key in each piece of data, the k-distance of its neighboring point and the distance between the data point and the neighboring point can be compared, and the larger value is taken as the reachable distance between the data point and its neighboring point, and the calculation result is recorded as RDD [ data point p, [ (neighboring point o)₁Reachable distance d₁₎,.., (neighboring points o)_kReachable distance d_k)]]。

In the application, because the data of different keys are positioned at different working nodes after specification, the process of calculating the reachable distance is completely executed in parallel, and each working node is independently responsible for respective calculation tasks without influencing each other, so that the execution speed of the algorithm is further improved.

Further, in step S133, the working nodes in the cluster calculate the reachable density of each data point in parallel, that is, each working node calculates the reachable density of each data point initially stored on the working node. Wherein the achievable density is defined as follows:

wherein, lrd_MinPts(p) denotes the achievable density of p, MinPts (p) denotes all k-distance neighbors of the current data point p when k is MinPts. Specifically, the groupByKey operation is performed on each data point in the RDD obtained in step S132, and data with the same key is reduced to the same work node. For each piece of data after the specification, calculating the sum of the reachable distances of all the near-adjacent points of the data points corresponding to the keys and the number of the near-adjacent points, and calculating the reachable density according to the formula, wherein the calculation result is RDD [ data point p, [ (near-adjacent point o)₁Can reach density d₁₎,.., (neighboring points o)_kCan reach density d_k)]]. Similarly, the process of calculating the achievable density is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.

Further, in the step S134, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point. Here, the anomaly probability is expressed as LOF_MinPts(p), defined as follows:

specifically, the RDD obtained in step S133 and the Swapped _ RDD are subjected to grouping operation, the reachable density of the data point, the sum of the reachable densities of all the neighboring points of the data point, and the number of the neighboring points are calculated, and the anomaly probability that the data point is an abnormal data point is calculated according to the above formula. Similarly, the process of calculating the abnormal probability is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.

There is also provided in accordance with another aspect of the present application a cluster for data distributed anomaly detection, wherein the cluster includes:

and the first working node is used for sending the first data set stored on the working node to the main node and calculating the abnormal probability of each data point based on the data set on the working node. The first data set comprises at least one data point;

the master node is configured to acquire the first data set and broadcast the first data set to a second working node in the cluster, so that the working node in the cluster calculates an abnormal probability of each data point based on the data set on the working node, where the second working node is another working node except the first working node in the cluster;

and the second working node acquires a first data set of the first working node broadcasted by the main node, and calculates the abnormal probability of each data point based on the first data set and the data set on the working node.

In this embodiment, a first working node in a cluster sends a first data set stored on the working node to a master node, and accordingly, the master node obtains the first data set stored on the first working node in the cluster, where the first data set includes at least one data point.

Therefore, a first working node in the cluster sends a first data set stored on the working node to the master node, and accordingly, the master node acquires the first data set stored on the first working node in the cluster in a collection manner, in this case, each working node becomes the first working node in turn, that is, each working node in the cluster will send the data set on the working node to the master node, the master node broadcasts the data set to other working nodes, i.e., the master node broadcasts the first data set to a second working node in the cluster by means of a broadcast variable, such that all worker nodes in the cluster compute an anomaly probability for each data point based on the data set on the worker node and, accordingly, and the second working node acquires a first data set of the first working node broadcasted by the main node.

Continuing in this embodiment, the worker nodes in the cluster calculate the probability of anomaly for each data point based on the data set on the worker node. That is, the working nodes in the cluster calculate the anomaly probability of each data point based on the first data set broadcast by the master node through the broadcast variable and the data set stored on the current working node.

Preferably, the second working node has a second data set stored thereon, the first working node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second working node acquires a second global K-distance neighbor of a data point in the second data set; further, the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor thereof based on the first global K-distance neighbor and the second global K-distance neighbor; further, the working nodes in the cluster parallelly determine the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point; further, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point.

In this embodiment, the first worker node in the cluster obtains a first global K-distance neighbor of data points in the first data set, and the second worker node obtains a second global K-distance neighbor of data points in the second data set.

the first working node calculating a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;

the second working node calculating a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;

the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

In this embodiment, the first working node obtains a K-distance neighbor, i.e. a first local K-distance neighbor, of each data point in the first data set by calculating a distance between each data point in the first data set.

Similarly, since the second working node obtains the first data set, the second working node calculates a distance between each data point in the first data set and each data point in the second data set, so as to obtain a K-distance neighbor, i.e., a second local K-distance neighbor, of each data point in the first data set in the second data set.

Further, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor includes: the first working node acquires a second local K-distance neighbor determined by the second working node; the first working node obtains a global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor includes: the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node; the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.

In this embodiment, a Driver end collect the data set on the first working node, and broadcast onto the second working node, where the first working node calculates a first local K-distance neighbor of the data point in the first data set, that is, calculates a local K-distance neighbor of the data point with sequence numbers 0, 1, and 2 in the data set (that is, the first data set), for example, the first local K-distance neighbor of the data point with sequence number 0 in the data set is the data point with sequence number 1.

Similarly, the second working node calculates a second local K-distance neighbor of the data points in the first data set, that is, the second working node calculates the distance between the data point in the first data set and each data point in the second data set, and takes the closest point as the local K-distance neighbor of the data point in the second data set, for example, the second local K-distance neighbor of the data point with sequence number 0 in the second working node 2 is the data point with sequence number 4.

Continuing in this embodiment, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor. Specifically, the first local K-distance neighbors and the second local K-distance neighbors can be sorted in ascending order of distance, and then the K local neighbors with the minimum distance can be selected by traversing the first local K-distance neighbors and the second local K-distance neighbors.

Preferably, wherein the working node in the cluster, based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the working node from its K-distance neighbor in parallel comprises: the working nodes in the cluster determine the reachable distance between each data point and the K-distance neighbors of the data points based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.

reach-dist_k(p，o)＝maX{k-distOnce(O)，d(p，0)}

where d (p, o) refers to the distance between data points p and o.

Further, the working nodes in the cluster calculate the reachable density of each data point in parallel, that is, each working node calculates the reachable density of each data point initially stored on the working node. Wherein the achievable density is defined as follows:

wherein, lrd_MinPts(p) denotes the achievable density of p, MinPts (p) denotes all k-distance neighbors of the current data point p when k is MinPts. Specifically, a groupByKey operation is performed on each data point in the RDD obtained after the reachable distance is calculated, and data with the same keys are reduced to the same working node. For each piece of data after the specification, calculating the sum of the reachable distances of all the near-adjacent points of the data points corresponding to the keys and the number of the near-adjacent points, and calculating the reachable density according to the formula, wherein the calculation result is RDD [ data point p, [ (near-adjacent point o)₁Can reach density d₁₎,.., (neighboring points o)_kCan reach density d_k)]]. Similarly, the process of calculating the achievable density is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.

Further, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point. Here, the anomaly probability is expressed as LOF_MinPts(p), defined as follows:

specifically, the RDD obtained after the reachable density is calculated and the Swapped _ RDD are subjected to grouping operation, the reachable density of the data point, the sum of the reachable densities of all the adjacent points of the data point, and the number of the adjacent points are calculated, and the abnormal probability that the data point is an abnormal data point is calculated according to the formula. Similarly, the process of calculating the abnormal probability is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for data distributed anomaly detection, wherein the method comprises:

2. A method for data distributed anomaly detection, wherein the method comprises:

a first working node in a cluster sends a first data set stored on the working node to a main node, wherein the first data set comprises at least one data point;

3. The method of claim 2, wherein the second working node has a second data set stored thereon, and wherein working nodes in the cluster calculate the probability of anomaly for each data point based on the data set on the working node comprises:

the first working node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second working node acquires a second global K-distance neighbor of a data point in the second data set;

the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor;

the working node in the cluster parallelly determines the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point;

and the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point.

4. The method of claim 3, wherein the first worker node obtaining a first global K-distance neighbor of a data point in the first dataset comprises:

5. The method of claim 4, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:

the first working node acquires a second local K-distance neighbor determined by the second working node;

the first working node obtains a global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

6. The method of claim 4, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:

the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node;

the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.

7. The method according to any of claims 4 to 6, wherein the local K-distance and global K-distance neighbors are stored in a manner comprising a sequence number identification of the data point and sequence number identifications of the local and global K-distance neighbors of the data point.

8. The method of claim 7, wherein the obtaining a global K-distance neighbor of a data point in the first data set based on the first and second local K-distance neighbors comprises:

carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor;

determining a global K-distance neighbor of a data point in the first data set based on the specification result.

9. The method of claim 3, wherein the working nodes in the cluster based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the working node from its K-distance neighbor in parallel comprises:

and the working nodes in the cluster parallelly determine the reachable distance between each data point on the working node and the K-distance neighbors thereof based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.

10. A master node for data distributed anomaly detection, wherein the master node is to:

11. A cluster for data distributed anomaly detection, wherein the cluster comprises:

12. The cluster of claim 11, wherein the second worker node has a second dataset stored thereon, wherein the first worker node obtains a first global K-distance neighbor of data points in the first dataset and the second worker node obtains a second global K-distance neighbor of data points in the second dataset; the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor;

determining the reachable density of each data point on the working node in parallel based on the reachable distance between each data point and the K-distance neighbor of the data point;

and determining the abnormal probability of each data point on the working node in parallel based on the reachable density of each data point.

13. The cluster of claim 12, wherein the first worker node obtaining a first global K-distance neighbor of a data point in the first dataset comprises:

14. The cluster of claim 13, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:

15. The cluster of claim 13, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:

and receiving a global K-distance neighbor of the data point in the first data set, which is sent by the intermediate working node, wherein the intermediate working node acquires the global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.

16. The cluster according to any of claims 13 to 15, wherein the local K-distance and global K-distance neighbors are stored in a way comprising a sequence number identification of the data point and sequence number identifications of the local and global K-distance neighbors of the data point.

17. The cluster of claim 16, wherein the obtaining a global K-distance neighbor of a data point in the first data set based on the first and second local K-distance neighbors comprises:

18. The cluster of claim 12, wherein the worker node in the cluster, based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the worker node from its K-distance neighbor in parallel comprises: