CN107528904B - Method and apparatus for data distributed anomaly detection - Google Patents

Method and apparatus for data distributed anomaly detection Download PDF

Info

Publication number
CN107528904B
CN107528904B CN201710778563.7A CN201710778563A CN107528904B CN 107528904 B CN107528904 B CN 107528904B CN 201710778563 A CN201710778563 A CN 201710778563A CN 107528904 B CN107528904 B CN 107528904B
Authority
CN
China
Prior art keywords
distance
data
working node
neighbor
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710778563.7A
Other languages
Chinese (zh)
Other versions
CN107528904A (en
Inventor
林晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Transwarp Technology Shanghai Co Ltd
Original Assignee
Transwarp Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Transwarp Technology Shanghai Co Ltd filed Critical Transwarp Technology Shanghai Co Ltd
Priority to CN201710778563.7A priority Critical patent/CN107528904B/en
Publication of CN107528904A publication Critical patent/CN107528904A/en
Application granted granted Critical
Publication of CN107528904B publication Critical patent/CN107528904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/062Generation of reports related to network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Abstract

The application aims to provide a method and equipment for data distributed anomaly detection. Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.

Description

Method and apparatus for data distributed anomaly detection
Technical Field
The present application relates to the field of computer technologies, and in particular, to a technique for data distributed anomaly detection.
Background
Anomaly detection aims to discover anomalous data points in a data set that are significantly different from other data points, and their occurrence is not random, but may arise from a completely different mechanism. The analysis of the abnormal data points can dig out important information hidden behind the abnormal data points, thereby bringing considerable economic value. At present, anomaly detection is widely applied in real life, for example, in the financial field, the anomaly detection technology is adopted to carry out fraud detection on credit cards; in the field of information security, an anomaly detection technology is adopted for intrusion detection; in the medical field, anomaly detection techniques can be used to assess health issues and the like that may be underlying for a patient.
The existing LOF (Local Outlier Factor) algorithm is a density-based anomaly detection algorithm that considers that the density of an abnormal data point is significantly different from the density of its neighboring points, and thus detects an abnormal data point by comparing the relative densities of the current point and its neighboring points. Due to good detection accuracy, the algorithm is widely applied and becomes a classic anomaly detection algorithm, but with the coming of a big data era, the application range of the algorithm is limited due to higher time complexity of the algorithm.
Disclosure of Invention
The application aims to provide a method and equipment for data distributed anomaly detection.
According to one aspect of the application, a method for data distributed anomaly detection is provided, wherein the method comprises:
the method comprises the steps that a main node acquires a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point;
and broadcasting the first data set to a second working node in the cluster by the master node so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.
According to another aspect of the present application, there is also provided a method for data distributed anomaly detection, wherein the method comprises:
a first working node in the cluster sends a first data set stored on the working node to the main node, wherein the first data set comprises at least one data point;
a second working node in the cluster acquires a first data set of the first working node broadcasted by the main node, wherein the second working node is other working nodes except the first working node in the cluster;
and calculating the abnormal probability of each data point by the working nodes in the cluster based on the data set on the working node.
According to another aspect of the present application, there is also provided a master node for data distributed anomaly detection, wherein the master node is configured to:
obtaining a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point; and broadcasting the first data set to a second working node in the cluster so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.
According to yet another aspect of the present application, there is also provided a cluster for data distributed anomaly detection, wherein the cluster comprises:
the first working node is used for sending a first data set stored on the working node to the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the first data set comprises at least one data point;
and the second working node is used for acquiring the first data set of the first working node broadcasted by the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the second working node is other working nodes except the first working node in the cluster.
Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 illustrates a flow diagram of a method for distributed anomaly detection of data, according to one aspect of the subject application;
FIG. 2 is a schematic diagram illustrating a master node acquiring and broadcasting a first data set in accordance with a preferred embodiment of the present application;
fig. 3 is a schematic diagram illustrating a first working node obtaining a first global K-distance neighbor of a data point in the first data set according to a preferred embodiment of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
To further illustrate the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.
FIG. 1 illustrates a method for data distributed anomaly detection, according to one aspect of the present application, wherein the method comprises:
s11, a first working node in the cluster sends a first data set stored on the working node to a master node, and correspondingly, the master node acquires the first data set stored on the first working node in the cluster, wherein the first data set comprises at least one data point;
s12 the master node broadcasts the first data set to a second working node in the cluster, so that the working node in the cluster calculates an abnormal probability of each data point based on the data set on the working node, and accordingly the second working node obtains the first data set of the first working node broadcasted by the master node, where the second working node is another working node except the first working node in the cluster;
and (8) calculating the abnormal probability of each data point by the working nodes in the S13 cluster based on the data sets on the working nodes.
In this embodiment, in step S11, a first working node in the cluster sends a first data set stored on the working node to the master node, and accordingly, the master node obtains the first data set stored on the first working node in the cluster, where the first data set includes at least one data point.
Here, the cluster is composed of each node in the distributed system, for example, in the embodiment of the present application, the distributed system mainly includes a master node and a working node, and the master node is one node in the cluster and can play a role in collecting and broadcasting data; the first working node is any one of all working nodes in the distributed system, and the second working node is other working nodes except the first working node. Here, each node in the distributed system includes, but is not limited to, various mobile intelligent devices, personal computers, network hosts, a single network server, a plurality of network server sets, or a cloud of multiple servers; wherein, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), wherein the Cloud Computing is one of distributed Computing, and is a virtual super computer composed of a group of loosely coupled computer sets.
Specifically, a set composed of all data points that need to be subjected to anomaly detection is referred to as a data set, in this application, the entire data set is divided into a plurality of data partitions, the plurality of partitions are distributively stored on each working node, each working node stores one or more partitions in the entire data set, a partition stored on a first working node is referred to as a first data set, and a partition stored on a second working node is referred to as a second data set. In the application, the distributed data set RDD provided in Spark may be used to read the original data set into the memory according to the data format of RDD [ (serial number, vector) ], where the RDD is divided into multiple partitions by Spark and stored in each working node of the Spark cluster in a distributed manner.
Since k-distance neighbors of each data point need to be calculated through the LOF algorithm, wherein k-distance refers to the distance between the current point and the kth data point. And the k-distance neighbors mean all data points whose distance from the current point does not exceed k-distance, so that, taking the first working node as an example, when calculating the k-distance neighbors of each data point in the first data set on the first working node, the first data set needs to be sent to other working nodes, namely the second working node, so that the k-distance neighbors of each data point in the first data set in the whole data set can be obtained.
Therefore, in step S11, the first working node in the cluster sends the first data set stored on the working node to the master node, and accordingly, the master node acquires the first data set stored on the first working node in the cluster by means of collection, in this case, each working node becomes the first working node in turn, that is, each working node in the cluster will send the data set on the working node to the master node, the master node broadcasts the data set to other working nodes, i.e., in step S12, the master node broadcasts the first data set to a second working node in the cluster by means of a broadcast variable, such that all worker nodes in the cluster compute an anomaly probability for each data point based on the data set on the worker node and, accordingly, and the second working node acquires a first data set of the first working node broadcasted by the main node.
Fig. 2 is a schematic diagram illustrating a master node acquiring and broadcasting a first data set, where Driver is the master node, the following four are working nodes, and each working node stores a data set, in this embodiment, each data set has three data points, and the data are stored according to a data format of [ (serial number, vector) ], any one of the four working nodes may be used as a first working node, and the others are used as second working nodes, in this embodiment, the first working node on the left is used as a first working node, the other working nodes are used as second working nodes, and the master node collects the data set in the first working node, i.e., the first data set, to the master node, and broadcasts the first data set broadcast to the second working node. Furthermore, the second working nodes are taken as the first working nodes in turn, so that the data sets on the working nodes are broadcast to other working nodes.
Continuing in this embodiment, in said step S13, the worker nodes in the cluster calculate the anomaly probability of each data point based on the data set on the worker node. That is, the working nodes in the cluster calculate the anomaly probability of each data point based on the first data set broadcast by the master node through the broadcast variable and the data set stored on the current working node.
Preferably, wherein the second working node has a second data set stored thereon, the step S13 includes: s131 (not shown) the first worker node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second worker node acquires a second global K-distance neighbor of a data point in the second data set; s132 (not shown), working nodes in the cluster parallelly determine the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor; s133 (not shown) the working nodes in the cluster parallelly determine the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point; s134 (not shown), the working nodes in the cluster determine the abnormal probability of each data point on the working node in parallel based on the reachable density of each data point.
In this embodiment, in the step S131, the first worker node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second worker node acquires a second global K-distance neighbor of a data point in the second data set.
The first global K-distance neighbor represents a K-distance neighbor of a data point in the first data set in all data sets of all working nodes, and the second global K-distance neighbor represents a K-distance neighbor of a data point in the second data set in all data sets of all working nodes. The first global K-distance neighbor and the second global K-distance neighbor are collectively called as global K-distance neighbors and represent the K-distance neighbors of each data point in all data sets of all working nodes. For example, since the master node broadcasts the data set on each working node to other working nodes, each working node may calculate K-distance neighbors of data points in the data set on the working node in all data sets, so as to obtain a global K-distance neighbor of each data point.
Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set includes:
s1311 (not shown) the first worker node computing a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;
s1312 (not shown) the second worker node computing a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;
s1313 (not shown), the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
In this embodiment, in step S1311, the first working node obtains a K-distance neighbor, that is, a first local K-distance neighbor, of each data point in the first data set by calculating a distance between each data point in the first data set.
Similarly, in the step S1312, since the second working node obtains the first data set, the second working node calculates a distance between each data point in the first data set and each data point in the second data set, so as to obtain a K-distance neighbor, i.e., a second local K-distance neighbor, of each data point in the first data set in the second data set.
Further, in the step S1313, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
The global K-distance neighbors for data points in the second data set may be determined in the same manner as the global K-distance neighbors for data points in the first data set.
Preferably, wherein the step S1313 includes: s13131 (not shown), the first working node acquires a second local K-distance neighbor determined by the second working node; s13132 (not shown), the first working node obtains a global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
In this embodiment, the global K-distance neighbor of the data point in the first data set may be determined by the first working node, that is, the first local K-distance neighbor and the second local K-distance neighbor are aggregated onto the first working node to perform calculation to obtain the global K-distance neighbor of the data point in the first data set. Similarly, the global K-distance neighbors of the data points in the second data set can be determined through the second working node, so that distributed calculation is realized, and the operation efficiency is improved.
Preferably, the step S1313 includes: the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node; the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.
In the preferred embodiment, the calculation of the global K-distance neighbor may be performed without using a working node, where the calculation of the global K-distance neighbor may be implemented by using an intermediate node, where the intermediate node is a node that is provided in addition to the master node and the working node, and the intermediate node may determine the global K-distance neighbor by obtaining the first local K-distance neighbor and the second local K-distance neighbor and performing corresponding calculation.
Preferably, the storage manner of the local K-distance and the global K-distance neighbors includes serial number identifiers of the data points and serial number identifiers of the local K-distance neighbors and the global K-distance neighbors of the data points. That is, when storing the local K-distance and the global K-distance neighbor of each data point, the sequence number identifier of each data point and the sequence number identifiers of the local K-distance and the global K-distance neighbor of each data point are included, for example, the K-distance neighbor of the data point p may be written as: (data point p, [ (neighbor point o)1Distance d1) ,.., (neighboring points o)kDistance dk)]Where d is the distance between the data point p and the neighboring point.
For example, in the embodiment shown in fig. 3, the global or local K-distance neighbor of data point 0 is data point 1, and since K is 1, the K-distance neighbor of data point 0 can be stored as (0, 1, d1), where d1 is the distance between data point 0 and data point 1.
Preferably, the obtaining a global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor comprises:
carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor; determining a global K-distance neighbor of a data point in the first data set based on the specification result. Specifically, all data points with the same serial number can be reduced to the same working node through the reducibykey operation, for example, the data points are reduced to the first working node or an intermediate node.
As shown in fig. 3, an embodiment of the first working node obtaining the first global K-distance neighbor of the data point in the first data set is shown, in which the data set stored on each working node includes three data points, and each data point is stored in a (sequence number, vector) data format, for example, the data points with sequence numbers 0, 1, and 2 are stored on the first working node. In this embodiment, it is assumed that K in the K-distance neighbor is 1.
In this embodiment, a Driver end collect the data set on the first working node, and broadcast onto the second working node, in step S1311, the first working node calculates a first local K-distance neighbor of the data point in the first data set, that is, calculates a local K-distance neighbor of the data point with sequence numbers 0, 1, and 2 in the data set (that is, the first data set), for example, the first local K-distance neighbor of the data point with sequence number 0 in the data set is the data point with sequence number 1.
Likewise, in step S1312, the second working node calculates a second local K-distance neighbor of the data points in the first data set, that is, the second working node calculates the data points in the first data set and the distance from each data point in the second data set, and takes the closest point as the local K-distance neighbor of the data point in the second data set, for example, the second local K-distance neighbor of the data point with sequence number 0 in the second working node 2 is the data point with sequence number 4.
Continuing in this embodiment, in the step S1313, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor. Specifically, the first local K-distance neighbors and the second local K-distance neighbors can be sorted in ascending order of distance, and then the K local neighbors with the minimum distance can be selected by traversing the first local K-distance neighbors and the second local K-distance neighbors.
Specifically, all data points with the same sequence number may be reduced to the same working node through a reduciebykey operation, for example, as shown in fig. 3, where K is 1, local K-distance neighbors of data points with sequence number 0 are reduced together, and sorted according to the distance from the data point with sequence number 0, and K data points with the smallest distance are determined, in this embodiment, since K is 1, a global K-distance neighbor of a data point with sequence number 0 is determined to be (0.484, 0.598), that is, a global K-distance neighbor of a data point with sequence number 0 is a data point with sequence number 1, and similarly, a global K-distance neighbor of a data point with sequence number 1 is obtained as a data point with sequence number 2.
Preferably, wherein the step S132 includes: the working nodes in the cluster determine the reachable distance between each data point and the K-distance neighbors of the data points based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.
In this embodiment, when the global K-distance neighbor of each data point on the working node is a data point on the working node, the working node in the cluster only needs to determine the reachable distance between each data point on the working node and the K-distance neighbor thereof based on the global K-distance neighbor of each data point on the working node; when the global K-distance neighbors of the data points existing on the working node are the data points on other working nodes, the reachable distance between each data point on the working node and the K-distance neighbors of the data points on the working node needs to be determined based on each data point on the working node and the global K-distance neighbors of the data points on the other working nodes. And each working node performs parallel calculation, so that the reachable distance of the K-distance neighbors of all the data points is determined.
Here, the reachable distance between the data point o and the data point p is defined as follows:
reach-distk(p,o)=max{k-distOnce(O),d(p,0)}
where d (p, o) refers to the distance between data points p and o.
In particular, data points in a K-distance neighborhood of the resulting data point are exchanged with neighboring points, e.g., the K-distance neighborhood of data point p is denoted as RDD [ (data point p, [ (neighboring point o)1Distance d1) … …, (neighbor point o)kDistance dk)])]Then data point p is Swapped with a neighbor point, which may be denoted as Swapped _ RDD [ (neighbor point o)1Data point p, distance d1) … … neighboring point okData point p, distance dk),……]Wherein each piece of data of the RDD is stored in the form of a key-value pair, with the neighboring points as keys and the data points and distances as values. The Swapped _ RDD is subjected to group operation with the original RDD according to the key (this is an operation on the RDD provided by Spark). The group operation will reduce the data points with the same key in the swappod _ RDD and the original RDD to the same working node. Therefore, for each data point as a key in each piece of data, the k-distance of its neighboring point and the distance between the data point and the neighboring point can be compared, and the larger value is taken as the reachable distance between the data point and its neighboring point, and the calculation result is recorded as RDD [ data point p, [ (neighboring point o)1Reachable distance d1),.., (neighboring points o)kReachable distance dk)]]。
In the application, because the data of different keys are positioned at different working nodes after specification, the process of calculating the reachable distance is completely executed in parallel, and each working node is independently responsible for respective calculation tasks without influencing each other, so that the execution speed of the algorithm is further improved.
Further, in step S133, the working nodes in the cluster calculate the reachable density of each data point in parallel, that is, each working node calculates the reachable density of each data point initially stored on the working node. Wherein the achievable density is defined as follows:
Figure BDA0001396424980000111
wherein, lrdMinPts(p) denotes the achievable density of p, MinPts (p) denotes all k-distance neighbors of the current data point p when k is MinPts. Specifically, the groupByKey operation is performed on each data point in the RDD obtained in step S132, and data with the same key is reduced to the same work node. For each piece of data after the specification, calculating the sum of the reachable distances of all the near-adjacent points of the data points corresponding to the keys and the number of the near-adjacent points, and calculating the reachable density according to the formula, wherein the calculation result is RDD [ data point p, [ (near-adjacent point o)1Can reach density d1),.., (neighboring points o)kCan reach density dk)]]. Similarly, the process of calculating the achievable density is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.
Further, in the step S134, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point. Here, the anomaly probability is expressed as LOFMinPts(p), defined as follows:
Figure BDA0001396424980000112
specifically, the RDD obtained in step S133 and the Swapped _ RDD are subjected to grouping operation, the reachable density of the data point, the sum of the reachable densities of all the neighboring points of the data point, and the number of the neighboring points are calculated, and the anomaly probability that the data point is an abnormal data point is calculated according to the above formula. Similarly, the process of calculating the abnormal probability is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.
Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.
There is also provided in accordance with another aspect of the present application a cluster for data distributed anomaly detection, wherein the cluster includes:
and the first working node is used for sending the first data set stored on the working node to the main node and calculating the abnormal probability of each data point based on the data set on the working node. The first data set comprises at least one data point;
the master node is configured to acquire the first data set and broadcast the first data set to a second working node in the cluster, so that the working node in the cluster calculates an abnormal probability of each data point based on the data set on the working node, where the second working node is another working node except the first working node in the cluster;
and the second working node acquires a first data set of the first working node broadcasted by the main node, and calculates the abnormal probability of each data point based on the first data set and the data set on the working node.
In this embodiment, a first working node in a cluster sends a first data set stored on the working node to a master node, and accordingly, the master node obtains the first data set stored on the first working node in the cluster, where the first data set includes at least one data point.
Here, the cluster is composed of each node in the distributed system, for example, in the embodiment of the present application, the distributed system mainly includes a master node and a working node, and the master node is one node in the cluster and can play a role in collecting and broadcasting data; the first working node is any one of all working nodes in the distributed system, and the second working node is other working nodes except the first working node. Here, each node in the distributed system includes, but is not limited to, various mobile intelligent devices, personal computers, network hosts, a single network server, a plurality of network server sets, or a cloud of multiple servers; wherein, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), wherein the Cloud Computing is one of distributed Computing, and is a virtual super computer composed of a group of loosely coupled computer sets.
Specifically, a set composed of all data points that need to be subjected to anomaly detection is referred to as a data set, in this application, the entire data set is divided into a plurality of data partitions, the plurality of partitions are distributively stored on each working node, each working node stores one or more partitions in the entire data set, a partition stored on a first working node is referred to as a first data set, and a partition stored on a second working node is referred to as a second data set. In the application, the distributed data set RDD provided in Spark may be used to read the original data set into the memory according to the data format of RDD [ (serial number, vector) ], where the RDD is divided into multiple partitions by Spark and stored in each working node of the Spark cluster in a distributed manner.
Since k-distance neighbors of each data point need to be calculated through the LOF algorithm, wherein k-distance refers to the distance between the current point and the kth data point. And the k-distance neighbors mean all data points whose distance from the current point does not exceed k-distance, so that, taking the first working node as an example, when calculating the k-distance neighbors of each data point in the first data set on the first working node, the first data set needs to be sent to other working nodes, namely the second working node, so that the k-distance neighbors of each data point in the first data set in the whole data set can be obtained.
Therefore, a first working node in the cluster sends a first data set stored on the working node to the master node, and accordingly, the master node acquires the first data set stored on the first working node in the cluster in a collection manner, in this case, each working node becomes the first working node in turn, that is, each working node in the cluster will send the data set on the working node to the master node, the master node broadcasts the data set to other working nodes, i.e., the master node broadcasts the first data set to a second working node in the cluster by means of a broadcast variable, such that all worker nodes in the cluster compute an anomaly probability for each data point based on the data set on the worker node and, accordingly, and the second working node acquires a first data set of the first working node broadcasted by the main node.
Fig. 2 is a schematic diagram illustrating a master node acquiring and broadcasting a first data set, where Driver is the master node, the following four are working nodes, and each working node stores a data set, in this embodiment, each data set has three data points, and the data are stored according to a data format of [ (serial number, vector) ], any one of the four working nodes may be used as a first working node, and the others are used as second working nodes, in this embodiment, the first working node on the left is used as a first working node, the other working nodes are used as second working nodes, and the master node collects the data set in the first working node, i.e., the first data set, to the master node, and broadcasts the first data set broadcast to the second working node. Furthermore, the second working nodes are taken as the first working nodes in turn, so that the data sets on the working nodes are broadcast to other working nodes.
Continuing in this embodiment, the worker nodes in the cluster calculate the probability of anomaly for each data point based on the data set on the worker node. That is, the working nodes in the cluster calculate the anomaly probability of each data point based on the first data set broadcast by the master node through the broadcast variable and the data set stored on the current working node.
Preferably, the second working node has a second data set stored thereon, the first working node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second working node acquires a second global K-distance neighbor of a data point in the second data set; further, the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor thereof based on the first global K-distance neighbor and the second global K-distance neighbor; further, the working nodes in the cluster parallelly determine the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point; further, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point.
In this embodiment, the first worker node in the cluster obtains a first global K-distance neighbor of data points in the first data set, and the second worker node obtains a second global K-distance neighbor of data points in the second data set.
The first global K-distance neighbor represents a K-distance neighbor of a data point in the first data set in all data sets of all working nodes, and the second global K-distance neighbor represents a K-distance neighbor of a data point in the second data set in all data sets of all working nodes. The first global K-distance neighbor and the second global K-distance neighbor are collectively called as global K-distance neighbors and represent the K-distance neighbors of each data point in all data sets of all working nodes. For example, since the master node broadcasts the data set on each working node to other working nodes, each working node may calculate K-distance neighbors of data points in the data set on the working node in all data sets, so as to obtain a global K-distance neighbor of each data point.
Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set includes:
the first working node calculating a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;
the second working node calculating a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;
the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
In this embodiment, the first working node obtains a K-distance neighbor, i.e. a first local K-distance neighbor, of each data point in the first data set by calculating a distance between each data point in the first data set.
Similarly, since the second working node obtains the first data set, the second working node calculates a distance between each data point in the first data set and each data point in the second data set, so as to obtain a K-distance neighbor, i.e., a second local K-distance neighbor, of each data point in the first data set in the second data set.
Further, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
The global K-distance neighbors for data points in the second data set may be determined in the same manner as the global K-distance neighbors for data points in the first data set.
Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor includes: the first working node acquires a second local K-distance neighbor determined by the second working node; the first working node obtains a global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
In this embodiment, the global K-distance neighbor of the data point in the first data set may be determined by the first working node, that is, the first local K-distance neighbor and the second local K-distance neighbor are aggregated onto the first working node to perform calculation to obtain the global K-distance neighbor of the data point in the first data set. Similarly, the global K-distance neighbors of the data points in the second data set can be determined through the second working node, so that distributed calculation is realized, and the operation efficiency is improved.
Preferably, the obtaining, by the first working node, a first global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor includes: the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node; the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.
In the preferred embodiment, the calculation of the global K-distance neighbor may be performed without using a working node, where the calculation of the global K-distance neighbor may be implemented by using an intermediate node, where the intermediate node is a node that is provided in addition to the master node and the working node, and the intermediate node may determine the global K-distance neighbor by obtaining the first local K-distance neighbor and the second local K-distance neighbor and performing corresponding calculation.
Preferably, the storage manner of the local K-distance and the global K-distance neighbors includes serial number identifiers of the data points and serial number identifiers of the local K-distance neighbors and the global K-distance neighbors of the data points. That is, when storing the local K-distance and the global K-distance neighbor of each data point, the sequence number identifier of each data point and the sequence number identifiers of the local K-distance and the global K-distance neighbor of each data point are included, for example, the K-distance neighbor of the data point p may be written as: (data point p, [ (neighbor point o)1Distance d1) ,.., (neighboring points o)kDistance dk)]Where d is the distance between the data point p and the neighboring point.
For example, in the embodiment shown in fig. 3, the global or local K-distance neighbor of data point 0 is data point 1, and since K is 1, the K-distance neighbor of data point 0 can be stored as (0, 1, d1), where d1 is the distance between data point 0 and data point 1.
Preferably, the obtaining a global K-distance neighbor of a data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor comprises:
carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor; determining a global K-distance neighbor of a data point in the first data set based on the specification result. Specifically, all data points with the same serial number can be reduced to the same working node through the reducibykey operation, for example, the data points are reduced to the first working node or an intermediate node.
As shown in fig. 3, an embodiment of the first working node obtaining the first global K-distance neighbor of the data point in the first data set is shown, in which the data set stored on each working node includes three data points, and each data point is stored in a (sequence number, vector) data format, for example, the data points with sequence numbers 0, 1, and 2 are stored on the first working node. In this embodiment, it is assumed that K in the K-distance neighbor is 1.
In this embodiment, a Driver end collect the data set on the first working node, and broadcast onto the second working node, where the first working node calculates a first local K-distance neighbor of the data point in the first data set, that is, calculates a local K-distance neighbor of the data point with sequence numbers 0, 1, and 2 in the data set (that is, the first data set), for example, the first local K-distance neighbor of the data point with sequence number 0 in the data set is the data point with sequence number 1.
Similarly, the second working node calculates a second local K-distance neighbor of the data points in the first data set, that is, the second working node calculates the distance between the data point in the first data set and each data point in the second data set, and takes the closest point as the local K-distance neighbor of the data point in the second data set, for example, the second local K-distance neighbor of the data point with sequence number 0 in the second working node 2 is the data point with sequence number 4.
Continuing in this embodiment, the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor. Specifically, the first local K-distance neighbors and the second local K-distance neighbors can be sorted in ascending order of distance, and then the K local neighbors with the minimum distance can be selected by traversing the first local K-distance neighbors and the second local K-distance neighbors.
Specifically, all data points with the same sequence number may be reduced to the same working node through a reduciebykey operation, for example, as shown in fig. 3, where K is 1, local K-distance neighbors of data points with sequence number 0 are reduced together, and sorted according to the distance from the data point with sequence number 0, and K data points with the smallest distance are determined, in this embodiment, since K is 1, a global K-distance neighbor of a data point with sequence number 0 is determined to be (0.484, 0.598), that is, a global K-distance neighbor of a data point with sequence number 0 is a data point with sequence number 1, and similarly, a global K-distance neighbor of a data point with sequence number 1 is obtained as a data point with sequence number 2.
Preferably, wherein the working node in the cluster, based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the working node from its K-distance neighbor in parallel comprises: the working nodes in the cluster determine the reachable distance between each data point and the K-distance neighbors of the data points based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.
In this embodiment, when the global K-distance neighbor of each data point on the working node is a data point on the working node, the working node in the cluster only needs to determine the reachable distance between each data point on the working node and the K-distance neighbor thereof based on the global K-distance neighbor of each data point on the working node; when the global K-distance neighbors of the data points existing on the working node are the data points on other working nodes, the reachable distance between each data point on the working node and the K-distance neighbors of the data points on the working node needs to be determined based on each data point on the working node and the global K-distance neighbors of the data points on the other working nodes. And each working node performs parallel calculation, so that the reachable distance of the K-distance neighbors of all the data points is determined.
Here, the reachable distance between the data point o and the data point p is defined as follows:
reach-distk(p,o)=maX{k-distOnce(O),d(p,0)}
where d (p, o) refers to the distance between data points p and o.
In particular, data points in a K-distance neighborhood of the resulting data point are exchanged with neighboring points, e.g., the K-distance neighborhood of data point p is denoted as RDD [ (data point p, [ (neighboring point o)1Distance d1) … …, (neighbor point o)kDistance dk)])]Then data point p is Swapped with a neighbor point, which may be denoted as Swapped _ RDD [ (neighbor point o)1Data point p, distance d1) … … neighboring point okData point p, distance dk),……]Wherein each piece of data of the RDD is stored in the form of a key-value pair, with the neighboring points as keys and the data points and distances as values. The Swapped _ RDD is subjected to group operation with the original RDD according to the key (this is an operation on the RDD provided by Spark). The group operation will reduce the data points with the same key in the swappod _ RDD and the original RDD to the same working node. Therefore, for each data point as a key in each piece of data, the k-distance of its neighboring point and the distance between the data point and the neighboring point can be compared, and the larger value is taken as the reachable distance between the data point and its neighboring point, and the calculation result is recorded as RDD [ data point p, [ (neighboring point o)1Reachable distance d1),.., (neighboring points o)kReachable distance dk)]]。
In the application, because the data of different keys are positioned at different working nodes after specification, the process of calculating the reachable distance is completely executed in parallel, and each working node is independently responsible for respective calculation tasks without influencing each other, so that the execution speed of the algorithm is further improved.
Further, the working nodes in the cluster calculate the reachable density of each data point in parallel, that is, each working node calculates the reachable density of each data point initially stored on the working node. Wherein the achievable density is defined as follows:
Figure BDA0001396424980000191
wherein, lrdMinPts(p) denotes the achievable density of p, MinPts (p) denotes all k-distance neighbors of the current data point p when k is MinPts. Specifically, a groupByKey operation is performed on each data point in the RDD obtained after the reachable distance is calculated, and data with the same keys are reduced to the same working node. For each piece of data after the specification, calculating the sum of the reachable distances of all the near-adjacent points of the data points corresponding to the keys and the number of the near-adjacent points, and calculating the reachable density according to the formula, wherein the calculation result is RDD [ data point p, [ (near-adjacent point o)1Can reach density d1),.., (neighboring points o)kCan reach density dk)]]. Similarly, the process of calculating the achievable density is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.
Further, the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point. Here, the anomaly probability is expressed as LOFMinPts(p), defined as follows:
specifically, the RDD obtained after the reachable density is calculated and the Swapped _ RDD are subjected to grouping operation, the reachable density of the data point, the sum of the reachable densities of all the adjacent points of the data point, and the number of the adjacent points are calculated, and the abnormal probability that the data point is an abnormal data point is calculated according to the formula. Similarly, the process of calculating the abnormal probability is completely executed in parallel, and each working node is independently responsible for respective calculation tasks and does not influence each other.
Compared with the prior art, the data sets on the working nodes are collected and broadcasted through the main node, so that the abnormal probability of each data point is calculated by each working node based on the data sets stored on the working node and the data sets broadcasted by the main node.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (18)

1. A method for data distributed anomaly detection, wherein the method comprises:
the method comprises the steps that a main node acquires a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point;
and broadcasting the first data set to a second working node in the cluster by the master node so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.
2. A method for data distributed anomaly detection, wherein the method comprises:
a first working node in a cluster sends a first data set stored on the working node to a main node, wherein the first data set comprises at least one data point;
a second working node in the cluster acquires a first data set of the first working node broadcasted by the main node, wherein the second working node is other working nodes except the first working node in the cluster;
and calculating the abnormal probability of each data point by the working nodes in the cluster based on the data set on the working node.
3. The method of claim 2, wherein the second working node has a second data set stored thereon, and wherein working nodes in the cluster calculate the probability of anomaly for each data point based on the data set on the working node comprises:
the first working node in the cluster acquires a first global K-distance neighbor of a data point in the first data set, and the second working node acquires a second global K-distance neighbor of a data point in the second data set;
the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor;
the working node in the cluster parallelly determines the reachable density of each data point on the working node based on the reachable distance between each data point and the K-distance neighbor of the data point;
and the working nodes in the cluster parallelly determine the abnormal probability of each data point on the working node based on the reachable density of each data point.
4. The method of claim 3, wherein the first worker node obtaining a first global K-distance neighbor of a data point in the first dataset comprises:
the first working node calculating a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;
the second working node calculating a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;
the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
5. The method of claim 4, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:
the first working node acquires a second local K-distance neighbor determined by the second working node;
the first working node obtains a global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
6. The method of claim 4, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:
the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node;
the first working node receives a global K-distance neighbor of the data point in the first data set, wherein the global K-distance neighbor of the data point in the first data set is obtained by the intermediate working node based on the first local K-distance neighbor and the second local K-distance neighbor.
7. The method according to any of claims 4 to 6, wherein the local K-distance and global K-distance neighbors are stored in a manner comprising a sequence number identification of the data point and sequence number identifications of the local and global K-distance neighbors of the data point.
8. The method of claim 7, wherein the obtaining a global K-distance neighbor of a data point in the first data set based on the first and second local K-distance neighbors comprises:
carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor;
determining a global K-distance neighbor of a data point in the first data set based on the specification result.
9. The method of claim 3, wherein the working nodes in the cluster based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the working node from its K-distance neighbor in parallel comprises:
and the working nodes in the cluster parallelly determine the reachable distance between each data point on the working node and the K-distance neighbors thereof based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.
10. A master node for data distributed anomaly detection, wherein the master node is to:
obtaining a first data set stored on a first working node in a cluster, wherein the first data set comprises at least one data point; and broadcasting the first data set to a second working node in the cluster so that the working node in the cluster calculates the abnormal probability of each data point based on the data set on the working node, wherein the second working node is the other working nodes except the first working node in the cluster.
11. A cluster for data distributed anomaly detection, wherein the cluster comprises:
the first working node is used for sending a first data set stored on the working node to the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the first data set comprises at least one data point;
and the second working node is used for acquiring the first data set of the first working node broadcasted by the main node and calculating the abnormal probability of each data point based on the data set on the working node, wherein the second working node is other working nodes except the first working node in the cluster.
12. The cluster of claim 11, wherein the second worker node has a second dataset stored thereon, wherein the first worker node obtains a first global K-distance neighbor of data points in the first dataset and the second worker node obtains a second global K-distance neighbor of data points in the second dataset; the working node in the cluster parallelly determines the reachable distance between each data point on the working node and the K-distance neighbor of the working node based on the first global K-distance neighbor and the second global K-distance neighbor;
determining the reachable density of each data point on the working node in parallel based on the reachable distance between each data point and the K-distance neighbor of the data point;
and determining the abnormal probability of each data point on the working node in parallel based on the reachable density of each data point.
13. The cluster of claim 12, wherein the first worker node obtaining a first global K-distance neighbor of a data point in the first dataset comprises:
the first working node calculating a first local K-distance neighbor of data points in the first data set, wherein the first local K-distance neighbor comprises a K-distance neighbor of data points in the first data set;
the second working node calculating a second local K-distance neighbor of data points in the first data set, wherein the second local K-distance neighbor comprises the K-distance neighbor of data points in the first data set in the second data set;
the first working node obtains a first global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
14. The cluster of claim 13, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:
the first working node acquires a second local K-distance neighbor determined by the second working node;
the first working node obtains a global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
15. The cluster of claim 13, wherein the first worker node obtaining a first global K-distance neighbor of data points in the first dataset based on a first local K-distance neighbor and a second local K-distance neighbor comprises:
the first working node and the second working node transmit the first local K-distance neighbor and the second local K-distance neighbor to an intermediate working node;
and receiving a global K-distance neighbor of the data point in the first data set, which is sent by the intermediate working node, wherein the intermediate working node acquires the global K-distance neighbor of the data point in the first data set based on the first local K-distance neighbor and the second local K-distance neighbor.
16. The cluster according to any of claims 13 to 15, wherein the local K-distance and global K-distance neighbors are stored in a way comprising a sequence number identification of the data point and sequence number identifications of the local and global K-distance neighbors of the data point.
17. The cluster of claim 16, wherein the obtaining a global K-distance neighbor of a data point in the first data set based on the first and second local K-distance neighbors comprises:
carrying out reduction on data points with the same serial number identification in the first local K-distance neighbor and the second local K-distance neighbor;
determining a global K-distance neighbor of a data point in the first data set based on the specification result.
18. The cluster of claim 12, wherein the worker node in the cluster, based on the first global K-distance neighbor and the second global K-distance neighbor, determining the reachable distance of each data point on the worker node from its K-distance neighbor in parallel comprises:
and the working nodes in the cluster parallelly determine the reachable distance between each data point on the working node and the K-distance neighbors thereof based on the global K-distance neighbors of each data point on the working node or based on the global K-distance neighbors of each data point on the working node and the data points on other working nodes.
CN201710778563.7A 2017-09-01 2017-09-01 Method and apparatus for data distributed anomaly detection Active CN107528904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710778563.7A CN107528904B (en) 2017-09-01 2017-09-01 Method and apparatus for data distributed anomaly detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710778563.7A CN107528904B (en) 2017-09-01 2017-09-01 Method and apparatus for data distributed anomaly detection

Publications (2)

Publication Number Publication Date
CN107528904A CN107528904A (en) 2017-12-29
CN107528904B true CN107528904B (en) 2020-02-18

Family

ID=60683289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710778563.7A Active CN107528904B (en) 2017-09-01 2017-09-01 Method and apparatus for data distributed anomaly detection

Country Status (1)

Country Link
CN (1) CN107528904B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108471428B (en) * 2018-06-27 2021-05-28 北京云端智度科技有限公司 DDoS attack active defense technology and equipment applied to CDN system
CN109388512A (en) * 2018-10-30 2019-02-26 江苏物联网研究发展中心 For the assessment and analysis system of large-scale computer cluster intensity of anomaly
CN111835696B (en) * 2019-04-23 2023-05-09 阿里巴巴集团控股有限公司 Method and device for detecting abnormal request individuals
CN110580030A (en) * 2019-10-11 2019-12-17 南京铁道职业技术学院 Pharmaceutical factory environment purification control system based on Internet of things
WO2023131962A1 (en) * 2022-01-04 2023-07-13 Telefonaktiebolaget Lm Ericsson (Publ) First node, second node and methods performed thereby for handling anomalous values

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028930A1 (en) * 2015-08-20 2017-02-23 Telefonaktiebolaget Lm Ericsson (Publ) Methods and apparatus for running an analytics function
CN107070753A (en) * 2017-06-15 2017-08-18 郑州云海信息技术有限公司 A kind of data monitoring method of distributed cluster system, apparatus and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702660B (en) * 2009-11-12 2011-12-14 中国科学院计算技术研究所 abnormal domain name detection method and system
CN104852886B (en) * 2014-02-14 2019-05-24 腾讯科技(深圳)有限公司 The guard method of user account number and device
US10367838B2 (en) * 2015-04-16 2019-07-30 Nec Corporation Real-time detection of abnormal network connections in streaming data
CN105871634B (en) * 2016-06-01 2019-02-15 北京蓝海讯通科技股份有限公司 Detect the method for cluster exception and the system of application, management cluster

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028930A1 (en) * 2015-08-20 2017-02-23 Telefonaktiebolaget Lm Ericsson (Publ) Methods and apparatus for running an analytics function
CN107070753A (en) * 2017-06-15 2017-08-18 郑州云海信息技术有限公司 A kind of data monitoring method of distributed cluster system, apparatus and system

Also Published As

Publication number Publication date
CN107528904A (en) 2017-12-29

Similar Documents

Publication Publication Date Title
CN107528904B (en) Method and apparatus for data distributed anomaly detection
US11710131B2 (en) Method and apparatus of identifying a transaction risk
JP6321681B2 (en) Method and apparatus for identifying website users
CN110826594B (en) Track clustering method, equipment and storage medium
US20180181751A1 (en) Anomaly Detection in Distributed Ledger Systems
US10402427B2 (en) System and method for analyzing result of clustering massive data
US8484212B2 (en) Providing reconstructed data based on stored aggregate data in response to queries for unavailable data
US11100073B2 (en) Method and system for data assignment in a distributed system
CN109145225B (en) Data processing method and device
CN106022936B (en) Community structure-based influence maximization algorithm applicable to thesis cooperative network
CN113285960B (en) Data encryption method and system for service data sharing cloud platform
CN106570029B (en) Data processing method and system for distributed relational database
CN113468226A (en) Service processing method, device, electronic equipment and storage medium
CN111626311B (en) Heterogeneous graph data processing method and device
CN115617279A (en) Distributed cloud data processing method and device and storage medium
CN109993338B (en) Link prediction method and device
CN110046194A (en) A kind of method, apparatus and electronic equipment of expanding node relational graph
CN112015825B (en) Model registration method and device based on block chain and electronic equipment
Yan et al. Federated clustering with GAN-based data synthesis
CN115629945A (en) Alarm processing method and device and electronic equipment
CN115955323A (en) Network security situation sensing method and device and electronic equipment
CN113873025A (en) Data processing method and device, storage medium and electronic equipment
US20140258333A1 (en) Fast distributed database frequency summarization
WO2018228259A1 (en) Relationship diagram processing method and apparatus
CN115374109B (en) Data access method, device, computing equipment and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai

Patentee after: Star link information technology (Shanghai) Co.,Ltd.

Address before: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai

Patentee before: TRANSWARP TECHNOLOGY (SHANGHAI) Co.,Ltd.

CP01 Change in the name or title of a patent holder