CN107528904A - Method and apparatus for data distribution formula abnormality detection - Google Patents
Method and apparatus for data distribution formula abnormality detection Download PDFInfo
- Publication number
- CN107528904A CN107528904A CN201710778563.7A CN201710778563A CN107528904A CN 107528904 A CN107528904 A CN 107528904A CN 201710778563 A CN201710778563 A CN 201710778563A CN 107528904 A CN107528904 A CN 107528904A
- Authority
- CN
- China
- Prior art keywords
- distance
- working node
- data
- data point
- neighbours
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/06—Generation of reports
- H04L43/062—Generation of reports related to network traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Abstract
The purpose of the application is to provide a kind of method and apparatus for data distribution formula abnormality detection.Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that each working node calculates the abnormal probability of each data point based on the data set that the data set stored on this working node and host node are broadcasted, this mode, give required amount of calculation reasonable distribution to each working node, take full advantage of the computing capability of whole cluster so that the performance of algorithm is significantly improved.
Description
Technical field
The application is related to field of computer technology, more particularly to a kind of technology for data distribution formula abnormality detection.
Background technology
Abnormality detection be intended to find data set in exceptional data point, these exceptional data points substantially with other data points not
Together, and their appearance is not random, and be possible to result from entirely different mechanism.Exceptional data point is divided
Analysis, the hiding important information in its behind can be excavated, so as to bring considerable economic value.At present, abnormality detection is in reality
Had a wide range of applications among life, such as in financial field, fraud detection is carried out to credit card using abnormality detection technology;Believing
Security fields are ceased, are performed intrusion detection using abnormality detection technology;In medical field, abnormality detection technology can be used for assessing patient
Possible potential health problem etc..
Existing LOF (Local Outlier Factor, local outlier factor) algorithm is a kind of exception based on density
Detection algorithm, the algorithm think that the density of the adjacent point of the density of exceptional data point has notable difference, and therefore, the algorithm passes through
Compare the relative density of the adjacent point of current point to detect exceptional data point.Due to good Detection accuracy, the algorithm quilt
Extensive use, turn into a kind of classical Outlier Detection Algorithm, still, with the arriving in big data epoch, due to the algorithm itself
Higher time complexity, it is restricted its application.
The content of the invention
The purpose of the application is to provide a kind of method and apparatus for data distribution formula abnormality detection.
According to the one side of the application, there is provided a kind of method for data distribution formula abnormality detection, wherein, the party
Method includes:
Host node obtains the first data set stored in cluster on the first working node, wherein, the first data set bag
Containing at least one data point;
Host node is by the second working node in first data sets broadcast to the cluster, so that in the cluster
Working node calculates the abnormal probability of each data point based on the data set on this working node, wherein, the second work section
Point is other working nodes that first working node is removed in the cluster.
According to the another aspect of the application, a kind of method for data distribution formula abnormality detection is additionally provided, wherein, should
Method includes:
The first working node in cluster sends the first data set stored on this working node to the host node, its
In, first data set includes at least one data point;
The second working node in cluster obtains the first data set of first working node of the host node broadcast,
Wherein, second working node is other working nodes that first working node is removed in the cluster;
Working node in cluster calculates the abnormal probability of each data point based on the data set on this working node.
According to the another further aspect of the application, a kind of host node for data distribution formula abnormality detection is additionally provided, wherein,
The host node is used for:
The first data set stored in cluster on the first working node is obtained, wherein, first data set includes at least
One data point;And by the second working node in first data sets broadcast to the cluster, so that in the cluster
Working node calculates the abnormal probability of each data point based on the data set on this working node, wherein, the second work section
Point is other working nodes that first working node is removed in the cluster.
According to the another aspect of the application, a kind of cluster for data distribution formula abnormality detection is additionally provided, wherein, should
Cluster includes:
First working node, for sending the first data set stored on this working node to the host node and being based on
Data set on this working node, the abnormal probability of each data point is calculated, wherein, first data set includes at least one
Data point;
Second working node, for first working node that obtains host node broadcast the first data set and
Based on the data set on this working node, the abnormal probability of each data point is calculated, wherein, second working node is described
Other working nodes of first working node are removed in cluster.
Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that
Each working node calculates each data based on the data set that the data set stored on this working node and host node are broadcasted
The abnormal probability of point, this mode give required amount of calculation reasonable distribution to each working node, take full advantage of whole cluster
Computing capability so that the performance of algorithm is significantly improved.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of method flow diagram for data distribution formula abnormality detection according to the application one side;
Fig. 2 shows to be obtained and broadcasted showing for the first data set according to a kind of host node of one preferred embodiment of the application
It is intended to;
Fig. 3 shows to obtain the first data intensive data according to first working node of one preferred embodiment of the application
The schematic diagram of first global K-distance neighbours of point.
Same or analogous reference represents same or analogous part in accompanying drawing.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
In one typical configuration of the application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, magnetic disk storage or other magnetic storage apparatus or
Any other non-transmission medium, the information that can be accessed by a computing device available for storage.Defined according to herein, computer
Computer-readable recording medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It is below in conjunction with the accompanying drawings and preferably real for the technological means and the effect of acquirement that further elaboration the application is taken
Example is applied, to the technical scheme of the application, carries out clear and complete description.
Fig. 1 shows a kind of method for data distribution formula abnormality detection according to the application one side, wherein, it is described
Method includes:
The first working node in S11 clusters sends the first data set stored on this working node to host node, accordingly
Ground, host node obtain the first data set stored in cluster on the first working node, wherein, first data set includes at least
One data point;
Host node described in S12 is by the second working node in first data sets broadcast to the cluster, so that described
Working node in cluster calculates the abnormal probability of each data point based on the data set on this working node, correspondingly, described
Second working node obtains the first data set of first working node of the host node broadcast, wherein, second work
It is other working nodes that first working node is removed in the cluster as node;
Working node in S13 clusters calculates the abnormal probability of each data point based on the data set on this working node.
In this embodiment, in the step S11, the first working node in cluster sends this work section to host node
The first data set stored on point, correspondingly, host node obtain the first data set stored in cluster on the first working node, its
In, first data set includes at least one data point.
Here, the cluster is made up of each node in distributed system, for example, being distributed in the embodiment of the present application
Formula system mainly includes host node and working node, and the host node is a node in the cluster, can play collection
And the effect of broadcast data;First working node is any one section in all working node in the distributed system
Point, second working node are other working nodes in addition to the first working node.Here, in the distributed system
Each node includes but is not limited to various intelligent movable equipment, personal computer, network host, single network server, multiple
The cloud that webserver collection or multiple servers are formed;Wherein, cloud is by the big gauge based on cloud computing (Cloud Computing)
Calculation machine or the webserver are formed, wherein, cloud computing is one kind of Distributed Calculation, by the computer collection group of a group loose couplings
Into a virtual supercomputer.
Specifically, it would be desirable to carry out all group of data points of abnormality detection into collection be collectively referred to as data set, in this application,
To whole data set, multiple data partitions can be divided into, the plurality of subregion can be stored on each working node in a distributed manner, often
Individual working node can store one or more of whole data set subregion, wherein, the subregion that will be stored on the first working node
Referred to as the first data set, the subregion stored on the second working node are referred to as the second data set.In this application, can utilize
The distributed data collection RDD provided in Spark, raw data set is read according to RDD [(sequence number, vector)] data format
In internal memory, RDD is divided into multiple subregions by Spark, is stored in a distributed manner on each working node of Spark clusters.
Due to needing to calculate the k-distance neighbours of each data point by LOF algorithms, wherein, k-distance is
Refer to the distance between current point and the remote data point of its kth.And k-distance neighbours refer to all and distance of current point not
Data point more than k-distance, therefore, by taking the first working node as an example, the first data on the first working node of calculating
, it is necessary to which the first data set is sent to other working nodes, i.e., second during the k-distance neighbours for each data point concentrated
Working node, so as to obtain k-distance neighbour of each data point in whole data set in the first data set.
So in the step S11, the first working node in cluster is sent to host node to be stored on this working node
The first data set, correspondingly, host node obtains first stored in cluster on the first working node by way of collect
Data set, here, each working node turns into the first working node in turn, that is, each working node in cluster can incite somebody to action
Data set on this working node is sent to the host node, and the host node can be by the data sets broadcast in addition to this working node
Other working nodes, that is, in the step S12, the host node is by way of broadcasting variable by first data
Collection broadcast is to the second working node in the cluster, so that all working node in the cluster is based on this working node
Data set, calculate the abnormal probability of each data point, correspondingly, second working node obtains the host node broadcast
First data set of first working node.
Host node obtains and broadcasted the schematic diagram of the first data set as shown in Figure 2, wherein, Driver is host node, under
Four, face is working node, and data set is stored with each working node, and in this embodiment, each data are concentrated with three numbers
Strong point, and according to the data format storage of [(sequence number, vector)], can be using any one in four working nodes as first
Working node, others are used as the second working node, in this embodiment, using first left working node as the first work
Node, other working nodes are as the second working node, and host node is by the data set in the first working node, i.e. the first data
Collection, collect (collection) arrive the host node, and by first data set broadcast (broadcast) to the second working node.Enter one
Step ground, the second working node alternately as the first working node, so as to realize by the data sets broadcast on this working node to its
His working node.
Continue in this embodiment, in the step S13, the working node in cluster is based on the number on this working node
According to collection, the abnormal probability of each data point is calculated.What the working node i.e. in cluster was broadcasted based on host node by broadcasting variable
The data set stored on first data set and work at present node, calculate the abnormal probability of each data point.
Preferably, wherein, the second data set is stored with second working node, the step S13 includes:S131
First working node in (not shown) cluster obtains the first global K- of data point in first data set
Distance neighbours and second working node obtain the second global K- of data point in second data set
Distance neighbours;Working node in S132 (not shown) clusters is based on the described first global K-distance neighbours and institute
The second global K-distance neighbours are stated, determine each data point and its K-distance neighbour on this working node parallel can
Up to distance;Working node in S133 (not shown) clusters based on each data point and its K-distance neighbour up to away from
From determining the reachable density of each data point on this working node parallel;Working node in S134 (not shown) clusters is based on
The reachable density of each data point, the abnormal probability of each data point on this working node is determined parallel.
In this embodiment, in the step S131, first working node in cluster obtains first number
Obtained according to the first global K-distance neighbours of intensive data point and second working node in second data set
Second global K-distance neighbours of data point.
Here, the first global K-distance neighbours represent data point in first data set in all working
K-distance neighbours in all data sets of node, the second global K-distance neighbours are represented in the second data set
K-distance neighbour of the data point in all data sets of all working node.The first global K-distance is near
Adjacent and described second global K-distance neighbours are referred to as global K-distance neighbours, represent each data point all
K-distance neighbours in all data sets of working node.For example, because host node is by the data on each working node
Collection broadcast is to other working nodes, and therefore, the data point that each working node can calculate on this working node in data set exists
K-distance neighbours in all data sets, so as to obtain the global K-distance neighbours of each data point.
Preferably, wherein, first working node obtains the first global K- of data point in first data set
Distance neighbours include:
First working node described in S1311 (not shown) calculates the First partial K- of data point in first data set
Distance neighbours, wherein, the First partial K-distance neighbours are including data point in first data set described
K-distance neighbours in first data set;
Second working node described in S1312 (not shown) calculates the second local K- of data point in first data set
Distance neighbours, wherein, the second local K-distance neighbours are including data point in first data set described
K-distance neighbours in second data set;
First working node described in S1313 (not shown) is based on First partial K-distance neighbours and the second local K-
Distance neighbours, obtain the first global K-distance neighbours of data point in first data set.
In this embodiment, in the step S1311, first working node is by calculating first data set
In the distance between each data point, to obtain K- of each data point in first data set in the first data set
Distance neighbours, i.e. First partial K-distance neighbours.
Similarly, in the step S1312, because the second working node has got first data set, then the second work
Each data point in first data set, the distance with each data point in the second data set, so as to obtain can be calculated by making node
Take K-distance neighbour of each data point in second data set, i.e., the second local K- in the first data set
Distance neighbours.
Further, in the step S1313, first working node is based on First partial K-distance neighbours
And the second local K-distance neighbour, the first global K-distance for obtaining data point in first data set are near
It is adjacent.
It is identical with determining the global K-distance neighbours of the data point in the first data set, same side can be passed through
Formula, determine the global K-distance neighbours of data point in the second data set.
Preferably, wherein, the step S1313 includes:Described in first working node described in S13131 (not shown) obtains
The second local K-distance neighbours that second working node determines;First working node described in S13132 (not shown) is based on institute
First partial K-distance neighbours and the second local K-distance neighbours are stated, obtains the data in first data set
The global K-distance neighbours of point.
In this embodiment it is possible to the global K- of the data point in the first data set is determined by the first working node
Distance neighbours, i.e., by First partial K-distance neighbours and the second local K-distance neighbours be aggregated to this first
Carry out calculating the global K-distance neighbours for obtaining the data point in the first data set on working node.Similarly, Ke Yitong
The second working node is crossed to determine the global K-distance neighbours of data point in the second data set, so as to realize distributed meter
Calculate, improve operation efficiency.
Preferably, the step S1313 includes:First working node and second working node are by described first
Local K-distance neighbours and the second local K-distance neighbours are transmitted to middle working node;First work
Node receives the global K-distance neighbours of the data point in first data set that the middle working node is sent, described,
Wherein, the middle working node is based on the First partial K-distance neighbours and the second local K-distance is near
Neighbour, obtain the global K-distance neighbours of the data point in first data set.
In this preferred embodiment, the calculating of global K-distance neighbours can not be carried out by working node,
Here, the calculating of global K-distance neighbours can be realized by intermediate node, here, intermediate node be except host node and
Outside working node, the node that separately sets, the intermediate node can be by obtaining First partial K-distance neighbours and second game
Portion K-distance neighbours, and carry out corresponding calculate to determine global K-distance neighbours.
Preferably, the storage mode of local K-distance and global K-distance neighbours include the sequence of the data point
Number mark and the data point part and global K-distance neighbours sequence number mark.That is, storing each data
The local K-distance of point and during global K-distance neighbours, can include the sequence number mark of each data point and per number
The sequence number of the local K-distance at strong point and global K-distance neighbours identify, for example, data point p K-distance is near
Neighbour can be denoted as:(data point p, [(Neighbor Points o1, distance d1) ..., (Neighbor Points ok, distance dk)], wherein, d is data point p
With the distance of Neighbor Points.
For example, in the embodiment shown in fig. 3, the global or local K-distance neighbours of data point 0 are data point 1,
Due to K=1, then data point 0 K-distance neighbours, (0,1, d1) can be stored as, wherein, d1 is data point 0 and data
Point the distance between 1.
Preferably, wherein, it is described to be based on the First partial K-distance neighbours and the second local K-distance
Neighbour, obtaining the global K-distance neighbours of the data point in first data set includes:
Sequence number mark in the First partial K-distance neighbours and the second local K-distance neighbours is identical
Data point carry out stipulations;The global K-distance of the data point in first data set is determined based on the stipulations result
Neighbour.Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work section
On point, for example, stipulations are on the first working node or intermediate node.
As shown in figure 3, show that first working node obtains the first global K- of data point in first data set
A kind of embodiment of distance neighbours, in this embodiment, the data set stored on each working node include three data
Point, each data point is stored by the data format of (sequence number, vector), for example, being stored with sequence number 0,1,2 on the first working node
Data point.In this embodiment it is assumed that the K=1 in K-distance neighbours.
In this embodiment, the data set on the first working node described in the collect of Driver ends, and broadcast is arrived
On second working node, in the step S1311, first working node calculates data point in first data set
First partial K-distance neighbours, that is, the data point of serial number 0,1,2 is calculated at notebook data collection (i.e. described first data set)
In local K-distance neighbours, for example, the data point of serial number 0 is near in the First partial K-distance that notebook data is concentrated
Neighbour is the data point of serial number 1.
Likewise, in the step S1312, second working node calculates data point in first data set
Second local K-distance neighbours, i.e., described second working node calculate the data point in first data set respectively, with
The distance of each data point in second data set, and closest is put as the data point in the second data set
Local K-distance neighbours, for example, second local K-distance of the data point of serial number 0 in the second working node 2
Neighbour is the data point of serial number 4.
Continue in this embodiment, in the step S1313, first working node is based on First partial K-
Distance neighbours and the second local K-distance neighbours, obtain the first global K- of data point in first data set
Distance neighbours.Specifically, can be by near to First partial K-distance neighbours and the second local K-distance
It is adjacent respectively according to apart from ascending sort, the then selected distance by way of traveling through the first and second local K-distance neighbours
K minimum K-distance neighbour.
Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work
On node, for example, as shown in figure 3, wherein, K=1, by the local K-distance neighbours stipulations of the data point of serial number 0 to one
Rise, and be ranked up according to the distance of the data point with serial number 0, it is determined that apart from K data point of minimum, in the embodiment
In, due to K=1, so the global K-distance neighbours for determining the data point of serial number 0 are (0.484,0.598), i.e. sequence number
Global K-distance neighbours for the data point for 0 are the data point of serial number 1, similarly, can obtain the data point of serial number 1
Global K-distance neighbours are the data point of serial number 2.
Preferably, wherein, the step S132 includes:Working node in cluster is based on every number on this working node
The global K-distance neighbours at strong point or based on each data point on this working node and other working nodes
The global K-distance neighbours of data point, it is determined that each data point and the reach distance of its K-distance neighbour.
In this embodiment, when the global K-distance neighbours of each data point on this working node are this work sections
During data point on point, the working node in cluster only need to the global K- based on each data point on this working node
Distance neighbours, to determine each data point and the reach distance of its K-distance neighbour on this working node;When this work
, it is necessary to be based on this work when the global K-distance neighbours for making to have data point on node are the data points on other working nodes
Make the global K-distance neighbours of each data point on node and the data point on other working nodes being related to, come
Determine each data point and the reach distance of its K-distance neighbour on this working node.Each working node parallel computation,
So as to realize the reach distance for the K-distance neighbours for determining all data points.
Here, the reach distance between data point o and data point p is defined as follows:
reach-distk(p, o)=max { k-distOnce (O), d (p, 0) }
Wherein, d (p, o) refers to the distance between data point p and o.
Specifically, the data point in the K-distance neighbours of obtained data point is swapped with Neighbor Points, for example,
Data point p K-distance neighbours are denoted as RDD [(data point p, [(Neighbor Points o1, distance d1) ... ..., (Neighbor Points ok, distance
dk)])], then data point p is swapped with Neighbor Points, Swapped_RDD [(Neighbor Points o can be designated as1, data point p, distance
d1) ... ..., Neighbor Points okData point p, distance dk) ... ...], wherein, RDD every a data is stored in the form of key-value pair,
Using Neighbor Points as key, data point and apart from as value.The Swapped_RDD is subjected to cogroup operations according to key and former RDD
(this is that one kind to RDD that Spark is provided operates).Cogroup operations can be by key identical in Swapped_RDD and former RDD
Data point stipulations are to same working node.Therefore, for the data point as key in every a data, it may compare its neighbour
Point k-distance and the data point arrive the distance between Neighbor Points, take higher value as the data point to its Neighbor Points can
Up to distance, its result of calculation is designated as RDD [data point p, [(Neighbor Points o1, reach distance d1)..., (Neighbor Points ok, reach distance
dk)]]。
In the application, because the data of different keys after stipulations are located at different working nodes, therefore, the present invention calculates reachable
The process of distance is entirely parallel execution, and each working node is independently responsible for respective calculating task, do not influenceed mutually, further
Improve the execution speed of algorithm.
Further, in the step S133, working node parallel computation each data point in cluster it is reachable
Density, i.e., each working node calculate the reachable density of each data point of initial storage on this working node respectively.Wherein, may be used
It is defined as follows up to density:
Wherein, lrdMinPts(p) p reachable density is represented, MinPts (p) is represented as k=MinPts, current data point p
All k-distance neighbours.Specifically, each data point in the RDD obtained to step S132 carries out groupByKey
Operation, by all key identical hough transformations to same working node.For every a data after stipulations, calculation key institute
The reach distance sum of all Neighbor Points and the quantity of Neighbor Points of corresponding data point, and it is reachable according to above-mentioned formula calculating
Density, its result of calculation are RDD [data point p, [(Neighbor Points o1, up to density d1)..., (Neighbor Points ok, up to density
dk)]].Equally, the process that the application calculates reachable density is that complete parallel performs, and each working node is independently responsible for respective
Calculating task, do not influence mutually.
Further, in the step S134, the working node in cluster is based on the reachable close of each data point
Degree, the abnormal probability of each data point on this working node is determined parallel.Here, abnormal probability is designated as LOFMinPts(p), define
It is as follows:
Specifically, the RDD obtained in step S133 and Swapped_RDD is subjected to cogroup operations, calculates the data point
Reachable density, the reachable density sum of all Neighbor Points of the data point, and the quantity of Neighbor Points, and according to above-mentioned formula
Calculate the abnormal probability that the data point is exceptional data point.Equally, the process that the application calculates abnormal probability is that complete parallel is held
OK, each working node is independently responsible for respective calculating task, does not influence mutually.
Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that
Each working node calculates each data based on the data set that the data set stored on this working node and host node are broadcasted
The abnormal probability of point, this mode give required amount of calculation reasonable distribution to each working node, take full advantage of whole cluster
Computing capability so that the performance of algorithm is significantly improved.
A kind of cluster for data distribution formula abnormality detection is additionally provided according to the application on the other hand, wherein, institute
Stating cluster includes:
First working node, for sending the first data set stored on this working node to host node, and it is based on this work
Make the data set on node, calculate the abnormal probability of each data point.First data set includes at least one data point;
The host node, for obtaining first data set, and by first data sets broadcast into the cluster
The second working node so that the working node in the cluster calculates each data based on the data set on this working node
The abnormal probability of point, wherein, second working node is other work sections that first working node is removed in the cluster
Point;
Second working node obtains the first data set of first working node of the host node broadcast, and base
Data set on first data set and this working node, calculate the abnormal probability of each data point.
In this embodiment, the first working node in cluster sends the first number stored on this working node to host node
According to collection, correspondingly, host node obtains the first data set stored in cluster on the first working node, wherein, first data
Collection includes at least one data point.
Here, the cluster is made up of each node in distributed system, for example, being distributed in the embodiment of the present application
Formula system mainly includes host node and working node, and the host node is a node in the cluster, can play collection
And the effect of broadcast data;First working node is any one section in all working node in the distributed system
Point, second working node are other working nodes in addition to the first working node.Here, in the distributed system
Each node includes but is not limited to various intelligent movable equipment, personal computer, network host, single network server, multiple
The cloud that webserver collection or multiple servers are formed;Wherein, cloud is by the big gauge based on cloud computing (Cloud Computing)
Calculation machine or the webserver are formed, wherein, cloud computing is one kind of Distributed Calculation, by the computer collection group of a group loose couplings
Into a virtual supercomputer.
Specifically, it would be desirable to carry out all group of data points of abnormality detection into collection be collectively referred to as data set, in this application,
To whole data set, multiple data partitions can be divided into, the plurality of subregion can be stored on each working node in a distributed manner, often
Individual working node can store one or more of whole data set subregion, wherein, the subregion that will be stored on the first working node
Referred to as the first data set, the subregion stored on the second working node are referred to as the second data set.In this application, can utilize
The distributed data collection RDD provided in Spark, raw data set is read according to RDD [(sequence number, vector)] data format
In internal memory, RDD is divided into multiple subregions by Spark, is stored in a distributed manner on each working node of Spark clusters.
Due to needing to calculate the k-distance neighbours of each data point by LOF algorithms, wherein, k-distance is
Refer to the distance between current point and the remote data point of its kth.And k-distance neighbours refer to all and distance of current point not
Data point more than k-distance, therefore, by taking the first working node as an example, the first data on the first working node of calculating
, it is necessary to which the first data set is sent to other working nodes, i.e., second during the k-distance neighbours for each data point concentrated
Working node, so as to obtain k-distance neighbour of each data point in whole data set in the first data set.
So the first working node in cluster sends the first data set stored on this working node, phase to host node
Ying Di, host node obtains the first data set stored in cluster on the first working node by way of collect, here, respectively
Individual working node turns into the first working node in turn, that is, each working node in cluster can be by this working node
Data set is sent to the host node, other working nodes that the host node can be by the data sets broadcast extremely in addition to this working node,
That is, the host node is saved the second work in first data sets broadcast to the cluster by way of broadcasting variable
Point, so that all working node in the cluster based on the data set on this working node, calculates the exception of each data point
Probability, correspondingly, second working node obtain the first data set of first working node of the host node broadcast.
Host node obtains and broadcasted the schematic diagram of the first data set as shown in Figure 2, wherein, Driver is host node, under
Four, face is working node, and data set is stored with each working node, and in this embodiment, each data are concentrated with three numbers
Strong point, and according to the data format storage of [(sequence number, vector)], can be using any one in four working nodes as first
Working node, others are used as the second working node, in this embodiment, using first left working node as the first work
Node, other working nodes are as the second working node, and host node is by the data set in the first working node, i.e. the first data
Collection, collect (collection) arrive the host node, and by first data set broadcast (broadcast) to the second working node.Enter one
Step ground, the second working node alternately as the first working node, so as to realize by the data sets broadcast on this working node to its
His working node.
Continue in this embodiment, the working node in cluster is calculated per number based on the data set on this working node
The abnormal probability at strong point.Working node i.e. in cluster is based on host node by broadcasting the first data set that variable is broadcasted and current
The data set stored on working node, calculate the abnormal probability of each data point.
Preferably, wherein, it is stored with the second data set on second working node, the first work section in cluster
Point obtains the first global K-distance neighbours of data point and second working node in first data set and obtained
The second of data point global K-distance neighbours in second data set;Further, the working node in cluster is based on
The first global K-distance neighbours and the second global K-distance neighbours, determine on this working node parallel
Each data point and the reach distance of its K-distance neighbour;Further, the working node in cluster is based on each data
Point and the reach distance of its K-distance neighbour, the reachable density of each data point on this working node is determined parallel;Enter one
Step, reachable density of the working node based on each data point in cluster, determine parallel on this working node per number
The abnormal probability at strong point.
In this embodiment, first working node in cluster obtains first of data point in first data set
Global K-distance neighbours and second working node obtain the second global K- of data point in second data set
Distance neighbours.
Here, the first global K-distance neighbours represent data point in first data set in all working
K-distance neighbours in all data sets of node, the second global K-distance neighbours are represented in the second data set
K-distance neighbour of the data point in all data sets of all working node.The first global K-distance is near
Adjacent and described second global K-distance neighbours are referred to as global K-distance neighbours, represent each data point all
K-distance neighbours in all data sets of working node.For example, because host node is by the data on each working node
Collection broadcast is to other working nodes, and therefore, the data point that each working node can calculate on this working node in data set exists
K-distance neighbours in all data sets, so as to obtain the global K-distance neighbours of each data point.
Preferably, wherein, first working node obtains the first global K- of data point in first data set
Distance neighbours include:
First working node calculates the First partial K-distance neighbours of data point in first data set, its
In, the First partial K-distance neighbours include K- of the data point in first data set in first data set
Distance neighbours;
Second working node calculates the second local K-distance neighbours of data point in first data set, its
In, the second local K-distance neighbours include K- of the data point in second data set in first data set
Distance neighbours;
First working node is based on First partial K-distance neighbours and the second local K-distance neighbours,
Obtain the first global K-distance neighbours of data point in first data set.
In this embodiment, first working node is by calculating in first data set between each data point
Distance, to obtain K-distance neighbour of each data point in first data set, i.e. first game in the first data set
Portion K-distance neighbours.
Similarly, because the second working node has got first data set, then the second working node can calculate this first
Each data point in data set, the distance with each data point in the second data set are each in the first data set so as to obtain
K-distance neighbour of the data point in second data set, i.e., the second local K-distance neighbours.
Further, first working node is based on First partial K-distance neighbours and the second local K-
Distance neighbours, obtain the first global K-distance neighbours of data point in first data set.
It is identical with determining the global K-distance neighbours of the data point in the first data set, same side can be passed through
Formula, determine the global K-distance neighbours of data point in the second data set.
Preferably, wherein, first working node is based on First partial K-distance neighbours and the second local K-
Distance neighbours, obtaining the first global K-distance neighbours of data point in first data set includes:Described first
Working node obtains the second local K-distance neighbours that second working node determines;First working node is based on
The First partial K-distance neighbours and the second local K-distance neighbours, obtain the number in first data set
The global K-distance neighbours at strong point.
In this embodiment it is possible to the global K- of the data point in the first data set is determined by the first working node
Distance neighbours, i.e., by First partial K-distance neighbours and the second local K-distance neighbours be aggregated to this first
Carry out calculating the global K-distance neighbours for obtaining the data point in the first data set on working node.Similarly, Ke Yitong
The second working node is crossed to determine the global K-distance neighbours of data point in the second data set, so as to realize distributed meter
Calculate, improve operation efficiency.
Preferably, first working node is based on First partial K-distance neighbours and the second local K-
Distance neighbours, obtaining the first global K-distance neighbours of data point in first data set includes:Described first
Working node and second working node are by the First partial K-distance neighbours and the second local K-
Distance neighbours are transmitted to middle working node;First working node receives the middle working node is sent, institute
The global K-distance neighbours of the data point in the first data set are stated, wherein, the middle working node is based on described first
Local K-distance neighbours and the second local K-distance neighbours, obtain the complete of data point in first data set
Office K-distance neighbours.
In this preferred embodiment, the calculating of global K-distance neighbours can not be carried out by working node,
Here, the calculating of global K-distance neighbours can be realized by intermediate node, here, intermediate node be except host node and
Outside working node, the node that separately sets, the intermediate node can be by obtaining First partial K-distance neighbours and second game
Portion K-distance neighbours, and carry out corresponding calculate to determine global K-distance neighbours.
Preferably, the storage mode of local K-distance and global K-distance neighbours include the sequence of the data point
Number mark and the data point part and global K-distance neighbours sequence number mark.That is, storing each data
The local K-distance of point and during global K-distance neighbours, can include the sequence number mark of each data point and per number
The sequence number of the local K-distance at strong point and global K-distance neighbours identify, for example, data point p K-distance is near
Neighbour can be denoted as:(data point p, [(Neighbor Points o1, distance d1) ..., (Neighbor Points ok, distance dk)], wherein, d is data point p
With the distance of Neighbor Points.
For example, in the embodiment shown in fig. 3, the global or local K-distance neighbours of data point 0 are data point 1,
Due to K=1, then data point 0 K-distance neighbours, (0,1, d1) can be stored as, wherein, d1 is data point 0 and data
Point the distance between 1.
Preferably, wherein, it is described to be based on the First partial K-distance neighbours and the second local K-distance
Neighbour, obtaining the global K-distance neighbours of the data point in first data set includes:
Sequence number mark in the First partial K-distance neighbours and the second local K-distance neighbours is identical
Data point carry out stipulations;The global K-distance of the data point in first data set is determined based on the stipulations result
Neighbour.Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work section
On point, for example, stipulations are on the first working node or intermediate node.
As shown in figure 3, show that first working node obtains the first global K- of data point in first data set
A kind of embodiment of distance neighbours, in this embodiment, the data set stored on each working node include three data
Point, each data point is stored by the data format of (sequence number, vector), for example, being stored with sequence number 0,1,2 on the first working node
Data point.In this embodiment it is assumed that the K=1 in K-distance neighbours.
In this embodiment, the data set on the first working node described in the collect of Driver ends, and broadcast is arrived
On second working node, first working node calculates the First partial K-distance of data point in first data set
Neighbour, that is, calculate local K-distance of the data point of serial number 0,1,2 in notebook data collection (i.e. described first data set)
Neighbour, for example, the data point of serial number 0 is in the data that the First partial K-distance neighbours that notebook data is concentrated are serial number 1
Point.
Likewise, second working node calculates the second local K-distance of data point in first data set
Neighbour, i.e., described second working node calculate the data point in first data set respectively, and each in the second data set
The distance of data point, and closest point is near as the local K-distance in the second data set of the data point
Neighbour, for example, second local K-distance neighbour of the data point of serial number 0 in the second working node 2 is the number of serial number 4
Strong point.
Continue in this embodiment, first working node is based on First partial K-distance neighbours and second game
Portion K-distance neighbours, obtain the first global K-distance neighbours of data point in first data set.Specifically, may be used
With by First partial K-distance neighbours and the second local K-distance neighbours respectively according to apart from ascending sort,
Then K minimum K-distance of selected distance is near by way of traveling through the first and second local K-distance neighbours
It is adjacent.
Specifically, can be operated by reduceByKey, by all sequence number identical data point stipulations to same work
On node, for example, as shown in figure 3, wherein, K=1, by the local K-distance neighbours stipulations of the data point of serial number 0 to one
Rise, and be ranked up according to the distance of the data point with serial number 0, it is determined that apart from K data point of minimum, in the embodiment
In, due to K=1, so the global K-distance neighbours for determining the data point of serial number 0 are (0.484,0.598), i.e. sequence number
Global K-distance neighbours for the data point for 0 are the data point of serial number 1, similarly, can obtain the data point of serial number 1
Global K-distance neighbours are the data point of serial number 2.
Preferably, wherein, the working node in the cluster is based on the described first global K-distance neighbours and institute
The second global K-distance neighbours are stated, determine each data point and its K-distance neighbour on this working node parallel can
Include up to distance:Global K-distance neighbour of the working node based on each data point on this working node in cluster,
Or the global K-distance based on each data point on this working node and the data point on other working nodes is near
Neighbour, it is determined that each data point and the reach distance of its K-distance neighbour.
In this embodiment, when the global K-distance neighbours of each data point on this working node are this work sections
During data point on point, the working node in cluster only need to the global K- based on each data point on this working node
Distance neighbours, to determine each data point and the reach distance of its K-distance neighbour on this working node;When this work
, it is necessary to be based on this work when the global K-distance neighbours for making to have data point on node are the data points on other working nodes
Make the global K-distance neighbours of each data point on node and the data point on other working nodes being related to, come
Determine each data point and the reach distance of its K-distance neighbour on this working node.Each working node parallel computation,
So as to realize the reach distance for the K-distance neighbours for determining all data points.
Here, the reach distance between data point o and data point p is defined as follows:
reach-distk(p, o)=maX { k-distOnce (O), d (p, 0) }
Wherein, d (p, o) refers to the distance between data point p and o.
Specifically, the data point in the K-distance neighbours of obtained data point is swapped with Neighbor Points, for example,
Data point p K-distance neighbours are denoted as RDD [(data point p, [(Neighbor Points o1, distance d1) ... ..., (Neighbor Points ok, distance
dk)])], then data point p is swapped with Neighbor Points, Swapped_RDD [(Neighbor Points o can be designated as1, data point p, distance
d1) ... ..., Neighbor Points okData point p, distance dk) ... ...], wherein, RDD every a data is stored in the form of key-value pair,
Using Neighbor Points as key, data point and apart from as value.The Swapped_RDD is subjected to cogroup operations according to key and former RDD
(this is that one kind to RDD that Spark is provided operates).Cogroup operations can be by key identical in Swapped_RDD and former RDD
Data point stipulations are to same working node.Therefore, for the data point as key in every a data, it may compare its neighbour
Point k-distance and the data point arrive the distance between Neighbor Points, take higher value as the data point to its Neighbor Points can
Up to distance, its result of calculation is designated as RDD [data point p, [(Neighbor Points o1, reach distance d1)..., (Neighbor Points ok, reach distance
dk)]]。
In the application, because the data of different keys after stipulations are located at different working nodes, therefore, the present invention calculates reachable
The process of distance is entirely parallel execution, and each working node is independently responsible for respective calculating task, do not influenceed mutually, further
Improve the execution speed of algorithm.
Further, the reachable density of working node parallel computation each data point in cluster, i.e., each work section
Point calculates the reachable density of each data point of initial storage on this working node respectively.Wherein, it is defined as follows up to density:
Wherein, lrdMinPts(p) p reachable density is represented, MinPts (p) is represented as k=MinPts, current data point p
All k-distance neighbours.Specifically, each data point calculated in the RDD obtained after reach distance is carried out
GroupByKey is operated, by all key identical hough transformations to same working node.For each number after stipulations
According to, the reach distance sum of all Neighbor Points and the quantity of Neighbor Points of the data point corresponding to calculation key, and according to above-mentioned
Formula is calculated up to density, and its result of calculation is RDD [data point p, [(Neighbor Points o1, up to density d1)..., (Neighbor Points ok,
Up to density dk)]].Equally, the process that the application calculates reachable density is that complete parallel performs, and each working node is independently born
Respective calculating task is blamed, is not influenceed mutually.
Further, reachable density of the working node in cluster based on each data point, determines this work parallel
The abnormal probability of each data point on node.Here, abnormal probability is designated as LOFMinPts(p), it is defined as follows:
Specifically, it will calculate and carry out cogroup operations up to the RDD obtained after density and Swapped_RDD, calculate the number
The reachable density at strong point, the reachable density sum of all Neighbor Points of the data point, and the quantity of Neighbor Points, and according to above-mentioned
Formula calculates the abnormal probability that the data point is exceptional data point.Equally, the process that the application calculates abnormal probability is completely simultaneously
Row performs, and each working node is independently responsible for respective calculating task, does not influence mutually.
Compared with prior art, the application is collected by host node and broadcasts the data set on each working node, so that
Each working node calculates each data based on the data set that the data set stored on this working node and host node are broadcasted
The abnormal probability of point, this mode give required amount of calculation reasonable distribution to each working node, take full advantage of whole cluster
Computing capability so that the performance of algorithm is significantly improved.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.This
Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple
Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table
Show title, and be not offered as any specific order.
Claims (18)
1. a kind of method for data distribution formula abnormality detection, wherein, this method includes:
Host node obtains the first data set stored in cluster on the first working node, wherein, first data set includes extremely
A few data point;
Host node is by the second working node in first data sets broadcast to the cluster, so that the work in the cluster
Node calculates the abnormal probability of each data point based on the data set on this working node, wherein, second working node is
Other working nodes of first working node are removed in the cluster.
2. a kind of method for data distribution formula abnormality detection, wherein, this method includes:
The first working node in cluster sends the first data set stored on this working node to the host node, wherein, institute
State the first data set and include at least one data point;
The second working node in cluster obtains the first data set of first working node of the host node broadcast, its
In, second working node is other working nodes that first working node is removed in the cluster;
Working node in cluster calculates the abnormal probability of each data point based on the data set on this working node.
3. according to the method for claim 2, wherein, the second data set, the collection are stored with second working node
Based on the data set on this working node, calculate the abnormal probability of each data point includes working node in group:
The first global K-distance that first working node in cluster obtains data point in first data set is near
Adjacent and described second working node obtains the second global K-distance neighbours of data point in second data set;
Working node in cluster is based on the described first global K-distance neighbours and the second global K-distance
Neighbour, each data point and the reach distance of its K-distance neighbour on this working node are determined parallel;
Reach distance of the working node based on each data point Yu its K-distance neighbour in cluster, determines this work parallel
The reachable density of each data point on node;
Reachable density of the working node based on each data point in cluster, determines each data on this working node parallel
The abnormal probability of point.
4. according to the method for claim 3, wherein, first working node obtains data point in first data set
The first global K-distance neighbours include:
First working node calculates the First partial K-distance neighbours of data point in first data set, wherein,
The First partial K-distance neighbours include K- of the data point in first data set in first data set
Distance neighbours;
Second working node calculates the second local K-distance neighbours of data point in first data set, wherein,
The second local K-distance neighbours include K- of the data point in second data set in first data set
Distance neighbours;
First working node is based on First partial K-distance neighbours and the second local K-distance neighbours, obtains
The first of data point global K-distance neighbours in first data set.
5. according to the method for claim 4, wherein, first working node is based on First partial K-distance neighbours
And the second local K-distance neighbour, obtain the first global K-distance neighbours of data point in first data set
Including:
First working node obtains the second local K-distance neighbours that second working node determines;
First working node is based on the First partial K-distance neighbours and the second local K-distance neighbours,
Obtain the global K-distance neighbours of the data point in first data set.
6. according to the method for claim 4, wherein, first working node is based on First partial K-distance neighbours
And the second local K-distance neighbour, obtain the first global K-distance neighbours of data point in first data set
Including:
First working node and second working node are by the First partial K-distance neighbours and described second
Local K-distance neighbours are transmitted to middle working node;
First working node receives the overall situation of the data point in first data set that the middle working node is sent, described
K-distance neighbours, wherein, the middle working node is based on the First partial K-distance neighbours and second game
Portion K-distance neighbours, obtain the global K-distance neighbours of the data point in first data set.
7. the method according to any one of claim 4 to 6, local K-distance and global K-distance neighbours'
Storage mode includes the sequence number mark and the part of the data point and the sequence of global K-distance neighbours of the data point
Number mark.
8. the method according to claim 11, wherein, it is described to be based on the First partial K-distance neighbours and second
Local K-distance neighbours, obtaining the global K-distance neighbours of the data point in first data set includes:
Sequence number in the First partial K-distance neighbours and the second local K-distance neighbours is identified into identical number
Strong point carries out stipulations;
The global K-distance neighbours of the data point in first data set are determined based on the stipulations result.
9. according to the method for claim 3, wherein, the working node in the cluster is based on the described first global K-
Distance neighbours and the second global K-distance neighbours, each data point and its are determined on this working node parallel
The reach distance of K-distance neighbours includes:
Global K-distance neighbours based on each data point on this working node of working node in cluster or it is based on
The global K-distance neighbours of each data point on this working node and the data point on other working nodes, it is parallel true
Each data point and the reach distance of its K-distance neighbour on this fixed working node.
10. a kind of host node for data distribution formula abnormality detection, wherein, the host node is used for:
The first data set stored in cluster on the first working node is obtained, wherein, first data set includes at least one
Data point;And by the second working node in first data sets broadcast to the cluster, so that the work in the cluster
Node calculates the abnormal probability of each data point based on the data set on this working node, wherein, second working node is
Other working nodes of first working node are removed in the cluster.
11. a kind of cluster for data distribution formula abnormality detection, wherein, the cluster includes:
First working node, for sending the first data set for being stored on this working node to the host node and based on this work
Make the data set on node, calculate the abnormal probability of each data point, wherein, first data set includes at least one data
Point;
Second working node, for first working node that obtains host node broadcast the first data set and be based on
Data set on this working node, the abnormal probability of each data point is calculated, wherein, second working node is the cluster
In remove first working node other working nodes.
12. cluster according to claim 11, wherein, the second data set is stored with second working node, wherein,
First working node obtains in first data set the first global K-distance neighbours of data point and described the
Two working nodes obtain the second global K-distance neighbours of data point in second data set;Work in the cluster
Node is based on the described first global K-distance neighbours and the second global K-distance neighbours, determines this work parallel
Make each data point and the reach distance of its K-distance neighbour on node;
And the reach distance based on each data point Yu its K-distance neighbour, each data on this working node are determined parallel
The reachable density of point;
And based on the reachable density of each data point, the abnormal probability of each data point on this working node is determined parallel.
13. cluster according to claim 12, wherein, first working node obtains the first data intensive data
First global K-distance neighbours of point include:
First working node calculates the First partial K-distance neighbours of data point in first data set, wherein,
The First partial K-distance neighbours include K- of the data point in first data set in first data set
Distance neighbours;
Second working node calculates the second local K-distance neighbours of data point in first data set, wherein,
The second local K-distance neighbours include K- of the data point in second data set in first data set
Distance neighbours;
First working node is based on First partial K-distance neighbours and the second local K-distance neighbours, obtains
The first of data point global K-distance neighbours in first data set.
14. cluster according to claim 13, wherein, it is near that first working node is based on First partial K-distance
Adjacent and the second local K-distance neighbours, the first global K-distance for obtaining data point in first data set are near
Neighbour includes:
First working node obtains the second local K-distance neighbours that second working node determines;
First working node is based on the First partial K-distance neighbours and the second local K-distance neighbours,
Obtain the global K-distance neighbours of the data point in first data set.
15. cluster according to claim 13, wherein, it is near that first working node is based on First partial K-distance
Adjacent and the second local K-distance neighbours, the first global K-distance for obtaining data point in first data set are near
Neighbour includes:
First working node and second working node are by the First partial K-distance neighbours and described second
Local K-distance neighbours are transmitted to middle working node;
And the global K-distance for receiving the data point in first data set that the middle working node is sent, described is near
Neighbour, wherein, the middle working node is based on the First partial K-distance neighbours and the second local K-distance
Neighbour, obtain the global K-distance neighbours of the data point in first data set.
16. the cluster according to any one of claim 13 to 15, local K-distance and global K-distance are near
Adjacent storage mode includes sequence number mark and part and the global K-distance neighbours of the data point of the data point
Sequence number mark.
17. cluster according to claim 16, wherein, it is described based on the First partial K-distance neighbours and the
Two local K-distance neighbours, obtaining the global K-distance neighbours of the data point in first data set includes:
Sequence number in the First partial K-distance neighbours and the second local K-distance neighbours is identified into identical number
Strong point carries out stipulations;
The global K-distance neighbours of the data point in first data set are determined based on the stipulations result.
18. cluster according to claim 12, wherein, the working node in the cluster is based on the described first global K-
Distance neighbours and the second global K-distance neighbours, each data point and its are determined on this working node parallel
The reach distance of K-distance neighbours includes:
Global K-distance neighbours based on each data point on this working node of working node in cluster or it is based on
The global K-distance neighbours of each data point on this working node and the data point on other working nodes, it is parallel true
Each data point and the reach distance of its K-distance neighbour on this fixed working node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710778563.7A CN107528904B (en) | 2017-09-01 | 2017-09-01 | Method and apparatus for data distributed anomaly detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710778563.7A CN107528904B (en) | 2017-09-01 | 2017-09-01 | Method and apparatus for data distributed anomaly detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107528904A true CN107528904A (en) | 2017-12-29 |
CN107528904B CN107528904B (en) | 2020-02-18 |
Family
ID=60683289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710778563.7A Active CN107528904B (en) | 2017-09-01 | 2017-09-01 | Method and apparatus for data distributed anomaly detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107528904B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108471428A (en) * | 2018-06-27 | 2018-08-31 | 北京云端智度科技有限公司 | Applied to the ddos attack initiative type safeguard technology and equipment in CDN system |
CN109388512A (en) * | 2018-10-30 | 2019-02-26 | 江苏物联网研究发展中心 | For the assessment and analysis system of large-scale computer cluster intensity of anomaly |
CN110580030A (en) * | 2019-10-11 | 2019-12-17 | 南京铁道职业技术学院 | Pharmaceutical factory environment purification control system based on Internet of things |
CN111835696A (en) * | 2019-04-23 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Method and device for detecting abnormal request individuals |
WO2023131962A1 (en) * | 2022-01-04 | 2023-07-13 | Telefonaktiebolaget Lm Ericsson (Publ) | First node, second node and methods performed thereby for handling anomalous values |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101702660A (en) * | 2009-11-12 | 2010-05-05 | 中国科学院计算技术研究所 | Abnormal domain name detection method and system |
CN104579846A (en) * | 2015-01-21 | 2015-04-29 | 清华大学 | Network flow anomaly detection method based on adjustable segmented entropy |
CN104852886A (en) * | 2014-02-14 | 2015-08-19 | 腾讯科技(深圳)有限公司 | Protection method and device for user account |
CN105871634A (en) * | 2016-06-01 | 2016-08-17 | 北京蓝海讯通科技股份有限公司 | Method and application for detecting cluster anomalies and cluster managing system |
WO2017028930A1 (en) * | 2015-08-20 | 2017-02-23 | Telefonaktiebolaget Lm Ericsson (Publ) | Methods and apparatus for running an analytics function |
US20170149814A1 (en) * | 2015-04-16 | 2017-05-25 | Nec Laboratories America, Inc. | Real-Time Detection of Abnormal Network Connections in Streaming Data |
CN107070753A (en) * | 2017-06-15 | 2017-08-18 | 郑州云海信息技术有限公司 | A kind of data monitoring method of distributed cluster system, apparatus and system |
-
2017
- 2017-09-01 CN CN201710778563.7A patent/CN107528904B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101702660A (en) * | 2009-11-12 | 2010-05-05 | 中国科学院计算技术研究所 | Abnormal domain name detection method and system |
CN104852886A (en) * | 2014-02-14 | 2015-08-19 | 腾讯科技(深圳)有限公司 | Protection method and device for user account |
CN104579846A (en) * | 2015-01-21 | 2015-04-29 | 清华大学 | Network flow anomaly detection method based on adjustable segmented entropy |
US20170149814A1 (en) * | 2015-04-16 | 2017-05-25 | Nec Laboratories America, Inc. | Real-Time Detection of Abnormal Network Connections in Streaming Data |
WO2017028930A1 (en) * | 2015-08-20 | 2017-02-23 | Telefonaktiebolaget Lm Ericsson (Publ) | Methods and apparatus for running an analytics function |
CN105871634A (en) * | 2016-06-01 | 2016-08-17 | 北京蓝海讯通科技股份有限公司 | Method and application for detecting cluster anomalies and cluster managing system |
CN107070753A (en) * | 2017-06-15 | 2017-08-18 | 郑州云海信息技术有限公司 | A kind of data monitoring method of distributed cluster system, apparatus and system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108471428A (en) * | 2018-06-27 | 2018-08-31 | 北京云端智度科技有限公司 | Applied to the ddos attack initiative type safeguard technology and equipment in CDN system |
CN108471428B (en) * | 2018-06-27 | 2021-05-28 | 北京云端智度科技有限公司 | DDoS attack active defense technology and equipment applied to CDN system |
CN109388512A (en) * | 2018-10-30 | 2019-02-26 | 江苏物联网研究发展中心 | For the assessment and analysis system of large-scale computer cluster intensity of anomaly |
CN111835696A (en) * | 2019-04-23 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Method and device for detecting abnormal request individuals |
CN111835696B (en) * | 2019-04-23 | 2023-05-09 | 阿里巴巴集团控股有限公司 | Method and device for detecting abnormal request individuals |
CN110580030A (en) * | 2019-10-11 | 2019-12-17 | 南京铁道职业技术学院 | Pharmaceutical factory environment purification control system based on Internet of things |
WO2023131962A1 (en) * | 2022-01-04 | 2023-07-13 | Telefonaktiebolaget Lm Ericsson (Publ) | First node, second node and methods performed thereby for handling anomalous values |
Also Published As
Publication number | Publication date |
---|---|
CN107528904B (en) | 2020-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107528904A (en) | Method and apparatus for data distribution formula abnormality detection | |
US10120956B2 (en) | Methods and systems for distributed computation of graph data | |
He et al. | Computing nearest-neighbor fields via propagation-assisted kd-trees | |
EP2821944A1 (en) | System and method for parallel search on explicitly represented graphs | |
CN105630847B (en) | Date storage method, data query method, apparatus and system | |
CN102822822B (en) | Image management apparatus, image management method, program, record medium, integrated circuit | |
KR20130113653A (en) | System and method for analyzing cluster result of mass data | |
Dai et al. | A new replica placement policy for hadoop distributed file system | |
Ibrahim et al. | Intelligent data placement mechanism for replicas distribution in cloud storage systems | |
CN108268586A (en) | Across the data processing method of more tables of data, device, medium and computing device | |
US20110179013A1 (en) | Search Log Online Analytic Processing | |
Dai et al. | Improving load balance for data-intensive computing on cloud platforms | |
CN105488176A (en) | Data processing method and device | |
US20100205075A1 (en) | Large-scale item affinity determination using a map reduce platform | |
CN108204821A (en) | A kind of paths planning method and device | |
Ma et al. | In-memory distributed indexing for large-scale media data retrieval | |
Aksehirli et al. | Efficient cluster detection by ordered neighborhoods | |
CN109993338B (en) | Link prediction method and device | |
US20160055211A1 (en) | Apparatus and method for memory storage and analytic execution of time series data | |
Li et al. | Algorithms for loading parallel grid files | |
CN108073356A (en) | A kind of data storage, lookup method, device and data handling system | |
CN112015825B (en) | Model registration method and device based on block chain and electronic equipment | |
CN106649566A (en) | Serial number generation method and device | |
JP6784096B2 (en) | Data distribution program, data distribution method, and data distribution device | |
Van Dam et al. | Duplicate detection in web shops using LSH to reduce the number of computations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai Patentee after: Star link information technology (Shanghai) Co.,Ltd. Address before: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai Patentee before: TRANSWARP TECHNOLOGY (SHANGHAI) Co.,Ltd. |