CN103150470A - Visualization method for concept drift of data stream in dynamic data environment - Google Patents

Visualization method for concept drift of data stream in dynamic data environment Download PDF

Info

Publication number
CN103150470A
CN103150470A CN2013100520887A CN201310052088A CN103150470A CN 103150470 A CN103150470 A CN 103150470A CN 2013100520887 A CN2013100520887 A CN 2013100520887A CN 201310052088 A CN201310052088 A CN 201310052088A CN 103150470 A CN103150470 A CN 103150470A
Authority
CN
China
Prior art keywords
concept
data
data block
kdq
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100520887A
Other languages
Chinese (zh)
Other versions
CN103150470B (en
Inventor
冯林
姚远
陈沣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201310052088.7A priority Critical patent/CN103150470B/en
Publication of CN103150470A publication Critical patent/CN103150470A/en
Application granted granted Critical
Publication of CN103150470B publication Critical patent/CN103150470B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of intelligent information processing, and discloses a visualization method for concept drift of a data stream in a dynamic data environment. The visualization method comprises the following steps: achieving static treatment on the data stream; establishing different concept presentation modes according to different concept drift modes and saving the different concept presentation modes in concept pools; and when a new data block comes, utilizing the KL divergence algorithm to search similar concept presentations in the concept pools, if the similar concept presentations exist, counting the similar concept presentations, and if not exist, adding the new data block into the concept pools to serve as a new concept to be saved. The visualization method can be used for detecting the changed drift types of various data streams, can fully analyze the concept drift process in the data stream through counting, finally, utilizes the Bays method to draw a concept drift and transition graph according to the statistic result, and visualizes the concept drift and transition graph for assisting data mining in a concept level.

Description

Data stream concept drift method for visualizing under a kind of dynamic data environment
Technical field
The present invention relates to the intelligent information processing technology field, particularly the visual method of data stream concept drift under a kind of dynamic environment, be applicable to network invasion monitoring, the aspects such as network security monitoring, sensing data monitoring and mains supply.
Background technology
Deep development along with infotech, the traditional data mining method is being faced with new challenge, what stand in the breach is exactly the variation of data mode, change dynamic data into by the traditional static data streamed, therefore how data stream is effectively excavated, extract the concern that the inner knowledge that comprises more and more is subject to industry member.
Different from static data, data stream itself has three characteristics: magnanimity, real-time and dynamic change.These three characteristics just require the traditional data mining model to adjust and to change, with variation and the characteristics that adapt to data stream, therefore at present much for data flow model and method, all that data attribute around data stream itself launches, for example data flow classification model, Clustering Model, dimensionality reduction model etc.But excavate for the concept aspect that comprises in data stream, still there is no corresponding biology barrier at present.
The technology of the data stream conceptual dependency of only depositing at present, mainly for there being the concept drift phenomenon in data stream, detect in real time or classify, for follow-up work provides support, and still belonging to blank for the concept drift visualization problem in academia and industry member.Remaining the exploratory stage at present although concept drift is visual, is the senior form of expression of data due to concept, therefore has great importance for understanding data and extracting the data knowledge method.Use for reference other method for visualizing, for example, stream shape figure, circle representation etc. mode obtaining under the prerequisite of concept characteristic, is carried out visual to concept drift, can be for follow-up work provide a representation intuitively, this will be conducive to carrying out smoothly and effectively of follow-up work.Exist the needs of dynamic environment data stream concept drift method for visualizing are provided in this area.
Summary of the invention
The objective of the invention is: for solving above-mentioned problems of the prior art, and for the deficiency of concept drift method for visualizing research, provide the method for visualizing of the data stream concept drift under a kind of dynamic environment.
For achieving the above object, the technical solution used in the present invention is: the method for visualizing of the data stream concept drift under a kind of dynamic data environment is provided, has specifically comprised the following steps:
Step 1: dynamic dataflow collection module 102 is collected data according to time sequencing from magnanimity real-time stream 101;
Step 2: data stream is divided the data flow data in module 103 read step 1, and according to the sequencing that data in data stream arrive, data stream is divided; Described data stream is divided module 103 and is divided in the data block that obtains, and comprises N bar record; N is fixed variable, is set in advance by the user;
Step 3: usage data stream is divided obtained the static data piece after module 103 is divided, be input in kdq tree module 104 and build the kdq tree; Wherein, the threshold value that described kdq tree is corresponding is used the bootstrap based on the KL divergence to calculate to provide or is directly given by the user;
Step 4: kdq is set kdq tree, threshold value corresponding to kdq tree that module 104 sets up put into concept pond 106 and preserve;
Step 5: concept detection module 105 is divided module 103 in data stream and is obtained a new data block, whether and to detect new data block be new concept, and the testing result of concept detection module 105 provides according to the comparative result of being set corresponding threshold value by the kdq that preserves in the KL divergence value of original data block, new data block and concept pond 106; Need to carry out discretize to original data block when calculating K L divergence, the result of discretize is provided by the result that kdq sets by data block;
Step 6: when data stream is divided module 103 when obtaining new data block, the concept of preserving in this data block and concept pond 106 is compared, if find similar concept, Concept counting module 107 is upgraded; Otherwise this data block is added in concept pond 106 as new concept;
Step 7: repeating step 1-6 is until the data stream end.Gather the statistical information in Concept counting module 107 this moment, calculates the statistical information of each concept in concept pond 106;
Step 8: above-mentioned statistical information is input to concept drift module 108, utilizes Bayesian formula structure concept transition diagram, complete the concept drift visualization process.
Wherein, set up the kdq tree in described step 3 and comprise following substep:
Step 3.1: at first in the selected data piece first attribute as current attribute, seek meta numerical value v in current dimension, can divide data block, make the sample size in two subsets that obtain after division substantially equal, namely be less than or equal to the quantity of data of v greater than the numerical value of the data of v and current attribute substantially equal for the numerical value of current attribute;
Step 3.2: in above-mentioned resulting subset, seek in follow-up attribute and can satisfy the attribute of division condition, and select this attribute as current attribute, repeat to seek the process of meta numerical value, continue the dividing data subset;
Step 3.3: repeat above process, until satisfy end condition;
Described division condition is: the difference of the maximal value of leading dimension degrees of data and little value is greater than variable ε, and the ε value is specified by the user;
Described end condition is: the data scale of current data block is less than n min, perhaps the difference of the maximin of each dimension is all less than ε, wherein in n minValue is specified in advance by the user.
Wherein, in described step 4, kdq tree module 104 adopts dependent thresholds corresponding to the given kdq tree of bootstrap, comprises the following steps:
Step 4.1: the extraction N bar data of putting back to are arranged from original data block, and the data that namely extract are not deleted from original data block, utilize the data that extract to consist of new data block;
Step 4.2: utilize the kdq tree that new data block is divided and obtain the discretize result;
Step 4.3: calculate the KL divergence value of new data block and original data block according to the computing formula of described KL divergence, result of calculation is added in formation;
The computing formula of described KL divergence is:
Figure 795371DEST_PATH_IMAGE001
(1)
Figure 782919DEST_PATH_IMAGE002
(2)
In formula (1), kl 1The expression data block C a And data block C b Between the KL divergence that distributes of data, Pc a (x) expression data block C a Probability distribution after discretize, Pc b (x) expression data block C b Probability distribution after discretize; w b,j The expression data block C b After discretize at j interval data scale, w a,j The expression data block C a After discretize at j interval data scale, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval; The discretize result of described data block is obtained by the kdq tree;
In formula (2), kl 2The expression data block C a And data block C b Between the KL divergence that distributes of the data of tape label, Pc a (Y|x) expression data block C a The probability distribution of every kind of label after discretize, Pc b (Y|x) expression data block C b The probability distribution of the label after discretize; w B, i, j The expression data block C b After discretize in j interval label be the data scale of i, w A, i, j The expression data block C a After discretize in j interval label be the data scale of i, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval, | Y| represents that data are total | the label that the Y| kind is different; The discretize result of described data block is obtained by the kdq tree;
Step 4.4: repeating step 4.1 repeats k time altogether to step 4.3; Wherein, the k value is the constant that the user sets in advance;
Step 4.5: the data value in formation is sorted according to size, get the large value of 1-α as threshold value.
Wherein, described α represents to occur the degree of confidence of concept drift, and wherein α, is specified by the user less than 1 in advance greater than 0.
Wherein, described step 4.2 is utilized the kdq tree that new data block is divided and is obtained the discretize result, comprises the following steps:
Step 4.2.1: data in new data block are divided according to the decision value that kdq sets each node, if the value of data specified dimension enters the left subtree of root node less than or equal to the decision value of node in kdq tree, otherwise enter the right subtree of root node;
Step 4.2.2: repeat described decision process until stop when in data block, all data all are divided into the leaf node of kdq tree, so far obtain the division result for this data block;
Step 4.2.3: for the kdq tree division result for current data block, calculate sample size in each zoning divided by the whole sample size of data block, obtain current data block for the probability distribution of the discretize result of kdq tree
Figure 7227DEST_PATH_IMAGE003
And
Figure 334303DEST_PATH_IMAGE004
Wherein, concept detection module 105 detection concept drifts in described step 5 comprise following substep:
Step 5.1: data block by the result that kdq sets, is obtained the kdq tree to the discretize result of data block;
Step 5.2: utilize KL divergence computing formula, calculate the KL divergence value of each concept kdq discrete results of preservation in new data block kdq tree discrete results and concept pond module 106.
Step 5.3: the threshold value that above-mentioned KL divergence value is corresponding with the kdq tree compares, if the KL divergence value that calculates less than threshold value corresponding to kdq tree, represents that concept drift does not occur, otherwise the actual generation of expression concept drift.
Wherein, described step 6 comprises following substep:
Step 6.1: with the kdq of new data block tree discrete value respectively with concept pond module 106 in the concept of preserving, calculating K L divergence value.
Step 6.2: the KL divergence value that obtains is sorted, if the threshold value that minimum KL divergence value still sets greater than the concept detection module, represent in concept pond module 106, the corresponding concept of this data block not, be that this data block is expressed as a kind of new ideas, so the kdq of this data block tree discretize result is stored in concept pond module 106, occurs as new concept.If find the concept that satisfies threshold value, the relevant statistics to this concept upgrades in Concept counting module 107.
Wherein, described step 8 comprises following substep:
Step 8.1: with in Concept counting module 107, in the module 106 of concept pond, each Concept counting information gathers for being stored in;
Step 8.2: utilize Bayesian formula to calculate mutual transition probability between different concepts;
Figure 680971DEST_PATH_IMAGE005
Wherein P( C i ) probability of i concept of expression appearance, P( C j ) probability of j concept of expression appearance, P( C j | C i ) represent when i concept occurs, the probability of j concept to occur.
Step 8.3: will calculate the mutual transition probability of each concept and add up, and be input to data stream concept drift module 108, and draw out data stream concept drift figure, and complete visualization process.
Wherein, described data stream 101 comprises: network intrusion monitoring, network security monitoring, sensing data monitoring and mains supply various aspects data.
The invention has the beneficial effects as follows: the present invention utilizes kdq tree and the drift of KL divergence detection concept, can detect dissimilar concept drift, and dissimilar concept is stored in the concept pond; Compare by the concept of will preserve in new data block and concept pond, can statistics stream in the number of times that occurs of concept and the transfer relationship between different concepts; And finally construct concept drift figure, complete data stream concept drift visualization tasks.
Description of drawings
Fig. 1 is the FB(flow block) of the concept drift method for visualizing under a kind of dynamic data environment of the present invention.
Fig. 2 is that the present invention utilizes data stream to divide a kind of specific embodiment that module is divided data stream.
Fig. 3 is the process flow diagram of kdq tree module of the present invention in setting up kdq tree process.
Fig. 4 is the concrete implementing procedure that bootstrap of the present invention is asked for threshold value corresponding to kdq tree.
Fig. 5 is that in concept of the present invention pond, the concept detection module is carried out a kind of specific embodiment of concept detection to data block.
Fig. 6 is the concept drift transition diagram under a kind of dynamic data environment of the present invention.
Accompanying drawing sign: the 101-data stream, 102-data stream collection module, the 103-data stream is divided module, and 104-kdq sets module, 105-concept detection module, 106-concept pond module, 107-Concept counting module, 108-data stream concept drift module.
Embodiment
The present invention is described in detail below in conjunction with drawings and Examples.
With reference to Fig. 1, the framework of the method for classifying data stream under a kind of dynamic data environment of the present invention, comprise data stream 101, data stream collection module 102, data stream is divided module 103, and kdq sets module 104, concept detection module 105, concept pond module 106, Concept counting module 107, data stream concept drift module 108; Comprise the following steps:
Step 1: data stream collection module 102 is collected data according to time sequencing from data stream 101.Data stream 101 comprises the data stream to any type known to persons of ordinary skill in the art, particularly including the network invasion monitoring data stream, and network security monitoring data stream, sensing data monitor data stream and mains supply data stream.Data stream normally produces in real time, therefore all will become very difficult to calculating, the storage of data flow data.
Step 2: data stream is divided module 103 reading out data from data stream collection module 102, and data stream is divided successively according to time sequencing according to the data block capacity of setting in advance.Described data stream is divided module (103) and is divided in the data block that obtains, and comprises N bar record; N is fixed variable, is set in advance by the user.Kdq sets module 104, and the division result that the required current data blocks of concept detection module 105 and concept pond module 108 are divided module 103 by data stream provides.
Step 3: usage data stream is divided mould (obtain the static data piece after 103 divisions, be input in kdq tree module 104 and build the kdq tree; Wherein, the threshold value that described kdq tree is corresponding is used the bootstrap based on the KL divergence to calculate to provide or is directly given by the user;
Step 4: kdq is set kdq tree, threshold value corresponding to kdq tree that module (104) sets up put into concept pond module (106) preservation;
Step 5: concept detection module (105) is divided module (103) in data stream and is obtained a new data block, and detecting whether new data block is new concept, the testing result of concept detection module (105) provides with the comparative result of the middle corresponding threshold value of kdq tree of preserving of concept pond module (106) according to the KL divergence value by original data block, new data block; Need to carry out discretize to original data block when calculating K L divergence, the result of discretize is provided by the result that kdq sets by data block;
Step 6: concept detection module 105 is used KL divergence methods, the concept of preserving in the kdq tree representation form of new data block and concept pond 106 is compared the searching similar concept.If find similar concept, the corresponding ASSOCIATE STATISTICS information of this concept in innovation idea statistical module 107, otherwise the expression new data block is new concept is stored in the corresponding kdq tree construction of new data block and kdq threshold value in the concept pond.
Step 7: repeating step 1-6, gathers the information in the Concept counting module when data stream 101 is disposed fully until data stream 101 finishes, divides according to different concepts, and the transition probability between the calculating different concepts.
Step 8: with above-mentioned input information concept drift transition diagram module 108, utilize Bayesian formula to draw the concept drift Visual Graph, complete the concept drift visualization process.
Described data stream collection module 102, data stream are divided module 103, kdq sets module 104,105 of concept detection moulds, and concept pond module 106, Concept counting module 107, data stream concept drift module 108 all is stored in the storer of computer system.
With reference to Fig. 2, for utilizing data stream to divide 103 pairs of data streams of module, Fig. 1 carries out a kind of concrete enforcement of piecemeal, data stream is divided module 103 and according to the sequencing that data stream in data stream collection module 102 arrives, data stream has been carried out piecemeal, be divided in order the first data block, the second data block ... the m data block, each data block comprises N bar data, and the N value can dynamically be adjusted by data stream division module 103 according to the feature of data stream.
Fig. 3 is the process that in step 3, kdq tree module 104 builds the kdq tree, and is as described below:
Step 3.1: first dimension in the selected data piece is as current dimension, seek meta numerical value v in current dimension, the searching of meta numerical value, requirement is two subsets of sample in data block according to scales such as quantity are divided into, and the numerical value of current dimension is substantially equal greater than, the data bulk that is less than or equal to v greater than the numerical value of the data of v and current dimension;
Step 3.2: seek the attribute that satisfies the division condition in the follow-up dimension of above-mentioned resulting subset, and use the dimension that satisfies condition as current dimension, repeat to seek the process of meta numerical value, continue the dividing data subset,
Step 3.3: repeating step 3.1 and step 3.2, until satisfy end condition;
Current dimension satisfies the division condition: the difference of the little value of the maximal value of leading dimension degrees of data is greater than ε, and the ε value is specified by the user;
Described end condition is: the data scale of current data block is less than n min, perhaps the difference of the maximin of each dimension is all less than ε.When satisfying stop condition, obtain a kind of division to original data block.
If the result that will divide as the left and right result of decision of current property value node, can obtain a kind of tree structure to the data division at every turn, namely kdq sets.The characteristics of kdq tree are can be approximate impartial to the division result (discretize result) of original data block.When the data stream environment is more stable, when concept drift not occuring, also can obtain approximately equalised discretize result to current data block.
With reference to Fig. 4, in step 4, kdq tree module 104 adopts dependent thresholds corresponding to the given kdq tree of bootstrap, comprises the following steps:
Step 4.1: the extraction N bar data of putting back to are arranged from original data block, and the data that namely extract are not deleted from original data block, and each process that extracts is completely random, and is separate between sample drawn.Utilize the data that extract to consist of new data block;
Step 4.2: repeat described extraction process until the data scale in new data block reaches m, and altogether obtain k new data block, utilize the kdq tree that new data block is divided and obtain the discretize result;
Step 4.3: calculate the KL divergence value of new data block and original data block according to the computing formula of described KL divergence, result of calculation is added in formation sort;
Use the KL divergence to carry out the judgement of similarity between data block in specific embodiment, the computing formula of described KL divergence is:
Figure 827918DEST_PATH_IMAGE006
(1)
Figure 855917DEST_PATH_IMAGE007
(2)
In formula (1), kl 1The expression data block C a And data block C b Between the KL divergence that distributes of data, Pc a (x) expression data block C a Probability distribution after discretize, Pc b (x) expression data block C b Probability distribution after discretize; w b,j The expression data block C b After discretize at j interval data scale, w a,j The expression data block C a After discretize at j interval data scale, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval; The discretize result of described data block is obtained by the kdq tree.
In formula (2), kl 2The expression data block C a And data block C b Between the KL divergence that distributes of the data of tape label, Pc a (Y|x) expression data block C a The probability distribution of every kind of label after discretize, Pc b (Y|x) expression data block C b The probability distribution of the label after discretize; w B, i, j The expression data block C b After discretize in j interval label be the data scale of i, w A, i, j The expression data block C a After discretize in j interval label be the data scale of i, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval, | Y| represents that data are total | the label that the Y| kind is different; The discretize result of described data block is obtained by the kdq tree;
Step 4.4: repeating step 4.1 repeats k time altogether to step 4.3; Wherein, the k value is the constant that the user sets in advance.
Step 4.5: the data value in formation is sorted according to size, get the large value of 1-α as threshold value, (α represents to occur the degree of confidence of concept drift, and wherein α, is specified by the user less than 1 in advance greater than 0.) obtaining the corresponding threshold value of kdq tree that degree of confidence is α, i.e. expression is if the KL divergence value that have a new data block and original data block this moment represents that greater than described threshold value described new data block has the probability of 1-α that concept drift has occured.
When new data block arrives, utilizing 105 pairs of new data blocks of concept detection module to carry out concept drift detects, the process that detects is as described below: current data block is obtained current data block for the discretize result of current kdq tree by current kdq tree, utilize KL divergence computing formula obtain the KL divergence value of current data block and original data block and compare with threshold value corresponding to kdq tree, if the KL divergence value that calculates represents that less than threshold value corresponding to kdq tree concept drift does not occur, otherwise represents that new concept has produced.
In step 4.2, utilize the kdq tree that new data block is divided and obtain the discretize result, comprise the following steps:
Step 4.2.1: data in new data block are divided according to the decision value that kdq sets each node, if the value of data specified dimension enters the left subtree of root node less than or equal to the decision value of node in kdq tree, otherwise enter the right subtree of root node;
Step 4.2.2: repeat described decision process until stop when in data block, all data all are divided into the leaf node of kdq tree, so far obtain the division result for this data block;
Step 4.2.3: for the kdq tree division result for current data block, calculate sample size in each zoning divided by the whole sample size of data block, obtain current data block for the probability distribution of the discretize result of kdq tree
Figure 37500DEST_PATH_IMAGE003
And
With reference to Fig. 5, described concept drift detects the overall process applicating example.At first data stream 101 usage datas are divided module 103 and divide, obtained for example A of different pieces of information piece, B etc., this moment, we did not know the concept of data block representative.To divide in this data block input kdq tree module 104, in the data block after then dividing and concept pond module 106, the concept of preservation compares.In process relatively, use KL divergence method, and be to choose the similarity threshold value by bootstrap.And the data block of each new data block and concept pond module 106 preservations compares respectively, judges finally whether this new data block is new ideas.
Fig. 5 will represent that in the data block of A concept and concept pond 106, concept compares.Finding has had the A concept in the concept pond, in innovation idea statistical module 107 about the ASSOCIATE STATISTICS information of A concept.If do not find similar concept, for example data block comprises concept E, but related notion not in concept pond 106 joins concept E in concept pond 106, and in Concept counting module 107, concept E added, for follow-up statistics is prepared.
Carry out repetition for said process, the number of times of repetition is relevant with data stream length, until data stream is whole processed complete, perhaps reaches the end parameter of setting in advance.
With reference to Fig. 6, the 108 final visual figure of concept drift that export give an example for the present invention's profit concept drift module.Wherein circle represents different concepts, and uses C1, the signs such as C2.Limit between each concept represents the relation between the different concepts transfer, and the weights on the limit represent the transition probability between any two conceptions of species.Concept drift comprises dual mode, certainly shifts and outer transfer, and be wherein the same concept from shifting between the expression adjacent data blocks, namely concept does not change, and is the form that oneself turns to oneself.The outer transfer represents, new concept occurs, and therefore shifted to new ideas by old concept.In example, we use 5000 samples as visual condition, therefore visual figure of concept drift of every 5000 samples output.Two circles represent to work as preconception in each subgraph, nearest concept when namely ending.
Be more than the preferred embodiment of the present invention, protection scope of the present invention also not only is confined to above-described embodiment, and all technical schemes that belongs under thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art, some improvements and modifications under the premise of not departing from the present invention should be considered as protection scope of the present invention.

Claims (8)

1. the data stream concept drift method for visualizing under a dynamic data environment specifically comprises the following steps:
Step 1: dynamic dataflow collection module (102) is collected data according to time sequencing from magnanimity real-time stream (101);
Step 2: data stream is divided the data flow data in module (103) read step 1, and according to the sequencing that data in data stream arrive, data stream is divided; Described data stream is divided module (103) and is divided in the data block that obtains, and comprises N bar record; N is fixed variable, is set in advance by the user;
Step 3: usage data stream is divided obtained the static data piece after module (103) is divided, be input to and build the kdq tree in kdq tree module (104); Wherein, the threshold value that described kdq tree is corresponding is used the bootstrap based on the KL divergence to calculate to provide or is directly given by the user;
Step 4: kdq is set kdq tree, threshold value corresponding to kdq tree that module (104) sets up put into concept pond (106) preservation;
Step 5: concept detection module (105) is divided module (103) in data stream and is obtained a new data block, and detecting whether new data block is new concept, the testing result of concept detection module (105) provides with the comparative result of the middle corresponding threshold value of kdq tree of preserving in concept pond (106) according to the KL divergence value by original data block, new data block; Need to carry out discretize to original data block when calculating K L divergence, the result of discretize is provided by the result that kdq sets by data block;
Step 6: when data stream division module (103) is obtained new data block, the concept of preserving in this data block and concept pond (106) is compared, if find similar concept, Concept counting module (107) is upgraded; Otherwise this data block is added in concept pond (106) as new concept;
Step 7: repeating step 1-6 is until the data stream end; Gather the statistical information in Concept counting module (107) this moment, calculates the statistical information of each concept in concept pond (106);
Step 8: above-mentioned statistical information is input to concept drift module (108), utilizes Bayesian formula structure concept transition diagram, complete the concept drift visualization process.
2. the method for classifying data stream under a kind of dynamic data environment according to claim 1, is characterized in that, sets up the kdq tree in described step 3 and comprise following substep:
Step 3.1: at first in the selected data piece first attribute as current attribute, seek meta numerical value v in current dimension, data block is divided, make the sample size in two subsets that obtain after division substantially equal, namely be less than or equal to the quantity of data of v greater than the numerical value of the data of v and current attribute substantially equal for the numerical value of current attribute;
Step 3.2: in above-mentioned resulting subset, seek in follow-up attribute and can satisfy the attribute of division condition, and select this attribute as current attribute, repeat to seek the process of meta numerical value, continue the dividing data subset;
Step 3.3: repeat above process, until satisfy end condition;
Described division condition is: the difference of the maximal value of leading dimension degrees of data and little value is greater than variable ε, and the ε value is specified by the user;
Described end condition is: the data scale of current data block is less than n min, perhaps the difference of the maximin of each dimension is all less than ε, wherein in n minValue is specified in advance by the user.
3. the data stream concept drift method for visualizing under a kind of dynamic environment according to claim 1, is characterized in that, in described step 4, kdq tree module (104) adopts dependent thresholds corresponding to the given kdq tree of bootstrap, comprises the following steps:
Step 4.1: the extraction N bar data of putting back to are arranged from original data block, and the data that namely extract are not deleted from original data block, utilize the data that extract to consist of new data block;
Step 4.2: utilize the kdq tree that new data block is divided and obtain the discretize result;
Step 4.3: calculate the KL divergence value of new data block and original data block according to the computing formula of described KL divergence, result of calculation is added in formation;
The computing formula of described KL divergence is:
Figure 314457DEST_PATH_IMAGE001
(1)
Figure 53743DEST_PATH_IMAGE002
(2)
In formula (1), kl 1The expression data block C a And data block C b Between the KL divergence that distributes of data, Pc a (x) expression data block C a Probability distribution after discretize, Pc b (x) expression data block C b Probability distribution after discretize; w b,j The expression data block C b After discretize at j interval data scale, w a,j The expression data block C a After discretize at j interval data scale, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval; The discretize result of described data block is obtained by the kdq tree;
In formula (2), kl 2The expression data block C a And data block C b Between the KL divergence that distributes of the data of tape label, Pc a (Y|x) expression data block C a The probability distribution of every kind of label after discretize, Pc b (Y|x) expression data block C b The probability distribution of the label after discretize; w B, i, j The expression data block C b After discretize in j interval label be the data scale of i, w A, i, j The expression data block C a After discretize in j interval label be the data scale of i, N b The expression data block C b The total data scale; TExpression can obtain after to the data block discretize altogether TIndividual interval, | Y| represents that data are total | the label that the Y| kind is different; The discretize result of described data block is obtained by the kdq tree;
Step 4.4: repeating step 4.1 repeats k time altogether to step 4.3; Wherein, the k value is the constant that the user sets in advance;
Step 4.5: the data value in formation is sorted according to size, get the large value of 1-α as threshold value;
Wherein, described α represents to occur the degree of confidence of concept drift, and wherein α, is specified by the user less than 1 in advance greater than 0.
4. the data stream concept drift method for visualizing under a kind of dynamic environment according to claim 3, is characterized in that, described step 4.2 is utilized the kdq tree that new data block is divided and obtained the discretize result, comprises the following steps:
Step 4.2.1: data in new data block are divided according to the decision value that kdq sets each node, if the value of data specified dimension enters the left subtree of root node less than or equal to the decision value of node in kdq tree, otherwise enter the right subtree of root node;
Step 4.2.2: repeat described decision process until stop when in data block, all data all are divided into the leaf node of kdq tree, so far obtain the division result for this data block;
Step 4.2.3: for the kdq tree division result for current data block, calculate sample size in each zoning divided by the whole sample size of data block, obtain current data block for the probability distribution of the discretize result of kdq tree
Figure 491677DEST_PATH_IMAGE003
And
5. the data stream concept drift method for visualizing under a kind of dynamic environment according to claim 1, is characterized in that, concept detection module (105) detection concept drift in described step 5 comprises following substep:
Step 5.1: data block by the result that kdq sets, is obtained the kdq tree to the discretize result of data block;
Step 5.2: utilize KL divergence computing formula, calculate the KL divergence value of each concept kdq discrete results of preserving in new data block kdq tree discrete results and concept pond module (106);
Step 5.3: the threshold value that above-mentioned KL divergence value is corresponding with the kdq tree compares, if the KL divergence value that calculates less than threshold value corresponding to kdq tree, represents that concept drift does not occur, otherwise the actual generation of expression concept drift.
6. the data stream concept drift method for visualizing under a kind of dynamic environment according to claim 1, is characterized in that, described step 6 comprises following substep:
Step 6.1: with the kdq of new data block tree discrete value respectively with concept pond module (106) in the concept of preserving, calculating K L divergence value;
Step 6.2: the KL divergence value that obtains is sorted, if the threshold value that minimum KL divergence value still sets greater than the concept detection module, represent in concept pond module (106), the corresponding concept of this data block not, be that this data block is expressed as a kind of new ideas, kdq tree discretize result with this data block is stored in concept pond module (106) so, occurs as new concept; If find the concept that satisfies threshold value, the relevant statistics to this concept upgrades in Concept counting module (107).
7. the method for classifying data stream under a kind of dynamic data environment according to claim 1, is characterized in that, described step 8 comprises following substep:
Step 8.1: with in Concept counting module (107), in concept pond module (106), each Concept counting information gathers for being stored in;
Step 8.2: utilize Bayesian formula to calculate mutual transition probability between different concepts;
Wherein P( C i ) probability of i concept of expression appearance, P( C j ) probability of j concept of expression appearance, P( C j | C i ) represent when i concept occurs, the probability of j concept to occur;
Step 8.3: will calculate the mutual transition probability of each concept and add up, and be input to data stream concept drift module (108), and draw out data stream concept drift figure, and complete visualization process.
8. the method for classifying data stream under a kind of dynamic data environment according to claim 1, is characterized in that, described data stream (101) comprising: network intrusion monitoring, network security monitoring, sensing data monitoring and mains supply various aspects data.
CN201310052088.7A 2013-02-18 2013-02-18 Data flow concept drift method for visualizing under a kind of dynamic data environment Expired - Fee Related CN103150470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310052088.7A CN103150470B (en) 2013-02-18 2013-02-18 Data flow concept drift method for visualizing under a kind of dynamic data environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310052088.7A CN103150470B (en) 2013-02-18 2013-02-18 Data flow concept drift method for visualizing under a kind of dynamic data environment

Publications (2)

Publication Number Publication Date
CN103150470A true CN103150470A (en) 2013-06-12
CN103150470B CN103150470B (en) 2015-12-23

Family

ID=48548544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310052088.7A Expired - Fee Related CN103150470B (en) 2013-02-18 2013-02-18 Data flow concept drift method for visualizing under a kind of dynamic data environment

Country Status (1)

Country Link
CN (1) CN103150470B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345575A (en) * 2013-06-19 2013-10-09 华南师范大学 Data flow concept drift detection method and system
CN106095921A (en) * 2016-06-07 2016-11-09 四川大学 Real-time parallel sorting technique towards mass data flow
CN107358019A (en) * 2017-05-25 2017-11-17 上海交通大学医学院附属瑞金医院 Medical scheme recommendation system and method suitable for concept drift
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection
CN108171251A (en) * 2016-12-07 2018-06-15 信阳师范学院 A kind of detection method for the concept that can handle reproduction
CN111639694A (en) * 2020-05-25 2020-09-08 南京航空航天大学 Concept drift detection method based on classifier diversity and Mcdiarmid inequality

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827002A (en) * 2010-05-27 2010-09-08 文益民 Concept drift detection method of data flow classification
US20110213740A1 (en) * 2006-09-12 2011-09-01 International Business Machines Corporation System and method for resource adaptive classification of data streams
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110213740A1 (en) * 2006-09-12 2011-09-01 International Business Machines Corporation System and method for resource adaptive classification of data streams
CN101827002A (en) * 2010-05-27 2010-09-08 文益民 Concept drift detection method of data flow classification
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
孙岳 等: "基于多分类器的数据流中的概念漂移挖掘", 《自动化学报》, vol. 34, no. 1, 15 January 2008 (2008-01-15) *
朱群 等: "一种基于双层窗口的概念漂移数据流分类算法", 《自动化学报》, vol. 37, no. 9, 15 September 2011 (2011-09-15) *
柴玉梅 等: "数据流上概念漂移的检测和分类", 《小型微型计算机***》, vol. 32, no. 2, 15 March 2011 (2011-03-15) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345575A (en) * 2013-06-19 2013-10-09 华南师范大学 Data flow concept drift detection method and system
CN103345575B (en) * 2013-06-19 2016-07-13 华南师范大学 A kind of data flow concept drift detection method and system
CN106095921A (en) * 2016-06-07 2016-11-09 四川大学 Real-time parallel sorting technique towards mass data flow
CN108171251A (en) * 2016-12-07 2018-06-15 信阳师范学院 A kind of detection method for the concept that can handle reproduction
CN107358019A (en) * 2017-05-25 2017-11-17 上海交通大学医学院附属瑞金医院 Medical scheme recommendation system and method suitable for concept drift
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection
CN107679228B (en) * 2017-10-23 2019-09-10 合肥工业大学 A kind of short text data stream classification method based on short text extension and concept drift detection
CN111639694A (en) * 2020-05-25 2020-09-08 南京航空航天大学 Concept drift detection method based on classifier diversity and Mcdiarmid inequality

Also Published As

Publication number Publication date
CN103150470B (en) 2015-12-23

Similar Documents

Publication Publication Date Title
CN110223168B (en) Label propagation anti-fraud detection method and system based on enterprise relationship map
CN104050196B (en) A kind of interest point data redundant detecting method and device
CN103150470A (en) Visualization method for concept drift of data stream in dynamic data environment
CN103020288B (en) Method for classifying data stream under a kind of dynamic data environment
CN104239501B (en) Mass video semantic annotation method based on Spark
CN105488130A (en) Associated data visualization display method
CN104239553A (en) Entity recognition method based on Map-Reduce framework
Su et al. A new classification algorithm for data stream
CN104516962A (en) Monitoring method and system for microblogging public opinion
CN103761286B (en) A kind of Service Source search method based on user interest
CN105930461A (en) Data processing method for realizing associated information node visualization tracking
A. Rashid et al. Association rule mining using time series data for Malaysia climate variability prediction
Arora et al. Big data: A review of analytics methods & techniques
CN111522950A (en) Rapid identification system for unstructured massive text sensitive data
Appiah et al. Extremely randomised trees machine learning model for electricity theft detection
CN105069574A (en) New method for analyzing business flow behavior similarity
CN112363996A (en) Method, system, and medium for building a physical model of a power grid knowledge graph
Si et al. Application of improved multidimensional spatial data mining algorithm in agricultural informationization
CN111159328A (en) Information knowledge fusion system and method
Liu et al. Automatically extracting user reviews from forum sites
CN103744899A (en) Distributed environment based mass data rapid classification method
Chen et al. Internet of things technology in ecological security assessment system of intelligent land
CN112685461A (en) Electricity stealing user judgment method based on pre-judgment model
Lyu et al. Intelligent clustering analysis model for mining area mineral resource prediction
CN110288014A (en) A kind of local Outliers Detection method based on comentropy weighting

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151223

Termination date: 20190218