CN101127037A - Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering - Google Patents

Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering Download PDF

Info

Publication number
CN101127037A
CN101127037A CNA2006100529523A CN200610052952A CN101127037A CN 101127037 A CN101127037 A CN 101127037A CN A2006100529523 A CNA2006100529523 A CN A2006100529523A CN 200610052952 A CN200610052952 A CN 200610052952A CN 101127037 A CN101127037 A CN 101127037A
Authority
CN
China
Prior art keywords
tree
cycle
time
algorithm
support
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006100529523A
Other languages
Chinese (zh)
Inventor
曾斌
曾凯
姜小丽
王宇熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LINAN MICROGRID INFORMATION ENGINEERING Co Ltd
Original Assignee
LINAN MICROGRID INFORMATION ENGINEERING Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LINAN MICROGRID INFORMATION ENGINEERING Co Ltd filed Critical LINAN MICROGRID INFORMATION ENGINEERING Co Ltd
Priority to CNA2006100529523A priority Critical patent/CN101127037A/en
Publication of CN101127037A publication Critical patent/CN101127037A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The utility model relates to a discovering algorithm with clustered cycling associated rule, based on a differing sequence method of time series vector. Firstly, in view of the drawback of the current discovering algorithm with cycling associated rule on the problem of dividing a plurality of time domains, an algorithm called CMDSA is proposed. The algorithm selects a time series vector which comprises a item supporting degree as the data character in time area to cluster; meanwhile, the clustering number is controlled by a DB principle to reach the best clustering result, so that each time area under the cycling associated rule can be identified more accurately and more useful cycling associated rules can be found compared with the current algorithm. Aiming at the fact that all the current algorithm of cycling associated rule are based on the Apriori algorithm and the efficiency is low, an algorithm of CFP-tree based on Fp tree is proposed. The algorithm of CFP-tree adopts cycling tailoring technique based on the condition FP tree to enhance the algorithm efficiency. Thus, the adoption of the discovering algorithm with cycling associated rule of CFP-tree is far better than the prior algorithm based on Apriori in the time and space efficiency.

Description

Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering
1 technical field
The present invention relates to a kind of algorithm at the periodic associated rule of data mining field discovery time sequence; Be specifically related to a class based on the cycle tense correlation rule problem between each attribute status of temporal constraint, be applicable to that period of state ground between limited attribute of development is by the problem of the relevance of time.Defined the tense correlation rule of incident mapping of equal value, non-same alike result and same alike result, determined the extraction of tense correlation rule by calculating supporting rate and confidence level.Provided the algorithm key step of excavating the seasonal effect in time series periodic associated rule when determining the validity of tense correlation rule.
2 background technologies
Variation in real world is all with the time factor interwoveness, so the periodicity correlation rule of temporal data can help the mankind to carry out correct decisions for the discovery of execution cycle property economic law and most of fields such as prediction, disaster prevention important and far-reaching meaning is arranged in the research real world data.
Research about time zone intercycle correlation rule is in the starting stage at initial stage at present at home and abroad.Ouyang Weimin for example [1]The discovery that proposes has the correlation rule of temporal constraint, but periodic associated rule is not discussed." CyclicAssociation Rules " at OzdenB [2]The middle periodic associated rule time that proposes is artificial definite, given by rule of thumb chronomere of user and Cycle Length (being the integral multiple of chronomere), and thus data are divided into the identical time period of some length, find the solution periodically correlation rule according to the affairs that exist in these time periods then.This tends to cause the division of time period very inaccurate, more likely can miss some periodically correlation rules.For example, establishing chronomere is 1 hour, and Cycle Length is 24 hours, for the periodicity correlation rule: milk → bread (7AM~8AM), if the time period of the main distribution of milk → bread is (6:45AM~7:20AM), then may be false.The correlation rule of " client who buys milk in 6:45AM~7:20AM time range of every day also can buy bread " can not be found like this." Study of Frequent Cyclic Association Rule " that also has the yellow beneficial people [3]Mainly to the improvement of algorithm in OzdenB " the Cyclic Association Rules " literary composition.
The problem that present existing periodic associated rule discovery algorithm exists mainly contains:
Problem one: the selection of time domain data unique point
" Cyclic Association Rules " at OzdenB [2]A kind of new periodic associated rule discovery model [4] [5] that problem Xu Min in the literary composition proposes is divided into the different time period of length by cluster analysis with one-period, thereby can finds periodic associated rule more accurately.But each number of transactions that takes place constantly of this periodic associated rule discovery Model Selection is carried out cluster for the time domain data unique point, its cluster is carried out at affairs, and project self has the own regularity of distribution, this cluster mode can not reflect the rule of single project, its existing problem is: we illustrate with an example, for a period of time, the situation of 0 o'clock to 14 o'clock every day all as shown in Figure 1, for example with literary composition [4] [5]In by each transactions cluster that takes place constantly at time zone 1-5,6-12 each number of transactions that carve to take place can to gather respectively be a class.By each number of entry cluster that takes place constantly at time zone 3-8 each to carve that number (number of transactions that comprises project A) that project A takes place can gather be a class.Time zone 3-8, project A support ≈ 15*5/25*2+ (25+20)/2+20*2 ≈ 66%, time zone 1-5, project A support ≈ 2.5+ (2.5+15)/2+15*2/25*4 ≈ 41.3%, time zone 6-12, project A support ≈ 15*2+ (15+2.5)/2+2.5*3/20*6 ≈ 38.6%.If minimum support is 54%, can't finds that by clustering method in literary composition [4] [5] project A is frequent, yet can find that by the number of entry cluster that each takes place constantly project A is frequent.
Though it is for example civilian by each number of entry cluster that takes place constantly [21]Can solve literary composition [4] [5]Problem, but still there is certain problem in it.Because only considered project in each item number that takes place constantly by each number of entry cluster that takes place constantly, and ignored each total number of transactions that takes place constantly.Though the item number two moment is identical, total number of transactions that might two moment is inequality and cause the project support degree difference in two moment.Yet have only whether each the real reflection of project support degree ability and decision project constantly is frequent.So by the only unilateral consideration of project cluster project in each item number that takes place constantly, can not reflect each inherent law of project support degree constantly.So we still illustrate this problem with example case among the figure one.As figure one, because be a class by each number of entry cluster 3-8 that takes place constantly, if minimum support is changed to 72%, under such a case, the project A of 3-8 is not frequent.If but by the number of transactions that comprises project at each constantly and this ratio that total number of transactions takes place constantly (project A is in each support constantly) carry out cluster the method 6-8 time period each to carve that the number that project A takes place and the ratio that each carves total number of transactions can gather be a class, its project A support slightly equals 75%, so the project A of 6-8 time period is frequent.Yet carry out the frequent item A that cluster will be found the 6-8 time period according to each degree of the project support constantly.To miss the frequent item of 6-8 time period by each number of entry cluster that takes place constantly.
So the present invention adopts each project support degree that takes place is constantly come the CMDSA algorithm of cluster, certainly each take place constantly more than on a project A in the example, have a variety of projects to take place.So constantly we can be with the support of each project one dimension component as a vector at each, all items just forms a time series vector in the support in this moment, and this time series vector cluster is got final product.
Problem two: cycle length, zone segments purpose was determined
Literary composition [2] [4] [5]Also have another problem, what time periods one-period is divided into is artificial definite.Though literary composition [4] [5]The length of periodicity correlation rule model time period obtain according to the concentration degree automatic cluster of affairs generation number, but its cluster is a best cluster of using the Fisher algorithm to obtain under the cluster number prerequisite artificially having determined.This has just ignored another index cluster number of judging the cluster effect, yet the selection of cluster number will judge that rather than artificial regulation has only by cluster validity function according to the actual conditions of concrete data [11] [12] [13]Judge the cluster number according to the actual conditions of concrete data, situation of change that could more real reflection real data reaches optimal cluster effect.Adopt DB Index criterion in the present invention [9] [19]Judge the validity of cluster, determine best cluster number.
Problem three: find the selection of periodic associated rule basic algorithm
Literary composition [2] [3] [4] [5] [21]The periodic associated rule algorithm is all based on the Apriori algorithm [6], also have literary composition [1]The associated rule discovery algorithm and the literary composition that have temporal constraint [18]The mining algorithm of partial periodicity pattern also based on the Apriori algorithm [6], problem such as the candidate that they exist processing is very big, and pattern and db transaction search matched will expend a large amount of time, and resource consumes high, and operational efficiency is not high.So the present invention replaces Apriori with FP-tree.
3 summary of the invention
The present invention's proposition-based on time sequence vector diverse sequence method [7]The periodic associated rule discovery algorithm of cluster (CARDSATSV) is made up of two parts: CMDSA and CFP-tree.At first CMDSA adopts based on the diversity sequence method the time series vector of being made up of project support degree [7]With DB Index criterion [9] [19]Cluster determine each time zone of the correlation rule in the cycle dynamically, with DB Index criterion [9] [19]Control the cluster number to reach best cluster effect.CMDSA finds useful correlation rule to greatest extent at the problem of mentioning in problem one and the problem two.At problem three, CFP-tree adopts based on Fp-tree the transaction database on each time zone in the cycle then [8] [16]Method carry out the periodically discovery of correlation rule.Utilize FP-tree obviously to be better than the characteristics of Apriori algorithm, CFP-tree adopts the periodicity tailoring technology based on condition FP-tree to increase substantially efficiency of algorithm, theoretical and description of test on spatiotemporal efficiency, be far superior to periodic associated rule algorithm based on the periodic associated rule algorithm CFP-tree of Fp-tree based on Apriori.
4 description of drawings
The case study of Fig. 1 periodic associated rule for example
Fig. 2 Frequent Item Sets in time period period 1 [sj, ej] generates tree
The cycle Frequent Item Sets generates tree in Fig. 3 time period [sj, ej]
Periodicity correlation rule model is found the comparison of useful cycle Frequent Item Sets quantity in Fig. 4 CARDSATSV and the literary composition [4] [5]
Fig. 5 when working time of T=30 days thinkings two and thinking three relatively
Fig. 6 when working time of T=30 days thinkings two and thinking three relatively
Fig. 7 T=30 days CARDSATSV with based on the periodic associated rule algorithm of Apriori in the comparison of following working time of different minimum support
5 embodiments
(1) relating basic concepts of time domain data and character
Definition 1 (time domain data). time domain data refers to have the affairs collection of time attribute.If whole affairs collection time zone is T, T can be expressed as T=∪ T i T i∩ T j=0; | T j|=| T i| i ≠ j wherein; I, j=1,2 ..., n|T i| expression T iTime span.Here claim | T i| be one-period length, T iBe i cycle, | T i| length is user-defined, as 1 year, and January or 1 week.Target of the present invention is to find in all period T iIn certain period in incidence relation between some frequent item.
Definition 2 (time series vectors). a series of observed readings that obtain in chronological order, each observed reading are the vectors of a n dimension, and these vectors that have time attribute are called time series vector.
The time series vector sequence that time series vector is formed is that (the time series related notion is seen literary composition to a kind of time series [17]), be that the related symbol of time series vector is represented below:
1) the time series vector sequence table of being made up of time series vector is shown E={e i| i=1 ..., m}.
2) time series vector e wherein i=<x 1..., x n, e iIt is n-dimensional vector.The support of each each project that takes place constantly constitutes e in the present invention iEach the dimension component, e iEach dimension component span be all real numbers in the closed interval [0,1].
3) Lend (E)=m represents the time series vector sequence length, i.e. vectorial number among the E.
4) time span of Time (E) expression time series vector sequence E experience, the i.e. moment of last element of E and first element time interval constantly.
5) among the present invention among the definition time series vector sequence E time interval between any two time series vectors equate that Granularity (E) is called time granularity, the time interval between two elements among the expression time series vector sequence E.
6) E sThe subsequence of expression E.If two subsequence E among the E S1And E S1Between do not have identical element then to claim E S1And E S1Not overlapping.
Definition 3 (cycle Item Sets). the cycle Item Sets also is an Item Sets, and it has increased the periodic nature definition of Item Sets on the definition basis of Item Sets, and establishing a cycle Item Sets expression formula about Item Sets X is X[C, s i, e i, s X].C is length cycle length, s i, e iStarting point and terminal point for i the time period in the cycle; s XBe called cycle Item Sets X[C, s i, e i, s X] i time zone [s in all cycles i, e i] interior periodicity support.Concrete r wherein TiThe Item Sets X[C of indication cycle, s i, e i, s X] at i the time period [s in t cycle i, e i] comprise number of transactions and the time period [s of X in the zone i, e i] in the zone in the ratio of all number of transactions, promptly Item Sets X is at [s i, e i] support.I time period [s in all n cycle i, e i] in, s is arranged X=min{r 1i, r 2i..., r Ni.
Definition 4 (cycle Frequent Item Sets). cycle Item Sets X[C, s i, e i, s X] at i the time period [s in all n cycle i, e i] in, if s X>=s MinThen claim X[C, s 1, e i, s X] be the cycle Frequent Item Sets.S wherein MinIt is the minimum support threshold values.
Definition 5 (general cycle Frequent Item Sets). as whole affairs collection time zone T → ∞, if number of cycles n → ∞ is at the time period in each cycle [s i, e i] in cycle Item Sets X[C, s i, e i, s X] periodicity support s X>=s MinThen claim cycle Item Sets X[C, s i, e i, s X] be general cycle Frequent Item Sets.
Definition 6 (periodic associated rules). periodic associated rule is as next implications: X->Y[C, s i, e i, s X->Y, c X->Y].X wherein, Y is the cycle Item Sets; C is length cycle length, s i, e iBe i time period starting point in the cycle and terminal point; s X->Y, c X->YBe called X->Y[C, s i, e i, s X->Y, c X->Y] at all cycles i time zone [s i, e i] intercycle support and periodicity confidence level.S wherein TiExpression X->Y[C, s i, e i, s X->Y, c X->Y] at i the time period [s in t cycle i, e i] comprise number of transactions and the time period [s of X ∪ Y in the zone i, e i] in the zone in the ratio of all number of transactions, c TiExpression X->Y is at i the time period [s in t cycle i, e i] comprise the number of transactions and the ratio that comprises the number of transactions of X of X ∪ Y in the zone.I time period [s in all n cycle i, e i] in, s is arranged X->Y=min{s 1i, s 2i..., s Ni, c X->Y=min{c 1i, c 2i..., c Ni.
Definition 7 (strong periodic associated rule and weak periodic associated rule). for periodic associated rule: X->Y[C, s i, e i, s X->Y, c X->Y], at the i time period [s of all n in the cycle i, e i] in, if s X->Y>=s MinAnd c X->Y>=c Min, then claim periodic associated rule X->Y[C, s i, e i, s X->Y, c X->Y] be strong periodic associated rule.At i the time period [s of all n in the cycle i, e i] at least one of section [s cycle length i, e i] in, if s X->Y>s= MinAnd c X->Y>=c Min, then claim periodic associated rule X->Y[C, s i, e i, s X->Y, c X->Y] be weak periodic associated rule.S wherein MinAnd c MinBe for excavating defined minimum support of effective correlation rule and minimum confidence level.
Character 1. is known by definition 5, all is the cycle Frequent Item Sets in the time zone at all subclass A place in all cycles of any one cycle Frequent Item Sets A.
Theorem 1. any one cycle Frequent Item Sets X[C, s i, e i, s X], (establish X[C, s i, e i, s X] at time period [s i, e i] in the cycle Frequent Item Sets), then at the corresponding time period [s in each cycle i, e i] in have at least a Frequent Item Sets Y to comprise X[C, s i, e i, s X], i.e. X[C, s 1, e 1, s X]  Y.
Proof: at the corresponding time period [s in each cycle i, e i] in can allow the Item Sets Y value be Item Sets X at least, know X[C, s by definition 3 i, e i, s X] periodicity support s XBe the corresponding time period [s in each cycle i, e i] in the minimum value of Item Sets X support, the support of ∴ Y is greater than s X, ∵ X[C again, s i, e i, s X] the cycle Frequent Item Sets, know s by definition 4 X>=s Min, ∴ Item Sets Y is a Frequent Item Sets, conclusion is set up.
When character 2. constantly increases as whole affairs collection time zone T (when number of cycles constantly increases), the cycle Frequent Item Sets of finding in T levels off to general cycle Frequent Item Sets, and both gaps level off to 0.When T was infinity, the cycle Frequent Item Sets of finding in T equaled general cycle Frequent Item Sets, and both gaps equal 0.Understand from a darker level, whole affairs collection time zone T span is big more, and the periodic associated rule of being found has more abstractness, generality, ubiquity.
(2) based on the clustering algorithm (CMDSA) of the time series vector of diversity sequence method and DB Index criterion
Carry out cluster one time in the period 1, determine each time zone of the correlation rule in the cycle, the time zone in other cycles is also by this division later on.
Whole affairs collection time zone T has a plurality of cycles, because the affairs that each takes place in cycle are different, divides inevitable different so each cycle is carried out the time period that cluster obtains.3 cycles are for example arranged among the T, first cycle is through after the cluster, the 8:30-9:30 affairs are poly-to be a class, there is Frequent Item Sets A among the 8:30-9:30, second period is through after the cluster, the 8:50-9:50 affairs are poly-to be a class, there is Frequent Item Sets A among the 8:50-9:50, the 3rd cycle is through after the cluster, the 9:00-10:00 affairs are poly-to be a class, have Frequent Item Sets A among the 9:00-10:00, cycle Frequent Item Sets A is certainly existed in three cycles and comprises among the common factor 9:00-9:30 of time period of Frequent Item Sets A so.So carry out the time period division that cluster obtains in any i the cycle in T, be divided in all Frequent Item Sets that obtain among the cycle i according to this time period as can be known by theorem l and must comprise all cycle Frequent Item Sets.So only need carry out cluster (might as well in the period 1) in some cycles, determine each time zone of the correlation rule in the cycle, find all Frequent Item Sets in this cycle in first cycle then, all affairs of corresponding time periods of each Frequent Item Sets and other compare (promptly through periodically cutting out) in cycles, just can find all cycle Frequent Item Sets, and then find all strong periodic associated rules.
Certainly we can carry out cluster in each cycle, find out the time that is suitable for all cycles according to the different separately time period in each cycle again and divide, but have n cycle will carry out n time cluster, and along with the increase of n, computing cost and cost are big more.Be worthless like this.
The time domain data unique point of cluster is chosen: literary composition [4] [5]The number of transactions cluster takes place constantly, by each number of items cluster and the present invention takes place constantly and by each time series vector that the support of project forms takes place constantly and carry out cluster and all belong to time domain data is carried out cluster by each, only the object of their clusters--the time domain data unique point is selected different.Literary composition [4] [5]What select is that each number of transactions that takes place constantly is the time domain data unique point, and the present invention selects be each project at the time series vector of each affairs generation number that comprises this project constantly and this ratio that total number of transactions takes place constantly (support of project takes place) formation is the time domain data unique point.
Time domain data unique point of the present invention is defined as follows:
1) support of each project that each takes place constantly in one-period (ratio of the number of transactions that the number of transactions of this project that this comprises constantly and this moment are total) constitutes time series vector e iEach dimension component, n dimension time series vector e i=<x 1..., x n, e iEach dimension component span be all real numbers in the closed interval [0,1].
2) the time series vector sequence table of being made up of time series vector so is shown E={e i| i=1 ..., m}, time series vector sequence length Lend (E)=m wherein, Time (E) is an one-period, among the present invention among the definition time series vector sequence E time interval between any two time series vectors equate time granularity Granularity (E) definite value.
The CMDSA algorithm is mainly finished the cluster to the time series vector sequence E in the period 1 in the present invention.
Work before the time series vector cluster: we face such problem and what classes the time series vector in the one-period is divided into can obtains optimal cluster result before the beginning time series vector cluster.What the judge index of optimal cluster result is.To reach what criterion be optimal to cluster result in other words.
Cluster validity function [11] [12] [13]: address the above problem needs cluster validity function to control the cluster number, and we adopt DB Index criterion here [9] [19]Control the cluster number.In the judgment rule of cluster validity, dispersion and between class distance often are used to judge the validity of cluster, DB Index criterion in the class [9] [19]Used dispersion and between class distance in the class simultaneously, adopted the judgment criterion of DB Index criterion in the present invention as classification validity.DB Index criterion substance is as follows:
1) mean square in the class S i = Σ X ∈ C i | | X - Z i | | | Ci | , wherein, Z iBe C iThe class center of class; | C i| expression C iThe class sample number.
2) between class distance d Ij=‖ Z i-Z j‖ promptly represents between class distance with the distance at two class centers.
3)DB?Index DB k = 1 k Σ i = 1 k R i , wherein R i = max j = 1 , . , k , j ≠ i S i + S j d ij , k is the classification number.
DB Index criterion is DB kValue more little, illustrate the classification effect good more.
Optimize the CMDSA algorithm-best cluster number of minimizing c OptThe hunting zone: one-period has n time series vector e, can allow cluster number c value from 2 until n, and utilization DB Index criterion and diversity sequence method are determined best cluster number c OptThe hunting zone of that cluster number c is if from 2 to n, i.e. 2<=c Opt<=n.Efficient just has problems so, when n is very big, seeks c OptThe computing cost be very large, so we need reduce c OptHunting zone C MaxFor how determining C MaxMany researchers use experience rule: c max ≤ n , this rule is at literary composition [13]In mention.Also has document [15]The rule of mentioning: c Max≤ 21nn.But above-mentioned rule lacks theoretical the support.Literary composition [14]Provide a kind of definite C MaxNew method, this new method has illustrated in theory c max ≤ n Validity.Literary composition [14]The best cluster numbers scope of fuzzy clustering has been discussed, and main the discussion blured division, and so the hard special case that belongs to fuzzy division of dividing is civilian [14]Theory be equally applicable to hard cluster of the present invention.So the present invention is according to literary composition [14]Adopt rule: c max ≤ n . We determine c OptThe hunting zone be 2 to arrive Promptly determine c max ≤ n 。Use DB Index criterion and diversity sequence method finally to determine best cluster number c then Opt c max ≤ n And c Max≤ n compares, and the computing cost of saving is very huge, and n is big more c max ≤ n And c MaxThe computing cost gap of≤n increases by geometric progression.
The diversity sequence method [7]: we are to each sequential time series vector e constantly in first cycle among the present invention iCarry out cluster, this vector that has time parameter belongs to orderly sample, and we can consider to adopt the Fisher algorithm [10] [20], literary composition [4] [5]What the number of transactions cluster at each quarter was used is exactly the Fisher algorithm [10] [20]But the Fisher algorithm is not considered the order of sample when calculating the diameter of each class.We adopt the diversity sequence method to come time series vector e at this order iCarry out cluster.The diversity sequence method has not only been considered the order of the sample that has time parameter, has simultaneously to calculate simply the characteristics of visual result.
The diversity sequence method [7]Related notion: be provided with m sample x in order 1, x 2... x m, each sample all has n index observed reading, to i sample x i, note is made x i=(x I1... x In) x IjJ index observed reading representing i sample, 1<=i<=m wherein, 1<=j<=n.Use nonnegative number g i=g (x i, x I+1) i sample x of expression iWith i+1 sample x I+1Between difference, i=1,2 ... m-1.. wherein work as x i=x I+1The time, g i=g (x i, x I+1)=0.Usually, desirable g iBe weighting l pMould g i = [ Σ j = 1 n w j | x ij - x i + 1 j | p ] 1 / p , I=1,2 ..., m-1 is w wherein jBe power, w j>=0.Power w jEffect mainly be the importance of eliminating the different of different index yardsticks and reflection index.
Notion 1 (diversity sequence). claim g i=g (x i, x I+1) be the diversity factor of i sample and i+1 sample, claim (g 1, g 2..., g M-1) be the diversity sequence of sample.
Notion 2 (secondary diversity sequence). (g 1, g 2..., g M-1) diversity factor h i=h (g 1, g I+1) be the secondary diversity factor of sample, be taken as h usually i=| g 1-g i+ 1|, i=1,2 ..., m-2 claims (h 1, h 2..., g M-2) be the secondary diversity sequence of sample.
Notion 3 (diversity sequence method). the method that the utilization variance sequence is classified to orderly sample is called the diversity sequence method.
Notion 4 (k class cut-point). orderly sample (x 1, x 2..., x m) (1<k<m) step of class is: at first determine k-1 integral point i to gather into k 1..., i K-1They satisfy 1<=i 1<=i 2<=...<=i K-1: then sample is gathered into k class (x 1..., x I1) (x I1+1..., x I2+1) ... (x Ik-2+1..., x Ik-1) (x Ik-1+1..., x m) title i 1..., i K-1Be k class cut-point.
The diversity sequence method [7]Basic thought: consider the diversity factor between each sample and its next sample earlier, overall thinking diversity factor is then selected the cut-point that is used for cluster.Therefore changed to another kind ofly by a class at cut-point place sample, the sample diversity factor at the cut-point place should be bigger; Because sample is at random, even in same class, the difference of sample also is different again, but near cut-point, the sample diversity factor changes should be greatly, the secondary diversity factor that also is sample should be bigger, so be divided into two time-like secondary diversity factor maximums, three time-likes take second place.
The step of diversity sequence method is: earlier sample is divided into two classes, again sample is divided into three classes on this basis, up to the k class.Be specially: at first determine 2 class cut-point i 1Get (1<=l<=m-2) make h 1 = max h i 1 ≤ i ≤ m - 2 . . . . . . . . ( a ) 。i 1Determine by following formula
Figure A20061005295200112
(b) meaning of formula is, getting the big point of diversity factor in l and two points of l+1 is cut-point i 1If l 1And l 2(l 1≠ l 2) all satisfy (a) formula, as max (g L1, g L1+ 1)>max (g L2, g L2+ 1) ... ... ... in the time of (c), then by l 1Determine i according to (b) formula 1As 2 class cut-point i 1After determining, just can be sample (x 1, x 2..., x m) be divided into two class (x l..., x Il) and (x Il+1..., x m).On this basis, can be divided into three classes to sample: above two classes are asked maximum secondary diversity factor (seeing (a)) respectively, the class at the maximal value place in these two values is divided into two classes again by above method, so just sample has been divided into three classes.Go down according to this sample is divided into the k class.
Clustering algorithm (CMDSA) based on the time series vector of diversity sequence method and DB Index criterion: diversity sequence method just classification can't judge that how many classes of branch reach optimal classification, so also need DB INDEX criterion to judge the optimal classification number.The present invention comes the time series vector sequence E of period 1 is carried out cluster in conjunction with both, determines the regularity of distribution of time domain data unique point, and then each time zone of the correlation rule in definite cycle.
The CMDSA algorithm:
1) step 1: m-1 the diversity factor g[i that calculates the time series vector sequence E that in one-period, has m time series vector earlier] and m-2 secondary diversity factor h[i], and form corresponding diversity factor and time diversity factor array.
2) step 2: because time series vector sequence E cluster number c value by 2 to
Figure A20061005295200113
So we find Individual cut-point gets final product.In secondary diversity factor sequence array, find maximum secondary diversity factor value h[i earlier], determine corresponding cut-point then, then at residue h[i] in look for maximum h[i], determine corresponding cut-point again, so analogize.The secondary diversity factor value h[i that has found for fear of multiple scanning], we can use ordering.Find maximum h[i] h[0 of first position of relief it and secondary diversity factor sequence array] exchange second largest h[i] and the h[1 of second position] exchange, up to the
Figure A20061005295200121
Big h[i].
3) because ordering makes each h[i] initial position upset, want to find h[i] definite corresponding cut-point position, back, so we must note h[i] initial position, we allow h[i] comprise the structure variable of 2 components, h[i] .data is secondary diversity factor sizes values, h[i] position h[i] .place deposited initial position i.
4) whenever find a h[i step by step 1 of step 2: 2)], just according to h[i] determine to be specially cut-point: whenever determine a h[i], according to h[i] .place utilizes formula
Figure A20061005295200122
Determine cut-point position k_wei, then k_wei is inserted into the correspondence position that sorts from small to large from the cut-point position the orderly single linked list Fen that deposits, k_wei inserts cut-point position among the Fen of back and still keeps from small to large order.
5) whenever determine a new cut-point step by step 2 of step 2: 4), (c Fen) calculates and the new corresponding DB of cut-point to utilize process INDEX according to the Fen of new variation cValue, DB cValue and DB *Relatively, if DB cValue is less than DB *, with DB cCover DB *, simultaneously will with DB cBe worth corresponding chained list Fen and cover Fen *Then when 2) in carry out and finish just to have found minimum DB cValue.Because DB cIt is good more to be worth more little cluster effect, minimum DB cBeing worth corresponding chained list Fen is exactly optimal classification.
(3) periodic associated rule discovery algorithm of setting based on FP-tree (CFP-tree)
Thereby the time series vector that uses the CMDSA algorithm that the support of each project that takes place is constantly formed carries out cluster to be determined after each time zone of the correlation rule in the cycle (according to the orderly single linked list Fen in the optimal partition point position that obtains previously *Determine each time slice), we begin to find useful periodic associated rule on each time zone in cycle.Periodic associated rule discovery algorithm has multiple choices: based on Apriori or based on FP-tree.The present invention adopts the periodic associated rule discovery algorithm (CFP-tree) based on FP-tree.
Three kinds of thinkings of discovery cycle Frequent Item Sets:
Thinking one: expect easily producing Frequent Item Sets, carry out the Frequent Item Sets comparison between the corresponding time period in all cycles then, produce the cycle Frequent Item Sets in each time period in each cycle structure FP-tree tree.But this time overhead is huge, is worthless.
Thinking two: the theorem 1 by the 3rd joint can know that any one cycle Frequent Item Sets exists a Frequent Item Sets to comprise it in the corresponding time period in each cycle, so as long as we can excavate the Frequent Item Sets of each time period in one-period, the transaction database of corresponding time periods in all cycles of these Frequent Item Sets and other is compared and is just found out all cycle Frequent Item Sets then.So can be earlier in each time period in first cycle structure FP-tree tree, utilize FP-growth to produce the Frequent Item Sets of first each time period in cycle, then these Frequent Item Sets and the affairs storehouse of other corresponding time periods in all cycles are compared, produce all cycle Frequent Item Sets.
Thinking three: we can further optimize the method for thinking two, according to theorem 1, can be earlier in each time period in first cycle structure FP-tree tree, from cycle frequent item tabulation L, employing generates all cycle Frequent Item Sets based on the periodicity tailoring technology of condition FP-tree, and then finds all strong periodic associated rules.Thinking three CFP-tree algorithm just of the present invention.
In FP-tree tree of each time slice structure of first cycle [8] [16]
Obtain in first cycle according to front CMDSA c reasonable time segmentation, transaction database of each time slice just has c transaction database D 1..., D C(D iCorresponding i time slice), FP-tree tree of a transaction database just needs c FP-tree tree of structure.
Each time period [s in first cycle i, e i] middle transaction database D iMiddle structure FP-tree iConcrete steps:
1) 1 transaction database D of scanning i, produce frequent item set F iAnd corresponding support, with F iWith except the corresponding time period [s in all cycles of period 1 i, e i] transaction database D iAll affairs compare and obtain cycle frequent item set F (the 1-item collection that each project x of F the inside forms all is the cycle Frequent Item Sets) and the corresponding periodicity of x support.With the periodicity support descending sort F of x, generate cycle frequent item tabulation L iThe project r among the F wherein TiProject x among the expression F is at i the time period [s in t cycle i, e i] comprise number of transactions and the time period [s of x in the zone i, e i] in the zone in the ratio of all number of transactions.I time period [s in all n cycle i, e i] in, the periodicity support s=min{r of x is arranged 1i, r 2i..., r Ni.
2) create FP-tree iRoot node, with " null " mark.To D iIn each affairs t do following processing: L selects frequent item among the t by cycle frequent item tabulation, delete non-frequent item, order occurs by element among the L and arrange frequent element among the affairs t. the frequent element list of arranging preface make marks [p|P], p is the 1st element, P is the tabulation of surplus element. call then inserttree ([p|P], T).Wherein, function insert tree ([p|P], T) processing procedure is: if T has child node N, and N.item_name=P.item_name, then make the N counting add 1; Otherwise, create new node N, counting is 1, and its father node is T, and node is linked to the next node identical with its title.If the P non-NULL, again recursive call insert tree (P, N).
Periodicity tailoring technology based on condition FP-tree: based on the concrete thinking of the periodicity tailoring technology of condition FP-tree: mainly be to literary composition [8] [16]The improvement of FP-growth, increase the function that FP-growth finds the cycle Frequent Item Sets.CFP-growth carries out periodicity to the condition FP-tree of the pattern β among the FP-growth and cuts out.The set that the condition FP-tree of pattern β and β is formed just (for example β is { kma}, the condition FP-tree of β are { (f:3, b:3) } | kma, the then set of Xing Chenging fkam}, bkam}}) and the corresponding time slice [s in other cycles j, e j] in all affairs compare, cut out in the set periodically support less than s MinItem Sets, the periodicity condition FP-tree of generate pattern β.At each time slice [s of period 1 j, e j] FP-tree traversal when finishing, promptly generate all cycle Frequent Item Sets.
Utilization based on the periodicity tailoring technology of condition FP-tree makes thinking three be better than thinking two.The validity of following this technology of surface analysis:
A) related notion and character
The notion of condition pattern base and condition FP-tree sees also literary composition [8] [16]Here do not state tired.
It is some time period [s among some cycle j that theorem 2. is established a j, e j] transaction database D iIn Item Sets, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, in cycle j, a ∪ β is at D iIn support support A ∪ βEqual the support of β in B, i.e. support A ∪ β=support β
Proof: by literary composition [8] [16]The definition of condition pattern base, all arrive the condition pattern base of the prefix path formation a of a, so each affairs that occur in B is simultaneously at D iThe place that middle a occurs occurs, and affairs comprise β in other words, and it comprises a certainly so.If an Item Sets β occurs n time in B, so β also with a simultaneously at D iIn occur n time, and all and a are simultaneously at D iThe middle project that occurs all is collected in the condition pattern base of a, so a ∪ β is at D iIn occur n time just. conclusion is set up.
It is some time period [s among some cycle j that inference 1. is established a j, e j] transaction database D iIn Item Sets, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, in cycle j, a ∪ β is at D iIn support support A ∪ βSmaller or equal to a at D iIn support, i.e. support A ∪ β<=support a
Proof: according to literary composition [8] [16]The definition of middle condition pattern base, each affairs that occur in B is all at D iThe place that middle a occurs occurs.If an Item Sets β occurs n time in B, so β also with a at D iIn occur n time simultaneously, this explanation a at D iIn occurrence number be at least n time, the situation that also may exist a not occur together with β simultaneously, at this moment a occurrence number is more than or equal to n time, support β<=support a, know support by theorem 2 A ∪ β=support βSo, support A ∪ β<=support a
Inference 2. is at the time period of some cycle j [s j, e j] in Item Sets a, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, at time period [s except any one cycle k of cycle j j, e j] in, a ∪ β is arranged at the time period of cycle k [s j, e j] in transaction database D iIn support support A ∪ βSmaller or equal to a at D iIn support, i.e. support A ∪ β<=support a
Proof: i) .B is the time period [s of cycle j j, e j] in the condition pattern base of Item Sets a,  β  M is known at the time period of cycle j [s by inference 1 j, e j] in support is arranged A ∪ β<=support aIi). but at the time period [s except other any one cycle k of j j, e j] in, B is the condition pattern base of a not necessarily, for  β  M, and β and a time period [s in cycle k j, e j] in FP-tree in position relation have three kinds may: 1) β and a are in same paths, and β is the prefix path of a.2) β and a are in same paths, and a is the prefix path of β.3) β and a be not in same paths.Iii). at the time period in all cycles [s j, e j] in each affairs all be according to time period [s j, e j] in frequent tabulation of cycle L iRank order after form among the FP-tree, so at cycle j time period [s j, e j] in to have B be the condition pattern base of a, for  β  M, all items among the β is at L iThe all items of middle position in a is at L iIn before the position, so at other except the time period [s among the cycle k of j j, e j] in, β and a time period [s in this cycle j, e j] in FP-tree in position relation can not occur 2) situation---a is the prefix path of β.Iv). at other except the time period [s among the cycle k of j j, e j] in, β and a time period [s in this cycle j, e j] in FP-tree in position relation be 1), a ∪ β is arranged at the time period in this cycle [s by inference 1 j, e j] in transaction database D iIn support support A ∪ βSmaller or equal to a at D iIn support, i.e. support A ∪ β<=support aBe 2) words support A ∪ β=0.V). in sum, conclusion is set up.
Theorem 3. hypothesis a[C, s i, e i, s a, c a] be time period [s i, e i] in the cycle Item Sets, B is the time period [s of some cycle j j, e j] in the condition pattern base of Item Sets a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, for time period [s i, e i] middle cycle Item Sets a ∪ β [C, s i, e i, s A ∪ β, c A ∪ β] periodicity support s A ∪ βSmaller or equal to a[C, s i, e i, s a, c a] the cycle support be s A ∪ β<=s a
Proof: i). by the 2nd joint definition 3, at the time period in all cycles [s i, e i] middle s a=min{s A1..., s An, s wherein AiBe that a is at the time period in i cycle [s i, e i] in support.Might as well establish the time period [s in k cycle i, e i] middle s AkMinimum, s so Ak=s aIi). when k=j, s is arranged by inference 1 Ak>=s A ∪ β k, when k ≠ j, know at time period [s except the cycle k of cycle j by inference 2 j, e j] middle s Ak>=s A ∪ β k, s wherein A ∪ β kBe that a ∪ β is at the time period of cycle k [s j, e j] in support.So s a>=s A ∪ β kIii). again because s A ∪ β k>=min{s A ∪ β 1..., s A ∪ β n}=s A ∪ βSo, s A ∪ β<=s a
B) based on the analysis of the periodicity tailoring technology of condition FP-tree
We are with literary composition [8]Middle example illustrates this problem, citation of the present invention [8]The frequent collection of middle example generates tree Figure 6 (seeing Fig. 2 of the present invention), and we establish, and frequent collection generation tree is to belong to time period period 1 [s among Fig. 2 i, e i] in.We are example with the generation tree node ma among Fig. 2, establish M and be the Item Sets that all projects that relate to form in the condition pattern base of ma, in the child node mac of ma and maf { c}  M are arranged; { f}  M, so we know that the periodicity support of ma is greater than its all child node mac, the periodicity support of maf in the FP-tree of Fig. 2 by theorem 3, mac in like manner, the periodicity support of maf is greater than also greater than the periodicity support of separately child node.Be in the subtree of root node so with ma so, the periodicity support maximum of root node ma.If ma is not the cycle Frequent Item Sets, the node outer except root node ma is not the cycle Frequent Item Sets.If adopt the periodicity tailoring technology based on condition FP-tree of thinking three, have only root node ma to need to compare once with corresponding time period transaction databases of other cycles, and all need not compare with corresponding time period transaction database of other cycles except other all nodes of ma, so generating in the tree with ma in the cycle Frequent Item Sets is that the subtree of root node can not generate in Fig. 3, comparing with thinking two like this and just having saved generation is the cost of the subtree of root node with ma, has saved this subtree simultaneously and has carried out the periodically cost of comparison.At ma is not under the condition of periodically frequent collection, and thinking two is Fig. 2 before the period 1 generates earlier, generates Fig. 3 through periodically cutting out the back, and thinking three directly generates Fig. 3 in the period 1.So based on the periodicity tailoring technology of condition FP-tree, effectively control cycle frequently collects the scale that generates tree in the thinking three, reduces the search volume.
Discovery cycle Frequent Item Sets: input: the c in first a cycle FP-tree:FP-tree 1..., FP-tree c, FP-tree wherein iBe i time period [s i, e i] the FP tree that produces.
Output: all cycle Frequent Item Sets
Void Main () // master routine
begin
for(i=1;i<=c;i++)do?begin
CFP-growth(FP-tree i,Null);
End
End
Procedure CFP-growth (FP-tree i, Null) // the condition pattern FP-tree of pattern β is carried out periodically cutting out the process of generation cycle Frequent Item Sets
Begin
If Tree comprises a single-pathway P then
Node combination (being designated as β) // β is the merging of the node element in the path P in for each path P
Produce α ∪ beta model, make its support equal the minimum periodicity support of each node element among the β;
All element α among Else For each Tree iDo
begin
Generation pattern β=α i∪ α makes its support equal α iThe periodicity support;
The condition pattern base of structure β and the condition FP-tree Tree β of β c
Tree β cSet C with the β formation; // for example (f:3, b:3) } | kma, the then set of Xing Chenging { { fkam}, { bkam}}
Tree β=CFP-Pruning (C) // to Tree β cThe set C that constitutes with β carries out the process that periodicity is cut out, output Tree β.IfTree β
Figure A20061005295200161
φ then CFP-growth (Tree β, β); // if Tree β non-NULL, recursive call CFP-growth
End//end 5)
End
Function?CFP-Pruning(C)
begin
For (j=2; J<=n; J++) do begin//2 are to n cycle
C i=φ; // empty C i
All items collection c and j cycle i time period [s among for all c ∈ C do begin//set of computations C i, e i] transaction database D JiMiddle support is cut out support support cLess than s MinItem Sets
Computational item collection c is at D JiMiddle support support c
Ifsupport c>=s min?then?c∪C i
end
C=C i
If C=φ then exit; // if all items collection all is tailored, then withdraw from all circulations
end
If?C≠φthen?begin
Generate Tree β according to C; // for example gather C={{fkam}, { bkam}}, β=kam, Tree β are { (f:3, b:3) } | kma
Return Tree β; // output Tree β
end
else?Returnφ;
End
Find strong periodic associated rule: each time period [s in each cycle i, e i] in strong periodic associated rule can directly produce with all cycle Frequent Item Sets that find in this time period.Be specially promptly subclass a, if support (L)/support (a)>=c to all non-NULLs of each cycle Frequent Item Sets L Min, strong periodic associated rule a → (L-a) [C, s are then arranged i, e i, s A → (L-a), c A → (L-a)].
(4) the present invention and existing algorithm performance are relatively
If the experimental data of check theoretical method is too simple, many times be difficult to the comprehensively correctness and the validity of check theoretical method.So the present invention chooses actual Industry Control forefront of the production information acquisition data and removes proof theory, it is big that these data have data volume, can reflect the characteristics of actual conditions complicated and changeable.
The wealthy S240XP server in the software/hardware platform of experiment test and parameter following (1) dawn sky (two CPU (Intel Xeon MP, dominant frequency 2.0GHz), 1G internal memory).(2) programming language VC++6.0, operating system Windows2000AdvanceServer, database Oracle9i.Got 600,000 preprocessed datas in (3) one days.Cycle is one day, time span Time (E)=1 day of time series vector sequence E experience so.Sampling time granularity Granularity (E) is 1 second, and so promptly 1 second once sampling is one-period number of samples Lend (E)=86400 sampled value.Experimental data has 300 projects, so each sampled value has 300 ATTRIBUTE INDEX (project), each sampled value is the time series vector of 300 dimensions.Because one-period number of samples Lend (E)=86400, so C max = Lend ( E ) ≈ 294 , through obtaining C after the CMDSA algorithm cluster OptBe 95, in the actual production one day being divided into about 94 sections under similarity condition is reasonably, so judge C through validity function DB Index criterion OptBe 95 to be practicable.
1) based on CARDSATSV algorithm and the literary composition of CMDSA [4] [5]In issue existing at different minimum supports based on the periodicity correlation rule model of Fisher with the quantitative comparison of cycle Frequent Item Sets
Fig. 4 be with one day be the cycle, relatively both find cycle Frequent Item Sets quantity under different total length of time T (T=1 days, 30 days, 60 days) and different minimum support condition.Under identical minimum support and identical total length of time T situation, find that based on the CARDSATSV algorithm of CMDSA useful cycle Frequent Item Sets quantity is more than literary composition as Fig. 4 [4] [5]In the cycle frequent item quantity found based on the periodicity correlation rule model of Fisher, illustrated that CARDSATSV algorithm of the present invention is better than literary composition [4] [5]Middle periodically correlation rule model.And because literary composition [4] [5]The periodicity correlation rule model of middle Fisher algorithm is than literary composition [2]The algorithm of " Cyclic Association Rules " is found how useful periodic associated rule, so the CARDSATSV algorithm also is better than literary composition [2]Algorithm.
2) Time (T is established in the experimental analysis based on the periodicity tailoring technology of condition FP-tree in the CFP-tree algorithm 30,3) for the cycle is one day, T.T. length T=30 days situation adopted running time of algorithm based on the periodicity tailoring technology of condition FP-tree, Time (T according to thinking three 30,2) for the cycle is one day, T.T. length T=30 days situations do not adopt the running time of algorithm of periodic tailoring technology according to thinking two.
Fig. 6 second row is Time (T under different minimum supports 30,3) and Time (T 30,2) ratio, the third line is Time (T under different minimum supports 30,3) and Time (T 30,2) difference.As Fig. 5 and Fig. 6 as can be seen: 1) T=30 days (thinking two) algorithms are higher than T=30 days (thinking three) working time under identical minimum support.Because in thinking two from T=1 days to T=30 days will with the candidates collection of other affairs comparison in cycle more than thinking three.2) along with the continuous reduction of minimum support, the working time of thinking three is with respect to the ratio Time (T of the working time of thinking two in Fig. 6 30,3)/Time (T 30,2) more and more lower, the difference Time (T of both working times 30,2)-Time (T 30,3) increasing.Illustrate that minimum support is low more, needed with the candidates collection of other cycle affairs comparisons many more from T=1 days to T=30 days, the non-frequent candidates collection of comparing with other cycle affairs based on the periodicity tailoring technology of condition FP-tree not needing of cutting out in the thinking three is many more, periodicity reduction effect to the non-frequent candidates collection of period 1 is big more, has high more operational efficiency.
3) the CARDSATSV algorithm with based on the working time of Apriori algorithm periodic associated rule model under different minimum supports relatively
At first have a look the time overhead of CARDSATSV algorithm, it is made up of two parts: the time overhead of CMDSA and CFP-tree.The CMDSA algorithm only carries out cluster to the time domain data unique point of one-period, and time domain data unique point number of samples generally is lower than number of transactions (for example one second one time time domain data unique point sampling of the present invention of one-period, amount to 86400 sampled points, however 600,000 of the number of transactions of one-period).For the CFP-tree algorithm that runs on the affairs collection time zone total length T (T has a plurality of cycles), CMDSA running time of algorithm expense is far smaller than the CFP-tree algorithm, and the T span is big more, the time overhead gap of CMDSA and CFP-tree is big more, so the CARDSATSV time overhead is mainly on the CFP-tree algorithm.
Compare 2 kinds of algorithms then: CARDSATSV of the present invention and literary composition [4] [5]Compare not only and can find how useful periodic associated rule, the CFP-tree algorithm of CARDSATSV is far superior to literary composition on efficient simultaneously [4] [5]In based on the periodic associated rule algorithm of Apriori.The Apriori algorithm produces a large amount of Candidate Sets (when length is 1 frequent collection when having 15000, length is that 2 Candidate Set number will be above 18M) and scan database (the Apriori algorithm is almost all wanted scan database once to each candidate item) repeatedly.Yet CFP-tree just can produce all frequent item sets of period 12 times at the period 1 scan database, simultaneously CFP-tree periodically cuts out the carrying out of the condition FP-tree of the Frequent Item Sets β of period 1 and needn't produce not only that to comprise β be Frequent Item Sets non-periodic of frequent collection in the period 1, and saved the time that these comprise β Item Sets and other cycle affairs comparisons, increased substantially the algorithm whole efficiency like this.The CFP-Tree algorithm carries out in internal memory in the major part work of period 1, and this has also saved a large amount of time overheads.
Fig. 7 be under T=30 days situations of affairs collection time zone total length the CARDSATSV algorithm with based on the periodic associated rule algorithm of Apriori in the comparison of following working time of different minimum support.
Had by Fig. 7: 1) time overhead of 2 kinds of algorithms reduces along with the increase of minimum support.Because minimum support is high more, the project of eliminating is just many more.2) the CARDSATSV algorithm is well below the time overhead based on the periodic associated rule algorithm of Apriori.
The periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering (CARDSATSV) that the present invention proposes is found the useful correlation rule that present periodic associated rule discovery algorithm can't be found to greatest extent, and and compares spatiotemporal efficiency based on the periodic associated rule algorithm of Apriori and be greatly improved at present.Every experiment shows that CARDSATSV algorithm of the present invention is applicable to the analysis of industrial periodicity mass data.CARDSATSV algorithm of the present invention is an industrial production line embedded type main control system (Chinese patent: the part that the information acquisition ratio of similitude is summed up the technology of fuzzy algorithm application number 200610052850.1), industrial production line embedded type main control system (Chinese patent: application number 200610052850.1) be informatization and transformation project, when promoting this area's characteristic industrial sector and enhancing productivity, provide accurate and complete decision information for local government and enterprise management level for the characteristic industrial sector in somewhere, Zhejiang.It is big to have data volume at the Industry Control forefront of the production information acquisition data of reality, time span is long, the characteristics of data variation complexity, (Chinese patent: information acquisition ratio of similitude application number 200610052850.1) has proposed efficiently fuzzy algorithm based on the industrial production line embedded type main control system of CARDSATSV algorithm of the present invention, complete solution, the CARDSATSV algorithm has not only effectively proposed to excavate the technological difficulties of periodic data correlation rule in system's specific implementation process, and proposed the solution of relevant difficult point, and in producing, experiment and practice obtained checking and utilization by the specific implementation process.The practical study achievement of CARDSATSV algorithm---industrial production line embedded type main control system (Chinese patent: application number 200610052850.1) dropped into the operation phase in early stage.Show that from various actual feedback data messages the CARDSATSV algorithms are applicable to the analysis of industrial periodicity mass data, and provide certain reference for theoretical research and the practice of excavating periodic associated rule.
List of references:
[1] Ouyang Weimin, Cai Qingsheng. in database, find to have the correlation rule of temporal constraint. software journal, 1999,10 (5): 527~532
[2]Ozden?B,Ramaswamy?S,Silberschatz?A.Cyclic?Association?Rules[J].IEEE?Trans?on?DataEngineering,1998,412~421
[3] Huang Yimin, Study of Frequent Cyclic Association Rule. computer science, 2000,27 (4): 43~45)
[1] Xu Min, Jin Yuanping. a kind of new periodicity correlation rule model., computer engineering and science, 2000,22 (4): 78~81
[2]Xu?Min..et?al...Mining?Cyclic?Generalized?Association?Rules.Transactions?of?NanjingUniversity?of?Aeronaut?ics?&?Astronaut?ics,2002,19(1)
[3]Agrawal?R.et?al..Fast?algorithms?for?mining?association?rules.In:Proceedings?of?the?20thInternational?Conference?on?Very?LargeDatabases,Santiago,Chile,1994,487~499
[4] Cheng Qiansheng. a kind of new sample clustering method---diversity sequence method. Science Bulletin .1994,39 (2)
[5]Han?J.et?al..Mining?frequent?patterns?without?candidate?generation.In:Proceedings?of?the2000ACMSIGMOD?Conference?On?Management?of?Data,Dallas,TX,2000,1~12
[6]Sergios?Theodoridis,Konstantinos?Koutroumbas.Pattern?Recognition(Second?Edition)[M].Beijing:Mechanical?Industrial?Press,,2003.163-205
[7]Fisher,W.D.,J.Am.Stat.Assoc.,1958:789~798.
[8]BezdekJ?C,Pal N?R.Some?new?indexes?of?cluster?validity,IEEE?Transactions?onSystems,Man,and?Cybernetics--Part?B:Cyber-netics,1998,28(3):301~315
[9]Xie?X?L.et?al..A?validity?method?for?fuzzy?clustering.IEEE?Trans?Patt?Anal?MachIntell,1991,13(8):841~847
[10]Ramze?R?M.et?al..A?new?cluster?validity?index?for?the?fuzzy?c-mean.Patterm?RecognitionLetters,1998,19:237~246
[11] in sword, Cheng Qiansheng. in the fuzzy clustering method hunting zone of best cluster numbers [J]. Chinese science, 2002,32 (2): 274-280
[12] Fan Jiulun, Pei Jihong, Xie Weixin. based on the cluster validity of possibility distribution. electronic letters, vol, 1998,26 (4): 113-115
[13]Fan?M,Meng?XF,et?al.Data?Mining:Concepts?and?Techniques.Beijing:Mechanical?IndustrialPress,2001(in?Chinese).
[14]Jiawei?Han,Wan?Gong,Yiwen?Yin.Mining?Segment-wise?Periodic?Pattern?in?Time?RelatedDatabases,Proc.of?1998?of?International?Conf[R].On?Knowledge?Discovery?and?DataMining(KDD’98)New?York?City,NY,1998.
[15]Han?J.Gong?W.Yin?Y.Efficient?Mining?of?Partial?Periodic?Patterns?in?Time?SeriesDatabase.In?Proc.1999Int.Conf.Data?Engineering(ICDE′99),Sadney,Australia,Apr1999:106~115
[16]Maria?Halkidi.et?al..On?Clustering?Validation?Techniques.Journal?of?Intelligent?InformationSystems.2001,17(2-3):107~145
[17]Hartigan,J.A..,Clustering Algorithrns,John?Wiley &Sons,1975.
[18] Wang Lei, Tan Yue advances, and gold is just opened hard. based on the quick time domain associated rule discovery algorithm of cluster. Computer Simulation, 2005,7 (22).

Claims (6)

1. the periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering (CARDSATSV), CARDSATSV is made up of two parts: CMDSA and CFP-tree.At first CMDSA adopts each time zone of determining the correlation rule in the cycle based on the cluster of diversity sequence method and DB Index criterion dynamically to the time series vector of being made up of project support degree, controls the cluster number to reach best cluster effect with DB Index criterion.CFP-tree adopts the periodicity tailoring technology based on condition FP tree that the transaction database on each time zone in the cycle is carried out the periodically discovery of correlation rule.
2. CMDSA algorithm according to claim 1 is characterized in that CMDSA carries out cluster based on diversity sequence method and DB Index criterion to time series vector.Diversity sequence method just classification can't judge that how many classes of branch reach optimal classification, so also need DB INDEX criterion to judge the optimal classification number.CMDSA comes the time series vector sequence E of period 1 is carried out cluster in conjunction with both, determines the regularity of distribution of time domain data unique point, and then each time zone of the correlation rule in definite cycle.
3. CFP-tree algorithm according to claim 1, it is characterized in that in each time period in first cycle structure FP-tree tree, from cycle frequent item tabulation L, employing generates all cycle Frequent Item Sets based on the periodicity tailoring technology of condition FP-tree, and then finds all strong periodic associated rules.
4. construct the FP-tree tree according to claim 1,3 described each time periods, it is characterized in that each time period [s in first cycle in first cycle i, e i] middle transaction database D iMiddle structure FP-tree iConcrete steps: 1) 1 transaction database D of scanning i, produce frequent item set F iAnd corresponding support, with F iWith except the corresponding time period [s in all cycles of period 1 i, e i] transaction database D iAll affairs compare and obtain cycle frequent item set F (the 1-item collection that each project x of F the inside forms all is the cycle Frequent Item Sets) and the corresponding periodicity of x support.With the periodicity support descending sort F of x, generate cycle frequent item tabulation L iThe project r among the F wherein TiProject x among the expression F is at i the time period [s in t cycle i, e i] comprise number of transactions and the time period [s of x in the zone i, e i] in the zone in the ratio of all number of transactions.I time period [s in all n cycle i, e i] in, the periodicity support s=min{r of x is arranged 1i, r 2i..., r Ni.2) create FP-tree iRoot node, with " null " mark.To D iIn each affairs t do following processing: L selects frequent item among the t by cycle frequent item tabulation, delete non-frequent item, order occurs by element among the L and arrange frequent element was arranged preface among the affairs t frequent element list make marks [p|P], p is the 1st element, P be the tabulation of surplus element call then insert tree ([p|P], T).Wherein, function insert tree ([p|P], T) processing procedure is: if T has child node N, and N.item_name=P.item_name, then make the N counting add 1; Otherwise, create new node N, counting is 1, and its father node is T, and node is linked to the next node identical with its title.If the P non-NULL, again recursive call insert tree (P, N).
5. according to claim 1,3 described periodicity tailoring technologies, it is characterized in that improvement, increase the function that FP-growth finds the cycle Frequent Item Sets FP-growth based on condition FP-tree.CFP-growth carries out periodicity to the condition FP-tree of the pattern β among the FP-growth and cuts out.The set that the condition FP-tree of pattern β and β is formed just (for example β is { kma}, the condition FP-tree of β are { (f:3, b:3) } | kma, the then set of Xing Chenging fkam}, bkam}}) and the corresponding time slice [s in other cycles j, e j] in all affairs compare, cut out in the set periodically support less than s MinItem Sets, the periodicity condition FP-tree of generate pattern β.At each time slice [s of period 1 j, e j] FP-tree traversal when finishing, promptly generate all cycle Frequent Item Sets.
6. according to claim 1, all strong periodic associated rules of 3 described discoveries, it is characterized in that each time period [s in each cycle i, e i] in strong periodic associated rule can directly produce with all cycle Frequent Item Sets that find in this time period.Be specially promptly subclass a, if support (L)/support (a)>=c to all non-NULLs of each cycle Frequent Item Sets L Min, strong periodic associated rule a → (L-a) [C, s are then arranged i, e i, s A → (L-a), c A → (L-a)].
CNA2006100529523A 2006-08-15 2006-08-15 Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering Pending CN101127037A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006100529523A CN101127037A (en) 2006-08-15 2006-08-15 Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006100529523A CN101127037A (en) 2006-08-15 2006-08-15 Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering

Publications (1)

Publication Number Publication Date
CN101127037A true CN101127037A (en) 2008-02-20

Family

ID=39095069

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006100529523A Pending CN101127037A (en) 2006-08-15 2006-08-15 Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering

Country Status (1)

Country Link
CN (1) CN101127037A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521325A (en) * 2011-12-02 2012-06-27 西北工业大学 XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence
CN102521314A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 What-if analysis method based on interval coding and query rewriting
CN102930372A (en) * 2012-09-25 2013-02-13 浙江图讯科技有限公司 Data analysis method for association rule of cloud service platform system orienting to safe production of industrial and mining enterprises
CN103341506A (en) * 2013-07-10 2013-10-09 鞍钢股份有限公司 Strip-shaped time series data mining method based on data patterns
CN104346169A (en) * 2014-10-14 2015-02-11 济南大学 Process object raw data time series finding and adjusting method
CN104731925A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 MapReduce-based FP-Growth load balance parallel computing method
CN104750830A (en) * 2015-04-01 2015-07-01 东南大学 Cycle mining method of time series data
CN105550281A (en) * 2015-12-10 2016-05-04 复旦大学 Rapid search method for searching longest significant sub-sequence on large-scale time sequence
CN105608510A (en) * 2015-12-31 2016-05-25 连云港杰瑞电子有限公司 Traffic period automatic division method based on Fisher algorithm
CN106095930A (en) * 2016-06-12 2016-11-09 西南石油大学 Petroleum Production data Frequent Pattern Mining method based on weak asterisk wildcard
CN108173876A (en) * 2018-01-30 2018-06-15 福建师范大学 Dynamic rules base construction method based on maximum frequent pattern
CN108182178A (en) * 2018-01-25 2018-06-19 刘广泽 Groundwater level analysis method and system based on event text data mining
CN109344150A (en) * 2018-08-03 2019-02-15 昆明理工大学 A kind of spatiotemporal data structure analysis method based on FP- tree
CN110390160A (en) * 2019-07-19 2019-10-29 浪潮(北京)电子信息产业有限公司 A kind of periodicity detection methods of clock signal, device and relevant device
WO2020178662A1 (en) * 2019-03-01 2020-09-10 International Business Machines Corporation Association rule mining system
CN113593262A (en) * 2019-11-14 2021-11-02 北京百度网讯科技有限公司 Traffic signal control method, traffic signal control device, computer equipment and storage medium
CN116226231A (en) * 2023-02-23 2023-06-06 北京思维实创科技有限公司 Data segmentation method and related device

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521314B (en) * 2011-12-01 2016-01-20 北京人大金仓信息技术股份有限公司 A kind of what-if analysis methodology based on Interval Coding and query rewrite
CN102521314A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 What-if analysis method based on interval coding and query rewriting
CN102521325A (en) * 2011-12-02 2012-06-27 西北工业大学 XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence
CN102930372A (en) * 2012-09-25 2013-02-13 浙江图讯科技有限公司 Data analysis method for association rule of cloud service platform system orienting to safe production of industrial and mining enterprises
CN103341506A (en) * 2013-07-10 2013-10-09 鞍钢股份有限公司 Strip-shaped time series data mining method based on data patterns
CN103341506B (en) * 2013-07-10 2015-03-11 鞍钢股份有限公司 Strip-shaped time series data mining method based on data patterns
CN104346169A (en) * 2014-10-14 2015-02-11 济南大学 Process object raw data time series finding and adjusting method
CN104346169B (en) * 2014-10-14 2017-06-16 济南大学 A kind of flow object initial data sequential finds and method of adjustment
CN104731925A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 MapReduce-based FP-Growth load balance parallel computing method
CN104750830A (en) * 2015-04-01 2015-07-01 东南大学 Cycle mining method of time series data
CN105550281A (en) * 2015-12-10 2016-05-04 复旦大学 Rapid search method for searching longest significant sub-sequence on large-scale time sequence
CN105608510A (en) * 2015-12-31 2016-05-25 连云港杰瑞电子有限公司 Traffic period automatic division method based on Fisher algorithm
CN106095930A (en) * 2016-06-12 2016-11-09 西南石油大学 Petroleum Production data Frequent Pattern Mining method based on weak asterisk wildcard
CN108182178A (en) * 2018-01-25 2018-06-19 刘广泽 Groundwater level analysis method and system based on event text data mining
CN108173876A (en) * 2018-01-30 2018-06-15 福建师范大学 Dynamic rules base construction method based on maximum frequent pattern
CN108173876B (en) * 2018-01-30 2020-11-06 福建师范大学 Dynamic rule base construction method based on maximum frequent pattern
CN109344150A (en) * 2018-08-03 2019-02-15 昆明理工大学 A kind of spatiotemporal data structure analysis method based on FP- tree
WO2020178662A1 (en) * 2019-03-01 2020-09-10 International Business Machines Corporation Association rule mining system
US11036741B2 (en) 2019-03-01 2021-06-15 International Business Machines Corporation Association rule mining system
GB2594901A (en) * 2019-03-01 2021-11-10 Ibm Association rule mining system
CN110390160A (en) * 2019-07-19 2019-10-29 浪潮(北京)电子信息产业有限公司 A kind of periodicity detection methods of clock signal, device and relevant device
CN110390160B (en) * 2019-07-19 2022-03-22 浪潮(北京)电子信息产业有限公司 Method and device for detecting period of time sequence signal and related equipment
CN113593262A (en) * 2019-11-14 2021-11-02 北京百度网讯科技有限公司 Traffic signal control method, traffic signal control device, computer equipment and storage medium
CN116226231A (en) * 2023-02-23 2023-06-06 北京思维实创科技有限公司 Data segmentation method and related device
CN116226231B (en) * 2023-02-23 2023-10-27 北京思维实创科技有限公司 Data segmentation method and related device

Similar Documents

Publication Publication Date Title
CN101127037A (en) Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering
Cao et al. A dissimilarity measure for the k-modes clustering algorithm
Lühr et al. Incremental clustering of dynamic data streams using connectivity based representative points
CN100416560C (en) Method and apparatus for clustered evolving data flow through on-line and off-line assembly
CN104216874B (en) Positive and negative mode excavation method and system are weighted between the Chinese word based on coefficient correlation
Chen et al. A survey of approximate quantile computation on large-scale data
CN113052225A (en) Alarm convergence method and device based on clustering algorithm and time sequence association rule
Al Aghbari et al. On clustering large number of data streams
CN109582714A (en) A kind of government affairs item data processing method based on time fading correlation
Apiletti et al. Pampa-HD: A parallel MapReduce-based frequent pattern miner for high-dimensional data
Gou et al. A/sup*/search: an efficient and flexible approach to materialized view selection
CN102799616A (en) Outlier point detection method in large-scale social network
Sari et al. Optimization of the FP-Growth Algorithm in Data Mining Techniques to Get the Electric Power Theft Pattern for the Development of Smart City
Ma et al. POD: A parallel outlier detection algorithm using weighted kNN
Pham et al. Fast streaming algorithms for k-submodular maximization under a knapsack constraint
CN110287237B (en) Social network structure analysis based community data mining method
Zhou et al. A graph clustering algorithm using attraction-force similarity for community detection
Sharma et al. A survey on clustering algorithms for data streams
Ansarifar et al. A novel algorithm for adaptive data stream clustering
Visalakshi et al. Distributed data clustering: A comparative analysis
Al-Khamees et al. Survey: Clustering techniques of data stream
Zhang et al. Self‐Adaptive K‐Means Based on a Covering Algorithm
Chen et al. Research and application of cluster analysis algorithm
Yang et al. Research on topic mining algorithm based on deep learning extension
Kowsalya et al. A weighted frequent itemset mining algorithm for intelligent decision in smart system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C57 Notification of unclear or unknown address
DD01 Delivery of document by public notice

Addressee: Zeng Bin

Document name: Notification of Publication of the Application for Invention

C57 Notification of unclear or unknown address
DD01 Delivery of document by public notice

Addressee: Zeng Bin

Document name: Notification of the application for patent for invention to go through the substantive examination procedure

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication