CN101127037A

CN101127037A - Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering

Info

Publication number: CN101127037A
Application number: CNA2006100529523A
Authority: CN
Inventors: 曾斌; 曾凯; 姜小丽; 王宇熙
Original assignee: LINAN MICROGRID INFORMATION ENGINEERING Co Ltd
Current assignee: LINAN MICROGRID INFORMATION ENGINEERING Co Ltd
Priority date: 2006-08-15
Filing date: 2006-08-15
Publication date: 2008-02-20

Abstract

The utility model relates to a discovering algorithm with clustered cycling associated rule, based on a differing sequence method of time series vector. Firstly, in view of the drawback of the current discovering algorithm with cycling associated rule on the problem of dividing a plurality of time domains, an algorithm called CMDSA is proposed. The algorithm selects a time series vector which comprises a item supporting degree as the data character in time area to cluster; meanwhile, the clustering number is controlled by a DB principle to reach the best clustering result, so that each time area under the cycling associated rule can be identified more accurately and more useful cycling associated rules can be found compared with the current algorithm. Aiming at the fact that all the current algorithm of cycling associated rule are based on the Apriori algorithm and the efficiency is low, an algorithm of CFP-tree based on Fp tree is proposed. The algorithm of CFP-tree adopts cycling tailoring technique based on the condition FP tree to enhance the algorithm efficiency. Thus, the adoption of the discovering algorithm with cycling associated rule of CFP-tree is far better than the prior algorithm based on Apriori in the time and space efficiency.

Description

Periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering

1 technical field

The present invention relates to a kind of algorithm at the periodic associated rule of data mining field discovery time sequence; Be specifically related to a class based on the cycle tense correlation rule problem between each attribute status of temporal constraint, be applicable to that period of state ground between limited attribute of development is by the problem of the relevance of time.Defined the tense correlation rule of incident mapping of equal value, non-same alike result and same alike result, determined the extraction of tense correlation rule by calculating supporting rate and confidence level.Provided the algorithm key step of excavating the seasonal effect in time series periodic associated rule when determining the validity of tense correlation rule.

2 background technologies

Variation in real world is all with the time factor interwoveness, so the periodicity correlation rule of temporal data can help the mankind to carry out correct decisions for the discovery of execution cycle property economic law and most of fields such as prediction, disaster prevention important and far-reaching meaning is arranged in the research real world data.

Research about time zone intercycle correlation rule is in the starting stage at initial stage at present at home and abroad.Ouyang Weimin for example ^[1]The discovery that proposes has the correlation rule of temporal constraint, but periodic associated rule is not discussed." CyclicAssociation Rules " at OzdenB ^[2]The middle periodic associated rule time that proposes is artificial definite, given by rule of thumb chronomere of user and Cycle Length (being the integral multiple of chronomere), and thus data are divided into the identical time period of some length, find the solution periodically correlation rule according to the affairs that exist in these time periods then.This tends to cause the division of time period very inaccurate, more likely can miss some periodically correlation rules.For example, establishing chronomere is 1 hour, and Cycle Length is 24 hours, for the periodicity correlation rule: milk → bread (7AM～8AM), if the time period of the main distribution of milk → bread is (6:45AM～7:20AM), then may be false.The correlation rule of " client who buys milk in 6:45AM～7:20AM time range of every day also can buy bread " can not be found like this." Study of Frequent Cyclic Association Rule " that also has the yellow beneficial people ^[3]Mainly to the improvement of algorithm in OzdenB " the Cyclic Association Rules " literary composition.

The problem that present existing periodic associated rule discovery algorithm exists mainly contains:

Problem one: the selection of time domain data unique point

" Cyclic Association Rules " at OzdenB ^[2]A kind of new periodic associated rule discovery model [4] [5] that problem Xu Min in the literary composition proposes is divided into the different time period of length by cluster analysis with one-period, thereby can finds periodic associated rule more accurately.But each number of transactions that takes place constantly of this periodic associated rule discovery Model Selection is carried out cluster for the time domain data unique point, its cluster is carried out at affairs, and project self has the own regularity of distribution, this cluster mode can not reflect the rule of single project, its existing problem is: we illustrate with an example, for a period of time, the situation of 0 o'clock to 14 o'clock every day all as shown in Figure 1, for example with literary composition ^{[4] [5]}In by each transactions cluster that takes place constantly at time zone 1-5,6-12 each number of transactions that carve to take place can to gather respectively be a class.By each number of entry cluster that takes place constantly at time zone 3-8 each to carve that number (number of transactions that comprises project A) that project A takes place can gather be a class.Time zone 3-8, project A support ≈ 15*5/25*2+ (25+20)/2+20*2 ≈ 66%, time zone 1-5, project A support ≈ 2.5+ (2.5+15)/2+15*2/25*4 ≈ 41.3%, time zone 6-12, project A support ≈ 15*2+ (15+2.5)/2+2.5*3/20*6 ≈ 38.6%.If minimum support is 54%, can't finds that by clustering method in literary composition [4] [5] project A is frequent, yet can find that by the number of entry cluster that each takes place constantly project A is frequent.

Though it is for example civilian by each number of entry cluster that takes place constantly ^[21]Can solve literary composition ^{[4] [5]}Problem, but still there is certain problem in it.Because only considered project in each item number that takes place constantly by each number of entry cluster that takes place constantly, and ignored each total number of transactions that takes place constantly.Though the item number two moment is identical, total number of transactions that might two moment is inequality and cause the project support degree difference in two moment.Yet have only whether each the real reflection of project support degree ability and decision project constantly is frequent.So by the only unilateral consideration of project cluster project in each item number that takes place constantly, can not reflect each inherent law of project support degree constantly.So we still illustrate this problem with example case among the figure one.As figure one, because be a class by each number of entry cluster 3-8 that takes place constantly, if minimum support is changed to 72%, under such a case, the project A of 3-8 is not frequent.If but by the number of transactions that comprises project at each constantly and this ratio that total number of transactions takes place constantly (project A is in each support constantly) carry out cluster the method 6-8 time period each to carve that the number that project A takes place and the ratio that each carves total number of transactions can gather be a class, its project A support slightly equals 75%, so the project A of 6-8 time period is frequent.Yet carry out the frequent item A that cluster will be found the 6-8 time period according to each degree of the project support constantly.To miss the frequent item of 6-8 time period by each number of entry cluster that takes place constantly.

So the present invention adopts each project support degree that takes place is constantly come the CMDSA algorithm of cluster, certainly each take place constantly more than on a project A in the example, have a variety of projects to take place.So constantly we can be with the support of each project one dimension component as a vector at each, all items just forms a time series vector in the support in this moment, and this time series vector cluster is got final product.

Problem two: cycle length, zone segments purpose was determined

Literary composition ^{[2] [4] [5]}Also have another problem, what time periods one-period is divided into is artificial definite.Though literary composition ^{[4] [5]}The length of periodicity correlation rule model time period obtain according to the concentration degree automatic cluster of affairs generation number, but its cluster is a best cluster of using the Fisher algorithm to obtain under the cluster number prerequisite artificially having determined.This has just ignored another index cluster number of judging the cluster effect, yet the selection of cluster number will judge that rather than artificial regulation has only by cluster validity function according to the actual conditions of concrete data ^{[11] [12] [13]}Judge the cluster number according to the actual conditions of concrete data, situation of change that could more real reflection real data reaches optimal cluster effect.Adopt DB Index criterion in the present invention ^{[9] [19]}Judge the validity of cluster, determine best cluster number.

Problem three: find the selection of periodic associated rule basic algorithm

Literary composition ^{[2] [3] [4] [5] [21]}The periodic associated rule algorithm is all based on the Apriori algorithm ^[6], also have literary composition ^[1]The associated rule discovery algorithm and the literary composition that have temporal constraint ^[18]The mining algorithm of partial periodicity pattern also based on the Apriori algorithm ^[6], problem such as the candidate that they exist processing is very big, and pattern and db transaction search matched will expend a large amount of time, and resource consumes high, and operational efficiency is not high.So the present invention replaces Apriori with FP-tree.

3 summary of the invention

The present invention's proposition-based on time sequence vector diverse sequence method ^[7]The periodic associated rule discovery algorithm of cluster (CARDSATSV) is made up of two parts: CMDSA and CFP-tree.At first CMDSA adopts based on the diversity sequence method the time series vector of being made up of project support degree ^[7]With DB Index criterion ^{[9] [19]}Cluster determine each time zone of the correlation rule in the cycle dynamically, with DB Index criterion ^{[9] [19]}Control the cluster number to reach best cluster effect.CMDSA finds useful correlation rule to greatest extent at the problem of mentioning in problem one and the problem two.At problem three, CFP-tree adopts based on Fp-tree the transaction database on each time zone in the cycle then ^{[8] [16]}Method carry out the periodically discovery of correlation rule.Utilize FP-tree obviously to be better than the characteristics of Apriori algorithm, CFP-tree adopts the periodicity tailoring technology based on condition FP-tree to increase substantially efficiency of algorithm, theoretical and description of test on spatiotemporal efficiency, be far superior to periodic associated rule algorithm based on the periodic associated rule algorithm CFP-tree of Fp-tree based on Apriori.

4 description of drawings

The case study of Fig. 1 periodic associated rule for example

Fig. 2 Frequent Item Sets in time period period 1 [sj, ej] generates tree

The cycle Frequent Item Sets generates tree in Fig. 3 time period [sj, ej]

Periodicity correlation rule model is found the comparison of useful cycle Frequent Item Sets quantity in Fig. 4 CARDSATSV and the literary composition [4] [5]

Fig. 5 when working time of T=30 days thinkings two and thinking three relatively

Fig. 6 when working time of T=30 days thinkings two and thinking three relatively

Fig. 7 T=30 days CARDSATSV with based on the periodic associated rule algorithm of Apriori in the comparison of following working time of different minimum support

5 embodiments

(1) relating basic concepts of time domain data and character

Definition 1 (time domain data). time domain data refers to have the affairs collection of time attribute.If whole affairs collection time zone is T, T can be expressed as T=∪ T _i T _i∩ T _j=0; | T _j|=| T _i| i ≠ j wherein; I, j=1,2 ..., n|T _i| expression T _iTime span.Here claim | T _i| be one-period length, T _iBe i cycle, | T _i| length is user-defined, as 1 year, and January or 1 week.Target of the present invention is to find in all period T _iIn certain period in incidence relation between some frequent item.

Definition 2 (time series vectors). a series of observed readings that obtain in chronological order, each observed reading are the vectors of a n dimension, and these vectors that have time attribute are called time series vector.

The time series vector sequence that time series vector is formed is that (the time series related notion is seen literary composition to a kind of time series ^[17]), be that the related symbol of time series vector is represented below:

1) the time series vector sequence table of being made up of time series vector is shown E={e _i| i=1 ..., m}.

2) time series vector e wherein _i=＜x ₁..., x _n, e _iIt is n-dimensional vector.The support of each each project that takes place constantly constitutes e in the present invention _iEach the dimension component, e _iEach dimension component span be all real numbers in the closed interval [0,1].

3) Lend (E)=m represents the time series vector sequence length, i.e. vectorial number among the E.

4) time span of Time (E) expression time series vector sequence E experience, the i.e. moment of last element of E and first element time interval constantly.

5) among the present invention among the definition time series vector sequence E time interval between any two time series vectors equate that Granularity (E) is called time granularity, the time interval between two elements among the expression time series vector sequence E.

6) E _sThe subsequence of expression E.If two subsequence E among the E _S1And E _S1Between do not have identical element then to claim E _S1And E _S1Not overlapping.

Definition 3 (cycle Item Sets). the cycle Item Sets also is an Item Sets, and it has increased the periodic nature definition of Item Sets on the definition basis of Item Sets, and establishing a cycle Item Sets expression formula about Item Sets X is X[C, s _i, e _i, s _X].C is length cycle length, s _i, e _iStarting point and terminal point for i the time period in the cycle; s _XBe called cycle Item Sets X[C, s _i, e _i, s _X] i time zone [s in all cycles _i, e _i] interior periodicity support.Concrete r wherein _TiThe Item Sets X[C of indication cycle, s _i, e _i, s _X] at i the time period [s in t cycle _i, e _i] comprise number of transactions and the time period [s of X in the zone _i, e _i] in the zone in the ratio of all number of transactions, promptly Item Sets X is at [s _i, e _i] support.I time period [s in all n cycle _i, e _i] in, s is arranged _X=min{r _1i, r _2i..., r _Ni.

Definition 4 (cycle Frequent Item Sets). cycle Item Sets X[C, s _i, e _i, s _X] at i the time period [s in all n cycle _i, e _i] in, if s _X＞=s _MinThen claim X[C, s ₁, e _i, s _X] be the cycle Frequent Item Sets.S wherein _MinIt is the minimum support threshold values.

Definition 5 (general cycle Frequent Item Sets). as whole affairs collection time zone T → ∞, if number of cycles n → ∞ is at the time period in each cycle [s _i, e _i] in cycle Item Sets X[C, s _i, e _i, s _X] periodicity support s _X＞=s _MinThen claim cycle Item Sets X[C, s _i, e _i, s _X] be general cycle Frequent Item Sets.

Definition 6 (periodic associated rules). periodic associated rule is as next implications: X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y].X wherein, Y is the cycle Item Sets; C is length cycle length, s _i, e _iBe i time period starting point in the cycle and terminal point; s _X-＞Y, c _X-＞YBe called X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y] at all cycles i time zone [s _i, e _i] intercycle support and periodicity confidence level.S wherein _TiExpression X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y] at i the time period [s in t cycle _i, e _i] comprise number of transactions and the time period [s of X ∪ Y in the zone _i, e _i] in the zone in the ratio of all number of transactions, c _TiExpression X-＞Y is at i the time period [s in t cycle _i, e _i] comprise the number of transactions and the ratio that comprises the number of transactions of X of X ∪ Y in the zone.I time period [s in all n cycle _i, e _i] in, s is arranged _X-＞Y=min{s _1i, s _2i..., s _Ni, c _X-＞Y=min{c _1i, c _2i..., c _Ni.

Definition 7 (strong periodic associated rule and weak periodic associated rule). for periodic associated rule: X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y], at the i time period [s of all n in the cycle _i, e _i] in, if s _X-＞Y＞=s _MinAnd c _X-＞Y＞=c _Min, then claim periodic associated rule X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y] be strong periodic associated rule.At i the time period [s of all n in the cycle _i, e _i] at least one of section [s cycle length _i, e _i] in, if s _X-＞Y＞s= _MinAnd c _X-＞Y＞=c _Min, then claim periodic associated rule X-＞Y[C, s _i, e _i, s _X-＞Y, c _X-＞Y] be weak periodic associated rule.S wherein _MinAnd c _MinBe for excavating defined minimum support of effective correlation rule and minimum confidence level.

Character 1. is known by definition 5, all is the cycle Frequent Item Sets in the time zone at all subclass A place in all cycles of any one cycle Frequent Item Sets A.

Theorem 1. any one cycle Frequent Item Sets X[C, s _i, e _i, s _X], (establish X[C, s _i, e _i, s _X] at time period [s _i, e _i] in the cycle Frequent Item Sets), then at the corresponding time period [s in each cycle _i, e _i] in have at least a Frequent Item Sets Y to comprise X[C, s _i, e _i, s _X], i.e. X[C, s ₁, e ₁, s _X]  Y.

Proof: at the corresponding time period [s in each cycle _i, e _i] in can allow the Item Sets Y value be Item Sets X at least, know X[C, s by definition 3 _i, e _i, s _X] periodicity support s _XBe the corresponding time period [s in each cycle _i, e _i] in the minimum value of Item Sets X support, the support of ∴ Y is greater than s _X, ∵ X[C again, s _i, e _i, s _X] the cycle Frequent Item Sets, know s by definition 4 _X＞=s _Min, ∴ Item Sets Y is a Frequent Item Sets, conclusion is set up.

When character 2. constantly increases as whole affairs collection time zone T (when number of cycles constantly increases), the cycle Frequent Item Sets of finding in T levels off to general cycle Frequent Item Sets, and both gaps level off to 0.When T was infinity, the cycle Frequent Item Sets of finding in T equaled general cycle Frequent Item Sets, and both gaps equal 0.Understand from a darker level, whole affairs collection time zone T span is big more, and the periodic associated rule of being found has more abstractness, generality, ubiquity.

(2) based on the clustering algorithm (CMDSA) of the time series vector of diversity sequence method and DB Index criterion

Carry out cluster one time in the period 1, determine each time zone of the correlation rule in the cycle, the time zone in other cycles is also by this division later on.

Whole affairs collection time zone T has a plurality of cycles, because the affairs that each takes place in cycle are different, divides inevitable different so each cycle is carried out the time period that cluster obtains.3 cycles are for example arranged among the T, first cycle is through after the cluster, the 8:30-9:30 affairs are poly-to be a class, there is Frequent Item Sets A among the 8:30-9:30, second period is through after the cluster, the 8:50-9:50 affairs are poly-to be a class, there is Frequent Item Sets A among the 8:50-9:50, the 3rd cycle is through after the cluster, the 9:00-10:00 affairs are poly-to be a class, have Frequent Item Sets A among the 9:00-10:00, cycle Frequent Item Sets A is certainly existed in three cycles and comprises among the common factor 9:00-9:30 of time period of Frequent Item Sets A so.So carry out the time period division that cluster obtains in any i the cycle in T, be divided in all Frequent Item Sets that obtain among the cycle i according to this time period as can be known by theorem l and must comprise all cycle Frequent Item Sets.So only need carry out cluster (might as well in the period 1) in some cycles, determine each time zone of the correlation rule in the cycle, find all Frequent Item Sets in this cycle in first cycle then, all affairs of corresponding time periods of each Frequent Item Sets and other compare (promptly through periodically cutting out) in cycles, just can find all cycle Frequent Item Sets, and then find all strong periodic associated rules.

Certainly we can carry out cluster in each cycle, find out the time that is suitable for all cycles according to the different separately time period in each cycle again and divide, but have n cycle will carry out n time cluster, and along with the increase of n, computing cost and cost are big more.Be worthless like this.

The time domain data unique point of cluster is chosen: literary composition ^{[4] [5]}The number of transactions cluster takes place constantly, by each number of items cluster and the present invention takes place constantly and by each time series vector that the support of project forms takes place constantly and carry out cluster and all belong to time domain data is carried out cluster by each, only the object of their clusters--the time domain data unique point is selected different.Literary composition ^{[4] [5]}What select is that each number of transactions that takes place constantly is the time domain data unique point, and the present invention selects be each project at the time series vector of each affairs generation number that comprises this project constantly and this ratio that total number of transactions takes place constantly (support of project takes place) formation is the time domain data unique point.

Time domain data unique point of the present invention is defined as follows:

1) support of each project that each takes place constantly in one-period (ratio of the number of transactions that the number of transactions of this project that this comprises constantly and this moment are total) constitutes time series vector e _iEach dimension component, n dimension time series vector e _i=＜x ₁..., x _n, e _iEach dimension component span be all real numbers in the closed interval [0,1].

2) the time series vector sequence table of being made up of time series vector so is shown E={e _i| i=1 ..., m}, time series vector sequence length Lend (E)=m wherein, Time (E) is an one-period, among the present invention among the definition time series vector sequence E time interval between any two time series vectors equate time granularity Granularity (E) definite value.

The CMDSA algorithm is mainly finished the cluster to the time series vector sequence E in the period 1 in the present invention.

Work before the time series vector cluster: we face such problem and what classes the time series vector in the one-period is divided into can obtains optimal cluster result before the beginning time series vector cluster.What the judge index of optimal cluster result is.To reach what criterion be optimal to cluster result in other words.

Cluster validity function ^{[11] [12] [13]}: address the above problem needs cluster validity function to control the cluster number, and we adopt DB Index criterion here ^{[9] [19]}Control the cluster number.In the judgment rule of cluster validity, dispersion and between class distance often are used to judge the validity of cluster, DB Index criterion in the class ^{[9] [19]}Used dispersion and between class distance in the class simultaneously, adopted the judgment criterion of DB Index criterion in the present invention as classification validity.DB Index criterion substance is as follows:

1) mean square in the class

S_{i} = \frac{\underset{X &Element; C_{i}}{Σ} | | X - Z_{i} | |}{| Ci |}

, wherein, Z _iBe C _iThe class center of class; | C _i| expression C _iThe class sample number.

2) between class distance d _Ij=‖ Z _i-Z _j‖ promptly represents between class distance with the distance at two class centers.

3)DB?Index

{DB}_{k} = \frac{1}{k} Σ_{i = 1}^{k} R_{i}

, wherein

R_{i} = \max_{j = 1, ., k, j &NotEqual; i} \frac{S_{i} + S_{j}}{d_{ij}}

, k is the classification number.

DB Index criterion is DB _kValue more little, illustrate the classification effect good more.

Optimize the CMDSA algorithm-best cluster number of minimizing c _OptThe hunting zone: one-period has n time series vector e, can allow cluster number c value from 2 until n, and utilization DB Index criterion and diversity sequence method are determined best cluster number c _OptThe hunting zone of that cluster number c is if from 2 to n, i.e. 2＜=c _Opt＜=n.Efficient just has problems so, when n is very big, seeks c _OptThe computing cost be very large, so we need reduce c _OptHunting zone C _MaxFor how determining C _MaxMany researchers use experience rule:

c_{\max} \leq \sqrt{n}

, this rule is at literary composition ^[13]In mention.Also has document ^[15]The rule of mentioning: c _Max≤ 21nn.But above-mentioned rule lacks theoretical the support.Literary composition ^[14]Provide a kind of definite C _MaxNew method, this new method has illustrated in theory

c_{\max} \leq \sqrt{n}

Validity.Literary composition ^[14]The best cluster numbers scope of fuzzy clustering has been discussed, and main the discussion blured division, and so the hard special case that belongs to fuzzy division of dividing is civilian ^[14]Theory be equally applicable to hard cluster of the present invention.So the present invention is according to literary composition ^[14]Adopt rule:

c_{\max} \leq \sqrt{n} .

We determine c _OptThe hunting zone be 2 to arrive Promptly determine

c_{\max} \leq \sqrt{n}

。Use DB Index criterion and diversity sequence method finally to determine best cluster number c then _Opt

c_{\max} \leq \sqrt{n}

And c _Max≤ n compares, and the computing cost of saving is very huge, and n is big more

c_{\max} \leq \sqrt{n}

And c _MaxThe computing cost gap of≤n increases by geometric progression.

The diversity sequence method ^[7]: we are to each sequential time series vector e constantly in first cycle among the present invention _iCarry out cluster, this vector that has time parameter belongs to orderly sample, and we can consider to adopt the Fisher algorithm ^{[10] [20]}, literary composition ^{[4] [5]}What the number of transactions cluster at each quarter was used is exactly the Fisher algorithm ^{[10] [20]}But the Fisher algorithm is not considered the order of sample when calculating the diameter of each class.We adopt the diversity sequence method to come time series vector e at this order _iCarry out cluster.The diversity sequence method has not only been considered the order of the sample that has time parameter, has simultaneously to calculate simply the characteristics of visual result.

The diversity sequence method ^[7]Related notion: be provided with m sample x in order ₁, x ₂... x _m, each sample all has n index observed reading, to i sample x _i, note is made x _i=(x _I1... x _In) x _IjJ index observed reading representing i sample, 1＜=i＜=m wherein, 1＜=j＜=n.Use nonnegative number g _i=g (x _i, x _I+1) i sample x of expression _iWith i+1 sample x _I+1Between difference, i=1,2 ... m-1.. wherein work as x _i=x _I+1The time, g _i=g (x _i, x _I+1)=0.Usually, desirable g _iBe weighting l _pMould

g_{i} = {[Σ_{j = 1}^{n} w_{j} {| x_{ij} - x_{i + 1 j} |}^{p}]}^{1 / p},

I=1,2 ..., m-1 is w wherein _jBe power, w _j＞=0.Power w _jEffect mainly be the importance of eliminating the different of different index yardsticks and reflection index.

Notion 1 (diversity sequence). claim g _i=g (x _i, x _I+1) be the diversity factor of i sample and i+1 sample, claim (g ₁, g ₂..., g _M-1) be the diversity sequence of sample.

Notion 2 (secondary diversity sequence). (g ₁, g ₂..., g _M-1) diversity factor h _i=h (g ₁, g _I+1) be the secondary diversity factor of sample, be taken as h usually _i=| g ₁-g _i+ ₁|, i=1,2 ..., m-2 claims (h ₁, h ₂..., g _M-2) be the secondary diversity sequence of sample.

Notion 3 (diversity sequence method). the method that the utilization variance sequence is classified to orderly sample is called the diversity sequence method.

Notion 4 (k class cut-point). orderly sample (x ₁, x ₂..., x _m) (1＜k＜m) step of class is: at first determine k-1 integral point i to gather into k ₁..., i _K-1They satisfy 1＜=i ₁＜=i ₂＜=...＜=i _K-1: then sample is gathered into k class (x ₁..., x _I1) (x _I1+1..., x _I2+1) ... (x _Ik-2+1..., x _Ik-1) (x _Ik-1+1..., x _m) title i ₁..., i _K-1Be k class cut-point.

The diversity sequence method ^[7]Basic thought: consider the diversity factor between each sample and its next sample earlier, overall thinking diversity factor is then selected the cut-point that is used for cluster.Therefore changed to another kind ofly by a class at cut-point place sample, the sample diversity factor at the cut-point place should be bigger; Because sample is at random, even in same class, the difference of sample also is different again, but near cut-point, the sample diversity factor changes should be greatly, the secondary diversity factor that also is sample should be bigger, so be divided into two time-like secondary diversity factor maximums, three time-likes take second place.

The step of diversity sequence method is: earlier sample is divided into two classes, again sample is divided into three classes on this basis, up to the k class.Be specially: at first determine 2 class cut-point i ₁Get (1＜=l＜=m-2) make

h_{1} = \underset{1 \leq i \leq m - 2}{\max h_{i}} . . . . . . . . (a)

。i ₁Determine by following formula

(b) meaning of formula is, getting the big point of diversity factor in l and two points of l+1 is cut-point i ₁If l ₁And l ₂(l ₁≠ l ₂) all satisfy (a) formula, as max (g _L1, g _L1+ ₁)＞max (g _L2, g _L2+ ₁) ... ... ... in the time of (c), then by l ₁Determine i according to (b) formula ₁As 2 class cut-point i ₁After determining, just can be sample (x ₁, x ₂..., x _m) be divided into two class (x _l..., x _Il) and (x _Il+1..., x _m).On this basis, can be divided into three classes to sample: above two classes are asked maximum secondary diversity factor (seeing (a)) respectively, the class at the maximal value place in these two values is divided into two classes again by above method, so just sample has been divided into three classes.Go down according to this sample is divided into the k class.

Clustering algorithm (CMDSA) based on the time series vector of diversity sequence method and DB Index criterion: diversity sequence method just classification can't judge that how many classes of branch reach optimal classification, so also need DB INDEX criterion to judge the optimal classification number.The present invention comes the time series vector sequence E of period 1 is carried out cluster in conjunction with both, determines the regularity of distribution of time domain data unique point, and then each time zone of the correlation rule in definite cycle.

The CMDSA algorithm:

1) step 1: m-1 the diversity factor g[i that calculates the time series vector sequence E that in one-period, has m time series vector earlier] and m-2 secondary diversity factor h[i], and form corresponding diversity factor and time diversity factor array.

2) step 2: because time series vector sequence E cluster number c value by 2 to

So we find Individual cut-point gets final product.In secondary diversity factor sequence array, find maximum secondary diversity factor value h[i earlier], determine corresponding cut-point then, then at residue h[i] in look for maximum h[i], determine corresponding cut-point again, so analogize.The secondary diversity factor value h[i that has found for fear of multiple scanning], we can use ordering.Find maximum h[i] h[0 of first position of relief it and secondary diversity factor sequence array] exchange second largest h[i] and the h[1 of second position] exchange, up to the

Big h[i].

3) because ordering makes each h[i] initial position upset, want to find h[i] definite corresponding cut-point position, back, so we must note h[i] initial position, we allow h[i] comprise the structure variable of 2 components, h[i] .data is secondary diversity factor sizes values, h[i] position h[i] .place deposited initial position i.

4) whenever find a h[i step by step 1 of step 2: 2)], just according to h[i] determine to be specially cut-point: whenever determine a h[i], according to h[i] .place utilizes formula

Determine cut-point position k_wei, then k_wei is inserted into the correspondence position that sorts from small to large from the cut-point position the orderly single linked list Fen that deposits, k_wei inserts cut-point position among the Fen of back and still keeps from small to large order.

5) whenever determine a new cut-point step by step 2 of step 2: 4), (c Fen) calculates and the new corresponding DB of cut-point to utilize process INDEX according to the Fen of new variation _cValue, DB _cValue and DB ^*Relatively, if DB _cValue is less than DB ^*, with DB _cCover DB ^*, simultaneously will with DB _cBe worth corresponding chained list Fen and cover Fen ^*Then when 2) in carry out and finish just to have found minimum DB _cValue.Because DB _cIt is good more to be worth more little cluster effect, minimum DB _cBeing worth corresponding chained list Fen is exactly optimal classification.

(3) periodic associated rule discovery algorithm of setting based on FP-tree (CFP-tree)

Thereby the time series vector that uses the CMDSA algorithm that the support of each project that takes place is constantly formed carries out cluster to be determined after each time zone of the correlation rule in the cycle (according to the orderly single linked list Fen in the optimal partition point position that obtains previously ^*Determine each time slice), we begin to find useful periodic associated rule on each time zone in cycle.Periodic associated rule discovery algorithm has multiple choices: based on Apriori or based on FP-tree.The present invention adopts the periodic associated rule discovery algorithm (CFP-tree) based on FP-tree.

Three kinds of thinkings of discovery cycle Frequent Item Sets:

Thinking one: expect easily producing Frequent Item Sets, carry out the Frequent Item Sets comparison between the corresponding time period in all cycles then, produce the cycle Frequent Item Sets in each time period in each cycle structure FP-tree tree.But this time overhead is huge, is worthless.

Thinking two: the theorem 1 by the 3rd joint can know that any one cycle Frequent Item Sets exists a Frequent Item Sets to comprise it in the corresponding time period in each cycle, so as long as we can excavate the Frequent Item Sets of each time period in one-period, the transaction database of corresponding time periods in all cycles of these Frequent Item Sets and other is compared and is just found out all cycle Frequent Item Sets then.So can be earlier in each time period in first cycle structure FP-tree tree, utilize FP-growth to produce the Frequent Item Sets of first each time period in cycle, then these Frequent Item Sets and the affairs storehouse of other corresponding time periods in all cycles are compared, produce all cycle Frequent Item Sets.

Thinking three: we can further optimize the method for thinking two, according to theorem 1, can be earlier in each time period in first cycle structure FP-tree tree, from cycle frequent item tabulation L, employing generates all cycle Frequent Item Sets based on the periodicity tailoring technology of condition FP-tree, and then finds all strong periodic associated rules.Thinking three CFP-tree algorithm just of the present invention.

In FP-tree tree of each time slice structure of first cycle ^{[8] [16]}

Obtain in first cycle according to front CMDSA c reasonable time segmentation, transaction database of each time slice just has c transaction database D ₁..., D _C(D _iCorresponding i time slice), FP-tree tree of a transaction database just needs c FP-tree tree of structure.

Each time period [s in first cycle _i, e _i] middle transaction database D _iMiddle structure FP-tree _iConcrete steps:

1) 1 transaction database D of scanning _i, produce frequent item set F _iAnd corresponding support, with F _iWith except the corresponding time period [s in all cycles of period 1 _i, e _i] transaction database D _iAll affairs compare and obtain cycle frequent item set F (the 1-item collection that each project x of F the inside forms all is the cycle Frequent Item Sets) and the corresponding periodicity of x support.With the periodicity support descending sort F of x, generate cycle frequent item tabulation L _iThe project r among the F wherein _TiProject x among the expression F is at i the time period [s in t cycle _i, e _i] comprise number of transactions and the time period [s of x in the zone _i, e _i] in the zone in the ratio of all number of transactions.I time period [s in all n cycle _i, e _i] in, the periodicity support s=min{r of x is arranged _1i, r _2i..., r _Ni.

2) create FP-tree _iRoot node, with " null " mark.To D _iIn each affairs t do following processing: L selects frequent item among the t by cycle frequent item tabulation, delete non-frequent item, order occurs by element among the L and arrange frequent element among the affairs t. the frequent element list of arranging preface make marks [p|P], p is the 1st element, P is the tabulation of surplus element. call then inserttree ([p|P], T).Wherein, function insert tree ([p|P], T) processing procedure is: if T has child node N, and N.item_name=P.item_name, then make the N counting add 1; Otherwise, create new node N, counting is 1, and its father node is T, and node is linked to the next node identical with its title.If the P non-NULL, again recursive call insert tree (P, N).

Periodicity tailoring technology based on condition FP-tree: based on the concrete thinking of the periodicity tailoring technology of condition FP-tree: mainly be to literary composition ^{[8] [16]}The improvement of FP-growth, increase the function that FP-growth finds the cycle Frequent Item Sets.CFP-growth carries out periodicity to the condition FP-tree of the pattern β among the FP-growth and cuts out.The set that the condition FP-tree of pattern β and β is formed just (for example β is { kma}, the condition FP-tree of β are { (f:3, b:3) } | kma, the then set of Xing Chenging fkam}, bkam}}) and the corresponding time slice [s in other cycles _j, e _j] in all affairs compare, cut out in the set periodically support less than s _MinItem Sets, the periodicity condition FP-tree of generate pattern β.At each time slice [s of period 1 _j, e _j] FP-tree traversal when finishing, promptly generate all cycle Frequent Item Sets.

Utilization based on the periodicity tailoring technology of condition FP-tree makes thinking three be better than thinking two.The validity of following this technology of surface analysis:

A) related notion and character

The notion of condition pattern base and condition FP-tree sees also literary composition ^{[8] [16]}Here do not state tired.

It is some time period [s among some cycle j that theorem 2. is established a _j, e _j] transaction database D _iIn Item Sets, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, in cycle j, a ∪ β is at D _iIn support support _{A ∪ β}Equal the support of β in B, i.e. support _{A ∪ β}=support _β

Proof: by literary composition ^{[8] [16]}The definition of condition pattern base, all arrive the condition pattern base of the prefix path formation a of a, so each affairs that occur in B is simultaneously at D _iThe place that middle a occurs occurs, and affairs comprise β in other words, and it comprises a certainly so.If an Item Sets β occurs n time in B, so β also with a simultaneously at D _iIn occur n time, and all and a are simultaneously at D _iThe middle project that occurs all is collected in the condition pattern base of a, so a ∪ β is at D _iIn occur n time just. conclusion is set up.

It is some time period [s among some cycle j that inference 1. is established a _j, e _j] transaction database D _iIn Item Sets, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, in cycle j, a ∪ β is at D _iIn support support _{A ∪ β}Smaller or equal to a at D _iIn support, i.e. support _{A ∪ β}＜=support _a

Proof: according to literary composition ^{[8] [16]}The definition of middle condition pattern base, each affairs that occur in B is all at D _iThe place that middle a occurs occurs.If an Item Sets β occurs n time in B, so β also with a at D _iIn occur n time simultaneously, this explanation a at D _iIn occurrence number be at least n time, the situation that also may exist a not occur together with β simultaneously, at this moment a occurrence number is more than or equal to n time, support _β＜=support _a, know support by theorem 2 _{A ∪ β}=support _βSo, support _{A ∪ β}＜=support _a

Inference 2. is at the time period of some cycle j [s _j, e _j] in Item Sets a, B is the condition pattern base of a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, at time period [s except any one cycle k of cycle j _j, e _j] in, a ∪ β is arranged at the time period of cycle k [s _j, e _j] in transaction database D _iIn support support _{A ∪ β}Smaller or equal to a at D _iIn support, i.e. support _{A ∪ β}＜=support _a

Proof: i) .B is the time period [s of cycle j _j, e _j] in the condition pattern base of Item Sets a,  β  M is known at the time period of cycle j [s by inference 1 _j, e _j] in support is arranged _{A ∪ β}＜=support _aIi). but at the time period [s except other any one cycle k of j _j, e _j] in, B is the condition pattern base of a not necessarily, for  β  M, and β and a time period [s in cycle k _j, e _j] in FP-tree in position relation have three kinds may: 1) β and a are in same paths, and β is the prefix path of a.2) β and a are in same paths, and a is the prefix path of β.3) β and a be not in same paths.Iii). at the time period in all cycles [s _j, e _j] in each affairs all be according to time period [s _j, e _j] in frequent tabulation of cycle L _iRank order after form among the FP-tree, so at cycle j time period [s _j, e _j] in to have B be the condition pattern base of a, for  β  M, all items among the β is at L _iThe all items of middle position in a is at L _iIn before the position, so at other except the time period [s among the cycle k of j _j, e _j] in, β and a time period [s in this cycle _j, e _j] in FP-tree in position relation can not occur 2) situation---a is the prefix path of β.Iv). at other except the time period [s among the cycle k of j _j, e _j] in, β and a time period [s in this cycle _j, e _j] in FP-tree in position relation be 1), a ∪ β is arranged at the time period in this cycle [s by inference 1 _j, e _j] in transaction database D _iIn support support _{A ∪ β}Smaller or equal to a at D _iIn support, i.e. support _{A ∪ β}＜=support _aBe 2) words support _{A ∪ β}=0.V). in sum, conclusion is set up.

Theorem 3. hypothesis a[C, s _i, e _i, s _a, c _a] be time period [s _i, e _i] in the cycle Item Sets, B is the time period [s of some cycle j _j, e _j] in the condition pattern base of Item Sets a, M is the Item Sets that all items that relates among the B forms, β is any one subclass among the M, β  M, so, for time period [s _i, e _i] middle cycle Item Sets a ∪ β [C, s _i, e _i, s _{A ∪ β}, c _{A ∪ β}] periodicity support s _{A ∪ β}Smaller or equal to a[C, s _i, e _i, s _a, c _a] the cycle support be s _{A ∪ β}＜=s _a

Proof: i). by the 2nd joint definition 3, at the time period in all cycles [s _i, e _i] middle s _a=min{s _A1..., s _An, s wherein _AiBe that a is at the time period in i cycle [s _i, e _i] in support.Might as well establish the time period [s in k cycle _i, e _i] middle s _AkMinimum, s so _Ak=s _aIi). when k=j, s is arranged by inference 1 _Ak＞=s _{A ∪ β k}, when k ≠ j, know at time period [s except the cycle k of cycle j by inference 2 _j, e _j] middle s _Ak＞=s _{A ∪ β k}, s wherein _{A ∪ β k}Be that a ∪ β is at the time period of cycle k [s _j, e _j] in support.So s _a＞=s _{A ∪ β k}Iii). again because s _{A ∪ β k}＞=min{s _{A ∪ β 1}..., s _{A ∪ β n}}=s _{A ∪ β}So, s _{A ∪ β}＜=s _a

B) based on the analysis of the periodicity tailoring technology of condition FP-tree

We are with literary composition ^[8]Middle example illustrates this problem, citation of the present invention ^[8]The frequent collection of middle example generates tree Figure 6 (seeing Fig. 2 of the present invention), and we establish, and frequent collection generation tree is to belong to time period period 1 [s among Fig. 2 _i, e _i] in.We are example with the generation tree node ma among Fig. 2, establish M and be the Item Sets that all projects that relate to form in the condition pattern base of ma, in the child node mac of ma and maf { c}  M are arranged; { f}  M, so we know that the periodicity support of ma is greater than its all child node mac, the periodicity support of maf in the FP-tree of Fig. 2 by theorem 3, mac in like manner, the periodicity support of maf is greater than also greater than the periodicity support of separately child node.Be in the subtree of root node so with ma so, the periodicity support maximum of root node ma.If ma is not the cycle Frequent Item Sets, the node outer except root node ma is not the cycle Frequent Item Sets.If adopt the periodicity tailoring technology based on condition FP-tree of thinking three, have only root node ma to need to compare once with corresponding time period transaction databases of other cycles, and all need not compare with corresponding time period transaction database of other cycles except other all nodes of ma, so generating in the tree with ma in the cycle Frequent Item Sets is that the subtree of root node can not generate in Fig. 3, comparing with thinking two like this and just having saved generation is the cost of the subtree of root node with ma, has saved this subtree simultaneously and has carried out the periodically cost of comparison.At ma is not under the condition of periodically frequent collection, and thinking two is Fig. 2 before the period 1 generates earlier, generates Fig. 3 through periodically cutting out the back, and thinking three directly generates Fig. 3 in the period 1.So based on the periodicity tailoring technology of condition FP-tree, effectively control cycle frequently collects the scale that generates tree in the thinking three, reduces the search volume.

Discovery cycle Frequent Item Sets: input: the c in first a cycle FP-tree:FP-tree ₁..., FP-tree _c, FP-tree wherein _iBe i time period [s _i, e _i] the FP tree that produces.

Output: all cycle Frequent Item Sets

Void Main () // master routine

begin

for(i＝1；i＜＝c；i++)do?begin

CFP-growth(FP-tree _i，Null)；

End

Procedure CFP-growth (FP-tree _i, Null) // the condition pattern FP-tree of pattern β is carried out periodically cutting out the process of generation cycle Frequent Item Sets

Begin

If Tree comprises a single-pathway P then

Node combination (being designated as β) // β is the merging of the node element in the path P in for each path P

Produce α ∪ beta model, make its support equal the minimum periodicity support of each node element among the β;

All element α among Else For each Tree _iDo

begin

Generation pattern β=α _i∪ α makes its support equal α _iThe periodicity support;

The condition pattern base of structure β and the condition FP-tree Tree β of β _c

Tree β _cSet C with the β formation; // for example (f:3, b:3) } | kma, the then set of Xing Chenging { { fkam}, { bkam}}

Tree β=CFP-Pruning (C) // to Tree β _cThe set C that constitutes with β carries out the process that periodicity is cut out, output Tree β.IfTree β

φ then CFP-growth (Tree β, β); // if Tree β non-NULL, recursive call CFP-growth

End//end 5)

End

Function?CFP-Pruning(C)

begin

For (j=2; J＜=n; J++) do begin//2 are to n cycle

C _i=φ; // empty C _i

All items collection c and j cycle i time period [s among for all c ∈ C do begin//set of computations C _i, e _i] transaction database D _JiMiddle support is cut out support support _cLess than s _MinItem Sets

Computational item collection c is at D _JiMiddle support support _c

Ifsupport _c＞＝s _min?then?c∪C _i；

end

C＝C _i；

If C=φ then exit; // if all items collection all is tailored, then withdraw from all circulations

end

If?C≠φthen?begin

Generate Tree β according to C; // for example gather C={{fkam}, { bkam}}, β=kam, Tree β are { (f:3, b:3) } | kma

Return Tree β; // output Tree β

end

else?Returnφ；

End

Find strong periodic associated rule: each time period [s in each cycle _i, e _i] in strong periodic associated rule can directly produce with all cycle Frequent Item Sets that find in this time period.Be specially promptly subclass a, if support (L)/support (a)＞=c to all non-NULLs of each cycle Frequent Item Sets L _Min, strong periodic associated rule a → (L-a) [C, s are then arranged _i, e _i, s _{A → (L-a)}, c _{A → (L-a)}].

(4) the present invention and existing algorithm performance are relatively

If the experimental data of check theoretical method is too simple, many times be difficult to the comprehensively correctness and the validity of check theoretical method.So the present invention chooses actual Industry Control forefront of the production information acquisition data and removes proof theory, it is big that these data have data volume, can reflect the characteristics of actual conditions complicated and changeable.

The wealthy S240XP server in the software/hardware platform of experiment test and parameter following (1) dawn sky (two CPU (Intel Xeon MP, dominant frequency 2.0GHz), 1G internal memory).(2) programming language VC++6.0, operating system Windows2000AdvanceServer, database Oracle9i.Got 600,000 preprocessed datas in (3) one days.Cycle is one day, time span Time (E)=1 day of time series vector sequence E experience so.Sampling time granularity Granularity (E) is 1 second, and so promptly 1 second once sampling is one-period number of samples Lend (E)=86400 sampled value.Experimental data has 300 projects, so each sampled value has 300 ATTRIBUTE INDEX (project), each sampled value is the time series vector of 300 dimensions.Because one-period number of samples Lend (E)=86400, so

C_{\max} = \sqrt{Lend (E)} \approx 294

, through obtaining C after the CMDSA algorithm cluster _OptBe 95, in the actual production one day being divided into about 94 sections under similarity condition is reasonably, so judge C through validity function DB Index criterion _OptBe 95 to be practicable.

1) based on CARDSATSV algorithm and the literary composition of CMDSA ^{[4] [5]}In issue existing at different minimum supports based on the periodicity correlation rule model of Fisher with the quantitative comparison of cycle Frequent Item Sets

Fig. 4 be with one day be the cycle, relatively both find cycle Frequent Item Sets quantity under different total length of time T (T=1 days, 30 days, 60 days) and different minimum support condition.Under identical minimum support and identical total length of time T situation, find that based on the CARDSATSV algorithm of CMDSA useful cycle Frequent Item Sets quantity is more than literary composition as Fig. 4 ^{[4] [5]}In the cycle frequent item quantity found based on the periodicity correlation rule model of Fisher, illustrated that CARDSATSV algorithm of the present invention is better than literary composition ^{[4] [5]}Middle periodically correlation rule model.And because literary composition ^{[4] [5]}The periodicity correlation rule model of middle Fisher algorithm is than literary composition ^[2]The algorithm of " Cyclic Association Rules " is found how useful periodic associated rule, so the CARDSATSV algorithm also is better than literary composition ^[2]Algorithm.

2) Time (T is established in the experimental analysis based on the periodicity tailoring technology of condition FP-tree in the CFP-tree algorithm _30,3) for the cycle is one day, T.T. length T=30 days situation adopted running time of algorithm based on the periodicity tailoring technology of condition FP-tree, Time (T according to thinking three _30,2) for the cycle is one day, T.T. length T=30 days situations do not adopt the running time of algorithm of periodic tailoring technology according to thinking two.

Fig. 6 second row is Time (T under different minimum supports _30,3) and Time (T _30,2) ratio, the third line is Time (T under different minimum supports _30,3) and Time (T _30,2) difference.As Fig. 5 and Fig. 6 as can be seen: 1) T=30 days (thinking two) algorithms are higher than T=30 days (thinking three) working time under identical minimum support.Because in thinking two from T=1 days to T=30 days will with the candidates collection of other affairs comparison in cycle more than thinking three.2) along with the continuous reduction of minimum support, the working time of thinking three is with respect to the ratio Time (T of the working time of thinking two in Fig. 6 _30,3)/Time (T _30,2) more and more lower, the difference Time (T of both working times _30,2)-Time (T _30,3) increasing.Illustrate that minimum support is low more, needed with the candidates collection of other cycle affairs comparisons many more from T=1 days to T=30 days, the non-frequent candidates collection of comparing with other cycle affairs based on the periodicity tailoring technology of condition FP-tree not needing of cutting out in the thinking three is many more, periodicity reduction effect to the non-frequent candidates collection of period 1 is big more, has high more operational efficiency.

3) the CARDSATSV algorithm with based on the working time of Apriori algorithm periodic associated rule model under different minimum supports relatively

At first have a look the time overhead of CARDSATSV algorithm, it is made up of two parts: the time overhead of CMDSA and CFP-tree.The CMDSA algorithm only carries out cluster to the time domain data unique point of one-period, and time domain data unique point number of samples generally is lower than number of transactions (for example one second one time time domain data unique point sampling of the present invention of one-period, amount to 86400 sampled points, however 600,000 of the number of transactions of one-period).For the CFP-tree algorithm that runs on the affairs collection time zone total length T (T has a plurality of cycles), CMDSA running time of algorithm expense is far smaller than the CFP-tree algorithm, and the T span is big more, the time overhead gap of CMDSA and CFP-tree is big more, so the CARDSATSV time overhead is mainly on the CFP-tree algorithm.

Compare 2 kinds of algorithms then: CARDSATSV of the present invention and literary composition ^{[4] [5]}Compare not only and can find how useful periodic associated rule, the CFP-tree algorithm of CARDSATSV is far superior to literary composition on efficient simultaneously ^{[4] [5]}In based on the periodic associated rule algorithm of Apriori.The Apriori algorithm produces a large amount of Candidate Sets (when length is 1 frequent collection when having 15000, length is that 2 Candidate Set number will be above 18M) and scan database (the Apriori algorithm is almost all wanted scan database once to each candidate item) repeatedly.Yet CFP-tree just can produce all frequent item sets of period 12 times at the period 1 scan database, simultaneously CFP-tree periodically cuts out the carrying out of the condition FP-tree of the Frequent Item Sets β of period 1 and needn't produce not only that to comprise β be Frequent Item Sets non-periodic of frequent collection in the period 1, and saved the time that these comprise β Item Sets and other cycle affairs comparisons, increased substantially the algorithm whole efficiency like this.The CFP-Tree algorithm carries out in internal memory in the major part work of period 1, and this has also saved a large amount of time overheads.

Fig. 7 be under T=30 days situations of affairs collection time zone total length the CARDSATSV algorithm with based on the periodic associated rule algorithm of Apriori in the comparison of following working time of different minimum support.

Had by Fig. 7: 1) time overhead of 2 kinds of algorithms reduces along with the increase of minimum support.Because minimum support is high more, the project of eliminating is just many more.2) the CARDSATSV algorithm is well below the time overhead based on the periodic associated rule algorithm of Apriori.

The periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering (CARDSATSV) that the present invention proposes is found the useful correlation rule that present periodic associated rule discovery algorithm can't be found to greatest extent, and and compares spatiotemporal efficiency based on the periodic associated rule algorithm of Apriori and be greatly improved at present.Every experiment shows that CARDSATSV algorithm of the present invention is applicable to the analysis of industrial periodicity mass data.CARDSATSV algorithm of the present invention is an industrial production line embedded type main control system (Chinese patent: the part that the information acquisition ratio of similitude is summed up the technology of fuzzy algorithm application number 200610052850.1), industrial production line embedded type main control system (Chinese patent: application number 200610052850.1) be informatization and transformation project, when promoting this area's characteristic industrial sector and enhancing productivity, provide accurate and complete decision information for local government and enterprise management level for the characteristic industrial sector in somewhere, Zhejiang.It is big to have data volume at the Industry Control forefront of the production information acquisition data of reality, time span is long, the characteristics of data variation complexity, (Chinese patent: information acquisition ratio of similitude application number 200610052850.1) has proposed efficiently fuzzy algorithm based on the industrial production line embedded type main control system of CARDSATSV algorithm of the present invention, complete solution, the CARDSATSV algorithm has not only effectively proposed to excavate the technological difficulties of periodic data correlation rule in system's specific implementation process, and proposed the solution of relevant difficult point, and in producing, experiment and practice obtained checking and utilization by the specific implementation process.The practical study achievement of CARDSATSV algorithm---industrial production line embedded type main control system (Chinese patent: application number 200610052850.1) dropped into the operation phase in early stage.Show that from various actual feedback data messages the CARDSATSV algorithms are applicable to the analysis of industrial periodicity mass data, and provide certain reference for theoretical research and the practice of excavating periodic associated rule.

List of references:

[1] Ouyang Weimin, Cai Qingsheng. in database, find to have the correlation rule of temporal constraint. software journal, 1999,10 (5): 527～532

[2]Ozden?B，Ramaswamy?S，Silberschatz?A.Cyclic?Association?Rules[J].IEEE?Trans?on?DataEngineering，1998，412～421

[3] Huang Yimin, Study of Frequent Cyclic Association Rule. computer science, 2000,27 (4): 43～45)

[1] Xu Min, Jin Yuanping. a kind of new periodicity correlation rule model., computer engineering and science, 2000,22 (4): 78～81

[2]Xu?Min..et?al...Mining?Cyclic?Generalized?Association?Rules.Transactions?of?NanjingUniversity?of?Aeronaut?ics?&?Astronaut?ics，2002，19(1)

[3]Agrawal?R.et?al..Fast?algorithms?for?mining?association?rules.In：Proceedings?of?the?20thInternational?Conference?on?Very?LargeDatabases，Santiago，Chile，1994，487～499

[4] Cheng Qiansheng. a kind of new sample clustering method---diversity sequence method. Science Bulletin .1994,39 (2)

[5]Han?J.et?al..Mining?frequent?patterns?without?candidate?generation.In：Proceedings?of?the2000ACMSIGMOD?Conference?On?Management?of?Data，Dallas，TX，2000，1～12

[6]Sergios?Theodoridis，Konstantinos?Koutroumbas.Pattern?Recognition(Second?Edition)[M].Beijing：Mechanical?Industrial?Press，，2003.163-205

[7]Fisher，W.D.，J.Am.Stat.Assoc.，1958：789～798.

[8]BezdekJ?C，Pal N?R.Some?new?indexes?of?cluster?validity，IEEE?Transactions?onSystems，Man，and?Cybernetics--Part?B：Cyber-netics，1998，28(3)：301～315

[9]Xie?X?L.et?al..A?validity?method?for?fuzzy?clustering.IEEE?Trans?Patt?Anal?MachIntell，1991，13(8)：841～847

[10]Ramze?R?M.et?al..A?new?cluster?validity?index?for?the?fuzzy?c-mean.Patterm?RecognitionLetters，1998，19：237～246

[11] in sword, Cheng Qiansheng. in the fuzzy clustering method hunting zone of best cluster numbers [J]. Chinese science, 2002,32 (2): 274-280

[12] Fan Jiulun, Pei Jihong, Xie Weixin. based on the cluster validity of possibility distribution. electronic letters, vol, 1998,26 (4): 113-115

[13]Fan?M，Meng?XF，et?al.Data?Mining：Concepts?and?Techniques.Beijing：Mechanical?IndustrialPress，2001(in?Chinese).

[14]Jiawei?Han，Wan?Gong，Yiwen?Yin.Mining?Segment-wise?Periodic?Pattern?in?Time?RelatedDatabases，Proc.of?1998?of?International?Conf[R].On?Knowledge?Discovery?and?DataMining(KDD’98)New?York?City，NY，1998.

[15]Han?J.Gong?W.Yin?Y.Efficient?Mining?of?Partial?Periodic?Patterns?in?Time?SeriesDatabase.In?Proc.1999Int.Conf.Data?Engineering(ICDE′99)，Sadney，Australia，Apr1999：106～115

[16]Maria?Halkidi.et?al..On?Clustering?Validation?Techniques.Journal?of?Intelligent?InformationSystems.2001，17(2-3)：107～145

[17]Hartigan，J.A..，Clustering Algorithrns，John?Wiley &Sons，1975.

[18] Wang Lei, Tan Yue advances, and gold is just opened hard. based on the quick time domain associated rule discovery algorithm of cluster. Computer Simulation, 2005,7 (22).

Claims

1. the periodic associated rule discovery algorithm based on time sequence vector diverse sequence method clustering (CARDSATSV), CARDSATSV is made up of two parts: CMDSA and CFP-tree.At first CMDSA adopts each time zone of determining the correlation rule in the cycle based on the cluster of diversity sequence method and DB Index criterion dynamically to the time series vector of being made up of project support degree, controls the cluster number to reach best cluster effect with DB Index criterion.CFP-tree adopts the periodicity tailoring technology based on condition FP tree that the transaction database on each time zone in the cycle is carried out the periodically discovery of correlation rule.

2. CMDSA algorithm according to claim 1 is characterized in that CMDSA carries out cluster based on diversity sequence method and DB Index criterion to time series vector.Diversity sequence method just classification can't judge that how many classes of branch reach optimal classification, so also need DB INDEX criterion to judge the optimal classification number.CMDSA comes the time series vector sequence E of period 1 is carried out cluster in conjunction with both, determines the regularity of distribution of time domain data unique point, and then each time zone of the correlation rule in definite cycle.

3. CFP-tree algorithm according to claim 1, it is characterized in that in each time period in first cycle structure FP-tree tree, from cycle frequent item tabulation L, employing generates all cycle Frequent Item Sets based on the periodicity tailoring technology of condition FP-tree, and then finds all strong periodic associated rules.

4. construct the FP-tree tree according to claim 1,3 described each time periods, it is characterized in that each time period [s in first cycle in first cycle _i, e _i] middle transaction database D _iMiddle structure FP-tree _iConcrete steps: 1) 1 transaction database D of scanning _i, produce frequent item set F _iAnd corresponding support, with F _iWith except the corresponding time period [s in all cycles of period 1 _i, e _i] transaction database D _iAll affairs compare and obtain cycle frequent item set F (the 1-item collection that each project x of F the inside forms all is the cycle Frequent Item Sets) and the corresponding periodicity of x support.With the periodicity support descending sort F of x, generate cycle frequent item tabulation L _iThe project r among the F wherein _TiProject x among the expression F is at i the time period [s in t cycle _i, e _i] comprise number of transactions and the time period [s of x in the zone _i, e _i] in the zone in the ratio of all number of transactions.I time period [s in all n cycle _i, e _i] in, the periodicity support s=min{r of x is arranged _1i, r _2i..., r _Ni.2) create FP-tree _iRoot node, with " null " mark.To D _iIn each affairs t do following processing: L selects frequent item among the t by cycle frequent item tabulation, delete non-frequent item, order occurs by element among the L and arrange frequent element was arranged preface among the affairs t frequent element list make marks [p|P], p is the 1st element, P be the tabulation of surplus element call then insert tree ([p|P], T).Wherein, function insert tree ([p|P], T) processing procedure is: if T has child node N, and N.item_name=P.item_name, then make the N counting add 1; Otherwise, create new node N, counting is 1, and its father node is T, and node is linked to the next node identical with its title.If the P non-NULL, again recursive call insert tree (P, N).

5. according to claim 1,3 described periodicity tailoring technologies, it is characterized in that improvement, increase the function that FP-growth finds the cycle Frequent Item Sets FP-growth based on condition FP-tree.CFP-growth carries out periodicity to the condition FP-tree of the pattern β among the FP-growth and cuts out.The set that the condition FP-tree of pattern β and β is formed just (for example β is { kma}, the condition FP-tree of β are { (f:3, b:3) } | kma, the then set of Xing Chenging fkam}, bkam}}) and the corresponding time slice [s in other cycles _j, e _j] in all affairs compare, cut out in the set periodically support less than s _MinItem Sets, the periodicity condition FP-tree of generate pattern β.At each time slice [s of period 1 _j, e _j] FP-tree traversal when finishing, promptly generate all cycle Frequent Item Sets.

6. according to claim 1, all strong periodic associated rules of 3 described discoveries, it is characterized in that each time period [s in each cycle _i, e _i] in strong periodic associated rule can directly produce with all cycle Frequent Item Sets that find in this time period.Be specially promptly subclass a, if support (L)/support (a)＞=c to all non-NULLs of each cycle Frequent Item Sets L _Min, strong periodic associated rule a → (L-a) [C, s are then arranged _i, e _i, s _{A → (L-a)}, c _{A → (L-a)}].