CN106446081A

CN106446081A - Method for mining association relationship of time series data based on change consistency

Info

Publication number: CN106446081A
Application number: CN201610814069.7A
Authority: CN
Inventors: 王文青; 杨天社; 鲍军鹏; 张海龙; 吴冠; 李方正; 王超; 齐勇
Original assignee: Xian Jiaotong University; China Xian Satellite Control Center
Current assignee: Xian Jiaotong University; China Xian Satellite Control Center
Priority date: 2016-09-09
Filing date: 2016-09-09
Publication date: 2017-02-22
Anticipated expiration: 2036-09-09
Also published as: CN106446081B

Abstract

The invention discloses a method for mining an association relationship of time series data based on change consistency. The method comprises the steps of firstly preprocessing time series data variables; secondly performing wavelet transform on a single variable, dividing an original time series into a plurality of windows by using a sliding window, performing discrete wavelet transform on each window, and extracting maximum wavelet detail coefficients; thirdly performing WDC gathering on the maximum wavelet detail coefficients of all the windows of the single variable to distinguish the windows with wavelet features different from those of the most windows, wherein the windows correspond to change points of the variable; and finally performing CCP clustering on the change points of all the variables, wherein the change points of the variables in a same cluster of a clustering result are similar, so that the variables have change consistency and are regarded to have a potential association relationship. According to the method, starting from the perspective of change consistency among the variables, the variables with a linear association relationship can be discovered and the variables with a complex nonlinear association relationship can be detected, so that the method has an important effect for associative analysis among variables of a large complex system.

Description

Based on the method that change concordance excavates time series data incidence relation

Technical field

The invention belongs to Intelligent Information Processing and field of computer technology, and in particular to a kind of based on change concordance excavation The method of time series data incidence relation.

Background technology

In large-scale complicated system, generally require to detect the incidence relation between multiple variables, this is for the system of summary fortune Professional etiquette rule, early warning are significant.The incidence relation of complexity, this incidence relation is there may be between variable in system Generally acted on by internal system rule.Relatedness can show as cooccurrence relation, cause effect relation, tendency relation on space-time Etc..When a variable changes, will cause different variables that corresponding change occurs.

Content of the invention

It is an object of the invention to provide a kind of method for excavating time series data incidence relation based on change concordance, the party Method integrated use wavelet transformation theory detects the change point of single variable, and clustering learning theory is investigating multivariate change Similarity between point vector, so as to potential incidence relation between discovery time sequence variables.

For reaching above-mentioned purpose, the technical scheme is that：

Based on the method that change concordance excavates time series data incidence relation, the system for realizing the method includes that data are located in advance Reason module, characteristic extracting module, WDC cluster module and CCP cluster module, which comprises the concrete steps that：

1) first, processed using pre- module 1-1 of data carries out elimination of burst noise, at equal intervals interpolation, normalizing to original temporal data Change operation, obtain the valid data form of sequential variable；

2) secondly, using characteristic extracting module 1-2, each window data of the valid data form of sequential variable is carried out Wavelet transform, extracts maximum wavelet detail coefficients；

3) and then, WDC is carried out to the maximum wavelet detail coefficients of all windows of single variable using WDC cluster module 1-3 Cluster, in cluster result less than window in the cluster of threshold value be change point；

4) last, CCP cluster is carried out to the change point vector of all variables using CCP cluster module 1-4, in cluster result Variable in same cluster is related, finally exports incidence relation and its intensity of each cluster internal variable.

Described data preprocessing module carries out elimination of burst noise, at equal intervals interpolation, normalization operation bag to original temporal data Include following steps：

First, average and the standard deviation of each window are calculated, judge each data point and watch window average which is located it Whether difference is more than the standard deviation of 5 times of watch window, if being more than, the data point is outlier, to reject；

Then, interpolation at equal intervals is carried out to the time serieses after elimination of burst noise, if the sampling interval is that △ t, initial time is T, Then the time collection at equal intervals after interpolation is combined into { T+n* △ t n=0,1,2,3 ... }, and the corresponding value of T+i* △ t is original sequence In row from the moment nearest less than first in value, the i.e. original series corresponding to T+i* △ t be more than T+i* △ t The previous moment corresponding to observation；

Finally, linear normalization is carried out to the data after interpolation operation at equal intervals, scans time serieses first, obtain The maximum (max) of observation and minima (min), according to formulaCalculate the number after each observation station normalization Value, original time series span is transformed on [0,1] interval, wherein, x_iRepresent i-th observation station numerical value；△= max-min.

The characteristic extraction step of described characteristic extracting module includes：First, with sliding window, univariate data is carried out Cutting, if the Sampling starting point of initial data is t, the sampling interval is the n second, and it is l that window size is m, sliding distance, then first It is that first window initial time slides backward l that the time period of individual window is the initial time of t, t+n*m, two window, therefore The time period of two windows is t+l, t+l+n*m, by that analogy, obtains N number of window；

Secondly, discrete wavelet transformation is carried out to the data in each window, according to window size, the wavelet decomposition number of plies is set L, maximum wavelet details coefficient cD in selected window_iAs the feature of the window, [i, cD_i] represent initial data in i-th The wavelet character of window.

The WDC sorting procedure of described WDC cluster module：

1) initialization of cluster, the independent cluster of each window, the cluster heart be the window characteristic vector of itself wavelet character [i, CDi], window number is denoted as m, and number of clusters mesh is denoted as n, now n=m；

2) error sum of squares SSE of cluster result, according to equation below, is calculated_n；

Wherein, n represents the number of cluster；W represents the window number in a cluster；J represents the window subscript in cluster i；c_iTable Show the cluster heart of cluster i；

3) the cluster heart distance of any two cluster, according to equation below, is calculated；

dist(c_i,c_j)=| c_i-c_j|i≠j

Wherein, dist (c_i,c_j) represent cluster i and cluster j manhatton distance；c_i、c_jRepresent the cluster heart of two clusters respectively；

4) two nearest clusters of combined distance and according to equation below change cluster center；

Wherein, c represents the cluster heart；W represents the window number in the cluster；cD_iRepresent the maximum wavelet detail coefficients of window i；

5) n number subtracts 1；

6) repeat step 2) to 5) until n=1；

7) corresponding cluster result when SSE declines most fast is picked out according to equation below, is denoted as result={ c₁,c₂,… c_k, k represents the number of clusters mesh of this layer of cluster result；

Wherein, i represents the number of plies of cluster；M is window number, that is, cluster the maximum number of plies；

8) distance of any two cluster in result is calculated, closest two cluster is picked out, is denoted as c_i,c_j；

9) if dist is (c_i,c_j)≤d, d=0.2, then merge the two clusters, and calculates the cluster heart of new cluster, then repeats step Rapid 8；

10) if dist is (c_i,c_j)>D, then exit cluster process；

11) in cluster result, in less cluster, contained window is the Parameters variation point, and less cluster is exactly window in cluster Several ratios with total window number are less than the cluster of given threshold value 0.2, and all labels compared with window in tuftlets then constitute the change of the parameter Point set, i.e. cpv={ cp₁,cp₂,…,cp_m, wherein cp_iIt is window label.

The CCP sorting procedure of described CCP cluster module includes：

1) the independent cluster of single variable, is provided with n variable, and the number of cluster is designated as k, then k=n；

2) the change consistency coefficient CoC of any two cluster, according to equation below, is calculated：

Wherein, CoC (c) represents cluster c (c_i, c_jNew cluster after merging) change consistency coefficient；X, y are any two in cluster c Individual variable；Z is cluster internal variable number, and the combination of any two variable has z (z-1)/2 kind, the change consistency coefficient of a cluster It is equal to the meansigma methodss of the change consistency coefficient of all any two variables in cluster

Wherein, CoC (x, y) represents the change consistency coefficient of two variables x, y；|cpv_x| represent the change point of variable x The number i.e. size of the Parameters variation point set；|cpv_y| represent that the change of variable y is counted out；|cpv_xy| represent variable x's and y Common change is counted out；

cpv_xy=cpv_x∩cpv_y

Wherein, cpv_x、cpv_yRepresent the change point set of variable x, y respectively；

3) two most strong cluster c of change concordance are picked out_i,c_j, change consistency coefficient therebetween is denoted as max_ CoC；

4) if max_CoC is more than or equal to given threshold value 0.8, merge cluster c_i,c_j, k number subtracts 1, goes to step 2)；

5) if max_CoC is less than given threshold value, cluster process is exited, in final cluster result, in same cluster Variable has incidence relation, and the strength of association between them is exactly the change consistency coefficient CoC of corresponding cluster.

Change concordance refers to that several sequential variables are always changed in the close moment.If that is, many Or almost changing together on individual variable longer period, or nearly all do not change again, these variables have potential Incidence relation.The present invention is that foundation excavates the variable with relatedness from a large amount of variables collections with the change concordance of variable Subset.With respect to prior art, the invention has the advantages that：The present invention is investigated many from change concordance angle Incidence relation between each and every one variable, this incidence relation can be nonlinear, and such as the function such as index, logarithm, multinomial is closed System.The relatedness that variable is showed under change is paid close attention to, and general association rule mining method is to excavate normally In the case of frequent mode.Traditional association rule mining method Apriori and FP-Tree is compared, the present invention is suitable for big Quantitative change amount is associated analysis, therefrom finds potential relatedness between parameter.

Description of the drawings

Fig. 1 is the module frame figure of present system.

Fig. 2 is WDC cluster module flow chart of the present invention.

Fig. 3 is CCP cluster module of the present invention.

Table 1 is the data simulation function of example sequential variable of the present invention.

Fig. 4 is the emulation datagraphic fragment of few examples sequential variable of the present invention.

Table 2 is example time series data variable association relation excavation result in CCP cluster module.

Specific embodiment

Below in conjunction with the accompanying drawings and embodiment is described in further detail to the present invention.

Referring to Fig. 1, the system for realizing the present invention includes data preprocessing module 1-1, characteristic extracting module 1-2, WDC cluster Module 1-3 and CCP cluster module 1-4；The concrete technical scheme of the present invention is：

Step one：Processed using pre- module 1-1 of data carries out elimination of burst noise, at equal intervals interpolation, normalizing to original temporal data Change operation, obtain the valid data form of sequential variable；

Finally, linear normalization is carried out to the data after interpolation operation at equal intervals, scans time serieses first, obtain The maximum (max) of observation and minima (min), according to formulaAfter calculating each observation station normalization Numerical value, original time series span is transformed on [0,1] interval, wherein, x_iRepresent i-th observation station numerical value；△= max-min；

Step 2：Secondly, using each window data of characteristic extracting module 1-2 to the valid data form of sequential variable Wavelet transform is carried out, extracts maximum wavelet detail coefficients；

First, with sliding window, univariate data is cut, if the Sampling starting point of initial data is t, sampling Interval is the n second, and it is l that window size is m, sliding distance, then the time period of first window is rising for t, t+n*m, two window Moment beginning is that first window initial time slides backward l, therefore the time period of second window is t+l, t+l+n*m, with such Push away, obtain N number of window；

Secondly, discrete wavelet transformation is carried out to the data in each window, according to window size, the wavelet decomposition number of plies is set L, maximum wavelet details coefficient cD in selected window_iAs the feature of the window, [i, cD_i] represent initial data in i-th The wavelet character of window；

Step 3：Referring to Fig. 2, then, using WDC (Wavelet Detail Coefficient) cluster module 1-3 pair The maximum wavelet detail coefficients of all windows of single variable carry out WDC cluster, in cluster result less than window in the cluster of threshold value are Change point；

1) step 2-1 carried out first, the initialization of cluster, the independent cluster of each window, the cluster heart is the wavelet character of the window cD_i, window number is denoted as m, and number of clusters mesh is denoted as n, now n=m；

2) and then step 2-2 is carried out, according to equation below, calculates error sum of squares SSE of cluster result_n(Sum of Squared Error)；

3) execution step 2-3, according to equation below, calculates the cluster heart distance of any two cluster；

dist(c_i,c_j)=| c_i-c_j|i≠j

4) execution step 2-4, two nearest clusters of combined distance and according to equation below change cluster center；

5) execution step 2-5, n number subtracts 1；

6) execution step 2-6, repeat step 2) to 5) until n=1；

7) execution step 2-7, picks out corresponding cluster result when SSE declines most fast according to equation below, is denoted as Result={ c₁,c₂,…c_k, k represents the number of clusters mesh of this layer of cluster result；

8) execution step 2-8, calculates the distance of any two cluster in result, picks out closest two cluster, note Make c_i,c_j；

9) execution step 2-9, if dist is (c_i,c_j)≤d, d=0.2), then merge the two clusters, and calculate the cluster of new cluster The heart, then repeat step 8；

10) execution step 2-10, if dist is (c_i,c_j)>D, then exit cluster process；

Step 4：With reference to Fig. 3, finally, using CCP (Clustering based on Change Point) cluster module 1-4 carries out CCP cluster to the change point vector of all variables, the variable in cluster result in same cluster be related, finally Export incidence relation and its intensity of each cluster internal variable；

1) step 3-1 carried out first, the independent cluster of single variable, n variable is provided with, the number of cluster is designated as k, then k=n；

2) execution step 3-2, according to equation below, calculates the change consistency coefficient CoC of any two cluster：

Wherein, CoC (x, y) represents the change consistency coefficient of two variables x, y；|cpv_x| represent the change point of variable x Number (i.e. the size of the Parameters variation point set)；|cpv_y| represent that the change of variable y is counted out；|cpv_xy| represent variable x and y Common change count out；

cpv_xy=cpv_x∩cpv_y

3) execution step 3-3, picks out two most strong cluster c of change concordance_i,c_j, change concordance system therebetween Number scale makees max_CoC；

4) execution step 3-4, if max_CoC is more than or equal to given threshold value 0.8, merges cluster c_i,c_j, k number subtracts 1, turns Step 2)；

5) execution step 3-5, if max_CoC is less than given threshold value, exits cluster process, in final cluster result, Variable in same cluster has incidence relation, and the strength of association between them is exactly the change consistency coefficient of corresponding cluster CoC.

With reference to table 1, which is example time series data variable simulated function, according to simulated function, simulates each variable 20 days Data, the sampling interval is 20 minutes.Three groups of correlated variabless are wherein had, includes 11 variables, A group variable and g per group₁(x) phase Pass, B group variable and g₂(x) correlation, C group variable and g₃X () correlation, formula is as follows：

Table 1

With reference to Fig. 4, which is the emulation datagraphic fragment of few examples time series data variable.In figure yellow, white bars mark The part of note represents window, wherein the maximum wavelet detail coefficients of i-th window of " cDi " expression.

With reference to table 2, which is example time series data variable association relation excavation result in CCP cluster module, wherein same Variable in cluster is considered to have incidence relation, and the strength of association between them is exactly the change concordance system of corresponding cluster Number CoC.

Table 2

Claims

1. the method for time series data incidence relation being excavated based on change concordance, it is characterised in that:Realize the system bag of the method Data preprocessing module (1-1), characteristic extracting module (1-2), WDC cluster module (1-3) and CCP cluster module (1-4) is included, its Comprise the concrete steps that：

1) first, processed using the pre- module of data (1-1) carries out elimination of burst noise, at equal intervals interpolation, normalization to original temporal data Operation, obtains the valid data form of sequential variable；

2) secondly, using characteristic extracting module (1-2) each window data of the valid data form of sequential variable is carried out from Scattered wavelet transformation, extracts maximum wavelet detail coefficients；

3) and then, using WDC cluster module (1-3) the maximum wavelet detail coefficients of all windows of single variable are carried out WDC gather Class, in cluster result less than window in the cluster of threshold value be change point；

4) last, CCP cluster is carried out to the change point vector of all variables using CCP cluster module (1-4), same in cluster result Variable in one cluster is related, finally exports incidence relation and its intensity of each cluster internal variable.

2. according to claim 1 based on change concordance excavate time series data incidence relation method, it is characterised in that： Described data preprocessing module (1-1) original temporal data are carried out elimination of burst noise, at equal intervals interpolation, normalization operation include with Lower step：

First, average and the standard deviation of each window is calculated, judges that the difference that each data point is located watch window average with which is The standard deviation of the no watch window for being more than 5 times, if being more than, the data point is outlier, to reject；

Then, interpolation at equal intervals is carried out to the time serieses after elimination of burst noise, if the sampling interval is that △ t, initial time is T, then etc. Time collection after the interpolation of interval is combined into { T+n* △ t n=0,1,2,3 ... }, and the corresponding value of T+i* △ t is in original series From the moment nearest less than first in value, the i.e. original series corresponding to T+i* △ t more than before T+i* △ t Observation corresponding to one moment；

Finally, linear normalization is carried out to the data after interpolation operation at equal intervals, scans time serieses first, obtain observation The maximum (max) of value and minima (min), according to formulaThe numerical value after each observation station normalization is calculated, Original time series span is transformed on [0,1] interval, wherein, x_iRepresent i-th observation station numerical value；△=max- min.

3. according to claim 1 based on change concordance excavate time series data incidence relation method, it is characterised in that The characteristic extraction step of described characteristic extracting module (1-2) includes：First, with sliding window, univariate data is cut Cut, if the Sampling starting point of initial data is t, the sampling interval is the n second, it is l that window size is m, sliding distance, then first It is that first window initial time slides backward l that the time period of window is the initial time of t, t+n*m, two window, therefore second The time period of individual window is t+l, t+l+n*m, by that analogy, obtains N number of window；

Secondly, discrete wavelet transformation is carried out to the data in each window, according to window size, wavelet decomposition number of plies L, choosing is set Take maximum wavelet details coefficient cD in window_iAs the feature of the window, [i, cD_i] represent initial data in i-th window Wavelet character.

4. according to claim 3 based on change concordance excavate time series data incidence relation method, it is characterised in that The WDC sorting procedure of described WDC cluster module (1-3) includes：

{SSE}_{n} = Σ_{i = 1}^{n} Σ_{j = 1}^{w} {({cD}_{j} - c_{i})}^{2}

Wherein, n represents the number of cluster；W represents the window number in a cluster；J represents the window subscript in cluster i；c_iRepresent cluster i The cluster heart；

dist(c_i,c_j)=| c_i-c_j|i≠j

c = \frac{1}{w} Σ_{i = 1}^{w} {cD}_{i}

5) n number subtracts 1；

6) repeat step 2) to 5) until n=1；

7) corresponding cluster result when SSE declines most fast is picked out according to equation below, is denoted as result={ c₁,c₂,…c_k, k Represent the number of clusters mesh of this layer of cluster result；

\max {\frac{{SSE}_{i}}{{SSE}_{i - 1}}}, i = 2, 3, ... m

9) if dist is (c_i,c_j)≤d, d=0.2, then merge the two clusters, and calculates the cluster heart of new cluster, then repeat step 8；

10) if dist is (c_i,c_j)>D, then exit cluster process；

11) in cluster result, in less cluster, contained window is the Parameters variation point, less cluster be exactly in cluster window number and The ratio of total window number then constitutes the change point set of the parameter less than the cluster of given threshold value 0.2, all labels compared with window in tuftlet Close, i.e. cpv={ cp₁,cp₂,…,cp_m, wherein cp_iIt is window label.

5. according to claim 1 based on change concordance excavate time series data incidence relation method, it is characterised in that： The CCP sorting procedure of described CCP cluster module (1-4) includes：

Wherein, CoC (c) represents cluster c (c_i, c_jNew cluster after merging) change consistency coefficient；X, y are that in cluster c, any two becomes Amount；Z is cluster internal variable number, and the combination of any two variable has a z (z-1)/2 kind, and the change consistency coefficient of a cluster is just etc. The meansigma methodss of the change consistency coefficient of all any two variables in cluster：

C o C (x, y) = \frac{2 | {cpv}_{x y} |}{| {cpv}_{x} | + | {cpv}_{y} |}

Wherein, CoC (x, y) represents the change consistency coefficient of two variables x, y；|cpv_x| represent that the change of variable x is counted out i.e. The size of the Parameters variation point set；|cpv_y| represent that the change of variable y is counted out；|cpv_xy| represent the common change of variable x and y Change is counted out；

cpv_xy=cpv_x∩cpv_y

3) two most strong cluster c of change concordance are picked out_i,c_j, change consistency coefficient therebetween is denoted as max_CoC；

5) if max_CoC is less than given threshold value, cluster process is exited, the variable in final cluster result, in same cluster With incidence relation, and the strength of association between them is exactly the change consistency coefficient CoC of corresponding cluster.