CN108536862A

CN108536862A - A kind of Time Series Similarity measure based on dynamic time warping

Info

Publication number: CN108536862A
Application number: CN201810355812.6A
Authority: CN
Inventors: 刘良桂; 李炜; 贾会玲; 张宇
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2018-09-14

Abstract

The present invention discloses a kind of Time Series Similarity measure, and this method combines dynamic time warping algorithm and derivative dynamic time warping algorithm, increases the accuracy of Time Series Similarity measurement, solid foundation is provided for the follow-up study of time series.

Description

A kind of Time Series Similarity measure based on dynamic time warping

Technical field

The present invention relates to the method for measuring similarity between data analysis field more particularly to time series.

Background technology

Nowadays, with the continuous development of Internet technology, electronic equipment and software technology, all trades and professions are at every moment all Breaking out huge data, exponentially type increases the size of data, presents that data scale is big, data class is more, updating decision And intrinsic value the characteristics of reaching.Time series be it is very common in a kind of actual life and with association in time, have successively time The sequence of values or symbol sebolic addressing of sequence, it is especially common in industries such as economy, weather, biologic medicals, while some non-time series Data can also be converted into time series data to be analyzed.Therefore, how to be excavated from the time series data of magnanimity hiding The useful information is that current Data Mining needs one of content of primary study.

Time Series Data Mining is the sub- content of core of Data Mining, and application range is very extensive.As when Between important foundation Journal of Sex Research in Series Data Mining, Time Series Similarity measurement is before other data mining tasks are realized It carries, such as classification, cluster, abnormality detection and pattern-recognition etc..Therefore, from certain angle, Time Series Similarity degree The quality of amount performance decides the efficiency of Time series data mining algorithm to a certain extent.The similitude of time series There are many measure, common are Euclidean distance (Euclidean Distance, ED), dynamic time warping (Dynamic Time Warping, DTW) etc..But in numerous measures in calculating process, only calculate two time serieses away from From, and the shape feature of time series is not considered.It would therefore be desirable to have better method, calculate time series apart from while, The shape feature of time series is taken into account.

Invention content

In order to preferably calculate the similitude between time series, the present invention provides the sides for calculating Time Series Similarity Method not only allows for the distance between time series, it is also contemplated that the shape feature between time series, specific technical solution is such as Under：

(1) length of m time serieses to be measured uniformly is arranged to n, time series to be measured not less than m n In maximum length sequence length；

(2) time series that m length is n is formed into a matrix T_m×n；

(3) by PCA dimension-reduction algorithms to matrix T_m×nDimensionality reduction is carried out, new matrix T is obtained_m×l, after wherein l indicates dimensionality reduction Length of time series.

(4) calculating matrix T_m×lIn two time serieses (A and B) between DTW distances Dist₁。

(5) calculating matrix T_m×lIn each time series derivative, constitute derivative time sequence, then calculate in step 4 again Two time serieses A and B derivative time sequence between DTW distances Dist₂, i.e. DDTW distances of time series.

(6) the Time Series Similarity size finally calculated is Dist=α * Dist₁+(1-α)*Dist₂, wherein α ∈ (0, 1)。

(7) according to similitude size Dist, cluster operation is carried out, is calculated between cluster result and similitude size Dist Homologous related coefficient；Different α values are taken, are sought so that maximum α ' the values of homologous related coefficient.

(8) it is worth according to the α ' that step 7 obtains, obtains final similitude size Dist=α ' the * of time series A and B Dist₁+(1-α')*Dist₂。

Further, in the step 1, n is the length of the maximum length sequence in m time serieses to be measured.

Further, in the step 1, the time series of n is less than for sequence length, 0 is mended at sequence end, is allowed to long Degree is n.

Time Series Similarity measure according to the present invention, during calculating Time Series Similarity, no The distance between time series size is only calculated, is also taken into account the shape feature of time series so that time series Similarity measurement is more accurate.

Description of the drawings

The calculated homologous related coefficient size of Fig. 1 distinct methods

Specific implementation mode

The Time Series Similarity measure of the present invention is further explained with reference to specific embodiment It states.

The present invention provides the methods for calculating Time Series Similarity, not only allow for the distance between time series, also Consider the shape feature between time series.Below with the time series power of communication histories of mobile phone, the present invention is made specific It is described as follows：

1. the length of 2076 time serieses to be measured uniformly is arranged to 4032, described 4032 to wait measuring for 2076 Time series in maximum length sequence length, for sequence length be less than 4032 time series, sequence end mend 0, make Length be 4032, i.e. m=2076, n=4032；

2. the time series that 2076 length are 4032 is formed a matrix T_m×n；

3. by PCA dimension-reduction algorithms to matrix T_m×nDimensionality reduction is carried out, new matrix T is obtained_m×l, after wherein l indicates dimensionality reduction Length of time series, i.e. l=8.

4. calculating matrix T_m×lDTW distances Dist between middle any two time series₁。

5. calculating matrix T_m×lIn each time series derivative, constitute derivative time sequence, then calculate any two again DTW distances Dist between derivative time sequence₂, i.e. DDTW distances of time series.

6. the Time Series Similarity size finally calculated is Dist=α * Dist₁+(1-α)*Dist₂, wherein α ∈ (0, 1)。

7. according to similitude size Dist, cluster operation is carried out, is calculated between cluster result and similitude size Dist Homologous related coefficient；Different α values are taken, are sought so that maximum α ' the values of homologous related coefficient.

8. according to the α values that step 7 obtains, final similitude size Dist=α ' the * Dist of time series are obtained₁+(1- α')*Dist₂。

It can be seen in the drawings that being higher than using DTW and making by using the homologous related coefficient that DDTW methods obtain (such as with traditional method for measuring similarity：Euclidean distance), meanwhile, use homologous phase relation obtained by method of the present invention Number obtains best effect under certain α values, it follows that method of the present invention, can more accurately reflect two Similarity between time series.

Claims

1. a kind of Time Series Similarity measure, which is characterized in that include the following steps：

(1) length of m time serieses to be measured uniformly is arranged to n, n is not less than in a time serieses to be measured of m The length of maximum length sequence；

(2) time series that m length is n is formed into a matrix T_m×n；

(3) by PCA dimension-reduction algorithms to matrix T_m×nDimensionality reduction is carried out, new matrix T is obtained_m×l, wherein l indicate dimensionality reduction after time Sequence length.

(5) calculating matrix T_m×lIn each time series derivative, constitute derivative time sequence, then calculate two in step 4 again DTW distances Dist between the derivative time sequence of a time series A and B₂, i.e. DDTW distances of time series.

(6) the Time Series Similarity size finally calculated is Dist=α * Dist₁+(1-α)*Dist₂, wherein α ∈ (0,1).

(7) according to similitude size Dist, cluster operation is carried out, is calculated homologous between cluster result and similitude size Dist Related coefficient；Different α values are taken, are sought so that maximum α ' the values of homologous related coefficient.

(8) it is worth according to the α ' that step 7 obtains, obtains final similitude size Dist=α ' the * Dist of time series A and B₁+(1- α')*Dist₂。

2. according to the method described in claim 1, it is characterized in that, in the step 1, n is in m time serieses to be measured Maximum length sequence length.

3. according to the method described in claim 1, it is characterized in that, in the step 1, the time of n is less than for sequence length Sequence mends 0 at sequence end, and it is n to be allowed to length.