CN114357037A

CN114357037A - Time sequence data analysis method and device, electronic equipment and storage medium

Info

Publication number: CN114357037A
Application number: CN202210279483.8A
Authority: CN
Inventors: 李峰; 张潇澜; 周镇镇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-04-15

Abstract

The application discloses a time sequence data analysis method, a time sequence data analysis device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a data pattern database; the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions and a sample data set corresponding to each data pattern; acquiring time sequence data containing a plurality of target dimensions, if the target dimensions do not exist in the data pattern database, calculating the similarity between a sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not; if so, multiplexing a data mode corresponding to the target sample data set; if the multidimensional time sequence data does not exist, the time sequence data is trained to obtain a target data mode, and the target dimensionality, the target data mode and the time sequence data are stored in a data mode database, so that the data mode analysis of the multidimensional time sequence data is realized.

Description

Time sequence data analysis method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for analyzing time series data, an electronic device, and a computer-readable storage medium.

Background

In order to improve the operation and maintenance efficiency of enterprises and projects and reduce the operation and maintenance cost, the intelligent operation and maintenance is developed. The intelligent operation and maintenance technology intelligently predicts the faults and quickly positions the root causes through algorithms such as data mining and machine learning, and has high application value and significance in practice. Big data analysis is a key technology of intelligent operation and maintenance, an abnormal mode is mined from multiple dimensions of mass monitoring data, and the method is a research hotspot of intelligent operation and maintenance and an important research direction in the field of data mining.

Compared with one-dimensional time sequence data, the multi-dimensional time sequence data describes real-time state information of hardware and components from different indexes, the data characteristics are richer and more comprehensive, and the method plays a more important role in an intelligent decision-making link in AIOps (intelligent operation and maintenance). However, due to the difference of the distribution of the data with different dimensions, the data with different dimensions have different dimensions, and the distribution patterns have difference, which cannot be characterized by a uniform distribution model, thereby increasing the complexity of the algorithm. In addition, the distribution pattern of data in the same dimension changes gradually with time, and concept drift occurs, so that the existing data distribution model cannot be reused.

Therefore, how to analyze the data pattern of the multidimensional time series data is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a time sequence data analysis method and device, an electronic device and a computer readable storage medium, which realize the analysis of a data pattern of multi-dimensional time sequence data.

In order to achieve the above object, the present application provides a time series data analysis method, including:

acquiring a data pattern database; wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions, and a sample data set corresponding to each of the data patterns;

acquiring time sequence data containing a plurality of target dimensions, if the target dimensions do not exist in the data pattern database, calculating the similarity between a sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not;

if so, multiplexing a data mode corresponding to the target sample data set;

if not, training the time sequence data to obtain a target data pattern, and storing the target dimension, the target data pattern and the time sequence data into the data pattern database.

After the obtaining of the time series data including a plurality of target dimensions, the method further includes:

if the target dimension exists in the data pattern database, judging whether the time sequence data has concept drift;

if the time sequence data has concept drift, calculating the similarity between a sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not;

if so, multiplexing a data mode corresponding to the target sample data set;

if the time sequence data does not exist, training the time sequence data to obtain a target data pattern, and storing the target dimension, the target data pattern and the time sequence data into the data pattern database;

and if the concept drift does not occur in the time sequence data, multiplexing the data mode at the previous moment.

Wherein said calculating a similarity between the sample data set in the data pattern database and the time series data comprises:

dividing the sample data set and the time sequence data in the data pattern database into a plurality of subsequences respectively;

performing a linear fit on each of said subsequences to determine a data pattern for each of said subsequences;

calculating a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time series data;

calculating the similarity between the data mode of the sample data set and the data mode of the time sequence data by using a dynamic time warping algorithm; and the element of the distance matrix in the dynamic time warping algorithm is the distance between the data pattern of the subsequence in the sample data set and the data pattern of the subsequence in the time series data.

Wherein, dividing the sample data set and the time sequence data in the data pattern database into a plurality of subsequences respectively comprises:

determining a first compression rate corresponding to a sample data set in the data pattern database and a second compression rate corresponding to the time sequence data;

the sample data set is divided into a plurality of subsequences based on the first compression rate, and the time series data is divided into a plurality of subsequences based on the second compression rate.

After determining the data pattern of each of the subsequences, the method further comprises:

determining a trend of change for each of the subsequences;

and if the variation trends of the two adjacent subsequences are the same, merging the two adjacent subsequences, and re-fitting the merged subsequences.

Wherein the determining the variation trend of each of the subsequences comprises:

if the middle time observation value of the current subsequence is larger than the middle time observation value of the previous subsequence, determining the change trend of the current subsequence as rising;

and if the middle-time observation value of the current subsequence is smaller than the middle-time observation value of the previous subsequence, determining the change trend of the current subsequence as a decrease.

Wherein said calculating a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time-series data comprises:

calculating a distance between a data pattern of subsequences in the sample data set and a data pattern of subsequences in the time series data based on the trend and direction vectors of subsequences in the sample data set and the trend and direction vectors of subsequences in the time series data.

In order to achieve the above object, the present application provides a time series data analysis apparatus, including:

the acquisition module is used for acquiring a data pattern database; wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions, and a sample data set corresponding to each of the data patterns;

the first judgment module is used for acquiring time sequence data containing a plurality of target dimensions and judging whether the target dimensions exist in the data pattern database or not; if not, starting the working process of the second judgment module;

the second judging module is used for calculating the similarity between the sample data set in the data pattern database and the time sequence data and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not; if yes, starting the working process of the first multiplexing module; if not, starting the working process of the training module;

the first multiplexing module is used for multiplexing a data mode corresponding to the target sample data set;

the training module is used for training the time sequence data to obtain a target data pattern, and storing the target dimension, the target data pattern and the time sequence data into the data pattern database.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the time sequence data analysis method when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the time series data analysis method as described above.

According to the scheme, the time sequence data analysis method provided by the application comprises the following steps: acquiring a data pattern database; wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions, and a sample data set corresponding to each of the data patterns; acquiring time sequence data containing a plurality of target dimensions, if the target dimensions do not exist in the data pattern database, calculating the similarity between a sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not; if the target dimension exists, multiplexing a data mode corresponding to the target sample data set, and storing the target dimension, the data mode corresponding to the target sample data set and the time sequence data into the data mode database; if not, training the time sequence data to obtain a target data pattern, and storing the target dimension, the target data pattern and the time sequence data into the data pattern database.

The time sequence data analysis method provided by the application constructs a data pattern database, wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to each dimension and a sample data set corresponding to each data pattern. For the multidimensional time series data, traversing each data mode corresponding to each dimension in the mode database, judging whether the data mode of the new dimension has concept recurrence or not through the similarity between the sample data set and the multidimensional time series data, if the concept recurrence occurs, multiplexing the existing data mode, and if not, training the new dimension data, storing the obtained mode and the sample data set into the mode database, and being suitable for data mode analysis of the multidimensional time series data. The application also discloses a time sequence data analysis device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating an analysis of a data pattern of time series data according to the related art;

FIG. 2 is a flow chart illustrating a method of time series data analysis in accordance with an exemplary embodiment;

FIG. 3 is a flow chart illustrating another method of time series data analysis in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating analysis of a data pattern of time series data in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a time series data analysis apparatus according to an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

Concept drift, a common situation in the field of time series data analysis. In the related art, a database is first maintained, and the database includes a set of classifiers and a set of data samples corresponding to the classifiers. As shown in fig. 1, for a new data block, it is first detected whether a concept drift has occurred. If there is no concept drift, the classifier at the previous time instance is multiplexed. And if the concept drift occurs, further analyzing the data distribution condition and judging whether concept reappearance occurs. If the concept recurs, the closest classifier is selected from the database as the base classifier for the new data block. Otherwise, a new classifier is trained using the new data block and stored in the database along with the training data samples.

In the related art, a common method for judging concept recurrence, such as a KNN algorithm (K-Nearest Neighbor), calculates a distribution similarity between a new data block and a data sample stored in the data block, and judges whether concept recurrence occurs. The method for measuring the similarity of two time series data comprises the following steps: (1) the similarity is measured based on the distance between two time series data value points, such as Euclidean distance, and the smaller the distance, the higher the similarity. (2) Dynamic Time Warping (DTW), which solves the problem of data translation on the Time axis. The data are processed and aligned to find the correct time point of the alignment of the two time sequence data, and then the Euclidean distance is used for measuring the similarity of the time sequence data. (3) Time series data pattern similarity. The time sequence data are divided, trend information is extracted from subintervals, and the similarity of the two time sequence data is calculated according to the trend information. The larger the pattern difference is, the lower the similarity between the two time series data is.

On one hand, data with different dimensions have different dimensions, distribution modes have difference, and correlation relations among the dimensions are not considered in the related technology, so that the method cannot be directly applied to the field of multidimensional time series data analysis. On the other hand, the similarity calculation method for measuring concept recurrence does not comprehensively consider the trend characteristics of data, and has certain limitations. The Euclidean distance-based method requires that the lengths of time sequence data are equal, and the robustness is not high enough; the method is very sensitive to noise, and the influence of the noise on the result cannot be shielded; the data form and trend cannot be comprehensively considered. The dynamic time warping algorithm is complex in calculation and high in time complexity, and the application range of the dynamic time warping algorithm is limited. The trend of the original data combined by using the trend information of the segmented sub-sequences is the processing of further discretization and symbolization of the original data. The trend of the subsequences can effectively describe the transformation of the data in a short period of time, but the combination mode cannot accurately reflect the whole long-term change form of the time sequence data, and meanwhile, the method has a poor effect on the translation problem on a time axis and is limited in practical application. Meanwhile, due to the fact that time sequence data distribution difference among different dimensions causes multi-dimensional time sequence data analysis to be more complex, the prior art has certain limitation in the field of multi-dimensional data.

Therefore, the method for analyzing the data mode of the self-adaptive multi-dimensional time sequence data not only effectively solves the problem of concept drift of the time sequence data, but also judges whether concept reappearance occurs or not through automatically searching the database for the data with different dimensions, retraining the data is avoided, and the overall efficiency of the framework is improved. Meanwhile, the application also provides a self-adaptive algorithm based on a time sequence data pattern matching mechanism to detect concept recurrence, the local and overall trend characteristics of the time sequence data are comprehensively considered, and the accuracy of time sequence data similarity calculation is improved, so that the performance and efficiency of the whole frame system are improved, and the method has high application value in practice.

The embodiment of the application discloses a time sequence data analysis method, which realizes the analysis of a data mode on multidimensional time sequence data.

Referring to fig. 2, a flow chart of a method of time series data analysis is shown according to an exemplary embodiment, as shown in fig. 2, including:

s101: acquiring a data pattern database; wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions, and a sample data set corresponding to each of the data patterns;

the study object of the present embodiment is time series data, i.e., over timeThe changed data. Is recorded as: x = { (t)₁,x₁),(t₂,x₂),…,(t_k,x_k) Where t is_iDenotes the i-th observation time, x_iThe time sequence data in the embodiment includes multiple dimensions, which can be understood as multiple observation indexes, such as CPU utilization, temperature, and the like.

In this embodiment, a data pattern database is constructed, where the data pattern database includes dimension names of multiple dimensions, one or more data patterns corresponding to each dimension, and a sample data set corresponding to each data pattern, for example, as shown in table 1:

TABLE 1

Wherein, Dimension is Dimension name, DataFeature is data mode, and Sample is Sample data set. Each record in the data pattern database is (dim)_i,f_i ^k,s_i ^k) Wherein dim_iDenotes the ith dimension, f_i ^kK-th data pattern, s, representing the ith dimension_i ^kRepresenting the sample data set corresponding to the data schema.

S102: acquiring time sequence data containing a plurality of target dimensions, and judging whether the target dimensions exist in the data pattern database; if not, the step S103 is entered;

s103: calculating the similarity between the sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not; if yes, entering S104; if not, entering S105;

s104: multiplexing a data mode corresponding to the target sample data set;

s105: training the time sequence data to obtain a target data pattern, and storing the target dimensionality, the target data pattern and the time sequence data into the data pattern database.

In a specific implementation, time series data to be analyzed is acquired, wherein the time series data comprises data of a plurality of target dimensions. For target dimensions which do not exist in the data pattern database, namely new dimensional time sequence data, traversing the data pattern database, calculating the similarity between each sample data set and the time sequence data to obtain a similarity set, and outputting a maximum value Sim_max. If Sim_maxIf the data pattern is larger than or equal to the threshold, the corresponding data pattern is used for depicting the data distribution of the time sequence data of the new dimensionality, and a small amount of data sets can be used for correcting the data pattern; if Sim_maxAnd if the dimension is smaller than the threshold value, training the data of the current dimension, and storing the obtained data mode and the sample data into a mode database.

Specifically, the threshold is recorded as

New dimension and data are marked as dim_newAnd data_newTraversing all sample data sets in the data pattern database, calculating and data_newObtaining the highest similarity value Sim_max=sim(s_t ^k,data_new). If it is not

Starting a training process for the new data datanew, and acquiring a first data pattern of the dimension as f_new ¹And the result is expressed as (dim)_new,f_new ¹,s_new ¹) Is saved to the data schema database. If it is not

If conceptual reproduction occurs, then data_newMultiplexing mode f_i ^k。

The time series data analysis method provided by the embodiment of the application constructs a data pattern database, wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to each dimension and a sample data set corresponding to each data pattern. For the multidimensional time series data, traversing each data mode corresponding to each dimension in the mode database, judging whether the data mode of the new dimension has concept recurrence or not through the similarity between the sample data set and the multidimensional time series data, if the concept recurrence occurs, multiplexing the existing data mode, and if not, training the new dimension data, storing the obtained mode and the sample data set into the mode database, and being suitable for data mode analysis of the multidimensional time series data.

The embodiment of the application discloses a time series data analysis method, after step S102 in the previous embodiment, specifically:

referring to fig. 2, a flow chart of another method of time series data analysis is shown according to an exemplary embodiment, as shown in fig. 2, including:

s201: if the target dimension exists in the data pattern database, judging whether the time sequence data has concept drift; if yes, entering S202; if not, the process goes to S205;

s202: calculating the similarity between the sample data set in the data pattern database and the time sequence data, and judging whether a target sample data set with the similarity larger than or equal to a threshold value with the time sequence data exists or not; if yes, entering S203; if not, entering S204;

s203: multiplexing a data mode corresponding to the target sample data set;

s204: training the time sequence data to obtain a target data pattern, and storing the target dimensionality, the target data pattern and the time sequence data into the data pattern database;

s205: the data pattern at the previous time is multiplexed.

In this embodiment, for time series data of a target dimension existing in a data pattern database, if no concept drift occurs, multiplexing a data pattern at the previous moment; if concept drift occurs, it is necessary to further detect whether concept recurrence occurs. If the concept reappearance occurs, the existing mode is reused, otherwise retraining is needed, and the mode and the sample data set are stored in the database.

Specifically, the new window data is marked as dataew, and if the concept of the data does not drift, the historical data pattern is reused. And if the concept drift occurs, traversing all sample data sets in the data mode database and judging whether concept reappearance occurs. If no concept recurrence occurs, the data is completely trained and the corresponding training result is expressed as (dim)_new,f_new ¹,s_new ¹) Is saved to the data schema database. If conceptual reproduction occurs, the data_newMultiplexing mode f_i ^k。

For the analysis flowchart of the complete data pattern of the multidimensional time series data, as shown in fig. 4, for the time series data of the new dimension, the data pattern database is traversed, the similarity values of all the sample data sets and the credit data are calculated, and the highest similarity value S is selected_maxIf S is_maxIf > threshold, then S is selected_maxAnd taking the corresponding data mode as the data mode of the new dimension time sequence data, and if not, retraining and storing the data mode in a data mode database. For the same dimension time sequence data, judging whether concept drift occurs, if not, carrying out mode multiplexing, if yes, continuously judging whether concept recurrence occurs, and the specific flow is as follows: traversing the data pattern database, calculating similarity values of all sample data sets and credit data, and selecting the highest similarity value S_maxIf S is_maxIf > threshold, then S is selected_maxAnd taking the corresponding data mode as the data mode of the new dimension time sequence data, and if not, retraining and storing the data mode in a data mode database.

On the basis of the foregoing embodiment, as a preferred implementation, the calculating the similarity between the sample data set in the data pattern database and the time series data includes: dividing the sample data set and the time sequence data in the data pattern database into a plurality of subsequences respectively; performing a linear fit on each of said subsequences to determine a data pattern for each of said subsequences; calculating a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time series data; calculating the similarity between the data mode of the sample data set and the data mode of the time sequence data by using a dynamic time warping algorithm; and the element of the distance matrix in the dynamic time warping algorithm is the distance between the data pattern of the subsequence in the sample data set and the data pattern of the subsequence in the time series data.

In specific implementation, the sample data set and the time sequence data are divided into a plurality of sub-mode sequences respectively, then the data modes of the sub-mode sequences are extracted, the data mode set of the sub-mode sequences is used as the whole time sequence data or the data modes of the sample data set, and finally the similarity between the data modes of the sample data set and the time sequence data is measured by using a dynamic time warping algorithm. Therefore, the local characteristics (data mode of the subsequences) and the overall trend (combination of the subsequences) of the time series data are comprehensively considered, the translation problem of the trend of the time series data on the time axis is solved by using the dynamic time warping algorithm, and the method has high accuracy and good application value.

Assume the sample data set and time series data are X_AAnd X_BIs marked as X_A={(t^a ₁, x^a ₁), (t^a ₂, x^a ₂), …, (t^a _m, x^a _m)}，XB={(t^b ₁, x^b ₁), (t^b ₂, x^b ₂), …, (t^bn, x^b _n) And a and b are data modes. The compression ratio is E, the number of the divided subsequences is r, and the calculation formula of the compression ratio E is as follows: e = (1- (r +1)/s), where s is the length of the time series data. The data XA and XB are divided according to a given compression rate E. The criterion for the division is a trend turning point. Wherein the trend is defined as three states of rising, falling and invariant. The three states correspond to the slopes k of the subsequence after linear fitting, and if the slopes k are positive, the slopes are increased; k is negative, then decreasing, k is equal to zero, then constant. It should be noted that the sample data set and the time series data may adopt the same compression rate, and may adopt different compression ratesThe rate, that is, dividing the sample data set and the time series data in the data pattern database into a plurality of subsequences respectively, includes: determining a first compression rate corresponding to a sample data set in the data pattern database and a second compression rate corresponding to the time sequence data; the sample data set is divided into a plurality of subsequences based on the first compression rate, and the time series data is divided into a plurality of subsequences based on the second compression rate.

As a preferred embodiment, after determining the data pattern of each of the subsequences, the method further includes: determining a trend of change for each of the subsequences; and if the variation trends of the two adjacent subsequences are the same, merging the two adjacent subsequences, and re-fitting the merged subsequences.

In specific implementation, defining a variation trend tr of adjacent subsequences, wherein the value range is { -1, +1}, and if the intermediate time observation value of the current subsequence is greater than the intermediate time observation value of the previous subsequence, determining the variation trend of the current subsequence as rising and marking as + 1; if the middle time observation value of the current subsequence is smaller than the middle time observation value of the previous subsequence, determining the change trend of the current subsequence as a decrease and recording the decrease as-1, wherein the calculation formula is as follows:

where mid (cur) is the observed value at the middle time of the current subsequence, and mid (pre) is the observed value at the middle time of the previous subsequence.

And if the change trends of the continuous adjacent subsequences are the same, combining the subsequences into one subsequence, re-fitting the combined subsequence, and re-determining the data mode of the combined subsequence.

As a preferred embodiment, the calculating the distance between the data pattern of the subsequence in the sample data set and the data pattern of the subsequence in the time-series data includes: calculating a distance between a data pattern of subsequences in the sample data set and a data pattern of subsequences in the time series data based on the trend and direction vectors of subsequences in the sample data set and the trend and direction vectors of subsequences in the time series data.

In a specific implementation, two subsequences sub are defined₁And sub₂Is dis (sub) is the distance between the data patterns of₁, sub₂) The direction vectors of the respective linear fits are noted

And

，k₁and k₂The respective trends are tr for the slopes of the respective linear fits₁And tr₂. The distance between the data patterns of the two subsequences is calculated as:

；

it can be seen that if the variation trend of the two subsequences is +1, the subsequence distance plays a positive role in similarity calculation, and if the variation trend is-1, the subsequence distance plays a negative role in similarity calculation.

Further, a dynamic time warping algorithm is used to calculate the similarity of the two subsequences, and each element value of the distance matrix of the DTW algorithm is the distance between the data patterns of the two subsequences. The DTW algorithm can measure the similarity of two time sequences with equal length, is also suitable for time sequence data with unequal length, and is insensitive to sudden change or abnormal points in the time sequence data. Two subsequences are assumed: p = { P₁,p₂,……,p_mAnd Q = { Q = }₁,q₂,……,q_nAny two patterns dis (p) in two subsequences_i,q_j) The pitch formation pattern distance matrix D (P, Q) = { dis (P)_i,q_j) Wherein i is more than or equal to 1 and less than or equal to m, and j is more than or equal to 1 and less than or equal to n.

；

Finding the shortest path in this matrix is done in dis (p)₁,q₁) As a starting point, find a path MP = { MP = { MP }₁,mp₂,…,mp_s-minimizing the sum of elements on the path, wherein: for two consecutive values: mp_i=dis(p_t,q_k)，mp_i+1=dis(p_h,q_g) H is more than or equal to t, g is more than or equal to k, and the monotonicity of the path in the matrix is ensured. The mean value of the accumulation of values on the path MP is the similarity value of the two submodes:

. The larger the value of sim (P, Q), the more similar the distribution trend of the two time series data is indicated.

Furthermore, a dynamic programming algorithm is adopted to obtain the optimal path. The method comprises the following specific steps: constructing a cumulative matrix as: totalmrix = { tm (i, j) } record the first element dis (P) from the top left corner of matrix D (P, Q)₁,q₁) To the current element dis (p)_i,q_j) A matrix is formed, wherein the calculation formula of tm (i, j) is as follows:

；

where tm (0,0) =0, tm (i,0) = tm (0, j) = tm

。

Further, defining an evaluation index: KL divergence (Kullback-Leibler divergence), also known as information divergence or relative entropy, is an asymmetry metric that measures the difference between two probability distributions. The larger the KL divergence value, the higher the difference of the two data distributions. Assuming that P (X), Q (X) are two probability distributions over the random variable X, the calculation of KL divergence in the case of discrete and continuous random variables is:

；

；

the multi-dimensional time series data anomaly detection is taken as an example to describe the technical details of the whole framework. The distribution of time series data is estimated using a window-based kernel density estimation (wkde) algorithm. For Data_wSuppose the pattern of the w-1 th window data is denoted as f_w-1The pattern of the w-th window data is denoted as f_wMultiplexing of concept reproduction mode is denoted as f_old。KL(f_w||f_w-1)=0.596，KL(f_w||f_old ²)=0.339，KL(f_w||f_w-1)>KL(f_w||f_old ²) Explanation is given on the multiplexing mode f_oldCan better depict Data_wDistribution of data.

Therefore, the similarity of the time sequence data is measured by using the similarity between the data modes, and the dynamic time warping algorithm is adopted, so that the problem of translation of the time sequence data on a time axis is solved, local and global trend characteristics of the time sequence data are comprehensively considered, and the precision of the time sequence data similarity measurement is improved.

In the following, a time series data analysis apparatus provided by an embodiment of the present application is introduced, and a time series data analysis apparatus described below and a time series data analysis method described above may be referred to each other.

Referring to fig. 5, a block diagram of a time series data analysis apparatus according to an exemplary embodiment is shown, as shown in fig. 5, including:

an obtaining module 501, configured to obtain a data pattern database; wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to the dimensions, and a sample data set corresponding to each of the data patterns;

a first determining module 502, configured to obtain time series data including multiple target dimensions, and determine whether the target dimensions exist in the data pattern database; if not, the working process of the second judgment module 503 is started;

the second determining module 503 is configured to calculate a similarity between a sample data set in the data pattern database and the time series data, and determine whether a target sample data set with a similarity greater than or equal to a threshold exists; if yes, starting the work flow of the first multiplexing module 504; if not, starting the working process of the training module 505;

the first multiplexing module 504 is configured to multiplex a data pattern corresponding to the target sample data set;

the training module 505 is configured to train the time series data to obtain a target data pattern, and store the target dimension, the target data pattern, and the time series data in the data pattern database.

The time series data analysis device provided by the embodiment of the application constructs a data pattern database, wherein the data pattern database comprises dimension names of a plurality of dimensions, one or more data patterns corresponding to each dimension and a sample data set corresponding to each data pattern. For the multidimensional time series data, traversing each data mode corresponding to each dimension in the mode database, judging whether the data mode of the new dimension has concept recurrence or not through the similarity between the sample data set and the multidimensional time series data, if the concept recurrence occurs, multiplexing the existing data mode, and if not, training the new dimension data, storing the obtained mode and the sample data set into the mode database, and being suitable for data mode analysis of the multidimensional time series data.

On the basis of the above embodiment, as a preferred implementation, the method further includes:

a third determining module 506, configured to determine whether concept drift occurs in the time-series data when the target dimension exists in the data pattern database; if yes, the work flow of the second judgment module 503 is started; if not, starting the working process of the second multiplexing module 507;

the second multiplexing module 507 is configured to multiplex the data pattern at the previous time.

On the basis of the foregoing embodiment, as a preferred implementation, the second determining module 503 includes:

the dividing unit is used for dividing the sample data set and the time sequence data in the data pattern database into a plurality of subsequences respectively;

a fitting unit for performing a linear fit on each of the subsequences to determine a data pattern for each of the subsequences;

a first calculation unit, configured to calculate a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time-series data;

the second calculation unit is used for calculating the similarity between the data mode of the sample data set and the data mode of the time sequence data by utilizing a dynamic time warping algorithm; wherein, an element of a distance matrix in the dynamic time warping algorithm is a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time series data;

and the judging unit is used for judging whether a target sample data set with the similarity degree of the time sequence data being more than or equal to a threshold exists.

On the basis of the foregoing embodiment, as a preferred implementation manner, the dividing unit is specifically configured to determine a first compression rate corresponding to a sample data set in the data pattern database and a second compression rate corresponding to the time series data; the sample data set is divided into a plurality of subsequences based on the first compression rate, and the time series data is divided into a plurality of subsequences based on the second compression rate.

On the basis of the foregoing embodiment, as a preferred implementation manner, the second determining module 503 further includes:

a determining unit, configured to determine a variation trend of each of the subsequences after determining the data pattern of each of the subsequences;

and the merging unit is used for merging the two adjacent subsequences and refitting the merged subsequences if the variation trends of the two adjacent subsequences are the same.

On the basis of the foregoing embodiment, as a preferred implementation manner, the determining unit is specifically configured to determine that the trend of change of the current subsequence is an increase if the intermediate-time observed value of the current subsequence is greater than the intermediate-time observed value of the previous subsequence; and if the middle-time observation value of the current subsequence is smaller than the middle-time observation value of the previous subsequence, determining the change trend of the current subsequence as a decrease.

On the basis of the foregoing embodiment, as a preferred implementation manner, the first calculating unit is specifically configured to calculate a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time-series data based on a variation trend and a direction vector of the subsequence in the sample data set and a variation trend and a direction vector of the subsequence in the time-series data.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 6 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 6, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the time sequence data analysis method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 5.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for analyzing time series data, comprising:

if so, multiplexing a data mode corresponding to the target sample data set;

2. The method of claim 1, wherein after obtaining the time series data comprising a plurality of target dimensions, the method further comprises:

if so, multiplexing a data mode corresponding to the target sample data set;

3. The method of analyzing time series data according to claim 1, wherein said calculating a similarity between the sample data set in the data pattern database and the time series data comprises:

4. The method according to claim 3, wherein dividing the sample data set and the time series data in the data pattern database into a plurality of subsequences respectively comprises:

5. The method of claim 3, wherein the determining the data pattern for each of the subsequences further comprises:

determining a trend of change for each of the subsequences;

6. The method of analyzing time series data according to claim 5, wherein said determining a trend of change for each of said subsequences comprises:

7. The method of claim 3, wherein the calculating a distance between a data pattern of a subsequence in the sample data set and a data pattern of a subsequence in the time series data comprises:

8. A time series data analysis apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of time series data analysis according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the time series data analysis method according to any one of claims 1 to 7.