CN109408498B

CN109408498B - Time series feature identification and decomposition method based on feature matrix decision tree

Info

Publication number: CN109408498B
Application number: CN201811170289.6A
Authority: CN
Inventors: 苏鹭梅; 朱文婷; 郑小龙; 郑锐洁; 张宝琼; 叶恺昕
Original assignee: Xiamen University of Technology
Current assignee: Xiamen University of Technology
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2022-12-13
Anticipated expiration: 2038-10-09
Also published as: CN109408498A

Abstract

The invention provides a time series feature recognition and decomposition method based on a feature matrix decision tree, which mainly comprises sample data preprocessing, sample data period determination, sample data feature selection and extraction and multivariate time series feature recognition and decomposition model establishment. The method can improve the speed and accuracy of feature identification, and is particularly suitable for non-invasive load identification and decomposition in the power industry.

Description

Time series feature identification and decomposition method based on feature matrix decision tree

Technical Field

The invention relates to the field of big data analysis and mining, in particular to data identification and decomposition based on time series.

Background

In recent years, a data model based on time series is ubiquitous in the aspects of internet big data, machine and sensor data and the like, and the model is widely applied to the fields of finance, e-commerce platforms, process industry and the like, so that the model is widely concerned.

At present, the analysis aiming at the time sequence is widely concerned, a mathematical model is established for the analysis, parameter estimation is carried out, and the analysis is further applied to many aspects such as prediction, industrial adaptive control, optimal filtering and the like. The traditional time series analysis technology focuses on the analysis and the variable point identification of the unary time series. With the time series method becoming mature, the application field is wider and wider, the requirement on the model is not limited to a unitary time series but faces to a multi-component time series model, and therefore the time series model needs to be analyzed, identified and decomposed.

Based on analysis, identification and decomposition of the time series, not only the hidden features of the series are extracted, but also the range of periodic fluctuation and the division of the multivariate time series are determined. The decomposition of the multivariate time sequence enables the data to be more visual and simple, and the rules and trends in the data can be obtained more easily, so that the multivariate time sequence is applied to various fields.

Disclosure of Invention

Therefore, the invention provides a time series feature identification and decomposition method based on a feature matrix decision tree, which comprises the following specific scheme:

the time series feature identification and decomposition method based on the feature matrix decision tree comprises the following steps:

100. data preprocessing: carrying out data cleaning, data integration and data reduction on the sample data;

200. determining a sample period: performing data screening and grouping on the screened characteristic values at intervals of a certain quantity according to a specific period, wherein the grouping method is to perform Fourier transform on the time sequence characteristic quantity to obtain an intensity frequency spectrum, find out the maximum frequency component and determine the reciprocal of the maximum frequency component as the period;

300. selecting and extracting features, namely evaluating feature subsets by adopting a combination sequence forward feature selection algorithm and a K-means clustering algorithm and determining an optimal feature subset to complete feature selection, and then extracting high-identification-degree features from the selected sample features;

400. and establishing a multivariate time series characteristic identification and decomposition model.

Further, the method for cleaning data in data preprocessing described in step 100 is a Grubbs (Grubbs) method, specifically, the specific kettle determines a "suspicious value" by determining a "suspicious value" in the sample data, calculates a deviation value to determine the "suspicious value", calculates a Gi value, compares Gi with a critical value GP (n) given by the Grubbs table by searching the Grubbs (Grubbs) table, determines that the measured data is an abnormal value if the Gi value is greater than the critical value GP (n) in the table, and can remove the "suspicious value" from the data sample without participating in the calculation of the average value.

Further, the method for data integration in data preprocessing described in step 100 is a correlation coefficient method, specifically, a correlation coefficient is obtained by calculating a standard deviation and a covariance of a sample, and the strength of a relationship between the two is judged according to a value of the correlation coefficient, and a value range of the correlation coefficient is between 1 and-1, where 1 represents that two variables are completely linearly correlated, -1 represents that two variables are completely negatively correlated, and 0 represents that two variables are uncorrelated; the closer the data is to 0, the weaker the correlation is.

Further, the method for data reduction in data preprocessing described in step 100 is a regression analysis method, and the relationship between variables is refined and solidified on the basis of the association degree between each parameter obtained by data integration, and irrelevant variables are removed, so that the dimensionality of the analyzed data sample is reduced, and a reliable model is mined.

Further, the specific process of step 300 is:

310. determining an optimal feature subset according to a sequential forward feature selection algorithm, and forming a feature group X with k size by using the selected k features _k D-k unselected features X _j J =1,2, 3., d-k, arranged in J value size after combination with the features already entered, the sequential forward feature selection algorithm starts with an empty feature set, and in each subsequent cycle, the best feature in the original feature set is selected and added to the set until the number of features increases to m;

320. evaluating the separation degree of characteristics among different types of samples by adopting a K-means clustering algorithm, giving a sample set K, dividing the sample set into K clusters by the K-means algorithm, wherein each clustering center is the mean value of samples in the clusters; then distributing the other objects to the nearest cluster according to the distance between the other objects and all samples in each cluster, then requiring the center of a new cluster, and continuously repeating the iterative positioning process to ensure that the sum of the distances between all samples and the center in each cluster is minimum until the target function is minimized, thereby selecting the optimal characteristic;

330. the method comprises the steps of extracting features based on a time sequence feature selection algorithm, calculating feature values of sample data, eliminating invalid periods in the sample data, selecting 15 period data with feasibility as the sample data, calculating the feature values of the 15 period data, and extracting and obtaining the features with the highest identification degree through feature value classification.

Further, the step of establishing the multivariate time series identification model in step 400 comprises the following sub-steps:

410. based on a C4.5 decision tree classification algorithm, each feature is considered to be a class, the class is equivalent to a leaf node in a decision tree, attribute values (namely sample feature parameters) are compared at internal nodes of the decision tree in a top-down recursive mode, classification is carried out in a mode of judging downward branches from the nodes according to different attribute values until each class only contains a unique result, namely the leaves are pure, and identification and decomposition are carried out according to the obtained optimal feature parameters to judge the data class to which the feature parameters belong.

420. And introducing an improved sliding window bilateral CUSUM event detection algorithm to segment the time sequence, and continuously tracking the change of the characteristic parameters at each sampling point through an event detection program. Whether a certain characteristic parameter is changed or not is detected in the whole time sequence, so that the identification of the characteristic in the time sequence is realized, then the time of the time sequence of the characteristic value group at the current time is judged, and then characteristic decomposition is carried out, so that the current time of the current data is in a certain state of certain data;

430. establishing a category characteristic matrix based on a time sequence, averaging the characteristic values of data through training samples, solving a standard deviation of the mean value as a fluctuation level, introducing a category characteristic matrix decision tree, and establishing a time sequence characteristic probability model, thereby establishing the optimal solution of the current multivariate time sequence characteristic and finally realizing the automatic identification and decomposition of the characteristic.

The method can improve the speed and the accuracy of feature identification, and is particularly suitable for non-intrusive load identification in the power industry.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention;

FIG. 2 is a flow chart of feature selection and extraction in the method of the present invention;

FIG. 3 is a flow chart of the multivariate time series identification modeling in the method of the present invention;

FIG. 4 is a schematic diagram of power spectrum analysis of a computer;

FIG. 5 is a flowchart of a sliding window bilateral CUSUM event detection method in the method of the present invention;

fig. 6 is a schematic diagram of four stages of event detection in the CUSUM event detection method of fig. 5.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

Referring to FIG. 1, the overall flow of the process of the present invention is described. The method mainly comprises sample data preprocessing 100, sample data period determination 200, feature selection and extraction 300 and establishment of a multivariate time series feature identification and decomposition model 400.

In the sample data preprocessing 100, the present embodiment first performs data cleaning using the Grubbs (Grubbs) method, i.e., corrects recognizable errors in the data file, processes invalid values, missing values, and abnormal values, and checks data consistency. The method comprises the steps of determining a suspicious value in sample data, calculating a deviation value to determine the suspicious value, calculating a Gi value, searching a Grubbs table, comparing Gi with a critical value GP (n) given by the Grubbs table, and if the Gi value is larger than the critical value GP (n) in the table, judging that the measured data is an abnormal value. Thus, the "suspect value" can be removed from the data samples by the Grubbs method without taking part in the calculation of the mean value.

Considering that some electrical equipment parameters may have a high degree of correlation, we perform data integration processing on sample data, and we use a correlation coefficient-based method to reflect the degree of affinity between variables. And calculating to obtain a correlation coefficient by calculating the standard deviation and the covariance of the sample, and judging the strength of the relation between the standard deviation and the covariance according to the numerical value of the correlation coefficient. The following is a calculation formula of the correlation coefficient:

sxy sample covariance calculation formula:

sx sample standard deviation calculation formula:

sy sample standard deviation calculation formula:

wherein r is _xy Represents the sample correlation coefficient, S _xy Represents the sample covariance, S _x Sample standard deviation, S, for X _y Sample standard deviations for y are indicated. Coefficient of correlation r _xy The correlation degree table of (2) is shown in table 1:

TABLE 1 correlation coefficient r _xy Reference table of degree of correlation

The value interval of the correlation coefficient is between 1 and-1. Where 1 indicates that the two variables are completely linearly related, -1 indicates that the two variables are completely negatively related, and 0 indicates that the two variables are not related. The closer the data is to 0, the weaker the correlation is.

Because the number of data samples for analysis is huge, data reduction processing, namely parameter dimension reduction processing, is carried out on the data samples. Specifically, a regression analysis method is adopted, relationships among variables are refined and solidified on the basis of the association degree among all parameters obtained through data integration, irrelevant variables are removed, the dimensionality of analyzed data samples is reduced, and a reliable model is excavated. Taking the current, active power, reactive power, power factor and second harmonic current of a laser printer and a notebook computer as examples, the results obtained by regression analysis of MATLAB 2016a are shown in tables 2 and 3:

TABLE 2 correlation between laser printer parameters

TABLE 3 correlation between computer parameters

In the step of determining the sample data period 200, the present embodiment employs a spectrum analysis method, and performs data screening and grouping on the screened feature values at certain intervals in a specific period, where the grouping method is to perform fourier transform on the time-series feature quantity to obtain an intensity spectrum, find out the maximum frequency component, and determine the reciprocal of the maximum frequency component as a period, thereby improving the resolution of feature extraction.

The fourier transform of the periodic discrete-time signal x (nT) can be expressed as:

wherein, the finite long discrete signal x (N), N =0,1, \8230;, N-1.

Fig. 4 shows a spectral analysis of a computer. The period we estimate from the raw data is about 400s, and the second highest frequency obtained with our algorithm is about 0.0025Hz, which is consistent. The reason why the frequency of the highest amplitude is not used is that because our data is non-periodic, the highest amplitude occurs near zero and the corresponding frequency of the next highest amplitude is closer to the data period.

In the feature selection and extraction 300, the embodiment evaluates the feature subsets by combining the sequential forward feature selection algorithm and the K-means clustering algorithm and determines the optimal feature subsets to complete feature selection, and then extracts features with high degree of identification from the selected sample features. As shown in fig. 2, the specific process is as follows:

310. the optimal feature subset is determined according to a sequential forward feature selection algorithm. Let it be assumed that k selected features form a set of k sized features X _k The unselected d-k features X _j J =1,2,3,.., d-k, arranged in the size of the J value after combination with the already entered feature, i.e. if

J(X _k +x ₁ )≥J(X _k +x ₂ )≥…≥J(X _k +x _d-k ) (6)

The next step is to select the feature set as

X _k+1 ＝X _k +x ₁ (7)

The sequential forward feature selection algorithm starts with an empty feature set, and in each subsequent cycle, the best feature in the original feature set is selected and added to the set until the number of features increases to m.

320. And evaluating the separation degree of the characteristics among different types of samples by adopting a K-means clustering algorithm. From the perspective of geometric intuition, the larger the separability between classes is, the larger the distance between classes is, the farther the classification between different classes of samples is, and meanwhile, the smaller the intra-class distance is, the higher the intra-class aggregation degree is. Giving a sample set K, and dividing the sample set into K clusters by a K-means algorithm, wherein each cluster center is the mean value of samples in the clusters; and then distributing the other objects to the nearest cluster according to the distances between the other objects and all samples in each cluster, then requiring the center of a new cluster, and continuously repeating the iterative positioning process to ensure that the sum of the distances between all samples and the center in each cluster is minimum until the target function is minimized, thereby selecting the optimal characteristic.

330. Because the clustering result cannot complete the feature selection of all sample data categories to a great extent, in order to improve the efficiency of feature selection, the embodiment proposes the feature extraction method adopting time domain statistical features, calculates the operating feature value of the electric equipment, eliminates the invalid period in the sample data, selects 15 period data with feasibility as the sample data, and extracts the features with strong identification through calculating the feature value of the 15 period data and comparing various time domain statistical features such as the mean value, the variance, the skewness and the like.

In the process of establishing the multivariate time series feature identification and decomposition model 400, referring to fig. 3, the process of the multivariate time series feature identification model of the embodiment is described, which includes the following sub-steps:

410. based on a C4.5 decision tree classification algorithm, each feature is considered to be classified as a class, the class is equivalent to a leaf node in a decision tree, attribute values (namely sample feature parameters) are compared at internal nodes of the decision tree in a top-down recursive mode, classification is carried out in a mode of judging downward branches from the nodes according to different attribute values until each class only contains a unique result, namely, leaves are pure, and the class to which the feature parameters belong is judged by identifying and decomposing according to the obtained optimal feature parameters.

The C4.5 decision tree classification algorithm is a supervised classification learning algorithm. Let us say that there is one sample set denoted PC. The proportion of the kth class sample in the sample set is P _k (k =1,2, \8230;, a), a being the total number of classes in a sample, the sample set information entropy is defined as shown in the formula:

let us say that the sample set is divided according to the attribute B, if there are X possible values in the attribute B, X branch nodes are generated, wherein the X (X =1,2, \ 8230;, X) th branch node contains all the values B on the attribute B in the sample set ^x Sample of (1), denoted as C ^x (ii) a The "information gain" (information gain) obtained by dividing the sample set by the attribute B can be defined as follows:

further, the information gain ratio of the attribute B:

the gain rates of different attributes can be calculated according to the formula, the attribute with the maximum gain rate is selected as the splitting attribute of the splitting, the gain rates of other attributes are calculated in the same mode, and the splitting is performed successively until all the attributes are separated or all samples are subjected to value phase on all the attributes until the splitting cannot be performed.

420. Introducing an improved sliding window bilateral CUSUM event detection algorithm to segment the time sequence, and continuously tracking the change of the characteristic parameters at each sampling point through an event detection program; whether a certain characteristic parameter changes is detected in the whole time sequence, so that the identification of the characteristics in the time sequence is realized, then the time of the characteristic value group at the current time is judged at the time of the time sequence, and then characteristic decomposition is carried out, so that the current time of the current data is in a certain state of certain data.

The following describes an improved sliding window bilateral CUSUM event detection algorithm by taking detection of residential electric equipment as an example, and the algorithm specifically includes the following steps:

setting an active power time sequence

Defining two continuous sliding windows Ws (steady state mean window) and Wu (transient mean window) in the time sequence, defining the lengths of the windows as s and u respectively, and calculating the mean value A of the two windows respectively _s And A _u The calculation formula is as follows:

then define respectively

And

for detecting whether the time series is switched on (i.e. power present increasing phenomenon) or switched off (i.e. power present decreasing phenomenon) at the current moment, and defining a fluctuation level epsilon for representing the time series in a steady state, the calculation formula is as follows:

taking the time sequence whether to have an event starting or changing the state as an example, the flow of the sliding window bilateral CUSUM event detection method is as follows, taking the detection of the input event as an example, when the detection window A is used _u A value of greater than A _u When the sum is + epsilon,

an increment is started. At this time, a threshold value range K for determining the occurrence of the event needs to be set when

In order to avoid the multiple recognition of the load turn-on or turn-off event caused by the sequence oscillation, a time delay factor d (with an initial value of 0) is introduced, and each time the delay factor is added by l, the event can be generated at the moment

And

make a comparison if

Then it is considered that what caused the active power change at that time is a fluctuation, and let

d =0, thereby avoiding multiple identification events caused by device data fluctuations. When in use

Let d = d + l, calculate

Up to

The detected time of occurrence of the event can be derived from t-d. The sliding window bilateral CUSUM event detection process taking the detection of the load input event as an example is shown in fig. 5, and the process of detecting the close event can be obtained in the same manner.

When the sliding window of the sliding window bilateral CUSUM event detection program slides over the occurrence time of an event, the sliding window bilateral CUSUM event detection program can be divided into 4 stages, as shown in fig. 6, where P is ₀ Is the active power before the occurrence of the event, and Δ P is the active power after the occurrence of the event and P ₀ The difference of (a).

a. The first phase is when the transient detection window has not yet slid to the event occurrence, and the values of both windows remain unchanged, i.e. A _u –A _s ＝0；

b. The second phase is when the time of occurrence of the event is within the transient detection window, A _u Is constantly changing, and A _s Do not change, this time order P ₁ ＝P ₀ +. DELTA P, and set t _d ＝t-t ₁ And t is _d E (1, u), then at this stage every moment in time corresponds to it A _s And A _u Are respectively A _s ＝P ₀ ，

c. The third phase is when the time of occurrence of the event is within the mean calculation window, A _u Invariable, A _s Constantly changing, and (t) _d -u) e (1, s-1), where A corresponds to each time instant _s And A _u Are respectively as

d. The fourth stage is when both windows have slid past the event detection window, A _s And A _u No change occurs.

The above calculation and analysis of the threshold K are based on the instant-on devices, but many of the residential electrical devices, such as microwave ovens, printers, etc., are not instant-on. In order to reduce the error rate of event identification, a compromise scheme is introduced, and the maximum and minimum values of the threshold value are used as the threshold value for determining the occurrence of the event, namely, the maximum and minimum values are ordered to be

From the above derivation, it is only necessary to determine As and A _u K, and the minimum power of the device identified at that time, may be determined. Then, the value range of the threshold K for determining the occurrence of time can be obtained as follows:

K＝(K _max +K _min )/2 (12)

430. establishing a category characteristic matrix based on a time sequence, averaging characteristic values of data through training samples, solving a standard deviation of the mean value as a fluctuation level, introducing a category characteristic matrix decision tree, and establishing a time sequence characteristic probability model, so that an optimal solution of the current multivariate time sequence characteristic is established, and automatic identification and decomposition of the multivariate time sequence characteristic are finally realized.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A time series feature recognition and decomposition method based on a feature matrix decision tree is used for non-intrusive load recognition in the power industry, and is characterized by comprising the following steps:

200. determining a sample period: performing data screening and grouping on the characteristic values obtained by screening according to a preset period and a preset number at intervals, wherein the grouping method is to perform Fourier transform on the time sequence characteristic quantity to obtain an intensity spectrum, find out the maximum frequency component and determine the reciprocal of the maximum frequency component as the period;

300. selecting and extracting features, namely evaluating feature subsets by adopting a combination sequence forward feature selection algorithm and a K-means clustering algorithm and determining an optimal feature subset to complete feature selection, and then extracting high-identification-degree features from the selected sample features, wherein the specific process comprises the following steps of:

310. determining an optimal feature subset according to a sequential forward feature selection algorithm, and forming a feature group X with the size of k by setting k selected features _k D-k unselected features X _j J =1,2,3,.., D-k, arranged in J value size in combination with the already entered features; the sequential forward feature selection algorithm starts from a null feature set, selects the best feature in the original feature set in each subsequent cycle, and adds the best feature to the optimal feature subset until the number of features increases to m;

330. extracting features based on a time sequence feature selection algorithm, calculating a feature value of sample data, eliminating invalid periods in the sample data, selecting 15 period data with feasibility as the sample data, calculating the feature value of the 15 period data, and extracting and obtaining the features with the highest identification degree through feature value classification;

400. establishing a multivariate time series feature recognition and decomposition model, comprising the following substeps:

410. based on a C4.5 decision tree classification algorithm, considering that each feature is classified into one type, namely a leaf node in a decision tree, performing attribute value comparison on an internal node of the decision tree in a top-down recursion mode, namely sample feature parameters, and classifying in a mode of judging downward branches from the node according to different attribute values until each type only contains a unique result, namely pure leaves, and performing identification and decomposition according to the obtained optimal feature parameters to judge the type to which the feature parameters belong;

420. introducing an improved sliding window bilateral CUSUM event detection algorithm to detect a load input event, and specifically, setting an active power time sequence

Defining two continuous sliding windows in the time sequence, namely a steady-state mean value window Ws and a transient-state mean value window Wu, defining the lengths of the two continuous sliding windows as s and u respectively, and calculating the average value A of the two continuous sliding windows respectively _s And A _u (ii) a Are defined separately

And

detecting whether the time series is input or cut off at the current moment, and defining a fluctuation level epsilon for representing the time series in a steady state; when detecting window A _u A value of greater than A _u When the sum is + epsilon,

starting to increase, it is necessary to set a threshold range K for determining the occurrence of an event when

If so, then there may be an event occurring at this time; introducing a time delay factor d, wherein the initial value of d is 0, and the delay factor is added to l

And

make a comparison if

Then it is considered that what caused the active power change at that time is a fluctuation, and order

d =0; when the temperature is higher than the set temperature

Let d = d + l, calculate

Up to

The occurrence time of the detected event can be deduced according to t-d; segmenting the time sequence, and continuously tracking the change of the characteristic parameters at each sampling point through an event detection program; whether a certain characteristic parameter changes is detected in the whole time sequence, so that the identification of the characteristics in the time sequence is realized, then the time of the characteristic value group at the current time is judged at the time of the time sequence, and then characteristic decomposition is carried out, so that the current time of the current data is in a certain state of certain data;

2. The method of claim 1, wherein the data cleansing method in the data preprocessing described in the step 100 is a Grubbs method, and specifically, the method comprises determining a suspicious value by determining a suspicious value in the sample data, calculating a deviation value to determine the suspicious value, calculating a Gi value, comparing Gi with a critical value GP (n) given by the Grubbs table by searching the Grubbs table, and determining that the data is an abnormal value if the Gi value is greater than the critical value GP (n) in the table.

3. The method of claim 1, wherein the method of data integration in the data preprocessing described in step 100 is a correlation coefficient method.

4. The method of claim 1, wherein the reduction of the data in the pre-processing of the data in step 100 is a regression analysis.