CN111612082B - Method and device for detecting abnormal subsequence in time sequence - Google Patents

Method and device for detecting abnormal subsequence in time sequence Download PDF

Info

Publication number
CN111612082B
CN111612082B CN202010456099.1A CN202010456099A CN111612082B CN 111612082 B CN111612082 B CN 111612082B CN 202010456099 A CN202010456099 A CN 202010456099A CN 111612082 B CN111612082 B CN 111612082B
Authority
CN
China
Prior art keywords
time
numerical
sequence
interval
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010456099.1A
Other languages
Chinese (zh)
Other versions
CN111612082A (en
Inventor
翟波
张亚
曾海芳
覃桢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Xiaopenguin Medical Technology Co ltd
Original Assignee
Hebei Xiaopenguin Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Xiaopenguin Medical Technology Co ltd filed Critical Hebei Xiaopenguin Medical Technology Co ltd
Priority to CN202010456099.1A priority Critical patent/CN111612082B/en
Publication of CN111612082A publication Critical patent/CN111612082A/en
Application granted granted Critical
Publication of CN111612082B publication Critical patent/CN111612082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention provides a method and equipment for detecting an abnormal subsequence in a time sequence. The method comprises the following steps: adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment; constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table; and acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller. The invention ensures the detection precision and reliability of the abnormal subsequence.

Description

Method and device for detecting abnormal subsequence in time sequence
Technical Field
The embodiment of the invention relates to the technical field of data mining, in particular to a method and equipment for detecting abnormal subsequences in a time sequence.
Background
In real life, various fields contain a large amount of time-series data such as electrocardiographic data, electroencephalographic data, industrial-field sensor data, and network flow data of a patient. The time-series data is data formed according to a data generation precedence relationship. Thus, the time series data records the fluctuation information of a certain action in the time dimension, and the abnormal subsequence possibly contained in the time series data contains more important information than most normal subsequences. For example, abnormal electrocardiographic data means that a patient may suffer from a certain type of heart disease, and abnormal electroencephalogram data may be caused by brain diseases such as epilepsy. Abnormal subsequence (pattern) detection in a time series is a very important field, most of data of the time series containing abnormal patterns are in a normal form, the occurrence frequency of the abnormal patterns is very low, but the rarely-occurring abnormal patterns contain very important information. The unsupervised time series anomaly detection algorithm does not need known data, and belongs to a machine learning algorithm of inert learning. In an unsupervised abnormal subsequence detection algorithm, comparing any two subsequences in any time sequence to judge an abnormal condition; however, the time series data has characteristics such as dynamics and is often high-dimensional data; therefore, these methods for comparing the two-by-two subsequences often require a large time overhead, and often lose information of the time-series data in the time dimension during the conversion of the time-series representation, which affects the detection accuracy of the algorithm. Therefore, the detection and research of abnormal subsequences of time series data are of great practical significance. Therefore, developing a method for detecting abnormal subsequences in a time sequence, which can effectively overcome the above-mentioned drawbacks of the related art, is a technical problem to be solved in the industry.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a method and equipment for detecting an abnormal subsequence in a time sequence.
In a first aspect, an embodiment of the present invention provides a method for detecting an abnormal subsequence in a time sequence, including: adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment; constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table; acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller; wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same.
Based on the foregoing method embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention uses a single value and a single time point to form a tuple, and forms a plurality of tuples into a time sequence, including:
P={(t 1 ,p 1 ),(t 2 ,p 2 ),(t 3 ,p 3 ),...,(t n ,p n )}
wherein n is the length of the time sequence and is any integer; (t) n ,p n ) Is the tuple; p is the time sequence; t is t n For the single point in time; p is p n Are the individual values.
Based on the foregoing method embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention defines similarity of different time sequences at any time point, including: if the first time sequence and the second time sequence are at t 1 To t n And if the numerical value of any time point in the time is in the same numerical value interval, judging that the first time sequence and the second time sequence are similar at the any time point.
Based on the content of the embodiment of the method, the method for detecting the abnormal subsequence in the time sequence provided in the embodiment of the invention, wherein the probability density is as follows:
Figure BDA0002509298290000021
the probability is:
Figure BDA0002509298290000022
wherein x is any time point; s is the number of numerical intervals; beta i Is the ith split point; i=0, …, S-1.
Based on the foregoing method embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention includes:
Figure BDA0002509298290000031
Figure BDA0002509298290000032
wherein p' is the derivative of p with respect to time; g is a constructor, if G is zero, then beta is determined i Is a split point, if G is not zero, beta is determined i Not the split point.
Based on the foregoing method embodiment, in the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, an interval table is constructed according to the probability and a plurality of numerical intervals, and correspondingly, elements of the interval table include:
Figure BDA0002509298290000033
wherein j is the j-th numerical interval; ITable is an element of the interval table.
Based on the foregoing content of the method embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, where the average value of the ownership weights is used as a score of each subsequence, includes:
Figure BDA0002509298290000034
Figure BDA0002509298290000035
wherein t is i Is the i-th moment; score (t) i ) For time point t i Is set in the extended interval table; score (P) is the fraction of subsequences; w is a weight; r is (r) j+1,i For time point t i The compact coefficients in the position of the numerical space and the adjacent upper interval; r is (r) j-1,i For time point t i The position in numerical space and the adjacent lower interval.
In a second aspect, an embodiment of the present invention provides an apparatus for detecting an abnormal subsequence in a time sequence, including:
the sequence construction module is used for forming a tuple by adopting a single numerical value and a single time point, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any time point;
the system comprises an extended interval table construction module, a time sequence generation module and a time sequence generation module, wherein the extended interval table construction module is used for constructing a plurality of splitting points to divide a numerical value space in the time sequence into a plurality of numerical value intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical value interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical value intervals, and constructing an extended interval table according to the interval table;
an anomaly determination module, configured to obtain weights of each time point of each sub-sequence of a time sequence in the extended interval table, average the weights of all the time points as a score of each sub-sequence, and if the score is smaller, determine that the sub-sequence is less likely to be anomalous;
wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of executing a method of detecting abnormal subsequences in the time series provided by any of the various possible implementations of the first aspect.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform a method of detecting an abnormal sub-sequence in a time sequence provided by any one of the various possible implementations of the first aspect.
According to the method and the device for detecting the abnormal subsequence in the time sequence, the time sequence and the similarity thereof are redefined, the numerical space is divided into a plurality of numerical intervals, the probability density and the corresponding falling probability of the time sequence are further obtained, an extended interval table is constructed on the basis, the subsequence of the time sequence is scored according to the weight in the extended interval table, algorithm detection efficiency can be improved on the premise that the time sequence is complete in time dimension information, and detection precision and reliability of the abnormal subsequence are guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for detecting an abnormal subsequence in a time sequence according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a position of a time point of electrocardiographic data in a numerical space according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of similarity of numerical points at the same time points in different time sequences according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an apparatus for detecting abnormal subsequences in a time sequence according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In addition, the technical features of the various embodiments or the single embodiments provided in the present invention may be combined with each other arbitrarily to form a feasible technical solution, but it is necessary to base that a person skilled in the art can implement the solution, and when the combination of the technical solutions contradicts or cannot implement the solution, it should be considered that the combination of the technical solutions does not exist and is not within the scope of protection claimed in the present invention.
The embodiment of the invention provides a method for detecting abnormal subsequences in a time sequence, referring to fig. 1, the method comprises the following steps:
101. adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment;
102. constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table;
103. acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller;
wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same.
Based on the foregoing disclosure of the foregoing method embodiment, as an optional embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, where a single value and a single point of time are adopted to form a tuple, and a plurality of tuples are formed into a time sequence, includes:
P={(t 1 ,p 1 ),(t 2 ,p 2 ),(t 3 ,p 3 ),...,(t n ,p n )} (1)
wherein n is the length of the time sequence and is any integer; (t) n ,p n ) Is the tuple; p is the time sequence; t is t n For the single point in time; p is p n Are the individual values.
Specifically, a time series is assumed as formula (1). If each tuple (t i ,p i ) Considered as coordinates in two dimensions, it can locate a point in space. Thus, tuple (t i ,p i ) It can be understood that p is used i To represent the time point t i In the numerical space. Thus, the time series representation translated in this way of understanding, taking as an example one of the electrocardiographic data in the ECG200, illustrates the visual appearance under this representation, as shown in FIG. 2. In FIG. 2, p is used i Subscript in time series P denotes t i (from t 0 To t 95 ) The location of each point in time of the electrocardiographic data in numerical space is shown in fig. 2.
Based on the foregoing disclosure of the foregoing method embodiment, as an optional embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, where the defining similarity between different time sequences at any time point includes: if the first time sequence and the second time sequence are at t 1 To t n And if the numerical value of any time point in the time is in the same numerical value interval, judging that the first time sequence and the second time sequence are similar at the any time point.
Specifically, for time series of arbitrary length equal, the time point t i Is identical, then the difference between the time series can only be represented by the value p at each instant in time i And p is different from i Representing the corresponding t i In the numerical space. Thus, the time sequence can be measuredAnd the adjacent relation of the corresponding time points in the column in the numerical space is used for completing the similarity calculation of the time sequence. If the positions of the time points in the numerical space are adjacent, the time points are similar; if the time points differ far in the value space, this time series is said to be dissimilar at the time points. That is, if the time points t of the time series P and Q i In which the value space lies in the same value interval, the time sequences P and Q lie at the instant t i The upper are adjacent, also referred to as similar. For example, in fig. 3, when the whole numerical space is divided into five sections by straight lines, the time points t of the time series P and Q are 12 Is adjacent to, at a time point t 0 Is not adjacent (t is the sum 0 To t 18 A point in time).
Based on the foregoing content of the method embodiment, as an optional embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, the probability density is:
Figure BDA0002509298290000071
the probability is:
Figure BDA0002509298290000072
wherein x is any time point; s is the number of numerical intervals; beta i Is the ith split point; i=0, …, S-1.
Specifically, the dividing method of the numerical intervals is that the whole numerical space is divided into numerical intervals with equal probability. The probabilities are equal, i.e., the probability that any data point falls within any numerical interval is the same. Dividing the numerical space into S numerical intervals requires determining S-1 split points beta 123 <...<β S-1 . S intervals are [ beta ] 01 ],[β 12 ],....,[β S-1S ]Wherein beta is 0 =-∞,β S = + infinity. Assuming that the time sequence accords with normal distribution of X-N (0, 1), a probability density function of the time sequence can be obtained as shown in a formula (2); then the probability calculation method that any point in time of the time series falls within any numerical interval is shown in formula (3).
Based on the foregoing disclosure of the foregoing method embodiment, as an optional embodiment, a method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, the constructing a plurality of split points includes:
Figure BDA0002509298290000073
Figure BDA0002509298290000074
wherein p' is the derivative of p with respect to time; g is a constructor, if G is zero, then beta is determined i Is a split point, if G is not zero, beta is determined i Not the split point.
Specifically, the splitting point dividing the numerical section can be obtained from the literature SAX. To verify the validity of the algorithm over more value intervals, newton's method is used to calculate the split points for more value intervals. First, the construction function G (x) is as shown in formula (4), where β i Is the known previous split point (beta 0 =-∞,β S = + infinity is provided A kind of electronic device. Then an iterative beta solution can be constructed i+1 After each iteration, determining whether a stop condition (if G is equal to 0, it is determined as a split point, otherwise, it is not a split point) is satisfied by using the formula (4) to determine each split point.
Based on the foregoing disclosure of the foregoing method embodiment, as an optional embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention constructs an interval table according to the probability and a plurality of numerical intervals, and correspondingly, elements of the interval table include:
Figure BDA0002509298290000081
wherein j is the j-th numerical interval; ITable is an element of the interval table.
Specifically, each Interval of the Interval Table (Interval Table) counts a set of time points corresponding to data points located in the Interval. Because the time of the time sequence is consistent, the time point set of each interval can be converted into binary representation; therefore, each interval table is a two-dimensional matrix of s×n, S represents the number of numerical intervals, n represents the length of the time series, and the value of each element in the interval table can be only 0 or 1. If the element is 1, the position of the time point in the numerical space is indicated to be in the corresponding interval, otherwise, the position of the time point in the numerical space is indicated to not fall in the corresponding interval. The form of each element in the Iable is shown as (6). For subsequences of equal time sequence length, the transformed interval tables are not only similar in structure, but also the number of binary 1's appearing in each interval table is the same. The position where a binary 1 appears in the interval table represents the variability of the interval table. Then, in combination with the feature that the abnormal data is "few and different", it can be found that if the data point of the sub-sequence at some time point is abnormal, the position of the 1 in the interval table corresponding to the sub-sequence will be different from the position of the 1 in most other interval tables.
Based on the foregoing disclosure of the foregoing method embodiment, as an optional embodiment, the method for detecting an abnormal subsequence in a time sequence provided in the embodiment of the present invention, where the averaging of all weights as a score of each subsequence includes:
Figure BDA0002509298290000082
Figure BDA0002509298290000083
wherein t is i Is the i-th moment; score (t) i ) For time point t i Is at (1)The weight in the extended interval table; score (P) is the fraction of subsequences; w is a weight; r is (r) j+1,i For time point t i The compact coefficients in the position of the numerical space and the adjacent upper interval; r is (r) j-1,i For time point t i The position in numerical space and the adjacent lower interval.
Specifically, each extended interval table (Extend Interval Table, EITable) is composed of a matrix of size s×n, n representing the length of the subsequence constituting the EITable, and S representing the number of numerical interval sections. Each element w in EITable j,i Is an integer greater than or equal to 0, and represents the time point t calculated in the data set i At the interval of numerical value j Is a weight of (a). The structure of the EITable is shown in table 1.
TABLE 1
Figure BDA0002509298290000091
After the EITable is constructed, the weight distribution of each time point in the time sequence data set in S intervals can be obtained. The greater the weight, the more sub-sequences the points in time representing the dataset are in line with the interval distribution in the location of the numerical space; the smaller the weight, the more points in time with fewer sub-sequences are in the numerical space, which corresponds to the distribution of this interval. From the feature of "few and different" abnormal data, it can be inferred that if a sub-sequence is abnormal, then some or even all of the time points in the sub-sequence have a distribution in the numerical space that is significantly different from the distribution of the time points corresponding to the majority of the other sub-sequences. The different distribution of time points in the numerical space is reflected in that the weights of the time points of the abnormal subsequence in the extended interval table in the EITable are different, so that the weights of the time points are necessarily small. Thus, the abnormality of the subsequence is determined by calculating the weight score of the time series in the EITable. The weight score is calculated as follows: based on the constructed extended interval table (EITable), the weight of each time point of each sub-sequence in the extended interval table is queried, and then the average value of all the time point weights of the sub-sequence is calculated as the fraction of the sub-sequence, as shown in a formula (7). The larger the score of the subsequence P calculated using equation (7), the greater the probability that P will fit the distribution of most subsequences; the smaller the score, the less likely the subsequence P will fit the distribution of most non-self matching subsequences, and the more likely the subsequence P will be abnormal.
score(t i ) Is the calculation time point t i The weight in EITable is divided into three parts: t is t i The weight of the section to which the weight belongs and the weights of two sections adjacent to each other up and down; but if t i Belonging to the first and last numerical intervals, inquiring the time point t acquired by EITable i The weight of (2) comprises two parts: t is t i The weight of the section to which it belongs and the weight of the adjacent preceding or following section. When calculating the score of adjacent sections, the time point t needs to be calculated first i The degree of compactness between the position in the numerical space and the adjacent interval; if the time point t i The position of the time point t can be approximately calculated by closely compacting the adjacent interval i Classifying the adjacent sections; if there is a slight gap from the adjacent interval, only the time point t can be described i There is a neighborhood relationship with a few data in adjacent intervals. Thus, score (t) i ) The formula of (2) is shown as formula (8).
According to the method for detecting the abnormal subsequence in the time sequence, which is provided by the embodiment of the invention, the time sequence and the similarity thereof are redefined, the numerical value space is divided into a plurality of numerical value intervals, the probability density and the corresponding falling probability of the time sequence are further obtained, an extended interval table is constructed on the basis, the subsequence of the time sequence is scored according to the weight in the extended interval table, the algorithm detection efficiency can be improved on the premise that the time dimension information of the time sequence is complete, and the detection precision and the reliability of the abnormal subsequence are ensured.
In order to more clearly illustrate the essence of the technical scheme of the invention, an integral embodiment is proposed on the basis of the above embodiment, and the overall view of the technical scheme of the invention is presented. It should be noted that, the overall embodiment is only for further embodying the technical essence of the present invention, and not limiting the scope of the present invention, and any combined technical solution meeting the technical essence of the present invention obtained by combining technical features on the basis of each embodiment of the present invention by a person skilled in the art is within the scope of protection of the present patent as long as the practical implementation is possible.
First, the time series data set selected in this experiment is shown in the following table 2 (UCR experimental data set):
TABLE 2
Figure BDA0002509298290000101
Figure BDA0002509298290000111
Table 2 contains a total of 4 different types of time-series data sets, and the time-series length of these data sets is from 65 to 2709, and the abnormal time-series contained in these data occupy different proportions, respectively. The diversity of data in the table may verify from different aspects the validity analysis of the proposed algorithm for time series anomaly detection. In order for a quantitative analysis algorithm to accurately detect abnormal time sequences, the proposed algorithm is evaluated using the AUC index. AUC represents the area of a graph surrounded by an ROC curve and two coordinate axes, the ROC curve can be used for evaluating indexes of the effects of two classifiers, data samples are ordered according to the prediction results of the classifiers, different thresholds are sequentially taken according to the order, samples with the prediction effects larger than the thresholds are taken as positive examples, and samples with the prediction results smaller than the thresholds are taken as negative examples. An element (FPR, TPR) is obtained each time divided by different thresholds, wherein FPR represents the false positive rate and TPR represents the true positive rate. And then calculating the values of two important quantities each time, and respectively plotting with FPR as an abscissa and TPR as an ordinate to obtain the ROC curve. True positive rates are also known as sensitivity in machine learning, false positive rates are also known as probability of false positives.
The selected comparison algorithms are respectively as follows: angle-based anomaly detection algorithm (FastVOA) proposed in 2012; an anomaly detection algorithm (PAPR-RW) with a combination of piecewise aggregation approximation and a random walk model was proposed in 2017; the kernel density based anomaly detection algorithm (RDOS) proposed in 2017; the interval set-based time series anomaly detection algorithm (international) proposed in 2018. Setting parameters of the comparison algorithm, wherein the neighbor number in the RDOS algorithm is set to be 10 according to the parameter set suggested in the reference document; the number of hash functions of the FastVOA algorithm is set to 100; parameters suggested in parameter references in the international algorithm, such as a boundary width factor of 0.2; the number of subspaces in the PAPR-RW algorithm is set to values ranging from 6 to 9 as suggested, and the other three parameters are set to 0.3, 0.4 and 0.3, respectively. Experimental results as shown in table 3, the best first two experimental results on each dataset are shown in bold, NA indicates that in the current experimental environment, the algorithm cannot calculate on this dataset. The AUC scores on the data sets for each algorithm are shown in table 3.
TABLE 3 Table 3
Figure BDA0002509298290000121
In the experimental results of table 3, the results of EITable are experimental results of the algorithms proposed by the subject, and the other columns are experimental results of the selected comparison algorithm. From the experimental results in the table, it can be found that in most of the time, the EITable has a better detection result, and compared with other algorithms, the EITable has a different degree of improvement in AUC score. For example: on the MoteStrain data set, the two proposed algorithms have more than ten percent improvement than other algorithms; on the Lighting2 dataset, ten percent is improved compared with the RDOS algorithm, and twenty percent is improved compared with the RDOS algorithm; ten percent improvement over other algorithms as well on the ECG200 dataset; there were good results on the three datasets of the DiatomSizeReduction.
Besides verifying the effectiveness of the experiment under the AUC index, the difference of the proposed algorithm and the comparison algorithm in CPU time is counted. The CPU run time pairs for each method on the data set of table 2 are shown in table 4. From the experimental results of table 4 (run time comparisons over different time series data sets), it can be seen that EITable requires less run time over most data sets, which requires only linear time complexity; the international algorithm divides the time series and finds the similarity matrix, which requires less time for small data sets, so that it can reach the best running time in part of the data sets; the algorithm RDOS needs to calculate Euclidean distance between time sequences and calculate k nearest neighbor, and all the time is needed; the PAPR-RW requires the maximum running time because it requires to convert the time series representation and calculate the similarity matrix first and input the similarity matrix into the RW model for multiple iterative optimization.
TABLE 4 Table 4
Figure BDA0002509298290000131
The implementation basis of the embodiments of the present invention is realized by a device with a processor function to perform programmed processing. Therefore, in engineering practice, the technical solutions and the functions of the embodiments of the present invention can be packaged into various modules. Based on this actual situation, on the basis of the above embodiments, an embodiment of the present invention provides an apparatus for detecting an abnormal subsequence in a time sequence, which is configured to perform the method for detecting an abnormal subsequence in a time sequence in the above method embodiment. Referring to fig. 4, the apparatus includes:
a sequence construction module 401, configured to form a tuple with a single value and a single time point, form a time sequence from a plurality of tuples, and define similarities of different time sequences at any time point;
the extended interval table construction module 402 is configured to construct a plurality of splitting points, divide a numerical space in a time sequence into a plurality of numerical intervals, obtain probability density of the time sequence, obtain probability that any time point in the time sequence falls into any numerical interval according to the probability density, construct an interval table according to the probability and the plurality of numerical intervals, and construct an extended interval table according to the interval table;
an anomaly determination module 403, configured to obtain weights of each time point of each sub-sequence of a time sequence in the extended interval table, average the weights of all the time points as a score of each sub-sequence, and if the score is smaller, determine that the sub-sequence is less likely to be anomalous;
wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same.
The device for detecting the abnormal subsequence in the time sequence provided by the embodiment of the invention adopts the sequence construction module, the extended interval table construction module and the abnormality judgment module, redefines the time sequence and the similarity thereof, divides the numerical space into a plurality of numerical intervals, further obtains the probability density and the corresponding falling probability of the time sequence, constructs the extended interval table on the basis, scores the subsequence of the time sequence according to the weight in the extended interval table, and can improve the algorithm detection efficiency and ensure the detection precision and the reliability of the abnormal subsequence on the premise of ensuring the integrity of the information of the time sequence in the time dimension.
The method of the embodiment of the invention is realized by the electronic equipment, so that the related electronic equipment is necessary to be introduced. To this end, an embodiment of the present invention provides an electronic device, as shown in fig. 5, including: at least one processor (processor) 501, a communication interface (Communications Interface) 504, at least one memory (memory) 502 and a communication bus 503, wherein the at least one processor 501, the communication interface 504, and the at least one memory 502 are in communication with each other via the communication bus 503. The at least one processor 501 may invoke logic instructions in the at least one memory 502 to perform the following method: adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment; constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table; acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller; wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same.
Further, the logic instructions in the at least one memory 502 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. Examples include: adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment; constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table; acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller; wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls within any numerical interval is the same. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. Based on this knowledge, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In this patent, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method for detecting an abnormal subsequence in a time series, comprising:
adopting a single numerical value and a single moment to form a tuple, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any moment;
constructing a plurality of splitting points, dividing a numerical space in a time sequence into a plurality of numerical intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical intervals, and constructing an extended interval table according to the interval table;
acquiring the weight of each time point of each sub-sequence of the time sequence in the extended interval table, taking an average value of all the weights as the score of each sub-sequence, and determining that the sub-sequence is less likely to be abnormal if the score is smaller;
wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls into any numerical interval is the same;
the probability density is:
Figure FDA0004152149140000011
the probability is:
Figure FDA0004152149140000012
wherein x is any time point; s is the number of numerical intervals; beta i Is the ith split point;
i=0,…,S-1;
the construction of a number of split points includes:
Figure FDA0004152149140000013
Figure FDA0004152149140000014
wherein p' is the derivative of p with respect to time; g is a constructor, if G is zero, then beta is determined i Is a split point, if G is not zero, beta is determined i Not the split point;
the interval table is constructed according to the probability and a plurality of numerical intervals, and correspondingly, the elements of the interval table comprise:
Figure FDA0004152149140000015
wherein j is the j-th numerical interval; ITable is an element of the interval table;
said averaging of the ownership weights as a fraction of said each sub-sequence comprises:
Figure FDA0004152149140000021
Figure FDA0004152149140000022
wherein t is i Is the i-th moment; score (t) i ) For time point t i Is set in the extended interval table; score (P) is the fraction of subsequences; w is a weight; r is (r) j+1,i For time point t i The compact coefficients in the position of the numerical space and the adjacent upper interval; r is (r) j-1,i For time point t i The position in numerical space and the adjacent lower interval.
2. The method of claim 1, wherein the employing a single value with a single point in time to form a tuple, and the grouping of tuples into a time series, comprises:
P={(t 1 ,p 1 ),(t 2 ,p 2 ),(t 3 ,p 3 ),...,(t n ,p n )}
wherein n is the length of the time sequence and is any integer; (t) n ,p n ) Is the tuple; p is the time sequence; t is t n For the single point in time; p is p n Are the individual values.
3. The method for detecting abnormal subsequences in a time series as claimed in claim 2, which comprisesCharacterized in that said defining the similarity of different time sequences at any point in time comprises: if the first time sequence and the second time sequence are at t 1 To t n And if the numerical value of any time point in the time is in the same numerical value interval, judging that the first time sequence and the second time sequence are similar at the any time point.
4. An apparatus for detecting an abnormal subsequence in a time series, comprising:
the sequence construction module is used for forming a tuple by adopting a single numerical value and a single time point, forming a plurality of tuples into a time sequence, and defining the similarity of different time sequences at any time point;
the system comprises an extended interval table construction module, a time sequence generation module and a time sequence generation module, wherein the extended interval table construction module is used for constructing a plurality of splitting points to divide a numerical value space in the time sequence into a plurality of numerical value intervals, acquiring probability density of the time sequence, acquiring probability that any time point in the time sequence falls into any numerical value interval according to the probability density, constructing an interval table according to the probability and the plurality of numerical value intervals, and constructing an extended interval table according to the interval table;
an anomaly determination module, configured to obtain weights of each time point of each sub-sequence of a time sequence in the extended interval table, average the weights of all the time points as a score of each sub-sequence, and if the score is smaller, determine that the sub-sequence is less likely to be anomalous;
wherein the value space is made up of all values in the number of tuples; the probability that any numerical point falls into any numerical interval is the same;
the probability density is:
Figure FDA0004152149140000031
the probability is:
Figure FDA0004152149140000032
wherein x is any time point; s is the number of numerical intervals; beta i Is the ith split point;
i=0,…,S-1;
the construction of a number of split points includes:
Figure FDA0004152149140000033
Figure FDA0004152149140000034
wherein p' is the derivative of p with respect to time; g is a constructor, if G is zero, then beta is determined i Is a split point, if G is not zero, beta is determined i Not the split point;
the interval table is constructed according to the probability and a plurality of numerical intervals, and correspondingly, the elements of the interval table comprise:
Figure FDA0004152149140000035
wherein j is the j-th numerical interval; ITable is an element of the interval table;
said averaging of the ownership weights as a fraction of said each sub-sequence comprises:
Figure FDA0004152149140000036
Figure FDA0004152149140000037
wherein t is i Is the i-th moment; score (t) i ) For time point t i Is set in the extended interval table; sc (Sc)ore (P) is the fraction of subsequences; w is a weight; r is (r) j+1,i For time point t i The compact coefficients in the position of the numerical space and the adjacent upper interval; r is (r) j-1,i For time point t i The position in numerical space and the adjacent lower interval.
5. An electronic device, comprising:
at least one processor, at least one memory, and a communication interface; wherein,,
the processor, the memory and the communication interface are communicated with each other;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-3.
6. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the method of any one of claims 1 to 3.
CN202010456099.1A 2020-05-26 2020-05-26 Method and device for detecting abnormal subsequence in time sequence Active CN111612082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010456099.1A CN111612082B (en) 2020-05-26 2020-05-26 Method and device for detecting abnormal subsequence in time sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010456099.1A CN111612082B (en) 2020-05-26 2020-05-26 Method and device for detecting abnormal subsequence in time sequence

Publications (2)

Publication Number Publication Date
CN111612082A CN111612082A (en) 2020-09-01
CN111612082B true CN111612082B (en) 2023-06-23

Family

ID=72196337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010456099.1A Active CN111612082B (en) 2020-05-26 2020-05-26 Method and device for detecting abnormal subsequence in time sequence

Country Status (1)

Country Link
CN (1) CN111612082B (en)

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101906859B1 (en) * 2012-03-23 2018-10-11 삼성전자 주식회사 Aparatus and method for detecting anomalous subsequence
CN104156473B (en) * 2014-08-25 2017-05-03 哈尔滨工业大学 LS-SVM-based method for detecting anomaly slot of sensor detection data
JP6313730B2 (en) * 2015-04-10 2018-04-18 タタ コンサルタンシー サービシズ リミテッドTATA Consultancy Services Limited Anomaly detection system and method
CN105574669B (en) * 2015-12-16 2020-02-14 国网山东省电力公司电力科学研究院 Power transmission and transformation equipment state abnormity detection method based on time-space joint data clustering analysis
CN105678409A (en) * 2015-12-31 2016-06-15 哈尔滨工业大学 Adaptive and distribution-free time series abnormal point detection method
CN106127249B (en) * 2016-06-24 2020-05-01 深圳市颐通科技有限公司 Online detection method for abnormal subsequence in electrocardiogram data
CN106228002B (en) * 2016-07-19 2021-11-26 北京工业大学 High-efficiency abnormal time sequence data extraction method based on secondary screening
US10743821B2 (en) * 2016-10-21 2020-08-18 Tata Consultancy Services Limited Anomaly detection by self-learning of sensor signals
CN107528722B (en) * 2017-07-06 2020-10-23 创新先进技术有限公司 Method and device for detecting abnormal point in time sequence
CN108647737A (en) * 2018-05-17 2018-10-12 哈尔滨工业大学 A kind of auto-adaptive time sequence variation detection method and device based on cluster
CN109902703B (en) * 2018-09-03 2021-09-21 华为技术有限公司 Time series abnormity detection method and device
CN109542952A (en) * 2018-11-23 2019-03-29 中国民用航空上海航空器适航审定中心 A kind of detection method of time series abnormal point
CN109871401B (en) * 2018-12-26 2021-05-25 北京奇安信科技有限公司 Time series abnormity detection method and device
CN109784042B (en) * 2018-12-29 2021-02-23 奇安信科技集团股份有限公司 Method and device for detecting abnormal point in time sequence, electronic equipment and storage medium
CN109858522A (en) * 2018-12-29 2019-06-07 国网天津市电力公司电力科学研究院 A kind of management line loss abnormality recognition method based on data mining

Also Published As

Publication number Publication date
CN111612082A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
US20200401939A1 (en) Systems and methods for preparing data for use by machine learning algorithms
Large et al. On time series classification with dictionary-based classifiers
Aliniya et al. A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm
US20160283533A1 (en) Multi-distance clustering
CN110929029A (en) Text classification method and system based on graph convolution neural network
Sun et al. Learned cardinality estimation: A design space exploration and a comparative evaluation
CN107070867B (en) Network flow abnormity rapid detection method based on multilayer locality sensitive hash table
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
Grabocka et al. Scalable classification of repetitive time series through frequencies of local polynomials
US11281714B2 (en) Image retrieval
CN111612082B (en) Method and device for detecting abnormal subsequence in time sequence
CN112287036A (en) Outlier detection method based on spectral clustering
US20230245786A1 (en) Method for the prognosis of a desease following upon a therapeutic treatment, and corresponding system and computer program product
CN113192629B (en) Method and apparatus for automatic fetal heart interpretation
CN115344386A (en) Method, device and equipment for predicting cloud simulation computing resources based on sequencing learning
CN113066544B (en) FVEP characteristic point detection method based on CAA-Net and LightGBM
Songdechakraiwut et al. Topological classification in a Wasserstein distance based vector space
Hou A new clustering validity index based on K-means algorithm
CN111488903A (en) Decision tree feature selection method based on feature weight
Yin et al. Stroke risk prediction: Comparing different sampling algorithms
Chin et al. Improving handwritten digit recognition using hybrid feature selection algorithm
CN110265151A (en) A kind of learning method based on isomery temporal data in EHR
CN113705618B (en) Subspace dimension reduction-based high-dimensional multi-transformation point detection method, system, equipment and medium
Kudo et al. Simple termination conditions for k-nearest neighbor method
Igbinedion et al. Fast softmax sampling for deep neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant