CN116089491B - Retrieval matching method and device based on time sequence database - Google Patents

Retrieval matching method and device based on time sequence database Download PDF

Info

Publication number
CN116089491B
CN116089491B CN202211616863.2A CN202211616863A CN116089491B CN 116089491 B CN116089491 B CN 116089491B CN 202211616863 A CN202211616863 A CN 202211616863A CN 116089491 B CN116089491 B CN 116089491B
Authority
CN
China
Prior art keywords
sequence
time
matching
subsequence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211616863.2A
Other languages
Chinese (zh)
Other versions
CN116089491A (en
Inventor
***
朱妤晴
王一力
安彦哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211616863.2A priority Critical patent/CN116089491B/en
Publication of CN116089491A publication Critical patent/CN116089491A/en
Application granted granted Critical
Publication of CN116089491B publication Critical patent/CN116089491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a retrieval matching method and device based on a time sequence database, comprising the following steps: acquiring time sequence meta information and time sequence data change trend information; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set. According to the time sequence meta information and the time sequence data change trend information, the candidate sequence set is screened from the time sequence database, then the matching calculation is carried out, the subsequence which is the best matched with the shape described by the user in the database is queried, the user is supported to find out the required sequence section by utilizing the mixed information, and the query with shorter time delay is realized, so that the performance of sequence matching is greatly improved.

Description

Retrieval matching method and device based on time sequence database
Technical Field
The present invention relates to the field of database technologies, and in particular, to a method and apparatus for matching search based on a time sequence database.
Background
With the rapid development of new concepts such as industrial internet, mobile internet, internet of things, internet of vehicles and smart grid, more and more industries need to store and analyze time sequence data generated in real time, so that various time sequence databases are also promoted to be rapidly generated and developed like spring bamboo shoots after rain. In time series data analysis, users usually need to perform time series matching search, or query time series conforming to a certain variation trend, such as querying the physical machine which is exposed after the CPU occupancy of the machine room A is exposed in the last two hours, and querying the stock which has undergone three large rises and falls in the last few days.
The sub-sequences matching with a certain time sequence variation trend are searched in the time sequence database, the similarity between the time sequences needs to be calculated, and the main method for measuring the similarity of the time sequence data is dynamic time warping (Dynamic Time Wrapping, DTW). The DTW algorithm is a main algorithm for measuring the similarity of time sequences, and solves the Euclidean distance as degreeThe problem of shape similarity is difficult to deal with when the amount is measured. The DTW algorithm core is based on dynamic programming to calculate, and the time complexity is O (n multiplied by m 2 ) N is the number of subsequences meeting the requirement, and m is the length of the query sequence, so that the query delay of a user is too long. In addition, although the DTW algorithm can better calculate the similarity between two sequences, a sequence needs to be given as an input, and the value of the sequence cannot be accurately given.
In summary, the prior art has the problems that the query time delay is too long and the input cannot be accurately given.
Disclosure of Invention
The invention provides a retrieval matching method and device based on a time sequence database, which are used for solving the defects that the query time delay is too long and the input cannot be accurately given in the prior art, supporting users to find out required sequence segments by using mixed information, and realizing the query with shorter time delay.
The invention provides a retrieval matching method based on a time sequence database, which comprises the following steps:
acquiring time sequence meta information and time sequence data change trend information;
screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information;
and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set.
According to the retrieval matching method based on the time sequence database, provided by the invention, a candidate sequence set is screened out from a pre-established time sequence database based on the time sequence meta information, and the method concretely comprises the following steps:
inquiring target time sequence data from the time sequence database by utilizing the time sequence meta information, and obtaining an inquiry result;
constructing a query result set containing all query results;
and taking the query result set as the candidate sequence set under the condition that the query result set is a non-empty set.
According to the retrieval matching method based on the time sequence database, the time sequence data change trend information comprises a sequence Q with fixed length and time intervals between adjacent data points in the sequence Q, the time intervals are used for screening out subsequences with highest similarity with the sequence Q in the candidate sequence set, and the subsequences with highest similarity are used as the best matching subsequences.
According to the retrieval matching method based on the time sequence database, the matching calculation is carried out on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set, and the method concretely comprises the following steps:
carrying out data normalization processing on each sequence in the candidate sequence set to obtain a shape sequence of each sequence in the candidate sequence set;
calculating the shape distance in parallel between the shape sequence of the sequence Q and the shape sequence of each sequence in the candidate sequence set to obtain a shape distance calculation result;
pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set.
According to the retrieval matching method based on the time sequence database, the pre-established curve matching module is utilized to carry out matching calculation on the candidate sequence set based on the time sequence data change trend information;
wherein, the curve matching module includes:
the normalization module is used for normalizing the sequence Q with the fixed length and the candidate sequence set input by the user so as to convert the sequence Q and the candidate sequence set into a shape sequence;
a parallel computing module through which each sequence of the shape sequence and the given shape sequence set computes the shape distance in parallel;
a result summarizing module; and summarizing the calculation results of the shape distances through the result summarizing module, and outputting a subsequence most similar to the time sequence change trend input by the user.
According to the retrieval matching method based on the time sequence database, a preset formula is used as a conversion formula of the matching calculation;
wherein, the preset formula is:
dp[i][j]=min(dp[i-1][j-1],dp[i-1][j],dp[i][j-1])+dis[i][j]
wherein dp [ i ] [ j ] represents the minimum dtw distance between two sequences of subsequence i at length j; dis [ i ] [ j ] represents the shortest distance between two sequences of the subsequence i at a length j; min () represents a number with the minimum value in the return of the specified number; i is a sub-sequence index and j is the query sequence length.
According to the retrieval matching method based on the time sequence database, pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set, and the method specifically comprises the following steps:
when the lower bound of Kim_FL is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
when Keogh lower bound is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
and discarding the current subsequence when the distance of the current subsequence in the distance matrix of the current DTW in the candidate sequence set exceeds the historical shortest distance.
The invention also provides a retrieval matching device based on the time sequence database, which comprises the following steps:
the acquisition module is used for acquiring time sequence meta information and time sequence data change trend information;
the screening module is used for screening a candidate sequence set from a time sequence database which is created in advance based on the time sequence meta information;
and the matching module is used for carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the retrieval matching method based on the time sequence database when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a time-series database based search matching method as described in any of the above.
According to the retrieval matching method and device based on the time sequence database, time sequence meta information and time sequence data change trend information are obtained; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set. According to the time sequence meta information and the time sequence data change trend information, the candidate sequence set is screened from the time sequence database, then the matching calculation is carried out, the subsequence which is the best matched with the shape described by the user in the database is queried, the user is supported to find out the required sequence section by utilizing the mixed information, and the query with shorter time delay is realized, so that the performance of sequence matching is greatly improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for matching search based on a time sequence database;
FIG. 2 is a second flow chart of the method for matching search based on time sequence database according to the present invention;
FIG. 3 is a schematic diagram of a pattern sequence of one embodiment of a time-series database-based search matching method provided by the present invention;
FIG. 4 is a schematic diagram of a curve matching module of the retrieval matching method based on a time sequence database;
FIG. 5 is a flowchart of an improved DTW algorithm of the time sequence database-based search matching method provided by the invention;
FIG. 6 is a schematic diagram of a retrieval matching device based on a time sequence database;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Reference numerals:
610: an acquisition module; 620: a screening module; 630: a matching module;
710: a processor; 720: a communication interface; 730: a memory; 740: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The time series database-based search matching method of the present invention is described below with reference to fig. 1 to 5. Fig. 1 and fig. 2 are schematic flow diagrams of a retrieval matching method based on a time sequence database, and as shown in fig. 1, an embodiment of the present invention provides a retrieval matching method based on a time sequence database, including:
step 110: and acquiring time sequence meta information and time sequence data change trend information. The time sequence meta information and the time sequence data change trend information can be information input by a user, the data input is based on the wish of the user, and a description of the change trend is adopted to replace an explicit time sequence as the input of query matching. That is, the data input is time series meta information given by the user and time series data change trend defined by the user. The trend of the time series data is described by a sequence Q of fixed length and the time intervals between the sequence data points.
Step 120: and screening a candidate sequence set from a pre-created time sequence database based on the time sequence meta information.
After time sequence meta information and time sequence data change trend information given by a user are obtained, a time sequence database is queried according to the time sequence meta information, and a candidate time sequence set is obtained. The timing database is pre-created.
In the actual operation process, according to time sequence meta-information given by a user, inquiring corresponding time sequence data in a time sequence database, and if the inquiring result set is empty, returning that the best matching subsequence does not exist; and if the result set is not empty, taking the query result set as a candidate sequence set.
The sequence Q is a fixed-length sequence describing the trend of time series data. The trend of the time sequence data is described by a sequence Q with fixed length and the time interval between the sequence data points, and the subsequence which is most similar to the Q in the candidate time sequence set is supported to be inquired.
Step 130: and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set.
After the candidate sequence set is obtained, matching calculation is carried out on the candidate sequence set according to time sequence data change trend information, and the best matching subsequence of each sequence is found from the candidate time sequence set.
According to the method and the device, the most similar sequence segments are searched and matched from massive time sequence data according to the shape of the user description curve and the time sequence meta-information, so that a new application scene of the time sequence database is developed.
Based on the above embodiment, in the method, the candidate sequence set is screened from a pre-created time sequence database based on the time sequence meta information, which specifically includes:
inquiring target time sequence data from the time sequence database by utilizing the time sequence meta information, and obtaining an inquiry result;
constructing a query result set containing all query results;
and taking the query result set as the candidate sequence set under the condition that the query result set is a non-empty set.
Specifically, according to the acquired time sequence meta information, inquiring corresponding time sequence data in a time sequence database, and if the inquiring result set is empty, returning that no best matching subsequence exists; and if the result set is not empty, taking the query result set as a candidate sequence set.
Based on the above embodiment, in the method, the time series data change trend information includes a sequence Q with a fixed length and a time interval between each adjacent data point in the sequence Q, so as to screen out a sub-sequence with the highest similarity with the sequence Q in the candidate sequence set, and take the sub-sequence with the highest similarity as the best matching sub-sequence.
Specifically, the time sequence data change trend is described by a sequence Q with a fixed length and time intervals among data points of the sequence, and the time sequence data change trend is used for inquiring a subsequence which is most similar to the Q in a candidate time sequence set, and the subsequence which is most similar to the Q is used as the most matched subsequence.
Based on the above embodiment, in the method, the matching calculation is performed on the candidate sequence set based on the time series data change trend information so as to find a best matching subsequence of each sequence from the sequences of the candidate sequence set, which specifically includes:
carrying out data normalization processing on each sequence in the candidate sequence set to obtain a shape sequence of each sequence in the candidate sequence set;
calculating the shape distance in parallel between the shape sequence of the sequence Q and the shape sequence of each sequence in the candidate sequence set to obtain a shape distance calculation result;
pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set.
Specifically, after the candidate sequence set is obtained, each sequence in the candidate sequence set is subjected to data normalization processing, that is, normalization processing is performed on the result data set and the sequence Q, so as to obtain a shape sequence.
The normalization process of any sequence is as follows:
a time series, there are generally three states: the state correspondence is denoted as {1, -1,0}. It is assumed that there is a certain sequence of length S, which is equally divided into K segments. For each segment a slope is calculated, positive for rising, negative for falling and 0 for unchanged. This sequence can be expressed as a sequence of [1,0, -1, -1.], and adjacent identical patterns are combined to give a pattern sequence of [1,0, -1.].
Because adjacent identical patterns are combined, the resulting pattern sequence must be 1, -1,0 spaced apart, each pattern may span a different length of time, after which the sequence S1 may have N patterns and S2 may have M patterns, as shown in FIG. 3. They now have to digitize their modes, i.e. recalculate the modes of both sequences with the union of all the varying endpoints. This enables the two sequences to use a common segmentation point to obtain the pattern sequences of equal final length.
Based on the pattern sequence, the shape sequence further improves the metric effect. Specifically, according to the change trend complicating mode, the original three modes are expanded to 7 states according to the change trend of the slope: { acceleration decrease, level decrease, deceleration decrease, unchanged, deceleration increase, level increase, acceleration increase }, this is described by the pattern M = { -3, -2, -1,0,1,2,3 }. A threshold th is now set to help distinguish between the 7 states, the slope ki for each segment mode is shown in the table below.
TABLE 1 sequence pattern comparison Table
After unifying the original sequences into the shape sequences, the DTW distance, i.e., the shape distance, is calculated in parallel between the shape sequence of the sequence Q and the shape sequence of each sequence in the dataset.
Pruning is carried out on the shape distance result obtained through calculation, so that the best matching subsequence of each sequence in the candidate sequence set is obtained.
On the basis of the traditional method, the invention provides that the pattern distance and the shape distance are used for replacing the Euclidean distance, so that a user can use the time sequence change trend to replace a definite time sequence as the input of query matching.
Based on the above embodiment, in the method, a pre-created curve matching module is utilized to perform matching calculation on the candidate sequence set based on the time series data change trend information;
wherein, the curve matching module includes:
the normalization module is used for normalizing the sequence Q with the fixed length and the candidate sequence set input by the user so as to convert the sequence Q and the candidate sequence set into a shape sequence;
a parallel computing module through which each sequence of the shape sequence and the given shape sequence set computes the shape distance in parallel;
a result summarizing module; and summarizing the calculation results of the shape distances through the result summarizing module, and outputting a subsequence most similar to the time sequence change trend input by the user.
Specifically, as shown in fig. 4, a curve matching module is used to perform matching calculation on the candidate sequence set. The curve matching module comprises a normalization module (1-1), a parallel computing module (1-2) and a result summarizing module (1-3).
The input of the normalization module (1-1) is a sequence Q with a fixed length in a time sequence data change trend defined by a user, and a candidate sequence set obtained by inquiring a time sequence database according to time sequence meta-information is output as a shape sequence. That is, the change trend sequence Q input by the user and the given sequence data set are unified to be converted into a shape sequence by the normalization processing described above by the normalization module (1-1).
The input of the parallel computing module (1-2) is a shape sequence and each sequence of a given shape sequence set, and the output is a shape distance. That is, the shape sequence and each sequence of the given shape sequence set are calculated in parallel by the parallel calculation module (1-2) using the conversion formula.
The result summarizing module (1-3) summarizes the results and returns a subsequence most similar to the time sequence change trend input by the user.
Based on the above embodiment, in the method, a preset formula is used as a conversion formula of the matching calculation;
wherein, the preset formula is:
dp[i][j]=min(dp[i-1][j-1],dp[i-1][j],dp[i][j-1])+dis[i][j]
wherein dp [ i ] [ j ] represents the minimum dtw distance between two sequences of subsequence i at length j; dis [ i ] [ j ] represents the shortest distance between two sequences of the subsequence i at a length j; min () represents a number with the minimum value in the return of the specified number; i is a sub-sequence index and j is the query sequence length.
Specifically, as shown in fig. 5, the time complexity of the original DTW algorithm is too high, and the query time delay is too long, the method provided by the invention optimizes the original DTW algorithm, so that the execution speed of the method can be improved by tens of times to tens of times according to the data scale.
Fig. 5 is a flowchart of an improved DTW algorithm of the time-series database-based search matching method provided by the present invention, as shown in fig. 5, the improved DTW algorithm is used to calculate the minimum DTW (Dynamic Time Warping ) distance between the shape sequence Q input by the user and the shape sequence S to be matched, the input is the minimum DTW distance between the shape sequences Q and S and the output bit.
Further, declaring a variable DTW _distance to represent DTW distance in the iterative process, initializing it to the minimum DTW value between the currently calculated shape subsequence and the input shape sequence Q, calculating Kim_FL lower bound according to the shape sequence Q, C, marking the result as lb, outputting DTW _distance if lb is greater than DTW _distance, and ending the algorithm (i.e. pruning 1); otherwise, calculating Keogh lower bound according to the shape sequence Q, C, marking the result as lb, if lb is larger than dtw _distance, outputting dtw _distance, and ending the algorithm (namely pruning 2); otherwise, calculating dtw distances between the shape sequences Q, C through dynamic programming iteration, and if lb+dis [ i ] [ j ] is greater than dtw _distance in the iterative calculation process, making dtw _distance equal to lb+dis [ i ] [ j ], and ending the algorithm (namely pruning 3); otherwise, when i is greater than or equal to the length of Q, the algorithm ends.
Wherein dis [ i ] [ j ] represents the shortest distance between two sequences of sub-sequence i at length j.
The dtw distances between the shape sequences Q, C are calculated by dynamic programming iterations, the specific steps are described as follows:
1. the declaration variable i=0;
2. if i is smaller than the length of the shape sequence Q, continuing to step 3, otherwise jumping to step 9;
3. the declaration variable j=0;
4. if the variable j is smaller than the length of the shape distance S, continuing the step 5, otherwise, jumping to the step 7;
5.dis[i][j]=min(dis[i-1][j-1],dis[i-1][j],dis[i][j-1])
+Math.pow(Q.get(i)-S.get(j),2);
wherein dis [ i ] [ j ] represents the minimum dtw distance between two sequences of a subsequence i at a length j, math.pow (Q.get (i) -S.get (j), 2) is the square of the difference between the value of a shape sequence Q at i and the value of a shape sequence S at j;
6. if lb+dis [ i ] [ j ] is larger than dtw _distance, making dtw _distance equal to lb+dis [ i ] [ j ], jumping to step 10 (i.e. pruning 3), otherwise continuing to step 7;
7.j ++, jumping to the step 4;
i++, jumping to the step 2;
9.dtw_distance=dis[Q.size-1][S.size-1];
10. output dtw _distance.
The algorithm core of the DTW is calculated based on dynamic programming, for any given two sequences A and B, the lengths are N and M respectively, the DTW algorithm calculates the first i points of the sequence A, and the distance between subsequences formed by the first j points of the sequence B, and the difference between the Euclidean distances is that the two points are allowed to be in one-to-one correspondence between the sequences, but the two points can be one-to-many. Thus, the readily available conversion formula is:
dp[i][j]=min(dp[i-1][j-1],dp[i-1][j],dp[i][j-1])+dis[i][j]
calculating the completed dp [ N ]][M]I.e. the required shape distance. The time complexity of the algorithm is O (n m 2 ) N is the number of subsequences meeting the requirement, and m is the length of the query sequence. The above formula limits the error range of two sequences when the range dp is shifted, i.e. the distance between i and j is limited to not more than a given value in the above conversion formula, so that the complexity is from O (n×m 2 ) Changing to O (n×m×k), k is a given value, representing the realistic meaning that the match error between two sequences cannot exceed a given distance.
Based on the above embodiment, in the method, pruning is performed on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set, which specifically includes:
when the lower bound of Kim_FL is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
when Keogh lower bound is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
and discarding the current subsequence when the distance of the current subsequence in the distance matrix of the current DTW in the candidate sequence set exceeds the historical shortest distance.
Specifically, after the shape distance is obtained by matching calculation according to a conversion formula, the candidate subsequence is filtered by using a lower bound distance with low calculation time complexity.
1) The lower bound of Kim FL, i.e. the maximum of the distances of the two sequence start, end, maximum and minimum values, is calculated, and thus the complexity is O (n). However, since all timing sequences are unified and normalized before calculation, the value of each point can only be one of { -3, -2, -1,0,1,2,3}, so the time complexity can be optimized to O (1).
2) Keogh lower bound establishes an upper bound U (searching for the first position greater than U) and a lower bound L (searching for the first position greater than or equal to L) for the query sequence Q to encapsulate the query sequence. The area of the hatched portion in the graph is calculated by the following formula. And (3) obtaining two sequences U and L by preprocessing the query sequence Q by O (m), and performing O (m) preprocessing each time the sequence C is input to calculate Keogh lower bound.
The formula is used for calculating a Keogh lower bound, wherein Q is a shape sequence to be queried input by a user; c is a shape sequence for calculating the distance currently; i is a subsequence subscript; qi is the value of the shape sequence C at i; ui is the value of the shape sequence U at i; li is the value of the shape sequence L at i.
In calculating the Kim_FL and Keogh lower bounds for a sequence, this sub-sequence can be discarded if its distance has exceeded the shortest distance previously recorded.
In calculating DTW, since DTW is calculated from left to right, pruning can be performed again by calculating the existing DTW distance plus the lower bound of Keogh backward from the current position, i.e., calculating DTW [ k ] +Keogh [ k+1-n ] as a tighter lower bound.
And summarizing all the DTW calculation results through a summarizing module to obtain a subsequence of the best-matching time sequence change trend Q.
According to the invention, the range of the matching error of the two sequences in the dynamic programming process is limited, and the most similar subsequence of each sequence is found from the sequences of the candidate sequence set by diversified pruning, so that the matching performance of the sequences is greatly improved. The method can effectively solve the specific problems in the data retrieval and management of the industrial Internet of things in the relevant fields of important manufacturing industries in China such as large aircraft manufacturing and the like, and supports wider application based on big data and live data assets.
Based on the above embodiments, the present invention provides an illustration of completing one-time search matching by using the above search matching method based on a time series database, and the present invention is implemented by the following technical solutions described in steps S1 to S5:
s1: the user gives time sequence meta-information, trend description sequence Q and time interval between adjacent data points in the sequence Q;
s2: screening a candidate sequence set from a pre-established time sequence database according to the time sequence meta information, wherein the candidate sequence set comprises candidate sequences 1-n;
s3: carrying out data normalization processing on the sequence Q and the candidate sequence 1-candidate sequence n to obtain a shape sequence Q ', and a shape sequence 1' -shape sequence n ';
s4: the shape sequence Q ' and the shape sequence 1' -shape sequence n ' are used for calculating the shape distance in parallel by using a conversion formula;
s5: pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set.
In the above specific embodiment, the present invention provides a retrieval matching method based on a time sequence database, by acquiring time sequence meta information and time sequence data change trend information; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set. According to the time sequence meta information and the time sequence data change trend information, the candidate sequence set is screened from the time sequence database, then the matching calculation is carried out, the subsequence which is the best matched with the shape described by the user in the database is queried, the user is supported to find out the required sequence section by utilizing the mixed information, and the query with shorter time delay is realized, so that the performance of sequence matching is greatly improved. The invention screens the candidate sequence set from the time sequence database based on the time sequence meta information provided by the user, and then uses the pattern distance and the shape distance to replace the Euclidean distance on the basis of the traditional dynamic time warping algorithm (Dynamic Time Warping, DTW) algorithm according to the time sequence data change trend description defined by the user, so that the user can use the time sequence change trend to replace a definite time sequence as the input of query matching; in addition, by limiting the range of the matching errors of the two sequences in the dynamic programming process and diversified pruning, the most similar subsequence of each sequence is found out from the sequences of the candidate sequence set, so that the sequence matching performance is greatly improved, and a user is supported to find out a required sequence segment by utilizing mixed information so as to support diversified industrial application.
The time series database-based search matching device provided by the invention is described below, and the time series database-based search matching device described below and the time series database-based search matching device described above can be referred to correspondingly with each other.
Fig. 6 is a schematic structural diagram of a retrieval matching device based on a time sequence database according to an embodiment of the present invention, as shown in fig. 6, the embodiment of the present invention provides a retrieval matching device based on a time sequence database, including: an acquisition module 610; a screening module 620; a matching module 630;
wherein:
an acquisition module 610, configured to acquire time sequence meta information and time sequence data change trend information;
a screening module 620, configured to screen a candidate sequence set from a time sequence database created in advance based on the time sequence meta information;
and a matching module 630, configured to perform a matching calculation on the candidate sequence set based on the time-series data variation trend information, so as to find a best matching subsequence of each sequence from the sequences in the candidate sequence set.
Based on the above embodiment, in the apparatus, the screening of the candidate sequence set from the pre-created time sequence database based on the time sequence meta information specifically includes:
inquiring target time sequence data from the time sequence database by utilizing the time sequence meta information, and obtaining an inquiry result;
constructing a query result set containing all query results;
and taking the query result set as the candidate sequence set under the condition that the query result set is a non-empty set.
Based on the above embodiment, in the apparatus, the time series data change trend information includes a sequence Q with a fixed length and a time interval between each adjacent data point in the sequence Q, so as to screen out a sub-sequence with the highest similarity with the sequence Q in the candidate sequence set, and take the sub-sequence with the highest similarity as the best matching sub-sequence.
Based on the above embodiment, in the apparatus, the matching calculation is performed on the candidate sequence set based on the time series data change trend information, so as to find a best matching subsequence of each sequence from the sequences of the candidate sequence set, and specifically includes:
carrying out data normalization processing on each sequence in the candidate sequence set to obtain a shape sequence of each sequence in the candidate sequence set;
calculating the shape distance in parallel between the shape sequence of the sequence Q and the shape sequence of each sequence in the candidate sequence set to obtain a shape distance calculation result;
pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set.
Based on the above embodiment, in the apparatus, matching calculation is performed on the candidate sequence set based on the time series data change trend information by using a curve matching module created in advance;
wherein, the curve matching module includes:
the normalization module is used for normalizing the sequence Q with the fixed length and the candidate sequence set input by the user so as to convert the sequence Q and the candidate sequence set into a shape sequence;
a parallel computing module through which each sequence of the shape sequence and the given shape sequence set computes the shape distance in parallel;
a result summarizing module; and summarizing the calculation results of the shape distances through the result summarizing module, and outputting a subsequence most similar to the time sequence change trend input by the user.
Based on the above embodiment, in the apparatus, a preset formula is used as a conversion formula for the matching calculation;
wherein, the preset formula is:
dp[i][j]=min(dp[i-1][j-1],dp[i-1][j],dp[i][j-1])+dis[i][j]
wherein dp [ i ] [ j ] represents the minimum dtw distance between two sequences of subsequence i at length j; dis [ i ] [ j ] represents the shortest distance between two sequences of the subsequence i at a length j; min () represents a number with the minimum value in the return of the specified number; i is a sub-sequence index and j is the query sequence length.
Based on the above embodiment, in the apparatus, pruning is performed on the shape distance calculation result to obtain a best matching subsequence of each sequence in the candidate sequence set, which specifically includes:
when the lower bound of Kim_FL is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
when Keogh lower bound is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
and discarding the current subsequence when the distance of the current subsequence in the distance matrix of the current DTW in the candidate sequence set exceeds the historical shortest distance.
In the above specific embodiment, the present invention provides a retrieval matching device based on a time series database, by acquiring time series meta information and time series data change trend information; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set. According to the time sequence meta information and the time sequence data change trend information, the candidate sequence set is screened from the time sequence database, then the matching calculation is carried out, the subsequence which is the best matched with the shape described by the user in the database is queried, the user is supported to find out the required sequence section by utilizing the mixed information, and the query with shorter time delay is realized, so that the performance of sequence matching is greatly improved.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a time-series database based search matching method comprising: acquiring time sequence meta information and time sequence data change trend information; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for time-series database-based search matching provided by the above methods, the method comprising: acquiring time sequence meta information and time sequence data change trend information; screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information; and carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A time sequence database-based search matching method, comprising:
acquiring time sequence meta information and time sequence data change trend information;
screening a candidate sequence set from a pre-established time sequence database based on the time sequence meta information;
performing matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find a best matching subsequence of each sequence from the sequences of the candidate sequence set;
the time sequence data change trend information comprises a sequence Q with fixed length and time intervals between adjacent data points in the sequence Q, and is used for screening out a subsequence with highest similarity with the sequence Q in the candidate sequence set, and taking the subsequence with highest similarity as the best matching subsequence.
2. The method for matching search based on time series database according to claim 1, wherein the step of screening candidate sequence sets from a pre-created time series database based on the time series meta information comprises the following steps:
inquiring target time sequence data from the time sequence database by utilizing the time sequence meta information, and obtaining an inquiry result;
constructing a query result set containing all query results;
and taking the query result set as the candidate sequence set under the condition that the query result set is a non-empty set.
3. The method for matching search based on time series database according to claim 1, wherein the step of performing matching calculation on the candidate sequence set based on the time series data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set comprises the following steps:
carrying out data normalization processing on each sequence in the candidate sequence set to obtain a shape sequence of each sequence in the candidate sequence set;
calculating the shape distance in parallel between the shape sequence of the sequence Q and the shape sequence of each sequence in the candidate sequence set to obtain a shape distance calculation result;
pruning is carried out on the shape distance calculation result to obtain the best matching subsequence of each sequence in the candidate sequence set.
4. The retrieval matching method based on a time series database according to claim 3, wherein the matching calculation is performed on the candidate sequence set based on the time series data change trend information by using a curve matching module created in advance;
wherein, the curve matching module includes:
the normalization module is used for normalizing the sequence Q with the fixed length and the candidate sequence set input by the user so as to convert the sequence Q and the candidate sequence set into a shape sequence;
a parallel computing module through which each sequence of the shape sequence and the given shape sequence set computes the shape distance in parallel;
a result summarizing module; and summarizing the calculation results of the shape distances through the result summarizing module, and outputting a subsequence most similar to the time sequence change trend input by the user.
5. A time series database based search matching method according to claim 3, characterized in that a preset formula is used as a conversion formula for the matching calculation;
wherein, the preset formula is:
dp [ i ] [ j ] = min (dp [ i-1] [ j-1], dp [ i-1] [ j ], dp [ i ] [ j-1 ]) +dis [ i ] [ j ], wherein dp [ i ] [ j ] represents the minimum dtw distance between two sequences of subsequence i at length j; dis [ i ] [ j ] represents the shortest distance between two sequences of the subsequence i at a length j; min () represents a number with the minimum value in the return of the specified number; i is a sub-sequence index and j is the query sequence length.
6. The method for matching search based on time series database according to claim 3, wherein pruning is performed on the shape distance calculation result to obtain a best matching subsequence of each sequence in the candidate sequence set, specifically comprising:
when the lower bound of Kim_FL is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
when Keogh lower bound is calculated, discarding the current subsequence when the distance of the current subsequence in the candidate sequence set exceeds the historical shortest distance;
and discarding the current subsequence when the distance of the current subsequence in the distance matrix of the current DTW in the candidate sequence set exceeds the historical shortest distance.
7. A time series database-based search matching device, comprising:
the acquisition module is used for acquiring time sequence meta information and time sequence data change trend information;
the screening module is used for screening a candidate sequence set from a time sequence database which is created in advance based on the time sequence meta information;
the matching module is used for carrying out matching calculation on the candidate sequence set based on the time sequence data change trend information so as to find the best matching subsequence of each sequence from the sequences of the candidate sequence set;
the time sequence data change trend information comprises a sequence Q with fixed length and time intervals between adjacent data points in the sequence Q, and is used for screening out a subsequence with highest similarity with the sequence Q in the candidate sequence set, and taking the subsequence with highest similarity as the best matching subsequence.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the time-series database-based search matching method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the time-series database-based search matching method of any of claims 1 to 6.
CN202211616863.2A 2022-12-15 2022-12-15 Retrieval matching method and device based on time sequence database Active CN116089491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211616863.2A CN116089491B (en) 2022-12-15 2022-12-15 Retrieval matching method and device based on time sequence database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211616863.2A CN116089491B (en) 2022-12-15 2022-12-15 Retrieval matching method and device based on time sequence database

Publications (2)

Publication Number Publication Date
CN116089491A CN116089491A (en) 2023-05-09
CN116089491B true CN116089491B (en) 2024-01-30

Family

ID=86211215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211616863.2A Active CN116089491B (en) 2022-12-15 2022-12-15 Retrieval matching method and device based on time sequence database

Country Status (1)

Country Link
CN (1) CN116089491B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000305940A (en) * 1999-04-21 2000-11-02 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving time series data and storage medium storing time series data retrieval program
JP2004348594A (en) * 2003-05-23 2004-12-09 Nippon Telegr & Teleph Corp <Ntt> Time series data search method, device, and time program, and program storage medium
WO2021052156A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Data analysis method, apparatus and device, and computer readable storage medium
CN113360725A (en) * 2021-06-04 2021-09-07 重庆邮电大学 Electric power time sequence data retrieval method based on edge collaborative classification
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN114357037A (en) * 2022-03-22 2022-04-15 苏州浪潮智能科技有限公司 Time sequence data analysis method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000305940A (en) * 1999-04-21 2000-11-02 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving time series data and storage medium storing time series data retrieval program
JP2004348594A (en) * 2003-05-23 2004-12-09 Nippon Telegr & Teleph Corp <Ntt> Time series data search method, device, and time program, and program storage medium
WO2021052156A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Data analysis method, apparatus and device, and computer readable storage medium
CN113360725A (en) * 2021-06-04 2021-09-07 重庆邮电大学 Electric power time sequence data retrieval method based on edge collaborative classification
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN114357037A (en) * 2022-03-22 2022-04-15 苏州浪潮智能科技有限公司 Time sequence data analysis method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
褚蓉 ; 钮焱 ; .基于形状特征的DTW距离相似性搜索算法.软件导刊.2018,(第03期),82-84+87. *

Also Published As

Publication number Publication date
CN116089491A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
JP7343568B2 (en) Identifying and applying hyperparameters for machine learning
Norouzi et al. Fast exact search in hamming space with multi-index hashing
CN106096066B (en) Text Clustering Method based on random neighbor insertion
CN102254015B (en) Image retrieval method based on visual phrases
US8972415B2 (en) Similarity search initialization
EP2457151A1 (en) Ranking search results based on word weight
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
CN108549696B (en) Time series data similarity query method based on memory calculation
Mueen et al. AWarp: Fast warping distance for sparse time series
CN102184205A (en) Multi-mode string matching algorithm based on extended precision chaos hash
Yuen et al. Superseding nearest neighbor search on uncertain spatial databases
CN111723360B (en) Credential code processing method, device and storage medium
WO2017053779A1 (en) Data storage and retrieval system using online supervised hashing
US11763136B2 (en) Neural hashing for similarity search
CN116089491B (en) Retrieval matching method and device based on time sequence database
CN111026736B (en) Data blood margin management method and device and data blood margin analysis method and device
CN114880360A (en) Data retrieval method and device based on Bayesian optimization
Li et al. Linear time motif discovery in time series
CN113495901B (en) Quick retrieval method for variable-length data blocks
CN112416754B (en) Model evaluation method, terminal, system and storage medium
Chen et al. CGAP-align: a high performance DNA short read alignment tool
WO2016110125A1 (en) Hash method for high dimension vector, and vector quantization method and device
CN116760723B (en) Data prediction method, device, equipment and medium based on prediction tree model
CN116665772B (en) Genome map analysis method, device and medium based on memory calculation
CN113590260A (en) Statistical calculation method, device and equipment for calculation resources and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant