CN113742387A - Data processing method, device and computer readable storage medium - Google Patents

Data processing method, device and computer readable storage medium Download PDF

Info

Publication number
CN113742387A
CN113742387A CN202010473617.0A CN202010473617A CN113742387A CN 113742387 A CN113742387 A CN 113742387A CN 202010473617 A CN202010473617 A CN 202010473617A CN 113742387 A CN113742387 A CN 113742387A
Authority
CN
China
Prior art keywords
data
abnormal
sequences
segment
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010473617.0A
Other languages
Chinese (zh)
Inventor
蒋勇
彭鑫
叶德忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202010473617.0A priority Critical patent/CN113742387A/en
Priority to PCT/CN2021/086644 priority patent/WO2021238455A1/en
Publication of CN113742387A publication Critical patent/CN113742387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method, data processing equipment and a computer readable storage medium. The data processing method comprises the following steps: acquiring a target data sequence; acquiring a first abnormal data segment in a target data sequence; acquiring a first data search space in a target data sequence; acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment; and marking the second abnormal data segment. In the embodiment of the invention, the first abnormal data segment and the first data searching space are obtained in the target data sequence, so that the corresponding second abnormal data segment can be obtained and labeled in the first data searching space according to the abnormal data segment template, namely, the purpose of labeling other abnormal data segments in the target data sequence is realized, therefore, the labeling efficiency of the abnormal data in the data can be improved, and the human resources and the time resources can be saved.

Description

Data processing method, device and computer readable storage medium
Technical Field
Embodiments of the present invention relate to, but not limited to, the field of information processing technologies, and in particular, to a data processing method, device, and computer-readable storage medium.
Background
With the development of big data and artificial intelligence technologies, more intelligent and efficient machine learning technologies, such as abnormal perception of indexes, trend prediction, failure root cause analysis, and the like, have been increasingly introduced in the operation and maintenance of communication networks. These techniques generally rely on a high quality training data set for good application, and reliable label data is part of the high quality training data set. In addition, in recent years, the application of deep learning techniques in the fields of image and voice recognition has been successful greatly, and tag data sets obtained by labeling by a large amount of manpower are not needed.
However, manual labeling of a huge training data set is very expensive, and requires a lot of human resources and time resources. For example, for a medium-scale network, there are millions of massive time series data, and if all abnormal data in the data are labeled manually, it is a task that cannot be completed, and even if some empirical formulas and other methods are used for auxiliary labeling, the result is inaccurate and incomplete. Therefore, how to improve the labeling efficiency of abnormal data in data is a technical problem to be solved urgently.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
In a first aspect, embodiments of the present invention provide a data processing method, a device, and a computer-readable storage medium, which can improve the labeling efficiency of abnormal data in data.
In a second aspect, an embodiment of the present invention provides a data processing method, including,
acquiring a target data sequence;
acquiring a first abnormal data segment in the target data sequence;
acquiring a first data search space in the target data sequence;
acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment;
and marking the second abnormal data segment.
In a third aspect, an embodiment of the present invention further provides an apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method of the second aspect as described above when executing the computer program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions for executing the data processing method described above.
The embodiment of the invention comprises the following steps: acquiring a target data sequence; acquiring a first abnormal data segment in a target data sequence; acquiring a first data search space in a target data sequence; acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment; and marking the second abnormal data segment. According to the scheme provided by the embodiment of the invention, the first abnormal data segment and the first data searching space are obtained in the target data sequence, so that the first abnormal data segment can be used as the abnormal data segment template, and the corresponding second abnormal data segment can be obtained and labeled in the first data searching space according to the abnormal data segment template, namely, the purpose of labeling other abnormal data segments in the target data sequence is realized.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a schematic diagram of a system architecture platform for performing a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data processing method provided by an embodiment of the invention;
FIG. 3 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 4 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 5 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 6 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 7 is a flow chart of a data processing method according to another embodiment of the invention;
FIG. 8 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 9 is a flow chart of a data processing method according to another embodiment of the invention;
FIG. 10 is a flow chart of a data processing method according to another embodiment of the invention;
FIG. 11 is a flow chart of a data processing method according to another embodiment of the invention;
FIG. 12 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 13 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 14 is a flow chart of a data processing method according to another embodiment of the invention;
FIG. 15 is a flow diagram of a heuristic algorithm provided by one embodiment of the present invention;
FIG. 16 is a flow chart of a heuristic algorithm provided by another embodiment of the present invention;
fig. 17 is a main flow chart of a data processing method according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
For data collected in network operation, for example, time series index data, most of the time series index data have a certain period characteristic, and many abnormal data occurring repeatedly often occur at the same position in different periods, and the forms of the abnormal data show a certain similarity. In particular, in a time series index data having a long duration and a large amount of abnormal data, most of the abnormal data can be attributed to several types of abnormalities having similar characteristics, and the amount of truly unique abnormal data is relatively small. In addition, similar abnormal data may exist among different time series index data that are close to each other, for example, if a Central Processing Unit (CPU) utilization rate of one network element abnormally increases in a certain period of time, the situation may also occur in the CPU utilization rate time series data of another network element that carries similar services.
Based on the above circumstances, the present invention provides a data processing method, apparatus, and computer-readable storage medium, according to the periodic characteristics of abnormal data which repeatedly appears in most data, a first abnormal data segment and a first data search space are obtained in a target data sequence, so that the first abnormal data segment can be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and labeled in the first data search space according to the abnormal data segment template, namely, the purpose of labeling other abnormal data segments in the target data sequence is realized, therefore, compared with the traditional method for manually marking abnormal data segments, the method and the device for marking the abnormal data in the time series index data have the advantages that the abnormal data volume is large, and the abnormal types are not large, the marking efficiency of the abnormal data in the data can be improved, so that human resources and time resources can be saved.
The embodiments of the present invention will be further explained with reference to the drawings.
As shown in fig. 1, fig. 1 is a schematic diagram of a system architecture platform for executing a data processing method according to an embodiment of the present invention.
In the example of fig. 1, the system architecture platform includes a memory 110 and a processor 120, where the memory 110 and the processor 120 may be connected by a bus or other means, and fig. 1 illustrates a connection by a bus as an example.
The memory 110, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 110 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 110 may optionally include memory located remotely from processor 120, which may be connected to the system architecture platform via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It will be understood by those skilled in the art that the system architecture platform can be applied to various network controllers or network managers, and the embodiment is not limited thereto. In addition, the network controller or the network manager having the system architecture platform may be applied to various network systems, for example, may be applied to a 3G communication network system, an LTE communication network system, a 5G communication network system, a mobile communication network system that is evolved later, and the like, which is not limited in this embodiment.
Those skilled in the art will appreciate that the system architecture platform illustrated in FIG. 1 does not constitute a limitation on embodiments of the invention, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.
In the system architecture platform shown in fig. 1, the processor 120 may call a data processing program stored in the memory 110 to perform a data processing method.
Based on the above system architecture platform, the following provides various embodiments of the data processing method of the present invention.
As shown in fig. 2, fig. 2 is a flowchart of a data processing method according to an embodiment of the present invention, which includes, but is not limited to, step S100, step S200, step S300, step S400, and step S500.
Step S100, a target data sequence is acquired.
In an embodiment, the target data sequence may be time series index data, or may be other sequence data, where the other sequence data may be non-time series index data such as service type sequence data or service quantity sequence data, and the embodiment is not limited in particular. In addition, the target data sequence may be obtained automatically by the device having the system architecture platform in the network, or may be obtained by entering the target data sequence into the device having the system architecture platform through manual operation, which is not limited in this embodiment.
Step S200, a first abnormal data segment in the target data sequence is obtained.
In an embodiment, the first abnormal data segment is a data segment in the target data sequence in which abnormal data exists, and the first abnormal data segment in the target data sequence may be determined and selected manually, and then is entered into the device having the system architecture platform, so that the device acquires the first abnormal data segment, or is saved in a memory of the device, so that the device may acquire the first abnormal data segment from the memory. After the first abnormal data segment in the target data sequence is obtained, necessary basic conditions can be provided for labeling the other abnormal data segments in the target data sequence in the subsequent steps.
It should be noted that, in the time series index data, abnormal data often occurs at one or more continuous time points, the time point at which the abnormal data occurs is referred to as an abnormal time point, and a data segment corresponding to a set of the abnormal time points is referred to as an abnormal data segment. An abnormal data segment may last for a long time (i.e. contain many abnormal moments), therefore, at least the following characteristics are required for an abnormal data segment: the starting time point, the ending time point, at least 3 time points and the non-overlapping time points between the abnormal data sections.
Step S300, a first data search space is obtained in the target data sequence.
In one embodiment, the data search space is a portion of candidate anomaly data extracted from the target data sequence by a machine learning method. By acquiring the data search space, most normal data can be filtered, and similar data segments appearing in the normal data are prevented from being mistakenly judged as similar abnormal data and searched, so that the search accuracy of the similar abnormal data can be improved; in addition, by acquiring the data search space, the search range of similar abnormal data can be narrowed, and the search efficiency can be improved.
In an embodiment, the first abnormal data segment may be a data segment outside the first data search space, or may also be a data segment in the first data search space, which is not specifically limited in this embodiment. When the first abnormal data segment is a data segment in the first data search space, the first abnormal data segment is acquired in the first data search space because the first abnormal data segment is subjected to preliminary extraction when the first data search space is acquired, so that the acquisition of the first abnormal data segment can be more accurate and effective.
Step S400, according to the first abnormal data segment, a second abnormal data segment corresponding to the first abnormal data segment is obtained in the first data search space.
In an embodiment, the first abnormal data segment may be used as an abnormal data segment template, and the data segment in the first data search space may be compared with the first abnormal data segment to find a data segment in the first data search space that is the same as or similar to the first abnormal data segment, that is, a data segment that is the same as or similar to the first abnormal data segment, that is, a second abnormal data segment. Therefore, the purpose of finding out abnormal data in the target data sequence can be achieved by taking the first abnormal data segment as the abnormal data segment template and acquiring the second abnormal data segment corresponding to the first abnormal data segment in the first data search space, and therefore, compared with the traditional mode of manually finding abnormal data, the method can improve the finding efficiency of the abnormal data in the data for the time sequence index data with a large abnormal data amount and a small number of abnormal types, and thus can save human resources and time resources.
In an embodiment, all data in the first data search space may be regarded as one data segment, similarity calculation is performed by using a Dynamic Time Warping (DTW) algorithm, and when a calculation result is similar, a second abnormal data segment corresponding to the first abnormal data segment in the first data search space may be determined. In addition, the data in the first data search space may be divided into a plurality of data segments with the same length as the first abnormal data segment, and similarity calculation may be performed on the first abnormal data segment and the plurality of data segments in the first data search space by using methods such as euclidean distance or pearson correlation coefficient, so as to determine a second abnormal data segment corresponding to the first abnormal data segment in the first data search space.
And step S500, marking the second abnormal data segment.
In an embodiment, after the second abnormal data segment corresponding to the first abnormal data segment is found in the first data search space, the second abnormal data segment can be labeled, so that abnormal label data can be formed conveniently, a high-quality training data set can be obtained, and the method can be used for machine learning technologies such as deep learning technologies.
In an embodiment, because the data processing method uses the above steps S100, S200, S300, S400, and S500, and obtains the first abnormal data segment and the first data search space in the target data sequence, the first abnormal data segment can be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and labeled in the first data search space according to the abnormal data segment template, that is, the purpose of labeling other abnormal data segments in the target data sequence is achieved, therefore, for the time sequence index data with a large amount of abnormal data but a small number of abnormal types, compared with the conventional manual labeling of the abnormal data segment, the data processing method of this embodiment can improve the labeling efficiency of the abnormal data in the data, thereby saving human resources and time resources.
Referring also to fig. 3, in an embodiment, step S300 includes, but is not limited to, the following steps:
step S310, acquiring a first abnormal characteristic value of a target data sequence;
step S320, determining a first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;
step S330, a first data search space is obtained according to the first data position.
As will be understood by those skilled in the art, the abnormal data is data deviating from most data in the data set, and based on this, the first abnormal characteristic value in the present embodiment refers to a deviation value between the abnormal data and the normal data.
In an embodiment, after the first abnormal characteristic value of the target data sequence is obtained, a first data position corresponding to the first abnormal characteristic value in the target data sequence may be determined according to the first abnormal characteristic value, that is, a position of abnormal data in the target data sequence may be determined according to the first abnormal characteristic value, for example, when a distance of a data deviating from a majority of data is greater than or equal to the first abnormal characteristic value, the position of the data may be determined as a position of an abnormal data.
In one embodiment, the first data location may include a start exception location, an intermediate exception location, and an end exception location, and the first data search space is obtained when the start exception location, the intermediate exception location, and the end exception location are determined.
It should be noted that after the first abnormal feature value is obtained, different algorithms such as a lof (local outer factor) algorithm, a DBSCAN algorithm, or an isolated Forest (iForest) algorithm may be used to obtain the first data search space, which is not limited in this embodiment. Taking an isolated forest algorithm as an example for explanation, firstly, an iTree (tree) is constructed according to a target data sequence, data in the target data sequence is used as sample data of the tree, then, the sample data is subjected to binary division by using a first abnormal characteristic value, the sample data conforming to the first abnormal characteristic value and the sample data not conforming to the first abnormal characteristic value are distinguished to respectively form two data sets, and then, the above processes are respectively repeated on the two data sets until the data is irrevocable or reaches the maximum height of the tree, so that a first data position corresponding to the first abnormal characteristic value in the target data sequence can be obtained, and a first data search space can be determined according to the first data position.
In addition, as can be understood by those skilled in the art, the LOF algorithm, the DBSCAN algorithm, and the isolated forest algorithm are all algorithms commonly used in the art, and therefore, detailed description of specific principles of these algorithms is omitted here.
Referring also to fig. 4, in an embodiment, step S310 includes, but is not limited to, the following steps:
step S311, acquiring first baseline prediction data of the target data sequence;
in step S312, a first abnormal feature value is obtained according to the deviation value of the first baseline predicted data from the data in the target data sequence.
In one embodiment, when the step of obtaining the first abnormal characteristic value of the target data sequence is performed, the baseline prediction data of the normal data in the target data sequence may be obtained, and then the first abnormal characteristic value may be obtained according to the deviation value (i.e. the absolute difference value) between the data in the target data sequence and the baseline prediction data.
In an embodiment, different baseline prediction methods may be used to obtain baseline prediction data of normal data in the target data sequence, and this embodiment is not limited specifically, for example, the baseline prediction method may use a difference method, a moving average method, a weighted moving average method, an exponential weighted moving average method, a differential moving average autoregressive method, or a cubic exponential smoothing method, or a regression method such as random forest and xgboost (xtreme Gradient boosting). Various first abnormal characteristic values can be obtained by adopting various baseline prediction methods, and the first data search space can be obtained by integrating different first abnormal characteristic values and executing corresponding steps, so that the accuracy and generalization capability of the first data search space can be favorably obtained.
In an embodiment, after the first baseline predictive data of the target data sequence is acquired, the first abnormal characteristic value may be obtained by the following formula:
R=|Pi-Xi|
wherein R is a first abnormal characteristic value, XiFor data in the target data sequence, PiData is predicted for a first baseline of the target data sequence.
Those skilled in the art can understand that the differential method, the moving average method, the weighted moving average method, the exponential weighted moving average method, the differential moving average autoregressive method, the cubic exponential smoothing method, the random forest and the xgboost are all algorithms commonly used in the art, and therefore, specific principles of the algorithms are not described herein again.
Referring additionally to fig. 5, in an embodiment, step S400 includes, but is not limited to, the following steps:
step S410, determining a third data segment in the first data search space;
step S420, similarity calculation is carried out on the first abnormal data segment and the third data segment, and a first similarity metric value corresponding to the third data segment is obtained;
step S430, determining the corresponding third data segment as a second abnormal data segment according to the first similarity metric value.
In an embodiment, when the step of obtaining, according to the first abnormal data segment, the second abnormal data segment corresponding to the first abnormal data segment in the first data search space is performed, the third data segment may be determined in the first data search space, after the third data segment is determined, the similarity calculation is performed on the first abnormal data segment and the third data segment by using a similarity measurement algorithm, so as to obtain a first similarity measurement value corresponding to the third data segment, and when the first similarity measurement value indicates that the first abnormal data segment is similar to the third data segment, the corresponding third data segment may be determined to be the second abnormal data segment (i.e., the remaining abnormal data segments in the first data search space). That is, whether the third data segment is the second abnormal data segment is determined by comparing the similarity degree of the first abnormal data segment and the third data segment, and compared with the traditional method for manually marking the abnormal data segment, the method and the device for marking the abnormal data in the data can improve the marking efficiency of the abnormal data in the data, so that the human resources and the time resources can be saved.
In an embodiment, the number of the third data segments may be one or multiple, and the embodiment is not limited in particular. When the number of the third data segments is one, all data in the first data search space may be determined as the third data segments, or part of continuous data in the first data search space may be determined as the third data segments, which is not specifically limited in this embodiment; when the number of the third data segments is multiple, the data in the first data search space may be divided into multiple data segments with equal length, or the data in the first data search space may be divided into multiple data segments with different lengths, which is not limited in this embodiment.
In an embodiment, the similarity calculation for the first abnormal data segment and the third data segment may be implemented by using different similarity measurement algorithms. For example, for a plurality of third data segments with equal length, the similarity calculation may be performed on the first abnormal data segment and the third data segment in a manner of euclidean distance, pearson correlation coefficient, or spearman rank correlation coefficient; for another example, for a plurality of third data segments with different lengths, a DTW algorithm or an improved fast DTW algorithm may be used to perform similarity calculation on the first abnormal data segment and the third data segment. The specific implementation of performing the similarity calculation on the first abnormal data segment and the third data segment may be appropriately selected according to actual use requirements, and this embodiment is not particularly limited. It is noted that the Improved fast DTW algorithm may include FastDTW algorithm, spareddw algorithm, LB _ Keogh algorithm, LB _ Improved algorithm, etc., wherein the FastDTW algorithm may reduce the computational complexity without a great difference in precision by limiting the method of reducing the search space and data abstraction.
It can be understood by those skilled in the art that euclidean distance, pearson correlation coefficient, spearman rank correlation coefficient, DTW algorithm, and various modified fast DTW algorithms are algorithms commonly used in the art, and therefore, detailed description thereof is omitted here for specific principles of these algorithms.
In addition, in an embodiment, step S430 includes, but is not limited to, the following steps:
step S431, when the first similarity metric value is smaller than the preset threshold, determining that the third data segment corresponding to the first similarity metric value is the second abnormal data segment.
In an embodiment, the first similarity metric value represents a degree of similarity between the first abnormal data segment and the third data segment, and a smaller value of the first similarity metric value represents a higher degree of similarity between the first abnormal data segment and the third data segment, so that when the first similarity metric value is smaller than a preset threshold, it can be determined that the first abnormal data segment and the third data segment have a higher degree of similarity, and thus the third data segment corresponding to the first similarity metric value can be determined as the second abnormal data segment.
In an embodiment, the preset threshold may be appropriately selected according to a difference of the similarity metric algorithm used, for example, different preset thresholds may be used for the euclidean distance and the DTW algorithm, and the embodiment is not particularly limited.
In addition, referring to fig. 6, in an embodiment, when the number of the third data segments is more than two, the step S430 may include, but is not limited to, the following steps:
step S432, acquiring a first similarity metric value of which the numerical value is smaller than a preset threshold value;
step S433, sorting the first similarity metric values with the numerical values smaller than a preset threshold value from small to large so as to adjust the sorting of the corresponding third data segments;
in step S434, it is determined that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.
In an embodiment, in the case that the number of the third data segments is more than two, the number of the first similarity metric values corresponding to the third data segments obtained is also more than two, in this case, the first similarity value having a value smaller than the preset threshold may be obtained first to screen out the third data segment having a certain degree of similarity with the first abnormal data segment, and the remaining third data segment having a smaller degree of similarity may be excluded, and then, sorting the first similarity values with the values smaller than the preset threshold value from small to large to adjust the sorting of the corresponding third data segments, so that the third data segment having a certain degree of similarity with the first abnormal data segment can be reordered from high to low, then, according to the practical application condition, the first few third data segments are determined as second abnormal data segments.
To illustrate by a specific example, a FastDTW algorithm may be used to compare each third data segment in the first data search space with the first abnormal data segment, so as to calculate and obtain a similarity metric value of each third data segment, and then sort all the third data segments according to the similarity metric values, so as to obtain several third data segments with higher similarity degrees with the first abnormal data segment, so that the second abnormal data segment that needs to be labeled may be determined according to the several third data segments.
In one embodiment, the best value of N may be different for different first outlier data segments and different first data search spaces. For example, if the value of N is too small, some abnormal data segments may be omitted and not labeled, and if the value of N is too large, the accuracy may be reduced due to the identification of some abnormal data segments with lower similarity. Therefore, the value of N needs to be properly selected according to the actual application situation, and if the value of N needs to be selected more accurately, the optimal value of N can be obtained by establishing a curve of accuracy and recall to calculate the AUC value.
It should be noted that the accuracy rate refers to the proportion of correctly labeling the abnormal data segment; the recall ratio is the proportion correctly labeled in the samples of the artificially labeled abnormal data segments; auc (are Under curve) is an evaluation index of a model, which is simply to randomly extract a pair of samples (a positive sample and a negative sample), and then predict the two samples by using a trained classifier, so that the probability of obtaining the positive sample is greater than that of the negative sample.
Additionally, referring to fig. 7, in an embodiment, step S100 may include, but is not limited to, the following steps:
step S110, acquiring a plurality of data sequences to be detected;
step S120, clustering a plurality of data sequences to be detected to obtain a target data class;
step S130, determining a target data sequence from each target data class.
In an embodiment, the number of the to-be-detected data sequences acquired from the network is very large, and if the first abnormal data segment in each to-be-detected data sequence is determined manually, the workload is relatively large, and in addition, there may be a case where the to-be-detected data sequence does not contain a large amount of similar abnormal data due to a short acquisition time of the to-be-detected data sequence, and therefore, in this case, it is difficult to acquire the first abnormal data segment in the to-be-detected data sequences. However, for a data index, many data sequences to be measured can be collected according to different resource objects bound in the network, for example, in a medium-scale network, tens of thousands of port resources exist, and taking a data index of port traffic as an example, tens of thousands of data sequences to be measured can be collected, and the data sequences to be measured often have certain similarity. For example, for counting the traffic time series data of the base station access port deployed in school a, and for counting the traffic time series data of the base station access port deployed in school B, since the students of school a and school B have similar daily life and rest characteristics, the two data sequences to be measured are relatively similar to each other to a great extent. The abnormal data features of the similar data sequences to be tested have a certain commonality. Based on the above situation, the obtained multiple data sequences to be detected may be clustered to obtain target data classes, and then a target data sequence is determined from each target data class, so as to provide necessary basic conditions for the subsequent steps.
In an embodiment, the number of the target data classes obtained by clustering the plurality of data sequences to be detected may be one or more, and is determined according to the similarity of the data sequences to be detected, for example, if the plurality of data sequences to be detected are similar to each other, all the data sequences to be detected may be classified as one target data class, and if some of the plurality of data sequences to be detected are similar to each other, the plurality of data sequences to be detected may be classified into a plurality of target data classes, each of the target data classes including a part of the data sequences to be detected.
In addition, referring to fig. 8, in an embodiment, the data processing method may further include the following steps:
step S600, respectively acquiring second data search spaces in other data sequences to be detected in each target data class;
step S700, the first abnormal data segment in the target data sequence is utilized to respectively obtain a second abnormal data segment in a second data search space in the rest data sequences to be tested.
In an embodiment, when a target data sequence is determined in each target data class, the second abnormal data segment in the target data sequence may be obtained and labeled through the steps in the above embodiment, and the specific technical principle and the technical effect thereof may refer to the related description in the above embodiment, which is not described herein again.
In an embodiment, since the data sequences to be detected in each target data class have a certain similarity, the first abnormal data segment obtained in the target data sequence may be suitable for obtaining and labeling the second abnormal data segment for the remaining data sequences to be detected in the same target data class, so that the second data search spaces may be obtained in the remaining data sequences to be detected in each target data class, and then the second abnormal data segments may be obtained in the second data search spaces in the remaining data sequences to be detected by using the first abnormal data segment in the target data sequence. The second abnormal data segment can be obtained in the second data search space in the rest of the data sequences to be detected only according to the first abnormal data segment in the target data sequence, so that the operation steps of respectively obtaining the corresponding first abnormal data segments in the rest of the data sequences to be detected can be saved, the second abnormal data segments in the data sequences to be detected can be obtained more simply and efficiently, and the labeling efficiency of the abnormal data in the time sequence index data can be improved.
It should be noted that the second data search space in this embodiment is the same type of technical features as the first data search space in the above embodiment, and the two are different only in that the first data search space belongs to the target data sequence, and the second data search space belongs to the remaining data sequences to be tested in the same target data class. In order to avoid content duplication, the second data search space is not described in detail here, and for the related explanation of the second data search space, reference may be made to the related explanation of the first data search space in the above embodiments.
It should be noted that step S700 in this embodiment is similar to step S400 in the embodiment shown in fig. 2, and both have similar technical principles and technical effects, and the difference between the two is only that the execution object is different, the execution object of step S400 in the embodiment is the first data search space of the target data sequence, and the execution object of step S700 in this embodiment is the second data search space of the remaining data sequences to be tested in the same target data class. In order to avoid content duplication, the detailed description of step S700 is not provided herein, and for the related explanation of step S700, the related explanation for step S400 in the above embodiment may be referred to.
Referring additionally to fig. 9, in an embodiment, step S600 includes, but is not limited to, the following steps:
step S610, respectively obtaining second abnormal characteristic values of other data sequences to be detected in each target data class;
step S620, respectively determining second data positions corresponding to the second abnormal characteristic values in the rest to-be-detected data sequences according to the second abnormal characteristic values;
step S630, respectively obtaining second data search spaces of the remaining data sequences to be detected according to the second data positions.
In an embodiment, the second abnormal characteristic value and the second data location in this embodiment belong to the same type of technical characteristics as the first abnormal characteristic value and the first data location in the above embodiment, and the difference between the first abnormal characteristic value and the first data location is only that the first abnormal characteristic value and the second data location belong to different target data sequences, and the second abnormal characteristic value and the second data location belong to the remaining data sequences to be measured in the same target data class. In order to avoid duplication of content, the second abnormal characteristic value and the second data position are not described in detail here, and for the related explanation of the second abnormal characteristic value and the second data position, the related explanation for the first abnormal characteristic value and the first data position in the above embodiment may be referred to.
In an embodiment, step S610, step S620, and step S630 in this embodiment are similar to step S310, step S320, and step S330 in the embodiment shown in fig. 3, and have similar technical principles and technical effects, and the difference between the two is only that execution objects are different, where the execution objects of step S310, step S320, and step S330 in the embodiment are target data sequences, and the execution objects of step S610, step S620, and step S630 in this embodiment are remaining data sequences to be tested in the same target data class. In order to avoid content duplication, the detailed description of step S610, step S620 and step S630 is not provided herein, and for the related explanation of step S610, step S620 and step S630, the related explanation for step S310, step S320 and step S330 in the above-mentioned embodiment may be referred to.
Referring to fig. 10, in an embodiment, the step S610 includes, but is not limited to, the following steps:
step S611, second baseline prediction data of the rest to-be-detected data sequences are respectively obtained in each target data class;
step S612, respectively obtaining second abnormal characteristic values of the remaining to-be-detected data sequences according to the deviation values of the second baseline prediction data and the data in the remaining to-be-detected data sequences.
In an embodiment, the second baseline prediction data in the embodiment is the same type of technical features as the first baseline prediction data in the embodiment, and the difference between the first baseline prediction data and the first baseline prediction data is only different from the first baseline prediction data in the embodiment in that the first baseline prediction data belongs to the target data sequence, and the second baseline prediction data belongs to the remaining data sequences to be measured in the same target data class. In order to avoid duplication of content, the second baseline prediction data is not described in detail here, and for the related explanation of the second baseline prediction data, the related explanation for the first baseline prediction data in the above embodiment may be referred to.
In an embodiment, step S611 and step S612 in this embodiment are similar to step S311 and step S312 in the embodiment shown in fig. 4, and have similar technical principles and technical effects, and the difference between the two is only that execution objects are different, where the execution objects of step S311 and step S312 in the embodiment are target data sequences, and the execution objects of step S611 and step S612 in this embodiment are remaining data sequences to be tested in the same target data class. In order to avoid content duplication, step S611 and step S612 are not described in detail here, and for the related explanation of step S611 and step S612, the related explanation of step S311 and step S312 in the above embodiment may be referred to.
Referring to fig. 11, in an embodiment, step S700 includes, but is not limited to, the following steps:
step S710, respectively determining a fourth data segment in the second data search space in the rest data sequences to be detected;
step S720, similarity calculation is carried out on the first abnormal data segment in the target data sequence and the fourth data segment in the rest data sequences to be tested respectively, and a second similarity metric value corresponding to the fourth data segment is obtained;
step S730, respectively determining the corresponding fourth data segment in the remaining data sequences to be tested as the second abnormal data segment in the remaining data sequences to be tested according to the second similarity metric.
In an embodiment, the fourth data segment and the second similarity metric in this embodiment, which are the same as the third data segment and the first similarity metric in the foregoing embodiment, belong to the same type of technical features respectively, and the difference between the two is only that the third data segment and the first similarity metric is different from each other in belonging to different objects, the third data segment belongs to a first data search space of the target data sequence, the first similarity metric corresponds to the third data segment, the fourth data segment belongs to a second data search space of the remaining data sequences to be measured in the same target data class, and the second similarity metric corresponds to the fourth data segment. In order to avoid duplication of content, the fourth data segment and the second similarity metric value are not described in detail here, and for the related explanation of the fourth data segment and the second similarity metric value, reference may be made to the related explanation of the third data segment and the first similarity metric value in the above embodiments.
In an embodiment, step S710, step S720, and step S730 in this embodiment are similar to step S410, step S420, and step S430 in the embodiment shown in fig. 5, and have similar technical principles and technical effects, and the difference between the two is only that the execution objects are different, in the embodiment, the execution object of step S410, step S420, and step S430 is a first data search space of the target data sequence, and the execution object of step S710, step S720, and step S730 in this embodiment is a second data search space of the remaining data sequences to be tested in the same target data class. In order to avoid content duplication, the steps S710, S720 and S730 are not described in detail here, and for the related explanation of the steps S710, S720 and S730, the related explanation of the steps S410, S420 and S430 in the above embodiments can be referred to.
In addition, in an embodiment, step S730 includes, but is not limited to, the following steps:
step S731, when the second similarity metric is smaller than the preset threshold, determining that the corresponding fourth data segment in the remaining to-be-detected data sequences is the second abnormal data segment in the remaining to-be-detected data sequences.
In an embodiment, the second similarity metric represents a degree of similarity between the first abnormal data segment and the fourth data segment, and a smaller value of the second similarity metric represents a higher degree of similarity between the first abnormal data segment and the fourth data segment, so that when the second similarity metric is smaller than a preset threshold, it can be determined that the first abnormal data segment and the fourth data segment have a higher degree of similarity, and thus the fourth data segment corresponding to the second similarity metric can be determined as the second abnormal data segment.
In an embodiment, the preset threshold may be appropriately selected according to a difference of the similarity metric algorithm used, for example, different preset thresholds may be used for the euclidean distance and the DTW algorithm, and the embodiment is not particularly limited.
In addition, referring to fig. 12, in an embodiment, when the number of the fourth data segments is more than two, the step S730 may include, but is not limited to, the following steps:
step S732, respectively obtaining second similarity metric values corresponding to the rest of the data sequences to be detected, wherein the numerical values of the second similarity metric values are smaller than a preset threshold value;
step S733, sorting the second similarity metric values with the numerical values smaller than a preset threshold value from small to large so as to respectively adjust the sorting of corresponding fourth data segments in the rest data sequences to be tested;
step S734, respectively determine the first N fourth data segments as second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.
In an embodiment, step S732, step S733, and step S734 in this embodiment are similar to step S432, step S433, and step S434 in the embodiment shown in fig. 6, and have similar technical principles and technical effects, and the difference between the two is only that the execution objects are different, in the embodiment, the execution objects of step S432, step S433, and step S434 are target data sequences, and the execution objects of step S732, step S733, and step S734 in this embodiment are the remaining data sequences to be measured in the same target data class. In order to avoid content duplication, detailed descriptions of step S732, step S733 and step S734 are not provided herein, and for the related explanations of step S732, step S733 and step S734, the related explanations of step S432, step S433 and step S434 in the above-described embodiment may be referred to.
Additionally, referring to fig. 13, in an embodiment, step S120 may include, but is not limited to, the following steps:
step S121, respectively carrying out data preprocessing on a plurality of data sequences to be detected to obtain a plurality of first preprocessed data sequences;
step S122, respectively carrying out baseline extraction processing on the plurality of first preprocessed data sequences to obtain a plurality of second preprocessed data sequences;
and S123, clustering the plurality of second preprocessed data sequences according to the similarity to obtain a target data class.
In an embodiment, when a plurality of data sequences to be detected need to be clustered, data preprocessing may be performed on the plurality of data sequences to be detected, respectively, to obtain a plurality of first preprocessed data sequences, then baseline extraction processing may be performed on the plurality of first preprocessed data sequences, respectively, to obtain a plurality of second preprocessed data sequences, and then the plurality of second preprocessed data sequences are clustered according to the similarity, so as to obtain the target data class. After the plurality of data sequences to be detected are clustered to obtain corresponding target data classes, the abnormal data segment labeling processing aiming at each data sequence to be detected can be converted into the abnormal data segment labeling processing aiming at each target data class, so that the processing complexity and the processing time can be reduced, and the labeling efficiency of abnormal data in the data can be improved.
In an embodiment, the baseline extraction processing is performed on the first preprocessed data sequence, so that an abnormal part and a noise part in the data sequence to be detected can be smoothed, and the accuracy of similarity measurement between the data sequences to be detected can be improved.
It should be noted that the baseline extraction processing in step S122 in this embodiment has a similar technical principle to the step of acquiring the baseline prediction data by using the baseline prediction method in the embodiment shown in fig. 4, and for the explanation related to the baseline extraction processing in step S122 in this embodiment, reference may be made to the explanation related to the baseline extraction processing in the embodiment shown in fig. 4, and therefore details are not repeated here.
Additionally, referring to fig. 14, in an embodiment, step S121 may include, but is not limited to, the following steps:
step S1211, performing missing value filling processing on the plurality of data sequences to be detected respectively to obtain a plurality of filling data sequences;
in step S1212, data normalization processing is performed on the plurality of padding data sequences, respectively, to obtain a plurality of first preprocessed data sequences.
In an embodiment, to solve the problems, data padding is performed on the missing values to obtain a padding data sequence, and then data normalization processing is performed on the padding data sequence to obtain a first pre-processing data sequence.
In an embodiment, a linear interpolation filling method may be used to perform missing value filling processing, and the linear interpolation filling method may smooth the waveform of the data sequence to be measured, thereby facilitating the baseline extraction processing. For example, for a time series index data, the specific position of the missing value can be determined according to the time continuity, and after the specific position of the missing value is determined, the specific value to be filled can be obtained according to the previous and subsequent data of the position of the missing value, for example, the average value of the previous and subsequent data can be used as the specific value to be filled. It can be understood by those skilled in the art that the linear interpolation filling method belongs to algorithms commonly used in the art, and therefore, detailed description for specific principles of the algorithms is omitted here.
In an embodiment, the data normalization processing is performed on the padding data sequences, and the data sequences to be tested can be transformed and mapped onto a specific interval, so that dimensional differences among different data sequences to be tested can be eliminated, and the padding data sequences can be put together to perform similarity comparison. In this embodiment, the data normalization process may be performed by using a Z-Score method, and the calculation formula is as follows:
Figure BDA0002515107950000121
wherein, x'iFor the first preprocessed data sequence, xiIn order to be able to measure the data sequence,
Figure BDA0002515107950000122
and the sigma is the standard deviation of the data sequence to be detected.
In addition, in an embodiment, the clustering the plurality of second preprocessed data sequences according to the similarity in step S123 may specifically include, but is not limited to, the following steps:
step S1231, clustering a plurality of second preprocessed data sequences according to the similarity by using a DBSCAN algorithm; the parameters of the DBSCAN algorithm comprise a distance function, a neighborhood number threshold and a neighborhood distance threshold; the result of the DBSCAN algorithm includes the classification number and the abnormal ratio.
As will be appreciated by those skilled in the art, the DBSCAN algorithm is one of the commonly used clustering algorithms, and the DBSCAN algorithm does not need to determine the number of cluster centers in advance. The key parameters of the DBSCAN algorithm comprise a distance function, a neighborhood number threshold and a neighborhood distance threshold, and the result of the DBSCAN algorithm comprises a classification number and an abnormal proportion.
In an embodiment, for the distance function, the present embodiment may adopt an euclidean distance function; for the neighborhood number threshold, this embodiment may be set to 4; for the neighborhood distance threshold, the parameter needs to be dynamically estimated according to the data set, and the influence of the parameter on the clustering result is obvious.
Additionally, referring to FIG. 15, in one embodiment, the neighborhood distance threshold may be derived by a heuristic algorithm, wherein the heuristic algorithm includes, but is not limited to, the following steps:
step S810, calculating the similarity between every two of the plurality of second preprocessed data sequences through a distance function to obtain similarity matrix data;
step S820, calculating k-dist distance based on the similarity matrix data to obtain a k-dist sequence;
step S830, obtaining an initial distance threshold parameter based on a k-dist sequence;
step S840, adjusting the initial distance threshold parameter to obtain the neighborhood distance threshold.
In one embodiment, the k-dist distance refers to the distance between a data object and its k-th nearest object. When a suitable neighborhood distance threshold needs to be determined, similarity between every two of a plurality of second preprocessed data sequences can be calculated through an Euclidean distance function equidistant function, for example, to form similarity matrix data, then k-dist distance is calculated based on the similarity matrix data to obtain a k-dist sequence, then an initial distance threshold parameter is obtained based on the k-dist sequence, and then the suitable neighborhood distance threshold is obtained by adjusting the initial distance threshold parameter. After the appropriate neighborhood distance threshold is obtained through the heuristic algorithm, the neighborhood distance threshold can be applied to clustering the plurality of second preprocessed data sequences according to the similarity by using the DBSCAN algorithm in the embodiment, so that the target data class can be obtained.
In an embodiment, before step S810 is executed, initial thresholds such as a maximum distance threshold, a minimum length threshold, a slope threshold, and a slope difference threshold of a neighborhood may be set, and after the initial thresholds are set, step S810, step S820, step S830, and step S840 are executed.
In an embodiment, when step S820 is executed, after the k-dist distance of each k-dist point is calculated based on the similarity matrix data, the obtained k-dist distances may be sorted from small to large, and k-dist points with a k-dist distance of 0 and k-dist points with a k-dist distance exceeding a neighborhood maximum distance threshold are excluded, so that the remaining k-dist points form a k-dist sequence.
Additionally, in an embodiment, step S830 may include, but is not limited to, the following steps:
step S831, calculating the slopes of each k-dist point in the k-dist sequence and the two adjacent points in front and back, wherein the slopes of the two adjacent points in back are smaller than a preset slope threshold, and the difference value of the slopes of the two adjacent points in back is smaller than a preset slope difference threshold, and determining the current k-dist point as a candidate distance threshold;
in step S832, the largest one of the candidate distance threshold values is determined as the initial distance threshold parameter.
In an embodiment, when the step of obtaining the initial distance threshold parameter based on the k-dist sequence needs to be performed, a more gradual k-dist point in the k-dist sequence may be first determined as the candidate distance threshold, and the specific steps may be: the method comprises the steps of firstly calculating the slope of each k-dist point in a k-dist sequence and two adjacent points before and after the k-dist point, if the slope (which can be defined as a left slope) of the current k-dist point and the slope (which can be defined as a right slope) of the current k-dist point and the previous adjacent point are both smaller than a preset slope threshold value, and the difference value between the left slope and the right slope is smaller than a preset slope difference threshold value, determining the current k-dist point as a candidate distance threshold value, and when a plurality of candidate distance threshold values are obtained, sorting the candidate distance threshold values from large to small, and then taking the candidate distance threshold value with the maximum value as an initial distance threshold parameter.
Additionally, in an embodiment, step S840 may include, but is not limited to, the following steps:
step S841, obtaining the stepping length;
step S842, adjust the initial distance threshold parameter according to the step length to obtain a distance adjustment threshold, and when the classification number decreases, determine the distance adjustment threshold obtained by the previous adjustment as the neighborhood distance threshold.
In an embodiment, after the initial distance threshold parameter is determined, optimization processing may be continued on the initial distance threshold parameter, and it is to be noted that, when the optimization processing is performed on the initial distance threshold parameter, the abnormal proportion needs to be reduced as much as possible on the premise of keeping the classification number unchanged. Because the abnormal proportion is continuously reduced along with the increase of the value of the initial distance threshold parameter, and the classification number is also possibly reduced, the value of the initial distance threshold parameter can be gradually increased in a stepping mode to determine the optimal neighborhood distance threshold, that is, on the basis of the initial distance threshold parameter, the classification number and the abnormal proportion are recalculated after a stepping length is increased every time, and the stepping length is stopped to be increased until the classification number is reduced, at this time, the distance adjustment threshold obtained by performing the previous adjustment can be determined to be the optimal neighborhood distance threshold.
In an embodiment, the step length may be set according to an empirical value, or may also be set according to a candidate distance threshold, for example, when the step length is set according to the candidate distance threshold, the step length may be set to be one tenth of a difference between a maximum distance threshold and a minimum distance threshold in the candidate distance thresholds, which is not particularly limited in this embodiment.
In order to better explain the heuristic algorithms provided in the above embodiments, the following detailed description is given with specific examples:
in a specific example, as shown in fig. 16, the heuristic algorithm specifically includes the following steps:
in step S901, a threshold value is set.
In this step, initial thresholds such as a neighborhood maximum distance threshold, a minimum length threshold, a slope threshold, and a slope difference threshold are set, respectively.
Step S902, calculating a sequence similarity matrix.
In this step, the similarity between each pair of data sequences is calculated pairwise by a distance function to form similarity matrix data.
And step S903, calculating k-dist distance and sorting.
In the step, k-dist distance of each k-dist point is calculated based on the similarity matrix data, and the k-dist distances are sorted from small to large.
Step S904, filtering by maximum distance threshold.
In this step, k-dist points with k-dist distance of 0 and k-dist distance points with k-dist distance exceeding the neighborhood maximum distance threshold are excluded.
And step S905, sequentially taking k-dist sequence values.
Step S906, it is determined whether both the left slope and the right slope are less than a slope threshold.
In this step, the slope of each k-dist point and two adjacent points before and after the k-dist point is calculated, if the slope (which may be defined as a left slope) of the current k-dist point and the previous adjacent point and the slope (which may be defined as a right slope) of the current k-dist point and the next adjacent point are both smaller than a preset slope threshold, step S907 is performed, otherwise step S905 is performed.
In step S907, it is determined whether the difference between the left slope and the right slope is smaller than a slope difference threshold.
In this step, when the difference between the left slope and the right slope is smaller than the slope difference threshold, step S908 is performed, otherwise step S905 is performed.
Step S908, determining the current k-dist point as the candidate distance threshold, and repeating steps S905 to S907, and after all candidate distance thresholds are obtained, executing step S909.
In step S909, the candidate thresholds are sorted from large to small, and then the largest candidate threshold is taken as the initial distance threshold parameter.
Step S910, a clustering algorithm is executed to obtain the classification number and the abnormal proportion.
And step S911, adding a step length on the basis of the initial distance threshold parameter, and executing a clustering algorithm.
In step S912, it is determined whether the classification count is decreasing, if yes, step S913 is executed, otherwise step S911 is executed.
In step S913, the previous distance threshold is determined as the optimal distance threshold.
In addition, in an embodiment, step S130 may include, but is not limited to, the following steps:
step S131, respectively calculating the average sum of the distances between each data sequence to be measured and the rest data sequences to be measured in each target data class, and determining the data sequence to be measured corresponding to the data sequence with the minimum average sum of the distances as the target data sequence.
In an embodiment, after clustering a plurality of data sequences to be tested to obtain target data classes, a core data sequence for representing the corresponding target data class may be determined for each target data class, that is, a target data sequence is determined from each target data class. When a target data sequence is determined from a target data class, the distance average sum of each to-be-detected data sequence and other to-be-detected data sequences in the target data class may be calculated first, and then the to-be-detected data sequence with the smallest distance average sum is selected as a core data sequence representing the target data class, that is, the target data sequence of the target data class.
In one embodiment, the target data sequence may be determined by the following formula:
Figure BDA0002515107950000141
wherein the content of the first and second substances,
Figure BDA0002515107950000142
and
Figure BDA0002515107950000143
respectively representing different data sequences to be tested; euclidean () represents the euclidean distance.
In order to better explain the data processing method provided in the above embodiments, the following detailed description is made with specific examples:
in a specific example, as shown in fig. 17, fig. 17 is a main flow block diagram of the data processing method provided in this example, and based on the main flow block diagram shown in fig. 17, the data processing method specifically includes the following steps:
firstly, preprocessing a data sequence to be detected acquired in a network, wherein the preprocessing mainly comprises missing value filling and data standardization to form a time sequence data set with equal length;
then, performing baseline extraction on each data sequence to be detected by adopting a moving average method to form a baseline data set;
secondly, clustering is carried out on the baseline data set by adopting a DBSCAN algorithm and an Euclidean distance measurement function, and threshold parameter tuning is automatically carried out by adopting a heuristic algorithm;
then, determining a core sequence for each classification in the clustering result through distance measurement;
then, determining an abnormal data segment template for the single core sequence;
then, automatically completing the generation of data search space for each sequence;
then, automatically acquiring a corresponding abnormal data segment template according to the classified core sequence to which each sequence belongs, completing the search of similar abnormal segments in the data search space of the sequence, acquiring N similar abnormal segments with high similarity degree, and carrying out abnormal labeling on the N similar abnormal segments;
and finally, manually detecting and locally correcting the labeling result of the similar abnormal segment of each sequence to form final label abnormal data.
In addition, another embodiment of the present invention also provides an apparatus, including: a memory, a processor, and a computer program stored on the memory and executable on the processor.
The processor and memory may be connected by a bus or other means.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It should be noted that the device in this embodiment may include the system architecture platform in the embodiment shown in fig. 1, and the device in this embodiment and the system architecture platform in the embodiment shown in fig. 1 belong to the same inventive concept, so that both have the same implementation principle and technical effect, and are not described in detail here.
Non-transitory software programs and instructions required to implement the data processing method of the above-described embodiment are stored in the memory, and when executed by the processor, perform the data processing method of the above-described embodiment, for example, perform the above-described method steps S100 to S500 in fig. 2, the method steps S310 to S330 in fig. 3, the method steps S311 to S312 in fig. 4, the method steps S410 to S430 in fig. 5, the method steps S432 to S434 in fig. 6, the method steps S110 to S130 in fig. 7, the method steps S600 to S700 in fig. 8, the method steps S610 to S630 in fig. 9, the method steps S611 to S612 in fig. 10, the method steps S710 to S730 in fig. 11, the method steps S732 to S734 in fig. 12, the method steps S121 to S123 in fig. 13, the method steps S1211 to S1212 in fig. 14, the method steps S810 to S840 in fig. 15, and the method steps S913 in fig. 16.
The above described embodiments of the device are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Furthermore, another embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions, which are executed by a processor or a controller, for example, by a processor in the above-mentioned device embodiment, and can make the above-mentioned processor execute the data processing method in the above-mentioned embodiment, for example, execute the above-mentioned method steps S100 to S500 in fig. 2, method steps S310 to S330 in fig. 3, method steps S311 to S312 in fig. 4, method steps S410 to S430 in fig. 5, method steps S432 to S434 in fig. 6, method steps S110 to S130 in fig. 7, method steps S600 to S700 in fig. 8, method steps S610 to S630 in fig. 9, method steps S611 to S612 in fig. 10, method steps S710 to S730 in fig. 11, method steps S732 to S734 in fig. 12, Method steps S121 to S123 in fig. 13, method steps S1211 to S1212 in fig. 14, method steps S810 to S840 in fig. 15, method steps S901 to S913 in fig. 16.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (22)

1. A data processing method includes the steps of,
acquiring a target data sequence;
acquiring a first abnormal data segment in the target data sequence;
acquiring a first data search space in the target data sequence;
acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment;
and marking the second abnormal data segment.
2. The data processing method of claim 1, wherein the obtaining a first data search space in the target data sequence comprises:
acquiring a first abnormal characteristic value of the target data sequence;
determining a first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;
and acquiring the first data search space according to the first data position.
3. The data processing method according to claim 2, wherein the obtaining of the first abnormal feature value of the target data sequence comprises:
obtaining first baseline predictive data for the target data sequence;
obtaining a first anomaly characteristic value according to a deviation value of the first baseline predictive data and data in the target data sequence.
4. The data processing method according to claim 1, wherein the obtaining a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment comprises:
determining a third data segment in the first data search space;
similarity calculation is carried out on the first abnormal data segment and the third data segment, and a first similarity metric value corresponding to the third data segment is obtained;
and determining the corresponding third data segment as a second abnormal data segment according to the first similarity metric value.
5. The data processing method according to claim 4, wherein the determining that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value comprises:
and when the first similarity metric value is smaller than a preset threshold value, determining the third data segment corresponding to the first similarity metric value as a second abnormal data segment.
6. The data processing method according to claim 4, wherein when the number of the third data segments is two or more, the determining that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value includes:
acquiring the first similarity metric value of which the value is smaller than a preset threshold value;
sorting the first similarity metric values with the values smaller than a preset threshold value from small to large so as to adjust the sorting of the corresponding third data segments;
and determining the first N third data segments as second abnormal data segments, wherein N is greater than or equal to 1.
7. The data processing method of any one of claims 1 to 6, wherein the obtaining the target data sequence comprises:
acquiring a plurality of data sequences to be detected;
clustering a plurality of data sequences to be detected to obtain a target data class;
determining one of the target data sequences from each of the target data classes, respectively.
8. The data processing method of claim 7, further comprising:
respectively acquiring second data search spaces in other data sequences to be detected in each target data class;
and respectively acquiring the second abnormal data segments in the second data search spaces in the rest data sequences to be detected by using the first abnormal data segments in the target data sequence.
9. The data processing method according to claim 8, wherein the obtaining of the second data search space in the remaining data sequences to be tested in each of the target data classes respectively comprises:
respectively acquiring second abnormal characteristic values of other data sequences to be detected in each target data class;
respectively determining second data positions corresponding to the second abnormal characteristic values in the rest to-be-detected data sequences according to the second abnormal characteristic values;
and respectively acquiring second data search spaces of the other data sequences to be detected according to the second data positions.
10. The data processing method according to claim 9, wherein the obtaining of the second abnormal characteristic values of the remaining data sequences to be tested in each of the target data classes respectively comprises:
respectively acquiring second baseline prediction data of the other to-be-detected data sequences in each target data class;
and respectively obtaining second abnormal characteristic values of the rest data sequences to be detected according to the second baseline prediction data and the deviation values of the data in the rest data sequences to be detected.
11. The data processing method according to claim 8, wherein the obtaining the second abnormal data segment in the second data search space in the remaining data sequences to be tested by using the first abnormal data segment in the target data sequence comprises:
determining fourth data segments in the second data search spaces in the rest data sequences to be detected respectively;
respectively carrying out similarity calculation on the first abnormal data segment in the target data sequence and the fourth data segment in the rest data sequences to be detected to obtain a second similarity metric value corresponding to the fourth data segment;
and respectively determining the corresponding fourth data segment in the rest of the data sequences to be tested as a second abnormal data segment in the rest of the data sequences to be tested according to the second similarity metric.
12. The data processing method according to claim 11, wherein the determining, according to the second similarity metric values, that the corresponding fourth data segment in the remaining data sequences to be tested is a second abnormal data segment in the remaining data sequences to be tested respectively comprises:
and when the second similarity metric value is smaller than a preset threshold value, determining that the corresponding fourth data segment in the rest of the data sequences to be tested is a second abnormal data segment in the rest of the data sequences to be tested.
13. The data processing method according to claim 11, wherein when the number of the fourth data segments is two or more, the determining, according to the second similarity metric value, that the corresponding fourth data segment in the remaining data sequences to be tested is the second abnormal data segment in the remaining data sequences to be tested respectively includes:
respectively acquiring the second similarity metric values of which the numerical values corresponding to the rest data sequences to be detected are smaller than a preset threshold value;
sorting the second similarity metric values with the values smaller than a preset threshold value from small to large so as to respectively adjust the sorting of the corresponding fourth data segments in the rest data sequences to be tested;
and respectively determining the first N fourth data segments as second abnormal data segments in the rest data sequences to be detected, wherein N is more than or equal to 1.
14. The data processing method of claim 7, wherein the clustering the plurality of data sequences to be tested to obtain a target data class comprises:
respectively carrying out data preprocessing on the plurality of data sequences to be detected to obtain a plurality of first preprocessed data sequences;
respectively performing baseline extraction processing on the first preprocessed data sequences to obtain a plurality of second preprocessed data sequences;
and clustering the second preprocessing data sequences according to the similarity to obtain a target data class.
15. The data processing method according to claim 14, wherein the performing data preprocessing on the plurality of data sequences to be detected respectively to obtain a plurality of first preprocessed data sequences comprises:
respectively carrying out missing value filling processing on the plurality of data sequences to be detected to obtain a plurality of filling data sequences;
and respectively carrying out data standardization processing on the plurality of filling data sequences to obtain a plurality of first preprocessing data sequences.
16. The data processing method of claim 14, wherein the clustering the plurality of second preprocessed data sequences by similarity comprises:
clustering the second preprocessed data sequences according to the similarity by using a DBSCAN algorithm; the parameters of the DBSCAN algorithm comprise a distance function, a neighborhood number threshold and a neighborhood distance threshold; the result of the DBSCAN algorithm comprises a classification number and an abnormal proportion.
17. The data processing method of claim 16, wherein the neighborhood distance threshold is derived by a heuristic algorithm, wherein the heuristic algorithm comprises the steps of:
calculating the similarity between every two of the second preprocessed data sequences through the distance function to obtain similarity matrix data;
calculating k-dist distance based on the similarity matrix data to obtain a k-dist sequence;
obtaining an initial distance threshold parameter based on the k-dist sequence;
adjusting the initial distance threshold parameter to obtain the neighborhood distance threshold.
18. The data processing method according to claim 17, wherein said deriving an initial distance threshold parameter based on the k-dist sequence comprises:
calculating the slopes of each k-dist point in the k-dist sequence and two adjacent points before and after the k-dist sequence, and when the slopes of the two adjacent points before and after the k-dist sequence are both smaller than a preset slope threshold value and the difference value of the slopes of the two adjacent points before and after the k-dist sequence is smaller than a preset slope difference threshold value, determining the current k-dist point as a candidate distance threshold value;
and determining the largest one of the candidate distance threshold values as an initial distance threshold parameter.
19. The data processing method according to claim 17 or 18, wherein said adjusting the initial distance threshold parameter to obtain the neighborhood distance threshold comprises:
acquiring the stepping length;
and adjusting the initial distance threshold parameter according to the step length to obtain a distance adjustment threshold, and when the classification number is reduced, determining the distance adjustment threshold obtained by the previous step of adjustment as the neighborhood distance threshold.
20. The data processing method of claim 7, wherein said determining one of said target data sequences from each of said target data classes comprises:
and respectively calculating the average sum of the distances between each data sequence to be detected and the rest of the data sequences to be detected in each target data class, and determining the data sequence to be detected corresponding to the data sequence with the minimum average sum of the distances as the target data sequence.
21. An apparatus, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 20 when executing the computer program.
22. A computer-readable storage medium storing computer-executable instructions for performing the data processing method of any one of claims 1 to 20.
CN202010473617.0A 2020-05-29 2020-05-29 Data processing method, device and computer readable storage medium Pending CN113742387A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010473617.0A CN113742387A (en) 2020-05-29 2020-05-29 Data processing method, device and computer readable storage medium
PCT/CN2021/086644 WO2021238455A1 (en) 2020-05-29 2021-04-12 Data processing method and device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010473617.0A CN113742387A (en) 2020-05-29 2020-05-29 Data processing method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113742387A true CN113742387A (en) 2021-12-03

Family

ID=78724518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010473617.0A Pending CN113742387A (en) 2020-05-29 2020-05-29 Data processing method, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN113742387A (en)
WO (1) WO2021238455A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676774A (en) * 2022-03-25 2022-06-28 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN117476136A (en) * 2023-12-28 2024-01-30 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114872290B (en) * 2022-05-20 2024-02-06 深圳市信润富联数字科技有限公司 Self-adaptive production abnormality monitoring method for injection molding part
CN115792479B (en) * 2023-02-08 2023-05-09 东营市建筑设计研究院 Intelligent power consumption monitoring method and system for intelligent socket
CN115858894B (en) * 2023-02-14 2023-05-16 温州众成科技有限公司 Visual big data analysis method
CN116029842B (en) * 2023-03-28 2023-06-20 北京环球医疗救援有限责任公司 Cleaning and denoising method and system for medical insurance big data
CN116383190B (en) * 2023-05-15 2023-08-25 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive financial transaction big data
CN116331044B (en) * 2023-05-31 2023-08-04 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116994675B (en) * 2023-09-28 2023-12-01 佳木斯大学 Brocade based on near infrared data Lantern calyx epidermis detection method
CN117150233B (en) * 2023-10-30 2024-02-13 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium
CN117196446B (en) * 2023-11-06 2024-01-19 北京中海通科技有限公司 Product risk real-time monitoring platform based on big data
CN117455127B (en) * 2023-12-26 2024-03-15 临沂市园林环卫保障服务中心 Plant carbon sink dynamic data monitoring system based on wisdom gardens

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5538597B2 (en) * 2013-06-19 2014-07-02 株式会社日立製作所 Anomaly detection method and anomaly detection system
CN104636999B (en) * 2015-01-04 2018-06-22 江苏联宏智慧能源股份有限公司 A kind of abnormal energy data detection method of building
CN109882834B (en) * 2019-03-27 2020-03-03 新奥数能科技有限公司 Method and device for monitoring operation data of boiler equipment
CN110558971A (en) * 2019-08-02 2019-12-13 苏州星空大海医疗科技有限公司 Method for generating countermeasure network electrocardiogram abnormity detection based on single target and multiple targets
CN111061711B (en) * 2019-11-28 2023-09-01 同济大学 Big data stream unloading method and device based on data processing behavior

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676774A (en) * 2022-03-25 2022-06-28 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN117476136A (en) * 2023-12-28 2024-01-30 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system
CN117476136B (en) * 2023-12-28 2024-03-15 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system

Also Published As

Publication number Publication date
WO2021238455A1 (en) 2021-12-02

Similar Documents

Publication Publication Date Title
CN113742387A (en) Data processing method, device and computer readable storage medium
US20230012451A1 (en) Methods and apparatus to generate anomaly detection datasets
CN110444011B (en) Traffic flow peak identification method and device, electronic equipment and storage medium
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN113361645A (en) Target detection model construction method and system based on meta-learning and knowledge memory
CN110751191A (en) Image classification method and system
CN112766218A (en) Cross-domain pedestrian re-identification method and device based on asymmetric joint teaching network
CN110766075A (en) Tire area image comparison method and device, computer equipment and storage medium
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN113326177A (en) Index anomaly detection method, device, equipment and storage medium
CN114328942A (en) Relationship extraction method, apparatus, device, storage medium and computer program product
CN116451081A (en) Data drift detection method, device, terminal and storage medium
CN109739840A (en) Data processing empty value method, apparatus and terminal device
CN111798237B (en) Abnormal transaction diagnosis method and system based on application log
CN113139673A (en) Method, device, terminal and storage medium for predicting air quality
CN112738724A (en) Method, device, equipment and medium for accurately identifying regional target crowd
CN110569277A (en) Method and system for automatically identifying and classifying configuration data information
CN116843368B (en) Marketing data processing method based on ARMA model
CN112699908A (en) Method for labeling picture, electronic terminal, computer readable storage medium and equipment
CN117113148B (en) Risk identification method, device and storage medium based on time sequence diagram neural network
CN110837805B (en) Method, device and equipment for measuring confidence of video tag and storage medium
CN116188834B (en) Full-slice image classification method and device based on self-adaptive training model
CN117058498B (en) Training method of segmentation map evaluation model, and segmentation map evaluation method and device
CN116091867B (en) Model training and image recognition method, device, equipment and storage medium
CN112968968B (en) Internet of things equipment flow fingerprint identification method and device based on unsupervised clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination