CN115758183A - Training method and device for log anomaly detection model - Google Patents
Training method and device for log anomaly detection model Download PDFInfo
- Publication number
- CN115758183A CN115758183A CN202211350703.8A CN202211350703A CN115758183A CN 115758183 A CN115758183 A CN 115758183A CN 202211350703 A CN202211350703 A CN 202211350703A CN 115758183 A CN115758183 A CN 115758183A
- Authority
- CN
- China
- Prior art keywords
- log
- template
- log data
- anomaly detection
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention provides a training method and a device of a log anomaly detection model, wherein the training method of the log anomaly detection model comprises the following steps: acquiring various sample log data generated by a signal system; analyzing the log data of various samples based on an event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished. The method can increase the dimensionality of the template sequence corresponding to the sample log data, thereby improving the training efficiency of the clustering model and the accuracy of classification.
Description
Technical Field
The invention relates to the technical field of data anomaly detection, in particular to a training method and a training device for a log anomaly detection model.
Background
With the continuous increase of the scale and the complexity of a signal system in the field of urban rails, the amount of logs generated in unit time is increased sharply, log data generated in an abnormal operation scene are monitored and early warned in time, the signal system breakdown can be avoided, and the economic loss is reduced.
In the related technology, the efficiency of analyzing whether a fault occurs in a signal system by means of professional experience only manually is low, so that the operation and maintenance cost is high, when the detection model is used for carrying out abnormal detection on logs in various formats generated by the signal system, the log data presents the non-structural characteristics, such as non-uniform time formats, non-uniform professional vocabularies or acronyms defined by different manufacturers and the like, so that the difficulty of log analysis is increased, and the identification feature characterization capability extracted by the detection model is poor, so that the accuracy of abnormal log detection is low.
Disclosure of Invention
The invention provides a training method and a training device for a log anomaly detection model, which are used for solving the defects that the efficiency of judging system faults by using manual experience is low and the accuracy of detecting log anomalies by using the detection model is low when massive log messages are processed in the prior art, and improving the detection accuracy and the log detection efficiency of anomalous logs.
The invention provides a training method of a log anomaly detection model, which comprises the following steps:
acquiring various sample log data generated by a signal system;
analyzing the multiple sample log data based on an event template to obtain a first template sequence, wherein the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
According to the training method of the log anomaly detection model provided by the invention, the first counting matrix is input into the clustering model for hierarchical clustering training, and the log anomaly detection model is obtained after the training is finished, and the method comprises the following steps:
inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters;
obtaining a target distance of any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing the any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value;
and when the target distance is greater than the first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining the log anomaly detection model.
According to the training method of the log anomaly detection model provided by the invention, the first counting matrix is input to the clustering model for hierarchical clustering training, and the log anomaly detection model is obtained after the training is finished, and the training method further comprises the following steps:
carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector;
and inputting the counting vector into a clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after training is completed.
According to the training method of the log anomaly detection model provided by the invention, the step of determining the event template comprises the following steps:
acquiring a token object in the unstructured log data;
classifying the unstructured log data based on the token object and a second keyword in the unstructured log data to obtain the event template.
According to the training method of the log anomaly detection model provided by the invention, the mean value clustering formula comprises the following steps:
wherein, C i And C j Respectively for any two of the plurality of sample clusters,is C i Is determined by the average value of (a) of (b),is C j Average value of (a).
The invention also provides a log anomaly detection method, which comprises the following steps:
acquiring log data to be detected;
analyzing the log data to be detected based on an event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing analyzed log data to be detected corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, obtaining a second template sequence for the sequenced plurality of log templates, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed log data to be detected in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the log data of various samples;
clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
The invention provides a training device of a log anomaly detection model, which comprises:
the acquisition module is used for acquiring various sample log data generated by the signal system;
the analysis module is used for analyzing the multiple sample log data based on event templates to obtain a first template sequence, the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event templates are determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
the characteristic extraction module is used for sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest quantity, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and the clustering module is used for inputting the first counting matrix into a clustering model for hierarchical clustering training and obtaining a log abnormity detection model after the training is finished.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the training method of the log anomaly detection model or the log anomaly detection method is realized.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for training a log abnormality detection model or the method for detecting log abnormality as described in any one of the above.
The present invention also provides a computer program product comprising a computer program, which when executed by a processor implements the method for training a log anomaly detection model or the method for detecting log anomalies as described in any one of the above.
According to the training method and device for the log anomaly detection model, provided by the invention, the sample log data are analyzed according to the event template to obtain the first template sequence, the template similarity information and the statistical analysis information are added into the first template sequence, and the first keywords are utilized for statistical classification, so that the dimensionality of the first template sequence is increased, and the training efficiency and the classification accuracy of the clustering model are improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a training method of a log anomaly detection model provided by the present invention;
FIG. 2 is a schematic diagram of an interface for parsing sample log data based on an event template according to the present invention;
FIG. 3 is a schematic flow chart of a log anomaly detection method provided by the present invention;
FIG. 4 is a schematic structural diagram of a training apparatus for a log anomaly detection model provided in the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a training method of a log anomaly detection model provided by the present invention, and as shown in fig. 1, the training method of the log anomaly detection model includes: step 110, step 120, step 130 and step 140.
In this step, the various sample log data may be log data generated by different modules in the communication platform inside the signal system in normal operation, as well as a log containing fault information.
In some embodiments, the collected sample log data may be collected and aggregated in the form of a text file for use as input data in subsequent processes to train the clustering model.
In the embodiment, the collected logs with various formats can be recorded in files or templates according to a uniform format, and after preprocessing steps such as sequencing, similarity analysis and the like are performed on the files or templates, sample data which can be directly used for training the clustering model is obtained.
In some embodiments, the corresponding relationship between the version information of different modules and the log data generated by the modules can be further added to the text file, so that an anomaly analysis report can be formed subsequently.
In this embodiment, version information of different modules may be added to the text file, where the version information may be represented by a complete version number, or may be a designated identifier (such as a number, a serial number, or other characters), so as to assist in determining log anomalies of other modules.
It should be noted that, a system kernel, various application servers, and most application programs in the signal system may output logs, and the content, scale, and use of the logs are also different, and these log data often include many contents useful for users, and users may analyze different types of log data to achieve different functional requirements.
In this step, the target parameter may be category information corresponding to the sample log data, for example, the received from node blk _0001 and Send 120bytes to blk _0002belong to two different categories of log data, and the target parameter may be a keyword with a high occurrence frequency in the log data, such as received, from, send, and bytes to.
In this embodiment, the sample log data is classified into the corresponding event template according to the target parameter corresponding to each sample log data and analyzed to obtain the corresponding log template.
In this step, the text entry of the log data usually consists of a preset fixed template portion and a specific parameter portion, and after a large amount of log data is obtained, an event template can be obtained from unstructured log data through preprocessing operations such as a word segmentation technology, and then the template sequence is subjected to anomaly detection.
In this step, a plurality of log templates may be included in the first template sequence, each log template may be used to represent log data in different formats, and one log template may represent one or more pieces of log data. In this embodiment, for the log data generated by the signal system, the received from node blk _0001, the fixed template parts are "received", "from", and "node", such fixed template parts may be common or most frequently occurring participles, and the specific parameter part blk _0001, etc. is usually a pure number or a combination of a number and other symbols such as letters.
In this step, a log template may be randomly selected first, and the edit distance between the log template and other templates in the first template sequence and the similarity of word vectors in each event template are calculated to perform sorting, where generally, the log templates with similar sequence numbers have higher similarity.
In this step, the printing parameter may be an Init, debug, or Warning parameter in the sample log data.
In this embodiment, it is assumed that the number of log templates included in the first template sequence is 10, the initial number of each event template is 1 to 10, and the templates are numbered again by calculating the edit distance between the templates so that the two templates with high similarity have similar serial numbers; before this, a log sequence with a fixed window size of 100 is taken in a log collection stage, a count matrix with a size of 100 × 10 is obtained, and a mathematical statistic information column and a first keyword information column are added to the count matrix, where the mathematical statistic information column may be a sequence number of a log template with the largest number of analysis log data or the largest target parameter stored in the first template sequence, and a log label class column with the largest occurrence, for example, a log template sequence number corresponding to a large number of occurrences and a log label class column with the largest occurrence are Init (0), debug (1) and Warning (2).
In this embodiment, the first keyword information column may be obtained by calculating TFIDFTF-IDF (Term Frequency Inverse Document Frequency) values of the first template sequence to obtain three vocabularies with the highest TFIDF values, such as "extension a", "sending", and "central server", and then converting the chinese words into numbers by using correspondence between the vocabularies and the serial numbers in the corpus, and then adding the most appeared log template serial numbers, the most appeared log tag class columns, and three columns of keywords to the count matrix, so as to finally obtain an information matrix with a size of 100 × 15, that is, the first count matrix.
And 140, inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
In the step, after a first counting matrix after dimension expansion is obtained on an original counting matrix, the first counting matrix is used as the input of a clustering model to carry out hierarchical clustering training.
It should be noted that hierarchical clustering is to create a multi-level clustering tree by calculating the similarity between data points of different classes; the hierarchical decomposition comprises a bottom-up decomposition mode and a top-down decomposition mode, wherein the bottom-up aggregation method takes all log vector samples as a cluster, combines two clusters with the minimum distance in sequence, and repeats continuously until a termination condition is met; in contrast, the top-down splitting method is to take all log vectors as a whole, find out two clusters with the farthest distance in the clusters to split, and repeat the splitting process continuously until the number of clusters meets the termination condition.
In this embodiment, the termination condition may be that the distance of two clusters exceeds a first threshold or that the number of clusters exceeds a second threshold.
In this embodiment, both the first threshold and the second threshold may be set in a customized manner according to actual requirements.
In some embodiments, other clustering models or algorithms may also be clustered based on the first count matrix, resulting in a usable log anomaly detection model.
In this embodiment, the other clustering model or algorithm may be a K-means clustering algorithm, a Gaussian mixture clustering model, a density clustering algorithm, or the like.
According to the training method of the log anomaly detection model, provided by the invention, the sample log data is analyzed according to the event template to obtain the first template sequence, the template similarity information and the statistical analysis information are added into the first template sequence, and the first keywords are utilized for statistical classification, so that the dimensionality of the first template sequence is increased, and the training efficiency and the classification accuracy of the clustering model are improved.
In some embodiments, inputting the first count matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is completed, includes: inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters; determining the target distance of any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value; and when the target distance is greater than a first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining a log anomaly detection model.
It should be noted that the hierarchical clustering mode of the clustering model may be bottom-up aggregation hierarchical clustering or top-down split hierarchical clustering.
In this embodiment, the target distance may refer to an average distance between any two sample clusters in the plurality of sample clusters, or may be a minimum distance or a maximum distance; the corresponding mean clustering formula is used for clustering based on the average distance of the two sample clusters, or clustering based on the minimum distance or the maximum distance of the two sample clusters.
Next, a description will be given by taking an example of clustering performed by the clustering model.
In this embodiment, the first count matrix is used as the input of the agglomerative hierarchical cluster, and the number of log data samples included in the first count matrix is D = x 1 ,x 2 ,...,x m A similarity measurement function s, a clustering sample cluster number k, and clustering according to the following steps:
(1) Initializing each sample log data as a sample cluster, c i =x i ,i=1,2,...,m;
(2) Calculating a distance d (i, j) between two sample log data, wherein i =1,2.. M, j =1,2.. M;
(3) Finding out two most similar sample cluster clusters for fusion through a similarity measurement function s;
(4) And stopping clustering until the number of the fusion clusters is less than or equal to k (corresponding to a second threshold) or the target distance between any two sample clusters exceeds a first threshold, and taking a clustering model as an available log anomaly detection model.
In some embodiments, after the training of the clustering model is completed, the obtained log anomaly detection model may only include two fused sample clusters, namely a normal sample cluster and an abnormal sample cluster, and the number of the fused clusters is reduced in the training process. According to the training method of the log anomaly detection model, the technical matrix after dimensionality expansion is input into the clustering model for hierarchical clustering training, the calculated amount of the automatic clustering process based on the mean distance is small, and the new problem of outlier sensitivity can be solved.
In some embodiments, inputting the first count matrix into the clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after the training is completed, further includes: carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector; and inputting the counting vector into a clustering model to perform hierarchical clustering training to obtain a log anomaly detection model.
In this embodiment, the normalization processing on the first count matrix can reduce the influence of the event module with the largest number in the first count matrix on other event modules with smaller numbers.
In this embodiment, vectorization processing may be performed on the first count matrix by converting the first count matrix into a vector form for representing, so that a subsequent process measures distances between a cluster center of a log vector corresponding to each sample log data in the first count vector and cluster centers of other log vectors.
According to the training method of the log anomaly detection model, the first counting matrix is subjected to normalization processing and vectorization processing respectively to obtain the counting vector capable of being used for distance measurement, interference of data noise in the counting matrix is reduced, and the accuracy of the clustering process is improved.
In some embodiments, the step of determining an event template comprises: acquiring a token object in unstructured log data; and classifying the unstructured log data based on the token object and the second keyword corresponding to the unstructured log data to obtain an event template.
In the embodiment, the number of tokens in the unstructured log data analysis time log data is used as an important criterion for extracting the counting template, the variable tokens are firstly converted into specific symbols such as x or other identifiers, different formats of each sample log data are distinguished through the number of tokens, namely, each sample log data is classified by utilizing the number of category tokens, and finally, an event template is obtained according to a second keyword of the sample log.
Fig. 2 is a schematic diagram of an interface for analyzing sample log data based on an event template according to the present invention, and in the embodiment shown in fig. 2, for two pieces of log data generated by a signal system: receivefrom node blk _0001 and Send 120bytes to blk _0002, the fixed template portions of the previous log data are "receivefrom", "from" and "node", such fixed template portions may be common or most frequent participles, while the specific parameter portion blk _0001 etc. is usually pure numbers or a combination of numbers and letters or other symbols; the fixed template portions of the latter log data are "Send" and "bytes to", corresponding to the specific parameter portions "120" and "blk _0002"; at this time, the event templates corresponding to the two logs can be respectively expressed as Receive from node and Send bytes to.
According to the training method of the log anomaly detection model, the event template is used for carrying out structuralization processing on the log data of multiple samples to obtain the first template sequence, the sequence can be directly used for the hierarchical clustering process of the clustering model, the format of the number of the sample logs is unified, and the log analysis efficiency is improved.
In some embodiments, the mean clustering formula comprises:
wherein, C i And C j Respectively for any two sample clusters of the plurality of sample clusters,is C i Is determined by the average value of (a) of (b),is C j Wherein:
wherein, | C i I is a sample cluster C i Number of samples, | C j I is a sample cluster C j P is C i Q is C j Any of the sample log data in (1).
In this embodiment, the target distance may be an average distance between any two of the plurality of sample clusters, and the average distance is compared with a first threshold for comparative analysis, and when the average distance is smaller than the first threshold, the two sample clusters are merged to obtain a fused cluster.
In this embodiment, after sequentially traversing all sample clusters in the first counting matrix, it is determined whether the number of fused clusters exceeds a second threshold or whether the average distance between the fused clusters exceeds a first threshold, if so, the clustering process is stopped to obtain a log anomaly detection model, and if the threshold determination condition is not met, the clustering model continues hierarchical clustering training until any threshold determination condition is met.
According to the training method of the log anomaly detection model, the clustering model is subjected to hierarchical clustering training by setting the mean value clustering formula, so that the calculated amount in the clustering process can be reduced, and the clustering efficiency is improved.
Fig. 3 is a schematic flow diagram of a log anomaly detection method provided by the present invention, and as shown in fig. 3, the log anomaly detection method includes: step 310, step 320, step 330 and step 340.
And step 310, acquiring log data to be detected.
In this step, the log data to be detected are log data generated by different modules in the communication platform inside the signal system at run-time.
In some embodiments, the collected log data to be detected can be collected in a text file for use as input data when a subsequent process tests the log anomaly detection model.
In this embodiment, the log data to be detected may be recorded in a file or a template according to the same format as the sample log data, and the file or the template may be subjected to preprocessing such as sorting and similarity analysis to obtain input data of the log anomaly detection model.
In some embodiments, the corresponding relationship between the version information of different modules and the log data generated by the modules can be further added to the text file, so that an anomaly analysis report can be formed subsequently.
In this embodiment, version information of different modules may be added to the text file, where the version information may be represented by a complete version number, or may be a designated identifier (such as a number, a serial number, or other characters), so as to assist in determining log anomalies of other modules.
And 320, analyzing the log data to be detected based on the event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing the analyzed log data to be detected corresponding to the target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected.
It should be noted that, a system kernel, various application servers, and most application programs in the signal system may output logs, and the content, scale, and use of the logs are different, and these log data often include many contents useful for users, and users may analyze different types of log data to achieve different functional requirements.
In this step, the target parameter may be category information corresponding to the log data to be detected, for example, the received from node blk _0001 and the Send 120bytes to blk _0002belong to two different categories of log data, respectively, and the target parameter may be a keyword with a high word frequency, such as received, from, send, and bytes to, in the log data.
In the embodiment, the log data to be detected are classified into the corresponding event templates according to the target parameters for analysis, so as to obtain the corresponding log templates.
In this step, the text entry of the log data usually consists of a preset fixed template portion and a specific parameter portion, and after a large amount of log data is obtained, an event template can be obtained from unstructured log data through preprocessing operations such as a word segmentation technique, and then the template sequence is subjected to anomaly detection.
In this step, a plurality of log templates may be included in the first template sequence, each log template may be used to represent log data in a different format, and one log template may represent one or more pieces of log data. In this embodiment, for the log data generated by the signal system, the received from node blk _0001, the fixed template parts are "received", "from", and "node", such fixed template parts may be common or most frequently occurring participles, and the specific parameter part blk _0001, etc. is usually a pure number or a combination of a number and other symbols such as letters.
And 330, sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed to-be-detected log data, numbering the sequenced plurality of log templates to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed to-be-detected log data with the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data.
In this step, a log template may be randomly selected first, and the edit distance between the log template and other templates in the first template sequence and the similarity of the word vector in each event template are calculated to perform sorting, where generally, the log templates with similar sequence numbers have higher similarity.
In this step, the printing parameter may be an Init, debug, or Warning parameter in the sample log data.
In this embodiment, when the data to be detected is a log, the corresponding first template sequence includes a log template, and when the data to be detected is a plurality of logs, the corresponding first template sequence includes a plurality of log templates; adding a mathematical statistic information column and a first keyword information column in a first template sequence corresponding to data to be detected, wherein the mathematical statistic information column can be a sequence number of a log template with the largest number of analysis log data or the largest target parameter stored in the first template sequence and a log label type column with the largest occurrence, for example, the log template sequence numbers with the largest occurrence and the log label type column with the largest occurrence are Init (0), debug (1) and Warning (2).
In this embodiment, the first keyword information column may obtain three vocabularies with the highest TFIDF value, such as "extension a", "sending", and "central server", by calculating a TFIDFTF-IDF (Term Frequency-Inverse file Frequency) value of the first template sequence, and then convert the chinese words into numbers by using a correspondence between the vocabularies and the sequence numbers in the corpus, and add the sequence number of the log template with the most occurrence, the log label class column with the most occurrence, and three keyword columns in the count matrix, to finally obtain the information matrix with the expanded dimension, that is, the first count matrix.
And 340, clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
In this step, the log anomaly detection model is obtained in the same manner as the log anomaly detection model obtained in step 140, and this embodiment is not described again.
In this step, the target clustering distance is the distance from the clustering centroid to the target clustering centroid, which is obtained after the second counting matrix is input to the log anomaly detection model.
In this step, the third threshold may be set in a customized manner according to actual requirements.
In this embodiment, the second counting matrix is input into the log anomaly detection model, and when the distance between the second counting matrix and the centroid of the target cluster is smaller than a third threshold, the log data to be detected corresponding to the second counting matrix is determined to be anomalous log data.
It should be noted that the sample log data includes log data generated by normal operation of the module and log data including fault information, and when the sample log data is used for hierarchical clustering of the clustering model, a first clustering centroid of the log data generated by normal operation and a second clustering centroid of the log data including fault information can be obtained respectively, and the first clustering centroid and the second clustering centroid form the target clustering centroid.
According to the log anomaly detection method provided by the invention, the log data to be detected is converted into the second counting matrix, and the log anomaly detection model is used for identifying, so that the hidden fault information in the signal system can be positioned, and the log anomaly detection method is convenient for providing for professional technicians for subsequent fault treatment.
The following describes the training device of the log anomaly detection model provided by the present invention, and the training device of the log anomaly detection model described below and the training method of the log anomaly detection model described above can be referred to correspondingly.
Fig. 4 is a schematic structural diagram of a training apparatus for a log anomaly detection model provided by the present invention, and as shown in fig. 4, the training apparatus for the log anomaly detection model includes: an acquisition module 410, a parsing module 420, a feature extraction module 430, and a clustering module 440.
An obtaining module 410, configured to obtain multiple types of sample log data generated by a signal system;
the analysis module 420 is configured to analyze multiple sample log data based on an event template to obtain a first template sequence, where the first template sequence includes multiple log templates, each log template is used to store analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used to indicate category information of the sample log data;
the feature extraction module 430 is configured to sort the multiple log templates based on editing distances between the multiple log templates and word vector similarities between the multiple analyzed sample log data to obtain a second template sequence, add a first keyword and statistical information of a target log template to the second template sequence to obtain a first count matrix, where the target log template is a log template in which the number of analyzed sample log data is the largest among the multiple log templates, and the first keyword is obtained based on printing parameters in the multiple sample log data;
and the clustering module 440 is configured to input the first counting matrix to a clustering model for hierarchical clustering training, and obtain a log anomaly detection model after training is completed.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of training a log anomaly detection model, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of the multiple samples based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after training is completed. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed log data to be detected to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; and clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing a method for training a log anomaly detection model provided by the above methods, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of the multiple samples based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a log anomaly detection model provided by the above methods, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of various samples based on an event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after training is completed. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed log data to be detected, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A training method of a log anomaly detection model is characterized by comprising the following steps:
acquiring various sample log data generated by a signal system;
analyzing the multiple sample log data based on an event template to obtain a first template sequence, wherein the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
2. The training method of the log anomaly detection model according to claim 1, wherein the inputting the first count matrix into a clustering model for hierarchical clustering training and obtaining the log anomaly detection model after the training is completed comprises:
inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters;
obtaining a target distance between any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value;
and when the target distance is greater than the first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining the log anomaly detection model.
3. The method for training the log anomaly detection model according to claim 1, wherein the step of inputting the first counting matrix into the clustering model for hierarchical clustering training and obtaining the log anomaly detection model after the training is completed further comprises:
carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector;
and inputting the counting vector into a clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after training is finished.
4. The method for training a log anomaly detection model according to claim 1, wherein the step of determining the event template comprises:
acquiring a token object in the unstructured log data;
classifying the unstructured log data based on the token object and a second keyword in the unstructured log data to obtain the event template.
6. A log anomaly detection method is characterized by comprising the following steps:
acquiring log data to be detected;
analyzing the log data to be detected based on an event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing the analyzed log data to be detected corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, numbering the sequenced log templates to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed log data to be detected in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the log data of various samples;
clustering the second counting matrix based on the log anomaly detection model according to any one of claims 1 to 5 to obtain a target clustering distance, and determining the log data to be detected as anomalous log data when the target clustering distance is smaller than a third threshold value.
7. A training device for log anomaly detection models is characterized by comprising:
the acquisition module is used for acquiring various sample log data generated by the signal system;
the analysis module is used for analyzing the multiple sample log data based on event templates to obtain a first template sequence, the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event templates are determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
the feature extraction module is used for sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and the clustering module is used for inputting the first counting matrix into a clustering model for hierarchical clustering training and obtaining a log abnormity detection model after the training is finished.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a training method of a log anomaly detection model according to any one of claims 1 to 5 or a log anomaly detection method according to claim 6 when executing the program.
9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a training method of a log anomaly detection model according to any one of claims 1 to 5 or the log anomaly detection method according to claim 6.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a training method for a log anomaly detection model according to any one of claims 1 to 5 or a log anomaly detection method according to claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211350703.8A CN115758183A (en) | 2022-10-31 | 2022-10-31 | Training method and device for log anomaly detection model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211350703.8A CN115758183A (en) | 2022-10-31 | 2022-10-31 | Training method and device for log anomaly detection model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115758183A true CN115758183A (en) | 2023-03-07 |
Family
ID=85354695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211350703.8A Pending CN115758183A (en) | 2022-10-31 | 2022-10-31 | Training method and device for log anomaly detection model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115758183A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116910592A (en) * | 2023-09-13 | 2023-10-20 | 中移(苏州)软件技术有限公司 | Log detection method and device, electronic equipment and storage medium |
CN117827620A (en) * | 2024-03-05 | 2024-04-05 | 云账户技术(天津)有限公司 | Abnormality diagnosis method, training device, training equipment, and recording medium |
-
2022
- 2022-10-31 CN CN202211350703.8A patent/CN115758183A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116910592A (en) * | 2023-09-13 | 2023-10-20 | 中移(苏州)软件技术有限公司 | Log detection method and device, electronic equipment and storage medium |
CN116910592B (en) * | 2023-09-13 | 2023-11-24 | 中移(苏州)软件技术有限公司 | Log detection method and device, electronic equipment and storage medium |
CN117827620A (en) * | 2024-03-05 | 2024-04-05 | 云账户技术(天津)有限公司 | Abnormality diagnosis method, training device, training equipment, and recording medium |
CN117827620B (en) * | 2024-03-05 | 2024-05-10 | 云账户技术(天津)有限公司 | Abnormality diagnosis method, training device, training equipment, and recording medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639177B (en) | Text extraction method and device | |
CN115758183A (en) | Training method and device for log anomaly detection model | |
CN113449099B (en) | Text classification method and text classification device | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN113297051B (en) | Log analysis processing method and device | |
CN112765003B (en) | Risk prediction method based on APP behavior log | |
CN114818643A (en) | Log template extraction method for reserving specific service information | |
CN113723555A (en) | Abnormal data detection method and device, storage medium and terminal | |
CN113590421B (en) | Log template extraction method, program product and storage medium | |
CN116383742B (en) | Rule chain setting processing method, system and medium based on feature classification | |
CN117874662A (en) | Micro-service log anomaly detection method based on graph mode | |
CN112632000A (en) | Log file clustering method and device, electronic equipment and readable storage medium | |
CN116578700A (en) | Log classification method, log classification device, equipment and medium | |
CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
CN115169490A (en) | Log classification method, device and equipment and computer readable storage medium | |
CN112163217B (en) | Malware variant identification method, device, equipment and computer storage medium | |
CN114880471A (en) | Electronic medical record quality evaluation method and system based on text classification algorithm | |
CN114298236A (en) | Unstructured content similarity determining method and device and electronic equipment | |
CN112632229A (en) | Text clustering method and device | |
CN115048345A (en) | Abnormal log detection method and device, electronic equipment and storage medium | |
CN113065130A (en) | Log classification method and related device | |
CN111931229A (en) | Data identification method and device and storage medium | |
CN112685324B (en) | Method and system for generating test scheme | |
CN113535955B (en) | Method and device for quickly classifying logs | |
CN117235137B (en) | Professional information query method and device based on vector database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |