CN115758183A - Training method and device for log anomaly detection model - Google Patents

Training method and device for log anomaly detection model Download PDF

Info

Publication number
CN115758183A
CN115758183A CN202211350703.8A CN202211350703A CN115758183A CN 115758183 A CN115758183 A CN 115758183A CN 202211350703 A CN202211350703 A CN 202211350703A CN 115758183 A CN115758183 A CN 115758183A
Authority
CN
China
Prior art keywords
log
template
log data
anomaly detection
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211350703.8A
Other languages
Chinese (zh)
Inventor
李金壑
耿鹏
王紫成
张帅
徐硕
陈逸
王中林
何富君
李俊松
郭俊垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CRSC Urban Rail Transit Technology Co Ltd
Original Assignee
CRSC Urban Rail Transit Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CRSC Urban Rail Transit Technology Co Ltd filed Critical CRSC Urban Rail Transit Technology Co Ltd
Priority to CN202211350703.8A priority Critical patent/CN115758183A/en
Publication of CN115758183A publication Critical patent/CN115758183A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a training method and a device of a log anomaly detection model, wherein the training method of the log anomaly detection model comprises the following steps: acquiring various sample log data generated by a signal system; analyzing the log data of various samples based on an event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished. The method can increase the dimensionality of the template sequence corresponding to the sample log data, thereby improving the training efficiency of the clustering model and the accuracy of classification.

Description

Training method and device for log anomaly detection model
Technical Field
The invention relates to the technical field of data anomaly detection, in particular to a training method and a training device for a log anomaly detection model.
Background
With the continuous increase of the scale and the complexity of a signal system in the field of urban rails, the amount of logs generated in unit time is increased sharply, log data generated in an abnormal operation scene are monitored and early warned in time, the signal system breakdown can be avoided, and the economic loss is reduced.
In the related technology, the efficiency of analyzing whether a fault occurs in a signal system by means of professional experience only manually is low, so that the operation and maintenance cost is high, when the detection model is used for carrying out abnormal detection on logs in various formats generated by the signal system, the log data presents the non-structural characteristics, such as non-uniform time formats, non-uniform professional vocabularies or acronyms defined by different manufacturers and the like, so that the difficulty of log analysis is increased, and the identification feature characterization capability extracted by the detection model is poor, so that the accuracy of abnormal log detection is low.
Disclosure of Invention
The invention provides a training method and a training device for a log anomaly detection model, which are used for solving the defects that the efficiency of judging system faults by using manual experience is low and the accuracy of detecting log anomalies by using the detection model is low when massive log messages are processed in the prior art, and improving the detection accuracy and the log detection efficiency of anomalous logs.
The invention provides a training method of a log anomaly detection model, which comprises the following steps:
acquiring various sample log data generated by a signal system;
analyzing the multiple sample log data based on an event template to obtain a first template sequence, wherein the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
According to the training method of the log anomaly detection model provided by the invention, the first counting matrix is input into the clustering model for hierarchical clustering training, and the log anomaly detection model is obtained after the training is finished, and the method comprises the following steps:
inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters;
obtaining a target distance of any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing the any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value;
and when the target distance is greater than the first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining the log anomaly detection model.
According to the training method of the log anomaly detection model provided by the invention, the first counting matrix is input to the clustering model for hierarchical clustering training, and the log anomaly detection model is obtained after the training is finished, and the training method further comprises the following steps:
carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector;
and inputting the counting vector into a clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after training is completed.
According to the training method of the log anomaly detection model provided by the invention, the step of determining the event template comprises the following steps:
acquiring a token object in the unstructured log data;
classifying the unstructured log data based on the token object and a second keyword in the unstructured log data to obtain the event template.
According to the training method of the log anomaly detection model provided by the invention, the mean value clustering formula comprises the following steps:
Figure BDA0003918740800000031
wherein, C i And C j Respectively for any two of the plurality of sample clusters,
Figure BDA0003918740800000032
is C i Is determined by the average value of (a) of (b),
Figure BDA0003918740800000033
is C j Average value of (a).
The invention also provides a log anomaly detection method, which comprises the following steps:
acquiring log data to be detected;
analyzing the log data to be detected based on an event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing analyzed log data to be detected corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, obtaining a second template sequence for the sequenced plurality of log templates, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed log data to be detected in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the log data of various samples;
clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
The invention provides a training device of a log anomaly detection model, which comprises:
the acquisition module is used for acquiring various sample log data generated by the signal system;
the analysis module is used for analyzing the multiple sample log data based on event templates to obtain a first template sequence, the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event templates are determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
the characteristic extraction module is used for sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest quantity, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and the clustering module is used for inputting the first counting matrix into a clustering model for hierarchical clustering training and obtaining a log abnormity detection model after the training is finished.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the training method of the log anomaly detection model or the log anomaly detection method is realized.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for training a log abnormality detection model or the method for detecting log abnormality as described in any one of the above.
The present invention also provides a computer program product comprising a computer program, which when executed by a processor implements the method for training a log anomaly detection model or the method for detecting log anomalies as described in any one of the above.
According to the training method and device for the log anomaly detection model, provided by the invention, the sample log data are analyzed according to the event template to obtain the first template sequence, the template similarity information and the statistical analysis information are added into the first template sequence, and the first keywords are utilized for statistical classification, so that the dimensionality of the first template sequence is increased, and the training efficiency and the classification accuracy of the clustering model are improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a training method of a log anomaly detection model provided by the present invention;
FIG. 2 is a schematic diagram of an interface for parsing sample log data based on an event template according to the present invention;
FIG. 3 is a schematic flow chart of a log anomaly detection method provided by the present invention;
FIG. 4 is a schematic structural diagram of a training apparatus for a log anomaly detection model provided in the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a training method of a log anomaly detection model provided by the present invention, and as shown in fig. 1, the training method of the log anomaly detection model includes: step 110, step 120, step 130 and step 140.
Step 110, obtaining a plurality of sample log data generated by the signal system.
In this step, the various sample log data may be log data generated by different modules in the communication platform inside the signal system in normal operation, as well as a log containing fault information.
In some embodiments, the collected sample log data may be collected and aggregated in the form of a text file for use as input data in subsequent processes to train the clustering model.
In the embodiment, the collected logs with various formats can be recorded in files or templates according to a uniform format, and after preprocessing steps such as sequencing, similarity analysis and the like are performed on the files or templates, sample data which can be directly used for training the clustering model is obtained.
In some embodiments, the corresponding relationship between the version information of different modules and the log data generated by the modules can be further added to the text file, so that an anomaly analysis report can be formed subsequently.
In this embodiment, version information of different modules may be added to the text file, where the version information may be represented by a complete version number, or may be a designated identifier (such as a number, a serial number, or other characters), so as to assist in determining log anomalies of other modules.
Step 120, analyzing the multiple sample log data based on the event template to obtain a first template sequence, where the first template sequence includes multiple log templates, each log template is used to store the analyzed sample log data corresponding to the target parameter, the event template is determined based on the unstructured log data, and the target parameter is used to represent the category information of the sample log data.
It should be noted that, a system kernel, various application servers, and most application programs in the signal system may output logs, and the content, scale, and use of the logs are also different, and these log data often include many contents useful for users, and users may analyze different types of log data to achieve different functional requirements.
In this step, the target parameter may be category information corresponding to the sample log data, for example, the received from node blk _0001 and Send 120bytes to blk _0002belong to two different categories of log data, and the target parameter may be a keyword with a high occurrence frequency in the log data, such as received, from, send, and bytes to.
In this embodiment, the sample log data is classified into the corresponding event template according to the target parameter corresponding to each sample log data and analyzed to obtain the corresponding log template.
In this step, the text entry of the log data usually consists of a preset fixed template portion and a specific parameter portion, and after a large amount of log data is obtained, an event template can be obtained from unstructured log data through preprocessing operations such as a word segmentation technology, and then the template sequence is subjected to anomaly detection.
In this step, a plurality of log templates may be included in the first template sequence, each log template may be used to represent log data in different formats, and one log template may represent one or more pieces of log data. In this embodiment, for the log data generated by the signal system, the received from node blk _0001, the fixed template parts are "received", "from", and "node", such fixed template parts may be common or most frequently occurring participles, and the specific parameter part blk _0001, etc. is usually a pure number or a combination of a number and other symbols such as letters.
Step 130, sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarities among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; the target log template is a log template with the largest number of sample log data after being stored and analyzed in the plurality of log templates, and the first keyword is obtained based on printing parameters in the plurality of sample log data.
In this step, a log template may be randomly selected first, and the edit distance between the log template and other templates in the first template sequence and the similarity of word vectors in each event template are calculated to perform sorting, where generally, the log templates with similar sequence numbers have higher similarity.
In this step, the printing parameter may be an Init, debug, or Warning parameter in the sample log data.
In this embodiment, it is assumed that the number of log templates included in the first template sequence is 10, the initial number of each event template is 1 to 10, and the templates are numbered again by calculating the edit distance between the templates so that the two templates with high similarity have similar serial numbers; before this, a log sequence with a fixed window size of 100 is taken in a log collection stage, a count matrix with a size of 100 × 10 is obtained, and a mathematical statistic information column and a first keyword information column are added to the count matrix, where the mathematical statistic information column may be a sequence number of a log template with the largest number of analysis log data or the largest target parameter stored in the first template sequence, and a log label class column with the largest occurrence, for example, a log template sequence number corresponding to a large number of occurrences and a log label class column with the largest occurrence are Init (0), debug (1) and Warning (2).
In this embodiment, the first keyword information column may be obtained by calculating TFIDFTF-IDF (Term Frequency Inverse Document Frequency) values of the first template sequence to obtain three vocabularies with the highest TFIDF values, such as "extension a", "sending", and "central server", and then converting the chinese words into numbers by using correspondence between the vocabularies and the serial numbers in the corpus, and then adding the most appeared log template serial numbers, the most appeared log tag class columns, and three columns of keywords to the count matrix, so as to finally obtain an information matrix with a size of 100 × 15, that is, the first count matrix.
And 140, inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
In the step, after a first counting matrix after dimension expansion is obtained on an original counting matrix, the first counting matrix is used as the input of a clustering model to carry out hierarchical clustering training.
It should be noted that hierarchical clustering is to create a multi-level clustering tree by calculating the similarity between data points of different classes; the hierarchical decomposition comprises a bottom-up decomposition mode and a top-down decomposition mode, wherein the bottom-up aggregation method takes all log vector samples as a cluster, combines two clusters with the minimum distance in sequence, and repeats continuously until a termination condition is met; in contrast, the top-down splitting method is to take all log vectors as a whole, find out two clusters with the farthest distance in the clusters to split, and repeat the splitting process continuously until the number of clusters meets the termination condition.
In this embodiment, the termination condition may be that the distance of two clusters exceeds a first threshold or that the number of clusters exceeds a second threshold.
In this embodiment, both the first threshold and the second threshold may be set in a customized manner according to actual requirements.
In some embodiments, other clustering models or algorithms may also be clustered based on the first count matrix, resulting in a usable log anomaly detection model.
In this embodiment, the other clustering model or algorithm may be a K-means clustering algorithm, a Gaussian mixture clustering model, a density clustering algorithm, or the like.
According to the training method of the log anomaly detection model, provided by the invention, the sample log data is analyzed according to the event template to obtain the first template sequence, the template similarity information and the statistical analysis information are added into the first template sequence, and the first keywords are utilized for statistical classification, so that the dimensionality of the first template sequence is increased, and the training efficiency and the classification accuracy of the clustering model are improved.
In some embodiments, inputting the first count matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is completed, includes: inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters; determining the target distance of any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value; and when the target distance is greater than a first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining a log anomaly detection model.
It should be noted that the hierarchical clustering mode of the clustering model may be bottom-up aggregation hierarchical clustering or top-down split hierarchical clustering.
In this embodiment, the target distance may refer to an average distance between any two sample clusters in the plurality of sample clusters, or may be a minimum distance or a maximum distance; the corresponding mean clustering formula is used for clustering based on the average distance of the two sample clusters, or clustering based on the minimum distance or the maximum distance of the two sample clusters.
Next, a description will be given by taking an example of clustering performed by the clustering model.
In this embodiment, the first count matrix is used as the input of the agglomerative hierarchical cluster, and the number of log data samples included in the first count matrix is D = x 1 ,x 2 ,...,x m A similarity measurement function s, a clustering sample cluster number k, and clustering according to the following steps:
(1) Initializing each sample log data as a sample cluster, c i =x i ,i=1,2,...,m;
(2) Calculating a distance d (i, j) between two sample log data, wherein i =1,2.. M, j =1,2.. M;
(3) Finding out two most similar sample cluster clusters for fusion through a similarity measurement function s;
(4) And stopping clustering until the number of the fusion clusters is less than or equal to k (corresponding to a second threshold) or the target distance between any two sample clusters exceeds a first threshold, and taking a clustering model as an available log anomaly detection model.
In some embodiments, after the training of the clustering model is completed, the obtained log anomaly detection model may only include two fused sample clusters, namely a normal sample cluster and an abnormal sample cluster, and the number of the fused clusters is reduced in the training process. According to the training method of the log anomaly detection model, the technical matrix after dimensionality expansion is input into the clustering model for hierarchical clustering training, the calculated amount of the automatic clustering process based on the mean distance is small, and the new problem of outlier sensitivity can be solved.
In some embodiments, inputting the first count matrix into the clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after the training is completed, further includes: carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector; and inputting the counting vector into a clustering model to perform hierarchical clustering training to obtain a log anomaly detection model.
In this embodiment, the normalization processing on the first count matrix can reduce the influence of the event module with the largest number in the first count matrix on other event modules with smaller numbers.
In this embodiment, vectorization processing may be performed on the first count matrix by converting the first count matrix into a vector form for representing, so that a subsequent process measures distances between a cluster center of a log vector corresponding to each sample log data in the first count vector and cluster centers of other log vectors.
According to the training method of the log anomaly detection model, the first counting matrix is subjected to normalization processing and vectorization processing respectively to obtain the counting vector capable of being used for distance measurement, interference of data noise in the counting matrix is reduced, and the accuracy of the clustering process is improved.
In some embodiments, the step of determining an event template comprises: acquiring a token object in unstructured log data; and classifying the unstructured log data based on the token object and the second keyword corresponding to the unstructured log data to obtain an event template.
In the embodiment, the number of tokens in the unstructured log data analysis time log data is used as an important criterion for extracting the counting template, the variable tokens are firstly converted into specific symbols such as x or other identifiers, different formats of each sample log data are distinguished through the number of tokens, namely, each sample log data is classified by utilizing the number of category tokens, and finally, an event template is obtained according to a second keyword of the sample log.
Fig. 2 is a schematic diagram of an interface for analyzing sample log data based on an event template according to the present invention, and in the embodiment shown in fig. 2, for two pieces of log data generated by a signal system: receivefrom node blk _0001 and Send 120bytes to blk _0002, the fixed template portions of the previous log data are "receivefrom", "from" and "node", such fixed template portions may be common or most frequent participles, while the specific parameter portion blk _0001 etc. is usually pure numbers or a combination of numbers and letters or other symbols; the fixed template portions of the latter log data are "Send" and "bytes to", corresponding to the specific parameter portions "120" and "blk _0002"; at this time, the event templates corresponding to the two logs can be respectively expressed as Receive from node and Send bytes to.
According to the training method of the log anomaly detection model, the event template is used for carrying out structuralization processing on the log data of multiple samples to obtain the first template sequence, the sequence can be directly used for the hierarchical clustering process of the clustering model, the format of the number of the sample logs is unified, and the log analysis efficiency is improved.
In some embodiments, the mean clustering formula comprises:
Figure BDA0003918740800000111
wherein, C i And C j Respectively for any two sample clusters of the plurality of sample clusters,
Figure BDA0003918740800000112
is C i Is determined by the average value of (a) of (b),
Figure BDA0003918740800000113
is C j Wherein:
Figure BDA0003918740800000114
wherein, | C i I is a sample cluster C i Number of samples, | C j I is a sample cluster C j P is C i Q is C j Any of the sample log data in (1).
In this embodiment, the target distance may be an average distance between any two of the plurality of sample clusters, and the average distance is compared with a first threshold for comparative analysis, and when the average distance is smaller than the first threshold, the two sample clusters are merged to obtain a fused cluster.
In this embodiment, after sequentially traversing all sample clusters in the first counting matrix, it is determined whether the number of fused clusters exceeds a second threshold or whether the average distance between the fused clusters exceeds a first threshold, if so, the clustering process is stopped to obtain a log anomaly detection model, and if the threshold determination condition is not met, the clustering model continues hierarchical clustering training until any threshold determination condition is met.
According to the training method of the log anomaly detection model, the clustering model is subjected to hierarchical clustering training by setting the mean value clustering formula, so that the calculated amount in the clustering process can be reduced, and the clustering efficiency is improved.
Fig. 3 is a schematic flow diagram of a log anomaly detection method provided by the present invention, and as shown in fig. 3, the log anomaly detection method includes: step 310, step 320, step 330 and step 340.
And step 310, acquiring log data to be detected.
In this step, the log data to be detected are log data generated by different modules in the communication platform inside the signal system at run-time.
In some embodiments, the collected log data to be detected can be collected in a text file for use as input data when a subsequent process tests the log anomaly detection model.
In this embodiment, the log data to be detected may be recorded in a file or a template according to the same format as the sample log data, and the file or the template may be subjected to preprocessing such as sorting and similarity analysis to obtain input data of the log anomaly detection model.
In some embodiments, the corresponding relationship between the version information of different modules and the log data generated by the modules can be further added to the text file, so that an anomaly analysis report can be formed subsequently.
In this embodiment, version information of different modules may be added to the text file, where the version information may be represented by a complete version number, or may be a designated identifier (such as a number, a serial number, or other characters), so as to assist in determining log anomalies of other modules.
And 320, analyzing the log data to be detected based on the event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing the analyzed log data to be detected corresponding to the target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected.
It should be noted that, a system kernel, various application servers, and most application programs in the signal system may output logs, and the content, scale, and use of the logs are different, and these log data often include many contents useful for users, and users may analyze different types of log data to achieve different functional requirements.
In this step, the target parameter may be category information corresponding to the log data to be detected, for example, the received from node blk _0001 and the Send 120bytes to blk _0002belong to two different categories of log data, respectively, and the target parameter may be a keyword with a high word frequency, such as received, from, send, and bytes to, in the log data.
In the embodiment, the log data to be detected are classified into the corresponding event templates according to the target parameters for analysis, so as to obtain the corresponding log templates.
In this step, the text entry of the log data usually consists of a preset fixed template portion and a specific parameter portion, and after a large amount of log data is obtained, an event template can be obtained from unstructured log data through preprocessing operations such as a word segmentation technique, and then the template sequence is subjected to anomaly detection.
In this step, a plurality of log templates may be included in the first template sequence, each log template may be used to represent log data in a different format, and one log template may represent one or more pieces of log data. In this embodiment, for the log data generated by the signal system, the received from node blk _0001, the fixed template parts are "received", "from", and "node", such fixed template parts may be common or most frequently occurring participles, and the specific parameter part blk _0001, etc. is usually a pure number or a combination of a number and other symbols such as letters.
And 330, sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed to-be-detected log data, numbering the sequenced plurality of log templates to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed to-be-detected log data with the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data.
In this step, a log template may be randomly selected first, and the edit distance between the log template and other templates in the first template sequence and the similarity of the word vector in each event template are calculated to perform sorting, where generally, the log templates with similar sequence numbers have higher similarity.
In this step, the printing parameter may be an Init, debug, or Warning parameter in the sample log data.
In this embodiment, when the data to be detected is a log, the corresponding first template sequence includes a log template, and when the data to be detected is a plurality of logs, the corresponding first template sequence includes a plurality of log templates; adding a mathematical statistic information column and a first keyword information column in a first template sequence corresponding to data to be detected, wherein the mathematical statistic information column can be a sequence number of a log template with the largest number of analysis log data or the largest target parameter stored in the first template sequence and a log label type column with the largest occurrence, for example, the log template sequence numbers with the largest occurrence and the log label type column with the largest occurrence are Init (0), debug (1) and Warning (2).
In this embodiment, the first keyword information column may obtain three vocabularies with the highest TFIDF value, such as "extension a", "sending", and "central server", by calculating a TFIDFTF-IDF (Term Frequency-Inverse file Frequency) value of the first template sequence, and then convert the chinese words into numbers by using a correspondence between the vocabularies and the sequence numbers in the corpus, and add the sequence number of the log template with the most occurrence, the log label class column with the most occurrence, and three keyword columns in the count matrix, to finally obtain the information matrix with the expanded dimension, that is, the first count matrix.
And 340, clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
In this step, the log anomaly detection model is obtained in the same manner as the log anomaly detection model obtained in step 140, and this embodiment is not described again.
In this step, the target clustering distance is the distance from the clustering centroid to the target clustering centroid, which is obtained after the second counting matrix is input to the log anomaly detection model.
In this step, the third threshold may be set in a customized manner according to actual requirements.
In this embodiment, the second counting matrix is input into the log anomaly detection model, and when the distance between the second counting matrix and the centroid of the target cluster is smaller than a third threshold, the log data to be detected corresponding to the second counting matrix is determined to be anomalous log data.
It should be noted that the sample log data includes log data generated by normal operation of the module and log data including fault information, and when the sample log data is used for hierarchical clustering of the clustering model, a first clustering centroid of the log data generated by normal operation and a second clustering centroid of the log data including fault information can be obtained respectively, and the first clustering centroid and the second clustering centroid form the target clustering centroid.
According to the log anomaly detection method provided by the invention, the log data to be detected is converted into the second counting matrix, and the log anomaly detection model is used for identifying, so that the hidden fault information in the signal system can be positioned, and the log anomaly detection method is convenient for providing for professional technicians for subsequent fault treatment.
The following describes the training device of the log anomaly detection model provided by the present invention, and the training device of the log anomaly detection model described below and the training method of the log anomaly detection model described above can be referred to correspondingly.
Fig. 4 is a schematic structural diagram of a training apparatus for a log anomaly detection model provided by the present invention, and as shown in fig. 4, the training apparatus for the log anomaly detection model includes: an acquisition module 410, a parsing module 420, a feature extraction module 430, and a clustering module 440.
An obtaining module 410, configured to obtain multiple types of sample log data generated by a signal system;
the analysis module 420 is configured to analyze multiple sample log data based on an event template to obtain a first template sequence, where the first template sequence includes multiple log templates, each log template is used to store analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used to indicate category information of the sample log data;
the feature extraction module 430 is configured to sort the multiple log templates based on editing distances between the multiple log templates and word vector similarities between the multiple analyzed sample log data to obtain a second template sequence, add a first keyword and statistical information of a target log template to the second template sequence to obtain a first count matrix, where the target log template is a log template in which the number of analyzed sample log data is the largest among the multiple log templates, and the first keyword is obtained based on printing parameters in the multiple sample log data;
and the clustering module 440 is configured to input the first counting matrix to a clustering model for hierarchical clustering training, and obtain a log anomaly detection model after training is completed.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of training a log anomaly detection model, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of the multiple samples based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after training is completed. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed log data to be detected to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; and clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing a method for training a log anomaly detection model provided by the above methods, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of the multiple samples based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a log anomaly detection model provided by the above methods, the method comprising: acquiring various sample log data generated by a signal system; analyzing the log data of various samples based on an event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix; and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after training is completed. Or executing a log anomaly detection method, wherein the method comprises the following steps: acquiring log data to be detected; analyzing the log data to be detected based on the event template to obtain a first template sequence; sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed log data to be detected, numbering the sequenced plurality of log templates to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix; clustering the second counting matrix based on the log anomaly detection model to obtain a target clustering distance, and determining the log data to be detected as anomalous log data under the condition that the target clustering distance is smaller than a third threshold value.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A training method of a log anomaly detection model is characterized by comprising the following steps:
acquiring various sample log data generated by a signal system;
analyzing the multiple sample log data based on an event template to obtain a first template sequence, wherein the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, and adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and inputting the first counting matrix into a clustering model for hierarchical clustering training, and obtaining a log anomaly detection model after the training is finished.
2. The training method of the log anomaly detection model according to claim 1, wherein the inputting the first count matrix into a clustering model for hierarchical clustering training and obtaining the log anomaly detection model after the training is completed comprises:
inputting the first counting matrix into a clustering model for clustering to obtain a plurality of sample clusters;
obtaining a target distance between any two sample clusters in the plurality of sample clusters based on a mean clustering formula, and fusing any two sample clusters to obtain a fused cluster when the target distance is smaller than a first threshold value;
and when the target distance is greater than the first threshold value or the number of the fusion clusters is smaller than a second threshold value, obtaining the log anomaly detection model.
3. The method for training the log anomaly detection model according to claim 1, wherein the step of inputting the first counting matrix into the clustering model for hierarchical clustering training and obtaining the log anomaly detection model after the training is completed further comprises:
carrying out normalization processing and vectorization processing on the counting matrix to obtain a counting vector;
and inputting the counting vector into a clustering model for hierarchical clustering training, and obtaining the log anomaly detection model after training is finished.
4. The method for training a log anomaly detection model according to claim 1, wherein the step of determining the event template comprises:
acquiring a token object in the unstructured log data;
classifying the unstructured log data based on the token object and a second keyword in the unstructured log data to obtain the event template.
5. The training method of the log anomaly detection model according to claim 2, wherein the mean clustering formula comprises:
Figure FDA0003918740790000021
wherein, C i And C j Respectively for any two of the plurality of sample clusters,
Figure FDA0003918740790000022
is C i Is determined by the average value of (a) of (b),
Figure FDA0003918740790000023
is C j Average value of (a).
6. A log anomaly detection method is characterized by comprising the following steps:
acquiring log data to be detected;
analyzing the log data to be detected based on an event template to obtain a first template sequence, wherein the first template sequence comprises a plurality of log templates, each log template is used for storing the analyzed log data to be detected corresponding to a target parameter, the event template is determined based on unstructured log data, and the target parameter is used for representing the category information of the log data to be detected;
sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the analyzed log data to be detected, numbering the sequenced log templates to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a second counting matrix, wherein the target log template is the log template which stores the analyzed log data to be detected in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the log data of various samples;
clustering the second counting matrix based on the log anomaly detection model according to any one of claims 1 to 5 to obtain a target clustering distance, and determining the log data to be detected as anomalous log data when the target clustering distance is smaller than a third threshold value.
7. A training device for log anomaly detection models is characterized by comprising:
the acquisition module is used for acquiring various sample log data generated by the signal system;
the analysis module is used for analyzing the multiple sample log data based on event templates to obtain a first template sequence, the first template sequence comprises multiple log templates, each log template is used for storing analyzed sample log data corresponding to a target parameter, the event templates are determined based on unstructured log data, and the target parameter is used for representing the category information of the sample log data;
the feature extraction module is used for sequencing the plurality of log templates based on editing distances among the plurality of log templates and word vector similarity among the plurality of analyzed sample log data to obtain a second template sequence, adding a first keyword and statistical information of a target log template into the second template sequence to obtain a first counting matrix, wherein the target log template is the log template which stores the analyzed sample log data in the plurality of log templates and has the largest number, and the first keyword is obtained based on printing parameters in the plurality of sample log data;
and the clustering module is used for inputting the first counting matrix into a clustering model for hierarchical clustering training and obtaining a log abnormity detection model after the training is finished.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a training method of a log anomaly detection model according to any one of claims 1 to 5 or a log anomaly detection method according to claim 6 when executing the program.
9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a training method of a log anomaly detection model according to any one of claims 1 to 5 or the log anomaly detection method according to claim 6.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a training method for a log anomaly detection model according to any one of claims 1 to 5 or a log anomaly detection method according to claim 6.
CN202211350703.8A 2022-10-31 2022-10-31 Training method and device for log anomaly detection model Pending CN115758183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211350703.8A CN115758183A (en) 2022-10-31 2022-10-31 Training method and device for log anomaly detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211350703.8A CN115758183A (en) 2022-10-31 2022-10-31 Training method and device for log anomaly detection model

Publications (1)

Publication Number Publication Date
CN115758183A true CN115758183A (en) 2023-03-07

Family

ID=85354695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211350703.8A Pending CN115758183A (en) 2022-10-31 2022-10-31 Training method and device for log anomaly detection model

Country Status (1)

Country Link
CN (1) CN115758183A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910592A (en) * 2023-09-13 2023-10-20 中移(苏州)软件技术有限公司 Log detection method and device, electronic equipment and storage medium
CN117827620A (en) * 2024-03-05 2024-04-05 云账户技术(天津)有限公司 Abnormality diagnosis method, training device, training equipment, and recording medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910592A (en) * 2023-09-13 2023-10-20 中移(苏州)软件技术有限公司 Log detection method and device, electronic equipment and storage medium
CN116910592B (en) * 2023-09-13 2023-11-24 中移(苏州)软件技术有限公司 Log detection method and device, electronic equipment and storage medium
CN117827620A (en) * 2024-03-05 2024-04-05 云账户技术(天津)有限公司 Abnormality diagnosis method, training device, training equipment, and recording medium
CN117827620B (en) * 2024-03-05 2024-05-10 云账户技术(天津)有限公司 Abnormality diagnosis method, training device, training equipment, and recording medium

Similar Documents

Publication Publication Date Title
CN111639177B (en) Text extraction method and device
CN115758183A (en) Training method and device for log anomaly detection model
CN113449099B (en) Text classification method and text classification device
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN113297051B (en) Log analysis processing method and device
CN112765003B (en) Risk prediction method based on APP behavior log
CN114818643A (en) Log template extraction method for reserving specific service information
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN113590421B (en) Log template extraction method, program product and storage medium
CN116383742B (en) Rule chain setting processing method, system and medium based on feature classification
CN117874662A (en) Micro-service log anomaly detection method based on graph mode
CN112632000A (en) Log file clustering method and device, electronic equipment and readable storage medium
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN116841779A (en) Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN115169490A (en) Log classification method, device and equipment and computer readable storage medium
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN114880471A (en) Electronic medical record quality evaluation method and system based on text classification algorithm
CN114298236A (en) Unstructured content similarity determining method and device and electronic equipment
CN112632229A (en) Text clustering method and device
CN115048345A (en) Abnormal log detection method and device, electronic equipment and storage medium
CN113065130A (en) Log classification method and related device
CN111931229A (en) Data identification method and device and storage medium
CN112685324B (en) Method and system for generating test scheme
CN113535955B (en) Method and device for quickly classifying logs
CN117235137B (en) Professional information query method and device based on vector database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination