CN115442211B

CN115442211B - Network log analysis method and device based on twin neural network and fixed analysis tree

Info

Publication number: CN115442211B
Application number: CN202210999336.8A
Authority: CN
Inventors: 徐小龙; 孙雷
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2023-08-04
Anticipated expiration: 2042-08-19
Also published as: CN115442211A

Abstract

The invention discloses a weblog analysis method and a device based on a twin neural network and a fixed analysis tree, wherein the method comprises the following steps: firstly, preprocessing a simple regular expression based on domain knowledge; determining a proper list of log events (i.e., leaf nodes of the tree) by a twin neural network in the tree interior nodes according to the encoding; if a suitable log event is found, the log message will match the log event stored in the log event list; otherwise, a new log event is created according to the log message and added into a log event list; when all log messages are matched, finally merging similar log events according to the log event conditions; the analysis method can accurately extract the log event in the weblog, and has good practicability in log analysis tasks with high accuracy and real-time requirements.

Description

Network log analysis method and device based on twin neural network and fixed analysis tree

Technical Field

The invention relates to a weblog analysis method and a weblog analysis device based on a twin neural network and a fixed analysis tree, and belongs to the technical field of natural language processing and artificial intelligence.

Background

Modern software and hardware systems typically record valuable runtime information (e.g., important events and related variables) in a log, as well as including some of the most important information to diagnose network or system anomalies. When an anomaly occurs in the network or system, log messages are typically used in more complex down-hole processes in which the service personnel examine the root cause of the problem and decide what they should do to recover from the fault. Thus, journaling plays an important role for practitioners in understanding the runtime behavior of software systems and diagnosing system failures. However, since the size of the log is typically very large (e.g., tens or hundreds of gb), previous studies have proposed methods of automatically analyzing the log that assist practitioners in performing various software maintenance and operational activities, such as debugging, anomaly detection, fault prediction, system understanding, performance diagnostics, and improvements.

The log is generated by logging statements in the source code (e.g., log.info () statements). The journaling statements are composed of journal levels (e.g., info, error, etc.), static text (e.g., "Received block" and "of size" etc.), and dynamic variables (e.g., "$blockId" and IP, etc.). During system operation, the log statement will generate an original log message, which is a line of unstructured text containing static text and the value of the dynamic variable specified in the log statement (e.g., "blk 7526945448667194862"). The log message also contains information such as the time of occurrence of the event (e.g., "081109 210637"). In other words, the logging statement defines a log event for a log message generated at run-time (some studies are referred to as log templates). The goal of log parsing is to extract static log events, dynamic variables, and header information (i.e., timestamp, log level, and logger name) from the original log message into a structured format, although there is some degree of log parser previously able to parse the log, with the rapid increase in log size, there are many challenges:

1) Severely dependent on specially designed regular expressions. In practice, practitioners often write special log parsing scripts that rely heavily on specially designed regular expressions as modern software systems often contain a large number of constantly evolving log events, requiring significant effort by practitioners to develop and maintain such regular expressions.

2) The semantic information of the log message is ignored. Some previous methods for analyzing the log adopt some special design rules, the rules depend on field knowledge, complex algorithms and data structures are designed to analyze the log, but semantic information of log texts is not considered, so that log analysis is completely dependent on algorithm models, log types and structures, and robustness of the log analysis method is low.

3) The log data is very complex. The log message consists of unstructured text messages generated by the vendor's log record statements in the source code. Especially in complex distributed systems, it is composed of multiple suppliers' elements, and multiple suppliers output out of order through log record statements in the source code, so the format of the log is highly diversified, which greatly increases the difficulty of log parsing.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a network log analysis method based on a twin neural network and a fixed analysis tree, merges the twin neural network on the basis of the fixed deep analysis tree, utilizes the twin neural network to learn the correlation among log messages, can learn log semantics without defining different rules by any domain knowledge, has good universality, and has good practicability in a log analysis system with higher accuracy and real-time requirements. In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a weblog parsing method based on a twin neural network and a fixed parse tree, including:

acquiring an original log message;

preprocessing an original log message through a simple regular expression to obtain a preprocessed log message and a corresponding log message length:

dividing original log information into log groups according to the length of the log information, wherein the length of the log information stored in each log group is the same;

for the current log message, selecting a path to the second-layer node based on the divided log group, and searching the intermediate node to finally search the most similar leaf node;

based on the searched most similar leaf node, determining the similarity of the current log message and the most similar log message of the leaf node by adopting a trained twin neural network model, and obtaining a log message similarity result;

updating the condition of the log message template through a fixed depth analysis tree according to the log message similarity result;

and when all log information analysis is completed, obtaining an analysis tree.

In some embodiments, dividing the original log message into log groups according to log message length includes: based on the assumption that the log messages with the same log event have the same log message length, the original logs are divided into different log groups according to the difference of the length of the log messages after preprocessing, and the length of the log messages stored in each log group is the same, wherein the length of the log messages is defined as the number of tokens in the log messages.

In some embodiments, selecting a path to a second tier node based on the partitioned log group, the searching intermediate node eventually searching for the most similar leaf node, comprises:

starting searching downwards from the root of the analysis tree, and when reaching the level of the length of the log message represented in the analysis tree, namely the second level, continuing searching downwards, and selecting the next internal node through the mark of the starting position of the log message; and searching leaf nodes with fixed depth according to the preset depth parameters of the parse tree, wherein the leaf nodes comprise a log event group list.

In some embodiments, determining the similarity of the current log message to the most similar log message of the leaf node using the trained twin neural network model comprises:

the twin neural network comprises two twin neural networks LSTMa and LSTMb, each network processing a given pair of sentences;

an LSTM spatially learns and maps from a variable length two-dimensional vector sequence toEach log message is represented as a sequence of tokens x ₁ ，...，x _t Passed to LSTM, one LSTM update procedure through the weight matrix W for each T e { 1..the.t } _i ，W _f ，W _c ，W _o ，U _i ，U _f ，U _c ，U _o And offset b _i ，b _f ，b _c ，b _o To parameterize:

i _t ＝sigmoid(W _i x _t +U _i h _t-1 +b _i )

f _t ＝sigmoid(W _f x _t +U _f h _t-1 +b _f )

o _t ＝sigmoid(W _o x _t +U _o h _t-1 +b _o )

h _t ＝o _t ⊙tanh(c _t )

wherein W is _i ，U _i Two weight matrices corresponding to input gates, W _f ，U _f Two weight matrices corresponding to forgetting gates, W _c ，U _c Two weight matrices, W, corresponding to candidate cells _o ，U _o Two weight matrices corresponding to output gates, b _i ，b _f ，b _c ，b _o The offsets of the input gate, the forget gate, the candidate cell and the input gate are respectively corresponding; h is a _t-1 Hidden information for t-1 timing, i _t For t time series input information, f _t Forgetting information for t time sequence, o _t For the output information of the t-sequence,candidate cell information for time series t, c _t Cell information for time t, c _t-1 Cell information for t-1 timing, h _t For the hidden information of the t time sequence, sigmoid and tanh are activation functions, and the product of the Hadamard is;

updating the hidden state of each sequence index through the equation; last table of log messagesShowing the last hidden state of the modelEncoding; for a given pair of log messages, a predefined similarity function g is to be: />LSTM representations applied to them; the similarity in the representation space is then used to infer potential semantic similarity of sentences;

the unique error signal back-propagated during training comes from the log message representationThe similarity between them will be limited to a simple similarity function +.>Wherein the method comprises the steps ofHidden information of log message a and log message b at time sequence T is represented respectively ₁ For Manhattan distance, exp is an exponential function.

In some embodiments, updating the log message template case by a fixed depth parse tree based on the log message similarity result includes:

if similar Log events are found in the Log event group of the most similar leaf node, adding the Log ID of the current Log message to Log IDs in the similar Log group, wherein the Log IDs only contain the ID of the Log message, and the Log events are the Log messages; in addition, the log events in the returned log group are updated;

if similar log events cannot be found in the log event group of the most similar leaf node, a new log event is created according to the current log message and added to the log event group list of the leaf node, and the parsing tree is updated by the new log event.

In some embodiments, the method for web log parsing based on a twin neural network and a fixed parse tree further comprises merging similar log events in leaf nodes of the parse tree.

Further, merging and resolving similar log events in leaf child nodes includes: and for the log events with fewer occurrence times in all leaf nodes of the subtrees divided by the same length, if the similarity of the log events is larger than a set threshold value, the similarity calculation adopts a twin neural network model, and the log events are combined and updated according to the steps in the updating analysis tree.

In a second aspect, the present invention provides an online log parsing method, including:

acquiring an online log message sequence and preprocessing the online log message sequence into a filtered log sequence;

dividing the log sequence length into different log groups according to the log sequence length, wherein the log message lengths in the different log groups are different;

inputting the filtered log into a fixed depth analysis tree, and obtaining the similarity of the log message and the log event when the log reaches a leaf node, wherein the similarity is given by a twin neural network, then updating the analysis tree, and finally merging the similar events.

In a third aspect, the invention provides a weblog parsing device based on a twin neural network and a fixed parse tree, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.

In a fourth aspect, the present invention provides a storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of the first aspect.

Compared with the prior art, the network log analysis method based on the twin neural network and the fixed analysis tree provided by the embodiment of the invention has the following beneficial effects:

the network log analysis method based on the twin neural network and the fixed analysis tree comprises the following steps: when a new original log message arrives, it will first be preprocessed by a simple regular expression based on domain knowledge. Then, determining a proper log event list (namely leaf nodes of the tree) through the twin neural network in the internal nodes of the tree according to codes, and if the proper log event is found, matching the log event stored in the log event list; otherwise, a new log event is created from the log message and added to the log event list. When all log messages are matched, finally merging similar log events according to the log event conditions; compared with other inventions, the invention combines the rule model and can learn the semantic information of the log information, so that the analysis of the log information is more robust; the method has the advantages that excellent effects are achieved on the public data set based on log analysis, the correlation among log messages is learned by utilizing the twin neural network, the twin neural network can learn log semantics, different rules are not required to be defined by any domain knowledge, the method has good universality, and the method has good practicability in a log analysis system with high accuracy and real-time requirements.

Drawings

FIG. 1 is a flow chart of a weblog parsing method based on a twin neural network and a fixed parse tree provided by the invention;

FIG. 2 is a block diagram of a fixed depth parse tree in a weblog parse method based on a twin neural network and a fixed parse tree provided by the invention;

fig. 3 is a schematic diagram of a twin LSTM neural network in a weblog parsing method based on a twin neural network and a fixed parse tree provided by the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Embodiment one:

the embodiment provides a weblog analysis method based on a twin neural network and a fixed analysis tree, which comprises the following steps:

acquiring an original log message;

and when all log information analysis is completed, obtaining an analysis tree.

In some embodiments, merging similar log events in parse tree leaf nodes is also included.

In some embodiments, as shown in fig. 1, a weblog parsing method based on a twin neural network and a fixed parse tree includes: performing simple preprocessing on the arrived log message through a regular expression;

dividing log groups according to the length of log messages, and guiding the log messages with different lengths into different log groups;

through log groups with different log message lengths, searching intermediate nodes to finally search the most similar leaf nodes;

the log message similarity is determined based on the twin neural network;

updating the log message template condition through a fixed depth parsing tree;

and finally merging similar log events in leaf nodes of the parse tree.

The method comprises the following specific steps:

step 1: the original log message of the system is obtained and filtered by a simple regular expression.

Allowing the log parser user to provide simple regular expressions based on domain knowledge representing common variables, such as IP addresses and block IDs;

tokens matched from the original log message by these regular expressions will be replaced, e.g., block ids will be matched and replaced by "blk [0-9] +" (i.e., dynamic variables). The regular expressions used are typically very simple and they are used to match tokens rather than log messages.

Step 2: the log groups are divided according to the length of the log messages, and the log messages with different lengths are led into different log groups.

Dividing the original logs into different log groups according to the different lengths of the logs after pretreatment (the length of the log information stored in each log group is the same);

the length of the log message is defined as the number of tokens in the log message, and a path to the second layer node is selected according to the length of the log message of the preprocessed log message;

this is based on the assumption that log messages with the same log event may have the same log message length.

Step 3: and searching the intermediate nodes through log groups with different log message lengths to finally search the most similar leaf nodes.

When reaching the level representing the length of the log message (i.e., the second level), the search down continues, based on the assumption that the tag in the beginning of the log message is more likely to be constant;

specifically, the next internal node is selected by the marker of the log message start position. As in fig. 2, for the log message "Received block blk _7526945448667194862 of size 67108864 from/10.251.203.80", from the second tier node "Length:8" to the third tier node "received" (log message all tokens have been unifiedly lowercase), because the tag at the first location of the log message is "received";

will traverse to the leaf node linked to the internal node "received". When the tree level is set to 5, continuing searching for a second position mark of the log message;

to avoid branch explosions of the parse tree, a parameter MaxChild is defined that limits the maximum number of branches of the internal nodes. If a node branch reaches the upper limit, any unmatched labels will find a particular internal node "< >", for that node. Furthermore, if a token contains a number, it will also match to the special internal node "< >", as described above. Although the non-digital variable of the true log event of the log message may also be at the beginning of the log, resulting in parsing mistakes the non-digital token for constant, it can still be resolved by post-processing.

Step 4: log message similarity is determined based on the twin neural network.

Two networks LSTM _a And LSTM _b Each network processes a given pair of sentences, but only focuses on weight-bound twinning architecture, such as LSTM in twinning neural networks _a ＝LSTM _b ；

As shown in FIG. 3, an LSTM is mapped from a variable length two-dimensional vector sequence space learningEach log message (represented as a token sequence) x ₁ ，...，x _t Passed to LSTM, one LSTM update procedure through the weight matrix W for each T e { 1..the.t } _i ，W _f ，W _c ，W _o ，U _i ，U _f ，U _c ，U _o And offset b _i ，b _f ，b _c ，b _o Parameterizing;

i _t ＝sigmoid(W _i x _t +U _i h _t-1 +b _i ) (1)

f _t ＝sigmoid(W _f x _t +U _f h _t-1 +b _f ) (2)

o _t ＝sigmoid(W _o x _t +U _o h _t-1 +b _o ) (3)

h _t ＝o _t ⊙tanh(c _t ) (6)

wherein W is _i ，U _i Two weight matrices corresponding to input gates, W _f ，U _f Two weight matrices corresponding to forgetting gates, W _c ，U _c Two weight matrices, W, corresponding to candidate cells _o ，U _o Two weight matrices corresponding to output gates, b _i ，b _f ，b _c ，b _o The offsets of the input gate, the forget gate, the candidate cell and the input gate are respectively corresponding; h is a _t-1 Hidden information for t-1 timing, i _t For t time series input information, f _t Forgetting information for t time sequence, o _t For the output information of the t-sequence,candidate cell information for time series t, c _t Cell information for t time series, h _t For the hidden information of the t time sequence, sigmoid and tanh are activation functions, and the product of the Hadamard is;

updating the hidden state of the sequence index by equations (1) - (6); the last representation of the log message is represented by the last hidden state of the modelEncoding; for a given pair of log messages, our method will predefine a similarity function g: />LSTM table applied to themShown. The similarity in the representation space is then used to infer potential semantic similarity of sentences. The unique error signal back-propagated during training comes from the log message representation +.>Similarity between them. We will be limited to simple similarity functionsWherein->Hidden information represented as two log message timing T respectively, I ₁ For Manhattan distance, exp is an exponential function.

Step 5: the log message template case is updated by a fixed depth parse tree.

If similar Log events are found in the Log event group, adding the Log ID of the current Log message to Log IDs in the similar Log group, wherein the Log IDs only contain the ID of the Log message, and the Log events are the Log messages;

the log events in the returned log group will be updated. In detail, tokens located at the same location as the log message and the log event are scanned. If the two tokens are identical, the token for that location is not modified. Otherwise, we update the token with a wild card (i.e., </x >) in the log event;

if similar log events cannot be found, a new log event is created according to the current log message and added to the log event group, and then the analysis tree is updated by the new log event.

Step 6: and finally merging similar log events in leaf nodes of the parse tree.

The logs are divided into different log groups according to log message length, and tokens starting from the log message continue to search backwards based on the assumption that the tags in the beginning of the log message are more likely to be constant. However, the non-numeric indicia in the beginning of the log message may also be a dynamic variable, i.e., when the dynamic variable is in the forward position of the log message and the parameter is non-numeric, the following can easily occur: the former non-digital variable is mistaken for an internal node (i.e., constant) of the parse tree, resulting in parsing errors;

and combining the characteristics of the analysis tree, and performing the following post-processing after all log information is analyzed: in all leaf nodes of the subtrees divided by the same length, if the similarity (the similarity calculation adopts a twin neural network model) of the log events is larger than a set threshold value for the log events with fewer occurrence times, the log events are combined and updated according to the steps in the updating analysis tree. This may allow logs that are misclassified as two log events to be recombined into the same log event.

The method combines the rule model of the fixed depth analysis tree, can learn the semantic information of the log information by utilizing the twin neural network and give out the similarity, and highlights that the method has robustness to the analysis of the log information.

Embodiment two:

the embodiment of the invention provides an online log message analysis method, which comprises the following steps:

The method specifically comprises the following steps:

step 1: and acquiring an online log message sequence and preprocessing the online log message sequence into a filtered log sequence.

Specifically, the online log message sequence is:

“Received block blk_7526945448667194862 of size 67108864 from/10.251.203.80”

the resulting filtered log message is as follows:

Date:081109 Time:210637 Pid:1283

Level:INFO Component:dfs.DataNode$PacketResponder

Log Event:received block<*>of size<*>from/<*>

Dynamic variable(s):blk_7526945448667194862,67108864,/10.251.203.80

step 2: the log sequence length is divided into different log groups, and the log message lengths in the different log groups are different.

As shown in fig. 2, the log messages filtered in step 1 are divided into log groups:

“Length：8”

step 3: inputting the filtered log into a fixed depth analysis tree, and obtaining the similarity of the log message and the log event when the log reaches a leaf node, wherein the similarity is given by a twin neural network, updating the analysis tree, and finally merging the similar events.

The invention combines the rule model and can learn the semantic information of the log information, so that the analysis of the log information is more robust; the method has the advantages that excellent effects are achieved on the public data set based on log analysis, the correlation among log messages is learned by utilizing the twin neural network, the twin neural network can learn log semantics, different rules are not required to be defined by any domain knowledge, the method has good universality, and the method has good practicability in a log analysis system with high accuracy and real-time requirements.

Embodiment III:

in a third aspect, the present embodiment provides a weblog parsing apparatus based on a twin neural network and a fixed parse tree, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is operative according to the instructions to perform the steps of the method according to embodiment 1.

Embodiment four:

in a fourth aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described in embodiment 1.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. The network log analysis method based on the twin neural network and the fixed analysis tree is characterized by comprising the following steps of:

acquiring an original log message;

and when all log information analysis is completed, obtaining an analysis tree.

2. The method for web log parsing based on a twin neural network and a fixed parse tree according to claim 1, wherein dividing the original log message into log groups according to log message lengths comprises: based on the assumption that the log messages with the same log event have the same log message length, the original logs are divided into different log groups according to the difference of the length of the log messages after preprocessing, and the length of the log messages stored in each log group is the same, wherein the length of the log messages is defined as the number of tokens in the log messages.

3. The method of network log parsing based on a twin neural network and a fixed parse tree according to claim 1, wherein selecting paths to nodes of a second layer based on divided log groups, searching intermediate nodes to finally search for most similar leaf nodes, comprises:

4. The method for web log parsing based on a twin neural network and a fixed parse tree according to claim 1, wherein determining the similarity of the current log message to the most similar log message of the leaf node using the trained twin neural network model comprises:

i _t ＝sigmoid(W _i x _t +U _i h _t-1 +b _i )

f _t ＝sigmoid(W _f x _t +U _f h _t-1 +b _f )

o _t ＝sigmoid(W _o x _t +U _o h _t-1 +b _o )

h _t ＝o _t ⊙tanh(c _t )

updating the hidden state on each sequence index through the equation; the last representation of the log message is represented by the last hidden state of the modelEncoding; for a given pair of log messages, a predefined similarity function will be definedLSTM representations applied to them; the similarity in the representation space is then used to infer potential semantic similarity of sentences;

the unique error signal back-propagated during training comes from the log message representationThe similarity between them will be limited to a simple similarity function +.>Wherein->Hidden information of log message a and log message b at time sequence T is represented respectively ₁ For Manhattan distance, exp is an exponential function.

5. The method for web log parsing based on a twin neural network and a fixed parse tree according to claim 1, wherein updating the log message template case by the fixed deep parse tree according to the log message similarity result comprises:

6. The method of network log parsing based on a twin neural network and a fixed parse tree according to claim 1, further comprising merging similar log events in leaf nodes of the parse tree.

7. The method for web log parsing based on a twin neural network and a fixed parse tree according to claim 6, wherein merging similar log events in leaf child nodes includes: and for the log events with fewer occurrence times in all leaf nodes of the subtrees divided by the same length, if the similarity of the log events is larger than a set threshold value, the similarity calculation adopts a twin neural network model, and the log events are combined and updated according to the steps in the updating analysis tree.

8. The weblog analysis device based on the twin neural network and the fixed analysis tree is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 7.

9. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 7.