CN111367964A

CN111367964A - Method for automatically analyzing log

Info

Publication number: CN111367964A
Application number: CN202010132165.XA
Authority: CN
Inventors: 李宁宁
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2020-07-03
Anticipated expiration: 2040-02-29
Also published as: CN111367964B

Abstract

The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, obtaining sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability; and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method has the advantages that the problem of low log analysis efficiency of the traditional method for manually formulating the regular expression is solved by constructing the hidden Markov log analysis model and combining the Vibitit algorithm, the internal structure of the log can be rapidly and accurately identified automatically, and effective information can be extracted.

Description

Method for automatically analyzing log

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.

Background

With the increasing development of computer technology, computer systems are also more and more complex. For IT operation and maintenance, the original log cannot directly provide valid information, and the fields in the original log need to be parsed and then the valid information is extracted. The traditional log analysis method is to manually make a corresponding regular rule. This approach works well if the log categories are small and the log structure changes infrequently. However, as various functions continue to be integrated into the system, a large number of IT subsystems and, thus, a large amount of various types of log data are generated. For these logs, it is very time and labor consuming if a regular matching rule is designed for each log. Therefore, how to rapidly and accurately parse the text log becomes a problem to be solved urgently.

Disclosure of Invention

The present invention aims to overcome the defects of the prior art and provide a method for automatically analyzing a log, which is based on a natural language processing technology and automatically identifies the internal structure of a text log through a computer so as to quickly and accurately extract effective information from the log.

The purpose of the invention can be realized by the following technical scheme: a method of automatically parsing a log, comprising the steps of:

s1, obtaining sample log data;

s2, respectively establishing a log database and a log analysis model according to the sample log data;

s3, acquiring target log data and preprocessing the target log data;

s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;

and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.

Further, the step S2 specifically includes the following steps:

s21, marking the structure of the sample log according to the effective information of the sample log data to establish a log database;

and S22, constructing a hidden Markov model according to the marked log structure information in the log database to be used as a log analysis model.

Further, the sample log data in step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.

Further, when the structure of the sample log is labeled in step S21, the structure of the log is specifically labeled with B, M, E, S, O identifiers to obtain labels corresponding to the characters in the log structure one by one, where S denotes a single character, B, M, E denotes the middle of the beginning and the end of a character string, and O denotes a character that is not a log structure.

Further, the log structure information labeled in step S22 includes a log structure character string and a corresponding character tag string, where each character in the log structure character string is a different observation amount, and each tag in the character tag string is in a different state.

Further, the specific process of constructing the hidden markov model in step S22 is as follows:

s221, counting transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;

s222, counting the transition probability from the state to the observed quantity in the log database to obtain an observation probability matrix;

s223, counting the initial state probability in the log database to obtain initial probability distribution;

s224, constructing a hidden Markov model by training the state transition matrix, the observation probability matrix and the initial probability distribution.

Further, the state transition matrix is specifically:

A＝[a_ij]_N×M

a_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，...，N；j＝1，2，...，N

the observation probability matrix is specifically as follows:

B＝[b_j(k)]_N×M

b_j(k)＝P(o_t＝v_k|i_t＝q_j)，k＝1，2，...，M；j＝1，2，...，N

the initial probability distribution is specifically:

π＝(π_i)^T

π_i＝P(i₁＝q_i)，i＝1，2，...，N

Q＝{q₁，q₂，...，q_N}，V＝{v₁，v₂，...，v_M}，

I＝{i₁，i₂，...，i_T}，O＝{o₁，o₂，...，o_T}

wherein Q is a set of states, V is a set of observations, N is a number of states, M is a number of observations, I is a sequence of states of length T, O is a sequence of observations corresponding to I, π is an initial probability distribution, π is_iIs that time t is 1 in state q_iA is a state transition probability matrix, a_ijIs in state q at the moment_iTime t +1 under the condition of (1) to the state q_jB is an observation probability matrix, B_j(k) At time t in state q_jUnder conditions to generate an observation v_kThe probability of (c).

Further, the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including a messy code, a carriage return symbol, and a space.

Compared with the prior art, the log analysis method based on the hidden Markov model has the advantages that the log analysis model is constructed based on the hidden Markov model, when different types of log data are processed, the logs can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the manpower and time for log analysis are greatly saved, in addition, the hidden Markov model and the Viterbi algorithm are combined to obtain the probability maximum path, and the accuracy of log analysis is ensured.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a process diagram of log parsing model construction in an embodiment;

FIG. 3 is a process for applying the log parsing model in an embodiment;

FIG. 4 is a sample Apache access data in an embodiment;

FIG. 5 is a sample Apache error log data in an embodiment;

FIG. 6 shows a data sample of the Aruba wireless in the embodiment;

FIG. 7 is a sample of Nginx access data in an embodiment;

FIG. 8 is a sample Nginx error data in an example embodiment;

FIG. 9 is a sample of Exchange data in the example;

FIG. 10 is a sample of the Juniper firewall log in the embodiment;

FIG. 11 is a sample VPN in an embodiment;

FIG. 12 is a diagram illustrating a log structure annotation in an embodiment;

FIG. 13 is a diagram illustrating a process of computing a maximum probability path;

FIG. 14 is a flow chart illustrating the usage of the REST API service in the embodiment.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, a method for automatically parsing a log includes the following steps:

s1, obtaining sample log data;

s3, acquiring target log data and preprocessing the target log data;

In this embodiment, the method is adopted to automatically analyze the text log, and an application service based on a REST (Representational State Transfer) API is constructed at the same time, as shown in fig. 2 to 3:

1. preparation work

Before the log is analyzed, log data needs to be collected, wherein the log data comprises a log database and a log entity tag determination.

1.1 Log repository establishment

Various types of log data are collected, and effective information of the log data is marked.

1.2 Log parsing model construction

And according to the collected log data, building a model according to the marked log structure information. The present invention uses a hidden markov model as an analytical model, and therefore calculates three parameters of the hidden markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.

Specifically, a Hidden Markov Model (HMM) is a Probabilistic graphical Model (HMM). HMMs are used primarily to describe the transition of hidden states and the probability of appearance of hidden states in a system. HMM capability consists in being able to estimate what the corresponding hidden variable sequence is from a given observed variable sequence and make predictions about future observed variables.

For example, voice recognition, gives you a piece of audio data, and needs to recognize the text corresponding to the audio data. Here, the audio data is an observation variable, and the text is a hidden variable. There are slight inflections in different context environments, but the rough pronunciation is statistically regular. On the other hand, when we say a sentence, there is some transfer rule between words.

In terms of model representation:

the HMM includes three parameters, an initial probability distribution, a state transition probability matrix, and an observation probability matrix.

Let Q be the set of all possible states and V be the set of all possible observations.

Q＝{q₁，q₂，...，q_N}，V＝{v₁，v₂，...，v_M}

Where N is the possible state tree and M is the number of possible observations.

I is the state sequence of length T and O is the corresponding observation sequence.

I＝{i₁，i₂，...，i_T}，O＝{o₁，o₂，...，o_T}

π is the initial state probability vector:

π＝(π_i)

wherein the content of the first and second substances,

π_i＝P(i₁＝q_i)，i＝1，2，...，N

is that time t is 1 in state q_iThe probability of (c).

A is the state transition probability matrix:

A＝[a_ij]_N×M

wherein the content of the first and second substances,

a_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，...，N；j＝1，2，...，N

is in state q at the moment_iTime t +1 under the condition of (1) to the state q_jThe probability of (c).

B is the observation probability matrix:

B＝[b_j(k)]_N×M

wherein the content of the first and second substances,

b_j(k)＝P(o_t＝v_k|i_t＝q_j)，k＝1，2，...，M；j＝1，2，...，N

is in state q at time t_jUnder conditions to generate an observation v_kThe probability of (c).

HMMs can mainly deal with three problems:

and (4) probability calculation problem. Given the model λ ═ (a, B, pi) and the observed sequence ═ o₁，o₂，...，o_T) Calculated under model λThe probability P (O | λ) of occurrence of the sequence O is observed.

And (5) learning. The known observation sequence O ═ O (O)₁，o₂，...，o_T) The estimated model λ is the (a, B, pi) parameter, and is the observed sequence probability P (O | λ) under this model. I.e. the parameters are estimated using maximum likelihood estimation.

The prediction problem, also known as the decoding (decoding) problem. The known model λ ═ (a, B, pi) and the observed sequence O ═ O (O)₁，o₂，...，o_T) The state sequence I ═ I (I | O) with the maximum conditional probability P (I | O) for a given observation sequence is determined₁，i₂，...，i_T). I.e. given the observation sequence, the most likely corresponding sequence state is found.

When labeling the log, labeling the structure of the log first, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected and marked, which are Apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.

The log content is then marked. The internal structure in the log is shown in table 1:

TABLE 1

The log data is marked as follows. For the sequence annotation problem, the annotation generally uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the middle of the beginning and the end of a string, respectively, and O denotes a character that is not log-structured. The above log internal structure is noted in conjunction with B, M, E, S. For the log: "192.168.3.1- - [ 08/Aug/2017: 00: 31: 26+0800, "GET/qx/xts/images/x _ gkbg. jpg HTTP/1.1" 2001171 ", the corresponding labels are as shown in FIG. 12.

And marking the collected sample log data one by one to construct a hidden Markov model, wherein for the hidden Markov model, three important variables, namely pi and A, B, need to be counted first. For a log, the string of the log is an observation sequence. Each character is an observation. The labels of the characters are hidden variables, i.e. states in the hidden markov model. These parameter calculations are presented below.

The state transition matrix is a matrix of M × M, M being the number of states, i.e., log labels.

The observation probability matrix is an M × N matrix, M is the number of the log labels, N is the number of the character types, and the calculation method is as follows:

probability of initial state pi_iIs calculated as the initial state of the S logs is q_iOf (c) is detected.

For example: for three pieces of log data, i.e., three observation sequences.

“127.0.0.1get 200”.

“192.168.10.1post 404”

“127.0.0.1get 403”

The log structure sequence is shown below, with the log structure type in parentheses and \ s representing a space character.

“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”

“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]

o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”

“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”

Then we can get a total of ten states as { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }.

The observation set is {1, 2, 7, 0, 9, 6, 8, 4, 3, g, e, t, p, o, s, \\ s }, for a total of sixteen observations, where \ s represents a space.

First, we count two adjacent states, and count the probability from the previous state to the next state. For example, we calculate the state transition probability p (host-e | host-m) from "host-m" to "host-e". host-m and host-e occur 3 times next to each other, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated as

Thus we can get a state transition matrix a of 10x 10.

Second, we count the observed probability matrix. Suppose we want to calculate the observed transition probability p (3| host-code-e) of "http-code-e" to "3". Then the observation character '3' is marked as the state 'http-code-e' for 1 time, and the state 'http-code-e' appears for 3 times in total, so that the observation transition probability from the state 'http-code-e' to the observation character '3' can be calculated

Thus we have an observed probability matrix B of 10 × 16.

Finally, we count the initial state probability π. We can see a total of 3 sequences, and the "host-b" appears 3 times, and no other states appear, so the initial probability of our state "host-b" is 1.0, and all other states are 0.

2 Log Structure parsing

The steps of log structure analysis will be described with a certain log data as an example.

The first step is as follows: and preprocessing the input log, and cleaning the carriage return symbols, the spaces and the messy codes in the log at the front end and the rear end of the log.

The second step is that: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, the state transition probability matrix and the observation probability matrix, and selecting the structure with the maximum probability.

The third step: and outputting the analysis structure of the log, extracting effective information in the log, and marking a corresponding position.

Specifically, for a newly input log, the log data is preprocessed first. The preprocessing content is mainly to remove some invalid characters such as messy codes. Then, an optimal analysis method is evaluated using the viterbi algorithm based on the three parameters of hidden markov. The viterbi algorithm is a method based on dynamic programming to find the most probable path. The path here corresponds to a log resolution structure.

For example, if we obtain a state transition matrix of

Observation probability matrix of

The initial probability distribution is:

π＝(0.3，0.2.0.5)^T。

the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analytic structure is solved for the observation sequence ("m", "n", "m").

First, when the parameter is initialized, i is 1, 2, 3 for each state i, i is observed as i₁To observe the probability of the character "m", let this probability be δ₁(i) Then, then

δ₁(i)＝π_ib_i(o₁)＝π_ib_i(m)，i＝1，2，3

Substituting actual data

δ₁(1)＝0.3*0.3＝0.09

δ₁(2)＝0.2*0.6＝0.12

δ₁(3)＝0.5*.04＝0.20

Phi₁(i)＝0，i＝1，2，3。

For each state i, i is 1, 2, 3, when t is 2, the observation character o is found to be observed in the state j as red when t is 1 and in the state i as i when t is 2₂The maximum probability of a path being "n", said probability being δ₂(i) Then, then

Meanwhile, for each state i, i ═ 1, 2, 3, the previous state j of the most probable path is recorded:

and (3) calculating:

ψ₂(1)＝3

δ₂(2)＝0.024，ψ₂(2)＝3

δ₂(3)＝0.048，ψ₂(3)＝3

similarly, when t is 3,

δ₃(1)＝0.00756，ψ₃(1)＝1

δ₃(1)＝0.00864，ψ₃(2)＝3

δ₂(1)＝0.00768，ψ₃(3)＝3

with P^*Probability of representing the optimal path, then

The end point of the optimal path is

From the end point of the optimal path

Finding in reverse direction

When the t is equal to 2, the reaction time is less than or equal to 2,

when the t is equal to 1, the first step is carried out,

thus, an optimal path, i.e. an optimal state sequence, is determined

I.e., "c", "b"). Fig. 13 shows a process of calculating the maximum path.

After the model construction is completed, the present embodiment evaluates the log resolution model. For the analysis model, the requirement is that "all log entities are discovered as much as possible, and the discovered log entities are accurate as much as possible", that is, the recall ratio recall and the precision ratio precision are required to be higher. Meanwhile, in order to guarantee the consideration of recall and precision, f1-measure is needed to evaluate the model.

In the above formula, correct_extractIndicating the correct number of extracted log entities, extract_entitvRepresenting the total number of log entities extracted. data_entityRepresenting the number of log entities in the data. For example, for a log, the format is as follows:

"Jan 1217: 47: 48127.0.0.1 xxx, info, download, 175.42.41.4 "

The correct analytical structure is "Jan 1217: 47: 48 "," 127.0.0.1 "," info "," 175.42.41.4 ". If the model is parsed into "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 ". Then the evaluation results are as follows.

correct_extract{ "127.0.0.1", "information", "175.42.41.4" }

extract_entity{ "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}

data_entity{ "Jan 1217: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}

In order to evaluate the analysis result of the log, the present embodiment divides each log data set, and takes 60% of the data as a training set and 40% of the data as a test set. And predicting the test set by using the model trained by the training set, and finally evaluating the result. Table 2 shows the analysis results of various log models.

TABLE 2

In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test result is shown in table 3:

TABLE 3

Log categories	Number of logs	File size	File format	Analysis time
					Apache Access	523	51KB	txt	0.1s
Apache error	30001	4.23MB	txt	4.1s
					Aruba wireless	380752	62.6MB	txt	63.2s
Nginx access	2231408	482MB	txt	420.1s
					Nginx error	33026	13.5MB	txt	10.1s
Exchange	648492	357MB	txt	301.2s
					Juniper firewall log	33034	12.4MB	txt	23.5s
VPN log	18581	2.64MB	txt	2.5s

3 construction of REST

The log parsing method is applied to the REST service as a library that can be called by the user using the REST API.

In order to apply the log parsing to practical applications, the present embodiment provides a log parsing service based on RESTAPI, which is convenient for users to use.

The architecture is as shown in fig. 14, the service is implemented by programming using Python3, the log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of the REST service, and RESTAPI is provided. The interface design is shown in table 4:

TABLE 4

In summary, the traditional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the method uses natural language processing and data mining technologies, does not need to make the regular expressions manually, and saves labor and time; the manually formulated regular expression needs to be re-formulated in the face of the change of the log structure, but the method used by the invention does not need to be retrained, and the embodiment also constructs the log analysis service based on the REST API and provides an idea for the engineering application realization of the log analysis.

Claims

1. A method for automatically parsing a log, comprising the steps of:

s1, obtaining sample log data;

s3, acquiring target log data and preprocessing the target log data;

2. The method for automatically parsing a log according to claim 1, wherein the step S2 specifically includes the following steps:

3. The method for automatically parsing a log according to claim 2, wherein the sample log data in the step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.

4. The method as claimed in claim 2, wherein in step S21, when labeling the structure of the sample log, labeling the structure of the sample log with B, M, E, S, O identifiers to obtain labels corresponding to characters in the log structure, wherein S represents a single character, B, M, E represents the middle of the beginning and the end of a character string, and O represents a character that is not in the log structure.

5. The method of claim 4, wherein the log structure information labeled in step S22 includes a log structure character string and a corresponding character label string, wherein each character in the log structure character string is a different observation quantity, and each label in the character label string is a different status.

6. The method for automatically parsing a log according to claim 5, wherein the specific process of constructing the hidden Markov model in step S22 is as follows:

7. The method for automatically parsing a log according to claim 6, wherein the state transition matrix is specifically:

A＝[a_ij]_N×M

a_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，...，N；j＝1，2，...，N

the observation probability matrix is specifically as follows:

B＝[b_j(k)]_N×M

b_j(k)＝P(o_t＝v_k|i_t＝q_j)，k＝1，2，...，M；j＝1，2，...，N

the initial probability distribution is specifically:

π＝(π_i)^T

π_i＝P(i₁＝q_i)，i＝1，2，...，N

Q＝{q₁，q₂，...，q_N}，V＝{v₁，v₂，...，v_M}，

I＝{i₁，i₂，...，i_T}，O＝{o₁，o₂，...，o_T}

8. The method according to claim 1, wherein the preprocessing in step S3 is to clean up invalid characters in the target log structure, including messy codes, carriage returns and spaces.