CN111367964A - Method for automatically analyzing log - Google Patents

Method for automatically analyzing log Download PDF

Info

Publication number
CN111367964A
CN111367964A CN202010132165.XA CN202010132165A CN111367964A CN 111367964 A CN111367964 A CN 111367964A CN 202010132165 A CN202010132165 A CN 202010132165A CN 111367964 A CN111367964 A CN 111367964A
Authority
CN
China
Prior art keywords
log
probability
analysis
state
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010132165.XA
Other languages
Chinese (zh)
Other versions
CN111367964B (en
Inventor
李宁宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Information Technology Co Ltd
Original Assignee
Shanghai Eisoo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Information Technology Co Ltd filed Critical Shanghai Eisoo Information Technology Co Ltd
Priority to CN202010132165.XA priority Critical patent/CN111367964B/en
Publication of CN111367964A publication Critical patent/CN111367964A/en
Application granted granted Critical
Publication of CN111367964B publication Critical patent/CN111367964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, obtaining sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability; and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method has the advantages that the problem of low log analysis efficiency of the traditional method for manually formulating the regular expression is solved by constructing the hidden Markov log analysis model and combining the Vibitit algorithm, the internal structure of the log can be rapidly and accurately identified automatically, and effective information can be extracted.

Description

Method for automatically analyzing log
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.
Background
With the increasing development of computer technology, computer systems are also more and more complex. For IT operation and maintenance, the original log cannot directly provide valid information, and the fields in the original log need to be parsed and then the valid information is extracted. The traditional log analysis method is to manually make a corresponding regular rule. This approach works well if the log categories are small and the log structure changes infrequently. However, as various functions continue to be integrated into the system, a large number of IT subsystems and, thus, a large amount of various types of log data are generated. For these logs, it is very time and labor consuming if a regular matching rule is designed for each log. Therefore, how to rapidly and accurately parse the text log becomes a problem to be solved urgently.
Disclosure of Invention
The present invention aims to overcome the defects of the prior art and provide a method for automatically analyzing a log, which is based on a natural language processing technology and automatically identifies the internal structure of a text log through a computer so as to quickly and accurately extract effective information from the log.
The purpose of the invention can be realized by the following technical scheme: a method of automatically parsing a log, comprising the steps of:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
Further, the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data to establish a log database;
and S22, constructing a hidden Markov model according to the marked log structure information in the log database to be used as a log analysis model.
Further, the sample log data in step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.
Further, when the structure of the sample log is labeled in step S21, the structure of the log is specifically labeled with B, M, E, S, O identifiers to obtain labels corresponding to the characters in the log structure one by one, where S denotes a single character, B, M, E denotes the middle of the beginning and the end of a character string, and O denotes a character that is not a log structure.
Further, the log structure information labeled in step S22 includes a log structure character string and a corresponding character tag string, where each character in the log structure character string is a different observation amount, and each tag in the character tag string is in a different state.
Further, the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting the transition probability from the state to the observed quantity in the log database to obtain an observation probability matrix;
s223, counting the initial state probability in the log database to obtain initial probability distribution;
s224, constructing a hidden Markov model by training the state transition matrix, the observation probability matrix and the initial probability distribution.
Further, the state transition matrix is specifically:
A=[aij]N×M
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
the observation probability matrix is specifically as follows:
B=[bj(k)]N×M
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(πi)T
πi=P(i1=qi),i=1,2,...,N
Q={q1,q2,...,qN},V={v1,v2,...,vM},
I={i1,i2,...,iT},O={o1,o2,...,oT}
wherein Q is a set of states, V is a set of observations, N is a number of states, M is a number of observations, I is a sequence of states of length T, O is a sequence of observations corresponding to I, π is an initial probability distribution, π isiIs that time t is 1 in state qiA is a state transition probability matrix, aijIs in state q at the momentiTime t +1 under the condition of (1) to the state qjB is an observation probability matrix, Bj(k) At time t in state qjUnder conditions to generate an observation vkThe probability of (c).
Further, the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including a messy code, a carriage return symbol, and a space.
Compared with the prior art, the log analysis method based on the hidden Markov model has the advantages that the log analysis model is constructed based on the hidden Markov model, when different types of log data are processed, the logs can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the manpower and time for log analysis are greatly saved, in addition, the hidden Markov model and the Viterbi algorithm are combined to obtain the probability maximum path, and the accuracy of log analysis is ensured.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a process diagram of log parsing model construction in an embodiment;
FIG. 3 is a process for applying the log parsing model in an embodiment;
FIG. 4 is a sample Apache access data in an embodiment;
FIG. 5 is a sample Apache error log data in an embodiment;
FIG. 6 shows a data sample of the Aruba wireless in the embodiment;
FIG. 7 is a sample of Nginx access data in an embodiment;
FIG. 8 is a sample Nginx error data in an example embodiment;
FIG. 9 is a sample of Exchange data in the example;
FIG. 10 is a sample of the Juniper firewall log in the embodiment;
FIG. 11 is a sample VPN in an embodiment;
FIG. 12 is a diagram illustrating a log structure annotation in an embodiment;
FIG. 13 is a diagram illustrating a process of computing a maximum probability path;
FIG. 14 is a flow chart illustrating the usage of the REST API service in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
As shown in fig. 1, a method for automatically parsing a log includes the following steps:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
In this embodiment, the method is adopted to automatically analyze the text log, and an application service based on a REST (Representational State Transfer) API is constructed at the same time, as shown in fig. 2 to 3:
1. preparation work
Before the log is analyzed, log data needs to be collected, wherein the log data comprises a log database and a log entity tag determination.
1.1 Log repository establishment
Various types of log data are collected, and effective information of the log data is marked.
1.2 Log parsing model construction
And according to the collected log data, building a model according to the marked log structure information. The present invention uses a hidden markov model as an analytical model, and therefore calculates three parameters of the hidden markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Specifically, a Hidden Markov Model (HMM) is a Probabilistic graphical Model (HMM). HMMs are used primarily to describe the transition of hidden states and the probability of appearance of hidden states in a system. HMM capability consists in being able to estimate what the corresponding hidden variable sequence is from a given observed variable sequence and make predictions about future observed variables.
For example, voice recognition, gives you a piece of audio data, and needs to recognize the text corresponding to the audio data. Here, the audio data is an observation variable, and the text is a hidden variable. There are slight inflections in different context environments, but the rough pronunciation is statistically regular. On the other hand, when we say a sentence, there is some transfer rule between words.
In terms of model representation:
the HMM includes three parameters, an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Let Q be the set of all possible states and V be the set of all possible observations.
Q={q1,q2,...,qN},V={v1,v2,...,vM}
Where N is the possible state tree and M is the number of possible observations.
I is the state sequence of length T and O is the corresponding observation sequence.
I={i1,i2,...,iT},O={o1,o2,...,oT}
π is the initial state probability vector:
π=(πi)
wherein the content of the first and second substances,
πi=P(i1=qi),i=1,2,...,N
is that time t is 1 in state qiThe probability of (c).
A is the state transition probability matrix:
A=[aij]N×M
wherein the content of the first and second substances,
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
is in state q at the momentiTime t +1 under the condition of (1) to the state qjThe probability of (c).
B is the observation probability matrix:
B=[bj(k)]N×M
wherein the content of the first and second substances,
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
is in state q at time tjUnder conditions to generate an observation vkThe probability of (c).
HMMs can mainly deal with three problems:
and (4) probability calculation problem. Given the model λ ═ (a, B, pi) and the observed sequence ═ o1,o2,...,oT) Calculated under model λThe probability P (O | λ) of occurrence of the sequence O is observed.
And (5) learning. The known observation sequence O ═ O (O)1,o2,...,oT) The estimated model λ is the (a, B, pi) parameter, and is the observed sequence probability P (O | λ) under this model. I.e. the parameters are estimated using maximum likelihood estimation.
The prediction problem, also known as the decoding (decoding) problem. The known model λ ═ (a, B, pi) and the observed sequence O ═ O (O)1,o2,...,oT) The state sequence I ═ I (I | O) with the maximum conditional probability P (I | O) for a given observation sequence is determined1,i2,...,iT). I.e. given the observation sequence, the most likely corresponding sequence state is found.
When labeling the log, labeling the structure of the log first, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected and marked, which are Apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.
The log content is then marked. The internal structure in the log is shown in table 1:
TABLE 1
Figure BDA0002396094240000061
The log data is marked as follows. For the sequence annotation problem, the annotation generally uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the middle of the beginning and the end of a string, respectively, and O denotes a character that is not log-structured. The above log internal structure is noted in conjunction with B, M, E, S. For the log: "192.168.3.1- - [ 08/Aug/2017: 00: 31: 26+0800, "GET/qx/xts/images/x _ gkbg. jpg HTTP/1.1" 2001171 ", the corresponding labels are as shown in FIG. 12.
And marking the collected sample log data one by one to construct a hidden Markov model, wherein for the hidden Markov model, three important variables, namely pi and A, B, need to be counted first. For a log, the string of the log is an observation sequence. Each character is an observation. The labels of the characters are hidden variables, i.e. states in the hidden markov model. These parameter calculations are presented below.
The state transition matrix is a matrix of M × M, M being the number of states, i.e., log labels.
Figure BDA0002396094240000062
The observation probability matrix is an M × N matrix, M is the number of the log labels, N is the number of the character types, and the calculation method is as follows:
Figure BDA0002396094240000071
probability of initial state piiIs calculated as the initial state of the S logs is qiOf (c) is detected.
For example: for three pieces of log data, i.e., three observation sequences.
“127.0.0.1get 200”.
“192.168.10.1post 404”
“127.0.0.1get 403”
The log structure sequence is shown below, with the log structure type in parentheses and \ s representing a space character.
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”
“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]
o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”
Then we can get a total of ten states as { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }.
The observation set is {1, 2, 7, 0, 9, 6, 8, 4, 3, g, e, t, p, o, s, \\ s }, for a total of sixteen observations, where \ s represents a space.
First, we count two adjacent states, and count the probability from the previous state to the next state. For example, we calculate the state transition probability p (host-e | host-m) from "host-m" to "host-e". host-m and host-e occur 3 times next to each other, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated as
Figure BDA0002396094240000072
Thus we can get a state transition matrix a of 10x 10.
Second, we count the observed probability matrix. Suppose we want to calculate the observed transition probability p (3| host-code-e) of "http-code-e" to "3". Then the observation character '3' is marked as the state 'http-code-e' for 1 time, and the state 'http-code-e' appears for 3 times in total, so that the observation transition probability from the state 'http-code-e' to the observation character '3' can be calculated
Figure BDA0002396094240000073
Thus we have an observed probability matrix B of 10 × 16.
Finally, we count the initial state probability π. We can see a total of 3 sequences, and the "host-b" appears 3 times, and no other states appear, so the initial probability of our state "host-b" is 1.0, and all other states are 0.
2 Log Structure parsing
The steps of log structure analysis will be described with a certain log data as an example.
The first step is as follows: and preprocessing the input log, and cleaning the carriage return symbols, the spaces and the messy codes in the log at the front end and the rear end of the log.
The second step is that: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, the state transition probability matrix and the observation probability matrix, and selecting the structure with the maximum probability.
The third step: and outputting the analysis structure of the log, extracting effective information in the log, and marking a corresponding position.
Specifically, for a newly input log, the log data is preprocessed first. The preprocessing content is mainly to remove some invalid characters such as messy codes. Then, an optimal analysis method is evaluated using the viterbi algorithm based on the three parameters of hidden markov. The viterbi algorithm is a method based on dynamic programming to find the most probable path. The path here corresponds to a log resolution structure.
For example, if we obtain a state transition matrix of
Figure BDA0002396094240000081
Observation probability matrix of
Figure BDA0002396094240000082
The initial probability distribution is:
π=(0.3,0.2.0.5)T
the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analytic structure is solved for the observation sequence ("m", "n", "m").
First, when the parameter is initialized, i is 1, 2, 3 for each state i, i is observed as i1To observe the probability of the character "m", let this probability be δ1(i) Then, then
δ1(i)=πibi(o1)=πibi(m),i=1,2,3
Substituting actual data
δ1(1)=0.3*0.3=0.09
δ1(2)=0.2*0.6=0.12
δ1(3)=0.5*.04=0.20
Phi1(i)=0,i=1,2,3。
For each state i, i is 1, 2, 3, when t is 2, the observation character o is found to be observed in the state j as red when t is 1 and in the state i as i when t is 22The maximum probability of a path being "n", said probability being δ2(i) Then, then
Figure BDA0002396094240000091
Meanwhile, for each state i, i ═ 1, 2, 3, the previous state j of the most probable path is recorded:
Figure BDA0002396094240000092
and (3) calculating:
Figure BDA0002396094240000093
ψ2(1)=3
δ2(2)=0.024,ψ2(2)=3
δ2(3)=0.048,ψ2(3)=3
similarly, when t is 3,
Figure BDA0002396094240000094
Figure BDA0002396094240000095
δ3(1)=0.00756,ψ3(1)=1
δ3(1)=0.00864,ψ3(2)=3
δ2(1)=0.00768,ψ3(3)=3
with P*Probability of representing the optimal path, then
Figure BDA0002396094240000096
The end point of the optimal path is
Figure BDA0002396094240000097
Figure BDA0002396094240000098
From the end point of the optimal path
Figure BDA0002396094240000099
Finding in reverse direction
Figure BDA00023960942400000910
When the t is equal to 2, the reaction time is less than or equal to 2,
Figure BDA00023960942400000911
when the t is equal to 1, the first step is carried out,
Figure BDA00023960942400000912
thus, an optimal path, i.e. an optimal state sequence, is determined
Figure BDA00023960942400000913
I.e., "c", "b"). Fig. 13 shows a process of calculating the maximum path.
After the model construction is completed, the present embodiment evaluates the log resolution model. For the analysis model, the requirement is that "all log entities are discovered as much as possible, and the discovered log entities are accurate as much as possible", that is, the recall ratio recall and the precision ratio precision are required to be higher. Meanwhile, in order to guarantee the consideration of recall and precision, f1-measure is needed to evaluate the model.
Figure BDA0002396094240000101
Figure BDA0002396094240000102
Figure BDA0002396094240000103
In the above formula, correctextractIndicating the correct number of extracted log entities, extractentitvRepresenting the total number of log entities extracted. dataentityRepresenting the number of log entities in the data. For example, for a log, the format is as follows:
"Jan 1217: 47: 48127.0.0.1 xxx, info, download, 175.42.41.4 "
The correct analytical structure is "Jan 1217: 47: 48 "," 127.0.0.1 "," info "," 175.42.41.4 ". If the model is parsed into "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 ". Then the evaluation results are as follows.
correctextract{ "127.0.0.1", "information", "175.42.41.4" }
extractentity{ "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}
dataentity{ "Jan 1217: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}
Figure BDA0002396094240000104
Figure BDA0002396094240000105
In order to evaluate the analysis result of the log, the present embodiment divides each log data set, and takes 60% of the data as a training set and 40% of the data as a test set. And predicting the test set by using the model trained by the training set, and finally evaluating the result. Table 2 shows the analysis results of various log models.
TABLE 2
Figure BDA0002396094240000106
Figure BDA0002396094240000111
In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test result is shown in table 3:
TABLE 3
Log categories Number of logs File size File format Analysis time
Apache Access 523 51KB txt 0.1s
Apache error 30001 4.23MB txt 4.1s
Aruba wireless 380752 62.6MB txt 63.2s
Nginx access 2231408 482MB txt 420.1s
Nginx error 33026 13.5MB txt 10.1s
Exchange 648492 357MB txt 301.2s
Juniper firewall log 33034 12.4MB txt 23.5s
VPN log 18581 2.64MB txt 2.5s
3 construction of REST
The log parsing method is applied to the REST service as a library that can be called by the user using the REST API.
In order to apply the log parsing to practical applications, the present embodiment provides a log parsing service based on RESTAPI, which is convenient for users to use.
The architecture is as shown in fig. 14, the service is implemented by programming using Python3, the log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of the REST service, and RESTAPI is provided. The interface design is shown in table 4:
TABLE 4
Figure BDA0002396094240000112
Figure BDA0002396094240000121
In summary, the traditional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the method uses natural language processing and data mining technologies, does not need to make the regular expressions manually, and saves labor and time; the manually formulated regular expression needs to be re-formulated in the face of the change of the log structure, but the method used by the invention does not need to be retrained, and the embodiment also constructs the log analysis service based on the REST API and provides an idea for the engineering application realization of the log analysis.

Claims (8)

1. A method for automatically parsing a log, comprising the steps of:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
2. The method for automatically parsing a log according to claim 1, wherein the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data to establish a log database;
and S22, constructing a hidden Markov model according to the marked log structure information in the log database to be used as a log analysis model.
3. The method for automatically parsing a log according to claim 2, wherein the sample log data in the step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.
4. The method as claimed in claim 2, wherein in step S21, when labeling the structure of the sample log, labeling the structure of the sample log with B, M, E, S, O identifiers to obtain labels corresponding to characters in the log structure, wherein S represents a single character, B, M, E represents the middle of the beginning and the end of a character string, and O represents a character that is not in the log structure.
5. The method of claim 4, wherein the log structure information labeled in step S22 includes a log structure character string and a corresponding character label string, wherein each character in the log structure character string is a different observation quantity, and each label in the character label string is a different status.
6. The method for automatically parsing a log according to claim 5, wherein the specific process of constructing the hidden Markov model in step S22 is as follows:
s221, counting transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting the transition probability from the state to the observed quantity in the log database to obtain an observation probability matrix;
s223, counting the initial state probability in the log database to obtain initial probability distribution;
s224, constructing a hidden Markov model by training the state transition matrix, the observation probability matrix and the initial probability distribution.
7. The method for automatically parsing a log according to claim 6, wherein the state transition matrix is specifically:
A=[aij]N×M
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
the observation probability matrix is specifically as follows:
B=[bj(k)]N×M
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(πi)T
πi=P(i1=qi),i=1,2,...,N
Q={q1,q2,...,qN},V={v1,v2,...,vM},
I={i1,i2,...,iT},O={o1,o2,...,oT}
wherein Q is a set of states, V is a set of observations, N is a number of states, M is a number of observations, I is a sequence of states of length T, O is a sequence of observations corresponding to I, π is an initial probability distribution, π isiIs that time t is 1 in state qiA is a state transition probability matrix, aijIs in state q at the momentiTime t +1 under the condition of (1) to the state qjB is an observation probability matrix, Bj(k) At time t in state qjUnder conditions to generate an observation vkThe probability of (c).
8. The method according to claim 1, wherein the preprocessing in step S3 is to clean up invalid characters in the target log structure, including messy codes, carriage returns and spaces.
CN202010132165.XA 2020-02-29 2020-02-29 Method for automatically analyzing log Active CN111367964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010132165.XA CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010132165.XA CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Publications (2)

Publication Number Publication Date
CN111367964A true CN111367964A (en) 2020-07-03
CN111367964B CN111367964B (en) 2023-11-17

Family

ID=71206461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010132165.XA Active CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Country Status (1)

Country Link
CN (1) CN111367964B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN107070852A (en) * 2016-12-07 2017-08-18 东软集团股份有限公司 Network attack detecting method and device
CN107273269A (en) * 2017-06-12 2017-10-20 北京奇虎科技有限公司 Daily record analysis method and device
CN108021552A (en) * 2017-11-09 2018-05-11 国网浙江省电力公司电力科学研究院 A kind of power system operation ticket method for extracting content and system
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN109947891A (en) * 2017-11-07 2019-06-28 北京国双科技有限公司 Document analysis method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN107070852A (en) * 2016-12-07 2017-08-18 东软集团股份有限公司 Network attack detecting method and device
CN107273269A (en) * 2017-06-12 2017-10-20 北京奇虎科技有限公司 Daily record analysis method and device
CN109947891A (en) * 2017-11-07 2019-06-28 北京国双科技有限公司 Document analysis method and device
CN108021552A (en) * 2017-11-09 2018-05-11 国网浙江省电力公司电力科学研究院 A kind of power system operation ticket method for extracting content and system
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system

Also Published As

Publication number Publication date
CN111367964B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN110209823B (en) Multi-label text classification method and system
CN111198948B (en) Text classification correction method, apparatus, device and computer readable storage medium
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
CN110826494B (en) Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium
CN111783451A (en) Method and apparatus for enhancing text samples
CN113807171B (en) Text classification method based on semi-supervised transfer learning
CN112560478A (en) Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN110046356B (en) Label-embedded microblog text emotion multi-label classification method
CN110569505A (en) text input method and device
CN112507190B (en) Method and system for extracting keywords of financial and economic news
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN108984159B (en) Abbreviative phrase expansion method based on Markov language model
CN115359799A (en) Speech recognition method, training method, device, electronic equipment and storage medium
CN111985612A (en) Encoder network model design method for improving video text description accuracy
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN112256932B (en) Word segmentation method and device for address character string
CN113761845A (en) Text generation method and device, storage medium and electronic equipment
CN112711943A (en) Uygur language identification method, device and storage medium
CN111367964B (en) Method for automatically analyzing log
CN116975738A (en) Polynomial naive Bayesian classification method for question intent recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant