CN111367964A - Method for automatically analyzing log - Google Patents
Method for automatically analyzing log Download PDFInfo
- Publication number
- CN111367964A CN111367964A CN202010132165.XA CN202010132165A CN111367964A CN 111367964 A CN111367964 A CN 111367964A CN 202010132165 A CN202010132165 A CN 202010132165A CN 111367964 A CN111367964 A CN 111367964A
- Authority
- CN
- China
- Prior art keywords
- log
- probability
- analysis
- state
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000004458 analytical method Methods 0.000 claims abstract description 41
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 33
- 230000007704 transition Effects 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 7
- 241000721662 Juniperus Species 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000014509 gene expression Effects 0.000 abstract description 5
- 238000010276 construction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000035484 reaction time Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/547—Remote procedure calls [RPC]; Web services
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, obtaining sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability; and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method has the advantages that the problem of low log analysis efficiency of the traditional method for manually formulating the regular expression is solved by constructing the hidden Markov log analysis model and combining the Vibitit algorithm, the internal structure of the log can be rapidly and accurately identified automatically, and effective information can be extracted.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.
Background
With the increasing development of computer technology, computer systems are also more and more complex. For IT operation and maintenance, the original log cannot directly provide valid information, and the fields in the original log need to be parsed and then the valid information is extracted. The traditional log analysis method is to manually make a corresponding regular rule. This approach works well if the log categories are small and the log structure changes infrequently. However, as various functions continue to be integrated into the system, a large number of IT subsystems and, thus, a large amount of various types of log data are generated. For these logs, it is very time and labor consuming if a regular matching rule is designed for each log. Therefore, how to rapidly and accurately parse the text log becomes a problem to be solved urgently.
Disclosure of Invention
The present invention aims to overcome the defects of the prior art and provide a method for automatically analyzing a log, which is based on a natural language processing technology and automatically identifies the internal structure of a text log through a computer so as to quickly and accurately extract effective information from the log.
The purpose of the invention can be realized by the following technical scheme: a method of automatically parsing a log, comprising the steps of:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
Further, the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data to establish a log database;
and S22, constructing a hidden Markov model according to the marked log structure information in the log database to be used as a log analysis model.
Further, the sample log data in step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.
Further, when the structure of the sample log is labeled in step S21, the structure of the log is specifically labeled with B, M, E, S, O identifiers to obtain labels corresponding to the characters in the log structure one by one, where S denotes a single character, B, M, E denotes the middle of the beginning and the end of a character string, and O denotes a character that is not a log structure.
Further, the log structure information labeled in step S22 includes a log structure character string and a corresponding character tag string, where each character in the log structure character string is a different observation amount, and each tag in the character tag string is in a different state.
Further, the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting the transition probability from the state to the observed quantity in the log database to obtain an observation probability matrix;
s223, counting the initial state probability in the log database to obtain initial probability distribution;
s224, constructing a hidden Markov model by training the state transition matrix, the observation probability matrix and the initial probability distribution.
Further, the state transition matrix is specifically:
A=[aij]N×M
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
the observation probability matrix is specifically as follows:
B=[bj(k)]N×M
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(πi)T
πi=P(i1=qi),i=1,2,...,N
Q={q1,q2,...,qN},V={v1,v2,...,vM},
I={i1,i2,...,iT},O={o1,o2,...,oT}
wherein Q is a set of states, V is a set of observations, N is a number of states, M is a number of observations, I is a sequence of states of length T, O is a sequence of observations corresponding to I, π is an initial probability distribution, π isiIs that time t is 1 in state qiA is a state transition probability matrix, aijIs in state q at the momentiTime t +1 under the condition of (1) to the state qjB is an observation probability matrix, Bj(k) At time t in state qjUnder conditions to generate an observation vkThe probability of (c).
Further, the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including a messy code, a carriage return symbol, and a space.
Compared with the prior art, the log analysis method based on the hidden Markov model has the advantages that the log analysis model is constructed based on the hidden Markov model, when different types of log data are processed, the logs can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the manpower and time for log analysis are greatly saved, in addition, the hidden Markov model and the Viterbi algorithm are combined to obtain the probability maximum path, and the accuracy of log analysis is ensured.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a process diagram of log parsing model construction in an embodiment;
FIG. 3 is a process for applying the log parsing model in an embodiment;
FIG. 4 is a sample Apache access data in an embodiment;
FIG. 5 is a sample Apache error log data in an embodiment;
FIG. 6 shows a data sample of the Aruba wireless in the embodiment;
FIG. 7 is a sample of Nginx access data in an embodiment;
FIG. 8 is a sample Nginx error data in an example embodiment;
FIG. 9 is a sample of Exchange data in the example;
FIG. 10 is a sample of the Juniper firewall log in the embodiment;
FIG. 11 is a sample VPN in an embodiment;
FIG. 12 is a diagram illustrating a log structure annotation in an embodiment;
FIG. 13 is a diagram illustrating a process of computing a maximum probability path;
FIG. 14 is a flow chart illustrating the usage of the REST API service in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
As shown in fig. 1, a method for automatically parsing a log includes the following steps:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
In this embodiment, the method is adopted to automatically analyze the text log, and an application service based on a REST (Representational State Transfer) API is constructed at the same time, as shown in fig. 2 to 3:
1. preparation work
Before the log is analyzed, log data needs to be collected, wherein the log data comprises a log database and a log entity tag determination.
1.1 Log repository establishment
Various types of log data are collected, and effective information of the log data is marked.
1.2 Log parsing model construction
And according to the collected log data, building a model according to the marked log structure information. The present invention uses a hidden markov model as an analytical model, and therefore calculates three parameters of the hidden markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Specifically, a Hidden Markov Model (HMM) is a Probabilistic graphical Model (HMM). HMMs are used primarily to describe the transition of hidden states and the probability of appearance of hidden states in a system. HMM capability consists in being able to estimate what the corresponding hidden variable sequence is from a given observed variable sequence and make predictions about future observed variables.
For example, voice recognition, gives you a piece of audio data, and needs to recognize the text corresponding to the audio data. Here, the audio data is an observation variable, and the text is a hidden variable. There are slight inflections in different context environments, but the rough pronunciation is statistically regular. On the other hand, when we say a sentence, there is some transfer rule between words.
In terms of model representation:
the HMM includes three parameters, an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Let Q be the set of all possible states and V be the set of all possible observations.
Q={q1,q2,...,qN},V={v1,v2,...,vM}
Where N is the possible state tree and M is the number of possible observations.
I is the state sequence of length T and O is the corresponding observation sequence.
I={i1,i2,...,iT},O={o1,o2,...,oT}
π is the initial state probability vector:
π=(πi)
wherein the content of the first and second substances,
πi=P(i1=qi),i=1,2,...,N
is that time t is 1 in state qiThe probability of (c).
A is the state transition probability matrix:
A=[aij]N×M
wherein the content of the first and second substances,
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
is in state q at the momentiTime t +1 under the condition of (1) to the state qjThe probability of (c).
B is the observation probability matrix:
B=[bj(k)]N×M
wherein the content of the first and second substances,
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
is in state q at time tjUnder conditions to generate an observation vkThe probability of (c).
HMMs can mainly deal with three problems:
and (4) probability calculation problem. Given the model λ ═ (a, B, pi) and the observed sequence ═ o1,o2,...,oT) Calculated under model λThe probability P (O | λ) of occurrence of the sequence O is observed.
And (5) learning. The known observation sequence O ═ O (O)1,o2,...,oT) The estimated model λ is the (a, B, pi) parameter, and is the observed sequence probability P (O | λ) under this model. I.e. the parameters are estimated using maximum likelihood estimation.
The prediction problem, also known as the decoding (decoding) problem. The known model λ ═ (a, B, pi) and the observed sequence O ═ O (O)1,o2,...,oT) The state sequence I ═ I (I | O) with the maximum conditional probability P (I | O) for a given observation sequence is determined1,i2,...,iT). I.e. given the observation sequence, the most likely corresponding sequence state is found.
When labeling the log, labeling the structure of the log first, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected and marked, which are Apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.
The log content is then marked. The internal structure in the log is shown in table 1:
TABLE 1
The log data is marked as follows. For the sequence annotation problem, the annotation generally uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the middle of the beginning and the end of a string, respectively, and O denotes a character that is not log-structured. The above log internal structure is noted in conjunction with B, M, E, S. For the log: "192.168.3.1- - [ 08/Aug/2017: 00: 31: 26+0800, "GET/qx/xts/images/x _ gkbg. jpg HTTP/1.1" 2001171 ", the corresponding labels are as shown in FIG. 12.
And marking the collected sample log data one by one to construct a hidden Markov model, wherein for the hidden Markov model, three important variables, namely pi and A, B, need to be counted first. For a log, the string of the log is an observation sequence. Each character is an observation. The labels of the characters are hidden variables, i.e. states in the hidden markov model. These parameter calculations are presented below.
The state transition matrix is a matrix of M × M, M being the number of states, i.e., log labels.
The observation probability matrix is an M × N matrix, M is the number of the log labels, N is the number of the character types, and the calculation method is as follows:
probability of initial state piiIs calculated as the initial state of the S logs is qiOf (c) is detected.
For example: for three pieces of log data, i.e., three observation sequences.
“127.0.0.1get 200”.
“192.168.10.1post 404”
“127.0.0.1get 403”
The log structure sequence is shown below, with the log structure type in parentheses and \ s representing a space character.
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”
“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]
o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”
Then we can get a total of ten states as { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }.
The observation set is {1, 2, 7, 0, 9, 6, 8, 4, 3, g, e, t, p, o, s, \\ s }, for a total of sixteen observations, where \ s represents a space.
First, we count two adjacent states, and count the probability from the previous state to the next state. For example, we calculate the state transition probability p (host-e | host-m) from "host-m" to "host-e". host-m and host-e occur 3 times next to each other, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated asThus we can get a state transition matrix a of 10x 10.
Second, we count the observed probability matrix. Suppose we want to calculate the observed transition probability p (3| host-code-e) of "http-code-e" to "3". Then the observation character '3' is marked as the state 'http-code-e' for 1 time, and the state 'http-code-e' appears for 3 times in total, so that the observation transition probability from the state 'http-code-e' to the observation character '3' can be calculatedThus we have an observed probability matrix B of 10 × 16.
Finally, we count the initial state probability π. We can see a total of 3 sequences, and the "host-b" appears 3 times, and no other states appear, so the initial probability of our state "host-b" is 1.0, and all other states are 0.
2 Log Structure parsing
The steps of log structure analysis will be described with a certain log data as an example.
The first step is as follows: and preprocessing the input log, and cleaning the carriage return symbols, the spaces and the messy codes in the log at the front end and the rear end of the log.
The second step is that: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, the state transition probability matrix and the observation probability matrix, and selecting the structure with the maximum probability.
The third step: and outputting the analysis structure of the log, extracting effective information in the log, and marking a corresponding position.
Specifically, for a newly input log, the log data is preprocessed first. The preprocessing content is mainly to remove some invalid characters such as messy codes. Then, an optimal analysis method is evaluated using the viterbi algorithm based on the three parameters of hidden markov. The viterbi algorithm is a method based on dynamic programming to find the most probable path. The path here corresponds to a log resolution structure.
For example, if we obtain a state transition matrix of
Observation probability matrix of
The initial probability distribution is:
π=(0.3,0.2.0.5)T。
the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analytic structure is solved for the observation sequence ("m", "n", "m").
First, when the parameter is initialized, i is 1, 2, 3 for each state i, i is observed as i1To observe the probability of the character "m", let this probability be δ1(i) Then, then
δ1(i)=πibi(o1)=πibi(m),i=1,2,3
Substituting actual data
δ1(1)=0.3*0.3=0.09
δ1(2)=0.2*0.6=0.12
δ1(3)=0.5*.04=0.20
Phi1(i)=0,i=1,2,3。
For each state i, i is 1, 2, 3, when t is 2, the observation character o is found to be observed in the state j as red when t is 1 and in the state i as i when t is 22The maximum probability of a path being "n", said probability being δ2(i) Then, then
Meanwhile, for each state i, i ═ 1, 2, 3, the previous state j of the most probable path is recorded:
and (3) calculating:
ψ2(1)=3
δ2(2)=0.024,ψ2(2)=3
δ2(3)=0.048,ψ2(3)=3
similarly, when t is 3,
δ3(1)=0.00756,ψ3(1)=1
δ3(1)=0.00864,ψ3(2)=3
δ2(1)=0.00768,ψ3(3)=3
with P*Probability of representing the optimal path, then
thus, an optimal path, i.e. an optimal state sequence, is determinedI.e., "c", "b"). Fig. 13 shows a process of calculating the maximum path.
After the model construction is completed, the present embodiment evaluates the log resolution model. For the analysis model, the requirement is that "all log entities are discovered as much as possible, and the discovered log entities are accurate as much as possible", that is, the recall ratio recall and the precision ratio precision are required to be higher. Meanwhile, in order to guarantee the consideration of recall and precision, f1-measure is needed to evaluate the model.
In the above formula, correctextractIndicating the correct number of extracted log entities, extractentitvRepresenting the total number of log entities extracted. dataentityRepresenting the number of log entities in the data. For example, for a log, the format is as follows:
"Jan 1217: 47: 48127.0.0.1 xxx, info, download, 175.42.41.4 "
The correct analytical structure is "Jan 1217: 47: 48 "," 127.0.0.1 "," info "," 175.42.41.4 ". If the model is parsed into "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 ". Then the evaluation results are as follows.
correctextract{ "127.0.0.1", "information", "175.42.41.4" }
extractentity{ "Jan", "12", "17: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}
dataentity{ "Jan 1217: 47: 48 "," 127.0.0.1 "," information "," 175.42.41.4 "}
In order to evaluate the analysis result of the log, the present embodiment divides each log data set, and takes 60% of the data as a training set and 40% of the data as a test set. And predicting the test set by using the model trained by the training set, and finally evaluating the result. Table 2 shows the analysis results of various log models.
TABLE 2
In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test result is shown in table 3:
TABLE 3
Log categories | Number of logs | File size | File format | Analysis time |
Apache Access | 523 | 51KB | txt | 0.1s |
Apache error | 30001 | 4.23MB | txt | 4.1s |
Aruba wireless | 380752 | 62.6MB | txt | 63.2s |
Nginx access | 2231408 | 482MB | txt | 420.1s |
Nginx error | 33026 | 13.5MB | txt | 10.1s |
Exchange | 648492 | 357MB | txt | 301.2s |
Juniper firewall log | 33034 | 12.4MB | txt | 23.5s |
VPN log | 18581 | 2.64MB | txt | 2.5s |
3 construction of REST
The log parsing method is applied to the REST service as a library that can be called by the user using the REST API.
In order to apply the log parsing to practical applications, the present embodiment provides a log parsing service based on RESTAPI, which is convenient for users to use.
The architecture is as shown in fig. 14, the service is implemented by programming using Python3, the log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of the REST service, and RESTAPI is provided. The interface design is shown in table 4:
TABLE 4
In summary, the traditional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the method uses natural language processing and data mining technologies, does not need to make the regular expressions manually, and saves labor and time; the manually formulated regular expression needs to be re-formulated in the face of the change of the log structure, but the method used by the invention does not need to be retrained, and the embodiment also constructs the log analysis service based on the REST API and provides an idea for the engineering application realization of the log analysis.
Claims (8)
1. A method for automatically parsing a log, comprising the steps of:
s1, obtaining sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the structure of the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining the analysis structure of the target log by solving a path with the maximum probability;
and S5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
2. The method for automatically parsing a log according to claim 1, wherein the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data to establish a log database;
and S22, constructing a hidden Markov model according to the marked log structure information in the log database to be used as a log analysis model.
3. The method for automatically parsing a log according to claim 2, wherein the sample log data in the step S21 includes eight kinds of log data: apache access, Apache error, Aruba wireless, Nginx access, Nginx error, Exchange, Juniper firewall log, and VPN.
4. The method as claimed in claim 2, wherein in step S21, when labeling the structure of the sample log, labeling the structure of the sample log with B, M, E, S, O identifiers to obtain labels corresponding to characters in the log structure, wherein S represents a single character, B, M, E represents the middle of the beginning and the end of a character string, and O represents a character that is not in the log structure.
5. The method of claim 4, wherein the log structure information labeled in step S22 includes a log structure character string and a corresponding character label string, wherein each character in the log structure character string is a different observation quantity, and each label in the character label string is a different status.
6. The method for automatically parsing a log according to claim 5, wherein the specific process of constructing the hidden Markov model in step S22 is as follows:
s221, counting transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting the transition probability from the state to the observed quantity in the log database to obtain an observation probability matrix;
s223, counting the initial state probability in the log database to obtain initial probability distribution;
s224, constructing a hidden Markov model by training the state transition matrix, the observation probability matrix and the initial probability distribution.
7. The method for automatically parsing a log according to claim 6, wherein the state transition matrix is specifically:
A=[aij]N×M
aij=P(it+1=qj|it=qi),i=1,2,...,N;j=1,2,...,N
the observation probability matrix is specifically as follows:
B=[bj(k)]N×M
bj(k)=P(ot=vk|it=qj),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(πi)T
πi=P(i1=qi),i=1,2,...,N
Q={q1,q2,...,qN},V={v1,v2,...,vM},
I={i1,i2,...,iT},O={o1,o2,...,oT}
wherein Q is a set of states, V is a set of observations, N is a number of states, M is a number of observations, I is a sequence of states of length T, O is a sequence of observations corresponding to I, π is an initial probability distribution, π isiIs that time t is 1 in state qiA is a state transition probability matrix, aijIs in state q at the momentiTime t +1 under the condition of (1) to the state qjB is an observation probability matrix, Bj(k) At time t in state qjUnder conditions to generate an observation vkThe probability of (c).
8. The method according to claim 1, wherein the preprocessing in step S3 is to clean up invalid characters in the target log structure, including messy codes, carriage returns and spaces.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010132165.XA CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010132165.XA CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111367964A true CN111367964A (en) | 2020-07-03 |
CN111367964B CN111367964B (en) | 2023-11-17 |
Family
ID=71206461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010132165.XA Active CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111367964B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912570A (en) * | 2016-03-29 | 2016-08-31 | 北京工业大学 | English resume key field extraction method based on hidden Markov model |
CN107070852A (en) * | 2016-12-07 | 2017-08-18 | 东软集团股份有限公司 | Network attack detecting method and device |
CN107273269A (en) * | 2017-06-12 | 2017-10-20 | 北京奇虎科技有限公司 | Daily record analysis method and device |
CN108021552A (en) * | 2017-11-09 | 2018-05-11 | 国网浙江省电力公司电力科学研究院 | A kind of power system operation ticket method for extracting content and system |
CN108881194A (en) * | 2018-06-07 | 2018-11-23 | 郑州信大先进技术研究院 | Enterprises user anomaly detection method and device |
CN109388803A (en) * | 2018-10-12 | 2019-02-26 | 北京搜狐新动力信息技术有限公司 | Chinese word cutting method and system |
CN109947891A (en) * | 2017-11-07 | 2019-06-28 | 北京国双科技有限公司 | Document analysis method and device |
-
2020
- 2020-02-29 CN CN202010132165.XA patent/CN111367964B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912570A (en) * | 2016-03-29 | 2016-08-31 | 北京工业大学 | English resume key field extraction method based on hidden Markov model |
CN107070852A (en) * | 2016-12-07 | 2017-08-18 | 东软集团股份有限公司 | Network attack detecting method and device |
CN107273269A (en) * | 2017-06-12 | 2017-10-20 | 北京奇虎科技有限公司 | Daily record analysis method and device |
CN109947891A (en) * | 2017-11-07 | 2019-06-28 | 北京国双科技有限公司 | Document analysis method and device |
CN108021552A (en) * | 2017-11-09 | 2018-05-11 | 国网浙江省电力公司电力科学研究院 | A kind of power system operation ticket method for extracting content and system |
CN108881194A (en) * | 2018-06-07 | 2018-11-23 | 郑州信大先进技术研究院 | Enterprises user anomaly detection method and device |
CN109388803A (en) * | 2018-10-12 | 2019-02-26 | 北京搜狐新动力信息技术有限公司 | Chinese word cutting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111367964B (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110532554B (en) | Chinese abstract generation method, system and storage medium | |
CN110209823B (en) | Multi-label text classification method and system | |
CN111198948B (en) | Text classification correction method, apparatus, device and computer readable storage medium | |
CN110196906B (en) | Deep learning text similarity detection method oriented to financial industry | |
CN109492230B (en) | Method for extracting insurance contract key information based on interested text field convolutional neural network | |
CN110826494B (en) | Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium | |
CN111783451A (en) | Method and apparatus for enhancing text samples | |
CN113807171B (en) | Text classification method based on semi-supervised transfer learning | |
CN112560478A (en) | Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation | |
CN110580308A (en) | information auditing method and device, electronic equipment and storage medium | |
CN110046356B (en) | Label-embedded microblog text emotion multi-label classification method | |
CN110569505A (en) | text input method and device | |
CN112507190B (en) | Method and system for extracting keywords of financial and economic news | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN113010683B (en) | Entity relationship identification method and system based on improved graph attention network | |
CN108984159B (en) | Abbreviative phrase expansion method based on Markov language model | |
CN115359799A (en) | Speech recognition method, training method, device, electronic equipment and storage medium | |
CN111985612A (en) | Encoder network model design method for improving video text description accuracy | |
CN113935314A (en) | Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network | |
CN116881430A (en) | Industrial chain identification method and device, electronic equipment and readable storage medium | |
CN112256932B (en) | Word segmentation method and device for address character string | |
CN113761845A (en) | Text generation method and device, storage medium and electronic equipment | |
CN112711943A (en) | Uygur language identification method, device and storage medium | |
CN111367964B (en) | Method for automatically analyzing log | |
CN116975738A (en) | Polynomial naive Bayesian classification method for question intent recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |