CN116450393A

CN116450393A - Log anomaly detection method and system integrating BERT feature codes and variant transformers

Info

Publication number: CN116450393A
Application number: CN202310417120.0A
Authority: CN
Inventors: 方巍; 贾雪磊
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-18

Abstract

The invention discloses a method and a system for detecting log abnormality by fusing BERT feature codes and variant transformers, which relate to the technical field of intelligent operation and maintenance processing of logs, and the method comprises the following steps: receiving a log sequence, analyzing the log sequence, and inputting the analyzed log sequence into a pre-established BERT model to obtain a log sequence feature code with semantic information and position information; the log sequence feature codes with semantic information and position information are input into a pre-established abnormal detection model for training, a log sequence which possibly appears in the future is obtained, the obtained log sequence is used for carrying out accurate prediction, a detection result is obtained, the memory consumption and the time consumption of the attention layer calculation can be reduced, and the equivalent or exceeding prediction precision is achieved.

Description

Log anomaly detection method and system integrating BERT feature codes and variant transformers

Technical Field

The invention relates to the technical field of log intelligent operation and maintenance processing, in particular to a log anomaly detection method and system integrating BERT feature codes and variant transformers.

Background

The log is time sequence text data, which consists of time stamp and text message, and records the running state of the service in real time. By collecting and analyzing logs, it is possible to discover or predict what has happened or potential faults in the network. In addition, modern network systems are large in size, on the order of about 50Gb (about 1.2 hundred million-2 hundred million rows) of print logs per hour, and are inefficient if manual analysis of log data is relied upon to identify if a failure has occurred in the network. This leads to modeling the log data using an intelligent approach to find potential relationships between log sequences.

In recent years, many scientific research teams have developed related work for log anomaly detection and have achieved very great achievements. Error detection on HDFS logs by using a decision tree method has been proposed by Mike Chen et al as early as 04 years, and the team is also a research of using supervised learning models in machine learning to perform anomaly detection on log data relatively early, so that the method has great significance. In order to develop an effective fault tolerance strategy, the YInglong Liang team predicts a fault event of a system, uses an SVM (Support Vector Machines, support vector machine) to perform an exception handling on the log data of the IBM BlueGene/L, and the existing feature extraction or encoding mode does not fully consider semantic information or position information among words in a log sequence. The self-attention structure in the transducer model effectively captures the correlation of features within the text sequence, reducing the dependence on external information, enabling it to better discover the intrinsic links between data or features [69]. Many scholars then propose using a transducer model to replace RNN to perform log anomaly detection study, such as HitAnomaly, neuralLog, which achieves good experimental results, but the self-attention mechanism inside the method needs to calculate the correlation between each point and other points when calculating the correlation inside the sequence, and large-scale matrix multiplication operation leads to time complexity and space complexity of the method, and the calculation efficiency is not high.

Disclosure of Invention

In order to solve the above-mentioned shortcomings in the background art, the present invention is directed to a method and a system for detecting log anomalies by fusing BERT feature codes with variant transformers.

The aim of the invention can be achieved by the following technical scheme: a method for detecting log abnormality by fusing BERT feature codes and variant transformers comprises the following steps:

collecting a log sequence, analyzing the log sequence, and inputting the analyzed log sequence into a pre-established BERT model to obtain a log sequence feature code with semantic information and position information;

inputting the characteristic codes of the log sequences with semantic information and position information into a pre-established log abnormality detection model based on a variant Transformer for training to obtain a log sequence which possibly appears in the future, and accurately predicting the obtained log sequence to obtain a detection result.

Preferably, the BERT model is designed using BERT BASE version 1, with blocks of Transformer Encoder of 12, hidden layer size of 768, number of self-attention heads of 12, and total parameter size of 110M.

Preferably, the BERT model is as follows:

BERT _BASE (L＝12,H＝768,A＝12,TotalParam＝110M)。

preferably, the log sequence obtains a text sequence after word segmentation through a token word segmentation technology; and a special mark [ CLS ] is added to the beginning of the text sequence, wherein [ CLS ] represents the result mark of the text sequence, and can be placed at the beginning or at the tail, and different sentences are separated by a mark [ SEP ].

Preferably, the output of each word of the log sequence is composed of three parts, including: token Embedding, segment Embedding and Position Embedding.

Preferably, the sequence vector containing three types of Embedding is input into the BERT network for feature extraction, finally the sequence vector containing rich semantic features is output, and the association degree between different words is used for determining a weight matrix to characterize the words:

wherein Q, K, V areThe word vector matrix is used to determine,is the dimension of word embedding.

Preferably, a plurality of different linear changes are projected at Q, K, V, and finally the attention output structures of different heads are spliced by Concat, with the following formula:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _n )W ^O

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

wherein W is a weight matrix,

preferably, the anomaly detection model uses a log sequence feature code of 10 time steps as input to predict log sequence output for the next 5 time steps.

Preferably, a sequence of log codes for 10 time steps is entered into the Encoder of the model and accumulated with the position codes prior to entry, the positions of the elements are reflected using additional position codes, the position information is encoded using Sin and Cos functions, where pos represents the relevant position, j is the code dimension, d _model Representing the length of the vector, using the first of equation (4) if pos is located even and selecting the second if it is odd;

then in the coding process, the attention layer is entered to carry out vector attention operation, firstly, the input dimension is divided into two parts, one part is subjected to 1 x 1 convolution, the other part is subjected to normal self-attention layer, then the output is spliced to obtain a detection result, the coding operation is completed, then a mask operation is added in the decoding operation, and the decoded attention layer is identical to the coding attention layer.

A log anomaly detection system that fuses BERT signature encoding with variant transformers, comprising:

and a log coding module: the method comprises the steps of receiving a log sequence, analyzing the log sequence, and inputting the analyzed log sequence into a pre-established BERT model to obtain a log sequence feature code with semantic information and position information;

and a prediction module: the method is used for inputting the log sequence feature codes with semantic information and position information into a pre-established log abnormality detection model based on a variant Transformer for training to obtain a log sequence which possibly appears in the future, and accurately predicting the obtained log sequence to obtain a detection result.

The invention has the beneficial effects that:

firstly, BERT is used for feature coding in a log sequence data coding stage, semantic information and position information of each word token in a sequence can be fully considered due to unique advantage of BERT, three different Embeddings are accumulated in an input stage, so that the position information is contained in the final input, a CSPAttion module combining CSPNet and Self-attribute is designed to replace Self-attribute in an original converter, and the structure can reduce memory consumption and time consumption of Attention layer calculation, achieve equivalent or exceeding prediction precision, and prove that the process is shown in an appendix.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort;

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a CSPATTENTion structure of the present invention;

FIG. 3 is a schematic illustration of the self-attention structure of the present invention;

FIG. 4 is a schematic diagram of the attention structure after dimension division of the present invention;

FIG. 5 is a schematic diagram of the input composition of the present invention;

FIG. 6 is a logical schematic of the coding portion of the present invention;

FIG. 7 is an Encoder diagram in BERT of the present invention;

FIG. 8 is a modified transducer diagram of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in FIG. 1, a method for detecting log anomalies by fusing BERT feature codes with variant transformers comprises the following steps:

step 1: log sequence feature encoding stage

In this step, the present invention uses BERT for output of log sequence feature codes. In the normal log abnormality detection flow, the log needs to be parsed before encoding, and the step is not important in the invention and is not repeated. The design was performed using BERT BASE version (1), block number of Transformer Encoder being 12, hidden layer size being 768, number of self-attention heads being 12, total parameter size being 110M. The invention uses BERT only to be able to generate feature codes with semantic information and location information without classification or prediction tasks.

BERT _BASE (L＝12,H＝768,A＝12,TotalParam＝110M) (1)

In the input stage, three types of Embeddings are required to be accumulated and then input. As can be seen from fig. 5, for the log sequence, firstly, the text sequence after word segmentation is obtained through the token word segmentation technology; a special tag [ CLS ] is added to the beginning of the sequence, and the [ CLS ] represents a result tag of the sequence and can be placed at the beginning or at the tail. Different sentences are separated by a mark SEP. The output of each word of the log sequence at this time consists of 3 parts, token references Segment Embedding and Position Embedding. The sequence vector containing three types of coding is input into the BERT network for feature extraction, and finally the sequence vector containing rich semantic features is output, and the whole flow can be shown in figure 6. For BERT, it is simply a stacked structure of transducers with encoders, and the encoder structure is shown in fig. 7. The most critical in the encoder is multi-headed Self-attribute, which mainly characterizes words by determining a weight matrix according to the association degree between different words in the same sequence:

in the above formula, Q, K, V are word vector matrices,is the dimension of word embedding. The so-called multi-head attention mechanism projects a plurality of different linear changes at Q, K and V, and finally, the attention output structures of different heads are spliced by Concat. The formula is shown in (3).

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _n )W ^O

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (3)

The position information under different spaces can be obtained through the multi-head attention operation, wherein W is a weight matrix.

In addition, in fig. 6, it can be seen that a position code (Postional Embedding) is further added to the input part, and the data processed in the normal cyclic neural network is unidirectional, so that in order to solve the problem, the position code is added between the encoder inputs through a sine and cosine algorithm, and the position code is accumulated with the input original ebedding, so as to obtain the relative position information of each word in the sequence. A residual network is also added to the encoder to solve the gradient problem when the network is excessive.

Step 2: anomaly detection model construction and prediction stage

After the coding in the first step, feature codes are required to be input into an anomaly detection model for training, and the method only trains normal logs, because the normal logs account for the majority in a real production environment, the model only needs to learn the features of the normal logs, and when the model is applied to the real environment, the model can be judged to be anomaly if the predicted logs and the real logs generate great difference. A sliding window is used to control the size of the input, typically set to 10. A log sequence of 10 time steps is taken as input to predict the output of 5 time steps in the future, and the loss value is calculated with 5 time steps in the normal log, and then the loss is fitted continuously. In this step the invention designs a variant-transform based log anomaly detection method, wherein the encoder part takes as input the history of the log sequence, and the decoder part predicts in an autoregressive way what is likely to happen in the future log. The Self-Attention module in the original tranformer is replaced by a new Attention mode, namely the CSPATTENTion module combining CSPNet and Self-Attention is used for replacing Self-Attention in the original tranformer, and the structure can greatly reduce the memory consumption and time consumption of Attention layer calculation and achieve equivalent or exceeding prediction precision. The structure of the model is shown in the attached figure 8.

The log code sequence of 10 time steps is input into the Encoder of the model and accumulated with the position codes before input, the positions of the elements can be reflected by using the additional position codes, and the invention encodes the position information by using Sin and Cos functions as the most common sequence models, wherein pos represents the relevant position, j is the coding dimension, and dmedel represents the length of the vector.

The attention layer is then entered for vector attention operations. This is also an improvement of the present invention where the input dimension is split into two. One of them will undergo a 1 x 1 convolution and the other will undergo a normal Self-Attention. And finally, splicing all the outputs. Through Layer Normalization and Feed Forward operations. In addition, the residual error operation is added in the aspects of attention and Feed Forward operation, so as to prevent the gradient from disappearing or explosion problem. The encoder section is completed so far and the decoder operation is performed as follows. In order to predict the log of ten time steps in the future because of the task of the present model, there is one mask operation in the decoding operation. The intermediate attention layer and the attention layer in the encoder are of the same structure and function.

And (3) proving: CSPattern time complexity is 50% of traditional self-attention

Starting to prove:

(1): firstly, calculating the time complexity of an original self-attention mechanism;

(2): assume that the input sequence is

(3): is additionally provided withi＝1,...,n。d ₂ ×n＝E ₂ ，d ₃ ×n＝E ₃ ，

(4)：With a temporal complexity of LE ₁ d ₂ n＝LE ₁ E ₂ ；

(5)：With a temporal complexity of LE ₁ d ₂ n＝LE ₁ E ₂ ；

(6)：With a temporal complexity of LE ₁ d ₃ n＝LE ₁ E ₃ ；

(7)：With a temporal complexity of L ² d ₂ n＝E ₂ L ² ；

(8)：With a temporal complexity of L ² d ₃ n＝E ₃ L ² ；

(9)：With a temporal complexity of LE ₃ E ₄ ；

(10): in self-attention, there is typically E ₁ ＝E ₂ ＝E ₃ ＝E ₄ ＝E；

(11): the temporal complexity of the original self-attention mechanism is 4E ² L+2EL ² ；

(12): then calculating the time complexity of CSPattern;

(13): calculating the time complexity of the convolutional layer side L× (E/2) ² ；

(14): the temporal complexity of the calculation from the attention side is 4 (E/2) ² L+2(E/2)L ² ＝LE ² +EL ² ；

(15): step-adding steps (13) and (14) to obtain CSPattern time complexity of 1.25LE ² +EL ² ；

(16): by comparing the time complexity of step (15) with the time complexity of step (11), it is evident that the time complexity of CSPattern is reduced by at least 50%.

End of proof

Based on the same inventive concept, the present invention also provides a computer apparatus comprising: one or more processors, and memory for storing one or more computer programs; the program includes program instructions and the processor is configured to execute the program instructions stored in the memory. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal for implementing one or more instructions, in particular for loading and executing one or more instructions within a computer storage medium to implement the methods described above.

It should be further noted that, based on the same inventive concept, the present invention also provides a computer storage medium having a computer program stored thereon, which when executed by a processor performs the above method. The storage media may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electrical, magnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing has shown and described the basic principles, principal features, and advantages of the present disclosure. It will be understood by those skilled in the art that the present disclosure is not limited to the embodiments described above, which have been described in the foregoing and description merely illustrates the principles of the disclosure, and that various changes and modifications may be made therein without departing from the spirit and scope of the disclosure, which is defined in the appended claims.

Claims

1. A method for detecting log abnormality by fusing BERT feature codes and variant transformers is characterized by comprising the following steps:

2. The method for detecting log anomalies by fusing BERT feature codes and variant transformers according to claim 1, wherein the BERT model is designed by using BERT BASE version 1, blocks of Transformer Encoder are 12, hidden layer size is 768, number of self-attentions is 12, and total parameter size is 110M.

3. The method for detecting log anomalies by combining BERT signature encoding with variant transformation according to claim 2, wherein the BERT model is as follows:

BERT _BASE (L＝12,H＝768,A＝12,TotalParam＝110M)。

4. the method for detecting log anomalies by fusing BERT feature codes and variant transformers according to claim 1, wherein the log sequences are segmented by token segmentation technology to obtain segmented text sequences; and a special mark [ CLS ] is added to the beginning of the text sequence, wherein [ CLS ] represents the result mark of the text sequence, and can be placed at the beginning or at the tail, and different sentences are separated by a mark [ SEP ].

5. The method for detecting log anomalies by merging BERT feature codes and variant transformers according to claim 4, wherein the output Embedding of each word of the log sequence consists of three parts, including: token Embedding, segment Embedding and Position Embedding.

6. The method for detecting log anomalies by fusing BERT feature codes and variant transformers according to claim 5, wherein the method is characterized in that sequence vectors containing three types of Embedding are input into a BERT network for feature extraction, finally sequence vectors containing rich semantic features are output, and a weight matrix is determined according to the association degree between different words to characterize the words:

where Q, K, V are word vector matrices,is the dimension of word embedding.

7. The method for detecting log anomalies by fusing BERT feature codes and variant transformers according to claim 6, wherein a plurality of different linear changes are projected at Q, K, V, and finally attention output structures of different heads are spliced by Concat, wherein the formula is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _n )W ^O

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

wherein W is a weight matrix.

8. The method of claim 1, wherein the anomaly detection model uses 10 time-step log sequence feature codes as inputs to predict a 5 time-step log sequence output in the future.

9. According toThe method for detecting log anomalies by combining BERT feature codes and variant transformers according to claim 8, wherein a sequence of log codes of 10 time steps is input into an Encoder of a model and accumulated with position codes before input, the positions of elements are reflected by using additional position codes, position information is encoded by using Sin and Cos functions, wherein pos represents a relevant position, j is a coding dimension, d _model Representing the length of the vector, using the first of equation (4) if pos is located even and selecting the second if it is odd;

10. A log anomaly detection system that fuses BERT signature encoding with variant transformers, comprising: