CN115277180A

CN115277180A - Block chain log anomaly detection and tracing system

Info

Publication number: CN115277180A
Application number: CN202210882913.5A
Authority: CN
Inventors: 牛伟纳; 张小松; 廖旭涵; 赵丽睿; 周孝笑; 朱宇坤; 张然
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-01
Anticipated expiration: 2042-07-26
Also published as: CN115277180B

Abstract

The present invention relates to the field of blockchain applications. The system aims to solve the problem of data anomaly detection function lacking in the current blockchain architecture and can safely, accurately and reliably realize data detection. Extracting a template from the data log, and counting the number characteristics; training a model through feature representation of a log, wherein the features are divided into quantity features and time sequence features; for the log sequence to be detected, processing data through a data processing module, combining a model trained by a quantity time sequence model training module after the data processing, outputting a numerical value of 0-1 by the model, recording the numerical value as the time sequence model deviation and the quantity deviation respectively, and then comprehensively calculating the final deviation; and writing the logs exceeding the deviation threshold into a table, giving a threat mark, giving a log sequence to which the threat log belongs as tracing output, and if abnormal false alarm is found during auditing, marking the log sequence as false alarm to allow a system to dynamically adjust the threshold, so that the accuracy is increased.

Description

Block chain log anomaly detection and traceability system

Technical Field

The invention belongs to the field of block chain data security, and provides a block chain log anomaly detection and tracing system.

Background

The block chain technology is one of the most popular technologies today, and has been widely used in multiple scenarios such as finance and supply chain. The block chain technology can be generally divided into three implementation forms of public chain, alliance chain and private chain. In the initial stage of block chain application, public chain is used as the main expression form, all people can participate in supervision, and the authenticity of uplink information is the strongest. But the number of the participators is too large, which causes the running efficiency to be low. When the block chain is used in a small scale in an enterprise, the block chain is realized by selecting the private chain, the number of people participating in the private chain is small, but the centralization degree is too high, and the block chain can only be operated in the industry of a single center generally. The alliance chain combining the advantages of the two is a blockchain form which is selected by most applications at present, the alliance chain is jointly supervised by a plurality of main participation parts, each part can independently control individuals which want to authorize to participate in a blockchain network, and the individuals are registered to participate in supervision as a sub part of the part. The information on the link is transparent to all the participating individuals, and operations such as data addition are supervised by groups, so that the link has traceability and non-tamper property. At present, the application targets of block chains used at home and abroad mainly comprise the guarantee of non-falsification, credibility, integrity and traceability of auditable data.

The log is the most representative auditable data, the log is used for recording operation information such as various parameters and the like during the operation of the system, and a system developer can find problems and locate the problems to solve the problems in time through the audit log regularly or when abnormal behaviors occur. There are problems with existing logging systems. If the system is attacked manually, the log records can be tampered by a prepared attacker, so that a developer cannot locate errors through false log records, and the difficulty of repairing the system and locating problems by the developer is increased. In addition, a widely used log anomaly detection method is generally used for detecting anomalies by developers according to their domain knowledge, combining log anomaly levels, utilizing keyword search, regular expressions and the like. However, this approach relies heavily on manual work, which is more difficult as the system gets larger and more complex.

The log is a time sequence text data, which is composed of a timestamp and a text message, records the running state of the service in real time, has a certain number of corresponding relations, for example, if several files are opened, several files should be closed, and if the log execution sequence has a problem or the corresponding relation is incorrect, abnormality may be caused. However, the current logs are not uniform in specification, the formats of the logs printed by different types of equipment are different, the log data also has the characteristic of unstructured performance, the logs are difficult to be processed in batch and automatically, and the problems make log analysis very difficult.

The log analyzer based on the fixed depth tree preprocesses original log information through a simple regular expression set by domain knowledge, then searches a log group according to a special design rule coded in a node inside the tree, if a matched log group is found, the log information can be matched with log events stored in the log group, if the matched log group is not found, a new log group is created, finally all logs can be attributed to the log group, the log analyzer is equivalent to a classified log, and a common mode of extracting the logs in the same form is used as a template.

Disclosure of Invention

The invention discloses a block chain automatic log anomaly detection and tracing scheme based on a federation chain. In conventional blockchain applications, automated anomaly detection is not performed on data on the federation chain, but only by hand relying on experience. With the increasing variety and complexity of data on the federation chain, relying on manual detection alone is not feasible. Therefore, it is necessary to research a reasonable and high-accuracy log anomaly detection technology to increase the automation capability of the system. The log anomaly detection and tracing scheme solves the problem of data anomaly detection function lacking in the current block chain architecture, and can safely, highly accurately and reliably realize data detection.

In order to solve the technical problems, the invention adopts the following technical scheme:

a blockchain log anomaly detection and tracing system, comprising:

a data processing module: extracting a template from the data log, wherein the log template comprises a quantitative part and a variable, structuring unstructured log data into a template log which is easy to analyze, counting quantity characteristics according to the template, and counting the quantity characteristics, namely the number of times of word occurrence and the number of times of combined word occurrence in the template;

a quantity time sequence model training module: training a model through the characteristic representation of the log, wherein the characteristics are divided into quantity characteristics and time sequence characteristics, the quantity characteristics and the time sequence characteristics are respectively input into the time sequence and the quantity model for training, the quantity characteristics refer to the times of the words appearing in the template and the times of the combined words appearing, and the time sequence characteristics refer to the sequence of the log;

a deviation calculation module: for a log sequence to be detected, processing data through a data processing module, combining a model trained by a quantity time sequence model training module after the data processing, outputting a numerical value of 0-1 by the model, respectively recording as a time sequence model deviation degree and a quantity deviation degree, and then comprehensively calculating a final deviation degree;

an exception tracing module: and writing the logs exceeding the deviation threshold into a table, giving a threat mark, giving a log sequence to which the threat log belongs as tracing output, and if abnormal false alarm is found during auditing, marking the log sequence as false alarm to allow a system to dynamically adjust the threshold, so that the accuracy is increased.

In the above technical solution, the data processing module outputs the statistical characteristics by using a drain log template extractor in combination with the multidimensional characteristic combination, and specifically includes:

1) A drain log template extractor extracts a template from the existing log of the block chain network;

2) Respectively counting the occurrence times of words and the occurrence times of combined words in the template for the template extracted by the Drain;

3) When the node uploads the log in the block chain network, classifying the log into a corresponding template and a statistical quantity characteristic by using a drain:

in the above technical solution, in the quantity timing model training module:

respectively acquiring the time sequence characteristics and the quantity characteristics of the log sequence;

putting the time sequence characteristics into a GRU model based on an attention mechanism for training to obtain a time sequence model;

putting the quantity characteristics into a decision tree based on gradient lifting for training to obtain a quantity model;

and storing the time sequence model and the quantity model with the highest precision in the training process. 4. The system of claim 3, wherein the GRU model based on attention mechanism comprises the following steps:

A. the log is text data, the extracted template is also a text template, semantic conversion is needed before the text is input into the model, and the log template text is converted into a log template vector by adopting a semantic vector trained by glove for the input log template text; the log is a word, the program cannot process, the vector is a number, and the program can process. The glove vocabulary is a one-to-one correspondence relationship between characters and numbers, and the characters can be converted into the numbers by looking up the table.

B. Converting the batch log template vector into a log template sequence vector by adopting a sliding window mode;

C. inputting the log template sequence vector into a model, and enabling the model to learn the time sequence characteristics;

D. and storing the training result to obtain a time sequence model.

In the technical scheme, the exception tracing module,

1) Setting a threshold value, judging the deviation value, and marking the deviation value as abnormal (marked as 1) if the deviation value exceeds the threshold value;

2) The data with false alarm can be marked as false alarm (marked as 0), whether the threshold value is adjusted or not is judged according to whether the abnormal quantity below the deviation value of the false alarm in a certain time is particularly small, and if the abnormal quantity is particularly small, the threshold value is low and needs to be improved;

3) After being processed by the data processing module, the log data to be detected is input into a trained model in the quantity timing sequence model training module to obtain a deviation value;

4) Judging whether the mark is abnormal or not according to the threshold value;

5) If the log is abnormal, tracing and outputting a log sequence related to the abnormality.

In the technical scheme, the tracing outputs an abnormal process,

1) Caching the one-to-one corresponding relation between the original text and the vector of the log to be tested in a memory;

2) If the log vector is marked as abnormal, original log text information of the log vector and related log sequence information are obtained through table lookup;

3) Otherwise, emptying the cache and detecting the next batch of logs.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

1. the log data is stored in the alliance chain, and the credibility and reliability of the log data audit can be guaranteed.

2. The log abnormity is automatically detected by combining the advantages of deep learning and machine learning and utilizing the time sequence characteristics and the numerical characteristics of the log.

3. Through a comprehensive deviation calculation method of attention mechanism in time sequence characteristics, quantity characteristic screening and final weight distribution, irrelevant characteristics and influences are effectively removed, and the accuracy of results is improved.

Drawings

FIG. 1 is a basic flow of block chain log anomaly detection and tracing;

FIG. 2 is a process of log anomaly detection;

FIG. 3 is a block chain log anomaly detection and tracing system architecture

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The detailed description of the embodiments of the present invention is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The implementation of the block chain log anomaly detection and tracing system comprises the following steps:

a data processing module: and extracting a log template from the unstructured log, and counting the occurrence times of fixed words and the combination of the fixed words in the template.

A quantity time sequence model training module: and converting the log sequence into a digital vector, and inputting the digital vector into a timing model for training. And simultaneously, inputting the quantity characteristics counted by the data processing module into a quantity model for training, and storing a training result.

A deviation calculation module: and inputting the log sequence to be tested into the trained result in the quantity time sequence model training module, respectively calculating the time sequence deviation degree and the quantity deviation degree, and then calculating the comprehensive deviation degree.

An exception tracing module: whether the deviation degree is abnormal or not is marked according to the value of the deviation degree, if the deviation degree is abnormal, a related log sequence is output by tracing, and an operator can dynamically adjust the deviation threshold value through feedback.

The main processes of the scheme for the four modules comprise:

A. the data processing module extracts log templates from a large amount of unstructured log data existing on a block chain network, and counts the occurrence frequency of fixed words and the occurrence frequency of fixed word combinations corresponding to each template.

B. Converting the log text sequence into a digital vector through a Glove vocabulary, temporarily caching the digital vector, inputting the sequence containing time sequence information into a time sequence model, inputting the number characteristics of template statistics corresponding to the sequences into a number model for respective training, and storing the training result with the highest accuracy.

C. And processing the log sequence to be tested through a data processing module, inputting the log sequence to the trained model in the quantitative timing model, respectively calculating the timing deviation t and the numerical deviation n, performing weight distribution on the final deviation influence through the timing deviation and the numerical deviation, and calculating the comprehensive deviation y = w1t + w2n.

D. And determining whether the comprehensive deviation value y is marked as abnormal according to whether the comprehensive deviation value y is larger than a set threshold value m, if the comprehensive deviation value y is abnormal, outputting an abnormal log sequence through log associated information cached in the memory, and if the comprehensive deviation value y is not abnormal, clearing the cache.

Further, in the process of processing the transaction in step B, firstly, the log template is parsed, and we use a drain log parser. The method comprises the steps of replacing conventional variable information with masks, classifying and aggregating according to the length and prefix similarity of a log, and finally obtaining a log template.

In particular, in our system, we combine both the timing and quantitative features of the log sequence. The time sequence characteristic is the sequence of log execution, for example, after a file is newly created, the file is written and then deleted, and the sequence is sequential. And the quantity characteristic is that the file is opened for several times and then closed for several times, and the relation of the quantities is corresponding.

On the basis, a classic and high-accuracy deep learning algorithm and a machine learning method are used:

1) In the process of time sequence training, a GRU (Gated Current Unit) algorithm based on an attention mechanism is used, a specific sequence can be gathered according to the function of the sequence on an input time sequence, irrelevant sequence noise is ignored, and a good time sequence model is obtained through training.

2) In the quantity training process, a gradient lifting decision tree method is used for screening the one-dimensional characteristics and the two-dimensional characteristics of the quantity, irrelevant characteristics are removed, and finally a quantity model with effective respective abnormality is obtained.

In step D, the threshold value is initially set to a lower value manually, so that as much abnormal information as possible can be included. The algorithm for dynamically adjusting the threshold value through personnel feedback is to observe the false alarm rate, namely whether the false alarm quantity greatly exceeds the abnormal quantity under the conditions that the false alarm deviation value is less than or equal to the false alarm deviation value and the false alarm deviation value is greater than or equal to the threshold value, and if the false alarm deviation value exceeds the abnormal quantity, the threshold value is updated to be the false alarm deviation value.

The technical scheme of the invention is further explained as follows:

1. extraction of log template

The log analysis method aims to solve the problems that the current log specifications are not uniform, the formats of logs printed by different types of equipment are different, and log data are unstructured, so that the logs are convenient for personnel to analyze. And a log analyzer is used for extracting the logs into a log template, so that the types of the processed logs are clearer, and the processing difficulty is reduced. The main steps of the log parser include the following five steps.

1. Pretreatment: masking substitution is performed on the apparent portion using a regular expression.

2. Log length classification: the log is sorted according to the number of tokens in the original log.

3. And log classification: the logs are classified according to the preset log depth, which is generally set to 4, and can be finely adjusted according to the actual scene, and the depth can influence the number and the precision of nodes traversed by searching.

4. Log classification: after classification, according to similarity algorithm

Wherein seq1 (i) represents the ith letter of log sequence 1, seq2 (i) represents the ith letter of log sequence 2,

and judging whether the sequences belong to the same class, if not, adding new classes, wherein t1 and t2 refer to letters corresponding to the same positions of the two sequences, and n is the length of a longer sequence in sequence comparison.

2. Period of log data presence:

the purpose of log anomaly detection is to trace the source of an abnormal log, locate threats and check in time. But the logs are converted into digital vectors in the process of anomaly detection, the readability is not available, and because the number of processed logs is huge, all log sequence relations cannot be stored, the existence of how to set the period and how to select the cached content is of great significance for the anomaly detection of the logs. The specific caching process is as follows:

step one, marking the original log with a sequence number in a data processing module, establishing a cache table, wherein table items are content log sequence numbers-log templates, and the sequence numbers are used for replacing the original log in the subsequent intermediate process, so that the utilization rate of time and space is improved.

And step two, in the numerical timing model training module, converting the original log text through a glove vocabulary table to obtain a digital vector, and establishing a cache table for the log serial number and the digital vector, wherein the table entry is the log serial number-digital vector.

And step three, calculating the deviation degree of the log to be detected in a deviation degree calculation module, and establishing a corresponding cache table item as a log serial number-comprehensive deviation degree.

And step four, in the abnormal tracing module, if the deviation exceeds a threshold value, returning a log sequence according to the log sequence number and the window size initially set by the system, and clearing the cache table.

3. Format for log data presence

In the log exception handling process, four basic data formats are mainly adopted:

1) Original log data: receiving block blk-354458 src: /10.250.19.102:39325dest: /10.250.19.102:50010.

2) A log template: receving block [ ID ] src: [ ] dest: /[ IPANDPORT ].

3) glove vocabulary: receiving: [ 300-dimensional digital vector ], block: [ 300-dimensional digital vector ], src: [ 300-dimensional digital vector ], dest: the [ 300-dimensional digital vector ], the vector corresponding to a line of the log is formed by adding each word.

4) Quantitative characterization: receiving:1, block:1, src:1, dest:1, receiving-block: 1, receiving-src:1 …

The method is expressed as a vector [1,1,1, … ], after standardization, n is calculated to be the sum of the occurrence times, and finally expressed as a vector [1/n,1/n,1/n, … ], wherein Receiving-block is the combination of words Receiving and block, and Receiving-src is the combination of words Receiving and src.

Examples

The specific data execution process is as follows:

step one, for an original log Receiving block blk _ -354458src: /10.250.19.102:39325dest: the data processing module is input with/10.250.19.102: 50010 for classification processing to obtain a log template Receiving block [ ID ] src: [ ] dest: /[ IPANDPORT ], and statistical quantitative characterization Receiving:1, block:1, src:1, dest:1, receiving-block: 1, receiving-src:1 …, normalized to obtain vector [1/n,1/n,1/n,1/n,1/n,1/n, … ].

Step two, recording the original log recording block blk _ -354458src: /10.250.19.102:39325dest: /10.250.19.102:50010 converting into 300-dimensional digital vector through table look-up (glove vocabulary) conversion to obtain log sequence. And (3) inputting the log sequence into a time sequence model, inputting the digital vector obtained in the step one into a quantity model, respectively obtaining a time sequence deviation degree (0-1) and a quantity deviation degree (0-1), and then calculating to obtain a comprehensive deviation degree (0-1).

In the training stage, the model trains a very complex function through the input log template vector and corresponding labels, such as a normal log label of 0 and an abnormal label of 1. After the training is finished, the log template vector is input, namely the trained function is used for obtaining an output result, the value of the output result is 0-1, the closer to 0, the more possible the output result is to be normal, and the closer to one, the more possible the output result is to be abnormal.

And step three, judging whether the comprehensive deviation degree reaches a threshold value, if so, marking the log as abnormal, and outputting other original logs related to the input log case.

Claims

1. A block chain log anomaly detection and tracing system is characterized in that:

a data processing module: extracting a template from the data log, wherein the log template comprises a quantitative part and a variable, structuring unstructured log data into a template log which is easy to analyze, and counting quantity characteristics according to the template, wherein the quantity characteristics are the number of times of word occurrence and the number of times of combined word occurrence in the template;

a quantity time sequence model training module: training a model through feature representation of a log, wherein the features are divided into quantity features and time sequence features, the quantity features and the time sequence features are respectively input into a time sequence model and a quantity model for training, the quantity features refer to the times of word occurrence and the times of combined word occurrence in a template, and the time sequence features refer to the sequence of the log;

2. The system of claim 1, wherein the system comprises: the data processing module adopts a drain log template extractor and combines with multidimensional feature combination to output statistical features, and the method specifically comprises the following steps:

1) A drain log template extractor extracts a template from the existing log of the blockchain network;

2) Respectively counting the occurrence times of words and the occurrence times of combined words in the template extracted by drain;

3) When the node uploads the log in the block chain network, the log is classified into a corresponding template and the quantity characteristic by using drain for statistics.

3. The system of claim 1, wherein the quantity timing model training module comprises:

putting the quantity characteristics into a gradient-based lifting decision tree for training to obtain a quantity model;

and storing the time sequence model and the quantity model with the highest precision in the training process.

4. The system of claim 3, wherein the GRU model based on attention mechanism comprises the following steps:

A. the log is text data, the extracted template is also a text template, semantic conversion is needed before the text is input into the model, and the log template text is converted into a log template vector by adopting a semantic vector trained by glove for the input log template text;

D. and storing the training result to obtain a time sequence model.

5. The system of claim 1, wherein the system comprises: an exception source tracing module is used for tracing the exception,

1) Setting a threshold value, judging the deviation value, and marking the deviation value as abnormal if the deviation value exceeds the threshold value;

2) The data with false alarm can be marked as false alarm, whether the threshold value is adjusted or not is judged according to whether the abnormal quantity below the deviation value of the false alarm in a certain time is particularly small, and if the particularly small value represents that the threshold value is lower, the threshold value needs to be improved;

4) Judging whether the mark is abnormal or not according to a threshold value;

5) If the log sequence is abnormal, the log sequence related to the abnormality is output by tracing.

6. The system of claim 5, wherein the system comprises: the process of tracing to the source and outputting an exception,

3) Otherwise, emptying the cache and detecting the next batch of logs.