CN117271669A

CN117271669A - Distributed system and method for retrospectively analyzing historical data

Info

Publication number: CN117271669A
Application number: CN202311492636.8A
Authority: CN
Inventors: 孙伟鹏
Original assignee: Ruian Zhiyuan Beijing Information Technology Co ltd
Current assignee: Ruian Zhiyuan Beijing Information Technology Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2023-12-22

Abstract

The invention discloses a distributed system and a method for backtracking analysis of historical data, wherein the system consists of a storage node, an analysis node, a management node and a cache node; the management node is a man-machine interaction node in the system; the storage nodes, the analysis nodes and the cache nodes are distributed nodes, the distributed nodes are equal, and the distributed nodes are all decentralised nodes; the storage nodes adopt classified partition storage to carry out distributed storage on historical data; the analysis rules of the analysis nodes are uniformly configured, stored and managed through the management nodes; analysis rules need to be precompiled into a rule tree. A method for retrospectively analyzing historical data, the system is started; pre-compiling analysis rules and grouping; acquiring data; analyzing data; performing secondary analysis on the data; and optimizing the analysis rule. The method brings more convenience in service processing by slightly sacrificing the performance of a single node, and improves the overall performance of the system by increasing distributed nodes.

Description

Distributed system and method for retrospectively analyzing historical data

Technical Field

The invention relates to the field of computer information processing, in particular to a distributed system and a method for retrospectively analyzing massive and multi-source historical data, belonging to the field of big data analysis.

Background

Currently, we have entered the era of large explosion of information, the information technology is highly developed, and the information generation, propagation and acquisition speeds are extremely high. Meanwhile, with the development of new technologies such as artificial intelligence, big data, cloud computing and the like, various information devices generate a large amount of data, various software and hardware devices, hosts, application systems, safety protection facilities and the like related to various information systems generate a large amount of data, and the devices are scattered at different positions in a network topology, so that the number and the variety of the data are continuously increased, and therefore, various technologies and tools are required to store, manage and analyze the data.

With the development of storage technology, data storage is easy, data stored by various enterprises and institutions and organizations every day are counted by TB and even PB, but for a large amount of data, only data is queried from the data, and if the data is analyzed again, the difficulty is heavy.

Today's enterprises and organizations are in business security and the information security field is faced with more complex situations than before, both with endless intrusion and attack from outside the enterprises and organizations and with violations and leaks from inside the enterprises and organizations.

Current enterprise information security analysis, auditors, and data auditing platforms face the problems:

1. the stored data grows exponentially and the mass data must be analyzed.

2. The speed of data processing is far lower than the speed of data receiving and storing.

3. The means of network attack are more and more diversified, and attack techniques are more and more hidden.

4. Some abnormal behavior is abnormal if it is normal from an independent event, but multiple seemingly unrelated normal events, taken together.

5. After tuning the analysis rules, how to analyze the historical data rapidly and accurately has difficulty.

The current analysis processing method for the abnormal event in the industry mainly comprises two schemes:

1. and analyzing the received data in real time by predefining some analysis rules and by a centralized or distributed platform. The disadvantage of this solution is that the rules must not be too complex; secondly, the processing capacity is generally limited; thirdly, events with longer time windows are difficult to correlate; finally, the analysis rules often need to be optimized, and the adjusted rules cannot be used for the history data which are processed before.

2. Backtracking analysis can analyze historical data, but different rules are usually used, repeated judgment is often existed, so that the performance is low, and the number of the supported rules is limited.

Aiming at the problems existing in the current industry, the invention mainly solves the following problems:

1. complex analysis rules are supported.

2. By precompiling the analysis rules, duplicate matches are reduced, thereby improving performance.

3. The complexity of the rules has less impact on performance.

4. Event association supporting long time windows.

5. After the rule is optimized, the data can be analyzed again, so that missing report is reduced.

6. The processing speed is greatly improved by increasing distributed nodes to laterally expand.

Disclosure of Invention

In order to solve the defects of the technology, the invention provides a distributed system and a method for retrospectively analyzing historical data, which reduce repeated calculation through pre-compiling analysis rules, thereby improving the matching performance of single calculation nodes, break through the processing bottleneck of the single calculation nodes through the increase of the distributed nodes, and further improve the analysis performance of the whole system.

In order to solve the technical problems, the invention adopts the following technical scheme: a distributed system for retrospectively analyzing historical data comprises a storage node, an analysis node, a management node and a cache node;

the management node is a man-machine interaction node in the system;

the storage nodes, the analysis nodes and the cache nodes are distributed nodes, the distributed nodes are equal, and the distributed nodes are all decentralised nodes;

the storage nodes adopt classified partition storage to carry out distributed storage on historical data;

the analysis rules of the analysis nodes are uniformly configured, stored and managed through the management nodes;

the analysis rules need to be precompiled into rule trees, different priorities are set according to tree nodes, and the rule trees are stored on cache nodes.

Preferably, the classification partition storage rule is configurable: storing according to data sources or storing according to device types; the storage node may also be an analysis node.

Preferably, the analysis rules are precompiled for rules that are already in an enabled state: when the analysis rules are optimized, the rule tree needs to be recompiled, and the cache nodes are synchronously updated; after tuning the analysis rules, selecting whether to carry out retrospective analysis on the historical data again.

Preferably, the precompiled splits the analysis rules into a plurality of calculation factors, wherein:

1) The smallest computational factor is called an operator;

2) Forming a logic expression by a plurality of operators;

3) A matching condition is formed by a plurality of logic expressions, which is called an event;

4) Forming event association conditions by a plurality of events;

5) When the data store is stored in terms of data source partitions, all events have an attribute named data source, which cannot be null.

Preferably, the calculation result of each calculation factor is stored in the cache node, and when all the analysis rules are matched, the calculation factors are deleted from the cache node.

Preferably, the configuration of the analysis rules must be performed by the management node, and all configuration information of the analysis rules is stored on the cache node.

Preferably, the backtracking analysis rule consists of n events, wherein n is greater than or equal to 1, and the backtracking analysis rule specifically comprises the following steps:

(1) When n=1, event matching is successful, the corresponding analysis rule matching is completed, and a result is directly returned;

(2) When n is greater than 1, only n events are successfully matched, the corresponding analysis rule performs next matching judgment, so that any event is successfully matched, and the intermediate state of the corresponding event needs to be saved in the cache node for subsequent multi-event matching.

Preferably, all configuration information on the management node is stored on the cache node to improve access performance.

A method of retrospective analysis of historical data, comprising the steps of:

1) Starting a system;

2) Pre-compiling analysis rules and grouping, and storing the results into a cache node;

3) Acquiring data from the storage node according to the precompiled rules;

4) The analysis node analyzes the data: directly outputting the result or state data;

5) The analysis node carries out secondary analysis on the state data;

6) The analysis rules are tuned if necessary.

Preferably, the analysis process of the analysis node is: precompilation is carried out on all the enabled analysis rules, after precompilation is completed, a plurality of different rule trees are generated, each rule tree corresponds to one data source, and each data source corresponds to one partition in the storage node; the analysis node obtains corresponding data from the cache node according to different rule trees, and the data enter the rule tree; and carrying out matching calculation on each piece of data according to the priority of the operator, and storing the data into the cache node after the matching calculation is successful.

The invention provides a distributed system and a method for retrospectively analyzing historical data, which reduce the data quantity entering an analysis node through the configuration of a data storage rule; the pre-compiling of the analysis rules is carried out, repeated calculation is reduced, and therefore matching performance of single calculation nodes is improved; the processing bottleneck of a single computing node is broken through by the increase of the distributed nodes, so that the analysis performance of the whole system is improved. In addition, the backtracking analysis is used instead of real-time analysis to cope with the scene in which rules are frequently optimized in the real scene, so that omission of data participating in analysis is avoided, and false alarm is reduced; compared with real-time analysis, the method has the advantages that the data source is historical data, more service processing convenience is brought by slightly sacrificing the performance of a single node, and the overall performance of the system is improved by increasing distributed nodes.

Drawings

FIG. 1 is a diagram of the system of the present invention.

Fig. 2 is a flow chart of the overall system of the present invention.

FIG. 3 is a schematic diagram of a rule tree according to the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description.

The invention provides a distributed system for retrospectively analyzing historical data, which comprises the following steps:

1. as shown in FIG. 1, the whole system consists of a storage node, an analysis node, a management node and a cache node;

1) Other types of nodes are distributed nodes except the management node;

2) The distributed nodes are equal, and all the distributed nodes are nodes without centralization;

3) If the device performance allows, a node may be both an analysis node and a storage node, while also being a cache node.

Preferably, each node adopts independent computing resources and storage resources respectively; in some cases, such as where the computing resources of a device are very powerful, multiple nodes may share the computing resources. The management node is typically a single node, but a slave node may also be deployed to enhance availability.

2. Massive historical data is stored in a distributed manner

1) The data classification storage, the storage rule is configurable, for example, the storage rule can be stored according to data sources or device types;

2) The classified storage of the data is used for serving an analysis function of the analysis node;

3) In this example, we assume that data is stored in terms of data source partitions.

Preferably, the data should be stored in a sorted partition that does not affect the distribution of the storage nodes. The classification partition storage is mainly used for matching with analysis rules to achieve the purpose of reducing the data quantity participating in analysis. Classified partition storage and performance improvement.

3. As mentioned above, the distributed storage node may also be an analysis node;

4. the analysis rules are uniformly configured, stored and managed through the management nodes.

5. Rules need to be precompiled into rule trees, and different priorities are set according to tree nodes:

1) Only precompiled rules which are in an enabled state, so that participation of invalid rules is reduced;

2) For performance improvement, the rule tree is stored on the distributed cache nodes;

3) After the rule is optimized, the rule tree needs to be recompiled, and the cache node is updated;

the minimum leaf node of the rule tree, which we call an operator (boolean type), may share one or several operators for multiple rules, as a simple example: the expression of the rule R1 is A and B, the expression of the rule R2 is A and C, A, B and C are the smallest 3 operators, but the priorities of the three operators are different, the operator of A has the highest priority, because only if A is satisfied, B and C have the necessity of participating in calculation, and if A is not satisfied, both the rules cannot be matched.

6. As described above, in the process of pre-compiling the rule, the analysis rule is split into a plurality of calculation factors:

1) The smallest computational factor we refer to as an operator;

2) Forming a logic expression from a plurality of operators, the logic expression being also an operator in nature;

4) An event association condition is composed of a plurality of events.

7. Not all data need to be analyzed, only data related to the analysis rules will be analyzed:

1) As previously described, a rule may be broken into multiple events;

2) Importantly, there must be one data source tag for each event;

3) As previously described, the data is stored according to the data source partition;

4) Finally, only the data of the corresponding partition is needed to be taken from the distributed storage nodes, so that the data quantity participating in analysis is greatly reduced, the disk IO is greatly reduced, and the CPU computing power is also reduced.

Preferably, rule precompilation is a key step in this example, and by rule state filtering, rule splitting, marking operator priority, etc., an efficient rule matching tree is generated, see fig. 3 (rule tree schematic).

8. The calculation result of each calculation factor is stored on the distributed cache node: a distributed cache node, which belongs to a key node; when all relevant rule matching is completed, deleting from the cache node.

Preferably, a large number of distributed cache nodes are used, so that the secondary matching efficiency and the matching time window are improved.

9. Storage and management of rule configuration information

1) The configuration of rules must be performed by the management node;

2) The management node is the only man-machine interaction node in the system;

3) All rule configuration information is stored on the distributed cache nodes.

The management node belongs to a non-critical but very necessary node in the example, and is used for human-computer interaction. Preferably, all configuration information on the management node is stored on the distributed cache node to improve access performance.

10. Process of retrospective analysis

The whole backtracking analysis process is based on the scheme:

1. firstly, precompiling all the enabled rules, wherein precompiling is a very complex process, namely splitting the analysis rules into operators, marking priorities and recombining;

2. after compiling, generating a plurality of different rule trees, which can also be called event matching trees;

3. critically, why a plurality of different rule trees are used, since each rule tree corresponds to a data source;

4. critically, each data source, in turn, corresponds to a partition in the distributed storage node;

5. the distributed analysis node obtains corresponding data from the distributed split storage node according to different rule trees, and only the data enter an analysis rule matching tree;

6. for each piece of data, matching calculation is carried out according to the priority of an operator, and after the matching calculation is successful, the data is stored in a distributed cache;

7. importantly, the attribute of each piece of data is infinite, but the number of operators is limited, so that the calculation amount of matching is greatly reduced;

8. when the rules are optimized, the rule matching tree in the cache needs to be recompiled and synchronously updated;

9. after the rule is optimized, whether the historical data is traced back again or not can be selected;

10. specific backtracking analysis rules may consist of n (n.gtoreq.1) events:

1) When n=1, event matching is successful, the corresponding analysis rule matching is completed, and a result is directly returned;

2) When n is greater than 1, only n events are successfully matched, the corresponding analysis rule can perform next matching judgment, so that any event is successfully matched, and the intermediate state of the corresponding event needs to be saved in the distributed cache for later multi-event matching.

The compiling rule tree is described by the following embodiment:

wherein R in R1 represents Rule, E in E1 represents Event, and R1-E1 represents Event expression E1 in Rule R1.

1. Rule R1, consisting of two event expressions:

R1-E1: the data source is S1, provided that (A and B) or (C and D);

R1-E2: the data source is S2, provided that (E and F);

the conditions of association between the R1-E1 and R1-E2 event expressions we expressed with R1. C.

2. Rule R2, data source S1, only one event expression (A and C);

3. the composition factors of the two rules above are shown in the following table, respectively:

4. compiled into a rule tree as shown in fig. 3. The specific method is as follows:

1. splitting the rule into event expressions, wherein the rule R1 consists of two event expressions R1-E1 and R1-E2 as described in the table; rule R2 has only one event expression R2-E1;

2. each event expression consists of a data source, a logical expression consisting of operators, as described in the above table;

3. the data source has the highest priority, so the data source participates in calculation first to determine the data flow direction;

4. secondly, operators in a plurality of event expressions have higher priority, such as operator A in the example, which belongs to R1-E1 and R2-E1.

As shown in fig. 2, a method for retrospectively analyzing historical data includes the following steps:

1) Starting a system;

3) Acquiring data from the storage node according to the precompiled rules;

5) The analysis node carries out secondary analysis on the state data;

6) The analysis rules are tuned if necessary.

The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, but is also intended to be limited to the following claims.

Claims

1. A distributed system for retrospective analysis of historical data, comprising: the system consists of a storage node, an analysis node, a management node and a cache node;

the management node is a man-machine interaction node in the system;

the storage node adopts classified partition storage to perform distributed storage on historical data;

2. The distributed system for retrospective analysis of historical data of claim 1, wherein: the classification partition storage rule is configurable: storing according to data sources or storing according to device types; the storage node may also be an analysis node.

3. The distributed system for retrospective analysis of historical data of claim 2, wherein: analysis rules are precompiled rules that are already in an enabled state: when the analysis rules are optimized, the rule tree needs to be recompiled, and the cache nodes are synchronously updated; after tuning the analysis rules, selecting whether to carry out retrospective analysis on the historical data again.

4. A distributed system for retrospective analysis of historical data as defined in claim 3 wherein: precompilation splits the analysis rules into a plurality of calculation factors, wherein:

1) The smallest computational factor is called an operator;

2) Forming a logic expression by a plurality of operators;

4) Forming event association conditions by a plurality of events;

5. The distributed system for retrospective analysis of historical data of claim 4, wherein: and the calculation result of each calculation factor is stored on the cache node, and when all analysis rules are matched, the calculation factors are deleted from the cache node.

6. The distributed system for retrospective analysis of historical data of claim 5, wherein: the configuration of the analysis rules must be performed by the management node, and all configuration information of the analysis rules is stored on the cache node.

7. The distributed system for retrospective analysis of historical data of claim 6, wherein: the backtracking analysis rule consists of n events, wherein n is more than or equal to 1, and the backtracking analysis rule specifically comprises the following steps:

8. The distributed system for retrospective analysis of historical data of claim 7, wherein: all configuration information on the management node is stored on the cache node to improve access performance.

9. A method for retrospectively analyzing historical data is characterized in that: the method comprises the following steps:

1) Starting a system;

3) Acquiring data from the storage node according to the precompiled rules;

5) The analysis node carries out secondary analysis on the state data;

6) The analysis rules are tuned if necessary.

10. The method of retrospective analysis of historical data of claim 9, wherein: the analysis process of the analysis node is as follows: precompilation is carried out on all the enabled analysis rules, after precompilation is completed, a plurality of different rule trees are generated, each rule tree corresponds to one data source, and each data source corresponds to one partition in the storage node; the analysis node obtains corresponding data from the cache node according to different rule trees, and the data enter the rule tree; and carrying out matching calculation on each piece of data according to the priority of the operator, and storing the data into the cache node after the matching calculation is successful.