CN111638883A

CN111638883A - Decision engine implementation method based on decision tree

Info

Publication number: CN111638883A
Application number: CN202010407619.XA
Authority: CN
Inventors: 谢世茂; 彭恒; 陈杰
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-09-08
Anticipated expiration: 2040-05-14
Also published as: CN111638883B

Abstract

The invention relates to a decision engine implementation method based on a decision tree, which comprises the following steps: A. selecting needed decision nodes in a control dragging mode, and forming decision trees by all the decision nodes; B. detecting the correctness of the decision tree; C. detecting the correctness of the expression contained in each node; D. analyzing the decision tree: analyzing all decision paths according to the branch structure of the decision tree, and converting the information and the configured conditions contained in each node in each decision path into an SQL statement expression mode, so that each decision path is expressed by one SQL statement; E. and sending the SQL sentences of the decision tree to a big data cluster for distributed computation and operation. The invention can generate the executable decision scheme without programming, greatly reduces the operation difficulty of the user, and enables the decision scheme with large data magnitude to be capable of concurrent computation and operation in a distributed computation mode.

Description

Decision engine implementation method based on decision tree

Technical Field

The invention relates to a data processing method in the financial field, in particular to a decision engine implementation method based on a decision tree.

Background

In the field of banks, with the gradual expansion of financial scale, people summarize various rules about scenes such as wind control, anti-fraud, marketing and the like, and through the rules, users with higher overdue rate and lower credit can be identified, and users with better repayment capability and stronger loan willingness can also be identified. Through the types of various rules, the decision of freezing, unfreezing, adjusting the limit, interest rate and the like can be made for the user. The risk of bad assets of the bank can be reduced, and the income of the bank can be increased.

The current solution for implementing banking policy is mainly implemented by purchasing a third party rule engine, such as ilog, URule, etc. However, the price of the rule engine is generally expensive, and the rule configuration requires that a user understand a certain programming skill, so the rule engine is generally suitable for the wind control department, but the learning difficulty is large for the operation or other business departments. In addition, taking ilog as an example, the way for ilog to externally provide services is to provide an API interface (application program interface), which has two main disadvantages:

A) the method is not suitable for large-scale business, the data volume of decisions of a bank possibly exceeds ten million every day along with the increase of the data scale, and at the moment, if ilog needs to be used continuously, a plurality of servers of ilog need to be deployed to process requests concurrently to meet timeliness, so that the resource consumption is obviously further increased.

B) Different business departments need to develop different access programs for calling the ilog or configuring the ilog rule, and the development task amount is greatly increased.

There is therefore a need for an implementation that does not require programming and that can be adapted to big data banking strategies to accommodate the operation of many different types of departments, users.

Disclosure of Invention

The invention provides a decision engine implementation method based on a decision tree, which can generate an executable decision scheme without programming, reduce the operation difficulty of a user and be suitable for decision calculation and operation of big data.

The decision engine implementation method based on the decision tree comprises the following steps:

A. selecting a needed decision node on a display device in a visual control dragging mode through signal input equipment, configuring the selected decision node and establishing a father/son node relation according to a logic relation between decision nodes so as to form a decision tree from a following node to a leaf node;

B. detecting the correctness of the decision tree, including detecting the correctness of necessary field information in the code corresponding to the decision tree, the correctness of the father node and the child node of each node and the correctness of the condition setting of each node;

C. detecting the correctness of the expression contained in each node, including detecting the correctness of the format of the expression and the correctness of the function and the variable contained in the expression;

D. analyzing the decision tree: analyzing all decision paths according to the branch structure of the decision tree by traversing the decision tree, and converting the information and the configured conditions contained in each node in each decision path into an SQL statement expression mode, so that each decision path is expressed by one SQL statement;

E. and sending the SQL sentences of the decision tree to a big data cluster for distributed computation and operation.

The decision tree is constructed by adopting the visual page in a dragging mode, and then the rule is configured for each decision node, so that the method is very simple to use, does not need to carry out actual programming, and greatly reduces the use difficulty of a user. The decision tree is analyzed into a plurality of scripts of SQL statements and submitted to a big data cluster for distributed computation, so that the computation efficiency of tens of millions of data decisions is guaranteed. In addition, the invention does not interact with the outside in an API (application program interface) mode, and the user only needs to directly configure needed decisions on the decision tree (decision engine), thereby avoiding the problem that a business department needs to invest in manpower to develop and call programs. Because only the construction of the decision tree and the conversion of the SQL statement are local, the execution and calculation of the SQL statement are not local, but distributed calculation is performed through a big data cluster, so that the actual calculation and processing capacity is greatly improved, the data with big data magnitude can be calculated in parallel, the data of the traditional decision engine scheme is stored in a local memory, and calculation must be performed one by one during calculation, which causes a weak concurrency effect. In the invention, the calculation tasks are distributed to each kernel corresponding to each node for concurrent calculation, and the larger the decision scale is, the faster the decision speed is compared with the traditional scheme.

Specifically, the step B includes:

B1. analyzing a json format code corresponding to the decision tree, wherein each object contained in the json format code represents a corresponding node, each node contains a node ID and an associated node ID, and if the json format code cannot be analyzed or lacks necessary fields, a decision tree analysis abnormal error is thrown out;

B2. establishing an adjacency list for storing a tree structure of the decision tree in a memory according to the information of each node in the decision tree, detecting the number of father nodes of each node, and throwing out an incoupling degree exception if the number of the father nodes of a certain node is more than 1;

B3. judging whether a loop or an open circuit exists in the decision tree or not through a parallel set searching algorithm, and if so, correspondingly throwing out loop abnormity or forest abnormity;

B4. and detecting whether the condition setting of each node has information loss or not, throwing out the information loss exception if the condition setting of each node has information loss, and displaying the node name of specific missing information.

Specifically, the step C includes:

C1. scanning expressions contained in each node, and detecting whether abnormal characters exist or not;

C2. extracting functions and variables contained in all expressions through regular expressions, sequentially judging whether the variables exist in a basic field table or a declared function table, and if the variables do not exist, throwing out errors of unknown variables or unknown functions of the expressions;

C3. and carrying out grammar detection on the expression, and throwing expression grammar error exception if grammar error exists.

Further, in step C1, if there are other characters in the expression after removing the content in the quotation marks in the expression, the expression is determined to be illegal.

Specifically, the step D includes:

D1. obtaining a graph structure in the form of an adjacency list of a decision tree;

D2. according to the graph structure, traversing the whole decision tree through a depth-first search algorithm from the root node of the decision tree, and recording the father node of each node in the traversing process;

D3. and D2, recursively backtracking to a root node according to all father nodes of each leaf node recorded in the step D2, obtaining a condition list of SQL statements according to conditions configured in each node in the backtracking process, and forming SQL statement expression of each decision path by assembling contents in the condition list.

The decision engine implementation method based on the decision tree can generate the executable decision scheme without programming, greatly reduces the operation difficulty of the user, and enables the decision scheme with large data magnitude to be concurrently calculated and operated in a distributed calculation mode. The test shows that about 300 ten thousand decisions of the cluster with 6 nodes take about 1-2 minutes.

The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.

Drawings

FIG. 1 is a flow chart of a decision engine implementation method based on a decision tree according to the present invention.

FIG. 2 is a schematic diagram of the adjacency list diagram of sample a in step B2.

FIG. 3 is a schematic diagram of the adjacency list diagram of sample B in step B2.

FIG. 4 is a schematic diagram of the adjacency list diagram of sample a in step B3.

FIG. 5 is a schematic diagram of the adjacency list diagram of sample B in step B3.

Fig. 6 is a schematic diagram of the decision tree structure obtained in step D1.

FIG. 7 is a schematic diagram of a decision path obtained from the decision tree of FIG. 6.

Detailed Description

As shown in FIG. 1, the decision engine implementation method based on decision tree of the present invention includes:

A. the method comprises the steps of selecting needed decision nodes on a display device in a visual control dragging mode through signal input equipment, configuring the selected decision nodes and establishing a father/son node relation according to the logic relation among the decision nodes, so that a decision tree from a following node to a leaf node is formed.

The system for constructing the decision tree by dragging the decision nodes is an existing system, and the system is not described in detail herein, and the implementation principle of the system can refer to existing various software with a dragging control function, such as Visio, Dreamweaver and the like.

B. Detecting correctness of the decision tree, comprising:

B1. analyzing a json format code corresponding to the decision tree, wherein each object contained in the json format code represents a corresponding node, each node contains a node ID and an associated node ID, and if the json format code cannot be analyzed or lacks necessary fields, a decision tree analysis exception error is thrown out. For example:

sample a: there are json as follows:

[{"node_id":"a","next_nodes":["b","c"]},{"node_id":"b","next_nodes":[]},{"node_id":"c","

next_nodes":[]}]

the json field in actual code is much richer and is only an example here. The sample is a normal json format, and json analysis can be completed, so that detection is passed.

In this and the following examples of this embodiment, the node _ id is used to record the current node label, and the next _ nodes is used to record the successor node of this node.

Example b: json is as follows:

[{"node_id":"a"}]

when the node is not a leaf node, it must contain a child node, and the next _ nodes information cannot be resolved (the required key is missing), the example throws a resolution exception.

B2. And establishing an adjacency list for storing a tree structure of the decision tree in a memory according to the information of each node in the decision tree, detecting the number of father nodes of each node, and throwing out the incoupling degree exception if the number of the father nodes of a certain node is more than 1.

Sample a: and constructing an adjacency list according to the analysis result of the previous step, and assuming that the obtained adjacency list has the following relation: a- > b, b- > c, b- > d, b- > e, and is drawn in a visual form, and the diagram structure is shown in FIG. 2.

It can be seen that the structure is a standard tree structure and the degree of entry for no node in the structure is greater than 1. A pass can then be detected at this point.

Example b: assume that the adjacency list relationship is: a- > b, a- > c, b- > d and c- > d, and is drawn in a visual form, and the diagram structure is shown in FIG. 3.

It is clear that the d node in FIG. 3 has an in-degree of 2, which does not conform to the tree definition in the data structure, and throws an exception to the in-degree.

B3. And judging whether a loop or an open circuit exists in the decision tree through a parallel set searching algorithm, and if so, correspondingly throwing out loop abnormity or forest abnormity. For example:

sample a: assume that the adjacency list relationship is: a- > b, b- > c, c- > a, as visualized in figure 4.

At this time, it can be found that although the in-degree of each node is 1, a loop exists in the structure, which is not in accordance with the definition of the tree in the data structure, the loop in the graph structure can be identified through the parallel set searching algorithm, and the user can be informed of the loop abnormality when the loop exists.

Example b: assume that the adjacency list relationship is: a- > b, c- > d, visualized as shown in figure 5.

At this time, no loop exists in the graph structure, and the situation that the degree of income is greater than 1 does not exist, but the graph can be found to be disconnected at this time, and the forest anomaly can be found through the union set searching algorithm and the user is notified.

B4. And detecting whether the condition setting of each node has information loss or not, throwing out the information loss exception if the condition setting of each node has information loss, and displaying the node name of specific missing information. For example:

sample a: suppose that the json structure of a node is received as follows:

{"expr":"max((crdt_limit/aval_limit)*crdt_limit,300000)","operator_type":">","value":""}

in the json structure, information of a graph structure is hidden, and only data related to node conditions are reserved. It can be seen that in the json structure, the value is not filled to be null, and the user is informed that the information is missing. Meanwhile, whether the expression of the expr field is abnormal or not is checked through detection of the expression, and if the expression is not abnormal, the decision tree can be normally analyzed into the SQL statement.

In the present embodiment, expr represents the content of an expression, operator _ type represents operators including five symbols, i.e., '>', '<', '>,', and value represents a specific comparison object, and a constant value may be filled in.

C. Detecting the correctness of the expression contained in each node, including:

C1. and scanning expressions contained in each node, and detecting whether abnormal characters exist. Strings allow only numeric values, letters, underlines, parentheses, additions, subtractions, multiplications, divisions, percentiles (modulo arithmetic), commas, decimal points, quotation marks. And if other characters still exist after the content in the quotation marks is removed, directly judging that the expression is illegal. For example:

sample a: max ((crdt _ limit/aval _ limit) × (crdt _ limit,300000), this time the check is successful because all symbols in the expression are legal characters.

In the expression, max is a maximum function of two values, aval _ limit is a usable amount, crdt _ limit is a credit granting amount (maximum amount) of a user, and the expression means that the amount is increased according to the use condition of the amount of the user, but the amount is increased to 30 ten thousand at most.

Example b: crdt _ limit ^1.2, at which time the exception symbol is scanned, thus throwing an error.

C2. Extracting functions and variables contained in all expressions through the regular expression 'A-Za-z ] [ A-Za-z _0-9] +' and sequentially judging whether the variables exist in the basic field table or the declared function table or not, and if not, throwing out errors of unknown variables or unknown functions of the expressions. For example:

sample a: max ((crdt _ limit/aval _ limit) × crdt _ limit,300000), at this time, the 4 strings of max, aval _ limit, crdt _ limit, and crdt _ limit are read through the expression, and by comparing the function table and the field table of the 4 strings in the database, whether there is an undefined function or variable is queried, and if all the functions are defined, the check is passed.

Example b: at this time, 3 character strings of maxx (sample1+ sample2,20) and sample1 and sample2 are extracted, and by comparison, if maxx is neither a variable nor a function, an undefined exception is thrown.

C3. And carrying out grammar detection on the expression, and throwing expression grammar error exception if grammar error exists. The first method is to directly carry out recursive analysis on the expression by using a recursive descent analysis method to judge whether grammar detection is passed. The second is to compile the expression into a script and to use the script compiler for detection. In the second mode, the expression is compiled into groovy script to complete grammar detection. If the expression is not found to be abnormal, the expression is correct. For example:

sample a: max ((crdt _ limit/aval _ limit) × crdt _ limit,300000), which is in compliance with the programming specification, the expression will pass the check.

Example b: max (aatt, bbcc, which expression obviously lacks right brackets, when grammatical errors are detected.

D. Analyzing the decision tree:

D1. a graph structure in the form of an adjacency list of a decision tree is obtained. For example:

sample example: it is assumed that a decision tree as shown in fig. 6 is obtained. Fig. 6 is a standard binary tree structure, and each node is filled with information such as correct expression conditions as required, so that the detection is passed.

The credit limit field is named crdt _ limit, the field that is overdue or not is named is _ override, and the field that is blacklisted or not is named is _ black. When the decision tree is analyzed, all the related field names are analyzed into the expression mode of the SQL statement, and the fields used in the SQL statement are English with the three names.

D2. According to the graph structure, traversing the whole decision tree by a depth-first search algorithm from the root node of the decision tree, and recording the father node of each node in the traversing process. Taking the decision tree of fig. 6 as an example, where there are 4 leaf nodes in fig. 6, we can use the depth-first algorithm to calculate 4 decision paths, as shown in fig. 7.

D3. And D2, recursively backtracking to a root node according to all father nodes of each leaf node recorded in the step D2, obtaining a condition list of SQL statements according to conditions configured in each node in the backtracking process, and forming SQL statement expression of each decision path by assembling contents in the condition list. Still taking fig. 6 and 7 as an example:

after 4 decision paths shown in fig. 7 are resolved, each path is converted into an SQL condition. The SQL conditions for these 4 decision paths are:

A)crdt_limit>20000,is_overdue＝1；

B)crdt_limit>20000,is_overdue＝0；

C)crdt_limit<＝20000,is_black＝1；

D)crdt_limit<＝20000,is_black＝0；

and then assembling the SQL sentences by using the obtained condition list, wherein each leaf node corresponds to one SQL sentence.

Splicing the SQL conditions to obtain 4 SQL sentences, wherein the SQL sentences are as follows:

A)select(crdt_limit/2)from search_table where(1＝1)and(crdt_limit>20000)and(is_overdue＝1)；

B)select(crdt_limit*1.2)from search_table where(1＝1)and(crdt_limit>20000)and(is_overdue＝0)；

C) select ('freeze') from search _ table where (1 ═ 1) and (crdt _ limit < ═ 20000) and (is _ black ═ 1);

D) select ('no action') from search _ table where (1 ═ 1) and (crdt _ limit < ═ 20000) and (is _ black ═ 0);

The test shows that about 300 ten thousand decisions of the cluster with 6 nodes take about 1-2 minutes.

The core idea of the invention is that similar to spark (a fast and general computing engine for large-scale data processing) operation mode, only the tasks are analyzed and submitted locally, and the tasks are computed by using a big data cluster. In a large data cluster, data are stored in each node in a distributed mode, and the data volume capable of being calculated at the same time is far larger than the memory size of a local single node. The present invention can complete millions of decisions at a time. While the traditional decision engine scheme depends on a local memory during calculation, if the local memory is too small, large-scale calculation is difficult to support. If the support is needed, the data needs to be segmented and calculated in batches, so that the traditional scheme is more complicated than the proposal in programming realization and data management if the same scale effect is achieved.

Moreover, the invention has very fast realization speed of batch decision. The traditional decision engine scheme data are stored in a local memory, and calculation must be performed item by item during calculation, so that the concurrency effect is weak. Most of the calculation of the invention is not local, the calculation task can be distributed to each kernel of each node for concurrent calculation, and the larger the decision scale is, the faster the decision speed is compared with the traditional scheme.

Claims

1. The decision engine implementation method based on the decision tree is characterized by comprising the following steps:

2. A decision tree based decision engine implementation according to claim 1, characterized by: the step B comprises the following steps:

3. A decision tree based decision engine implementation according to claim 1, characterized by: the step C comprises the following steps:

4. A decision tree based decision engine implementation according to claim 3, characterized by: in step C1, if there are other characters in the expression after removing the quotation marks in the expression, the expression is determined to be illegal.

5. A decision tree based decision engine implementation according to claim 1, characterized by: the step D comprises the following steps: