CN114444721A

CN114444721A - Model training method and device, electronic equipment and computer storage medium

Info

Publication number: CN114444721A
Application number: CN202210113822.5A
Authority: CN
Inventors: 邸红叶; 徐慎昆; 冀晨光; 何秋果
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2022-01-30
Filing date: 2022-01-30
Publication date: 2022-05-06

Abstract

The embodiment of the application provides a model training method and device, electronic equipment and a computer storage medium. The model training method comprises the following steps: obtaining sample characteristic data of a target object and a sample label corresponding to the sample characteristic data; according to the sample label corresponding to the sample characteristic data, performing box separation processing on the sample characteristic data; determining a training sample according to the sample characteristic data subjected to the binning processing and the corresponding sample label; and training a decision tree by using the training sample to determine a screening condition corresponding to the node contained in the decision tree. The method can obtain a more stable and robust model.

Description

Model training method and device, electronic equipment and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a model training method and device, electronic equipment and a computer storage medium.

Background

In the prior art, in order to solve the problems that the labor cost is high and the requirement of fast iteration which is required at present is not met when a large amount of data are manually analyzed, mined and output screening conditions exist, trained decision trees are adopted to mine data characteristics, but the stability of the trained decision trees is poor, the problem of overfitting is easy to occur, and the trained decision trees have poor effect in the actual use process.

Disclosure of Invention

In view of the above, embodiments of the present application provide a model training scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a model training method, including: obtaining sample characteristic data of a target object and a sample label corresponding to the sample characteristic data; according to the sample label corresponding to the sample characteristic data, performing box separation processing on the sample characteristic data; determining a training sample according to the sample characteristic data subjected to the binning processing and the corresponding sample label; and training a decision tree by using the training sample to determine a screening condition corresponding to the node contained in the decision tree.

According to a second aspect of the embodiments of the present application, there is provided a risk identification method, including: acquiring attribute information of an identification object, wherein the attribute information comprises at least one attribute and an attribute value corresponding to the attribute; and according to the attributes and the attribute values in the attribute information, screening according to each screening condition in a screening condition combination to obtain a screening result for indicating the risk degree of the identification object, wherein the screening conditions in the screening condition combination are determined according to the method.

According to a third aspect of the embodiments of the present application, there is provided a rule mining method, including: obtaining sample characteristic data and a sample label corresponding to the sample characteristic data; according to the sample characteristic data and the corresponding sample label, performing box separation processing on the sample characteristic data; training a decision tree by using the sample characteristic data after the box separation and the corresponding sample labels to determine corresponding screening conditions of nodes contained in the decision tree; determining a risk identification rule according to the connection relation among the nodes of the decision tree, the screening conditions corresponding to the nodes and the quantity of the sample characteristic data screened by the screening rules of the nodes, wherein the risk identification rule comprises at least one screening condition; and outputting the risk identification rule.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method.

According to a fifth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.

According to a sixth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the aforementioned method.

According to the model training scheme provided by the embodiment of the application, the sample characteristic data of the target object is subjected to binning processing according to the sample label of the sample characteristic data, so that the binned sample characteristic data is discretized, the training sample is determined according to the processed sample characteristic data and the sample label, the training sample is used for training the decision tree, the screening condition corresponding to the node of the decision tree is determined, the problem that the trained decision tree is over-fitted can be avoided, and the robustness of the decision tree is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1A is a flowchart illustrating steps of a model training method according to a first embodiment of the present disclosure;

FIG. 1B is a schematic diagram illustrating the sub-steps of step S104 in FIG. 1A;

FIG. 1C is a diagram of a decision tree for an example scenario in the embodiment shown in FIG. 1A;

FIG. 2A is a flowchart illustrating steps of a model training method according to a second embodiment of the present application;

FIG. 2B is a step flow diagram of the substeps of step S210 in FIG. 2A;

FIG. 3 is a flowchart illustrating steps of a risk identification method according to a third embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a rule mining method according to a fourth embodiment of the present application;

FIG. 5 is a block diagram of a model training apparatus according to a fifth embodiment of the present application;

fig. 6 is a block diagram of a risk identification device according to a sixth embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a rule mining apparatus according to a seventh embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an eighth embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Example one

Referring to fig. 1A, a flowchart illustrating steps of a model training method according to a first embodiment of the present application is shown.

In this embodiment, the method may be used to train a machine learning model, such as a decision tree, for automatically mining the screening conditions, wherein the decision tree includes, but is not limited to, a conventional decision tree, a random forest, a GBDT model, and the like. Wherein, the method comprises the following steps:

step S102: and acquiring sample characteristic data of the target object and a sample label corresponding to the sample characteristic data.

Different sample feature data can be obtained according to different application scenarios of the screening condition, for example, the sample feature data includes object representation data of the target object, device feature data, and any other suitable feature data. Similarly, the sample label may be different for different application scenarios, the sample label may be a manual label, or a true value of prediction may be obtained as the sample label through a data mining or analysis method. For example, a sample label can be used to indicate whether a breach is warranted, whether a deviation from a set trajectory is detected, and the like.

In this embodiment, the sample feature data includes an attribute value of at least one attribute. For example, in an application scenario of determining whether to go out for play, attribute values of attributes such as weather, temperature, humidity, and wind of the target object may be acquired.

The following table shows that a number of different target objects correspond to the aforementioned attributes as attribute values.

Target object	Weather (weather)	Temperature of	Humidity	Wind power
					1	In sunny days	Heat generation	Height of	Weak (weak)
2	Cloudy	Cold	In general	High strength
					3	Rain-proof	Is moderate	In general	Weak (weak)
4	Cloudy	Is moderate	Height of	High strength

For each target object, respective sample feature data may be formed based on the attribute values of its corresponding respective attributes. For example, the attribute values are converted into word vectors by adopting a word embedding method, and then the word vectors are spliced into sample feature data. Or, for another example, a corresponding vector representation is configured for each attribute value, and then the respective vectors are spliced into sample feature data.

Of course, in other scenarios, the sample feature data may be other data, for example, in a risk control scenario, the sample feature data may be portrait feature data (including height attributes, etc.), device feature data (including, among others), and other suitable feature data of the target object. Of course, these characteristic data are obtained with the authorization and permission of the target object.

The sample label may be obtained in a suitable manner, which may be manually labeled or determined at the time the sample characteristic data is acquired. For example, in the foregoing application scenario of determining whether to go out for play, the sample tag may be determined directly according to the real going-out situation, and the attribute value carrying the sample tag is represented by the following table:

target object	Weather (weather)	Temperature of	Humidity	Wind power	Whether go out of door
						1	In sunny days	Heat generation	Height of	Weak (weak)	Whether or not
2	Cloudy	Cold	In general	High strength	Is that
						3	Rain-proof	Is moderate	In general	Weak (weak)	Is that
4	Cloudy	Is moderate	Height of	High strength	Is that
						5	Rain-proof	Is moderate	Height of	High strength	Whether or not

In the table, the last column is the sample label, and the value of the sample label may be "yes" or "no".

Step S104: and performing box separation on the sample characteristic data according to the sample label corresponding to the sample characteristic data.

The box separation treatment is used for carrying out discretization treatment on sample characteristic data, continuous sample characteristic data can be cut into a plurality of sections through the box separation treatment, the sample characteristic data in each section can be considered as a classification, the discretization treatment is realized, the influence of abnormal data on a trained decision tree can be reduced through discretization of the sample characteristic data, the robustness of the trained decision tree is improved, and the risk of overfitting is reduced.

In a feasible mode, the sample characteristic data can be numerical characteristic data and/or character variable data, and supervised binning is performed on the sample characteristic data with set dimensionality according to the sample label, so that the binned sample characteristic data is obtained.

As shown in fig. 1B, the binning process for the sample feature data may be implemented by the following sub-steps:

substep S1041: and sequencing the sample characteristic data according to the attribute value of the selected target attribute, and determining the sample characteristic data with the same attribute value as an attribute value group.

The target attribute may be an attribute selected from attributes corresponding to the sample feature data. For example, the attribute values of the attributes corresponding to the sample feature data are shown in the following table:

target object	Attribute 1	Attribute 2	Attribute 3	Whether or not to violate
					AA	19.1	176	M	Whether or not
BB	20.4	156	F	Is that
					CC	52	187	M	Is that
DD	31	165	F	Is that
					EE	31.4	166	M	Whether or not
FF	31	187	M	Is that

In this embodiment, with attribute 1 as the target attribute, the result of sorting according to the attribute value of the target attribute is:

target object	Attribute 1	Attribute 2	Attribute 3	Whether or not to violate
					AA	19.1	176	M	Whether or not
BB	20.4	156	F	Is that
					DD	31	165	F	Is that
FF	31	187	M	Is that
					EE	31.4	166	M	Whether or not
CC	52	187	M	Is that

Based on the sorting results, the same attribute values can form one attribute value group, for example, the attribute values corresponding to the target object AA form one attribute value group 1, similarly, the attribute values corresponding to the target object BB form one attribute value group 2, the attribute values corresponding to the target object FF and the target object DD form one attribute value group 3, the attribute values corresponding to the target object EE form one attribute value group 4, and the attribute values corresponding to the target object CC form one attribute value group 5.

Substep S1042: binding two adjacent attribute value groups into the attribute value group according to the attribute value group sequence indicated by the sorting result.

For example, according to the order of the attribute value groups, the attribute value group 1 and the attribute value group 2 are bound to the attribute value group 1, and the attribute value group 2 and the attribute value group 3 are bound to the attribute value group 2, and so on, and will not be described again.

Substep S1043: and calculating chi-square values of the attribute value sets according to the sample characteristic data and the sample labels in the attribute value sets aiming at the attribute value sets.

The chi-square value can be calculated by adopting a chi-square test formula, and the chi-square value is calculated according to the quantity of the sample characteristic data and the corresponding sample label. In the previous example, the sample is labeled as a default. The chi-squared value is used to indicate the correlation of two attribute value sets in a set of attribute values.

Substep S1044: and processing the sample characteristic data of the attribute value set according to the calculated chi-square value to obtain the sample characteristic data subjected to binning processing.

In an example, sub-step S1044 can be implemented as: and according to the calculated chi-square value, combining the sample characteristic data of the two attribute value groups in the attribute value set with the minimum chi-square value to obtain the sample characteristic data subjected to binning processing.

And if the selected minimum chi-square value does not meet the chi-square stop threshold, merging the sample characteristic data in the attribute value set. For example, in the above example, if the chi-squared value of the attribute value set composed of the attribute value group 3 and the attribute value group 4 is the smallest, the sample feature data in the attribute value group 3 and the attribute value group 4 is processed to form a new attribute value group.

Then, the process can continue to return to substep S1042 for attribute value set 1, attribute value set 2, new attribute value set, and attribute value set 5 until the chi-squared value meets the chi-squared stop threshold or the number of obtained attribute value sets meets the required number, and then stop.

Each attribute value group obtained in the way can be regarded as one sub-box, and the sample characteristic data in different sub-boxes are more differentiated, so that the discretization of the sample characteristic data is realized.

Step S106: and determining a training sample according to the sample characteristic data subjected to the binning processing and the corresponding sample label.

In a feasible mode, the sample feature data after the binning processing and the corresponding sample labels form training samples. After the binning processing, the sample characteristic data of different bins are more concentrated, so that the discretization degree is improved.

Step S108: and training a decision tree by using the training sample to determine a screening condition corresponding to the node contained in the decision tree.

In one example, the decision tree includes at least one root node and a plurality of leaf nodes, each node has a corresponding screening condition, the sample feature data in the training sample is input into the decision tree, the relevant sample meeting the screening condition of each node is determined, and the screening condition is adjusted according to the sample label corresponding to the sample feature data, so that the relevant sample meeting the screening condition corresponds to the sample label.

As shown in fig. 1C, the trained decision tree includes nodes I to V, where node I is a root node, node II, node IV, and node V are leaf nodes, and node III is a middle node.

In an example, the screening condition corresponding to the node I is that the value of the 3 rd feature in the sample feature data is less than or equal to 0.8, the number of the relevant samples corresponding to the node I is 112, the computed gini coefficient is 0.665, and the gini coefficient is used to indicate the purity of the relevant samples corresponding to the node.

Node II indicates that the number of correlated samples satisfying the screening condition of node I in 112 correlated samples is 37, and the gini coefficient of 0 indicates that the sample labels of the 37 correlated samples are all the same, e.g., all "there is a default" or all "there is no default".

The number of the relevant samples corresponding to the node III is 75, and the screening condition is that the value of the 2 nd characteristic in the relevant samples is less than or equal to 4.95.

The number of the correlation samples corresponding to the node IV is 36, which means that among the 75 correlation samples, the 2 nd feature value of 36 correlation samples is greater than 0.8 and less than or equal to 4.95, the gini coefficient is 0.153, and it means that the sample labels of most of the correlation samples are the same.

The number of correlation samples corresponding to the node V is 39, which means that the 2 nd feature value of 39 correlation samples is greater than 4.95, the gini coefficient is 0.05, and the sample labels of most correlation samples are the same.

Therefore, the decision tree can efficiently learn the sample characteristic data and accurately classify the sample characteristic data by training the decision tree, and each path from the root node to the leaf node in the decision tree is a decision branch based on the principle of the decision tree. For example, still taking the decision tree shown in FIG. 1C as an example, "node I-node II" forms decision branch 1, "node I-node III-node IV" forms decision branch 2, and "node I-node III-node V" forms decision branch 3.

Each decision branch comprises at least one screening condition. And each node in the decision branch has a certain number of corresponding correlation samples. Because the decision tree is trained by adopting the sample characteristic data subjected to the box separation treatment, and the sample characteristic data subjected to the box separation treatment is discretized, the trained decision tree is more stable, overfitting is effectively prevented, and the decision tree is quickly and stably trained on the basis of a large amount of high-dimensional sample characteristic data.

The sample characteristic data are subjected to supervised binning processing according to the sample labels, so that after binning processing, the sample characteristic data in the same bin are more concentrated, the sample characteristic data are discretized, and the problem that continuous sample characteristic data easily cause overfitting when a decision tree is trained is solved.

According to the embodiment, the sample characteristic data of the target object is subjected to binning processing according to the sample label of the sample characteristic data, so that the sample characteristic data subjected to binning processing is discretized, then the training sample is determined according to the processed sample characteristic data and the sample label, the training sample is used for training the decision tree, the screening condition corresponding to the node of the decision tree is determined, the problem that the trained decision tree is over-fitted can be avoided, and the robustness of the decision tree is improved.

The model training method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example two

Referring to fig. 2A, a flow chart of the steps of the method of embodiment two of the present application is shown.

The method is used for obtaining the rules which can accurately classify the data. The method comprises the following steps:

step S202: and acquiring sample characteristic data of the target object and a sample label corresponding to the sample characteristic data.

Step S204: and performing box separation on the sample characteristic data according to the sample label corresponding to the sample characteristic data.

Step S206: and determining a training sample according to the sample characteristic data subjected to the binning processing and the corresponding sample label.

Step S208: and training a decision tree by using the training samples to determine a screening condition corresponding to the node contained in the decision tree.

Step S210: and combining different decision branches, and determining a target decision branch combination according to the screening conditions corresponding to the nodes contained in the decision branch combination, the related samples corresponding to the nodes and the training samples.

Step S212: and determining a screening condition combination according to the screening conditions corresponding to the nodes contained in the target decision branch combination.

The implementation manners of step S202 to step S208 may be the same as or similar to the implementation manners of step S102 to step S108 in the foregoing embodiments, and therefore, the description is omitted.

As shown in fig. 2B, step S210 can be realized by substeps S2101 to substep S2103 described below.

Substep S2101: and aiming at each decision branch, calculating the branch accuracy according to the number and sample labels of the relevant samples corresponding to the nodes in the decision branch, and the number and sample labels of the training samples.

Taking the above trained decision tree shown in fig. 1C as an example, the decision branches include: decision branch 1 includes "node I-node II". Decision branch 2 includes "node I-node III-node IV". Decision branch 3 includes "node I-node III-node V". For each decision branch, the branch accuracy can be determined by the procedure a1 through procedure C1 described below.

Procedure a 1: and determining the number of the relevant samples of the decision branch and the sample labels corresponding to the relevant samples according to the relevant samples corresponding to the nodes in the decision branch.

Taking decision branch 1 as an example, where the number of relevant samples corresponding to node I is 112, the total number of training samples is represented. The number of correlation samples corresponding to the node II is 37, and since the gini coefficient thereof is 0, the sample labels of all the correlation samples are the same, for example, "no violation".

Procedure B1: and respectively calculating the accuracy and the recall rate corresponding to the decision branch according to the number and the sample labels of the relevant samples of the decision branch and the number and the sample labels of the training samples.

Wherein the accuracy rate is used to indicate the accuracy of the classification, which can be determined according to the number of different sample labels in the related samples. For example, the accuracy is a ratio of the number of correlation samples labeled "no default" to the number of correlation samples labeled "default" in the correlation samples corresponding to the node II.

The recall rate indicates the overall degree of screening, which may be based on the ratio of the number of relevant samples corresponding to a certain type of sample label in node II to the total number of corresponding sample labels in the training samples. For example, the ratio of the number of correlated samples labeled "non-violating" to the number of all "non-violating" training samples in the training sample.

Procedure C1: and determining the branch accuracy according to the accuracy and the recall rate of the decision branch.

In order to comprehensively reflect the recall rate and the accuracy rate of the decision branch, in this embodiment, the branch accuracy rate is the product of the accuracy rate and the recall rate divided by the sum of the accuracy rate and the recall rate, and then the ratio is multiplied by a constant, and the value of the constant can be determined as required, for example, 1, 2, 3, and the like.

If the constant value is 2, the branch accuracy can be expressed as: 2x [ (accuracy x recall)/(accuracy + recall) ].

The branch accuracy may represent the accuracy and completeness of the decision branch classification.

The branch accuracy of decision branch 2 and decision branch 3 may be calculated in a similar manner, and thus will not be described in detail.

Substep S2102: and combining different decision branches according to the branch accuracy, and determining the combined accuracy corresponding to each decision branch combination according to the related samples corresponding to different decision branch combinations and the training samples.

In a possible way, the effects can be combined and the efficiency of decision branch combination can be determined, and the sub-step S2102 can be realized by the following processes:

procedure a 2: and obtaining the initial combination accuracy of the decision branch combination.

If the decision branch combination is empty, the initial combination accuracy may be initialized to 0. Alternatively, if the decision branch combination is not empty, the initial combination accuracy may be determined according to the relevant samples corresponding to the decision branches contained therein and the training samples.

Procedure B2: and according to the branch accuracy, selecting the decision branch with the highest branch accuracy from the decision branches not included in the decision branch combination and adding the decision branch into the decision branch combination.

Taking the decision branches 1-3 as an example, the highest one (e.g., the highest branch accuracy of decision branch 1) is selected and added to the decision branch combination according to the respective branch accuracy of decision branches 1-3.

Procedure C2: and determining the relevant samples corresponding to the decision branch combination according to the relevant samples corresponding to the decision branches contained in the decision branch combination.

If the decision branch combination only includes decision branch 1, the correlation sample corresponding to the decision branch combination is the correlation sample corresponding to node II.

Or, if the decision branch combination includes decision branch 1 and decision branch 2, the relevant samples corresponding to the decision branch combination are corresponding samples corresponding to node II and node IV.

Procedure D2: and determining the updating combination accuracy of the decision branch combination according to the relevant sample and the training sample corresponding to the decision branch combination.

If the decision branch combination only comprises one decision branch, the combination accuracy is updated to be the branch accuracy of the decision branch.

If the decision branch combination comprises more than one decision branch, the combination accuracy can be calculated and updated according to the relevant samples corresponding to the decision branch combination and the training samples by adopting a branch accuracy calculation mode, and the calculation principle is the same, so that the details are not repeated.

Based on the calculated updated combination accuracy, the magnitude of the updated combination accuracy can be compared with the initial combination accuracy, and if the updated combination accuracy is greater than the initial combination accuracy, the procedure E2 is performed; or, if the updated combined accuracy is less than or equal to the initial combined accuracy, the added decision branch may be screened from the decision branch combination, and then the process returns to step B2, and the added decision branch with the highest branch accuracy is selected from the remaining decision branches, and the execution is continued until the set termination condition is satisfied. The termination condition is for example to traverse all decision branches.

Procedure E2: if the updated combination accuracy is greater than the initial combination accuracy, the added decision branches are retained in the decision branch combination, the updated combination accuracy is used as a new initial combination accuracy, and according to the branch accuracy, the decision branch with the highest branch accuracy is selected from the decision branches which are not included in the decision branch combination, and is added into the decision branch combination to be continuously executed until a termination condition is met.

If the updated combination accuracy is greater than the initial combination accuracy, the combined decision branch classification effect is better, and the accuracy and the recall rate are both improved, so that the newly added decision branch can be retained in the decision branch combination, the updated combination accuracy is used as the new initial combination accuracy, and the process B2 is returned to continue to be executed until the termination condition is met.

The termination condition is for example to traverse all decision branches. All possible decision branch combinations can be traversed by repeating the above process.

The following takes decision branches 1-3 as an example to illustrate the process of determining the combination of decision branches:

in the initial case, the decision branch combination is empty, and the branch accuracy of decision branches 1-3 is gradually reduced. Based on this, the initial combination accuracy of the decision branch combination is 0. Decision branch 1 with the highest branch accuracy is selected from decision branches 1-3 and added into the decision branch combination, and at this time, the updated combination accuracy of the decision branch combination is the branch accuracy of decision branch 1, which is greater than 0, so that decision branch 1 is retained in the decision branch combination, and the new initial combination accuracy is the branch accuracy of decision branch value 1.

Then, selecting decision branch 2 with higher branch accuracy from

decision branches

2 and 3, adding the selected decision branch 2 into the decision branch combination, wherein the decision branch combination comprises decision branches 1 and 2, determining relevant samples of the decision branch combination according to relevant samples corresponding to decision branches 1 and 2, and determining the update combination accuracy of the decision branch combination at the moment by combining training samples.

If the update combination accuracy is less than or equal to the initial combination accuracy, it indicates that the combination of decision branches 1 and 2 does not improve the classification accuracy and/or the recall ratio, so that decision branch 2 is deleted from the decision branch combination, decision branch 1 remains in the decision branch combination, and the initial combination accuracy also remains unchanged.

Or, if the updated combination accuracy is greater than the initial combination accuracy, the decision branch 2 is retained in the decision branch combination, and the updated combination accuracy is used as the new initial combination accuracy.

Decision branch 3 is then added to the decision branch combination and the above process is repeated to determine whether decision branch 3 remains within the decision branch combination.

The decision branch combination can be automatically and quickly obtained through the process.

Optionally, after 1 round of circulation is completed, the above process can be repeated for decision branches not in the decision branch combination, so that sufficient traversal of different combination modes can be realized, omission is avoided, the repetition times of combination are effectively reduced, and efficiency is improved.

Substep S2103: and selecting the decision branch combination with the combination accuracy rate meeting the set threshold value as the target decision branch combination.

After multiple rounds of circulation, multiple different target decision branch combinations and combination accuracy rates (namely final initial combination accuracy rates or updated combination accuracy rates) corresponding to the target decision branch combinations can be obtained, set thresholds can be configured as required, and the required target decision branch combinations can be screened out according to the set thresholds.

Because the non-leaf nodes in the trained decision tree have corresponding screening conditions, by combining different decision branches, screening conditions or a set of screening conditions capable of accurately classifying feature data can be automatically mined from the decision tree, so that a better target decision branch combination is determined.

Based on the determined combination of objective decision branches, a combination of filtering conditions may be determined by step S212 to form a rule capable of classification. In one case, if the decision branch combination of interest includes decision branch 1, the corresponding filtering condition is that the value of the 3 rd feature is less than or equal to 0.8. If the target decision branch combination comprises a plurality of decision branches, the corresponding screening conditions can be connected in an OR mode, and therefore the screening condition combination is obtained.

By the method, supervised binning processing can be combined with the decision tree, high-dimensional sample characteristic data can be processed by carrying out supervised binning on the obtained sample characteristic data with the sample label, numerical sample characteristic data and character sample characteristic data can be subjected to binning processing, and the decision tree is trained by using the processed sample characteristic data, so that the influence of secondary observation errors in the sample characteristic data on the decision tree can be reduced.

Through adopting this kind of mode of branch case to handle sample characteristic data, can realize carrying out the discretization to continuous sample characteristic data, and then make the decision-making tree that sample characteristic data after the discretization trained more stable, reduce the overfitting risk. The supervised binning mode utilizes the sample labels, and information is more effectively utilized.

Aiming at the trained decision tree, the decision tree principle can be known that a path from a root node to each leaf node of the decision tree forms a decision branch, each node in the decision tree has a certain number of related samples, and the accuracy (precision) and recall (recall) of each decision branch can be calculated by using the related samples so as to determine the branch accuracy. Based on the relevant samples corresponding to the nodes in the decision tree and the training samples, different decision branches can be automatically sorted and combined, and the combination accuracy of the different decision branch combinations is determined, for example, a greedy algorithm is used for automatic stratum layer sorting search of leaf nodes. And selecting the decision branch combination meeting the set threshold value according to the combination accuracy of different decision branch combinations to obtain the corresponding screening condition combination. Therefore, the required screening condition combination can be efficiently and quickly obtained, and the good classification effect of the screening condition combination is ensured.

The decision tree in this embodiment may be a conventional decision tree, or may be a random forest or GBDT model. In the supervised binning process, a proper manner such as chi-square binning or minimum entropy binning may be adopted.

Compared with the existing grid method, the method in the embodiment does not need to manually divide various grids and evaluate intermediate result indexes, and can prevent the decision tree from being over-fitted.

EXAMPLE III

Referring to fig. 3, a flow diagram of a risk identification method is shown.

The method comprises the following steps:

step S302: acquiring attribute information of an identification object, wherein the attribute information comprises at least one attribute and an attribute value corresponding to the attribute.

Attributes include, but are not limited to: attributes associated with the portrait characteristics data of the identified object, such as preferences, etc., may also include device-related attributes, such as device name, model number, etc. Note that these attribute information are obtained when the identification object authorizes permission.

Step S304: and according to the attributes and the attribute values in the attribute information, screening according to each screening condition in a screening condition combination to obtain a screening result for indicating the risk degree of the identification object, wherein the screening conditions in the screening condition combination are determined according to the method in the previous embodiment.

The filtering condition indicates the filtered attribute and the corresponding threshold, and a certain filtering condition is that the attribute value of the 3 rd attribute in the sample feature data is less than or equal to 0.8, and the 3 rd attribute of the sample feature data is determined in the training process, for example, the payment duration is, then the filtering condition is actually: and if the payment duration of the identification object is greater than 0.8, determining whether other adaptive screening conditions exist in the screening condition combination, and if so, continuing to use the screening conditions to screen the identification object until the classification of the identification object is determined, wherein the classification indicates the risk degree of the identification object.

Through the method, the recognized objects can be accurately classified by the trained screening conditions, the risk degree of the recognized objects is also estimated, and then risk early warning and risk control are realized, so that the operation safety is improved.

Example four

Referring to fig. 4, a flowchart illustrating method steps of a fourth embodiment of the present application is shown.

The method is used for automatically mining the rules, and comprises the following steps:

step S402: and acquiring sample characteristic data and a sample label corresponding to the sample characteristic data.

The usage scenarios of the rules to be mined are different, the sample feature data may be different, and the sample feature data may be, for example, image feature data, device feature data, and the like, which is not limited to this. The sample labels may be "breach," "non-breach," etc. Of course, the sample label may be other labels in the use scenario.

Step S404: and performing box separation processing on the sample characteristic data according to the sample characteristic data and the corresponding sample label.

The process of the binning processing is as described in the previous embodiment, and therefore, is not described again.

Step S406: and training the decision tree by using the sample characteristic data after the box separation and the corresponding sample labels to determine the corresponding screening conditions of the nodes contained in the decision tree.

When the decision tree is trained, the attributes and the screening threshold values in the screening conditions of each node in the decision tree can be initialized, then the sample characteristic data are screened according to the screening conditions after initialization, so that the sample characteristic data meeting the screening conditions can be determined, gini coefficients can be calculated according to respective sample labels, and the attributes or the screening threshold values in the screening conditions can be adjusted according to the gini coefficients, so that the classification and screening of the sample characteristic data are more accurate. This is repeated until the termination condition is satisfied.

Step S408: and determining a risk identification rule according to the connection relation among the nodes of the decision tree, the screening conditions corresponding to the nodes and the quantity of the sample characteristic data screened by the screening rules of the nodes, wherein the risk identification rule comprises at least one screening condition.

Based on the trained decision tree, the screening conditions of a certain decision branch can be combined according to the connection relation between nodes, the sample characteristic data is screened according to the screening condition combination, so that the classification of the decision branch is finally determined, the accuracy and the recall rate can be determined according to the determined classification and the original sample label of the sample characteristic data, and then the branch accuracy of the decision branch is determined.

Based on the branch accuracy of each decision branch, according to the branch accuracy, selecting the decision branch with the highest branch accuracy from the decision branches not included in the decision branch combination, adding the decision branch into the decision branch combination, and determining the relevant sample corresponding to the decision branch combination according to the relevant sample corresponding to the decision branch included in the decision branch combination. And determining the updating combination accuracy of the decision branch combination according to the relevant samples and the training samples corresponding to the decision branch combination, if the updating combination accuracy is greater than the initial combination accuracy, keeping the added decision branches in the decision branch combination, taking the updating combination accuracy as a new initial combination accuracy, returning to the decision branches which are not included in the decision branch combination according to the branch accuracy, selecting the decision branch with the highest branch accuracy to be added into the decision branch combination for continuous execution until a termination condition is met.

Screening conditions in the decision branch combination are combined into a risk identification rule, so that automatic mining of the risk rule is realized based on sample characteristic data, and the risk identification rule can be used as a mining platform to provide rule mining service for the outside.

Step S410: and outputting the risk identification rule.

The risk identification rule can be output to a user, so that the user does not need to dig the rule, the risk identification rule can be automatically dug for the user as long as sample characteristic data are provided, and the use convenience is improved.

EXAMPLE five

Referring to fig. 5, a block diagram of a model training apparatus according to a fifth embodiment of the present application is shown.

In this embodiment, the model training apparatus includes:

a first obtaining module 502, configured to obtain sample feature data of a target object and a sample label corresponding to the sample feature data;

a binning module 504, configured to perform binning processing on the sample feature data according to a sample label corresponding to the sample feature data;

a determining module 506, configured to determine a training sample according to the sample feature data subjected to binning processing and the corresponding sample label;

a training module 508, configured to train a decision tree using the training samples to determine a screening condition corresponding to a node included in the decision tree.

Optionally, the sample feature data includes an attribute value corresponding to at least one attribute, the binning module 504 is configured to sort the sample feature data according to the attribute value of the selected target attribute, and the sample feature data with the same attribute value is determined as an attribute value group; binding two adjacent attribute value groups into an attribute value set according to the attribute value group sequence indicated by the sorting result; calculating chi-square values of the attribute value sets according to the sample characteristic data and the sample labels in the attribute value sets aiming at the attribute value sets; and processing the sample characteristic data of the attribute value set according to the calculated chi-square value to obtain the sample characteristic data subjected to binning processing.

Optionally, the binning module 504 is configured to, when processing the sample feature data of the attribute value set according to the calculated chi-squared value to obtain the binned sample feature data, merge the sample feature data of two attribute value sets in the attribute value set with the smallest chi-squared value according to the calculated chi-squared value to obtain the binned sample feature data.

Optionally, the trained decision tree at least includes a root node and a plurality of leaf nodes, and a path from each leaf node to the root node forms a corresponding decision branch; the device further comprises:

the combination module 510 is configured to combine different decision branches, and determine a target decision branch combination according to a screening condition corresponding to a node included in the decision branch combination, a related sample corresponding to the node, and a training sample;

a screening module 512, configured to determine a screening condition combination according to a screening condition corresponding to a node included in the target decision branch combination.

Optionally, the combining module 510 is configured to, for each decision branch, calculate a branch accuracy according to the number of relevant samples and sample labels corresponding to nodes in the decision branch, and the number of training samples and sample labels; combining different decision branches according to the branch accuracy, and determining the combined accuracy corresponding to each decision branch combination according to the related samples corresponding to different decision branch combinations and the training samples; and selecting the decision branch combination with the combination accuracy rate meeting the set threshold value as the target decision branch combination.

Optionally, the combining module 510 is configured to, when calculating a branch accuracy according to the number and sample labels of the relevant samples corresponding to the nodes in the decision branches and the number and sample labels of the training samples, determine, according to the relevant samples corresponding to the nodes in the decision branches, the number of the relevant samples of the decision branches and the sample labels corresponding to the relevant samples; respectively calculating the accuracy and the recall rate corresponding to the decision branches according to the number and the sample labels of the relevant samples of the decision branches and the number and the sample labels of the training samples; and determining the branch accuracy according to the accuracy and the recall rate of the decision branch.

Optionally, the combining module 510 is configured to obtain an initial combination accuracy of the decision branch combinations when different decision branches are combined according to the branch accuracy, and a combination accuracy corresponding to each decision branch combination is determined according to the relevant samples corresponding to the different decision branch combinations and the training sample; according to the branch accuracy, selecting the decision branch with the highest branch accuracy from the decision branches not included in the decision branch combination and adding the decision branch into the decision branch combination; determining a relevant sample corresponding to a decision branch combination according to the relevant sample corresponding to the decision branch contained in the decision branch combination; determining the updating combination accuracy of the decision branch combination according to the relevant sample and the training sample corresponding to the decision branch combination; if the updated combination accuracy is greater than the initial combination accuracy, the added decision branches are retained in the decision branch combination, the updated combination accuracy is used as a new initial combination accuracy, and the decision branch with the highest branch accuracy is selected from the decision branches which are not included in the decision branch combination according to the branch accuracy and added into the decision branch combination to be continuously executed until a termination condition is met.

The model training apparatus of this embodiment is used to implement the corresponding model training method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the model training apparatus of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not repeated here.

EXAMPLE six

Referring to fig. 6, a block diagram of a device according to a sixth embodiment of the present application is shown.

The risk identification device comprises:

a second obtaining module 602, configured to obtain attribute information of an identification object, where the attribute information includes at least one attribute and an attribute value corresponding to the attribute;

and a condition screening module 604, configured to perform screening according to the attributes and the attribute values in the attribute information and according to each screening condition in a screening condition combination to obtain a screening result indicating the risk degree of the identification object, where the screening condition in the screening condition combination is determined according to the foregoing apparatus.

The apparatus of this embodiment is used to implement the corresponding method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

EXAMPLE seven

Referring to fig. 7, a block diagram of an apparatus according to a seventh embodiment of the present application is shown.

The rule mining device comprises:

a third obtaining module 702, configured to obtain sample feature data and a sample label corresponding to the sample feature data;

a sample binning module 704, configured to bin the sample feature data according to the sample feature data and a corresponding sample label;

a decision training module 706, configured to train a decision tree using the binned sample feature data and corresponding sample labels to determine corresponding screening conditions of nodes included in the decision tree;

a rule mining module 708, configured to determine a risk identification rule according to a connection relationship between nodes of the decision tree, a screening condition corresponding to each node, and a number of sample feature data screened by the screening rule of each node, where the risk identification rule includes at least one screening condition;

an output module 710, configured to output the risk identification rule.

Example eight

Referring to fig. 8, a schematic structural diagram of an electronic device according to an eighth embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 8, the electronic device may include: a processor (processor)802, a Communications Interface 804, a memory 806, and a communication bus 808.

Wherein:

the processor 802, communication interface 804, and memory 806 communicate with one another via a communication bus 808.

A communication interface 804 for communicating with other electronic devices or servers.

The processor 802 is configured to execute the program 810, and may specifically execute the relevant steps in the above-described embodiment of the model training method.

In particular, the program 810 may include program code comprising computer operating instructions.

The processor 802 may be a processor CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 806 stores a program 810. The memory 806 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 810 may be specifically configured to cause the processor 802 to perform operations corresponding to the aforementioned methods.

For specific implementation of each step in the program 810, reference may be made to corresponding steps and corresponding descriptions in units in the above embodiment of the model training method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

Embodiments of the present application provide a computer storage medium, on which a computer program is stored, which when executed by a processor implements the method as described above.

The embodiment of the present application provides a computer program product, which includes computer instructions for instructing a computing device to execute operations corresponding to the foregoing method.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A model training method, comprising:

obtaining sample characteristic data of a target object and a sample label corresponding to the sample characteristic data;

according to the sample label corresponding to the sample characteristic data, performing box separation processing on the sample characteristic data;

determining a training sample according to the sample characteristic data subjected to the binning processing and the corresponding sample label;

and training a decision tree by using the training sample to determine a screening condition corresponding to the node contained in the decision tree.

2. The method of claim 1, wherein the sample feature data comprises an attribute value corresponding to at least one attribute, and the binning the sample feature data according to the sample label corresponding to the sample feature data comprises:

sorting the sample feature data according to the attribute value of the selected target attribute, and determining the sample feature data with the same attribute value as an attribute value group;

binding two adjacent attribute value groups into an attribute value set according to the attribute value group sequence indicated by the sorting result;

calculating chi-square values of the attribute value sets according to the sample characteristic data and the sample labels in the attribute value sets aiming at the attribute value sets;

and processing the sample characteristic data of the attribute value set according to the calculated chi-square value to obtain the sample characteristic data subjected to binning processing.

3. The method of claim 2, wherein the processing the sample feature data of the set of attribute values according to the calculated chi-squared value to obtain binned sample feature data comprises:

and according to the calculated chi-square value, combining the sample characteristic data of the two attribute value groups in the attribute value set with the minimum chi-square value to obtain the sample characteristic data subjected to binning processing.

4. The method of claim 1, wherein the trained decision tree includes at least a root node and a plurality of leaf nodes, a path of each of the leaf nodes to the root node forming a respective decision branch;

the method further comprises the following steps:

combining different decision branches, and determining a target decision branch combination according to screening conditions corresponding to nodes contained in the decision branch combination, related samples corresponding to the nodes and training samples;

and determining a screening condition combination according to the screening conditions corresponding to the nodes contained in the target decision branch combination.

5. The method of claim 4, wherein the combining different decision branches and determining a target decision branch combination according to the screening condition corresponding to the node contained in the decision branch combination, the relevant sample corresponding to the node, and the training sample comprises:

aiming at each decision branch, calculating branch accuracy according to the number and sample labels of related samples corresponding to the nodes in the decision branch, and the number and sample labels of training samples;

combining different decision branches according to the branch accuracy, and determining the combined accuracy corresponding to each decision branch combination according to the related samples corresponding to different decision branch combinations and the training samples;

and selecting the decision branch combination with the combination accuracy rate meeting the set threshold value as the target decision branch combination.

6. The method of claim 5, wherein the calculating, for each decision branch, a branch accuracy from the number of relevant samples and sample labels corresponding to the nodes in the decision branch and the number of training samples and sample labels comprises:

determining the number of the relevant samples of the decision branch and the sample labels corresponding to the relevant samples according to the relevant samples corresponding to the nodes in the decision branch;

respectively calculating the accuracy and the recall rate corresponding to the decision branches according to the number and the sample labels of the relevant samples of the decision branches and the number and the sample labels of the training samples;

and determining the branch accuracy according to the accuracy and the recall rate of the decision branch.

7. The method of claim 5, wherein the combining different decision branches according to the branch accuracy, and determining the combined accuracy corresponding to each decision branch combination according to the relevant samples corresponding to different decision branch combinations and the training sample comprises:

acquiring initial combination accuracy of the decision branch combination;

according to the branch accuracy, selecting the decision branch with the highest branch accuracy from the decision branches not included in the decision branch combination and adding the decision branch into the decision branch combination;

determining a relevant sample corresponding to a decision branch combination according to the relevant sample corresponding to the decision branch contained in the decision branch combination;

determining the updating combination accuracy of the decision branch combination according to the relevant sample and the training sample corresponding to the decision branch combination;

if the updated combination accuracy is greater than the initial combination accuracy, the added decision branches are retained in the decision branch combination, the updated combination accuracy is used as a new initial combination accuracy, and the decision branch with the highest branch accuracy is selected from the decision branches which are not included in the decision branch combination according to the branch accuracy and added into the decision branch combination to be continuously executed until a termination condition is met.

8. A risk identification method, comprising:

acquiring attribute information of an identification object, wherein the attribute information comprises at least one attribute and an attribute value corresponding to the attribute;

and according to the attributes and the attribute values in the attribute information, screening according to each screening condition in a screening condition combination to obtain a screening result for indicating the risk degree of the identification object, wherein the screening conditions in the screening condition combination are determined according to the method in any one of claims 1 to 7.

9. A method of rule mining, comprising:

obtaining sample characteristic data and a sample label corresponding to the sample characteristic data;

according to the sample characteristic data and the corresponding sample label, performing box separation processing on the sample characteristic data;

training a decision tree by using the sample characteristic data after the box separation and the corresponding sample labels to determine corresponding screening conditions of nodes contained in the decision tree;

determining a risk identification rule according to the connection relation among the nodes of the decision tree, the screening conditions corresponding to the nodes and the quantity of the sample characteristic data screened by the screening rules of the nodes, wherein the risk identification rule comprises at least one screening condition;

and outputting the risk identification rule.

10. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the method according to any one of claims 1-7, the operation corresponding to the method according to claim 8, or the operation corresponding to the method according to claim 9.

11. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of claims 1-7, or the method of claim 8, or the method of claim 9.

12. A computer program product comprising computer instructions to instruct a computing device to perform operations corresponding to the method of any of claims 1-7, or operations corresponding to the method of claim 8, or operations corresponding to the method of claim 9.