CN109409434B

CN109409434B - Liver disease data classification rule extraction method based on random forest

Info

Publication number: CN109409434B
Application number: CN201811292849.5A
Authority: CN
Inventors: 黄立勤; 陈宋
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-02-05
Filing date: 2018-11-01
Publication date: 2021-05-18
Anticipated expiration: 2038-11-01
Also published as: CN109409434A

Abstract

The invention provides a method for extracting liver disease data classification rules based on random forests, which comprises the following steps: step 1: preprocessing unbalanced or irregular data in liver diseases, and obtaining a liver disease data set by SMOTE (synthetic minority oversampling technology); step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set; and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set; and 5: and carrying out original data verification to generate a final rule set. The elastic norm rule extraction and feature selection method combining the L1 norm and the L2 norm provided by the invention not only can select relatively fewer features, but also can improve generalization capability and classification accuracy. The secondary rule extraction and verification method provided by the invention greatly improves the reliability of the generated rule.

Description

Liver disease data classification rule extraction method based on random forest

Technical Field

The invention belongs to the field of data processing of disease and diagnosis information, and particularly relates to a method for extracting liver disease data classification rules based on random forests.

Background

Liver cancer is the second leading cause of cancer death worldwide, and primary hepatitis can progress to fibrosis, cirrhosis and even liver cancer. Most of the existing liver disease diagnosis methods are black box models, and still focus on classification problems, so that the accuracy and interpretability of classification rules for diagnosing liver diseases are difficult, and information hidden in data cannot be fully displayed. In practical medical applications, although some black box models achieve high accuracy, they do not give the reason for classification, which is very important for doctors. Knowledge representation rules extracted from the data are easier to understand and understand than other representations. Thus, the interpretation of the classification may express some compact and efficient rules. Concise and efficient rule extraction can provide a bottom level of detailed explanation, which is becoming more and more popular in medical environments, not only requiring high precision, but also being easy to understand. Rule extraction has been the subject of research in the field of artificial intelligence. By rule extraction, it is meant that many experimental studies combine data from multiple sources to understand potential problems. It is important to find and interpret the most important information from these sources. Therefore, there is a need for an efficient algorithm that can simultaneously extract decision rules and select relationships between key features to interpret risk factors affecting liver disease while preserving predictive performance, and provide relational expressions of the influencing factors for diagnosis by a physician.

Currently, many diagnostic methods for hepatitis data sets have been successfully applied to different classification algorithms: clustering based on attribute weighting; an extreme learning machine; a support vector machine; a neural network; fuzzy rule extraction is based on a support vector machine; classifying the regression tree; support vector identification; hsieh et al propose a particle swarm optimization-based on a fuzzy hyper-rectangular composite neural network, and the rule pruning training generated by adopting a particle swarm optimization algorithm does not reduce (even improve) the recognition performance. Barakat, n.anda.p.bradley et al propose a rule extraction using the output vector of the SVM model and applying a decision tree algorithm. In the similar work, the rules are extracted from the prediction model of the SVM by using naive Bayes tree, TREPAN, RIPPER and CART. Another work was rule extraction from support vector models using ANFIS and DENFIS. Recently, t.marthipmaja et al proposed a new hybrid algorithm, support vector data plus ripep to improve the interpretability of single-class SVM classification. Most of the work is mainly concentrated on an SVM classifier, in order to improve the interpretability of the generated rules, ShengLiu, RonakY.Patel and the like provide a model of a system based on rule extraction and feature selection of a random forest, data is subjected to rule extraction through the random forest, features existing in the rules are selected and fed back to the random forest for classification verification, the generated rules are used for classification, and the precision can reach the precision of original data classification. The feature search algorithm may be one of the most important parts in the feature selection method. And aiming at feature selection, a plurality of search strategies such as branching and constraint, a divide-and-conquer method, a greedy method, an evolutionary algorithm, an annealing algorithm and the like are provided. Among them, greedy search strategies, such as forward selection (delta search) or backward elimination, are one of the most popular techniques.

From the above, SVM, neural network, decision tree and random forest are basic models for researching rule extraction, and limit and extraction of the number of rules mainly utilize L1 or L2 norm regularization to realize sparseness of rules and features, i.e. feature selection and interpretability.

As described above, in the actual diagnosis of liver diseases, it is very important to have an interpretable model and high predictive performance, and to understand the potential problems well. Most advanced algorithms, such as Support Vector Machines (SVMs), artificial neural networks and Random Forests (RFs), generally have high accuracy of prediction results, but besides accuracy, it is difficult to explain the construction of these models because they are "black box models" or contain many decision rules that we cannot clearly interpret. On the other hand, some algorithms, especially those based on decision trees, are easy to interpret. However, the predicted performance is generally lower compared to SVM, ANN or RF.

Secondly, in liver disease diagnosis, if too many diagnosis rules are generated, no practical significance is provided for doctors, therefore, the rule extraction algorithm for the basic model decision tree can generate many rule sets, which has no significance for the intuitive interpretation of users, and although the L1 norm regularization can realize the extraction of rules and features, the rule with small relevance is directly set to be 0, so that overfitting is easily caused; meanwhile, the L2 norm sets the rule with small relevance to a small numerical value, which easily causes under-fitting of data.

Disclosure of Invention

To address the problems with the prior art, the present invention uses selection of a model that is suitable for classification performance balanced with an explanatory model. Meanwhile, an elastic norm iteration realization method is adopted in the rule extraction process.

The invention provides a new elastic norm convergence algorithm combining L1 and L2 to select effective and few rules aiming at liver disease data, and the result of rule extraction is used for feature selection through a mixed rule extraction and feature selection method, and the selected features in the generated rules are sent to the random forest and elastic norm coding steps to extract important rules. By continually iterating the alternating method until the selected features and rules are not changed. Finally, and most necessarily, for the generated rules to let the doctor or user trust the validity and accuracy of the rules, the present invention quantifies the performance using coverage and accuracy to achieve an optimal balance of accuracy and classification accuracy.

The present invention proposes a binary-coded forest generated from a Random Forest (RF) in liver data, which maps sample points to a space defined by the entire set of leaf nodes (rules). And then extracting a coding method of a representative rule by using binary coding and an elastic norm. In the selected rule, the re-selected features are used as sub-features for the next cycle, which is used to construct a new set of RF generation rules, and the process is repeated until the stopping condition is met, i.e., the number of features remains stable and the number of rules converges.

The following technical scheme is adopted specifically:

a method for extracting liver disease data classification rules based on random forests is characterized by comprising the following steps:

step 1: preprocessing unbalanced or irregular data in liver diseases, and obtaining a liver disease data set by SMOTE (synthetic minority oversampling technology);

step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set;

and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set;

and 4, step 4: extracting and deleting characteristics of the rule set for coding the liver diseases;

and 5: and carrying out original data verification to generate a final rule set.

Among other things, there are many problems in pattern recognition due to the imbalance in the raw data of the liver data set. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. The present invention, through processing by SMOTE (synthetic minority oversampling techniques), can create a "synthetic" instance for each minority class with few samples.

Further, in step 2, the method for binary sparse coding of the liver disease data set by using the random forest model comprises the following steps:

step 2A: training a liver disease data set to obtain a random forest comprising a plurality of decision trees, wherein in each decision tree, a path from a root node to a leaf node is interpreted as a decision rule, and the random forest is equivalent to a decision rule set;

and step 2B: corresponding each sample of the liver disease data set to a decision tree from a root node to only one leaf node;

and step 2C: defining a binary feature vector capture random forest leaf node structure: for sample X_iThe corresponding binary vector and the encoding leaf node are defined as:

X_i＝[X_1，...，X_q]^Twherein q is the total number of leaf nodes;

then X_iIs a leaf node space, in which each sample is mapped to a vertex of the hypercube, and the dimensions of each rule space are defined as a decision rule. Thus, such a mapping process essentially defines for a sample which rules are valid and which are invalid.

Further, in step 3, the method for extracting the flexible norm sparse coding rule from the liver disease rule set includes the following steps:

step 3A: and (3) constructing a new training sample according to the mapping result of the step 2C:

{(X₁，y₁)，(X₂，X₂)，...，(X_p，y_p)}；

wherein, X_iIs a binary attribute vector, y ∈ {1, 2., K } is a related class label, and the formula defining the class is:

wherein the weight vector W_kAnd a scalar b_kA linear discriminant function of the kth class is defined;

since each binary attribute represents a decision rule. Weight W in equation (1)_kThe importance of the rule is measured: the magnitude of the weight indicates the importance of the rule. Obviously, in the above classifier, if the weights of all classes are 0, the rule can be safely removed. Rule extraction is therefore a problem for learning weight vectors.

And step 3B: elastic norm normalization learning is performed, wherein the objective function is as follows:

ξ_i≥0，i＝1，...，p (2)

the objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:

to control the number of non-zero weights and rule extractions, P is a probability factor that selects either the L1 or L2 norm; second term ε_kIs the sum of the relaxation variables; λ is the regularization parameter. Because the non-zero relaxation variables represent a misclassified sample, the second term is related to empirical errors. The sparsity and empirical error of the results depend on the regularization parameters, while L1 and L2 norm sparse coding have been widely applied to statistics and machine learning. The L1 norm may remove insignificant features while the L2 norm may prevent overfitting the data. According to the invention, after the step 3B, the P value with the highest model cross validation precision is adopted, and the P value is selected and substituted into the formula (2).

The method for calculating the importance of any sample feature in the random forest comprises the following steps:

and step 3C: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB1, is calculated using the corresponding OOB (out-of-bag data) data;

and step 3D: randomly adding noise interference to the characteristics of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error of the out-of-bag data OOB again and recording the error as errOOB 2;

and step 3E: let there be Ntree decision trees in the random forest, and the importance of features is defined as:

∑(err00B1-err00B2)/Ntree (3)

the importance of all features is calculated.

The reason why equation (3) is used as the measure of importance of the corresponding feature is that: if the accuracy outside the bag is greatly reduced after noise is randomly added to a certain feature, it indicates that the feature has a great influence on the classification result of the sample, that is, the feature has a high degree of importance.

Further, in step 4, the method for extracting and deleting characteristics of the rule set encoding liver diseases comprises the following steps:

since the distribution of features in a random forest is determined by the learning process of the random forest. It is usually different from the feature distribution resulting from regular extraction from the previous formula. The important features are based on the assumption in the extracted decision rule, and the features can be selected by using the different features. A feature is deleted if it does not appear in the rules extracted in equation (2) above, because it has no effect on the classifier defined by equation (1). Under this idea, both rules and features can be selected.

While the regularization parameter lambda may be selected by training set cross validation. By reconstructing the random forest with the selected features, rules can be further selected to obtain more compact rules. Through the process of such an iteration, features are selected for construction of a new random forest in the previous iteration, through which new rules can be generated, and the iteration is until the selected features do not change.

This has the following:

step 4A: if a certain feature does not appear in the rule extracted by the formula (2), the feature is deleted;

and step 4B: selecting a regularization parameter lambda through training set cross validation, and returning to the step 2A to reconstruct a random forest for training;

and step 4C: and repeating the iterative process from the step 2A to the step 4B until the selected characteristics are not changed.

Further, in step 5, the verification of the original data and the generation of the final rule set include the following steps:

step 5A: given a class labeled liver disease data set D, let n_coversNumber of data covered, n_correctFor the number of data accurately classified by the rule set R, the coverage and accuracy of the rule set R are defined as:

the higher the coverage rate and the accuracy rate of the rule are, the greater the credibility of the rule for auxiliary diagnosis is; and generating a final rule set by using the rule with relatively high coverage rate and accuracy.

The elastic norm rule extraction and feature selection method combining the L1 norm and the L2 norm, which is provided by the invention and the preferred scheme, not only can select relatively few features, but also can improve generalization capability and classification accuracy.

The secondary rule extraction and verification method provided by the invention greatly improves the reliability of the generated rule (namely, the original data is verified to generate a final rule set).

The invention can realize the extraction and verification of various types of rules and solve the problem that the rule extraction can only be classified and extracted in the early work.

The training data set is unbalanced, which can cause many problems in pattern recognition. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. To achieve uniform distribution of sample data, the present invention solves this problem by synthesizing a few over-sampling techniques, with the algorithm creating a "synthetic" instance for each few class with few samples.

The advantages of the invention in the specific examples over the prior art are as follows:

1. because the existing liver disease data rule extraction algorithm is mainly based on SVM or decision tree, the feature search algorithm is probably the most important part in the feature selection method. And aiming at feature selection, a plurality of search strategies such as branching and constraint, a divide-and-conquer method, a greedy method, an evolutionary algorithm, an annealing algorithm and the like are provided. Among them, greedy search strategies, forward selection (delta search) or backward elimination, for example, are one of the most popular techniques, but their computational efficiency, robustness, is prone to over-or under-fitting.

The method adopts a basic model of random forests for liver disease data, solves the defect that the SVM cannot explain the rule with high precision, and adopts the elastic norm convergence combining L1 and L2 innovatively, so that the method can solve the problem of overfitting caused by excessive deletion rules or characteristics of the L1 norm; the problem that the L2 norm has too many rules or characteristics to cause under-fitting is solved.

2. Because a rule set generated by the result of the liver disease rule extraction algorithm does not have an effective verification algorithm, namely the generated rule is the final rule, the reliability of the strategy is poor.

The invention adopts a rule verification algorithm as a secondary verification step for generating a rule set. Two problems can be solved: 1. when the number of the rules is small, the credibility condition of each rule in the original sample can be verified; 2. when the ratio of the rules is more, the method can be used as a means for simplifying the rules and an algorithm for verifying the reliability of the rules again.

3. In the case of data noise or missing in medical data, especially in the original data of liver diseases, data abnormality can bias the accuracy of the model and the generated rule to the part of the data which is normal.

Firstly, missing values in a few-number-of-synthesis oversampling technology are processed, the missing values are filled by using median, and data continuity is guaranteed; second, resampling is used to keep the amount of samples of different classes consistent, and cross-validation is used to ensure sufficient training samples.

4. In the existing algorithm for rule extraction of liver data, the same type is mostly adopted for rule extraction, and obviously, the problems of increased calculation time, unrealistic practical application and the like are caused.

The random forest model adopted by the invention firstly classifies and stores data and confirms that the number of different types of samples is kept consistent again before running to the sample rule extraction, and then calculates the samples finished by liver disease classification at the same time to carry out rule extraction and feature selection. Such processing improves overall computational efficiency.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic flow diagram of an embodiment of a method;

FIG. 2 is a schematic diagram of binary encoding for a random forest in an embodiment of the present invention;

FIG. 3 is a schematic diagram of rule culling in an embodiment of the invention;

FIG. 4 is a schematic representation of the manner in which the L1 and L2 norms are combined in an embodiment of the invention;

FIG. 5 is a flow chart of a main algorithm of the method according to the embodiment of the present invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

as shown in fig. 1, an embodiment of the present invention includes the steps of:

Among other things, there are many problems in pattern recognition due to the imbalance in the raw data of the liver data set. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. The present invention enables the creation of "synthetic" instances for each minority class with few samples by processing (mainly including data balancing processing and data missing processing) through SMOTE (synthetic minority oversampling techniques).

In step 2, the method for performing binary sparse coding on the liver disease data set by using the random forest model comprises the following steps:

and step 2C: defining a binary feature vector capture random forest leaf node structure: for sample Xi, its corresponding binary vector and encoding leaf node are defined as:

X_i＝[X_1，...，X_q]^Twherein q is the total number of leaf nodes;

as shown in fig. 2, in an example to which the present invention is applied, a mapping relationship of nodes is shown:

In step 3, the method for extracting the elastic norm sparse coding rule from the liver disease rule set comprises the following steps:

{(X₁，y₁)，(X₂，y₂)，...，(X_p，y_p)}；

wherein, X_iIs a binary attribute vectorY ∈ {1, 2.,. K } is a related class label, and the formula defining the classification is:

ξ_i≥0,i＝1,...,p (2)

as shown in fig. 4, the objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:

to control the number of non-zero weights and rule extractions, P is a probability factor that selects either the L1 or L2 norm; second term ε_kIs the sum of the relaxation variables; λ is the regularization parameter. Because the non-zero relaxation variables represent a misclassified sample, the second term is related to empirical errors. The sparsity and empirical error of the results depend on the regularization parameters, while L1 and L2 norm sparse coding have been widely applied to statistics and machine learning. The L1 norm may remove insignificant features, while the L2 norm may preventAnd (5) fitting the data. According to the invention, after the step 3B, the P value with the highest model cross validation precision is adopted, and the P value is selected and substituted into the formula (2).

∑(err00B1-err00B2)/Ntree (3)

the importance of all features is calculated.

In step 4, the method for extracting and deleting the characteristics of the coding liver disease rule set comprises the following steps:

As shown in FIG. 3, FIG. 3 is the final result of the processing of the random forest of FIG. 2 according to the present invention. This has the following:

In step 5, the original data verification is carried out, and the generation of the final rule set comprises the following steps:

As shown in fig. 5, the pseudo code operation process for the embodiment of the present invention is as follows:

the precondition requires that:

1: a feature variable F is initialized.

2: training samples X are randomly selected from the total samples D.

Inputting:

1: selectingSelection feature F_f

2: selection rule R_f

The algorithm mainly comprises the following steps:

1: characteristic F_i，i＝1；

2: if it is not

Executing:

3: running a random forest model, and collecting the data with the characteristics F_iTraining sample X of (2);

4: random forest generation of a series of rule sets R_r；

5: using rule sets R_rTo encode training sample X;

6: linear equation (2) is used to obtain the cross-validation accuracy C_iAnd a weight W value;

7: when the weight is greater than a threshold value (a preset value small enough), recording the indexes of all the parameters;

8：R_rthe index is passed to R_i；

9：R_iThe characteristics in (1) are passed to (F)_i+1,i＝i+1；

10: ending the circulation;

11: selective cross validation accuracy C_iThe largest i is passed to i^*；

12：

To F_f；

13：

To R_f；

14: return to F_f、R_f。

The present invention is not limited to the above-mentioned preferred embodiments, and any other method for classifying and extracting liver disease data based on random forest can be obtained according to the teaching of the present invention, and other types of sample data with significant irregularity and imbalance can be effectively extracted according to the design of the present invention.

Claims

1. A method for extracting liver disease data classification rules based on random forests is characterized by comprising the following steps:

step 1: preprocessing unbalanced or irregular data in liver diseases, and synthesizing a few oversampling technologies by using SMOTE to obtain a liver disease data set;

and 5: performing original data verification to generate a final rule set;

X_i＝[X₁，...X_q]^Twherein q is the total number of leaf nodes;

then X_iThe space of (2) is a leaf node space, in which each sample is mapped to a vertex of the hypercube, and the dimension of each rule space is defined as a decision rule;

{(X₁，y₁)，(X₂，y₂)，...，(X_p，y_p)}；

s.t(W_yi-W_k)^TX_i+b_yi+b_k+ξ_ik≥1

ξ_i≥0,i＝1,...,p (2)

to control non-zero weights and rulesThe number of extractions, P is the probability factor of selecting the L1 or L2 norm, and the value is: p is more than or equal to 0 and less than or equal to 1; second term ε_kIs the sum of the relaxation variables; λ is a regularization parameter;

and step 4C: repeating the iterative process from the step 2A to the step 4B until the selected characteristics are not changed;

the higher the coverage rate and the accuracy rate of the rule are, the greater the credibility of the rule for auxiliary diagnosis is; generating a final rule set by using rules with relatively high coverage rate and accuracy;

and 3B, selecting the P value with the highest model cross validation precision after the step 3B, substituting the P value into the formula (2), wherein the method for calculating the importance of any sample feature in the random forest comprises the following steps:

and step 3C: for each decision tree in the random forest, calculating its out-of-bag data error using the corresponding out-of-bag data OOB, denoted as errOOB 1;

∑(err00B1-err00B2)/Ntree (3)

the importance of all features is calculated.