CN109409434B - Liver disease data classification rule extraction method based on random forest - Google Patents

Liver disease data classification rule extraction method based on random forest Download PDF

Info

Publication number
CN109409434B
CN109409434B CN201811292849.5A CN201811292849A CN109409434B CN 109409434 B CN109409434 B CN 109409434B CN 201811292849 A CN201811292849 A CN 201811292849A CN 109409434 B CN109409434 B CN 109409434B
Authority
CN
China
Prior art keywords
rule
liver disease
data
random forest
norm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811292849.5A
Other languages
Chinese (zh)
Other versions
CN109409434A (en
Inventor
黄立勤
陈宋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Publication of CN109409434A publication Critical patent/CN109409434A/en
Application granted granted Critical
Publication of CN109409434B publication Critical patent/CN109409434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a method for extracting liver disease data classification rules based on random forests, which comprises the following steps: step 1: preprocessing unbalanced or irregular data in liver diseases, and obtaining a liver disease data set by SMOTE (synthetic minority oversampling technology); step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set; and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set; and 5: and carrying out original data verification to generate a final rule set. The elastic norm rule extraction and feature selection method combining the L1 norm and the L2 norm provided by the invention not only can select relatively fewer features, but also can improve generalization capability and classification accuracy. The secondary rule extraction and verification method provided by the invention greatly improves the reliability of the generated rule.

Description

Liver disease data classification rule extraction method based on random forest
Technical Field
The invention belongs to the field of data processing of disease and diagnosis information, and particularly relates to a method for extracting liver disease data classification rules based on random forests.
Background
Liver cancer is the second leading cause of cancer death worldwide, and primary hepatitis can progress to fibrosis, cirrhosis and even liver cancer. Most of the existing liver disease diagnosis methods are black box models, and still focus on classification problems, so that the accuracy and interpretability of classification rules for diagnosing liver diseases are difficult, and information hidden in data cannot be fully displayed. In practical medical applications, although some black box models achieve high accuracy, they do not give the reason for classification, which is very important for doctors. Knowledge representation rules extracted from the data are easier to understand and understand than other representations. Thus, the interpretation of the classification may express some compact and efficient rules. Concise and efficient rule extraction can provide a bottom level of detailed explanation, which is becoming more and more popular in medical environments, not only requiring high precision, but also being easy to understand. Rule extraction has been the subject of research in the field of artificial intelligence. By rule extraction, it is meant that many experimental studies combine data from multiple sources to understand potential problems. It is important to find and interpret the most important information from these sources. Therefore, there is a need for an efficient algorithm that can simultaneously extract decision rules and select relationships between key features to interpret risk factors affecting liver disease while preserving predictive performance, and provide relational expressions of the influencing factors for diagnosis by a physician.
Currently, many diagnostic methods for hepatitis data sets have been successfully applied to different classification algorithms: clustering based on attribute weighting; an extreme learning machine; a support vector machine; a neural network; fuzzy rule extraction is based on a support vector machine; classifying the regression tree; support vector identification; hsieh et al propose a particle swarm optimization-based on a fuzzy hyper-rectangular composite neural network, and the rule pruning training generated by adopting a particle swarm optimization algorithm does not reduce (even improve) the recognition performance. Barakat, n.anda.p.bradley et al propose a rule extraction using the output vector of the SVM model and applying a decision tree algorithm. In the similar work, the rules are extracted from the prediction model of the SVM by using naive Bayes tree, TREPAN, RIPPER and CART. Another work was rule extraction from support vector models using ANFIS and DENFIS. Recently, t.marthipmaja et al proposed a new hybrid algorithm, support vector data plus ripep to improve the interpretability of single-class SVM classification. Most of the work is mainly concentrated on an SVM classifier, in order to improve the interpretability of the generated rules, ShengLiu, RonakY.Patel and the like provide a model of a system based on rule extraction and feature selection of a random forest, data is subjected to rule extraction through the random forest, features existing in the rules are selected and fed back to the random forest for classification verification, the generated rules are used for classification, and the precision can reach the precision of original data classification. The feature search algorithm may be one of the most important parts in the feature selection method. And aiming at feature selection, a plurality of search strategies such as branching and constraint, a divide-and-conquer method, a greedy method, an evolutionary algorithm, an annealing algorithm and the like are provided. Among them, greedy search strategies, such as forward selection (delta search) or backward elimination, are one of the most popular techniques.
From the above, SVM, neural network, decision tree and random forest are basic models for researching rule extraction, and limit and extraction of the number of rules mainly utilize L1 or L2 norm regularization to realize sparseness of rules and features, i.e. feature selection and interpretability.
As described above, in the actual diagnosis of liver diseases, it is very important to have an interpretable model and high predictive performance, and to understand the potential problems well. Most advanced algorithms, such as Support Vector Machines (SVMs), artificial neural networks and Random Forests (RFs), generally have high accuracy of prediction results, but besides accuracy, it is difficult to explain the construction of these models because they are "black box models" or contain many decision rules that we cannot clearly interpret. On the other hand, some algorithms, especially those based on decision trees, are easy to interpret. However, the predicted performance is generally lower compared to SVM, ANN or RF.
Secondly, in liver disease diagnosis, if too many diagnosis rules are generated, no practical significance is provided for doctors, therefore, the rule extraction algorithm for the basic model decision tree can generate many rule sets, which has no significance for the intuitive interpretation of users, and although the L1 norm regularization can realize the extraction of rules and features, the rule with small relevance is directly set to be 0, so that overfitting is easily caused; meanwhile, the L2 norm sets the rule with small relevance to a small numerical value, which easily causes under-fitting of data.
Disclosure of Invention
To address the problems with the prior art, the present invention uses selection of a model that is suitable for classification performance balanced with an explanatory model. Meanwhile, an elastic norm iteration realization method is adopted in the rule extraction process.
The invention provides a new elastic norm convergence algorithm combining L1 and L2 to select effective and few rules aiming at liver disease data, and the result of rule extraction is used for feature selection through a mixed rule extraction and feature selection method, and the selected features in the generated rules are sent to the random forest and elastic norm coding steps to extract important rules. By continually iterating the alternating method until the selected features and rules are not changed. Finally, and most necessarily, for the generated rules to let the doctor or user trust the validity and accuracy of the rules, the present invention quantifies the performance using coverage and accuracy to achieve an optimal balance of accuracy and classification accuracy.
The present invention proposes a binary-coded forest generated from a Random Forest (RF) in liver data, which maps sample points to a space defined by the entire set of leaf nodes (rules). And then extracting a coding method of a representative rule by using binary coding and an elastic norm. In the selected rule, the re-selected features are used as sub-features for the next cycle, which is used to construct a new set of RF generation rules, and the process is repeated until the stopping condition is met, i.e., the number of features remains stable and the number of rules converges.
The following technical scheme is adopted specifically:
a method for extracting liver disease data classification rules based on random forests is characterized by comprising the following steps:
step 1: preprocessing unbalanced or irregular data in liver diseases, and obtaining a liver disease data set by SMOTE (synthetic minority oversampling technology);
step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set;
and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set;
and 4, step 4: extracting and deleting characteristics of the rule set for coding the liver diseases;
and 5: and carrying out original data verification to generate a final rule set.
Among other things, there are many problems in pattern recognition due to the imbalance in the raw data of the liver data set. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. The present invention, through processing by SMOTE (synthetic minority oversampling techniques), can create a "synthetic" instance for each minority class with few samples.
Further, in step 2, the method for binary sparse coding of the liver disease data set by using the random forest model comprises the following steps:
step 2A: training a liver disease data set to obtain a random forest comprising a plurality of decision trees, wherein in each decision tree, a path from a root node to a leaf node is interpreted as a decision rule, and the random forest is equivalent to a decision rule set;
and step 2B: corresponding each sample of the liver disease data set to a decision tree from a root node to only one leaf node;
and step 2C: defining a binary feature vector capture random forest leaf node structure: for sample XiThe corresponding binary vector and the encoding leaf node are defined as:
Xi=[X1,...,Xq]Twherein q is the total number of leaf nodes;
Figure BDA0001850233690000041
then XiIs a leaf node space, in which each sample is mapped to a vertex of the hypercube, and the dimensions of each rule space are defined as a decision rule. Thus, such a mapping process essentially defines for a sample which rules are valid and which are invalid.
Further, in step 3, the method for extracting the flexible norm sparse coding rule from the liver disease rule set includes the following steps:
step 3A: and (3) constructing a new training sample according to the mapping result of the step 2C:
{(X1,y1),(X2,X2),...,(Xp,yp)};
wherein, XiIs a binary attribute vector, y ∈ {1, 2., K } is a related class label, and the formula defining the class is:
Figure BDA0001850233690000042
wherein the weight vector WkAnd a scalar bkA linear discriminant function of the kth class is defined;
since each binary attribute represents a decision rule. Weight W in equation (1)kThe importance of the rule is measured: the magnitude of the weight indicates the importance of the rule. Obviously, in the above classifier, if the weights of all classes are 0, the rule can be safely removed. Rule extraction is therefore a problem for learning weight vectors.
And step 3B: elastic norm normalization learning is performed, wherein the objective function is as follows:
Figure BDA0001850233690000043
Figure BDA0001850233690000044
ξi≥0,i=1,...,p (2)
the objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:
Figure BDA0001850233690000051
to control the number of non-zero weights and rule extractions, P is a probability factor that selects either the L1 or L2 norm; second term εkIs the sum of the relaxation variables; λ is the regularization parameter. Because the non-zero relaxation variables represent a misclassified sample, the second term is related to empirical errors. The sparsity and empirical error of the results depend on the regularization parameters, while L1 and L2 norm sparse coding have been widely applied to statistics and machine learning. The L1 norm may remove insignificant features while the L2 norm may prevent overfitting the data. According to the invention, after the step 3B, the P value with the highest model cross validation precision is adopted, and the P value is selected and substituted into the formula (2).
The method for calculating the importance of any sample feature in the random forest comprises the following steps:
and step 3C: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB1, is calculated using the corresponding OOB (out-of-bag data) data;
and step 3D: randomly adding noise interference to the characteristics of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error of the out-of-bag data OOB again and recording the error as errOOB 2;
and step 3E: let there be Ntree decision trees in the random forest, and the importance of features is defined as:
∑(err00B1-err00B2)/Ntree (3)
the importance of all features is calculated.
The reason why equation (3) is used as the measure of importance of the corresponding feature is that: if the accuracy outside the bag is greatly reduced after noise is randomly added to a certain feature, it indicates that the feature has a great influence on the classification result of the sample, that is, the feature has a high degree of importance.
Further, in step 4, the method for extracting and deleting characteristics of the rule set encoding liver diseases comprises the following steps:
since the distribution of features in a random forest is determined by the learning process of the random forest. It is usually different from the feature distribution resulting from regular extraction from the previous formula. The important features are based on the assumption in the extracted decision rule, and the features can be selected by using the different features. A feature is deleted if it does not appear in the rules extracted in equation (2) above, because it has no effect on the classifier defined by equation (1). Under this idea, both rules and features can be selected.
While the regularization parameter lambda may be selected by training set cross validation. By reconstructing the random forest with the selected features, rules can be further selected to obtain more compact rules. Through the process of such an iteration, features are selected for construction of a new random forest in the previous iteration, through which new rules can be generated, and the iteration is until the selected features do not change.
This has the following:
step 4A: if a certain feature does not appear in the rule extracted by the formula (2), the feature is deleted;
and step 4B: selecting a regularization parameter lambda through training set cross validation, and returning to the step 2A to reconstruct a random forest for training;
and step 4C: and repeating the iterative process from the step 2A to the step 4B until the selected characteristics are not changed.
Further, in step 5, the verification of the original data and the generation of the final rule set include the following steps:
step 5A: given a class labeled liver disease data set D, let ncoversNumber of data covered, ncorrectFor the number of data accurately classified by the rule set R, the coverage and accuracy of the rule set R are defined as:
Figure BDA0001850233690000061
Figure BDA0001850233690000062
the higher the coverage rate and the accuracy rate of the rule are, the greater the credibility of the rule for auxiliary diagnosis is; and generating a final rule set by using the rule with relatively high coverage rate and accuracy.
The elastic norm rule extraction and feature selection method combining the L1 norm and the L2 norm, which is provided by the invention and the preferred scheme, not only can select relatively few features, but also can improve generalization capability and classification accuracy.
The secondary rule extraction and verification method provided by the invention greatly improves the reliability of the generated rule (namely, the original data is verified to generate a final rule set).
The invention can realize the extraction and verification of various types of rules and solve the problem that the rule extraction can only be classified and extracted in the early work.
The training data set is unbalanced, which can cause many problems in pattern recognition. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. To achieve uniform distribution of sample data, the present invention solves this problem by synthesizing a few over-sampling techniques, with the algorithm creating a "synthetic" instance for each few class with few samples.
The advantages of the invention in the specific examples over the prior art are as follows:
1. because the existing liver disease data rule extraction algorithm is mainly based on SVM or decision tree, the feature search algorithm is probably the most important part in the feature selection method. And aiming at feature selection, a plurality of search strategies such as branching and constraint, a divide-and-conquer method, a greedy method, an evolutionary algorithm, an annealing algorithm and the like are provided. Among them, greedy search strategies, forward selection (delta search) or backward elimination, for example, are one of the most popular techniques, but their computational efficiency, robustness, is prone to over-or under-fitting.
The method adopts a basic model of random forests for liver disease data, solves the defect that the SVM cannot explain the rule with high precision, and adopts the elastic norm convergence combining L1 and L2 innovatively, so that the method can solve the problem of overfitting caused by excessive deletion rules or characteristics of the L1 norm; the problem that the L2 norm has too many rules or characteristics to cause under-fitting is solved.
2. Because a rule set generated by the result of the liver disease rule extraction algorithm does not have an effective verification algorithm, namely the generated rule is the final rule, the reliability of the strategy is poor.
The invention adopts a rule verification algorithm as a secondary verification step for generating a rule set. Two problems can be solved: 1. when the number of the rules is small, the credibility condition of each rule in the original sample can be verified; 2. when the ratio of the rules is more, the method can be used as a means for simplifying the rules and an algorithm for verifying the reliability of the rules again.
3. In the case of data noise or missing in medical data, especially in the original data of liver diseases, data abnormality can bias the accuracy of the model and the generated rule to the part of the data which is normal.
Firstly, missing values in a few-number-of-synthesis oversampling technology are processed, the missing values are filled by using median, and data continuity is guaranteed; second, resampling is used to keep the amount of samples of different classes consistent, and cross-validation is used to ensure sufficient training samples.
4. In the existing algorithm for rule extraction of liver data, the same type is mostly adopted for rule extraction, and obviously, the problems of increased calculation time, unrealistic practical application and the like are caused.
The random forest model adopted by the invention firstly classifies and stores data and confirms that the number of different types of samples is kept consistent again before running to the sample rule extraction, and then calculates the samples finished by liver disease classification at the same time to carry out rule extraction and feature selection. Such processing improves overall computational efficiency.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic flow diagram of an embodiment of a method;
FIG. 2 is a schematic diagram of binary encoding for a random forest in an embodiment of the present invention;
FIG. 3 is a schematic diagram of rule culling in an embodiment of the invention;
FIG. 4 is a schematic representation of the manner in which the L1 and L2 norms are combined in an embodiment of the invention;
FIG. 5 is a flow chart of a main algorithm of the method according to the embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
as shown in fig. 1, an embodiment of the present invention includes the steps of:
step 1: preprocessing unbalanced or irregular data in liver diseases, and obtaining a liver disease data set by SMOTE (synthetic minority oversampling technology);
step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set;
and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set;
and 4, step 4: extracting and deleting characteristics of the rule set for coding the liver diseases;
and 5: and carrying out original data verification to generate a final rule set.
Among other things, there are many problems in pattern recognition due to the imbalance in the raw data of the liver data set. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples and cluster them with the highest accuracy. In practical applications, this prejudice is not acceptable. The present invention enables the creation of "synthetic" instances for each minority class with few samples by processing (mainly including data balancing processing and data missing processing) through SMOTE (synthetic minority oversampling techniques).
In step 2, the method for performing binary sparse coding on the liver disease data set by using the random forest model comprises the following steps:
step 2A: training a liver disease data set to obtain a random forest comprising a plurality of decision trees, wherein in each decision tree, a path from a root node to a leaf node is interpreted as a decision rule, and the random forest is equivalent to a decision rule set;
and step 2B: corresponding each sample of the liver disease data set to a decision tree from a root node to only one leaf node;
and step 2C: defining a binary feature vector capture random forest leaf node structure: for sample Xi, its corresponding binary vector and encoding leaf node are defined as:
Xi=[X1,...,Xq]Twherein q is the total number of leaf nodes;
as shown in fig. 2, in an example to which the present invention is applied, a mapping relationship of nodes is shown:
Figure BDA0001850233690000091
then XiIs a leaf node space, in which each sample is mapped to a vertex of the hypercube, and the dimensions of each rule space are defined as a decision rule. Thus, such a mapping process essentially defines for a sample which rules are valid and which are invalid.
In step 3, the method for extracting the elastic norm sparse coding rule from the liver disease rule set comprises the following steps:
step 3A: and (3) constructing a new training sample according to the mapping result of the step 2C:
{(X1,y1),(X2,y2),...,(Xp,yp)};
wherein, XiIs a binary attribute vectorY ∈ {1, 2.,. K } is a related class label, and the formula defining the classification is:
Figure BDA0001850233690000092
wherein the weight vector WkAnd a scalar bkA linear discriminant function of the kth class is defined;
since each binary attribute represents a decision rule. Weight W in equation (1)kThe importance of the rule is measured: the magnitude of the weight indicates the importance of the rule. Obviously, in the above classifier, if the weights of all classes are 0, the rule can be safely removed. Rule extraction is therefore a problem for learning weight vectors.
And step 3B: elastic norm normalization learning is performed, wherein the objective function is as follows:
Figure BDA0001850233690000093
Figure BDA0001850233690000094
ξi≥0,i=1,...,p (2)
as shown in fig. 4, the objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:
Figure BDA0001850233690000095
to control the number of non-zero weights and rule extractions, P is a probability factor that selects either the L1 or L2 norm; second term εkIs the sum of the relaxation variables; λ is the regularization parameter. Because the non-zero relaxation variables represent a misclassified sample, the second term is related to empirical errors. The sparsity and empirical error of the results depend on the regularization parameters, while L1 and L2 norm sparse coding have been widely applied to statistics and machine learning. The L1 norm may remove insignificant features, while the L2 norm may preventAnd (5) fitting the data. According to the invention, after the step 3B, the P value with the highest model cross validation precision is adopted, and the P value is selected and substituted into the formula (2).
The method for calculating the importance of any sample feature in the random forest comprises the following steps:
and step 3C: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB1, is calculated using the corresponding OOB (out-of-bag data) data;
and step 3D: randomly adding noise interference to the characteristics of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error of the out-of-bag data OOB again and recording the error as errOOB 2;
and step 3E: let there be Ntree decision trees in the random forest, and the importance of features is defined as:
∑(err00B1-err00B2)/Ntree (3)
the importance of all features is calculated.
The reason why equation (3) is used as the measure of importance of the corresponding feature is that: if the accuracy outside the bag is greatly reduced after noise is randomly added to a certain feature, it indicates that the feature has a great influence on the classification result of the sample, that is, the feature has a high degree of importance.
In step 4, the method for extracting and deleting the characteristics of the coding liver disease rule set comprises the following steps:
since the distribution of features in a random forest is determined by the learning process of the random forest. It is usually different from the feature distribution resulting from regular extraction from the previous formula. The important features are based on the assumption in the extracted decision rule, and the features can be selected by using the different features. A feature is deleted if it does not appear in the rules extracted in equation (2) above, because it has no effect on the classifier defined by equation (1). Under this idea, both rules and features can be selected.
While the regularization parameter lambda may be selected by training set cross validation. By reconstructing the random forest with the selected features, rules can be further selected to obtain more compact rules. Through the process of such an iteration, features are selected for construction of a new random forest in the previous iteration, through which new rules can be generated, and the iteration is until the selected features do not change.
As shown in FIG. 3, FIG. 3 is the final result of the processing of the random forest of FIG. 2 according to the present invention. This has the following:
step 4A: if a certain feature does not appear in the rule extracted by the formula (2), the feature is deleted;
and step 4B: selecting a regularization parameter lambda through training set cross validation, and returning to the step 2A to reconstruct a random forest for training;
and step 4C: and repeating the iterative process from the step 2A to the step 4B until the selected characteristics are not changed.
In step 5, the original data verification is carried out, and the generation of the final rule set comprises the following steps:
step 5A: given a class labeled liver disease data set D, let ncoversNumber of data covered, ncorrectFor the number of data accurately classified by the rule set R, the coverage and accuracy of the rule set R are defined as:
Figure BDA0001850233690000111
Figure BDA0001850233690000112
the higher the coverage rate and the accuracy rate of the rule are, the greater the credibility of the rule for auxiliary diagnosis is; and generating a final rule set by using the rule with relatively high coverage rate and accuracy.
As shown in fig. 5, the pseudo code operation process for the embodiment of the present invention is as follows:
the precondition requires that:
1: a feature variable F is initialized.
2: training samples X are randomly selected from the total samples D.
Inputting:
1: selectingSelection feature Ff
2: selection rule Rf
The algorithm mainly comprises the following steps:
1: characteristic Fi,i=1;
2: if it is not
Figure BDA0001850233690000113
Executing:
3: running a random forest model, and collecting the data with the characteristics FiTraining sample X of (2);
4: random forest generation of a series of rule sets Rr
5: using rule sets RrTo encode training sample X;
6: linear equation (2) is used to obtain the cross-validation accuracy CiAnd a weight W value;
7: when the weight is greater than a threshold value (a preset value small enough), recording the indexes of all the parameters;
8:Rrthe index is passed to Ri
9:RiThe characteristics in (1) are passed to (F)i+1,i=i+1;
10: ending the circulation;
11: selective cross validation accuracy CiThe largest i is passed to i*
12:
Figure BDA0001850233690000121
To Ff
13:
Figure BDA0001850233690000122
To Rf
14: return to Ff、Rf
The present invention is not limited to the above-mentioned preferred embodiments, and any other method for classifying and extracting liver disease data based on random forest can be obtained according to the teaching of the present invention, and other types of sample data with significant irregularity and imbalance can be effectively extracted according to the design of the present invention.

Claims (1)

1. A method for extracting liver disease data classification rules based on random forests is characterized by comprising the following steps:
step 1: preprocessing unbalanced or irregular data in liver diseases, and synthesizing a few oversampling technologies by using SMOTE to obtain a liver disease data set;
step 2: carrying out binary sparse coding on the liver disease data set by using a random forest model to obtain a liver disease rule set;
and step 3: performing elastic norm sparse coding rule extraction on the liver disease rule set to obtain a coding liver disease rule set;
and 4, step 4: extracting and deleting characteristics of the rule set for coding the liver diseases;
and 5: performing original data verification to generate a final rule set;
in step 2, the method for performing binary sparse coding on the liver disease data set by using the random forest model comprises the following steps:
step 2A: training a liver disease data set to obtain a random forest comprising a plurality of decision trees, wherein in each decision tree, a path from a root node to a leaf node is interpreted as a decision rule, and the random forest is equivalent to a decision rule set;
and step 2B: corresponding each sample of the liver disease data set to a decision tree from a root node to only one leaf node;
and step 2C: defining a binary feature vector capture random forest leaf node structure: for sample XiThe corresponding binary vector and the encoding leaf node are defined as:
Xi=[X1,...Xq]Twherein q is the total number of leaf nodes;
Figure FDA0002948286070000011
then XiThe space of (2) is a leaf node space, in which each sample is mapped to a vertex of the hypercube, and the dimension of each rule space is defined as a decision rule;
in step 3, the method for extracting the elastic norm sparse coding rule from the liver disease rule set comprises the following steps:
step 3A: and (3) constructing a new training sample according to the mapping result of the step 2C:
{(X1,y1),(X2,y2),...,(Xp,yp)};
wherein, XiIs a binary attribute vector, y ∈ {1, 2., K } is a related class label, and the formula defining the class is:
Figure FDA0002948286070000021
wherein the weight vector WkAnd a scalar bkA linear discriminant function of the kth class is defined;
and step 3B: elastic norm normalization learning is performed, wherein the objective function is as follows:
Figure FDA0002948286070000022
s.t(Wyi-Wk)TXi+byi+bkik≥1
ξi≥0,i=1,...,p (2)
the objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:
Figure FDA0002948286070000023
to control non-zero weights and rulesThe number of extractions, P is the probability factor of selecting the L1 or L2 norm, and the value is: p is more than or equal to 0 and less than or equal to 1; second term εkIs the sum of the relaxation variables; λ is a regularization parameter;
in step 4, the method for extracting and deleting the characteristics of the coding liver disease rule set comprises the following steps:
step 4A: if a certain feature does not appear in the rule extracted by the formula (2), the feature is deleted;
and step 4B: selecting a regularization parameter lambda through training set cross validation, and returning to the step 2A to reconstruct a random forest for training;
and step 4C: repeating the iterative process from the step 2A to the step 4B until the selected characteristics are not changed;
in step 5, the original data verification is carried out, and the generation of the final rule set comprises the following steps:
step 5A: given a class labeled liver disease data set D, let ncoversNumber of data covered, ncorrectFor the number of data accurately classified by the rule set R, the coverage and accuracy of the rule set R are defined as:
Figure FDA0002948286070000024
Figure FDA0002948286070000025
the higher the coverage rate and the accuracy rate of the rule are, the greater the credibility of the rule for auxiliary diagnosis is; generating a final rule set by using rules with relatively high coverage rate and accuracy;
and 3B, selecting the P value with the highest model cross validation precision after the step 3B, substituting the P value into the formula (2), wherein the method for calculating the importance of any sample feature in the random forest comprises the following steps:
and step 3C: for each decision tree in the random forest, calculating its out-of-bag data error using the corresponding out-of-bag data OOB, denoted as errOOB 1;
and step 3D: randomly adding noise interference to the characteristics of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error of the out-of-bag data OOB again and recording the error as errOOB 2;
and step 3E: let there be Ntree decision trees in the random forest, and the importance of features is defined as:
∑(err00B1-err00B2)/Ntree (3)
the importance of all features is calculated.
CN201811292849.5A 2018-02-05 2018-11-01 Liver disease data classification rule extraction method based on random forest Active CN109409434B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810115962X 2018-02-05
CN201810115962 2018-02-05

Publications (2)

Publication Number Publication Date
CN109409434A CN109409434A (en) 2019-03-01
CN109409434B true CN109409434B (en) 2021-05-18

Family

ID=65470743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811292849.5A Active CN109409434B (en) 2018-02-05 2018-11-01 Liver disease data classification rule extraction method based on random forest

Country Status (1)

Country Link
CN (1) CN109409434B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046654A (en) * 2019-03-25 2019-07-23 东软集团股份有限公司 A kind of method, apparatus and relevant device of identification classification influence factor
CN110327074A (en) * 2019-08-02 2019-10-15 无锡海斯凯尔医学技术有限公司 Liver evaluation method, device, equipment and computer readable storage medium
CN113628697A (en) * 2021-07-28 2021-11-09 上海基绪康生物科技有限公司 Random forest model training method for classification unbalance data optimization

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156537A (en) * 2014-08-19 2014-11-19 中山大学 Cellular automaton urban growth simulating method based on random forest
CN105844300A (en) * 2016-03-24 2016-08-10 河南师范大学 Optimized classification method and optimized classification device based on random forest algorithm

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015066564A1 (en) * 2013-10-31 2015-05-07 Cancer Prevention And Cure, Ltd. Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
CN104882144B (en) * 2015-05-06 2018-10-30 福州大学 Animal sounds recognition methods based on sonograph bicharacteristic
CN104915679A (en) * 2015-05-26 2015-09-16 浪潮电子信息产业股份有限公司 Large-scale high-dimensional data classification method based on random forest weighted distance
CN106778836A (en) * 2016-11-29 2017-05-31 天津大学 A kind of random forest proposed algorithm based on constraints
CN108647829A (en) * 2018-05-16 2018-10-12 河海大学 A kind of Hydropower Stations combined dispatching Rules extraction method based on random forest

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156537A (en) * 2014-08-19 2014-11-19 中山大学 Cellular automaton urban growth simulating method based on random forest
CN105844300A (en) * 2016-03-24 2016-08-10 河南师范大学 Optimized classification method and optimized classification device based on random forest algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Combined Rule Extraction and Feature Elimination in Supervised Classification;Sheng Liu et al;《IEEE TRANSACTIONS ON NANOBIOSCIENCE》;20120930;第11卷(第3期);正文第II-III节 *
Regularization and variable selection via the elastic net;Hui Zou et al;《Journal of the Royal Statistical Society》;20051110;第67卷(第2期);正文第2.1节 *
基于随机森林算法的两阶段变量选择研究;冯盼峰等;《***科学与数学》;20180131;第38卷(第1期);第119-130页 *

Also Published As

Publication number Publication date
CN109409434A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
Jia et al. Feature dimensionality reduction: a review
Chen et al. Unpaired point cloud completion on real scans using adversarial training
Reddy et al. An efficient system for heart disease prediction using hybrid OFBAT with rule-based fuzzy logic model
Bhuyan et al. Feature and subfeature selection for classification using correlation coefficient and fuzzy model
CN111461322B (en) Deep neural network model compression method
CN109409434B (en) Liver disease data classification rule extraction method based on random forest
Chin-Wei et al. Multiobjective optimization approaches in image segmentation–the directions and challenges
CN113033309A (en) Fault diagnosis method based on signal downsampling and one-dimensional convolution neural network
CN113688869B (en) Photovoltaic data missing reconstruction method based on generation countermeasure network
Saranya et al. Feature selection techniques for disease diagnosis system: A survey
Ding et al. RVGAN-TL: A generative adversarial networks and transfer learning-based hybrid approach for imbalanced data classification
Terefe et al. Time series averaging using multi-tasking autoencoder
Weng et al. Adversarial attention-based variational graph autoencoder
Hagg et al. Expressivity of parameterized and data-driven representations in quality diversity search
CN113076545A (en) Deep learning-based kernel fuzzy test sequence generation method
Zhang et al. Relief feature selection and parameter optimization for support vector machine based on mixed kernel function
Yuan et al. A novel genetic algorithm with hierarchical evaluation strategy for hyperparameter optimisation of graph neural networks
Shokouhifar et al. A hybrid approach for effective feature selection using neural networks and artificial bee colony optimization
He et al. Robust adaptive graph regularized non-negative matrix factorization
Dutta Detecting Lung Cancer Using Machine Learning Techniques.
Acilar et al. A novel approach for designing adaptive fuzzy classifiers based on the combination of an artificial immune network and a memetic algorithm
Nouri-Moghaddam et al. A novel filter-wrapper hybrid gene selection approach for microarray data based on multi-objective forest optimization algorithm
Wang et al. M2SPL: Generative multiview features with adaptive meta-self-paced sampling for class-imbalance learning
Balamurugan et al. An integrated approach to performance measurement, analysis, improvements and knowledge management in healthcare sector
Doki et al. Heart disease prediction using xgboost

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant