CN102750286B - A kind of Novel decision tree classifier method processing missing data - Google Patents

A kind of Novel decision tree classifier method processing missing data Download PDF

Info

Publication number
CN102750286B
CN102750286B CN201110100232.0A CN201110100232A CN102750286B CN 102750286 B CN102750286 B CN 102750286B CN 201110100232 A CN201110100232 A CN 201110100232A CN 102750286 B CN102750286 B CN 102750286B
Authority
CN
China
Prior art keywords
node
child node
attribute
split
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110100232.0A
Other languages
Chinese (zh)
Other versions
CN102750286A (en
Inventor
吴军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd filed Critical CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd
Priority to CN201110100232.0A priority Critical patent/CN102750286B/en
Publication of CN102750286A publication Critical patent/CN102750286A/en
Application granted granted Critical
Publication of CN102750286B publication Critical patent/CN102750286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of Novel decision tree classifier method processing missing data, comprise the following steps: pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization distribution; The node split of described data centralization is the characteristic attribute of child node by selection one; By sample data according to the Characteristic Attribute Classification of node to each child node; Each child node splits off by the characterizing magnitudes selected by calculating each child node; According to the sample attribute determination leaf node of each child node split off; The present invention effectively can process missing data; Understandable rule can be generated; Calculated amount is not very large comparatively speaking; Continuous and category field can be processed; Which field can be shown clearly important; Select attribute with information gain-ratio, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain.

Description

A kind of Novel decision tree classifier method processing missing data
Technical field
The invention belongs to data mining and machine learning field, relate to a kind of method that can process the Novel decision tree classifier of missing data.
Background technology
Along with the high speed development of infotech, people collect, store and the quantity of visit data gets more and more, and to be richly stored with behind effective knowledge in these a large amount of historical datas.How going to find and analyze relation Sum fanction existing between these data is at present a very important problem.Data mining (DM) technology has been arisen at the historic moment under this background, and it has merged database, artificial intelligence, machine learning, the theory and knowledge in multiple field such as statistics.Data Mining Tools can be predicted future trend, can well support the decision-making of people.Wherein conventional method has neural network, genetic algorithm, decision tree, rule-based reasoning, Bayes's classification etc.Wherein traditional decision-tree is easier to be understood by people, and the precision of output is high, therefore comparatively extensive in Data Mining application.But traditional decision-tree also has its shortcoming, such as its is difficult to find rule based on the combination of multiple variable, and the division between different decision tree branches is also unsmooth, and the computation complexity of traditional decision tree algorithm is higher etc.Traditional decision-tree is one of current most widely used induction algorithm, is a kind of method of approaching discrete-valued function, also it can be regarded as a Boolean function.It is the induced learning algorithm based on example, is commonly used to form sorter and forecast model, is conceived to infer from one group of out of order, random example the classifying rules that decision tree represents formation.It adopts top-down recursive fashion, the inside node of decision tree carry out property value relatively and judge, from the downward branch of this node, finally to obtain conclusion at the leaf node of decision tree according to different property values.Therefore the paths from root to leaf node just correspond to a conjunction rule, and whole decision tree just correspond to one group of expression formula rule of extracting.
Up to the present decision tree has a lot of implementation algorithm.Such as: the CLS proposed by people such as Hunt, the C4.5 algorithm proposed for the ID3 and 1993 proposed by Quinlan in 1986 year, and CART, C5.0, FuzzyC4.5, OC1, QUEST, and CAL5 etc.The shortcoming of traditional decision tree algorithm comprises: the existence of (1) missing data is the major reason causing classifier performance to decline, and current sorter effectively can not process the classification problem of missing data mostly.(2) algorithm is often partial to the more attribute of selection value, and the attribute that the attribute that attribute is more is under many circumstances always not optimum.(3) when contributing, each node is only containing a feature, and be a kind of algorithm of monotropic unit, the correlativity between feature is tight not.Although connect together on one tree, contact or loose.(4) more responsive to dry acoustic ratio, be not easy to remove dry sound.Namely eigenwert gets mistake or classification to wrong.(5) when training set increases, ID3 decision tree changes thereupon.In achievement process, the mutual information of each feature can change with the increase of example, and decision tree also changes thereupon, and this is unaccommodated to the study of the data set of change.(6) although algorithm is theoretical clear, its calculating more complicated, in the process of study and training dataset, machine memory usage is larger, compares consumes resources, affects time and cost that data learn.
Summary of the invention
In order to overcome above defect, the technical problem to be solved in the present invention is: propose a kind of possible attribute simultaneously solving estimation missing data, and recursively construct the branch of decision tree, complete the structure of decision tree, the method for the Novel decision tree classifier of the process missing data of meticulous classifying rules.
The technical solution adopted in the present invention is: a kind of Novel decision tree classifier method processing missing data, comprises the following steps:
A, pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization distribution;
The node split of described data centralization is the characteristic attribute of child node by b, selection one;
C, by sample data according to the Characteristic Attribute Classification of node to each child node;
Each child node splits off by d, the characterizing magnitudes selected by calculating each child node;
The sample attribute determination leaf node of each child node that e, basis split off.
According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises described data set further and comprises missing data, non-missing data.
According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises described node diagnostic value further and comprises the information entropy of characteristic quantity, the information gain-ratio of characteristic quantity.
According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises the information gain-ratio calculating characteristic quantity when child node being split off further, characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase.
According to another preferred embodiment, when a kind of Novel decision tree classifier method processing missing data is included in the sample attribute determination leaf node according to each child node further, if each child node is only containing same class sample, be then leaf node by this Node configuration, terminate the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.
The invention has the beneficial effects as follows: 1, effectively can process missing data; 2, understandable rule can be generated; 3, calculated amount is not very large comparatively speaking; 4, continuous and category field can be processed; 5, which field can be shown clearly important; 6, select attribute with information gain-ratio, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain; 7, relative to the way that traditional classifier algorithm is given up when running into missing data, this algorithm makes sorter also can may occurring that the probability of data is classified to it according to it in the missing data.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the preferred embodiments of the present invention;
In figure: 1, initialization is carried out to raw data, 2, be each sensor selection problem characteristic attribute, 3, by the Characteristic Attribute Classification of sample evidence node to each child node, 4, the information entropy of the characteristic quantity that each child node is selected is calculated, child node splits off by the information gain-ratio 5, calculating characteristic quantity, 6, according to the sample attribute determination leaf node of each child node, 7, terminate.
Embodiment
The present invention is further detailed explanation with preferred embodiment by reference to the accompanying drawings now.These accompanying drawings are the schematic diagram of simplification, only basic structure of the present invention are described in a schematic way, and therefore it only shows the formation relevant with the present invention.
As shown in Figure 1, a kind of Novel decision tree classifier method processing missing data, comprises the following steps:
A, pending raw sample data collection is carried out Initialize installation 1, and described data set is carried out weighted value initialization distribution, wherein, described data set comprises missing data, non-missing data.
B, the characteristic attribute 2 selecting to be child node by the node split of described data centralization;
C, by sample data according to the Characteristic Attribute Classification of node to each child node 3;
Each child node splits off by d, the characterizing magnitudes selected by calculating each child node, comprises the information entropy 4, the information gain-ratio that calculate the characteristic quantity that each child node is selected;
The sample attribute determination leaf node 6 of each child node that e, basis split off.
The information gain-ratio calculating characteristic quantity child node is split off 5 time, the characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase.
When the sample attribute determination leaf node 6 according to each child node, if each child node is only containing same class sample, is then leaf node by this Node configuration, terminates the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.
Basic thought of the present invention is for missing data and non-missing data sample distribute a weight respectively, information entropy principle is utilized in assorting process, the attribute selecting information gain-ratio maximum is as categorical attribute, give each class node probability, recursively construct the branch of decision tree, complete the structure of decision tree, make sorter also can may occurring that the probability of data is classified to it according to it in the missing data.
As shown in Figure 1, this embodiment first step starts from carrying out initialization 1 to raw data.First the data layout with reference to table 1 represents sample, and each sample has weights.Weights initial value is 1.Weights represent the importance of each sample, if the weights of a sample are 10, then representing the importance of this sample in assorting process, to be equivalent to 10 weights be the sample of 1.
Sample # Weather Temperature There is wind Classification
1 Fine Cold No Do not play ball
2 Cold Be Do not play ball
3 Rain Heat Play ball
4 Rain Temperature Play ball
5 Fine No Play ball
6 Cold Be Play ball
7 Cloudy Cold No Do not play ball
Table 1
Second step be each sensor selection problem characteristic attribute 2 as attribute when this node split, can with reference to table 2, such as weather, temperature, or possible attribute when having wind to be set to node split.
Table 2
3rd step then by the Characteristic Attribute Classification of sample evidence node to each child node 3.Three characteristic attributes of such as first sample are respectively " fine " " cold " and "No", then, when selecting weather as characteristic attribute, first sample just divides in weather " fine " this subclass.
4th step, for utilizing following information entropy formula, calculates the information entropy 4 of the characteristic quantity that each child node is selected.
info ( U ) = - Σ i = 1 k ( Σ m = 1 n ( u ) Weight ( m , U ) · C i ( m ) / | U | ) · log 2 ( freq ( C i , U ) / | U | ) - - - ( 1 )
The information entropy of sample set U can be obtained by formula (1).
info ( U Aj ) = - Σ i = 1 k ( Σ m = 1 n ( u ) Weight ( m , U ) · C i ( m ) · O Ai ( m ) / | U Aj | ) · log 2 ( freq ( C i , U Aj ) / | U Aj | ) - - - ( 2 )
Can obtain characteristic quantity by formula (2) is A, and output valve is the information entropy of the subset U of j.
| U Aj | = Σ m = 1 n ( UAj ) Weight ( m , U Aj ) · Q Aj ( m ) - - - ( 3 )
Can be obtained by formula (3) and belong to U ajsample size.
Symbol implication wherein in above formula is respectively:
C 1(m), C 2(m) ... ... .C k-1(m), C k(m). the weights of representative classification, wherein C im () represents that m sample belongs to classification C idegree, be satisfied with condition
O a1(m), O a2(m) ... ... O a (n-1)(m), O anm weights that () representative exports;
O aj '(m ci ') represent O ajmissing data in (m);
M cirepresent the input amendment that there is not missing data;
M ci 'represent the input amendment that there is missing data;
N cibelong to C isample size;
U aj: characteristic quantity is A, output valve be the subset U. of j such as, characteristic quantity weather has three values " fine ", " cloudy " and " rain ", then U is divided into this three subsets;
Weight (m, U): the weights of m sample in U;
Freq (C i, U): be categorized as C in U isample size;
Freq (C i, U aj): U ajin be categorized as C isample size;
| the quantity of sample in U|:U;
| U aj|: belong to U ajsample size;
Entering the 5th step is the ratio of computing information gain to carve information amount, is the information gain-ratio calculating characteristic quantity and child node is split off 5.If selected characteristic quantity has maximum information ratio of profit increase exactly, then continuing split vertexes is child node.If selected characteristic quantity is not have maximum information ratio of profit increase, then turning back to second step is each sensor selection problem characteristic attribute 2, and again computing information ratio of profit increase until reach maximal value.The division information entropy of its interior joint A can utilize formula (4) first to obtain, then utilize formula (5) to obtain the gain of information, finally formula (4) and (5) are updated to formula (6) and can information gain-ratio be obtained.
Split - info ( A ) = - Σ j = 1 n ( u ) ( | U Aj | / | U | ) · log 2 ( | U Aj | / | U | ) - - - ( 4 )
Gain(A)=info(U)-info A(U Aj)(5)
Ratio of profit increase Gain-ratio (A)=Gain (A)/Split-info (A) (6)
6th step is the sample attribute determination leaf node 6 according to each child node, if each node is only containing same class sample, then minor node is set to leaf node, terminate the division of this node, flow process terminates 7.If not only containing a class sample, then get back to second step and continue as each sensor selection problem characteristic attribute 2.
Adopt method of the present invention, data set is divided into zero miss rate data according to the ratio of missing data, 5% miss rate data, 10% miss rate data, 20% miss rate data, 30% miss rate data.The present invention carries out accuracy with classical C4.5 decision tree classifier algorithm and compares, experimental result is as shown in table 3, and when without missing data, algorithm of the present invention and C4.5 sorting algorithm performance do not differ, but along with the ratio of missing data increases, the superiority of algorithm of the present invention is more and more significant.When shortage of data rate up to 30% time, utilize traditional C4.5 algorithm, error rate is up to 12.09%, and the error rate of algorithm of the present invention is only 7.32%, this error rate accepting in scope in real world applications.
The ratio of missing data C4.5 The present invention
0 4.7% 4.7%
5% 5.69% 4.7%
10% 6.16% 5.33%
20% 7.7% 6.0%
30% 12.09% 7.32%
Table 3
The way that the present invention is given up when running into missing data relative to traditional classifier algorithm, gives sample weights to missing data.In assorting process, give each class node probability, make sorter also can occurring that the probability of data is classified to it according to it in missing data, can effectively process missing data.Utilize information gain-ratio to select attribute, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain.It is important which data field the weighted value that sample exports shows clearly, generates understandable classifying rules.
With above-mentioned according to desirable embodiment of the present invention for enlightenment, by above-mentioned description, relevant staff in the scope not departing from this invention technological thought, can carry out various change and amendment completely.The technical scope of this invention is not limited to the content on instructions, must determine its technical scope according to right.

Claims (2)

1. process a Novel decision tree classifier method for missing data, it is characterized in that: comprise the following steps:
A, pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization divide
Join;
The node split of described data centralization is the characteristic attribute of child node by b, selection one;
C, by sample data according to the Characteristic Attribute Classification of node to each child node;
D, utilize characteristic attribute to calculate the characterizing magnitudes of each node, then be child node by node split;
The sample attribute determination leaf node of each child node that e, basis split off;
Described node diagnostic amount comprises the information entropy of characteristic quantity, the information gain-ratio of characteristic quantity;
When being child node according to the information gain-ratio of described characteristic quantity by node split, the characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase;
When the sample attribute determination leaf node according to each child node, if each child node is only containing same class sample, is then leaf node by this Node configuration, terminates the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.
2. a kind of Novel decision tree classifier method processing missing data according to claim 1, is characterized in that: described data set comprises missing data, non-missing data.
CN201110100232.0A 2011-04-21 2011-04-21 A kind of Novel decision tree classifier method processing missing data Active CN102750286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110100232.0A CN102750286B (en) 2011-04-21 2011-04-21 A kind of Novel decision tree classifier method processing missing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110100232.0A CN102750286B (en) 2011-04-21 2011-04-21 A kind of Novel decision tree classifier method processing missing data

Publications (2)

Publication Number Publication Date
CN102750286A CN102750286A (en) 2012-10-24
CN102750286B true CN102750286B (en) 2016-01-20

Family

ID=47030478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110100232.0A Active CN102750286B (en) 2011-04-21 2011-04-21 A kind of Novel decision tree classifier method processing missing data

Country Status (1)

Country Link
CN (1) CN102750286B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103248955B (en) * 2013-04-22 2017-07-28 深圳Tcl新技术有限公司 Personal identification method and device based on intelligent remote control system
CN104035779A (en) * 2014-06-25 2014-09-10 中国科学院软件研究所 Method for handling missing values during data stream decision tree classification
RU2720448C2 (en) * 2015-02-12 2020-04-29 Конинклейке Филипс Н.В. Reliable classifier
CN105843924A (en) * 2016-03-25 2016-08-10 南京邮电大学 CART-based decision-making tree construction method in cognitive computation
CN105956935B (en) * 2016-05-06 2017-08-25 泉州亿兴电力有限公司 A kind of power distribution network transformer threshold value update method and system
CN108665293B (en) * 2017-03-29 2021-08-31 华为技术有限公司 Feature importance obtaining method and device
CN107729943B (en) * 2017-10-23 2021-11-30 辽宁大学 Missing data fuzzy clustering algorithm for optimizing estimated value of information feedback extreme learning machine and application thereof
CN108416368A (en) * 2018-02-08 2018-08-17 北京三快在线科技有限公司 The determination method and device of sample characteristics importance, electronic equipment
CN109344255B (en) * 2018-09-26 2023-05-26 平安科技(深圳)有限公司 Label filling method and terminal equipment
CN111401509B (en) * 2019-01-02 2023-03-28 ***通信有限公司研究院 Terminal type identification method and device
CN113139712B (en) * 2021-03-09 2024-02-09 杭州电子科技大学 Machine learning-based extraction method for incomplete rules of activity attributes of process logs

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006155344A (en) * 2004-11-30 2006-06-15 Toshiba Corp Data analyzer, data analysis program, and data analysis method
CN101251851B (en) * 2008-02-29 2010-08-25 吉林大学 Multi-classifier integrating method based on increment native Bayes network
CN101819604A (en) * 2010-05-24 2010-09-01 天津大学 Probability rough set based decision tree generation method

Also Published As

Publication number Publication date
CN102750286A (en) 2012-10-24

Similar Documents

Publication Publication Date Title
CN102750286B (en) A kind of Novel decision tree classifier method processing missing data
CN105760888B (en) A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute
CN109886349B (en) A kind of user classification method based on multi-model fusion
CN103106279A (en) Clustering method simultaneously based on node attribute and structural relationship similarity
CN103034691B (en) A kind of expert system knowledge acquisition methods based on support vector machine
CN106991447A (en) A kind of embedded multi-class attribute tags dynamic feature selection algorithm
CN104881706A (en) Electrical power system short-term load forecasting method based on big data technology
CN109948668A (en) A kind of multi-model fusion method
CN103778227A (en) Method for screening useful images from retrieved images
CN105654196A (en) Adaptive load prediction selection method based on electric power big data
CN112417176B (en) Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics
CN101196905A (en) Intelligent pattern searching method
CN106779219A (en) A kind of electricity demand forecasting method and system
Khan et al. Evaluating the performance of several data mining methods for predicting irrigation water requirement
CN104239496A (en) Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering
CN104268629A (en) Complex network community detecting method based on prior information and network inherent information
Nhita A rainfall forecasting using fuzzy system based on genetic algorithm
CN110781940A (en) Fuzzy mathematics-based community discovery information processing method and system
CN105844334B (en) A kind of temperature interpolation method based on radial base neural net
CN107871183A (en) Permafrost Area highway distress Forecasting Methodology based on uncertain Clouds theory
CN105930531A (en) Method for optimizing cloud dimensions of agricultural domain ontological knowledge on basis of hybrid models
CN103310027B (en) Rules extraction method for map template coupling
Zhu et al. Loan default prediction based on convolutional neural network and LightGBM
CN103020864B (en) Corn fine breed breeding method
CN102004801A (en) Information classification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant