CN102750286B

CN102750286B - A kind of Novel decision tree classifier method processing missing data

Info

Publication number: CN102750286B
Application number: CN201110100232.0A
Authority: CN
Inventors: 吴军
Original assignee: CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd
Current assignee: CHANGZHOU LENCITY INFORMATION TECHNOLOGY Co Ltd
Priority date: 2011-04-21
Filing date: 2011-04-21
Publication date: 2016-01-20
Anticipated expiration: 2031-04-21
Also published as: CN102750286A

Abstract

The present invention relates to a kind of Novel decision tree classifier method processing missing data, comprise the following steps: pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization distribution; The node split of described data centralization is the characteristic attribute of child node by selection one; By sample data according to the Characteristic Attribute Classification of node to each child node; Each child node splits off by the characterizing magnitudes selected by calculating each child node; According to the sample attribute determination leaf node of each child node split off; The present invention effectively can process missing data; Understandable rule can be generated; Calculated amount is not very large comparatively speaking; Continuous and category field can be processed; Which field can be shown clearly important; Select attribute with information gain-ratio, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain.

Description

A kind of Novel decision tree classifier method processing missing data

Technical field

The invention belongs to data mining and machine learning field, relate to a kind of method that can process the Novel decision tree classifier of missing data.

Background technology

Along with the high speed development of infotech, people collect, store and the quantity of visit data gets more and more, and to be richly stored with behind effective knowledge in these a large amount of historical datas.How going to find and analyze relation Sum fanction existing between these data is at present a very important problem.Data mining (DM) technology has been arisen at the historic moment under this background, and it has merged database, artificial intelligence, machine learning, the theory and knowledge in multiple field such as statistics.Data Mining Tools can be predicted future trend, can well support the decision-making of people.Wherein conventional method has neural network, genetic algorithm, decision tree, rule-based reasoning, Bayes's classification etc.Wherein traditional decision-tree is easier to be understood by people, and the precision of output is high, therefore comparatively extensive in Data Mining application.But traditional decision-tree also has its shortcoming, such as its is difficult to find rule based on the combination of multiple variable, and the division between different decision tree branches is also unsmooth, and the computation complexity of traditional decision tree algorithm is higher etc.Traditional decision-tree is one of current most widely used induction algorithm, is a kind of method of approaching discrete-valued function, also it can be regarded as a Boolean function.It is the induced learning algorithm based on example, is commonly used to form sorter and forecast model, is conceived to infer from one group of out of order, random example the classifying rules that decision tree represents formation.It adopts top-down recursive fashion, the inside node of decision tree carry out property value relatively and judge, from the downward branch of this node, finally to obtain conclusion at the leaf node of decision tree according to different property values.Therefore the paths from root to leaf node just correspond to a conjunction rule, and whole decision tree just correspond to one group of expression formula rule of extracting.

Up to the present decision tree has a lot of implementation algorithm.Such as: the CLS proposed by people such as Hunt, the C4.5 algorithm proposed for the ID3 and 1993 proposed by Quinlan in 1986 year, and CART, C5.0, FuzzyC4.5, OC1, QUEST, and CAL5 etc.The shortcoming of traditional decision tree algorithm comprises: the existence of (1) missing data is the major reason causing classifier performance to decline, and current sorter effectively can not process the classification problem of missing data mostly.(2) algorithm is often partial to the more attribute of selection value, and the attribute that the attribute that attribute is more is under many circumstances always not optimum.(3) when contributing, each node is only containing a feature, and be a kind of algorithm of monotropic unit, the correlativity between feature is tight not.Although connect together on one tree, contact or loose.(4) more responsive to dry acoustic ratio, be not easy to remove dry sound.Namely eigenwert gets mistake or classification to wrong.(5) when training set increases, ID3 decision tree changes thereupon.In achievement process, the mutual information of each feature can change with the increase of example, and decision tree also changes thereupon, and this is unaccommodated to the study of the data set of change.(6) although algorithm is theoretical clear, its calculating more complicated, in the process of study and training dataset, machine memory usage is larger, compares consumes resources, affects time and cost that data learn.

Summary of the invention

In order to overcome above defect, the technical problem to be solved in the present invention is: propose a kind of possible attribute simultaneously solving estimation missing data, and recursively construct the branch of decision tree, complete the structure of decision tree, the method for the Novel decision tree classifier of the process missing data of meticulous classifying rules.

The technical solution adopted in the present invention is: a kind of Novel decision tree classifier method processing missing data, comprises the following steps:

A, pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization distribution;

The node split of described data centralization is the characteristic attribute of child node by b, selection one;

C, by sample data according to the Characteristic Attribute Classification of node to each child node;

Each child node splits off by d, the characterizing magnitudes selected by calculating each child node;

The sample attribute determination leaf node of each child node that e, basis split off.

According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises described data set further and comprises missing data, non-missing data.

According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises described node diagnostic value further and comprises the information entropy of characteristic quantity, the information gain-ratio of characteristic quantity.

According to another preferred embodiment, a kind of Novel decision tree classifier method processing missing data comprises the information gain-ratio calculating characteristic quantity when child node being split off further, characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase.

According to another preferred embodiment, when a kind of Novel decision tree classifier method processing missing data is included in the sample attribute determination leaf node according to each child node further, if each child node is only containing same class sample, be then leaf node by this Node configuration, terminate the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.

The invention has the beneficial effects as follows: 1, effectively can process missing data; 2, understandable rule can be generated; 3, calculated amount is not very large comparatively speaking; 4, continuous and category field can be processed; 5, which field can be shown clearly important; 6, select attribute with information gain-ratio, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain; 7, relative to the way that traditional classifier algorithm is given up when running into missing data, this algorithm makes sorter also can may occurring that the probability of data is classified to it according to it in the missing data.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the preferred embodiments of the present invention;

In figure: 1, initialization is carried out to raw data, 2, be each sensor selection problem characteristic attribute, 3, by the Characteristic Attribute Classification of sample evidence node to each child node, 4, the information entropy of the characteristic quantity that each child node is selected is calculated, child node splits off by the information gain-ratio 5, calculating characteristic quantity, 6, according to the sample attribute determination leaf node of each child node, 7, terminate.

Embodiment

The present invention is further detailed explanation with preferred embodiment by reference to the accompanying drawings now.These accompanying drawings are the schematic diagram of simplification, only basic structure of the present invention are described in a schematic way, and therefore it only shows the formation relevant with the present invention.

As shown in Figure 1, a kind of Novel decision tree classifier method processing missing data, comprises the following steps:

A, pending raw sample data collection is carried out Initialize installation 1, and described data set is carried out weighted value initialization distribution, wherein, described data set comprises missing data, non-missing data.

B, the characteristic attribute 2 selecting to be child node by the node split of described data centralization;

C, by sample data according to the Characteristic Attribute Classification of node to each child node 3;

Each child node splits off by d, the characterizing magnitudes selected by calculating each child node, comprises the information entropy 4, the information gain-ratio that calculate the characteristic quantity that each child node is selected;

The sample attribute determination leaf node 6 of each child node that e, basis split off.

The information gain-ratio calculating characteristic quantity child node is split off 5 time, the characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase.

When the sample attribute determination leaf node 6 according to each child node, if each child node is only containing same class sample, is then leaf node by this Node configuration, terminates the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.

Basic thought of the present invention is for missing data and non-missing data sample distribute a weight respectively, information entropy principle is utilized in assorting process, the attribute selecting information gain-ratio maximum is as categorical attribute, give each class node probability, recursively construct the branch of decision tree, complete the structure of decision tree, make sorter also can may occurring that the probability of data is classified to it according to it in the missing data.

As shown in Figure 1, this embodiment first step starts from carrying out initialization 1 to raw data.First the data layout with reference to table 1 represents sample, and each sample has weights.Weights initial value is 1.Weights represent the importance of each sample, if the weights of a sample are 10, then representing the importance of this sample in assorting process, to be equivalent to 10 weights be the sample of 1.

Sample #	Weather	Temperature	There is wind	Classification
					1	Fine	Cold	No	Do not play ball
2	？	Cold	Be	Do not play ball
					3	Rain	Heat	？	Play ball
4	Rain	Temperature	？	Play ball
					5	Fine	？	No	Play ball
6	？	Cold	Be	Play ball
					7	Cloudy	Cold	No	Do not play ball

Table 1

Second step be each sensor selection problem characteristic attribute 2 as attribute when this node split, can with reference to table 2, such as weather, temperature, or possible attribute when having wind to be set to node split.

Table 2

3rd step then by the Characteristic Attribute Classification of sample evidence node to each child node 3.Three characteristic attributes of such as first sample are respectively " fine " " cold " and "No", then, when selecting weather as characteristic attribute, first sample just divides in weather " fine " this subclass.

4th step, for utilizing following information entropy formula, calculates the information entropy 4 of the characteristic quantity that each child node is selected.

info (U) = - Σ_{i = 1}^{k} (Σ_{m = 1}^{n (u)} Weight (m, U) \cdot C_{i} (m) / | U |) \cdot \log_{2} (freq (C_{i}, U) / | U |) - - - (1)

The information entropy of sample set U can be obtained by formula (1).

info (U_{Aj}) = - Σ_{i = 1}^{k} (Σ_{m = 1}^{n (u)} Weight (m, U) \cdot C_{i} (m) \cdot O_{Ai} (m) / | U_{Aj} |) \cdot \log_{2} (freq (C_{i}, U_{Aj}) / | U_{Aj} |) - - - (2)

Can obtain characteristic quantity by formula (2) is A, and output valve is the information entropy of the subset U of j.

| U_{Aj} | = Σ_{m = 1}^{n (UAj)} Weight (m, U_{Aj}) \cdot Q_{Aj} (m) - - - (3)

Can be obtained by formula (3) and belong to U _ajsample size.

Symbol implication wherein in above formula is respectively:

C ₁(m), C ₂(m) ... ... .C _k-1(m), C _k(m). the weights of representative classification, wherein C _im () represents that m sample belongs to classification C _idegree, be satisfied with condition

O _a1(m), O _a2(m) ... ... O _{a (n-1)}(m), O _anm weights that () representative exports;

O _{aj '}(m _{ci '}) represent O _ajmissing data in (m);

M _cirepresent the input amendment that there is not missing data;

M _{ci '}represent the input amendment that there is missing data;

N _cibelong to C _isample size;

U _aj: characteristic quantity is A, output valve be the subset U. of j such as, characteristic quantity weather has three values " fine ", " cloudy " and " rain ", then U is divided into this three subsets;

Weight (m, U): the weights of m sample in U;

Freq (C _i, U): be categorized as C in U _isample size;

Freq (C _i, U _aj): U _ajin be categorized as C _isample size;

| the quantity of sample in U|:U;

| U _aj|: belong to U _ajsample size;

Entering the 5th step is the ratio of computing information gain to carve information amount, is the information gain-ratio calculating characteristic quantity and child node is split off 5.If selected characteristic quantity has maximum information ratio of profit increase exactly, then continuing split vertexes is child node.If selected characteristic quantity is not have maximum information ratio of profit increase, then turning back to second step is each sensor selection problem characteristic attribute 2, and again computing information ratio of profit increase until reach maximal value.The division information entropy of its interior joint A can utilize formula (4) first to obtain, then utilize formula (5) to obtain the gain of information, finally formula (4) and (5) are updated to formula (6) and can information gain-ratio be obtained.

Split - info (A) = - Σ_{j = 1}^{n (u)} (| U_{Aj} | / | U |) \cdot \log_{2} (| U_{Aj} | / | U |) - - - (4)

Gain(A)＝info(U)-info _A(U _Aj)(5)

Ratio of profit increase Gain-ratio (A)=Gain (A)/Split-info (A) (6)

6th step is the sample attribute determination leaf node 6 according to each child node, if each node is only containing same class sample, then minor node is set to leaf node, terminate the division of this node, flow process terminates 7.If not only containing a class sample, then get back to second step and continue as each sensor selection problem characteristic attribute 2.

Adopt method of the present invention, data set is divided into zero miss rate data according to the ratio of missing data, 5% miss rate data, 10% miss rate data, 20% miss rate data, 30% miss rate data.The present invention carries out accuracy with classical C4.5 decision tree classifier algorithm and compares, experimental result is as shown in table 3, and when without missing data, algorithm of the present invention and C4.5 sorting algorithm performance do not differ, but along with the ratio of missing data increases, the superiority of algorithm of the present invention is more and more significant.When shortage of data rate up to 30% time, utilize traditional C4.5 algorithm, error rate is up to 12.09%, and the error rate of algorithm of the present invention is only 7.32%, this error rate accepting in scope in real world applications.

The ratio of missing data	C4.5	The present invention
			0	4.7％	4.7％
5％	5.69％	4.7％
			10％	6.16％	5.33％
20％	7.7％	6.0％
			30％	12.09％	7.32％

Table 3

The way that the present invention is given up when running into missing data relative to traditional classifier algorithm, gives sample weights to missing data.In assorting process, give each class node probability, make sorter also can occurring that the probability of data is classified to it according to it in missing data, can effectively process missing data.Utilize information gain-ratio to select attribute, overcome the deficiency of selecting to be partial to during attribute the many attributes of selection value by information gain.It is important which data field the weighted value that sample exports shows clearly, generates understandable classifying rules.

With above-mentioned according to desirable embodiment of the present invention for enlightenment, by above-mentioned description, relevant staff in the scope not departing from this invention technological thought, can carry out various change and amendment completely.The technical scope of this invention is not limited to the content on instructions, must determine its technical scope according to right.

Claims

1. process a Novel decision tree classifier method for missing data, it is characterized in that: comprise the following steps:

A, pending raw sample data collection is carried out Initialize installation, and described data set is carried out weighted value initialization divide

Join;

D, utilize characteristic attribute to calculate the characterizing magnitudes of each node, then be child node by node split;

The sample attribute determination leaf node of each child node that e, basis split off;

Described node diagnostic amount comprises the information entropy of characteristic quantity, the information gain-ratio of characteristic quantity;

When being child node according to the information gain-ratio of described characteristic quantity by node split, the characteristic quantity selected by child node is exactly maximum information ratio of profit increase, then continuing split vertexes is child node; If the characteristic quantity selected by child node is not maximum information ratio of profit increase, then reselect the characteristic attribute being split into child node, until the characteristic quantity selected by child node is maximum information ratio of profit increase;

When the sample attribute determination leaf node according to each child node, if each child node is only containing same class sample, is then leaf node by this Node configuration, terminates the division of this node; If each child node not only containing same class sample, then reselects the characteristic attribute being split into child node.

2. a kind of Novel decision tree classifier method processing missing data according to claim 1, is characterized in that: described data set comprises missing data, non-missing data.