CN102063457A

CN102063457A - Data classification method and system

Info

Publication number: CN102063457A
Application number: CN 201010293694
Authority: CN
Inventors: 储晨
Original assignee: HEFEI JOYIN INFORMATION TECHNOLOGY Co Ltd
Current assignee: HEFEI JOYIN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2010-09-21
Filing date: 2010-09-21
Publication date: 2011-05-18

Abstract

The invention discloses data classification method and system. The data classification method comprises the following steps of: selecting a segmentation variable; carrying out segmentation and layering on an original sample set according to the segmentation variable and a target variable to obtain a training subset and a testing subset; selecting a key variable in the training subset, calculating a regression coefficient, modeling the training subset one by one by applying a regression model according to the key variable and the regression coefficient so as to generate a model for describing data; and substituting the sample variable in the testing subset into the model, calculating the probability value of the sample and classifying the sample according to the probability value. By applying the technical scheme, the original sample set is firstly segmented according to the segmentation variable before the key variable is selected, therefore, the local differentiation of the key variable is effectively eliminated, the modeling accuracy is improved, and the sample classification accuracy is further improved.

Description

A kind of data classification method and system

Technical field

The present invention relates to the data mining technology field, relate in particular to a kind of data classification method and system.

Background technology

Categorizing system is one of main system of data mining, it is normally concentrated from original sample and extracts key variables, by existing standard software for example: SAS (Statistical Analysis Software, statistical analysis software) and simulation software MATLAB, calculate regression coefficient, according to the modeling of key variables and regression coefficient utilization Logistic regression model, the future development trend of the model prediction data that the user obtains according to modeling is to make correct operation according to this trend.

Owing to concentrate the correlativity of extracting key variables and target variable to have partial error's opposite sex from whole original sample, when this partial error's opposite sex can make total volume modeling, estimate that the phenomenon of " positive and negative neutralization " appears in the regression coefficient of these key variables, cause the estimation of regression coefficient inaccurate, and then causing the modeling accuracy low, the sample classification accuracy reduces.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of data classification method and system, inaccurate to solve the estimation that causes regression coefficient that the correlativity of key variables and target variable exists in the prior art partial error's opposite sex causes, and then cause the modeling accuracy low, the problem that the sample classification accuracy reduces.

The invention provides a kind of data classification method, comprising:

The related coefficient of the target variable of calculating each sample variable and presetting, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable;

Choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As cutting apart variable;

According to described variable and the described target variable cut apart, the original sample collection is cut apart layering, obtain training subclass and test subclass;

Choose the key variables in the described training subclass, calculate regression coefficient, according to described key variables and regression coefficient utilization regression model, to the training subclass one by one modeling to produce the model of data of description;

With the described model of sample variable substitution in the described test subclass, calculate the probable value of sample, according to described probable value sample is classified.

The present invention also provides a kind of data sorting system, comprising:

Coefficients calculation block is used to calculate the related coefficient of each sample variable and default target variable, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable;

The variable of cutting apart that links to each other with coefficients calculation block is chosen module, is used to choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As cutting apart variable;

Choose the sample that module links to each other and cut apart hierarchical block with cutting apart variable, be used for the original sample collection being cut apart layering, obtain training subclass and test subclass according to described variable and the described target variable cut apart;

Cut apart the MBM that hierarchical block links to each other with sample, be used for choosing the key variables of described training subclass, calculate regression coefficient, according to described key variables and regression coefficient utilization regression model, to the training subclass one by one modeling to produce the model of data of description;

Cut apart the sort module that hierarchical block links to each other with MBM with sample, be used for the described model of sample variable substitution, calculate the probable value of sample, sample is classified according to described probable value with described test subclass.

Use technique scheme, by calculating the related coefficient of each sample variable and default target variable, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable, choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As cutting apart variable, cut apart the original sample collection according to cutting apart variable, to the modeling of resulting training subclass.Owing to before choosing key variables, at first the original sample collection is cut apart according to cutting apart variable, eliminate partial error's opposite sex of key variables effectively, improve the accuracy of modeling, and then the sample classification accuracy improves.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention, to do simple introduction to the accompanying drawing of required use among the embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

The process flow diagram of the data classification method that Fig. 1 provides for the embodiment of the invention;

Second kind of process flow diagram of the data classification method that Fig. 2 provides for the embodiment of the invention;

The third process flow diagram of the data classification method that Fig. 3 provides for the embodiment of the invention;

Fig. 4 is the process flow diagram of step S312 in the sorting technique shown in Figure 3;

A kind of structural representation of the data sorting system that Fig. 5 provides for the embodiment of the invention;

Second kind of structural representation of the data sorting system that Fig. 6 provides for the embodiment of the invention;

Fig. 7 is the structural representation of model prediction effect determination module in the categorizing system shown in Figure 6;

Fig. 8 is cut apart the structural representation of hierarchical block for sample in the categorizing system shown in Figure 6.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making under the creative work prerequisite, and the every other embodiment that is obtained belongs to protection domain of the present invention.

What needs were at first clear and definite is:

1 sample: be 1*n dimension matrix;

Sample set: be m*n dimension matrix, promptly this sample set comprises m sample;

Sample variable: be the element in the sample, each sample comprises n sample variable.

Categorizing system is one of main system of data mining, the accuracy of the direct impact prediction precision of the accuracy of MBM in this system.Choosing directly of key variables chosen from whole original sample is concentrated in the existing modeling method, because there is partial error's opposite sex in key variables, this partial error's opposite sex causes regression coefficient the phenomenon of " positive and negative neutralization " to occur, causes the modeling accuracy low, and then the sample classification accuracy reduces.In order to address the above problem, the embodiment of the invention provides a kind of data classification method, this method was cut apart the original sample collection according to cutting apart variable before key variables are chosen in modeling, obtain training subclass and test subclass, eliminate partial error's opposite sex of key variables effectively, improve the accuracy of modeling, and then the sample classification accuracy improves.

Embodiment one:

The process flow diagram of the data classification method that the embodiment of the invention provides comprises as shown in Figure 1:

S101: calculate the related coefficient of each sample variable and default target variable, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable;

The computing formula of related coefficient is:

φ_{{YX}_{1}} = \frac{Σ_{i = 1}^{N} (X_{1 i} - {\overset{&OverBar;}{X}}_{1}) (Y_{i} - \overset{&OverBar;}{Y})}{\sqrt{Σ_{i = 1}^{N} {(X_{1 i} - X_{1})}^{2}} \sqrt{Σ_{i = 1}^{N} {(Y_{i} - \overset{&OverBar;}{Y})}^{2}}}

Wherein: Y is default target variable, X ₁Be sample variable, N is the bar number of the sample variable of calculating related coefficient, With

Be respectively X ₁Average with Y.

The computing formula of partial correlation coefficient is:

φ_{{YX}_{1} | X_{2}} = \frac{φ_{{YX}_{1}} - φ_{{YX}_{2}} * φ_{X_{1} X_{2}}}{\sqrt{(1 - {φ_{{YX}_{2}}}^{2}) (1 - {φ_{X_{1} X_{2}}}^{2})}}

Wherein: X ₂Be the sample variable in the sample set, above-mentioned partial correlation coefficient is X ₁At X ₂Under the prerequisite for condition, X ₁Partial correlation coefficient with Y.

S102: choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As cutting apart variable;

Sample variable is a n-dimensional vector, works as X ₂The value number after a little while, directly choose X ₂As cutting apart variable, otherwise need be to X ₂Carry out discretize and handle, according to the clustering algorithm of the variable discretize in the softwares such as SAS to X ₂Carry out discretize, obtain X ₂The discretize variable X ₂'.Choose X ₂' as cutting apart variable.

S103: cut apart variable and described target variable is cut apart layering original sample collection according to described, obtain training subclass and test subclass;

S103 can comprise following substep:

S1031:, to the stratified sampling of original sample collection, obtain training set and test set according to 1: 1 ratio according to described target variable;

Wherein: stratified sampling is the value according to target variable, according to 1: 1 ratio to the stratified sampling of original sample collection.Illustrate below according to target variable, according to proportional layered sampling in 1: 1, for example: the target variable value is 0,1, original sample is concentrated and the value of target variable is that 0 o'clock corresponding all sample variable is divided into two equal portions at random, be made as (A1, A2), equally, with the value of target variable be that 1 o'clock corresponding all sample variable also is divided into two equal portions at random, be made as (B1, B2), A1 and B1 merge into training set, and A2 and B2 merge into test set, are 0 identical with concentrated this ratio of original sample with the ratio of 1 o'clock corresponding with it variable number to guarantee in training set and the test set target variable value.

S1032: variable is cut apart described training set respectively and described test set obtains training subclass and test subclass according to cutting apart.

When cutting apart sample set, it should be noted that: cutting apart the training subclass that obtains can not be too many with the number of test subclass, to prevent overfitting.Number in training subclass and the test subclass is to determine according to the value number of cutting apart variable.

The mode that sample set is cut apart: training set and test set are cut apart according to the value of cutting apart variable, it should be noted that: training set and test set are during according to the same value of cutting apart variable, the partitioning scheme of training set and test set is identical, to guarantee that the test subclass is effective to the detection of the model of training subclass foundation in the subsequent process.

The number of cutting apart the training subclass that obtains and test subclass is identical with the value number of cutting apart variable, that is: when the value of cutting apart variable has only two numerical value, the number of training subclass and test subclass is 2, when the value of cutting apart variable was a plurality of numerical value, the number of training subclass and test subclass was a plurality of.Each training subclass all has corresponding with it test subclass, and promptly training subclass and test subclass is one to one, and this corresponding relation is to determine according to the value of cutting apart variable in training subclass and the test subclass.Such as: the value of cutting apart variable in certain training subclass is 3, and is so corresponding, and a test subclass is arranged in the test subclass, and its value of cutting apart variable also is 3.

Certainly, step S103 can also adopt earlier according to cutting apart variable and cut apart the original sample collection and obtain a plurality of sample sets; According to target variable, layering obtained training subclass and test subclass to sample set respectively according to 1: 1 ratio again.The training subclass that adopts this method to obtain is identical with the test subclass with training subclass that adopts said method to obtain and test subclass, difference is: the operation time of this method is more than the operation time of said method, and the number of cutting apart variable-value is many more, it is many more that this method increases operation time, therefore, present embodiment is preferred: the method for cutting apart after the first layering.

S104: choose the key variables in the described training subclass, calculate regression coefficient, according to described key variables and regression coefficient utilization regression model, to the training subclass one by one modeling to produce the model of data of description; Wherein:

When being a plurality of owing to the value of cutting apart variable, the training subclass that obtains is a plurality of, therefore is to the modeling respectively of all training subclass when modeling, and the model number of the data of description that obtains is identical with the number of training subclass.

Adopt before the modeling and cut apart variable to cut apart sample set be because cutting apart the prerequisite that variable is a condition, certain key variables that modeling needs and the related coefficient of target variable and not cut apart variable with this be that its related coefficient of prerequisite of condition is opposite, be called partial error's opposite sex, therefore, if to whole sample set modeling, partial error's opposite sex can not be embodied in the regression coefficient calculating, even causes regression coefficient the phenomenon of " positive and negative neutralization " to occur, and then causes the sample classification accuracy to reduce.So in order to prevent that partial error's opposite sex is left in the basket, before modeling, need employing to cut apart variable and cut apart sample set, to improve the sample classification accuracy.

The key variables that can choose in the actual data qualification are very many, and in order to reach optimal balance between the number of the goodness of fit and key variables, the present invention uses forward the progressively Return Law backward to determine choosing of key variables.If the key variables of choosing from the training subclass are X ₁, X ₂..., X _m, wherein m is the number of key variables, and Y is a target variable, and obeying binomial distribution is Y={0,1}, the probability during P (Y=1) expression Y=1, the probability during P (Y=0) expression Y=0, and P (Y=1)+P (Y=0)=1.Then the model equation of modeling is as follows:

\log \frac{P (Y = 1)}{P (Y = 0)} = β_{0} + β_{1} X_{1} + β_{2} X_{2} + . . . . . . + β_{m} X_{m} = Xβ

X＝(1?X ₁…X _m)

β＝(β ₀?β ₁…β _m)

Wherein, β ₀, β ₁..., β _mBe regression coefficient, can calculate by existing standard software, as: SAS and MATLAB.

Model Selection and key variables are chosen the aspect, and the judgement of employing is labeled as AIC (AkaikeInformation Criterion, red pond information criterion), and promptly selected key variables and regression coefficient all must satisfy makes the AIC minimum.

AIC＝-2logL+2(m+1)

L = Π_{i = 1}^{m} {p_{i}}^{Y_{i}} {(1 - p_{i})}^{1 - Y_{i}}

p_{i} = \frac{e^{X_{i} β}}{1 + e^{X_{i} β}}

Wherein, m is the number of regression coefficient in the model.Choose AIC and do not adopt the traditional Wald testing model and the selection of key variables, being based on AIC judges quicker, the calculated amount that needs is few, and taken into account and made likelihood function as far as possible big and make regression coefficient as far as possible little, promptly simulate more excellent model, avoid the appearance of overfitting phenomenon with few regression coefficient of trying one's best.

S105: with the described model of sample variable substitution in the described test subclass, calculate the probable value of sample, sample is classified according to described probable value.

Therefore test subclass and training subclass are one to one, test the model that each sample variable of subclass needs substitution and the modeling of the corresponding training subclass of this test subclass to obtain.

In the model with sample variable substitution in each sample of test subclass and the pairing training subclass foundation of test subclass, obtain the probable value of sample, after the probable value merging and ordering with all test subclass, sample is classified according to default classification number percent.

Embodiment two:

Referring to Fig. 2, show the process flow diagram of a kind of data classification method embodiment two of the present invention, extract and fill choosing the sample variable that needs before cutting apart variable original sample is concentrated.Present embodiment two may further comprise the steps:

S201: calculate the disappearance ratio that original sample is concentrated each sample variable, choose the sample variable that meets disappearance ratio condition according to the disappearance ratio;

S202: calculate the average separately of the described sample variable of choosing that meets disappearance ratio condition respectively, the described disappearance ratio condition that the meets sample variable of choosing is carried out average fill;

Disappearance ratio condition is that the disappearance ratio of variable is not more than 30%, should disappearance ratio condition not fix certainly, determines according to the disappearance concrete condition of sample variable.Be that the disappearance ratio of variable is not more than 30% with disappearance ratio condition below, introduction is how to choose the sample variable that meets disappearance ratio condition and carry out average and fill.For example: total sample number is 4, and the value of sample variable A is: {, 1,, }, the value of sample variable B is { 1,3,, }, the value of sample variable C is { 1,2,3, }, the value of sample variable D be 1,2,4,1}, wherein the position of vacancy is the disappearance of variable.The disappearance ratio of sample variable equals the number percent of its disappearance number and total sample number, is example with A, and the disappearance ratio of A is: 3/4*100%=75%, therefore the disappearance ratio of A, does not choose A greater than 30%.Equally, the disappearance ratio of calculating B, C and D is not respectively chosen B as can be known, only chooses C and D.

Calculate the average of C, the average of C is: 1+2+3/3=2, according to average C is filled, and the C after the filling is: 1,2,3,2}.Because the disappearance ratio of D is 0, therefore, does not need that D is carried out average and fill.

S203: the sample variable after the filling is formed the new samples collection;

S204: obtain the total number of sample that new samples is concentrated sample variable;

S205: whether the total number of the sample of new samples collection surpasses the default total number of sample, is to carry out S206, otherwise carries out S207;

The default total number of sample is: the total number of sample is 20,000 to 30,000.

S206: concentrate the sample that extracts the default total number of sample from new samples, carry out S207;

S207-S211: identical with the step S101-S105 among the embodiment one.

Use such scheme, during directly from sample variable that the original sample collection extracts, the disappearance ratio of the sample variable that may extract is very big, be that its disappearance ratio does not meet disappearance ratio condition, then cause effective information very few, cause cutting apart the accuracy reduction that variable is chosen, therefore, cutting apart before variable chooses, at first choose the sample variable that the disappearance ratio meets disappearance ratio condition, again the disappearance ratio condition that the meets sample variable of choosing is carried out the average filling, but effectively increase the sum of analyzing samples, improve and cut apart the accuracy that variable is chosen.

Embodiment three

To the training subclass one by one modeling produce after the model of data of description, also need the prediction effect of model is judged, whether judgment models reaches the optimum prediction effect, therefore, after to the sample variable classification of test in the subclass, also comprising: to the deterministic process of the prediction effect of model, as shown in Figure 3, comprising:

S301 to S311: identical with the step S201-S211 among the embodiment two;

S312: judge whether described model reaches the optimum prediction effect, if, carry out S313, otherwise, S314 carried out;

Particularly, this step may further comprise the steps, as shown in Figure 4:

S3121: obtain the target variable value in the probable value that from step S311, is calculated and be 1 probable value;

S3122: this probable value is merged, according to the size of numerical value, ordering from big to small;

For example: the probable value of test subclass 1 is p1={10%, 39%, 27%, and 50%}, the probable value of test subclass 2 is p2={8%, 20%, 71%, 43%} then merges into p={10% earlier, 39%, 27%, 50%, 8%, 20%, 71%, 43%} after the ordering is, p becomes: p={71%, 50%, 43%, 39%, 27%, 20%, 10%, 8%}.

S3123: test set is sorted in the ordering according to probable value among the S31323, chooses the sample of ordering back number of samples in the predetermined value scope, calculates MP (conversion ratio) value of this sample; Be specially:

According to the probable value after the S1051 ordering, with in each test subclass with the corresponding sample of probable value according to the ordering of the probable value merging of sorting, to form a sample set, again this sample set is divided into a plurality of sample sets, choose a sample set from a plurality of sample sets, the probable value of this sample set is higher than other sample sets.For example, sample set is divided into 10 parts, according to probable value from big to small, sample set behind the branch such as grade is numbered: 1 to 10, if the predetermined value scope of pointing out among the S105 is 10%, then choose and be numbered 1 sample set, promptly probable value is higher than the sample set of other sample sets, calculates the MP value of this sample set.

Behind the branches such as sample set, calculate the model index of each five equilibrium sample set, that is: CT value (Cumulative of Total, sample proportion), SR value (Success Rate, interval conversion ratio), CR value (Cumulative Rate, accumulation conversion ratio), CTS value (Cumulative of TotalSuccess transforms the accumulation ratio that sample accounts for total conversion sample), LI value (Life Index, rising index) and MP value.The MP value equals the CR value.

In the real data classification, with a plurality of sample set number consecutivelies, the numbering minimum value is 1, and for the sample set that is numbered i, the calculating of its CT value, SR value, CR value, CTS value and LI value is as follows respectively:

Wherein: the 0th CTS value is designated as 0.

S3124: the MP value that more described MP value and common modeling method obtain;

S3125: judge whether comparative result exceeds default improvement value, if, carry out S313, otherwise, S314 carried out; Wherein:

Comparative result can be the difference between the MP value that obtains of MP value and common modeling method, also can be both difference number percent.For example: the MP value is 11.23%, and the MP value that common modeling method obtains is 10.11%, if comparative result when being difference, calculates both poor, i.e. and 11.23%-10.11%=1.12%, 1.12% exceeds default improvement value (default improvement value is 1%).And comparative result is when being difference number percent, and calculating formula is: ((11.23%-10.11%)/10.11%) * 100% ≈ 11%.Compare with the MP value that common modeling method obtains, the MP value of this method has improved 11%, exceeds default improvement value (default improvement value is 10%).

S313: judge that decision model reaches the optimum prediction effect;

S314: decision model does not reach the optimum prediction effect, returns and carries out S308.

Return when carrying out S308, need to change X ₂The algorithm of discretize, perhaps choose other sample variables that satisfy the related coefficient condition opposite with partial correlation coefficient.

Use technique scheme, whether prediction effect that can judgment models reaches the optimum prediction effect, and on the throne when reaching prediction effect, re-executes to choose sample and cut apart, and continues data are classified.

Below the present invention is further described by an object lesson.

The purpose of present embodiment modeling is to adopt data digging method, obtain potential automobile consumer's disaggregated model, potential automobile with the high purchase intention in accurate location is bought customers, for the production decision of automobile industry provides foundation, simultaneously also can be according to each client's advertising strategy response model marking, determine to choose optimum advertising strategy, the data preparation of directiveness is provided for the appointment of decision-making at the most effective propagate method of each client.Data from automobile consumer's magnanimity information database that certain full-sized car financing corporation provides, database comprises more than 20 ten thousand sample variablees, and every sample variable is a multidimensional variable { X ₁, X ₂..., X _m, its interior parameter representative: when the user checks vehicle inquiry car fare, request purchase vehicle, plan time buying and the time interval of current time, user's e-mail address and matching degree and other user profile of its name.User's purchase state is a target variable, this target variable be two-dimentional variable 0,1}, wherein: 0 expression user abandons buying, 1 expression user buys vehicle.

In the data classification method, judge at first original sample concentrates the disappearance ratio of each sample variable whether to meet disappearance ratio condition, for example: when the disappearance ratio is not more than 30%, represent that the disappearance ratio of this sample variable meets disappearance ratio condition.When the disappearance ratio meets disappearance ratio condition, choose the corresponding sample variable of this disappearance ratio.The sample variable of choosing is carried out its average fill, the sample variable after the filling is formed the new samples collection.Secondly, concentrate the sample drawn variable as the new samples subclass from new samples, the related coefficient of each sample variable and target variable in the subset of computations, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable, as shown in table 2, X in the table 2 ₁And X ₂List of values is shown in the column number in the subclass.

Table 2 related coefficient and partial correlation coefficient

From table 2, choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of the difference maximum of related coefficient and partial correlation coefficient ₁, and according to X ₁Choose X corresponding with it ₂, with X ₂As cutting apart variable, as shown in table 3, cutting apart variable in the present embodiment is the sample variable A9_o of the 10th row, this A9_o represents, the user carries out data in which sky in a week to be filled in and inquire the price, and value is 1-8, wherein 1-7 represents Monday to Sunday, and 8 represent festivals or holidays (U.S.).

The combination that table 3 related coefficient and partial correlation coefficient are opposite

Select after the A9_o, the sample variable that original sample is concentrated is according to the value 0 and 1 of target variable, proportional layered sampling according to 1: 1, obtain training set and test set, according to A9_o training set and test set are divided into training subclass and test subclass again, to training the modeling of subclass difference to produce the model of data of description.Use method of the present invention that automobile consumer's magnanimity information database is carried out modeling, model parameter is shown in table 4, table 5, table 6 and table 7.With respect to the model index shown in the table 1, the model index of testing subclass based on the present invention is as shown in table 8, and table 9 is the model indexs with respect to the test subclass of the common modeler model of table 1.

Table 4 A9_o={1,2,3,4, the model parameter of the training sample of 5}

Variable name	The coefficient estimation value	Estimation variance	The Z value	The P value
					Intercept	-3.74082	0.56770	-6.589	4.42e-11
A1_o	-0.09831	0.10017	-0.981	0.32637
					A2_o	0.06968	0.04418	1.577	0.11475

A3_o	0.30078	0.06244	4.817	1.46e-06
					A4_o	-0.55164	0.07929	-6.957	3.46e-12
A5_o	0.03596	0.03391	1.061	0.28884
					A6_o	-0.18971	0.02984	-6.358	2.05e-10
A7_o	0.08229	0.02502	3.289	0.00101
					A8_o	0.07948	0.02417	3.288	0.00101
A9_o	0.08963	0.02724	3.290	0.00100
					A10_o	0.05017	0.11554	0.434	0.66413
A11_o	0.04660	0.02062	2.260	0.02382
					A12_o	0.03237	0.02608	1.241	0.21443
A13_o	-0.06376	0.02241	-2.845	0.00443
					A14_o	0.01859	0.03547	0.524	0.60030
A15_o	0.05970	0.01337	4.467	7.95e-06

The model parameter of the training sample of table 5 A9_o=6

Variable name	The coefficient estimation value	Estimation variance	The Z value	The P value
					Intercept	-3.528780	0.620232	-5.689	1.27e-08
A1_o	-0.206372	0.122280	-1.688	0.091468
					A2_o	0.023088	0.049675	0.465	0.642084
A3_o	0.326030	0.071059	4.588	4.47e-06
					A4_o	-0.292864	0.089112	-3.286	0.001015

A5_o	0.190853	0.041459	4.603	4.16e-06
					A6_o	-0.110843	0.056505	-1.962	0.049804
A7_o	-0.017115	0.024217	-0.707	0.479738
					A8_o	0.025628	0.032563	0.787	0.431280
A10_o	-0.042441	0.125672	-0.338	0.735581
					A11_o	0.023507	0.023983	0.980	0.327006
A12_o	-0.050970	0.028066	-1.816	0.069357
					A13_o	-0.000916	0.025308	-0.036	0.971128
A14_o	0.069009	0.035291	1.955	0.050530
					A15_o	0.048512	0.014735	3.292	0.000994

The model parameter of the training sample of table 6 A9_o=7

Variable name	The coefficient estimation value	Estimation variance	The Z value	The P value
					Intercept	-5.35626	0.44631	-12.001	＜2e-16
A1_o	-0.08975	0.09012	-0.996	0.319303
					A2_o	0.11970	0.03465	3.454	0.000552
A3_o	0.10479	0.04524	2.316	0.020545
					A4_o	-0.21511	0.06453	-3.334	0.000857
A5_o	0.05709	0.01914	2.984	0.002849
					A6_o	0.09818	0.02857	3.437	0.000589
A7_o	0.02675	0.01419	1.885	0.059387
					A8_o	0.07552	0.03977	1.899	0.057557

A10_o	0.22531	0.09185	2.453	0.014169
					A11_o	0.06182	0.01638	3.775	0.000160
A12_o	-0.06076	0.01966	-3.090	0.002000
					A13_o	-0.01446	0.01735	-0.833	0.404639
A14_o	0.09201	0.02459	3.741	0.000183
					A15_o	0.04392	0.01019	4.311	1.63e-05

The model parameter of the training sample of table 7 A9_o=8

Variable name	The coefficient estimation value	Estimation variance	The Z value	The P value
					Intercept	-4.153193	0.451350	-9.202	＜2e-16
A1_o	-0.150153	0.084253	-1.782	0.074723
					A2_o	0.150461	0.035652	4.220	2.44e-05
A3_o	0.196052	0.045571	4.302	1.69e-05
					A4_o	-0.181489	0.061732	-2.940	0.003282
A5_o	0.078067	0.022941	3.403	0.000667
					A6_o	0.025024	0.022249	1.125	0.260705
A7_o	0.037638	0.018381	2.048	0.040593
					A8_o	-0.006805	0.031810	-0.214	0.830604
A10_o	-0.008445	0.087655	-0.096	0.923244
					A11_o	0.016991	0.015710	1.082	0.279442
A12_o	-0.005377	0.018253	-0.295	0.768313
					A13_o	-0.044740	0.016556	-2.702	0.006886

A14_o	0.060623	0.023895	2.537	0.011177
					A15_o	0.031205	0.009577	3.258	0.001121

Table 8 model index of the present invention

The model index of table 9 general models

As can be seen from the results, with respect to general models, the model index of model of the present invention improves.Table 10 is that the crucial evaluation index of model compares, the crucial evaluation index of model is the MP value, the MP value is identical with the CR value, as can be seen from Table 10, the MP value of model of the present invention improves 10% (the 10%th, draw by the number percent that calculates both differences) with respect to the MP value that common modeling obtains, exceed default improvement value, reach the optimum prediction effect.

The crucial evaluation index of table 10 relatively

To sum up, the present invention was at first cut apart the original sample collection according to cutting apart variable before choosing key variables, eliminated partial error's opposite sex of key variables effectively, improved the accuracy of modeling, improved the accuracy of sample classification, and then precision of prediction improves.

Embodiment four

The present invention also provides a kind of data sorting system, and the structural representation of this system comprises as shown in Figure 5: coefficients calculation block 10, cut apart that variable is chosen module 11, sample is cut apart hierarchical block 12, MBM 13 and sort module 14.Wherein:

Coefficients calculation block 10 is used to calculate the related coefficient of each sample variable and default target variable, and under other sample variable conditions, the partial correlation coefficient of described each sample variable and described target variable;

Cut apart variable and choose module 11, be used to choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As cutting apart variable;

Sample is cut apart hierarchical block 12, is used for according to described variable and the described target variable cut apart the original sample collection being cut apart layering, obtains training subclass and test subclass;

MBM 13 is used for choosing the key variables of described training subclass, calculates regression coefficient, according to described key variables and regression coefficient utilization regression model, to the training subclass one by one modeling to produce the model of data of description;

Sort module 14 is used for the described model of sample variable substitution with described test subclass, calculates the probable value of sample, according to described probable value sample is classified.

Data sorting system provided by the invention also comprises: sample variable is chosen module 15, choose the sample variable packing module 16 that module 15 links to each other with sample variable, the new samples that links to each other with sample variable packing module 16 is formed module 17, form module 17 total number acquisition module 18 of sample that links to each other and the sample variable abstraction module 19 that links to each other with the total number acquisition module 18 of sample with new samples, as shown in Figure 6.Wherein:

Sample variable is chosen module 15, is used to calculate the disappearance ratio that original sample is concentrated each sample variable, chooses the sample variable that meets disappearance ratio condition according to the disappearance ratio;

Sample variable packing module 16 is used for calculating respectively the average separately of the described sample variable of choosing that meets disappearance ratio condition, the described disappearance ratio condition that the meets sample variable of choosing is carried out average fill;

New samples is formed module 17, is used for the sample variable after filling is formed the new samples collection;

The total number acquisition module 18 of sample is used to obtain the total number of sample that new samples is concentrated;

Extraction of example module 19 is used for when total number surpasses the default total number of sample, concentrates the sample that extracts the default total number of sample from new samples.

The data sorting system that the embodiment of the invention provides needs also the prediction effect of model is judged that therefore, this data sorting system also comprises: model prediction effect determination module 20, as shown in Figure 6.Model prediction effect determination module 20 is used to judge whether described model reaches the optimum prediction effect, when described model does not reach the optimum prediction effect, then returns to carry out and cuts apart variable and choose step in the module 11.It is to be noted: carry out when cutting apart variable and choosing step in the module 11, need to change to X ₂The algorithm of discretize, perhaps choose other sample variables that satisfy the related coefficient condition opposite with partial correlation coefficient.

The structural representation of model prediction effect determination module 20 comprises as shown in Figure 7: probable value acquiring unit 201, probable value sequencing unit 202, conversion ratio value computing unit 203, conversion ratio value comparing unit 204 and comparative result judging unit 205.Wherein:

Probable value acquiring unit 201, being used for obtaining the target variable value from described probable value is 1 probable value;

Probable value sequencing unit 202, the probable value that is used for probable value acquiring unit 201 is chosen merges, according to the size of numerical value, ordering from big to small;

Conversion ratio value computing unit 203 is used for according to the ordering of probable value test set being sorted, and chooses the sample of ordering back number of samples in the predetermined value scope, calculates the conversion ratio value of this sample;

Conversion ratio value comparing unit 204 is used for the conversion ratio value that more described conversion ratio value and common modeling method obtain;

Comparative result judging unit 205 is used to judge whether comparative result exceeds default improvement value, whether reaches the optimum prediction effect to judge described model.When comparative result exceeds default improvement value, judge that described model reaches the optimum prediction effect; Otherwise, judge that described model does not reach the optimum prediction effect.

Sample is cut apart hierarchical block 12 and is comprised in the embodiment of the invention: sample layering unit 121 and sample cutting unit 122, as shown in Figure 8.Wherein:

Sample layering unit 121 is used for according to described target variable, to the sample set stratified sampling, obtains training set and test set according to 1: 1 ratio;

Sample cutting unit 122 is used for cutting apart described training set and described test set respectively and obtaining training subclass and test subclass according to the described variable of cutting apart.

To the above-mentioned explanation of the disclosed embodiments, make this area professional and technical personnel can realize or use the present invention.Multiple modification to these embodiment will be conspicuous concerning those skilled in the art, and defined herein General Principle can realize under the situation that does not break away from the spirit or scope of the present invention in other embodiments.Therefore, the present invention will can not be restricted to these embodiment shown in this article, but will meet and principle disclosed herein and features of novelty the wideest corresponding to scope.

Claims

1. a data classification method is characterized in that, comprising:

2. sorting technique according to claim 1, it is characterized in that, the related coefficient of the target variable of calculating each sample variable and presetting, and under other sample variable conditions, also comprise before the partial correlation coefficient of described each sample variable and described target variable:

Calculate original sample and concentrate the disappearance ratio of each sample variable, choose the sample variable that meets disappearance ratio condition according to the disappearance ratio;

Calculate the average separately of the described sample variable of choosing that meets disappearance ratio condition respectively, the described disappearance ratio condition that the meets sample variable of choosing is carried out average fill;

Sample variable after the filling is formed the new samples collection.

3. sorting technique according to claim 2, it is characterized in that, sample variable after filling is formed after the new samples collection, the related coefficient of the target variable of calculating each sample variable and presetting, and under other sample variable conditions, before the partial correlation coefficient of described each sample variable and described target variable, also comprise:

Obtain new samples and concentrate the total number of sample of sample variable;

When the total number of sample surpasses the default total number of sample, concentrate the sample that extracts the default total number of sample from new samples.

4. sorting technique according to claim 3 is characterized in that, with the described model of sample variable substitution in the described test subclass, calculates the probable value of sample, also comprises after according to described probable value sample being classified:

Judge whether described model reaches the optimum prediction effect;

When described model does not reach the optimum prediction effect, then return to carry out and choose related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As the step of cutting apart variable.

5. sorting technique according to claim 4 is characterized in that, judges whether described model reaches the optimum prediction effect and comprise:

From described probable value, obtain the target variable value and be 1 probable value;

This probable value is merged, according to the size of numerical value, ordering from big to small;

Test set is sorted in ordering according to probable value, chooses the sample of ordering back number of samples in the predetermined value scope, calculates the conversion ratio value of this sample;

The conversion ratio value that more described conversion ratio value and common modeling method obtain;

Judge whether comparative result exceeds default improvement value, whether reach the optimum prediction effect to judge described model.

6. sorting technique according to claim 5, it is characterized in that, describedly judge whether comparative result exceeds default improvement value, to judge whether described model reaches the optimum prediction effect and be specially: when comparative result exceeds default improvement value, judge that described model reaches the optimum prediction effect; Otherwise, judge that described model does not reach the optimum prediction effect.

7. sorting technique according to claim 6 is characterized in that, according to described variable and the described target variable cut apart, the original sample collection is cut apart layering, obtains training subclass and test subclass to comprise:

According to described target variable, to the stratified sampling of original sample collection, obtain training set and test set according to 1: 1 ratio;

Cut apart described training set and described test set respectively and obtain training subclass and test subclass according to the described variable of cutting apart.

8. according to any described sorting technique of claim 1-7, it is characterized in that, the described key variables of choosing in the described training subclass calculate regression coefficient and are specially: use the Return Law progressively to choose key variables in the described training subclass, calculate regression coefficient by existing standard software.

9. a data sorting system is characterized in that, comprising:

10. categorizing system according to claim 9 is characterized in that, also comprises:

Sample variable is chosen module, is used to calculate the disappearance ratio that original sample is concentrated each sample variable, chooses the sample variable that meets disappearance ratio condition according to the disappearance ratio;

Choose the sample variable packing module that module links to each other with sample variable, be used for calculating respectively the average separately of the described sample variable of choosing that meets disappearance ratio condition, the described disappearance ratio condition that the meets sample variable of choosing is carried out average fill;

The new samples that links to each other with described sample variable packing module is formed module, is used for the sample variable after filling is formed the new samples collection;

The total number acquisition module of sample is used to obtain the total number of sample that new samples is concentrated sample variable;

With the extraction of example module that the total number acquisition module of sample links to each other, be used for when described total number surpasses the default total number of sample, concentrate the sample that extracts the default total number of sample from new samples;

Model prediction effect determination module is used to judge whether described model reaches the optimum prediction effect, when described model does not reach the optimum prediction effect, then returns to carry out and chooses related coefficient and partial correlation coefficient opposite in sign, and the sample variable X of related coefficient maximum ₁, and according to sample variable X ₁Choose sample variable X corresponding with it ₂, with sample variable X ₂As the step of cutting apart variable.