CN104102716A

CN104102716A - Imbalance data predicting method based on cluster stratified sampling compensation logic regression

Info

Publication number: CN104102716A
Application number: CN201410341930.3A
Authority: CN
Inventors: 李鹏; 张楷卉
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2014-07-17
Filing date: 2014-07-17
Publication date: 2014-10-15

Abstract

The invention relates to an imbalance data predicting method based on cluster stratified sampling compensation logic regression, belongs to the field of imbalance data prediction, and aims to solve the problem that traditional predicting models are poor in imbalance data predicting effect. The method includes the steps of firstly, using a k-means algorithm to cluster a to-be-predicted sample set so as to obtain K categories of data; secondly, performing stratified sampling on the K categories of data so as to extract n data; thirdly, performing maximum likelihood estimation on the parameters of a stratified sample logic regression model to obtain the parameter estimator of the stratified sample logic regression model and determine the stratified sample logic regression model; inputting the n data into the stratified sample logic regression model to determine whether the to-be-predicted sampler set is an imbalance data set or not. The method is applicable to fields such as biology, medicine, engineering and computing which need imbalance data prediction.

Description

The unbalance data predication method returning based on hierarchical cluster sampling compensation logic

Technical field

The invention belongs to unbalance data prediction field.

Background technology

As everyone knows, decision-making must depend on prediction.Prediction is estimation and the deduction to making future, in order to reach this purpose, often will imitate or abstract real world (or claim research object), and this process is referred to as modeling.The model of therefore, one " good " can not only be expressed reality and should be able to be passed through the real data slice-of-life rule of development accurately.Therefore, forecast model is a kind of prediction or prophesy that is expressed as feature with quantification.

Forecasting problem towards unbalance data set is the difficulties in natural science field, and has important actual application value at numerous areas such as biology, medical science, engineering, calculating.Fact proved, in the situation that data category is unbalance, directly adopt Classical forecast model all can not reach the prediction effect that makes us acceptance.

The stratified sampling technology now adopting mainly comprises the stratified sampling method of network-oriented flow data, for the data hierarchy methods of sampling of IT system application appraisal expansion platform with towards the method for sampling of the stratified sampling of high attribute dimension data.Above three kinds of layered sampling method are all towards the real data of specific area, and formulate the stratified sampling of corresponding Stratified Strategy guide data according to data self character is artificial.

And existing logistic regression forecasting techniques, be applied in to adopt to utilize more and penalize logistic regression (PLR) model according to the method for quality screening plant embryos, by method and the method based on pseudomorphism in multivariate logistic regression detection ICU patient record of logistic regression algorithm predicts organic chemicals biodegradability, and logistic regression forecasting techniques is not used in to the prediction field of unbalance data set.

Summary of the invention

The object of the invention is in order to solve the bad problem of effect of the unbalance data of Classical forecast model prediction, the invention provides a kind of unbalance data predication method returning based on hierarchical cluster sampling compensation logic.

The unbalance data predication method returning based on hierarchical cluster sampling compensation logic of the present invention,

It comprises the steps:

Step 1: adopt k-means algorithm to carry out cluster to sample set to be predicted, obtain the data of K class;

Step 2: carry out stratified sampling to obtaining the data of K class, extract n data;

Step 3: the parameter of the Logic Regression Models of stratified sample is carried out to maximal possibility estimation, obtain the parameter estimation formula of stratified sample Logic Regression Models, determine stratified sample Logic Regression Models;

Step 4: the n of an extraction data are inputed in stratified sample Logic Regression Models, determine whether sample set to be predicted is unbalance data set.

Beneficial effect of the present invention is, the present invention adopts the method for hierarchical cluster sampling first unbalance data to be resampled, and cuts down in a large number the noise data of impact prediction, reduces unbalance ratio, reduces the generation of data submerge phenomenon; Secondly, the change distributing for the data after sampling, proposes a kind of parametric compensation logistic regression forecast model, proofreaies and correct prediction probability value when effectively improving estimated performance.Through verification experimental verification, Forecasting Methodology of the present invention can significantly improve the precision of prediction of unbalance data.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one.

Fig. 2 is the level division principle schematic diagram based on cluster in embodiment two.

Embodiment

Embodiment one: in conjunction with Fig. 1, present embodiment is described, the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in present embodiment, it comprises the steps:

Stratified sampling, is also named type sampling.Be exactly by population unit by some important property feature divide into several classes's type or layer, then all types of or layer in adopt simple random sampling (simple random sampling) or systematic sampling (system sampling) mode sample drawn unit.Feature is: due to by drawing class layering, increased the common point between all types of middle units, easily extracted representative investigation sample out.Stratified sampling is more more accurate than simple random sampling and systematic sampling, can pass through the investigation to less sampling unit, obtain inferred results more accurately, particularly, when totally large, inner structure is complicated, stratified sampling often can obtain gratifying effect.Meanwhile, stratified sampling, in to overall deduction, can also obtain the inference to every layer.The method is applicable to general status complexity, between constituent parts, differs greatly, and the situation that unit is more.Stratified random smapling can be done more accurately and estimate overall attribute than random sampling.

Stratified sampling is the heterogeneous stronger stronger subpopulation of homogeney one by one that is totally divided into, then the sample extracting in different subpopulations represents respectively this subpopulation, and all samples and then representative are overall.Than simple random sampling, first stratified sampling will carry out the division of level, i.e. layering.Adopt in actual applications the method for sampling of stratified sampling, most important work is exactly how sample to be carried out to rational level division, and the sampling that makes sample after layering is expression population distribution and the characteristic of refining more.It is emphasis and the difficulties of stratified sampling that level is divided.So present embodiment adopts the mode of cluster to carry out level division.

Cluster is one of the most common technology of Data Mining, for finding that each group forming by cluster process is called a class at the unknown data class of database.Before cluster, quantity and type that data class is divided are all unknown.This data class divide according to being " things of a kind come together, people of a mind fall into the same group ", press the similarity between individuality or data object, research object is divided into some.Cluster is returned into some classifications a group objects according to similarity, and object is to make to belong between other object of same class to have similar as far as possible feature, and between object in belonging to a different category, has as much as possible relatively only.Therefore good theoretical direction and the feasible method of providing is provided for level that, clustering method is stratified sampling.

Embodiment two: present embodiment is described in conjunction with Fig. 2, present embodiment is the further restriction to the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one, in step 1, adopt k-means algorithm to carry out cluster to sample set to be predicted, the method for obtaining the data of K class comprises:

Step is one by one: select K data at random in sample set to be predicted, each data is as Yi Gelei center;

Step 1 two: according to the principle nearest apart from each Lei center, by other data allocations in sample set to be predicted in each corresponding class;

Step 1 three: for each class, calculate the average property value of all data in such, and using described average property value as such Xin center;

Step 1 four: according to the principle nearest apart from each Lei Xin center, again by the data allocations in sample set to be predicted in each corresponding class; And whether class and the step 1 class of two minutes that judgement divides is again identical, if identical, stops, and determines the data of K class, if not identical, proceeds to step 1 three.

Present embodiment adopts k-means clustering algorithm to be applied to the middle-level division of stratified sampling, why selects k-means clustering algorithm except its feature such as simple, effective, the most important thing is, this clustering algorithm can be set the number of cluster classification in advance.From level, divide, apply the number of plies that this algorithm namely can the required division of predefined, can effectively control sampling process like this.Embodiment three: present embodiment is the further restriction to the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one, in step 3,

The parameter estimation formula of described stratified sample Logic Regression Models is

\{\begin{matrix} Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | = 0 \\ Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | x_{ij} = 0 \\ j = 1,2,3, . . ., m \end{matrix},

α wherein ₁and β ' is the unknown parameter of stratified sample Logic Regression Models, the vector that β ' is 1 * m, β '=(β ₁..., β _m) ^t, x _ijbe i j feature of data extracting, m is the Characteristic Number of the data of each extraction, i=1, and 2,3 ..., n; y _ithe predicted value of i the data that extract, y _ivalue is { 0,1};

Described Logic Regression Models is

The feature vector, X of the data of each extraction=(x ₁, x ₂..., x _m), x _mm the feature for the data that extract.

Existing logistic regression forecast model directly applies to the data subset after sampling mostly, and the probability that enters sample due to dissimilar data is different, and the distribution of sample point and population distribution no longer have homogeneity.Under stratified sampling condition, due to the inconsistency of sample distribution and population distribution, directly adopt maximum likelihood estimate to cause the estimated bias of model parameter and probability, cause prediction probability value inaccurate.Present embodiment adopts the parametric compensation of logistic regression under a kind of stratified sampling, causing a deviation when to maximal possibility estimation gives reasonably compensation and makes logistic regression prediction adapt to inconsistent that data distribute, and finally makes prediction probability value more level off to actual probability of happening.

Logic Regression Models is a kind of nonlinear model, so the parameter estimation of model adopts maximum likelihood to estimate conventionally.Can prove, under random sample condition, the maximal possibility estimation of Logic Regression Models has consistance, progressive validity and asymptotic normality.Yet in much research, sampling is not completely random, but adopts the method for stratified sampling, and this just need to consider the Parameter Estimation Problem of Logic Regression Models under stratified sampling condition.

In logistic regression, dependent variable Y _i(i=1,2,3 ..., n) follow Bernoulli probability distribution, the probability that dependent variable is 1 is P _i, be that 0 probability is 1-P _i, P _i/ 1-P _irefer to the diversity ratio odds that event occurs.Vector X _i(i=1,2,3 ..., n) for the vectorization of observation sample represents, the attribute number that constant K is sample, the namely number of representation class.

Y _i～Bernoulli(Y _i/P _i) (1)

\ln \frac{P (Y_{i} = 1)}{1 - P (Y_{i} = 1)} = \ln (odds) = α_{0} + Σ_{k = 1}^{K} β_{k} X_{ik} - - - (2)

Be logarithm diversity ratio above, two sides, by negate logarithm, can recently represent by difference.

odds = \frac{P (Y_{i} = 1)}{1 - P (Y_{i} = 1)} = \exp (α_{0} + Σ_{k = 1}^{K} β_{k} X_{ik}) - - - (3)

= e^{α_{0} + Σ_{k = 1}^{K} β_{k} X_{ik}} = e^{α_{0}} * Π_{k = 1}^{K} e^{β_{k} X_{k}} = e^{α_{0}} * Π_{k = 1}^{K} {(e^{β_{k}})}^{X_{k}} - - - (4)

For certain, apply especially the expression way that Logic Regression Models has plurality of optional to select; Also relatively easy from the angle logistic regression calculating, and have many instruments can carry out the parameter estimation of logistic regression; In actual applications, the performance of logistic regression is also pretty good.We notice, if we know diversity ratio or logarithm diversity ratio, are easy to so calculate corresponding probability of happening.

P_{x_{i}} = \frac{odds}{1 + odds} = \frac{\exp (α_{0} + β^{'} X)}{1 + \exp (α_{0} + β^{'} X)} - - - (5)

Wherein, unknown parameter α ₀be a constant, β ' is the vector of K * 1, corresponding each independent variable.The parameter of model is estimated by the method for maximal possibility estimation:

L (α_{0}, β^{'}) = Π_{i = 1}^{n} P_{x_{i}}^{Y_{i}} {(1 - P_{x_{i}})}^{1 - Y_{i}} - - - (6)

For random sampling (x _i, y _i), i=1,2 ..., n, takes the logarithm by two sides, and in conjunction with formula (2), log-likelihood function is reduced to:

\ln (L (α_{0}, β^{'})) = Σ_{i = 1}^{n} [y_{i} (α_{0} + β^{'} x_{i}) - \ln (1 + \exp (α_{0} + β^{'} x_{i}))] - - - (7)

Unknown parameter α ₀and the maximal possibility estimation equation of the value of β ' by below obtains.

\{\begin{matrix} \frac{&PartialD; \ln [L (α_{0}, β)]}{{&PartialD; α}_{0}} = Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{0} + β^{'} x)}{1 + \exp (α_{0} + β^{'} x)} | = 0 \\ \frac{&PartialD; \ln [L (α_{0}, β)]}{&PartialD; β_{j}} = Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{0} + β^{'} x)}{1 + \exp (α_{0} + β^{'} x)} | x_{ij} = 0 \\ j = 1,2,3, . . ., m . \end{matrix} - - - (8)

Under random sample condition, the maximal possibility estimation of Logic Regression Models has consistance, progressive validity and asymptotic normality.Yet in the research of some problem, sampling is not completely random, but adopt the method for stratified sampling.Under random sampling condition, the distribution of sample point is identical with population distribution; And under stratified sampling condition, the probability that enters sample due to dissimilar data is different, the distribution of sample point and population distribution no longer have homogeneity.Under stratified sampling condition, due to the inconsistency of sample distribution and population distribution, directly adopt maximum likelihood estimate to cause the estimated bias of model parameter and probability.The estimated bias that the art of this patent produces for the stratified sampling logistic regression forecast model of unbalance data set is studied, and proposes a kind of compensation method of estimated bias.

In population sample N, group very given figure is P ₀n, large classification sample number is (1-P ₀) N, adopt stratified sampling to extract respectively n in little classification and large classification sample ₁and n ₂individual as sample.Make λ ₀for the ratio of overall medium and small class number with large class number, λ ₀=P ₀n/ (1-P ₀) N=P ₀/ (1-P ₀); λ ₁for the ratio of the medium and small class number of sample with large class number, λ ₁=n ₁/ n ₂.By theory, derive, to stratified sample (x _i, y _i), i=1,2 ..., n, log-likelihood function is:

\begin{matrix} \ln [L (α, β)] = Σ_{i = 1}^{n} \{\begin{matrix} y_{i} [\ln λ_{1} + l   n P_{x_{i}}] + (1 - y_{i}) [\ln λ_{0} + \ln 1 - P_{x_{i}})] \\ - \ln [λ_{1} P_{x_{i}} + (1 - P_{x_{i}}) λ_{0}] \end{matrix}\} \\ = Σ_{i = 1}^{n} | y_{i} \ln \frac{λ_{1}}{λ_{0}} | + Σ_{i = 1}^{n} y_{i} \ln \frac{P_{x_{i}}}{1 - P_{x_{i}}} - Σ_{i = 1}^{n} | \frac{λ_{1}}{λ_{0}} \frac{P_{x_{i}}}{1 + P_{x_{i}}} + 1 | \end{matrix} - - - (9)

Utilize formula (5) to obtain

\ln [L (α_{0}, β^{'})] = A + Σ_{i = 1}^{n} {y_{i} (α_{0} + β^{'} x_{i}) - \ln [1 + \exp (α_{0} + λ + β^{'} x_{i})]} - - - (10)

Wherein, for with the irrelevant number of solve for parameter.If make α ₁=α ₀+ λ, parameter alpha ₁, the maximal possibility estimation of β ' can be obtained by following system of equations:

\{\begin{matrix} \frac{&PartialD; \ln [L (α_{0}, β)]}{{&PartialD; α}_{0}} = Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | = 0 \\ \frac{&PartialD; \ln [L (α_{0}, β)]}{&PartialD; β_{j}} = Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | x_{ij} = 0 \\ j = 1,2,3, . . ., m . \end{matrix} - - - (11)

Formula (11) is the parameter estimation formula to stratified sample Logic Regression Models.Under random sampling, sample distribution is consistent with population distribution, λ ₁=λ ₀, thereby λ=0, α ₁=α ₀, formula (11) is identical with formula (8), so formula (11) can be seen the popularization of an accepted way of doing sth (8) under stratified sampling.

Parameter and probability estimate deviation are carried out to theoretical analysis below, contrast formula (11) and formula (8) see, under stratified sampling, with formula (8) estimation model, is by α ₁, the estimation of β ' is when doing α ₀, the estimation of β ', this can cause:

1) deviation that constant term is estimated

α ₁=α ₀+ λ, λ is α more ₁larger, formula for stratified sample (8) estimation model be there will be to estimated value and the positively related phenomenon of λ of constant term, relevant with methods of sampling design to the estimation of constant term, in stratified sample, λ value obtains greatlyr, and the estimated value of the constant term obtaining is just more

2) deviation of probability estimate

If Z=is α ₀+ β ' X, due to α ₁> α ₀, use α ₁, β ' replaces α ₀, β ' will make Z increase, thereby makes increase, will over-evaluate so the other probability of happening of group, and λ get larger, this amplitude of over-evaluating is just larger.

There is two internal factors, i.e. deviation proportion and absence of information in unbalance data set.Wherein, deviation proportion (being designated as S) refers to large classification and other ratio of group, and it has represented the degree that data are unbalance.The number that during stratified sampling, level is divided, is designated as H.In stratified sampling process, the art of this patent, for the feature of unbalance data set, proposes the method for hierarchical cluster, and the sampling strategy adopting is very this all collection of group, and large classification sample Shuo equivalent collection by group from each layer.Adopt this Sampling Strategies combination discussion above, can obtain

\{\begin{matrix} λ_{0} = \frac{P_{0} N}{(1 - P_{0}) N} = \frac{P_{0}}{(1 - P_{0})} = \frac{1}{S} \\ λ_{1} = \frac{n_{1}}{n_{2}} = \frac{1}{H} \\ λ = \ln \frac{λ_{1}}{λ_{0}} = \ln \frac{S}{H} \end{matrix} - - - (12)

From formula (12), see, unbalance than S, more λ is larger, and this illustrates for unbalance data set, and data are unbalance, and situation is more serious, and λ is larger, and this deviation of over-evaluating is just larger.The strategy that formula (12) also can instruct level to divide, it is more serious that data are unbalance, and stratified sampling more trends towards multi-segment, the deviation that can reduce to over-evaluate.

The present invention success specifically extracts in application and implements and obtain successfully in answer.It is a sub-field of information extraction research that answer is extracted, and is also the important core ingredient of question answering system, and it is the sign that question answering system is different from text retrieval system under ordinary meaning.It is a kind of typical two classification problems that answer is extracted, and candidate answers may be only two kinds and have a kind of in form, is answer or is not answer.Therefore, this class problem is applicable to adopting the method for logistic regression to analyze and process theoretically.And in actual conditions, the quantity of correct option, far fewer than the quantity of disturbing answer, makes sample data serious unbalance.These features are just being applicable to the Forecasting Methodology returning based on hierarchical cluster sampling compensation logic that the art of this patent proposes.Therefore in the application of, extracting in the answer of InsunQA system, adopt the method to extract correct option.

The task that the information retrieval part of InsunQA system completes is that each problem is returned to 70 associated paragraphs.In these paragraphs, may comprise the correct option of problem, certainly wherein also comprise a large amount of interference answers.All these candidate answers are carried out to vector representation with the feature of extracting above, and each sample comprises 15 characteristic attribute values above.The fundamental purpose that answer is extracted is exactly in candidate answers, to extract correct answer, the answer abstracting method that logic-based returns is in fact the process of a candidate answers sequence, this just needs a sequence formula that contains above characteristic attribute, and formula is extracted in namely answer.

Because answer extracted data collection is typical unbalance data set, can adopt the method for the hierarchical cluster sampling that this chapter proposes to extract the sample set that extracts parameter estimation for answer, and the estimated bias compensation method of using SPSS software and us to propose, just can obtain the solve for parameter value α in formula (13) ₀and β '.Wherein, β ' is characteristic weights set.Work as α ₀with the value of β ' is known, predictor formula is so:

\{\begin{matrix} P_{x_{i}} = \frac{e^{Z}}{1 + e^{Z}} \\ Z = (α_{1} + β_{1} x_{1} + . . . + β_{m} x_{m}) \end{matrix} - - - (14)

So, formula is extracted in the answer that formula (14) namely generates.Through type (14) can predicting candidate answer be just the probability of correct option, and can to candidate answers, sort according to the size of probable value, usings the candidate answers of probable value maximum as final correct option.

Claims

1. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic, is characterized in that, it comprises the steps:

2. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic according to claim 1, is characterized in that, in step 1, adopts k-means algorithm to carry out cluster to sample set to be predicted, and the method for obtaining the data of K class comprises:

3. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic according to claim 1, is characterized in that,

In step 3,

\{\begin{matrix} Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | = 0 \\ Σ_{i = 1}^{n} | y_{i} - \frac{\exp (α_{1} + β^{'} x)}{1 + \exp (α_{1} + β^{'} x)} | x_{ij} = 0 \\ j = 1,2,3, . . ., m \end{matrix},

Described Logic Regression Models is