CN104102716A - Imbalance data predicting method based on cluster stratified sampling compensation logic regression - Google Patents

Imbalance data predicting method based on cluster stratified sampling compensation logic regression Download PDF

Info

Publication number
CN104102716A
CN104102716A CN201410341930.3A CN201410341930A CN104102716A CN 104102716 A CN104102716 A CN 104102716A CN 201410341930 A CN201410341930 A CN 201410341930A CN 104102716 A CN104102716 A CN 104102716A
Authority
CN
China
Prior art keywords
data
stratified
sample
class
logic regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410341930.3A
Other languages
Chinese (zh)
Inventor
李鹏
张楷卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201410341930.3A priority Critical patent/CN104102716A/en
Publication of CN104102716A publication Critical patent/CN104102716A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to an imbalance data predicting method based on cluster stratified sampling compensation logic regression, belongs to the field of imbalance data prediction, and aims to solve the problem that traditional predicting models are poor in imbalance data predicting effect. The method includes the steps of firstly, using a k-means algorithm to cluster a to-be-predicted sample set so as to obtain K categories of data; secondly, performing stratified sampling on the K categories of data so as to extract n data; thirdly, performing maximum likelihood estimation on the parameters of a stratified sample logic regression model to obtain the parameter estimator of the stratified sample logic regression model and determine the stratified sample logic regression model; inputting the n data into the stratified sample logic regression model to determine whether the to-be-predicted sampler set is an imbalance data set or not. The method is applicable to fields such as biology, medicine, engineering and computing which need imbalance data prediction.

Description

The unbalance data predication method returning based on hierarchical cluster sampling compensation logic
Technical field
The invention belongs to unbalance data prediction field.
Background technology
As everyone knows, decision-making must depend on prediction.Prediction is estimation and the deduction to making future, in order to reach this purpose, often will imitate or abstract real world (or claim research object), and this process is referred to as modeling.The model of therefore, one " good " can not only be expressed reality and should be able to be passed through the real data slice-of-life rule of development accurately.Therefore, forecast model is a kind of prediction or prophesy that is expressed as feature with quantification.
Forecasting problem towards unbalance data set is the difficulties in natural science field, and has important actual application value at numerous areas such as biology, medical science, engineering, calculating.Fact proved, in the situation that data category is unbalance, directly adopt Classical forecast model all can not reach the prediction effect that makes us acceptance.
The stratified sampling technology now adopting mainly comprises the stratified sampling method of network-oriented flow data, for the data hierarchy methods of sampling of IT system application appraisal expansion platform with towards the method for sampling of the stratified sampling of high attribute dimension data.Above three kinds of layered sampling method are all towards the real data of specific area, and formulate the stratified sampling of corresponding Stratified Strategy guide data according to data self character is artificial.
And existing logistic regression forecasting techniques, be applied in to adopt to utilize more and penalize logistic regression (PLR) model according to the method for quality screening plant embryos, by method and the method based on pseudomorphism in multivariate logistic regression detection ICU patient record of logistic regression algorithm predicts organic chemicals biodegradability, and logistic regression forecasting techniques is not used in to the prediction field of unbalance data set.
Summary of the invention
The object of the invention is in order to solve the bad problem of effect of the unbalance data of Classical forecast model prediction, the invention provides a kind of unbalance data predication method returning based on hierarchical cluster sampling compensation logic.
The unbalance data predication method returning based on hierarchical cluster sampling compensation logic of the present invention,
It comprises the steps:
Step 1: adopt k-means algorithm to carry out cluster to sample set to be predicted, obtain the data of K class;
Step 2: carry out stratified sampling to obtaining the data of K class, extract n data;
Step 3: the parameter of the Logic Regression Models of stratified sample is carried out to maximal possibility estimation, obtain the parameter estimation formula of stratified sample Logic Regression Models, determine stratified sample Logic Regression Models;
Step 4: the n of an extraction data are inputed in stratified sample Logic Regression Models, determine whether sample set to be predicted is unbalance data set.
Beneficial effect of the present invention is, the present invention adopts the method for hierarchical cluster sampling first unbalance data to be resampled, and cuts down in a large number the noise data of impact prediction, reduces unbalance ratio, reduces the generation of data submerge phenomenon; Secondly, the change distributing for the data after sampling, proposes a kind of parametric compensation logistic regression forecast model, proofreaies and correct prediction probability value when effectively improving estimated performance.Through verification experimental verification, Forecasting Methodology of the present invention can significantly improve the precision of prediction of unbalance data.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one.
Fig. 2 is the level division principle schematic diagram based on cluster in embodiment two.
Embodiment
Embodiment one: in conjunction with Fig. 1, present embodiment is described, the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in present embodiment, it comprises the steps:
Step 1: adopt k-means algorithm to carry out cluster to sample set to be predicted, obtain the data of K class;
Step 2: carry out stratified sampling to obtaining the data of K class, extract n data;
Step 3: the parameter of the Logic Regression Models of stratified sample is carried out to maximal possibility estimation, obtain the parameter estimation formula of stratified sample Logic Regression Models, determine stratified sample Logic Regression Models;
Step 4: the n of an extraction data are inputed in stratified sample Logic Regression Models, determine whether sample set to be predicted is unbalance data set.
Stratified sampling, is also named type sampling.Be exactly by population unit by some important property feature divide into several classes's type or layer, then all types of or layer in adopt simple random sampling (simple random sampling) or systematic sampling (system sampling) mode sample drawn unit.Feature is: due to by drawing class layering, increased the common point between all types of middle units, easily extracted representative investigation sample out.Stratified sampling is more more accurate than simple random sampling and systematic sampling, can pass through the investigation to less sampling unit, obtain inferred results more accurately, particularly, when totally large, inner structure is complicated, stratified sampling often can obtain gratifying effect.Meanwhile, stratified sampling, in to overall deduction, can also obtain the inference to every layer.The method is applicable to general status complexity, between constituent parts, differs greatly, and the situation that unit is more.Stratified random smapling can be done more accurately and estimate overall attribute than random sampling.
Stratified sampling is the heterogeneous stronger stronger subpopulation of homogeney one by one that is totally divided into, then the sample extracting in different subpopulations represents respectively this subpopulation, and all samples and then representative are overall.Than simple random sampling, first stratified sampling will carry out the division of level, i.e. layering.Adopt in actual applications the method for sampling of stratified sampling, most important work is exactly how sample to be carried out to rational level division, and the sampling that makes sample after layering is expression population distribution and the characteristic of refining more.It is emphasis and the difficulties of stratified sampling that level is divided.So present embodiment adopts the mode of cluster to carry out level division.
Cluster is one of the most common technology of Data Mining, for finding that each group forming by cluster process is called a class at the unknown data class of database.Before cluster, quantity and type that data class is divided are all unknown.This data class divide according to being " things of a kind come together, people of a mind fall into the same group ", press the similarity between individuality or data object, research object is divided into some.Cluster is returned into some classifications a group objects according to similarity, and object is to make to belong between other object of same class to have similar as far as possible feature, and between object in belonging to a different category, has as much as possible relatively only.Therefore good theoretical direction and the feasible method of providing is provided for level that, clustering method is stratified sampling.
Embodiment two: present embodiment is described in conjunction with Fig. 2, present embodiment is the further restriction to the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one, in step 1, adopt k-means algorithm to carry out cluster to sample set to be predicted, the method for obtaining the data of K class comprises:
Step is one by one: select K data at random in sample set to be predicted, each data is as Yi Gelei center;
Step 1 two: according to the principle nearest apart from each Lei center, by other data allocations in sample set to be predicted in each corresponding class;
Step 1 three: for each class, calculate the average property value of all data in such, and using described average property value as such Xin center;
Step 1 four: according to the principle nearest apart from each Lei Xin center, again by the data allocations in sample set to be predicted in each corresponding class; And whether class and the step 1 class of two minutes that judgement divides is again identical, if identical, stops, and determines the data of K class, if not identical, proceeds to step 1 three.
Present embodiment adopts k-means clustering algorithm to be applied to the middle-level division of stratified sampling, why selects k-means clustering algorithm except its feature such as simple, effective, the most important thing is, this clustering algorithm can be set the number of cluster classification in advance.From level, divide, apply the number of plies that this algorithm namely can the required division of predefined, can effectively control sampling process like this.Embodiment three: present embodiment is the further restriction to the unbalance data predication method returning based on hierarchical cluster sampling compensation logic described in embodiment one, in step 3,
The parameter estimation formula of described stratified sample Logic Regression Models is Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | = 0 Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | x ij = 0 j = 1,2,3 , . . . , m ,
α wherein 1and β ' is the unknown parameter of stratified sample Logic Regression Models, the vector that β ' is 1 * m, β '=(β 1..., β m) t, x ijbe i j feature of data extracting, m is the Characteristic Number of the data of each extraction, i=1, and 2,3 ..., n; y ithe predicted value of i the data that extract, y ivalue is { 0,1};
Described Logic Regression Models is
The feature vector, X of the data of each extraction=(x 1, x 2..., x m), x mm the feature for the data that extract.
Existing logistic regression forecast model directly applies to the data subset after sampling mostly, and the probability that enters sample due to dissimilar data is different, and the distribution of sample point and population distribution no longer have homogeneity.Under stratified sampling condition, due to the inconsistency of sample distribution and population distribution, directly adopt maximum likelihood estimate to cause the estimated bias of model parameter and probability, cause prediction probability value inaccurate.Present embodiment adopts the parametric compensation of logistic regression under a kind of stratified sampling, causing a deviation when to maximal possibility estimation gives reasonably compensation and makes logistic regression prediction adapt to inconsistent that data distribute, and finally makes prediction probability value more level off to actual probability of happening.
Logic Regression Models is a kind of nonlinear model, so the parameter estimation of model adopts maximum likelihood to estimate conventionally.Can prove, under random sample condition, the maximal possibility estimation of Logic Regression Models has consistance, progressive validity and asymptotic normality.Yet in much research, sampling is not completely random, but adopts the method for stratified sampling, and this just need to consider the Parameter Estimation Problem of Logic Regression Models under stratified sampling condition.
In logistic regression, dependent variable Y i(i=1,2,3 ..., n) follow Bernoulli probability distribution, the probability that dependent variable is 1 is P i, be that 0 probability is 1-P i, P i/ 1-P irefer to the diversity ratio odds that event occurs.Vector X i(i=1,2,3 ..., n) for the vectorization of observation sample represents, the attribute number that constant K is sample, the namely number of representation class.
Y i~Bernoulli(Y i/P i) (1)
ln P ( Y i = 1 ) 1 - P ( Y i = 1 ) = ln ( odds ) = α 0 + Σ k = 1 K β k X ik - - - ( 2 )
Be logarithm diversity ratio above, two sides, by negate logarithm, can recently represent by difference.
odds = P ( Y i = 1 ) 1 - P ( Y i = 1 ) = exp ( α 0 + Σ k = 1 K β k X ik ) - - - ( 3 )
= e α 0 + Σ k = 1 K β k X ik = e α 0 * Π k = 1 K e β k X k = e α 0 * Π k = 1 K ( e β k ) X k - - - ( 4 )
For certain, apply especially the expression way that Logic Regression Models has plurality of optional to select; Also relatively easy from the angle logistic regression calculating, and have many instruments can carry out the parameter estimation of logistic regression; In actual applications, the performance of logistic regression is also pretty good.We notice, if we know diversity ratio or logarithm diversity ratio, are easy to so calculate corresponding probability of happening.
P x i = odds 1 + odds = exp ( α 0 + β ′ X ) 1 + exp ( α 0 + β ′ X ) - - - ( 5 )
Wherein, unknown parameter α 0be a constant, β ' is the vector of K * 1, corresponding each independent variable.The parameter of model is estimated by the method for maximal possibility estimation:
L ( α 0 , β ′ ) = Π i = 1 n P x i Y i ( 1 - P x i ) 1 - Y i - - - ( 6 )
For random sampling (x i, y i), i=1,2 ..., n, takes the logarithm by two sides, and in conjunction with formula (2), log-likelihood function is reduced to:
ln ( L ( α 0 , β ′ ) ) = Σ i = 1 n [ y i ( α 0 + β ′ x i ) - ln ( 1 + exp ( α 0 + β ′ x i ) ) ] - - - ( 7 )
Unknown parameter α 0and the maximal possibility estimation equation of the value of β ' by below obtains.
∂ ln [ L ( α 0 , β ) ] ∂ α 0 = Σ i = 1 n | y i - exp ( α 0 + β ′ x ) 1 + exp ( α 0 + β ′ x ) | = 0 ∂ ln [ L ( α 0 , β ) ] ∂ β j = Σ i = 1 n | y i - exp ( α 0 + β ′ x ) 1 + exp ( α 0 + β ′ x ) | x ij = 0 j = 1,2,3 , . . . , m . - - - ( 8 )
Under random sample condition, the maximal possibility estimation of Logic Regression Models has consistance, progressive validity and asymptotic normality.Yet in the research of some problem, sampling is not completely random, but adopt the method for stratified sampling.Under random sampling condition, the distribution of sample point is identical with population distribution; And under stratified sampling condition, the probability that enters sample due to dissimilar data is different, the distribution of sample point and population distribution no longer have homogeneity.Under stratified sampling condition, due to the inconsistency of sample distribution and population distribution, directly adopt maximum likelihood estimate to cause the estimated bias of model parameter and probability.The estimated bias that the art of this patent produces for the stratified sampling logistic regression forecast model of unbalance data set is studied, and proposes a kind of compensation method of estimated bias.
In population sample N, group very given figure is P 0n, large classification sample number is (1-P 0) N, adopt stratified sampling to extract respectively n in little classification and large classification sample 1and n 2individual as sample.Make λ 0for the ratio of overall medium and small class number with large class number, λ 0=P 0n/ (1-P 0) N=P 0/ (1-P 0); λ 1for the ratio of the medium and small class number of sample with large class number, λ 1=n 1/ n 2.By theory, derive, to stratified sample (x i, y i), i=1,2 ..., n, log-likelihood function is:
ln [ L ( α , β ) ] = Σ i = 1 n y i [ ln λ 1 + l n P x i ] + ( 1 - y i ) [ ln λ 0 + ln 1 - P x i ) ] - ln [ λ 1 P x i + ( 1 - P x i ) λ 0 ] = Σ i = 1 n | y i ln λ 1 λ 0 | + Σ i = 1 n y i ln P x i 1 - P x i - Σ i = 1 n | λ 1 λ 0 P x i 1 + P x i + 1 | - - - ( 9 )
Utilize formula (5) to obtain
ln [ L ( α 0 , β ′ ) ] = A + Σ i = 1 n { y i ( α 0 + β ′ x i ) - ln [ 1 + exp ( α 0 + λ + β ′ x i ) ] } - - - ( 10 )
Wherein, for with the irrelevant number of solve for parameter.If make α 10+ λ, parameter alpha 1, the maximal possibility estimation of β ' can be obtained by following system of equations:
∂ ln [ L ( α 0 , β ) ] ∂ α 0 = Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | = 0 ∂ ln [ L ( α 0 , β ) ] ∂ β j = Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | x ij = 0 j = 1,2,3 , . . . , m . - - - ( 11 )
Formula (11) is the parameter estimation formula to stratified sample Logic Regression Models.Under random sampling, sample distribution is consistent with population distribution, λ 10, thereby λ=0, α 10, formula (11) is identical with formula (8), so formula (11) can be seen the popularization of an accepted way of doing sth (8) under stratified sampling.
Parameter and probability estimate deviation are carried out to theoretical analysis below, contrast formula (11) and formula (8) see, under stratified sampling, with formula (8) estimation model, is by α 1, the estimation of β ' is when doing α 0, the estimation of β ', this can cause:
1) deviation that constant term is estimated
α 10+ λ, λ is α more 1larger, formula for stratified sample (8) estimation model be there will be to estimated value and the positively related phenomenon of λ of constant term, relevant with methods of sampling design to the estimation of constant term, in stratified sample, λ value obtains greatlyr, and the estimated value of the constant term obtaining is just more
2) deviation of probability estimate
If Z=is α 0+ β ' X, due to α 1> α 0, use α 1, β ' replaces α 0, β ' will make Z increase, thereby makes increase, will over-evaluate so the other probability of happening of group, and λ get larger, this amplitude of over-evaluating is just larger.
There is two internal factors, i.e. deviation proportion and absence of information in unbalance data set.Wherein, deviation proportion (being designated as S) refers to large classification and other ratio of group, and it has represented the degree that data are unbalance.The number that during stratified sampling, level is divided, is designated as H.In stratified sampling process, the art of this patent, for the feature of unbalance data set, proposes the method for hierarchical cluster, and the sampling strategy adopting is very this all collection of group, and large classification sample Shuo equivalent collection by group from each layer.Adopt this Sampling Strategies combination discussion above, can obtain
λ 0 = P 0 N ( 1 - P 0 ) N = P 0 ( 1 - P 0 ) = 1 S λ 1 = n 1 n 2 = 1 H λ = ln λ 1 λ 0 = ln S H - - - ( 12 )
From formula (12), see, unbalance than S, more λ is larger, and this illustrates for unbalance data set, and data are unbalance, and situation is more serious, and λ is larger, and this deviation of over-evaluating is just larger.The strategy that formula (12) also can instruct level to divide, it is more serious that data are unbalance, and stratified sampling more trends towards multi-segment, the deviation that can reduce to over-evaluate.
The present invention success specifically extracts in application and implements and obtain successfully in answer.It is a sub-field of information extraction research that answer is extracted, and is also the important core ingredient of question answering system, and it is the sign that question answering system is different from text retrieval system under ordinary meaning.It is a kind of typical two classification problems that answer is extracted, and candidate answers may be only two kinds and have a kind of in form, is answer or is not answer.Therefore, this class problem is applicable to adopting the method for logistic regression to analyze and process theoretically.And in actual conditions, the quantity of correct option, far fewer than the quantity of disturbing answer, makes sample data serious unbalance.These features are just being applicable to the Forecasting Methodology returning based on hierarchical cluster sampling compensation logic that the art of this patent proposes.Therefore in the application of, extracting in the answer of InsunQA system, adopt the method to extract correct option.
The task that the information retrieval part of InsunQA system completes is that each problem is returned to 70 associated paragraphs.In these paragraphs, may comprise the correct option of problem, certainly wherein also comprise a large amount of interference answers.All these candidate answers are carried out to vector representation with the feature of extracting above, and each sample comprises 15 characteristic attribute values above.The fundamental purpose that answer is extracted is exactly in candidate answers, to extract correct answer, the answer abstracting method that logic-based returns is in fact the process of a candidate answers sequence, this just needs a sequence formula that contains above characteristic attribute, and formula is extracted in namely answer.
Because answer extracted data collection is typical unbalance data set, can adopt the method for the hierarchical cluster sampling that this chapter proposes to extract the sample set that extracts parameter estimation for answer, and the estimated bias compensation method of using SPSS software and us to propose, just can obtain the solve for parameter value α in formula (13) 0and β '.Wherein, β ' is characteristic weights set.Work as α 0with the value of β ' is known, predictor formula is so:
P x i = e Z 1 + e Z Z = ( α 1 + β 1 x 1 + . . . + β m x m ) - - - ( 14 )
So, formula is extracted in the answer that formula (14) namely generates.Through type (14) can predicting candidate answer be just the probability of correct option, and can to candidate answers, sort according to the size of probable value, usings the candidate answers of probable value maximum as final correct option.

Claims (3)

1. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic, is characterized in that, it comprises the steps:
Step 1: adopt k-means algorithm to carry out cluster to sample set to be predicted, obtain the data of K class;
Step 2: carry out stratified sampling to obtaining the data of K class, extract n data;
Step 3: the parameter of the Logic Regression Models of stratified sample is carried out to maximal possibility estimation, obtain the parameter estimation formula of stratified sample Logic Regression Models, determine stratified sample Logic Regression Models;
Step 4: the n of an extraction data are inputed in stratified sample Logic Regression Models, determine whether sample set to be predicted is unbalance data set.
2. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic according to claim 1, is characterized in that, in step 1, adopts k-means algorithm to carry out cluster to sample set to be predicted, and the method for obtaining the data of K class comprises:
Step is one by one: select K data at random in sample set to be predicted, each data is as Yi Gelei center;
Step 1 two: according to the principle nearest apart from each Lei center, by other data allocations in sample set to be predicted in each corresponding class;
Step 1 three: for each class, calculate the average property value of all data in such, and using described average property value as such Xin center;
Step 1 four: according to the principle nearest apart from each Lei Xin center, again by the data allocations in sample set to be predicted in each corresponding class; And whether class and the step 1 class of two minutes that judgement divides is again identical, if identical, stops, and determines the data of K class, if not identical, proceeds to step 1 three.
3. the unbalance data predication method returning based on hierarchical cluster sampling compensation logic according to claim 1, is characterized in that,
In step 3,
The parameter estimation formula of described stratified sample Logic Regression Models is Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | = 0 Σ i = 1 n | y i - exp ( α 1 + β ′ x ) 1 + exp ( α 1 + β ′ x ) | x ij = 0 j = 1,2,3 , . . . , m ,
α wherein 1and β ' is the unknown parameter of stratified sample Logic Regression Models, the vector that β ' is 1 * m, β '=(β 1..., β m) t, x ijbe i j feature of data extracting, m is the Characteristic Number of the data of each extraction, i=1, and 2,3 ..., n; y ithe predicted value of i the data that extract, y ivalue is { 0,1};
Described Logic Regression Models is
The feature vector, X of the data of each extraction=(x 1, x 2..., x m), x mm the feature for the data that extract.
CN201410341930.3A 2014-07-17 2014-07-17 Imbalance data predicting method based on cluster stratified sampling compensation logic regression Pending CN104102716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410341930.3A CN104102716A (en) 2014-07-17 2014-07-17 Imbalance data predicting method based on cluster stratified sampling compensation logic regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410341930.3A CN104102716A (en) 2014-07-17 2014-07-17 Imbalance data predicting method based on cluster stratified sampling compensation logic regression

Publications (1)

Publication Number Publication Date
CN104102716A true CN104102716A (en) 2014-10-15

Family

ID=51670870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410341930.3A Pending CN104102716A (en) 2014-07-17 2014-07-17 Imbalance data predicting method based on cluster stratified sampling compensation logic regression

Country Status (1)

Country Link
CN (1) CN104102716A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636447A (en) * 2015-01-21 2015-05-20 上海天呈医流科技股份有限公司 Intelligent evaluation method and system for medical instrument B2B website users
CN106407706A (en) * 2016-09-29 2017-02-15 北京理工大学 Boruta algorithm-based multi-level old people physical state quantization level calculation method
CN106982230A (en) * 2017-05-10 2017-07-25 深信服科技股份有限公司 A kind of flow rate testing methods and system
CN110458199A (en) * 2019-07-16 2019-11-15 中国传媒大学 Based on the kohonen neural network clustering methods of sampling

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014060753A1 (en) * 2012-10-16 2014-04-24 Randox Laboratories Ltd. Diagnosis and risk stratification of bladder cancer

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014060753A1 (en) * 2012-10-16 2014-04-24 Randox Laboratories Ltd. Diagnosis and risk stratification of bladder cancer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张吉凯 等: ""聚类在流行病学分层分析中的应用"", 《基础理论与方法》 *
彭寿康: ""分层抽样条件下Logistic回归模型"", 《统计研究》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636447A (en) * 2015-01-21 2015-05-20 上海天呈医流科技股份有限公司 Intelligent evaluation method and system for medical instrument B2B website users
CN104636447B (en) * 2015-01-21 2017-12-29 上海天呈医流科技股份有限公司 A kind of intelligent Evaluation method and system towards medicine equipment B2B websites user
CN106407706A (en) * 2016-09-29 2017-02-15 北京理工大学 Boruta algorithm-based multi-level old people physical state quantization level calculation method
CN106982230A (en) * 2017-05-10 2017-07-25 深信服科技股份有限公司 A kind of flow rate testing methods and system
CN110458199A (en) * 2019-07-16 2019-11-15 中国传媒大学 Based on the kohonen neural network clustering methods of sampling

Similar Documents

Publication Publication Date Title
Li et al. An extended cellular automaton using case‐based reasoning for simulating urban development in a large complex region
CN107633265A (en) For optimizing the data processing method and device of credit evaluation model
Wang et al. Identifying dominant factors for the calibration of a land-use cellular automata model using Rough Set Theory
CN109242149A (en) A kind of student performance early warning method and system excavated based on educational data
Bai et al. A forecasting method of forest pests based on the rough set and PSO-BP neural network
CN112417176B (en) Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics
Verma et al. Fuzzy association rule mining based model to predict students’ performance
CN105354208A (en) Big data information mining method
CN104102716A (en) Imbalance data predicting method based on cluster stratified sampling compensation logic regression
CN107729555A (en) A kind of magnanimity big data Distributed Predictive method and system
Vultureanu-Albişi et al. Improving students’ performance by interpretable explanations using ensemble tree-based approaches
Xu et al. CET-4 score analysis based on data mining technology
Gross et al. Systemic test and evaluation of a hard+ soft information fusion framework: Challenges and current approaches
CN106815320B (en) Investigation big data visual modeling method and system based on expanded three-dimensional histogram
Ntoutsi et al. A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees
bin Othman et al. Neuro fuzzy classification and detection technique for bioinformatics problems
Kumari et al. Analyzing the Factors Influencing the Waiting Time to First Citation and Long-Term Impact of Publications.
Pramudita et al. Optimization Analysis of Neural Network Algorithms Using Bagging Techniques on Classification of Date Fruit Types
Park et al. A new forecasting system using the latent dirichlet allocation (LDA) topic modeling technique
Yun et al. [Retracted] Quality Evaluation and Satisfaction Analysis of Online Learning of College Students Based on Artificial Intelligence
Zanellati et al. Representation of learning in the post-digital: students’ dropout predictive models with artificial intelligence algorithms
Prasad et al. Analysis and prediction of crime against woman using machine learning techniques
Aher et al. Prediction of course selection by student using combination of data mining algorithms in E-learning
Abdulrahman et al. Machine Learning in Nonlinear Material Physics
Truscott et al. Detecting shadow economy sizes with symbolic regression

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141015

RJ01 Rejection of invention patent application after publication