CN102033965A - Method and system for classifying data based on classification model - Google Patents

Method and system for classifying data based on classification model Download PDF

Info

Publication number
CN102033965A
CN102033965A CN 201110009286 CN201110009286A CN102033965A CN 102033965 A CN102033965 A CN 102033965A CN 201110009286 CN201110009286 CN 201110009286 CN 201110009286 A CN201110009286 A CN 201110009286A CN 102033965 A CN102033965 A CN 102033965A
Authority
CN
China
Prior art keywords
classification
sample data
sample
data
underlined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110009286
Other languages
Chinese (zh)
Inventor
黄林
黄学柱
杨宏彬
朱香友
刘安舒
夏洪涛
孙曙
张俊
张华�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HAIHUI FINANCE INVESTMENT GROUP Co Ltd
Original Assignee
ANHUI HAIHUI FINANCE INVESTMENT GROUP Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HAIHUI FINANCE INVESTMENT GROUP Co Ltd filed Critical ANHUI HAIHUI FINANCE INVESTMENT GROUP Co Ltd
Priority to CN 201110009286 priority Critical patent/CN102033965A/en
Publication of CN102033965A publication Critical patent/CN102033965A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for classifying data based on a classification model. The method comprises the following steps of: receiving target sample data to be analyzed, wherein the target sample data carries values for marking each attribute; extracting the value of an effective attribute of the target sample data, wherein the effective attribute is determined according to a preset classification function; substituting the value of the effective attribute into the classification function to acquire a target sample data classification value; and determining a data category to which the target sample data belongs according to the classification value, wherein the classification function is established in the mode of setting a category mark for unmarked sample data in a first primary sample set according to the category mark of marked sample data in the first primary sample set, forming a second primary sample set by using the marked sample data and the unmarked sample data with the category mark and determining the classification function by using a supervised classification model according to the second primary sample set. Through the scheme provided by the invention, the accuracy of classification of the target sample data to be analyzed can be effectively improved.

Description

A kind of data classification method and system based on disaggregated model
Technical field
The present invention relates to the data mining technology field, particularly relate to a kind of data classification method and system based on disaggregated model.
Background technology
Nowadays, data mining all is widely used in each fields such as financial circles, retail trade, telecommunications industries.Disaggregated model is as one of main models of data digging system.Utilize disaggregated model the sample data information of original sample collection can be reduced a certain classification function, this classification function can be used for new target sample data to be analyzed are carried out analyzing and processing, realizes new target sample classification of Data is handled with this.In simple terms, be updated in the disaggregated model, can determine classification function by sample data information with the original sample collection.After classification function is determined, in the information substitution classification function with target sample data to be analyzed, can obtain the affiliated classification of target sample data, and then different classes of sample data is taked the different modes for the treatment of.
In the prior art, concentrate sample data whether to carry the classification mark according to original sample, decision is adopted the supervised classification model, and (for example: decision tree, neural network, logistic recurrence etc.) still adopts no supervised classification model (for example: cluster, major component etc.) to obtain classification function.Wherein, have all sample datas of the required original sample collection of supervised classification model all to have the classification mark, that is: all sample datas all are to have determined good affiliated data category; And there are not all sample datas that the required original sample of supervised classification model concentrates are no classification marks.But in actual applications, original sample is concentrated and is not only had the marker samples data but also have unmarked sample data.If only utilize unmarked sample data, use no supervised classification model, ignore underlined sample data, determined classification function is inaccurate; And only utilizing underlined sample data, utilization that the supervised classification model is arranged, determined classification function is not accurate enough equally.And the disaggregated model that includes marker samples data and unmarked sample data that is applicable to of the prior art, as semi-supervised K mean cluster model, only be when initial, to have utilized underlined sample data, follow-up is general cluster flow process, do not make full use of underlined sample data and determine classification function, therefore the accuracy to classification affects greatly.
Summary of the invention
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of data classification method and system based on disaggregated model, and to improve the accuracy to target sample data qualification to be analyzed, technical scheme is as follows:
A kind of data classification method based on disaggregated model comprises:
Receive target sample data to be analyzed, described target sample data carry identifies the value of its each attribute;
Extract the value of effective attribute of described target sample data, described effective attribute is determined according to default classification function;
With the described classification function of value substitution of described effective attribute, obtain described target sample classification of Data value;
According to described target sample classification of Data value, judge the data category that described target sample data are affiliated;
Wherein, the building mode of described default classification function is:
According to the classification logotype of the concentrated underlined sample data of first original sample, be that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
With underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
According to the described second original sample collection, utilizing has the supervised classification model, determines described classification function.
A kind of data sorting system based on disaggregated model comprises:
Receiver module, extraction module, computing module, kind judging module, classification function make up module;
Described receiver module is used to receive target sample data to be analyzed, and described target sample data carry identifies the value of its each attribute;
Described extraction module is used to extract the value of effective attribute of the target sample data that described receiver module receives, and described effective attribute is that to make up the classification function that module makes up in advance according to described classification function determined;
Described computing module is used for the described classification function of value substitution with effective attribute of described extraction module extraction, obtains described target sample classification of Data value;
Described kind judging module is used for the target sample classification of Data value that obtains according to described computing module, judges the data category under the described target sample data;
Described classification function makes up module, is used to make up classification function, specifically comprises:
Classification logotype is provided with submodule, is used for the classification logotype according to the concentrated underlined sample data of first original sample, is that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
Sample set is determined submodule, is used for underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
Classification function is determined submodule, is used for determining the second original sample collection that submodule is determined according to described sample set, and utilizing has the supervised classification model, determines described classification function.
The technical scheme that the embodiment of the invention provided, utilize underlined sample data, unmarked sample data is converted into underlined sample data, make original sample concentrate all sample datas to become underlined sample data collection, then with these underlined sample datas as the input value that the supervised classification model is arranged, determine classification function.As seen in this programme, classification logotype according to underlined sample data is that unmarked sample data is provided with classification, and then by there being the constructed classification function of supervised classification model to make full use of underlined sample data, and effectively in conjunction with unmarked sample data, its accuracy promotes.When target sample data to be analyzed are carried out the branch time-like, utilize this classification function, can effectively improve the accuracy of classification.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do simple the introduction to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 makes up the process flow diagram of classification function for the embodiment of the invention;
The process flow diagram of Fig. 2 a kind of data classification method based on disaggregated model for the embodiment of the invention provided;
Another process flow diagram of Fig. 3 a kind of data classification method based on disaggregated model for the embodiment of the invention provided;
The structural representation of Fig. 4 a kind of data sorting system based on disaggregated model for the embodiment of the invention provided;
Fig. 5 makes up the structural representation of module for the classification function that the embodiment of the invention provided.
Embodiment
For quoting and understanding conveniently, now semi-supervised K mean cluster model, logistic model are described below:
Before introducing these two kinds of models, at first clear and definite: the attribute of (1) sample data is some information that can identify this sample data.The value of attribute is to be the numerical value that this attribute is provided with according to the information content, conveniently is used for calculating.When some attribute played decisive role to the affiliated data category of this sample data, these attributes were effective attribute of this sample data.For example: when with an enterprise during as a sample data, the attribute of this sample data can comprise: the financial information of this enterprise, administrator information, enterprise's essential information etc.And when differentiating enterprise's confidence level classification, if financial information, administrator information have played decisive role, the financial information of this enterprise and administrator information are effective attribute of this sample data so.
1, semi-supervised K mean cluster model:
The original sample collection that this model utilized includes marker samples data and unmarked sample data, the basic thought that utilizes this model that the original sample collection is handled is: produce initial cluster seed based on underlined sample data, and utilize underlined sample data to come the process of constrained clustering.Basic step is as follows:
1) center initialization:
The underlined sample data of utilizing original sample to concentrate is determined the center of cluster: suppose the concentrated individual underlined sample data of N that comprises of original sample, this N sample data belongs to K data classification, and (that is: classification logotype is K, represent different data categories respectively), and suppose that each class all comprises at least one underlined sample data, that is to say, finally can generate K bunch (that is: set).Utilize the average of the underlined sample data all properties value in each bunch to obtain the initialization average of the central point of each bunch, can utilize following formula to obtain the average of a certain attribute of the underlined sample data in each bunch:
u K = 1 | S K | Σ x ∈ S K x
Wherein, S KThe underlined sample set of representing K class, | S K| represent the number of all underlined sample datas in this bunch, x represents the value of a certain attribute of underlined sample data in this bunch, u KThe average of representing this property value of all underlined sample datas in this bunch.
By utilizing following formula, can obtain the average of each attribute of the underlined sample data in each bunch, that is to say corresponding average of attribute in this bunch.The initialization average of central point is the set of the average of each attribute of underlined sample data in this bunch.
2) unmarked sample data is assigned to K bunch:
Calculate the distance of unmarked sample data to each bunch respectively, that is: unmarked sample data is to the distance of each bunch central point.Unmarked sample data is assigned in pairing bunch of the bee-line.Can be during computed range with sample data as a point with multiattribute value, the calculating of distance can be selected the Euclidean distance formula so, supposes two some X i, X j, then the distance of these two points be d (i, j), account form is as follows:
d ( i , j ) = Σ t = 1 m ( X it - X jt ) 2
Wherein, the attribute of each point is m, and t is the sign of the attribute of each point, t=1,2,3......m.
3) redistribute all sample datas:
After all unmarked sample datas assigned, so unmarked sample data and underlined sample data all be assigned in K bunch.Recomputate the distance of all sample datas then to the central point of each bunch, with sample data be assigned to the bee-line correspondence bunch in.Repeating step, behind the certain number of times of iteration, all sample datas that original sample is concentrated are assigned in certain bunch, and the initialization average of the central point of each bunch is finally determined.
In the said method, the sample data that original sample is concentrated is utilized the mode of semi-supervised K mean cluster, has been assigned in final bunch, and has obtained the initialization average of the central point of each bunch.When new target sample data are come in, utilize the value of the attribute of these target sample data, calculate its central point distance to each bunch, and with this target sample data allocations to distance the shortest bunch in, can determine just which data category it belongs to.
2, logistic model:
All sample datas that the required original sample of logistic model is concentrated all are underlined sample datas, and promptly all sample datas have all had clear and definite data category attribute.By these underlined sample datas are applied to the logistic model, can obtain the coefficient of classification function of this model correspondence and effective attribute of sample data.Usage factor and effective attribute can be determined unique classification function.When carrying out the branch time-like to new target sample data, the effective attribute definite according to classification function, the value of effective attribute of the correspondence of extraction target sample data, in the substitution classification function, according to obtaining classification value, just the affiliated classification of decidable target sample data is classified the target sample data.
The calculating principle of logistic model simply is described below:
The estimation of the coefficient of the corresponding classification function of logistic model adopt usually maximum likelihood method (maximumlikelihood, ML).The basic thought of maximum likelihood method is to set up likelihood function and log-likelihood function earlier, and by making the log-likelihood function maximum find the solution corresponding coefficient value, resulting estimated value is called the maximum likelihood estimator of coefficient again.
In the prior art, there are supervised classification model and no supervised classification model not to be suitable for original sample and concentrate the situation that had both included marker samples data and unmarked sample data.Though this class disaggregated model of semi-supervised K mean cluster goes for both including the original sample collection of marker samples data and unmarked sample data, just utilized underlined sample data at first, follow-up is general cluster flow process.Because underlined sample data more has tap value than unmarked sample data, so when the utilization disaggregated model is determined classification function, can make full use of the accuracy that underlined sample data has influence on classification function, and then influence is to the accurate classification of target sample data to be analyzed.A kind of data classification method and system based on disaggregated model provided by the present invention have made full use of underlined sample data, and in conjunction with unmarked sample data, can effectively improve the accuracy of classification.At first a kind of data classification method based on disaggregated model provided by the present invention is introduced below.
A kind of data classification method based on disaggregated model comprises:
Receive target sample data to be analyzed, described target sample data carry identifies the value of its each attribute;
Extract the value of effective attribute of described target sample data, described effective attribute is determined according to default classification function;
With the described classification function of value substitution of described effective attribute, obtain described target sample classification of Data value;
According to described target sample classification of Data value, judge the data category that described target sample data are affiliated;
Wherein, the building mode of described default classification function is:
According to the classification logotype of the concentrated underlined sample data of first original sample, be that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
With underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
According to the described second original sample collection, utilizing has the supervised classification model, determines described classification function.
In the such scheme, utilize underlined sample data, unmarked sample data is converted into underlined sample data, make original sample concentrate all sample datas to become underlined sample data collection, then with these underlined sample datas as the input value that the supervised classification model is arranged, determine classification function.As seen in this programme, classification logotype according to underlined sample data is that unmarked sample data is provided with classification, and then by there being the constructed classification function of supervised classification model to make full use of underlined sample data, and effectively in conjunction with unmarked sample data, its accuracy promotes.When target sample data to be analyzed are carried out the branch time-like, utilize this classification function, can effectively improve the accuracy of classification.
(that is: the data category of sample data comprises: good and bad two kinds with two classification below, wherein the class of good sample data is designated 0, and the classification logotype of bad sample data is 1) for example a kind of data classification method based on disaggregated model provided by the present invention is described in detail.
In a kind of data classification method based on disaggregated model provided by the present invention, the detailed process that makes up classification function is:
S101:, be that the unmarked sample data that first original sample is concentrated is provided with classification logotype according to the classification logotype of the concentrated underlined sample data of first original sample;
Being unmarked sample data that first original sample is concentrated when classification logotype is set, can adopt semi-supervised K mean cluster mode, detailed process is as follows:
The underlined sample data that first original sample is concentrated is assigned in L the set, and comprises a underlined sample data at least in each set; Wherein, the value of L can be set at different values according to actual conditions, also can be set to 2 according to the number of data category.
Utilize the average of each attribute of underlined sample data to obtain the Initialization Center value of the central point of each set, account form by the agency of when introducing semi-supervised K mean cluster repeats no more herein.
Utilize the value of each attribute of unmarked sample data, calculate the distance of unmarked sample data, unmarked sample data is assigned in the shortest set of distance to L set;
Recomputate the distance of all sample datas, it is assigned in each set to the central point of each set;
Repeat above-mentioned steps, for each sample data is determined final affiliated set.
According to the classification logotype of the underlined sample data in each set, for the unmarked sample data in this set is provided with classification logotype, all sample datas that the original sample of winning is concentrated become underlined sample data.
Wherein, for the unmarked sample data in this set classification logotype is set according to the classification logotype of underlined sample data in each set, detailed process is:
Obtain in this set classification logotype in the underlined sample data and be 0 sample data and classification logotype and be the ratio distribution situation of 1 sample data;
Obtain the classification logotype of the underlined sample data of ratio maximum in this set;
The classification logotype of unmarked sample data in this set is set according to the classification logotype of the underlined sample data of ratio maximum.
That is to say, if the ratio of good sample data is greater than the ratio of bad sample data in the set, it is 0 underlined sample data that unmarked sample data in so just should gathering is set to have classification logotype, otherwise it is 1 underlined sample data that unmarked sample data is set to have classification logotype.
Be understandable that those skilled in the art can be that the unmarked sample data that first original sample is concentrated is provided with classification logotype according to other mode, be not limited to the mode of the semi-supervised K mean cluster that present embodiment provides.
S102: with underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
Because the unmarked sample data that first original sample is concentrated has been provided with classification logotype, as underlined sample data, so all sample datas that second original sample is concentrated all are underlined sample datas.
S103: according to the second original sample collection, utilizing has the supervised classification model, determines described classification function.
All sample datas that second original sample is concentrated all are markd sample datas, so can adopt the supervised classification model to determine classification function.In the process of determining classification function, all sample datas that second original sample can be concentrated, directly substitution has the supervised classification model to determine classification function.In another embodiment of the present invention, adopt the logistic model as the supervised classification model is arranged, need then to determine classification function that detailed process is according to a part of sample data of the second original sample collection:
S103a: the unmarked sample data that is provided with classification logotype of extracting the underlined sample data of the second original sample collection and preset ratio is as training set, and all the other remaining sample datas are as the checking collection;
Conventional way is to be divided into training set at random and to verify collection in default ratio all sample datas with sample set, but because the unmarked sample data that is provided with classification logotype that second original sample is concentrated is to determine according to the classification logotype of underlined sample data, what tap value was bigger is those underlined sample datas, so sample data is being carried out the branch time-like, specific practice can be: choose whole underlined sample datas and a certain proportion of unmarked sample data that is provided with classification logotype as training set, select whole underlined sample datas and the remaining unmarked sample data of classification logotype that is provided with as the checking collection.The unmarked sample data that is provided with classification logotype that for example can choose whole underlined sample datas and 70% is as training set, and the unmarked sample data that is provided with classification logotype of selecting whole underlined sample datas and 30% is as the checking collection.Certainly, can set other ratio according to actual conditions.
S103b: obtain training set according to concentrating from second original sample, utilizing has the supervised classification model, determines classification function.
The detailed process of determining classification function according to training set is:
With the training set corresponding sample data substitution logistic model that second original sample is concentrated, obtain the coefficient value of logistic model respective function and effective attribute that classification under the sample data is played a decisive role;
Determine classification function according to the coefficient value of determining and effective attribute of sample data.
In the classification function of determining, effectively attribute is as the independent variable of this classification function, just, the value of effective attribute of sample data is updated in the corresponding independent variable, can obtain the value of classification function.
Further, can utilize the checking collection to verify the accuracy of classification function.
After determining well for the classification function of two classification, can be used for the prediction classification to the new samples data, as shown in Figure 2, a kind of data classification method based on disaggregated model provided by the present invention is:
S201: receive target sample data to be analyzed, described target sample data carry identifies the value of its each attribute;
The target sample data carry identifies the value of its all properties, wherein has the value of effective attribute that classification under these target sample data is played a decisive role.
S202: the value of extracting effective attribute of these target sample data;
Because determined which attribute is effective attribute in the default classification function, which only plays the effect of identification information, so the value of effective attribute that can directly extract the target sample data is as the value of classification function independent variable.
S203: the value of the effective attribute value as the classification function independent variable is brought in the classification function, obtains target sample classification of Data value;
Because classification function determined,, can obtain classification value so directly the value of each attribute is brought into corresponding independent variable in the classification function as the value of independent variable.In the present embodiment, can obtaining the target sample data category, to be designated 1 probability and classification logotype be 0 probability.
S204: according to the classification value that calculates.Judge the data category that the target sample data are affiliated, it is classified.
In the present embodiment, be 0 probability if the target sample data category is designated 1 probability greater than classification logotype, show that then these target sample data are bad sample data.Otherwise the target sample data are sample data well.
In scheme provided by the present invention, utilize underlined sample data, unmarked sample data is converted into underlined sample data, make original sample concentrate all sample datas to become underlined sample data collection, then with these underlined sample datas as the input value that the supervised classification model is arranged, determine classification function.As seen, in this programme, be that unmarked sample data is provided with classification according to the classification logotype of underlined sample data, and then by there being the constructed classification function of supervised classification model to make full use of underlined sample data, and effectively in conjunction with unmarked sample data, its accuracy promotes.When target sample data to be analyzed are carried out the branch time-like, utilize this classification function, can effectively improve the accuracy of classification.
Below in conjunction with a concrete application example method provided by the present invention is described, still adopt two classification.
In the company information data that the application of a tame guarantee corporation is assured, there was the guarantee record in the said firm in the enterprise that has, enterprise for the secured record, the guarantee promise breaking once took place waited record of bad behavior, defining its credit value so is 1 (promptly risky), otherwise think that its credit is good, the definition credit value is 0.And the enterprise that has does not have a clear and definite guarantee record.The information of the enterprise in the database relates to many-sided data such as enterprise's essential information, financial information, administrator information, reference information.After data processing in advance, formed and be used for the first original sample collection that disaggregated model is determined classification function.
First original sample is concentrated and is had 4000 sample datas, underlined sample data is 1000, unmarked sample data is 3000, in the underlined sample data credit be divided into two grades 0 and 1 (0 the expression credit good, 1 expression credit is risky), in the wherein underlined sample data, 0 and 1 ratio is about 9: 1, and each sample data comprises 100 attributes.
At first, need make up classification function according to the first original sample collection, detailed process is:
1) for unmarked sample data classification logotype is set, and constitutes the second original sample collection with underlined sample data:
The underlined sample data that first original sample is concentrated is assigned randomly in 8 set;
Unmarked sample data is assigned in these 8 set;
Through behind certain iterations, for each sample data is determined final affiliated set;
Calculate 0 and 1 ratio of the underlined sample data in each set, for each unmarked sample data is provided with classification logotype;
With underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
2) utilize logistic to determine classification function:
All sample datas that second original sample is concentrated are divided into training set and checking collection.The unmarked sample data that is provided with classification logotype of choosing whole underlined sample datas and 70% is as training set, and the unmarked sample data that is provided with classification logotype of choosing whole underlined sample datas and 30% is as the checking collection.Training set comprises 3100 sample datas, and the checking collection comprises 900 sample datas.
At training set operation logistic model, obtain the coefficient of classification function of logistic model correspondence and effective attribute of sample data;
Suppose Z 1, Z 2... Z mBe the variable sign of effective attribute correspondence of sample data, m is the number of effective attribute; Q is the classification grades of two classification, and promptly Q can value 0 or 1; Probability during P (Q=1) expression Q=1, the probability of P (Q=0) expression Q=0, and P (Q=1)+P (Q=0)=1.Then the equation of classification function can be expressed as following form:
ln P ( Q = 1 ) P ( Q = 0 ) = α 0 + α 1 Z 1 + α 2 Z 2 + . . . + α m Z m
Wherein, α 0, α 1, α 2... α mBe the coefficient of classification function, can determine according to the sample data of training set by logistic.
After classification function builds, can predict that as shown in Figure 3, this method comprises to new sample data:
S301: receive target sample data to be analyzed, described target sample data carry identifies the value of its 100 attributes;
S302: extract the value of effective attribute of target sample data, and as independent variable Z 1, Z 2... Z mValue;
S303: the value of effective attribute is updated in definite classification function in the corresponding independent variable, obtains the probability of Q=1, the probability of Q=0;
S304:, judge the affiliated classification of target sample data according to probable value.
If P (Q=1)>P (Q=0) shows that then target sample data credit value is lower, risky; Otherwise show that this sample data credit is good.
In the embodiment of the invention, in the process that makes up classification function, at first the classification logotype according to underlined sample data is that unmarked sample data is provided with classification logotype, make unmarked sample be marked as the sample data or the good sample data of credit of credit difference, utilize underlined sample data and the unmarked sample data that is provided with classification logotype then, by there being the supervised classification model to make up classification function, made full use of underlined sample data, make the accuracy of classification function promote.When new enterprise during as the target sample data, utilize this classification function that it is carried out the branch time-like, effectively promoted the accuracy of classification.
Be understandable that the present invention is categorized as the detailed introduction that example is carried out technical scheme with two, but is not limited to two classification, the present invention is equally applicable to the multiple situation of data category.
Corresponding to top method embodiment, the embodiment of the invention also provides a kind of data sorting system based on disaggregated model, and as shown in Figure 4, this system comprises:
Receiver module 410, extraction module 420, computing module 430, kind judging module 440, classification function make up module 450;
Receiver module 410 is used to receive target sample data to be analyzed, and described target sample data carry identifies the value of its each attribute;
Extraction module 420 is used to extract the value of effective attribute of the target sample data that receiver module 410 receives, and effectively attribute is that to make up the default classification function of module 450 according to classification function determined;
Computing module 430 is used for the described classification function of value substitution with effective attribute of extraction module 420 extractions, obtains described target sample classification of Data value;
Kind judging module 440 is used for the target sample classification of Data value that obtains according to computing module 430, judges the data category under the described target sample data;
Classification function makes up module 450, is used to make up classification function, as shown in Figure 5, specifically comprises:
Classification logotype is provided with submodule 451, is used for the classification logotype according to the concentrated underlined sample data of first original sample, is that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
Sample set is determined submodule 452, is used for underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
Classification function is determined submodule 453, is used for determining the second original sample collection that submodule 452 is determined according to sample set, and utilizing has the supervised classification model, determines described classification function.
Wherein, classification logotype is provided with submodule 451 and comprises:
Allocation units are used for the underlined sample data that first original sample is concentrated is assigned to default different set respectively with unmarked sample data, the described different corresponding different data category of set;
Sign is provided with the unit, is used for the classification logotype according to the underlined sample data of set, for the unmarked sample data in this set is provided with classification logotype.
Described sign is provided with the unit, specifically comprises:
Ratio distribute to obtain subelement, and the ratio of underlined sample data that is used to obtain the different pieces of information classification of described set distributes;
Classification logotype obtains subelement, is used for distributing according to described ratio obtaining ratio that subelement obtains the distribute classification logotype of maximum underlined sample data of the ratio that obtains that distributes;
Classification logotype is provided with subelement, is used for being provided with according to the classification logotype that described classification logotype obtains the underlined sample data of the ratio maximum that subelement obtains the classification logotype of the unmarked sample data of this set.
Classification function is determined submodule 453, comprising:
The sample extraction unit, the unmarked sample data that is provided with classification logotype that is used to extract the underlined sample data of the described second original sample collection and preset ratio is as training set;
The classification function determining unit is used for the training set that extracted according to the sample extraction unit, and utilizing has the supervised classification model, determines described classification function.
Further, the classification function determining unit specifically comprises:
First classification function is determined subelement, be used for that training set corresponding sample data set substitution that described sample extraction unit is extracted is described a supervised classification model, obtain the coefficient of the described classification function that supervised classification model correspondence arranged and effective attribute of sample data;
Second classification function is determined subelement, is used for determining that according to first classification function coefficient that subelement obtains and effective attribute of sample data determine described classification function.
For system embodiment, because it is substantially corresponding to method embodiment, so relevant part gets final product referring to the part explanation of method embodiment.System embodiment described above only is schematic, wherein said unit as the separating component explanation can or can not be physically to separate also, the parts that show as the unit can be or can not be physical locations also, promptly can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select wherein some or all of module to realize the purpose of present embodiment scheme according to the actual needs.Those of ordinary skills promptly can understand and implement under the situation of not paying creative work.
In several embodiment provided by the present invention, should be understood that, disclosed system, apparatus and method not surpassing in the application's the spirit and scope, can realize in other way.Current embodiment is a kind of exemplary example, should be as restriction, and given particular content should in no way limit the application's purpose.For example, the division of described unit or subelement only is that a kind of logic function is divided, and during actual the realization other dividing mode can be arranged, and for example a plurality of unit or a plurality of subelement combine.In addition, a plurality of unit can or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.
In addition, institute's descriptive system, the synoptic diagram of apparatus and method and different embodiment, in the scope that does not exceed the application, can with other system, module, technology or method in conjunction with or integrated.Another point, the shown or coupling each other discussed or directly to be coupled or to communicate to connect can be by some interfaces, the indirect coupling of device or unit or communicate to connect can be electrically, machinery or other form.
The above only is the specific embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. the data classification method based on disaggregated model is characterized in that, comprising:
Receive target sample data to be analyzed, described target sample data carry identifies the value of its each attribute;
Extract the value of effective attribute of described target sample data, described effective attribute is determined according to default classification function;
With the described classification function of value substitution of described effective attribute, obtain described target sample classification of Data value;
According to described target sample classification of Data value, judge the data category that described target sample data are affiliated;
Wherein, the building mode of described default classification function is:
According to the classification logotype of the concentrated underlined sample data of first original sample, be that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
With underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
According to the described second original sample collection, utilizing has the supervised classification model, determines described classification function.
2. method according to claim 1 is characterized in that, described classification logotype according to the concentrated underlined sample data of first original sample is that the unmarked sample data that first original sample is concentrated is provided with classification logotype, is specially:
The underlined sample data that first original sample is concentrated is assigned to respectively in the default different set with unmarked sample data, the described different corresponding different data category of set;
According to the classification logotype of the underlined sample data in the set, for the unmarked sample data in this set is provided with classification logotype.
3. method according to claim 2 is characterized in that, described classification logotype according to the underlined sample data in the set for the unmarked sample data in this set is provided with classification logotype, is specially:
The ratio of underlined sample data that obtains the different pieces of information classification of described set distributes;
Obtain the classification logotype of the underlined sample data of ratio maximum in the described set;
The classification logotype of unmarked sample data in this set is set according to the classification logotype of the underlined sample data of described ratio maximum.
4. method according to claim 1 is characterized in that, and is described according to the described second original sample collection, and utilizing has the supervised classification model, determines classification function, is specially:
The unmarked sample data that is provided with classification logotype of extracting the underlined sample data of the described second original sample collection and preset ratio is as training set;
According to described training set, utilizing has the supervised classification model, determines described classification function.
5. method according to claim 4 is characterized in that, and is described according to described training set, and utilizing has the supervised classification model, determines described classification function, is specially:
The supervised classification model arranged with the substitution of described training set corresponding sample data set is described, obtain the coefficient of the described classification function that supervised classification model correspondence arranged and effective attribute of sample data;
Effective attribute according to described coefficient and sample data is determined described classification function.
6. the data sorting system based on disaggregated model is characterized in that, comprising: receiver module, extraction module, computing module, kind judging module, classification function make up module;
Described receiver module is used to receive target sample data to be analyzed, and described target sample data carry identifies the value of its each attribute;
Described extraction module is used to extract the value of effective attribute of the target sample data that described receiver module receives, and described effective attribute is that to make up the classification function that module makes up in advance according to described classification function determined;
Described computing module is used for the described classification function of value substitution with effective attribute of described extraction module extraction, obtains described target sample classification of Data value;
Described kind judging module is used for the target sample classification of Data value that obtains according to described computing module, judges the data category under the described target sample data;
Described classification function makes up module, is used to make up classification function, specifically comprises:
Classification logotype is provided with submodule, is used for the classification logotype according to the concentrated underlined sample data of first original sample, is that the unmarked sample data that first original sample is concentrated is provided with classification logotype;
Sample set is determined submodule, is used for underlined sample data and be provided with the unmarked sample data of classification logotype as the second original sample collection;
Classification function is determined submodule, is used for determining the second original sample collection that submodule is determined according to described sample set, and utilizing has the supervised classification model, determines described classification function.
7. system according to claim 6 is characterized in that, described classification logotype is provided with submodule and comprises:
Allocation units are used for the underlined sample data that first original sample is concentrated is assigned to default different set respectively with unmarked sample data, the described different corresponding different data category of set;
Sign is provided with the unit, is used for the classification logotype according to the underlined sample data of set, for the unmarked sample data in this set is provided with classification logotype.
8. system according to claim 7 is characterized in that, described sign is provided with the unit, specifically comprises:
Ratio distribute to obtain subelement, and the ratio of underlined sample data that is used to obtain the different pieces of information classification of described set distributes;
Classification logotype obtains subelement, is used for distributing according to described ratio obtaining ratio that subelement obtains the distribute classification logotype of maximum underlined sample data of the ratio that obtains that distributes;
Classification logotype is provided with subelement, is used for being provided with according to the classification logotype that described classification logotype obtains the underlined sample data of the ratio maximum that subelement obtains the classification logotype of the unmarked sample data of this set.
9. system according to claim 6 is characterized in that, described classification function is determined submodule, comprising:
The sample extraction unit, the unmarked sample data that is provided with classification logotype that is used to extract the underlined sample data of the described second original sample collection and preset ratio is as training set;
The classification function determining unit is used for the training set that extracted according to described sample extraction unit, and utilizing has the supervised classification model, determines described classification function.
10. system according to claim 9 is characterized in that, described classification function determining unit specifically comprises:
First classification function is determined subelement, be used for that training set corresponding sample data set substitution that described sample extraction unit is extracted is described a supervised classification model, obtain the coefficient of the described classification function that supervised classification model correspondence arranged and effective attribute of sample data;
Second classification function is determined subelement, is used for determining that according to first classification function coefficient that subelement obtains and effective attribute of sample data determine described classification function.
CN 201110009286 2011-01-17 2011-01-17 Method and system for classifying data based on classification model Pending CN102033965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110009286 CN102033965A (en) 2011-01-17 2011-01-17 Method and system for classifying data based on classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110009286 CN102033965A (en) 2011-01-17 2011-01-17 Method and system for classifying data based on classification model

Publications (1)

Publication Number Publication Date
CN102033965A true CN102033965A (en) 2011-04-27

Family

ID=43886858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110009286 Pending CN102033965A (en) 2011-01-17 2011-01-17 Method and system for classifying data based on classification model

Country Status (1)

Country Link
CN (1) CN102033965A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739522A (en) * 2012-06-04 2012-10-17 华为技术有限公司 Method and device for classifying Internet data streams
CN104765726A (en) * 2015-04-27 2015-07-08 湘潭大学 Data classification method based on information density
CN105138527A (en) * 2014-05-30 2015-12-09 华为技术有限公司 Data classification regression method and data classification regression device
CN105320677A (en) * 2014-07-10 2016-02-10 香港中文大学深圳研究院 Method and device for training streamed unbalance data
CN105447018A (en) * 2014-08-20 2016-03-30 阿里巴巴集团控股有限公司 Method and apparatus for verifying web page classification model
WO2017041651A1 (en) * 2015-09-09 2017-03-16 阿里巴巴集团控股有限公司 User data classification method and device
CN106911591A (en) * 2017-03-09 2017-06-30 广东顺德中山大学卡内基梅隆大学国际联合研究院 The sorting technique and system of network traffics
CN106953766A (en) * 2017-03-31 2017-07-14 北京奇艺世纪科技有限公司 A kind of alarm method and device
CN107203755A (en) * 2017-05-31 2017-09-26 中国科学院遥感与数字地球研究所 It is a kind of to increase new methods, devices and systems automatically for remote sensing images time series marker samples
CN109800139A (en) * 2018-12-18 2019-05-24 东软集团股份有限公司 Server health degree analysis method, device, storage medium and electronic equipment
CN109949181A (en) * 2019-03-22 2019-06-28 华立科技股份有限公司 The power grid type judgement method and device of algorithm are closed on based on KNN
CN109993234A (en) * 2019-04-10 2019-07-09 百度在线网络技术(北京)有限公司 A kind of unmanned training data classification method, device and electronic equipment
CN110019790A (en) * 2017-10-09 2019-07-16 阿里巴巴集团控股有限公司 Text identification, text monitoring, data object identification, data processing method
CN110363359A (en) * 2019-07-23 2019-10-22 中国联合网络通信集团有限公司 A kind of occupation prediction technique and system
WO2019223582A1 (en) * 2018-05-24 2019-11-28 Beijing Didi Infinity Technology And Development Co., Ltd. Target detection method and system
CN112232628A (en) * 2020-09-07 2021-01-15 国网宁夏电力有限公司经济技术研究院 Power transmission and transformation project cost data arrangement and deepening application system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息***有限公司 Searching engine with automating sorting function
WO2007059272A1 (en) * 2005-11-15 2007-05-24 Microsoft Corporation Information classification paradigm
CN101216845A (en) * 2008-01-03 2008-07-09 彭智勇 Database automatic classification method
US20080189257A1 (en) * 2007-02-01 2008-08-07 Microsoft Corporation World-wide classified listing search with translation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007059272A1 (en) * 2005-11-15 2007-05-24 Microsoft Corporation Information classification paradigm
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息***有限公司 Searching engine with automating sorting function
US20080189257A1 (en) * 2007-02-01 2008-08-07 Microsoft Corporation World-wide classified listing search with translation
CN101216845A (en) * 2008-01-03 2008-07-09 彭智勇 Database automatic classification method

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739522A (en) * 2012-06-04 2012-10-17 华为技术有限公司 Method and device for classifying Internet data streams
CN105138527A (en) * 2014-05-30 2015-12-09 华为技术有限公司 Data classification regression method and data classification regression device
CN105138527B (en) * 2014-05-30 2019-02-12 华为技术有限公司 A kind of data classification homing method and device
CN105320677A (en) * 2014-07-10 2016-02-10 香港中文大学深圳研究院 Method and device for training streamed unbalance data
CN105447018A (en) * 2014-08-20 2016-03-30 阿里巴巴集团控股有限公司 Method and apparatus for verifying web page classification model
CN105447018B (en) * 2014-08-20 2019-06-28 阿里巴巴集团控股有限公司 Verify the method and device of Web page classifying model
CN104765726B (en) * 2015-04-27 2018-07-31 湘潭大学 A kind of data classification method based on information density
CN104765726A (en) * 2015-04-27 2015-07-08 湘潭大学 Data classification method based on information density
WO2017041651A1 (en) * 2015-09-09 2017-03-16 阿里巴巴集团控股有限公司 User data classification method and device
CN106529110A (en) * 2015-09-09 2017-03-22 阿里巴巴集团控股有限公司 Classification method and equipment of user data
CN106911591A (en) * 2017-03-09 2017-06-30 广东顺德中山大学卡内基梅隆大学国际联合研究院 The sorting technique and system of network traffics
CN106953766A (en) * 2017-03-31 2017-07-14 北京奇艺世纪科技有限公司 A kind of alarm method and device
CN107203755A (en) * 2017-05-31 2017-09-26 中国科学院遥感与数字地球研究所 It is a kind of to increase new methods, devices and systems automatically for remote sensing images time series marker samples
CN107203755B (en) * 2017-05-31 2021-08-03 中国科学院遥感与数字地球研究所 Method, device and system for automatically adding new time sequence mark samples of remote sensing images
CN110019790B (en) * 2017-10-09 2023-08-22 阿里巴巴集团控股有限公司 Text recognition, text monitoring, data object recognition and data processing method
CN110019790A (en) * 2017-10-09 2019-07-16 阿里巴巴集团控股有限公司 Text identification, text monitoring, data object identification, data processing method
WO2019223582A1 (en) * 2018-05-24 2019-11-28 Beijing Didi Infinity Technology And Development Co., Ltd. Target detection method and system
CN109800139A (en) * 2018-12-18 2019-05-24 东软集团股份有限公司 Server health degree analysis method, device, storage medium and electronic equipment
CN109949181B (en) * 2019-03-22 2021-05-25 华立科技股份有限公司 Power grid type judgment method and device based on KNN proximity algorithm
CN109949181A (en) * 2019-03-22 2019-06-28 华立科技股份有限公司 The power grid type judgement method and device of algorithm are closed on based on KNN
CN109993234A (en) * 2019-04-10 2019-07-09 百度在线网络技术(北京)有限公司 A kind of unmanned training data classification method, device and electronic equipment
CN110363359A (en) * 2019-07-23 2019-10-22 中国联合网络通信集团有限公司 A kind of occupation prediction technique and system
CN112232628A (en) * 2020-09-07 2021-01-15 国网宁夏电力有限公司经济技术研究院 Power transmission and transformation project cost data arrangement and deepening application system

Similar Documents

Publication Publication Date Title
CN102033965A (en) Method and system for classifying data based on classification model
CN104572449A (en) Automatic test method based on case library
CN110428322A (en) A kind of adaptation method and device of business datum
CN104462611A (en) Modeling method, ranking method, modeling device and ranking device for information ranking model
CN103336766A (en) Short text garbage identification and modeling method and device
CN101980211A (en) Machine learning model and establishing method thereof
CN105335491A (en) Method and system for recommending books to users on basis of clicking behavior of users
CN101980210A (en) Marked word classifying and grading method and system
CN106651232B (en) Freight note number data analysis method and device
CN105631737A (en) Account checking method and account checking system
CN110209660A (en) Cheat clique's method for digging, device and electronic equipment
CN105378732A (en) Subject-matter analysis of tabular data
CN105389480A (en) Multiclass unbalanced genomics data iterative integrated feature selection method and system
CN102081781A (en) Finance modeling optimization method based on information self-circulation
CN103473128A (en) Collaborative filtering method for mashup application recommendation
CN109740642A (en) Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing
CN104463601A (en) Method for detecting users who score maliciously in online social media system
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN107491536A (en) A kind of examination question method of calibration, examination question calibration equipment and electronic equipment
CN106447397A (en) Tobacco retail customer pricing method based on decision tree algorithm
CN110648215A (en) Distributed scoring card model building method
CN102521713B (en) Data processing equipment and data processing method
CN102331987A (en) Patent data mining system and method
CN105912648A (en) Side information-based code snippet programming language detecting method
CN111179055A (en) Credit limit adjusting method and device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110427