CN107247972A

CN107247972A - One kind is based on mass-rent technology classification model training method

Info

Publication number: CN107247972A
Application number: CN201710511119.9A
Authority: CN
Inventors: 吴伟宁
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2017-10-13

Abstract

Mass-rent technology classification model training method is based on the invention provides one kind.The level that user provides markup information is estimated on the corresponding mass-rent markup information of a small amount of sample；Markup information used in training sample is determined by the use of observed mark person level as priori；The train classification models on training sample and its markup information；The training sample for using the disaggregated model selecting that the model anticipation error can be made minimum, and predict the sample generic；Training set is added by selected sample and in the flat highest user of category subscript water filling for its markup information provided；On training set in the updated, iteration performs above-mentioned steps, untill the precision of disaggregated model or the quantity of training sample reach preset standard.Effect of the invention is that, it is to avoid adverse effect of the low quality markup information that the low user of mark level provides to disaggregated model training, it is ensured that train the effect of high generalization ability disaggregated model under mass-rent environment.

Description

One kind is based on mass-rent technology classification model training method

Technical field

The present invention relates to a kind of disaggregated model training method.

Background technology

At present, in machine learning under the framework of supervised learning, train classification models need to collect one group in advance with mark letter The data sample of breath.The quality and quantity of collected training data directly determines the Generalization Capability of disaggregated model.Traditional Training data provides data sample corresponding unique correct mark letter during collecting it is necessary to have the expert of professional domain knowledge Breath, for ensureing that the disaggregated model obtained by training has good Generalization Capability.

This traditional method facing challenges is that the personnel with specialty background are less in realistic task, obtains sample mark The cost of information is higher, the time is longer.Thus, with the development of network technology and data storage technology, it is using mass-rent technology The a large amount of cheap markup informations of training sample quick obtaining, it is effective as the time and economic cost reduced in mark acquisition process One of approach.

Under mass-rent environment, the mark acquisition task of training data is not completed by traditional professional, but with freedom certainly The form of hope is contracted out to unspecific popular network to complete, i.e., amateur personal or individual of increasing income is with independent or cooperation side The quick qurer of formula completes mark task.Because the markup information based on mass-rent technical limit spacing is used from multiple online networks Family, therefore, it is difficult to ensure the quality of collected markup information, simultaneously as the correct markup information for lacking professional person's offer is made For " goldstandard ", it is also difficult to which the degree of accuracy of experience and its completion mark task to these users is weighed.Directly using many Bag markup information train classification models, can have a strong impact on the Generalization Capability of disaggregated model.

The content of the invention

It is an object of the invention to provide a kind of influence that can overcome low quality markup information to model training process, it is ensured that With the disaggregated model of minimum labeled cost one high generalization ability of study based on mass-rent technology classification model under mass-rent environment Training method.

The object of the present invention is achieved like this：

It is in m collected sample and by the mass-rent markup information of k user's offerCondition Under, carry out in accordance with the following steps：

Step one, n sample and its corresponding mass-rent mark are randomly selected from collected sample and its mass-rent labeled data Information

Step 2, builds training datasetWherein, whenWhen, y_i=1, otherwise, y_i=0；

Step 3, in training datasetThe disaggregated model that upper one parameter of study is w；

Step 4, one group of markup information that j-th of user provides on classification cOn mark level For

Wherein,WithRepresent that the user provides the number of times of correct mark and mistake mark respectively；

Step 5, according to for sample x_iThe mark level of multiple users of markup information is provided, the sample is used for the mark trained Information is estimated by following formula

Step 5, is predicted using disaggregated model to remaining m-n sample generic, and calculates corresponding point of each sample Class model anticipation error, it is as follows

Wherein, U represents the set of remaining sample composition, and (D x) represents to add sample into the error of disaggregated model after training set I；

Step 6, and selection p (y | x^*；W) the corresponding classifications of ＞ 0.5, and be sample by the flat highest user of category subscript water filling x^*The markup information y of offer^*Add training dataset

Step 7, repeats step 3 to step 6, until the Generalization accuracy of disaggregated model or the quantity of training sample reach Untill written standards.

The present invention can also include：

1st, mark level of j-th of user on classification cComputational methods be：

In training sample and its mass-rent mark setOn, initialization sample x_iCorresponding mark letter Cease y_i, that is, work asWhen, y_i=1, otherwise, y_i=0；According to markup information y_i, initialize the mark level of multiple users Collect θ, as prior distribution, markup information corresponding to training data is estimated as follows：

Wherein

p(y_i|x_i；W) it is probabilistic estimated value of the disaggregated model to training sample generic, according to estimated Update user annotation level, and train classification models again；Repeat to estimateWith update user annotation level, until seemingly Right functionConvergence, this process terminates.

2nd, sample is added into the error I of disaggregated model after training set (D, computational methods x) is：

Wherein, U' represent by the sample from remaining sample set deletion after sample set；Meanwhile, when sample x adds markup information And add after training set, the disaggregated model obtained by study is to the prediction probability of the samplex_u∈ U ', y_u∈{0,1}。

The present invention proposes a kind of based on mass-rent technology classification model training method.In the markup information provided from unique user Estimate the mark level of the user, the quality of markup information is provided according to multiple users come used in determining train classification models Markup information, reduction low-level user provides negative effect of the low quality markup information to disaggregated model training.Selected section sample This structure training dataset, improves the Generalization Capability of disaggregated model, it is ensured that the application effect of disaggregated model in actual task.

The purpose of the present invention is multiple low quality markup informations using mass-rent technical limit spacing, with minimum labeled cost study one The disaggregated model of individual high generalization ability.Beneficial effects of the present invention：There is provided present invention utilizes mass-rent user for training sample Multiple markup informations estimate that unique user provides the level of mark in each classification, using the mark level of multiple users as Priori, the sample markup information that estimation train classification models are used, overcomes low quality markup information to model training mistake The influence of journey.According to class prediction result of the disaggregated model to remaining sample, selection makes the minimum sample of disaggregated model anticipation error This addition training sample set, meanwhile, the markup information that the flat highest user of selection selection category subscript water filling provides adds instruction Practice markup information collection, overcome interference of the low quality sample to disaggregated model training.So as to, it is ensured that with most under mass-rent environment Small labeled cost learns the disaggregated model of a high generalization ability.

Brief description of the drawings

Fig. 1 is flow chart of the invention；

Fig. 2 is the present invention and the precision comparative result using correct mark train classification models；

Fig. 3 for the true mark level of the user annotation level estimated of the present invention and user comparative result.

Embodiment

Illustrate below and the present invention is described in more detail.

According to the flow chart based on mass-rent technology classification model training process in the present invention, comprise the following steps that：

1) n (n ＜ m) individual samples and corresponding mass-rent markup information are randomly selected from collected data

2) training sample x is determined_iCorresponding markup information y_i, whenWhen, y_i=1, otherwise, y_i=0.

3) disaggregated model that a parameter is w is being learnt using training sample and its markup information.

4) in training sample and its mass-rent mark setOn, according to markup information y_i, estimate multiple The corresponding mark level set θ of user, wherein, give one group of markup information that j-th of user provides on classification cThe user marks level accordinglyFor：

Wherein,WithRepresent that the user provides the number of times of correct mark and mistake mark respectively.As prior distribution, Markup information corresponding to training data is estimated as follows：

Wherein：

p(y_i|x_i；W) it is probabilistic estimated value of the disaggregated model to training sample generic.According to estimated Update user annotation level, and train classification models again.Repeat to estimateWith update user annotation level, until seemingly Right functionConvergence, this process terminates.

5) basis is sample x_iThere is provided markup information multiple users mark level, update the sample be used for train mark believe Breath：

6) remaining m-n sample generic is predicted according to using disaggregated model, and calculates the corresponding of each sample Disaggregated model anticipation error, it is as follows：

Wherein：

U represents the set of remaining sample composition, and U' is represented the sample from the sample set after the deletion of remaining sample set.Meanwhile, After sample x adds markup information and adds training set, the disaggregated model obtained by study is to the prediction probability of the samplex_u∈ U ', y_u∈{0,1}。

7) selection p (y | x^*；W) the corresponding classifications of ＞ 0.5, and be sample x by the flat highest user of category subscript water filling^*There is provided Markup information y^*Add training dataset

8) repeat step 3 to step 6, until the Generalization accuracy of disaggregated model or the quantity of training sample reach it is set Untill standard.

Because in the training process of disaggregated model, the mass-rent markup information of selected sample is only put down by category subscript water filling Highest mark person is provided, and then the level for the person that use mark is corrected and marked used in training sample as priori Information, it is ensured that the quality of markup information is used during train classification models, reduces low-level mark person and provides mark letter Cease the negative effect to disaggregated model training process.Secondly, in disaggregated model training process, for disaggregated model to choose With training sample, it is ensured that disaggregated model can make full use of the information that sample is included.

In the simulation process to this algorithm, 5 users are used altogether to simulate mass-rent markup information, training dataset bag 40 training samples are contained, unlabeled data concentration contains 1000 and do not mark sample, and test data is concentrated and contains 1000 Bar sample.Fig. 2 is shown under the conditions of proposed mass-rent mark and distinguished under the conditions of correct mark the ROC essences of learning classification model Spend comparative result.Fig. 3 is shown proposed method and estimates that obtained user's level and user are true according to mass-rent markup information The comparative result of real level.

Although having been incorporated with embodiment to enter one kind of the present invention based on mass-rent technology classification model training method Go explanation, but the invention is not restricted to this.The various modifications made under spirit and principles of the present invention should be included in this Within the scope of claims of invention are limited.

Claims

1. one kind is based on mass-rent technology classification model training method, it is characterized in that in m collected sample and by k use Family provide mass-rent markup information beUnder conditions of, carry out in accordance with the following steps：

Step 2, builds training datasetWherein, whenWhen, y_i=1, otherwise, y_i=0；

Step 4, one group of markup information that j-th of user provides on classification cOn mark levelFor

<mrow> <msubsup> <mi>&theta;</mi> <mi>j</mi> <mi>c</mi> </msubsup> <mo>=</mo> <mi>B</mi> <mi>e</mi> <mi>t</mi> <mi>a</mi> <mrow> <mo>(</mo> <msubsup> <mi>&theta;</mi> <mi>j</mi> <mi>c</mi> </msubsup> <mo>|</mo> <msubsup> <mi>&alpha;</mi> <mi>j</mi> <mn>1</mn> </msubsup> <mo>,</mo> <msubsup> <mi>&alpha;</mi> <mi>j</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>,</mo> <mi>c</mi> <mo>&Element;</mo> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>}</mo> </mrow>

Step 5, according to for sample x_iThe mark level of multiple users of markup information is provided, the sample is used for the mark letter trained Breath is estimated by following formula

<mrow> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>=</mo> <munder> <mi>argmax</mi> <mrow> <mi>y</mi> <mo>&Element;</mo> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>}</mo> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>y</mi> <mi>i</mi> <mi>j</mi> </msubsup> <mo>|</mo> <mi>y</mi> <mo>;</mo> <msub> <mi>&theta;</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

<mrow> <msup> <mi>x</mi> <mo>*</mo> </msup> <mo>=</mo> <munder> <mrow> <mi>arg</mi> <mi>min</mi> </mrow> <mrow> <mi>x</mi> <mo>&Element;</mo> <mi>U</mi> </mrow> </munder> <munder> <mo>&Sigma;</mo> <mrow> <mi>y</mi> <mo>&Element;</mo> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>}</mo> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>x</mi> <mo>;</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>I</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow>

Step 6, and selection p (y | x^*；W) the corresponding classifications of ＞ 0.5, and be sample x by the flat highest user of category subscript water filling^* The markup information y of offer^*Add training dataset

2. according to claim 1 be based on mass-rent technology classification model training method, it is characterized in that j-th of user is in class Mark level on other cComputational methods be：

In training sample and its mass-rent mark setOn, initialization sample x_iCorresponding mark letter Cease y_i, that is, work asWhen, y_i=1, otherwise, y_i=0；According to markup information y_i, initialize the mark water of multiple users Flat collection θ, as prior distribution, markup information corresponding to training data is estimated as follows：

Wherein

<mrow> <mi>C</mi> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>}</mo> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>y</mi> <mi>i</mi> <mi>j</mi> </msubsup> <mo>|</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> </mrow>

p(y_i|x_i；W) it is probabilistic estimated value of the disaggregated model to training sample generic, according to estimatedMore New user annotation level, and train classification models again；Repeat to estimateWith update user annotation level, until likelihood FunctionConvergence, this process terminates.

3. according to claim 1 or 2 be based on mass-rent technology classification model training method, it is characterized in that sample is added (D, computational methods x) are the error I of disaggregated model after training set：

<mrow> <mi>I</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>u</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <msup> <mi>U</mi> <mo>&prime;</mo> </msup> <mo>|</mo> </mrow> </munderover> <munder> <mo>&Sigma;</mo> <mi>k</mi> </munder> <msubsup> <mi>p</mi> <mi>u</mi> <mi>k</mi> </msubsup> <mo>&CenterDot;</mo> <mi>log</mi> <mi> </mi> <msubsup> <mi>p</mi> <mi>u</mi> <mi>k</mi> </msubsup> </mrow>

Wherein, U' represent by the sample from remaining sample set deletion after sample set；Meanwhile, when sample x adds markup information simultaneously Add after training set, the disaggregated model obtained by study is to the prediction probability of the sample