CN109086814A

CN109086814A - A kind of data processing method, device and the network equipment

Info

Publication number: CN109086814A
Application number: CN201810813137.7A
Authority: CN
Inventors: 李俊岑
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2018-12-25
Anticipated expiration: 2038-07-23
Also published as: CN109086814B

Abstract

The invention discloses a kind of data processing method, device and the network equipment, the data processing method includes: to obtain the first labeled data collection；The labeled data for traversing the first labeled data concentration determines conflict labeled data using mark prediction model when traversing the labeled data that the first labeled data is concentrated；The second labeled data collection is obtained, which is the labeled data for being marked conflict labeled data obtained in ergodic process again according to default mark rule；According to the first labeled data collection and the second labeled data collection, third labeled data collection is determined；When the evaluation result of third labeled data collection is unsatisfactory for default evaluation condition, using third labeled data collection as the first labeled data collection, the step of executing the traversal, is until the evaluation result of third labeled data collection meets default evaluation condition.The present invention improves the quality of labeled data, and has saved manpower and time cost.

Description

A kind of data processing method, device and the network equipment

Technical field

The present invention relates to field of computer technology, in particular to a kind of data processing method, device and the network equipment.

Background technique

With the development of computer technology, machine learning techniques are applied to more and more fields.Machine learning is usual A large amount of labeled data is needed to train learning model, and therefore, the mark quality of data is the weight for influencing learning model accuracy Want factor.

In order to promote the mark quality of data, relatively common mode is to allow multiple labelers to mark same part data, Then take the result of most of labeler marks as final annotation results；Alternatively, being taken out for each annotation results Sample assessment allows labeler to mark this part of data again if the accuracy rate of sampling assessment is less than preset threshold, until sampling is commented The accuracy rate estimated reaches preset threshold.

In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems:

In the related technology, it in the mark quality for promoting data, the mark of especially more complicated data, relies primarily on Or artificial participation, need to expend biggish human resources and time, and the accuracy rate of labeled data also need It further increases.

Accordingly, it is desirable to provide more reliable or more effective scheme, so as in the case where guaranteeing labeled data quality, effectively The consumption for reducing time and human resources.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of data processing method, device and networks Equipment.The technical solution is as follows:

On the one hand, a kind of data processing method is provided, which comprises

Obtain the first labeled data collection, the first labeled data collection be treated according to default mark rule labeled data into The labeled data that rower is infused；

The labeled data that first labeled data is concentrated is traversed, in the mark number for traversing the first labeled data concentration According to when, utilize mark prediction model determine conflict labeled data；

The second labeled data collection is obtained, the second labeled data collection is by the labeled data that conflicts obtained in ergodic process The labeled data marked again according to the default mark rule；

According to the first labeled data collection and the second labeled data collection, third labeled data collection is determined；

When the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, by the third labeled data The step of collection is used as the first labeled data collection, executes the traversal is until the evaluation result of the third labeled data collection is full The foot default evaluation condition.

On the other hand, a kind of data processing equipment is provided, described device includes:

First obtains module, and for obtaining the first labeled data collection, the first labeled data collection is according to default mark Rule treats the labeled data that labeled data is labeled；

Spider module, the labeled data concentrated for traversing first labeled data are traversing the first mark number According to concentration labeled data when, utilize mark prediction model determine conflict labeled data；

Second obtains module, and for obtaining the second labeled data collection, the second labeled data collection is will be in ergodic process The labeled data that obtained conflict labeled data is marked again according to the default mark rule；

First determining module, for determining that third is marked according to the first labeled data collection and the second labeled data collection Data set；

Circular treatment module, for when the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, Using the third labeled data collection as the first labeled data collection, the step of executing the traversal, is until the third marks The evaluation result of data set meets the default evaluation condition.

On the other hand, a kind of network equipment is provided, comprising:

Processor is adapted for carrying out one or one or more instruction；And

Memory, the memory are stored with one or one or more instruction, and described one or one or more instruction are suitable for It is loaded by the processor and executes above-mentioned data processing method.

Technical solution provided in an embodiment of the present invention has the benefit that

Labeled data progress time of the present invention after getting the labeled data collection marked, to labeled data concentration It goes through, and combines mark prediction model to determine conflict labeled data in ergodic process, marked conflicting obtained in ergodic process Data re-start mark, and the data marked again are merged to obtain new mark number with labeled data collection before According to collection, the labeled data collection new to this is evaluated later, when evaluation result is unsatisfactory for default evaluation condition, is carried out above-mentioned Cycle iterative operation thereof, until new labeled data collection meets default evaluation condition, so that the mark quality of data is substantially increased, And due to effectively being combined model and manpower, manpower and time cost are saved.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention；

Fig. 2 is provided in an embodiment of the present invention when traversing the labeled data that first labeled data is concentrated, and utilizes mark Infuse a kind of flow diagram that prediction model determines conflict labeled data；

Fig. 3 is a kind of flow diagram of the evaluation result provided in an embodiment of the present invention for obtaining third labeled data collection；

Fig. 4 is the flow diagram of another data processing method provided in an embodiment of the present invention；

Fig. 5 is provided in an embodiment of the present invention when traversing the labeled data that first labeled data is concentrated, and utilizes mark Infuse another flow diagram that prediction model determines conflict labeled data；

Fig. 6 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of spider module provided in an embodiment of the present invention；

Fig. 8 is the structural schematic diagram of another data processing equipment provided in an embodiment of the present invention；

Fig. 9 is a kind of structural schematic diagram of the first determining module provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of another data processing equipment provided in an embodiment of the present invention；

Figure 11 is a kind of structural schematic diagram of network equipment provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Referring to FIG. 1, its flow diagram for showing a kind of data processing method provided in an embodiment of the present invention, this theory Bright book provides the method operating procedure as described in embodiment or flow chart, but based on conventional or can be with without creative labor Including more or less operating procedure.The step of enumerating in embodiment sequence is only one in numerous step execution sequences Kind of mode does not represent and unique executes sequence.It, can be according to embodiment or attached drawing when system or product in practice executes Shown in method sequence execute or parallel execute (such as environment of parallel processor or multiple threads).It is specific such as to scheme Shown in 1, which comprises

S102, obtains the first labeled data collection, and the first labeled data collection is according to default mark rule to be marked The labeled data that data are labeled.

In this specification embodiment, data to be marked, which refer to, needs to mark the object that personnel are labeled, number to be marked According to can include but is not limited to text, image, audio, statistical data etc..

In this specification embodiment, default mark rule is how indicateing arm note personnel treat labeled data and be labeled Information.Labeled data includes annotation results corresponding with data to be marked, and annotation results refer to that mark personnel are based on pre- bidding Note rule treats the data obtained after labeled data is labeled.

Default mark rule may include annotation formatting and label classification etc..For example, annotation formatting can be " slot position name =slot position value ## slot position name=slot position value ", wherein slot position is certain entity words with particular community in data to be marked；Mark Label classification can be " title of the song -- > song, singer -- > singer, classification, style -- > tag, video display, program, works -- > tv, language Speech -- > language, location information -- > place ".It is " to come to me for data to be marked according to above-mentioned default mark rule First cowboy is extremely busy " annotation results can be " cowboy song=is extremely busy "；It is that " what Zhou Jielun has recently for data to be marked The annotation results of pleasing to the ear song " can be " outstanding human relations ##tag=is pleasing to the ear within singer=weeks "；It is " point one for data to be marked The annotation results of first Love Is Just A Dream " can be " song=Love Is Just A Dream ".

It should be noted that a kind of above-mentioned example for being merely given as default mark rule in practical applications can be with Corresponding default mark rule is set as needed, for example, it is desired to when the intention for treating labeled data is labeled, it can be pre- Point out that the label classification for being intended to A is 1 in bidding note rule, it is intended that the label classification of B is 2 etc..

In practical applications, mark personnel can initiate to request by some interactive devices to labeled data system, mark Data system can choose data to be marked from data set to be marked, and part data to be marked and default mark are regular It is encapsulated as data packet and is sent to mark personnel.Then, available mark personnel mark the labeled data completed.

S104 traverses the labeled data that first labeled data is concentrated, and is traversing what first labeled data was concentrated When labeled data, conflict labeled data is determined using mark prediction model.

In order to which the mark quality to labeled data controls, in this specification embodiment, pass through the first mark of traversal Labeled data in data set, and when traversing the labeled data of the first labeled data collection, it is true in conjunction with mark prediction model Surely conflict labeled data.

In this specification embodiment, data to be marked are input to mark prediction model, can be exported corresponding pre- Labeled data is surveyed, when the first labeled data is concentrated and the corresponding labeled data of data to be marked and above-mentioned prediction labeled data When inconsistent, then concentrate labeled data corresponding with the data to be marked to be determined as conflict the first labeled data and mark number According to.

It is pre- using mark when traversing the labeled data that first labeled data is concentrated in this specification embodiment It surveys model and determines that conflict labeled data can use method shown in Fig. 2.Fig. 2 show provided in an embodiment of the present invention traversing When the labeled data that first labeled data is concentrated, determine that a kind of process of conflict labeled data is shown using mark prediction model It is intended to, as shown in Fig. 2, may include:

S202 is concentrated from first labeled data and is chosen at least one labeled data as labeled data to be screened, and Labeled data after the first labeled data collection is removed the labeled data to be screened is as training labeled data.

In this specification embodiment, by the first labeled data concentrate labeled data be split as labeled data to be screened and Training labeled data.Training labeled data is to establish the data set of model by matching some parameters, i.e., using training mark Data training machine learning model, to determine the parameter of machine learning model.Labeled data to be screened is for being screened out from it The data set of conflict labeled data.

In this specification embodiment, labeled data to be screened can be a mark number of the first labeled data concentration According to being also possible to the set of several labeled data.Can by the way of randomly selecting from the first labeled data concentrate choose to Screen labeled data.

S204 carries out machine learning to the trained labeled data, generates mark prediction model.

In this specification embodiment, the model classification for carrying out machine learning can be corresponding interior according to labeled data Appearance is determined, for example, can select the mould of intent classifier if the content of labeled data is the data about intent classifier Type (such as classifier of support vector machines one kind) carries out machine learning, if the content of labeled data is about in slot position mark Hold, then the model (such as LSTM model, CRF model) of sequence labelling can be selected to carry out machine learning.

In this specification embodiment, generating mark prediction model can be by the likelihood function of maximization data setTo realize, wherein x indicates training labeled data input, and y indicates the class label of training labeled data Output.In machine-learning process, vector c is converted by training labeled data x first, is then converted to vector c corresponding Export y.

In this specification embodiment, when vector c is converted to corresponding output y, vector c can be inputted one The multinomial classifier of Softmax, to calculate the probability of each class label, specifically, the generating probability of i-th of class label can be with It indicates are as follows:

Wherein, j=1 ..., K；

Then the likelihood of all categories label indicates are as follows:

It can be determined in the model for carrying out machine learning by the likelihood maximum value of the above-mentioned all categories label of determination Parameter, and then generate mark prediction model.

In this specification embodiment, following at least two side can be used by converting vector c for training labeled data x Formula:

Mode one, counts the TF-IDF value of each word in training labeled data, and whole word is converted to a TF- IDF vector, specifically, TF-IDF vector can indicate are as follows:

V_d=[w_{1, d}, w_{2, d}..., w_{N, d}]^T, wherein

Wherein,

tf_{T, d}It is the frequency that phrase t occurs in input text；

It is reverse document-frequency；

| D | it is the total number of files in file set；

| d ' ∈ D | t ∈ d ' } | it is the number of files containing phrase t.

X is encoded to the vector c that a width is K with Recognition with Recurrent Neural Network encoder by mode two.Given random length has Sequence characteristics sequenceRecognition with Recurrent Neural Network encoder will return to regular length feature vector c_k ∈R^out(wherein, x_iIt can be one-hot expression or dense low-dimensional feature).

In this specification embodiment, Recognition with Recurrent Neural Network encoder takes the definition of recursion: believing specific to sequence Breath portrays process, in the partial sequence that i element before portraying forms, introduces hidden state s_iAs previous hidden state s_i-1With currentElement x_iOutput, i.e. s_i=R (s_i-1,x_i)；The feature vector of final output to regular length is then to pass through mapping O () is by final state s_kIt is mapped to c_k, specifically, being expressed as follows:

RNN(x_1:k；s₀)=c_1:k

c_i=O (s_i)

s_i=R (s_i-1,x_i)

The corresponding data input to be marked of the labeled data to be screened mark prediction model is labeled by S206 Prediction, obtains prediction labeled data corresponding to the data to be marked.

In this specification embodiment, training terminates after obtaining mark prediction model, can be by labeled data pair to be screened The data to be marked answered are input to mark prediction model and are labeled prediction, prediction corresponding to the available data to be marked Labeled data.

S208 determines conflict labeled data according to the labeled data to be screened and prediction labeled data.

It, can be by the annotation results of the prediction labeled data after obtaining prediction labeled data in this specification embodiment It is compared with the annotation results of labeled data to be screened, in annotation results and the prediction labeled data of labeled data to be screened When annotation results are inconsistent, the problems such as showing labeled data to be screened there may be marking errors, at this point, by mark number to be screened According to be determined as conflict labeled data.

For example, in the mark of intention shown in table 1, the mark of the labeled data to be screened as corresponding to serial number 2 As a result different from the prediction annotation results of labeled data, therefore, labeled data to be screened corresponding to serial number 2 is determined as conflicting Labeled data.

Table 1

In this specification embodiment, it will can traverse each time the conflict labeled data determined and be put into a data set In.

In this specification embodiment, due to the data that conflict labeled data is the problems such as there may be marking errors, it is The reliability that the mark prediction model that machine learning obtains is improved during traversal, can determine mark number to be screened After for conflict labeled data, which is concentrated from the first labeled data and is rejected, in this way, carrying out next time When going through process, training labeled data in would not exist conflict labeled data so that using training labeled data into The mark prediction model that row machine learning obtains is relatively reliable, and then improves the accuracy of the conflict labeled data filtered out.

S106, obtains the second labeled data collection, and the second labeled data collection is by the mark that conflicts obtained in ergodic process The labeled data that note data are marked again according to the default mark rule.

It, can should after conflict labeled data has been determined from the first labeled data concentration in this specification embodiment Conflict labeled data, which retransmits, gives mark personnel, so that mark personnel are according to default mark rule to the conflict labeled data Corresponding data to be marked carry out mark again and obtain annotation results.The subsequent available above-mentioned labeled data marked again, As the second labeled data collection.

S108 determines third labeled data collection according to the first labeled data collection and the second labeled data collection.

In this specification embodiment, the conflict labeled data that the first labeled data can be concentrated is with second mark Labeled data substitution in data set, obtains the third labeled data collection.

S110, when the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, by the third mark The step of data set is infused as the first labeled data collection, executes the traversal until the third labeled data collection evaluation As a result meet the default evaluation condition.

In this specification embodiment, it is determined that after third labeled data collection, can be carried out to the third labeled data collection Quality evaluation, that is, judge whether the evaluation result of third labeled data collection meets default evaluation condition.The quality assessment process can With by manually performing, to reduce the inaccuracy of model prediction result that may be present in abovementioned steps and caused by influence.

Specifically, when the evaluation result of third labeled data collection is not able to satisfy default evaluation condition, it can be by the third Labeled data collection executes step S104 as the first labeled data collection, until the evaluation result of obtained third labeled data collection It can satisfy default evaluation condition, then show that the mark quality of third labeled data collection at this time is qualified, meet demand, It can terminate to execute.

It should be noted that above-mentioned default evaluation condition can be set according to the evaluation method of third labeled data collection. In this specification embodiment, third mark can then be obtained using sampling assessment to the evaluation method of third labeled data collection The evaluation result of data set can use method shown in Fig. 3.Fig. 3 show acquisition third mark provided in an embodiment of the present invention A kind of flow diagram of the evaluation result of data set, as shown in figure 3, may include:

S302 concentrates the labeled data for extracting the first quantity as sample labeled data from the third labeled data.

In this specification embodiment, extraction can be extracted in a random fashion, and the first quantity of extraction can basis Actual demand is set, and general first quantity is bigger, then the quantity of sample labeled data is more, and the reliability of evaluation result is got over It is high；Conversely, the first quantity is smaller, then the quantity of sample labeled data is fewer, and evaluation result reliability is lower.

S304 counts the second quantity for meeting the labeled data of the default mark rule in the sample labeled data.

It, can be according to being used to indicate how mark personnel treat what labeled data was labeled in this specification embodiment Default mark rule detects sample labeled data one by one, when the testing result of a certain sample labeled data is that satisfaction is default When mark rule, it may be considered that sample labeled data mark is accurate, default mark rule are met in statistical sample labeled data The quantity of labeled data then, as the second quantity.

S306 calculates the ratio of second quantity and the first quantity, using the ratio as the third labeled data The evaluation result of collection.

In this specification embodiment, the ratio of the second quantity Yu the first quantity can be calculated, and using the ratio as The evaluation result of three labeled data collection.

It should be noted that default evaluation condition can be with when the evaluation result of third labeled data collection is above-mentioned ratio It is arranged in correspondence with to preset ratio, for example, default evaluation condition, which can be set to 95% or 90%, waits numerical value.Work as evaluation result When presetting ratio less than this, then show that the evaluation result of third labeled data collection is unsatisfactory for default evaluation condition；Conversely, when evaluation When being as a result more than or equal to default ratio, then show that the evaluation result of third labeled data collection meets default evaluation condition.

It should be noted that an example of the above-mentioned evaluation result for being merely given as obtaining third labeled data collection, In practical application, other modes can also be used to evaluate to be evaluated the mark quality of third labeled data collection As a result, for example, it is also possible to concentrate the distribution situation of labeled data to evaluate etc. according to third labeled data, the present invention to this not It limits.

To sum up, the embodiment of the present invention is after getting the labeled data collection marked, to the mark of labeled data concentration Data are traversed, and combine mark prediction model to determine conflict labeled data in ergodic process, will be obtained in ergodic process Conflict labeled data re-start mark, and the data marked again are merged to obtain with labeled data collection before New labeled data collection, the labeled data collection new to this is evaluated later, when evaluation result is unsatisfactory for default evaluation condition, Above-mentioned cycle iterative operation thereof is carried out, until new labeled data collection meets default evaluation condition, to substantially increase data Mark quality, and due to effectively model and manpower being finished in the control process of entire labeled data quality It closes, reduces manpower and time cost, improve the efficiency of labeled data quality inspection.

Referring to FIG. 4, its flow diagram for showing another data processing method provided in an embodiment of the present invention, this Specification provides the method operating procedure as described in embodiment or flow chart, but based on conventional or can without creative labor To include more or less operating procedure.The step of enumerating in embodiment sequence is only in numerous step execution sequences A kind of mode does not represent and unique executes sequence.It, can be according to embodiment or attached when system in practice or product execute The sequence of method shown in figure executes or parallel execution (such as environment of parallel processor or multiple threads).Specifically such as Shown in Fig. 4, which comprises

S402, obtains the first labeled data collection, and the first labeled data collection is according to default mark rule to be marked The labeled data that data are labeled.

S404 obtains the data characteristics that first labeled data concentrates labeled data.

In this specification embodiment, the data characteristics of labeled data can be annotation results are analyzed after obtain Feature, for example, it may be the slot position feature of labeled data, is also possible to intent features of labeled data etc..

The first labeled data collection is split as N parts of labeled data subsets by S406, and the labeled data subset is included The data characteristics of labeled data meet preset distribution rule, N >=2.

In this specification embodiment, in order to improve the efficiency of data processing, the first labeled data collection can be split as N Part (N >=2) labeled data subset, and the data characteristics for the labeled data for including described in every part of labeled data subset needs to meet Preset distribution rule, to ensure the reliability of the subsequent conflict labeled data filtered out.

In a specific embodiment, preset distribution rule can meet for the data characteristics of every part of labeled data subset One Poisson distribution.If in labeled data subset including m labeled data, the probability P (x) that data characteristics x occurs can be used down Formula indicates:

P (0)=e^-m

When the data characteristics of labeled data subset meets above-mentioned Poisson distribution, it can guarantee each mark to the greatest extent The consistency of data subset data, then, in the mark prediction model that later use machine learning obtains from mark number to be screened When according to middle screening conflict labeled data, conflict labeled data can be filtered out as far as possible, the standard of screening can be greatly improved True property and reliability, and then the efficiency of data processing not only can be improved, also advantageously improve the labeled data finally obtained Quality.

In this specification embodiment, preset distribution rule can also be set according to label classification and actual demand It sets, for example, it is a preset ratio that preset distribution rule, which can be each data characteristics in labeled data subset, such as when mark number According to marking types be intended to when, preset distribution rule can be intent features 1 in labeled data subset: intent features 2 >= The data characteristics of 9:1, that is, the labeled data in each minute mark note data subset obtained after splitting all need to meet intent features 1: 2 >=9:1 of intent features, this is not limited by the present invention.

S408 traverses the labeled data that first labeled data is concentrated, and is traversing what first labeled data was concentrated When labeled data, conflict labeled data is determined using mark prediction model.

It is pre- using mark when traversing the labeled data that first labeled data is concentrated in this specification embodiment It surveys model and determines that conflict labeled data can use method shown in fig. 5.Fig. 5 show provided in an embodiment of the present invention traversing When the labeled data that first labeled data is concentrated, another process of conflict labeled data is determined using mark prediction model Schematic diagram, as shown in figure 5, may include:

S502 chooses K parts of labeled data subsets as labeled data to be screened from the N parts of labeled data subset, and By (N-K) part labeled data subset as training labeled data, 1≤K≤N/2.

In this specification embodiment, K parts of labeled data can be randomly chosen from N part labeled data subset of fractionation Subset, 1≤K≤N/2 are used as example, can choose 1 part of labeled data subset from N part labeled data subset of fractionation wait sieve Labeled data is selected, then remaining (N-1) part labeled data subset is as training labeled data.

S504 carries out machine learning to the trained labeled data, generates mark prediction model.

The corresponding data input to be marked of the labeled data to be screened mark prediction model is labeled by S506 Prediction, obtains prediction labeled data corresponding to the data to be marked.

Wherein, above-mentioned steps 504 to step 506 may refer to aforementioned embodiment of the method shown in Fig. 2, no longer superfluous herein It states.

S508 determines conflict labeled data according to the labeled data to be screened and prediction labeled data.

In this specification embodiment, since labeled data to be screened is one or more labeled data subset, Determining conflict labeled data may be one or more labeled data in labeled data subset.Specifically, when as to The annotation results for screening the labeled data in the labeled data subset of labeled data and the annotation results of prediction labeled data are different When cause, the corresponding labeled data in the labeled data subset can be determined as to the labeled data that conflicts.Such as shown in table 2, to Screening labeled data includes labeled data subset 1 and labeled data subset 2, wherein the serial number 2 in labeled data subset 1 is corresponding Labeled data annotation results with prediction labeled data prediction annotation results it is inconsistent, therefore, the sequence of labeled data subset 1 Number 2 corresponding labeled data can be determined as the labeled data that conflicts；The corresponding labeled data of serial number 3 in labeled data subset 2 Annotation results with prediction labeled data prediction annotation results it is inconsistent, therefore, the serial number 3 of labeled data subset 2 is corresponding Labeled data can be determined as the labeled data that conflicts.

Table 2

In this specification embodiment, due to the data that conflict labeled data is the problems such as there may be marking errors, it is The reliability for the mark prediction model that machine learning obtains is improved the traversal during, can determine conflict in primary traversal After labeled data, which is concentrated from the first labeled data and is rejected, in this way, carrying out next ergodic process When, conflict labeled data would not be had by training in labeled data, so that using training labeled data progress machine It is relatively reliable to learn obtained mark prediction model.

S410, obtains the second labeled data collection, and the second labeled data collection is by the mark that conflicts obtained in ergodic process The labeled data that note data are marked again according to the default mark rule.

S412 determines third labeled data collection according to the first labeled data collection and the second labeled data collection.

S414, when the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, by the third mark The step of data set is infused as the first labeled data collection, executes the traversal until the third labeled data collection evaluation As a result meet the default evaluation condition.

In this specification embodiment, when the evaluation result of third labeled data collection meets default evaluation condition, then table The mark quality of bright third labeled data collection has been met the requirements, and can terminate to execute.

The detailed content of above-mentioned steps S412 to step S414 may refer to aforementioned embodiment of the method shown in FIG. 1, herein It repeats no more.

To sum up, the embodiment of the present invention is after getting the labeled data collection marked, to the mark of labeled data concentration Data are traversed, and combine mark prediction model to determine conflict labeled data in ergodic process, will be obtained in ergodic process Conflict labeled data re-start mark, and the data marked again are merged to obtain with labeled data collection before New labeled data collection, the labeled data collection new to this is evaluated later, when evaluation result is unsatisfactory for default evaluation condition, Above-mentioned cycle iterative operation thereof is carried out, until new labeled data collection meets default evaluation condition, to substantially increase data Mark quality, and due to effectively model and manpower being combined in the control process of entire labeled data quality, Manpower and time cost are reduced, the efficiency of labeled data quality inspection is improved.

Corresponding with the data processing method that above-mentioned several embodiments provide, the embodiment of the present invention is also provided at a kind of data Device is managed, the data processing method phase provided due to data processing equipment provided in an embodiment of the present invention with above-mentioned several embodiments It is corresponding, therefore the embodiment of aforementioned data processing method is also applied for data processing equipment provided in this embodiment, in this reality It applies in example and is not described in detail.

Referring to Fig. 6, it show the structural schematic diagram that the present invention implements a kind of data processing equipment provided, such as Fig. 6 It is shown, the apparatus may include: first obtains module 610, and spider module 620, second obtains module 630, the first determining module 640 and circular treatment module 650.

First obtains module 610, can be used for obtaining the first labeled data collection, and the first labeled data collection is according to pre- Bidding note rule treats the labeled data that labeled data is labeled；

Spider module 620 can be used for traversing the labeled data that first labeled data is concentrated, in traversal described first When the labeled data that labeled data is concentrated, conflict labeled data is determined using mark prediction model；

Second obtains module 630, can be used for obtaining the second labeled data collection, the second labeled data collection is will to traverse The labeled data that conflict labeled data obtained in process is marked again according to the default mark rule；

First determining module 640 can be used for determining according to the first labeled data collection and the second labeled data collection Three labeled data collection；

Circular treatment module 650 can be used for being unsatisfactory for default evaluation in the evaluation result of the third labeled data collection When condition, using the third labeled data collection as the first labeled data collection, the step of executing the traversal, is until described The evaluation result of third labeled data collection meets the default evaluation condition.

In an example, as shown in fig. 7, spider module 620 may include: to choose module 6210, generation module 6220, Prediction module 6230 and the second determining module 6240.

Choose module 6210, can be used for from first labeled data concentrate choose at least one labeled data be used as to Labeled data is screened, and the labeled data after the first labeled data collection is removed the labeled data to be screened is as training Labeled data；

Generation module 6220 can be used for carrying out machine learning to the trained labeled data, generate mark prediction model；

Prediction module 6230 can be used for the corresponding data input to be marked of the labeled data to be screened mark Prediction model is labeled prediction, obtains prediction labeled data corresponding to the data to be marked；

Second determining module 6240 can be used for determining punching according to the labeled data to be screened and prediction labeled data Prominent labeled data.

In a specific example, the second determining module 6240 specifically can be used for the labeled data to be screened with it is described When predicting that labeled data is inconsistent, the labeled data to be screened is determined as the labeled data that conflicts.

In another example, as shown in figure 8, the apparatus may include: first obtains module 610, spider module 620, the Two obtain module 630, the first determining module 640, circular treatment module 650, third acquisition module 660 and fractionation module 670.

Third obtains module 660, can be used for obtaining the data characteristics that first labeled data concentrates labeled data；

Module 670 is split, can be used for the first labeled data collection being split as N parts of labeled data subsets, the mark The data characteristics for the labeled data that note data subset is included meets preset distribution rule, N >=2；

In this example, first module 610 is obtained, spider module 620, second obtains module 630, the first determining module 640 and circular treatment module 650 may refer to Installation practice shown in fig. 6.Spider module 620 can be knot shown in Fig. 7 Structure, wherein choosing module 6210 specifically can be used for choosing K parts of labeled data subsets works from the N parts of labeled data subset For labeled data to be screened, and by (N-K) part labeled data subset as training labeled data, 1≤K≤N/2.

Optionally, as shown in fig. 7, spider module 620 can also include:

First rejects module 6250, can be used for picking the conflict labeled data from first labeled data concentration It removes.

In a specific example, as shown in figure 9, the first determining module 640 may include:

Alternative module 6410, the conflict labeled data that can be used for concentrating first labeled data is with second mark The labeled data substitution in data set is infused, third labeled data collection is obtained.

Optionally, as shown in Figure 10, the apparatus may include: first obtain module 610, spider module 620, second obtain Module 630, the first determining module 640, circular treatment module 650, abstraction module 680, statistical module 690 and computing module 6010。

Abstraction module 680 can be used for concentrating the labeled data conduct for extracting the first quantity from the third labeled data Sample labeled data；

Statistical module 690 can be used for counting the mark for meeting the default mark rule in the sample labeled data Second quantity of data；

Computing module 6010 can be used for calculating the ratio of second quantity and the first quantity, using the ratio as The evaluation result of the third labeled data collection.

In this example, first module 610 is obtained, spider module 620, second obtains module 630, the first determining module 640 and circular treatment module 650 may refer to Installation practice shown in fig. 6.

To sum up, data processing equipment provided in an embodiment of the present invention is after getting the labeled data collection marked, to this The labeled data that labeled data is concentrated is traversed, and combines mark prediction model to determine conflict mark number in ergodic process According to conflict labeled data obtained in ergodic process being re-started mark, and mark by the data marked again and before Note data set is merged to obtain new labeled data collection, and the labeled data collection new to this is evaluated later, in evaluation result When being unsatisfactory for default evaluation condition, above-mentioned cycle iterative operation thereof is carried out, until new labeled data collection meets default evaluation item Part, to substantially increase the mark quality of data, and due to effectively will in the control process of entire labeled data quality Model and manpower are combined, and manpower and time cost are reduced, and improve the efficiency of labeled data quality inspection.

It should be noted that device provided by the above embodiment, when realizing its function, only with above-mentioned each functional module It divides and carries out for example, can according to need in practical application and be completed by different functional modules above-mentioned function distribution, The internal structure of equipment is divided into different functional modules, to complete all or part of the functions described above.

Please refer to Figure 11 which shows a kind of structural schematic diagram of the network equipment provided in an embodiment of the present invention, the network Equipment is used for the data processing method for implementing to provide in above-described embodiment.The network equipment can be such as PC The terminal devices such as (PersonalComputer, personal computer), mobile phone, PDA (tablet computer) are also possible to such as application clothes The service equipments such as business device, cluster server.Referring to Figure 11, the internal structure of the network equipment may include but be not limited to: processing Device, network interface and memory.Wherein, the processor in the network equipment, network interface and memory can by bus or other Mode connects, in Figure 11 shown in this specification embodiment for being connected by bus.

Wherein, processor (or CPU (Central Processing Unit, central processing unit)) is the network equipment Calculate core and control core.Network interface optionally may include that standard wireline interface and wireless interface (such as WI-FI, is moved Dynamic communication interface etc.).Memory (Memory) is the memory device in the network equipment, for storing program and data.It can manage Solution, memory herein can be high-speed RAM storage equipment, be also possible to non-labile storage equipment (non- Volatile memory), a for example, at least disk storage equipment；It is aforementioned optionally to can also be that at least one is located remotely from The storage device of processor.Memory provides memory space, which stores the operating system of the network equipment, it may include But it is not limited to: Windows system (a kind of operating system), Linux (a kind of operating system), Android (Android, a kind of movement Operating system) system, IOS (a kind of Mobile operating system) system etc., the present invention is to this and is not construed as limiting；Also, it deposits at this It also houses and is suitable for by one or more than one instructions that processor loads and executes in storage space, these instructions can be one A or more than one computer program (including program code).In this specification embodiment, processor is loaded and is executed and deposits One stored in reservoir or one or more instruction, to realize the data processing method of above method embodiment offer.

The embodiments of the present invention also provide a kind of storage medium, the storage medium may be disposed among the network equipment with It saves for realizing relevant at least one instruction of one of embodiment of the method data processing method, at least one section of program, generation Code collection or instruction set, at least one instruction, at least one section of program, the code set or instruction set can be by the processing of the network equipment for this Device loads and executes the data processing method to realize above method embodiment offer.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or The various media that can store program code such as CD.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or device including a series of elements are not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or device Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or device including the element.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of data processing method, which is characterized in that the described method includes:

The first labeled data collection is obtained, the first labeled data collection is to treat labeled data according to default mark rule to be marked Infuse obtained labeled data；

The labeled data that first labeled data is concentrated is traversed, in the labeled data for traversing the first labeled data concentration When, conflict labeled data is determined using mark prediction model；

Obtain the second labeled data collection, the second labeled data collection be will conflict labeled data obtained in ergodic process according to The labeled data that the default mark rule is marked again；

When the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, the third labeled data collection is made For the first labeled data collection, the step of executing the traversal, is until the evaluation result of the third labeled data collection meets institute State default evaluation condition.

2. data processing method according to claim 1, which is characterized in that described to traverse the first labeled data collection In labeled data when, using mark prediction model determine conflict labeled data include:

It is concentrated from first labeled data and chooses at least one labeled data as labeled data to be screened, and by described first Labeled data collection removes the labeled data conduct training labeled data after the labeled data to be screened；

Machine learning is carried out to the trained labeled data, generates mark prediction model；

The corresponding data input to be marked of the labeled data to be screened mark prediction model is labeled prediction, is obtained Prediction labeled data corresponding to the data to be marked；

According to the labeled data to be screened and prediction labeled data, conflict labeled data is determined.

3. data processing method according to claim 2, which is characterized in that traversing what first labeled data was concentrated Before labeled data, the method also includes:

Obtain the data characteristics that first labeled data concentrates labeled data；

The first labeled data collection is split as N parts of labeled data subsets, the mark number that the labeled data subset is included According to data characteristics meet preset distribution rule, N >=2；

Described concentrate from first labeled data chooses at least one labeled data as labeled data to be screened, and will be described First labeled data collection removes the labeled data after the labeled data to be screened as training labeled data

K parts of labeled data subsets are chosen from the N parts of labeled data subset as labeled data to be screened, and by (N-K) part The labeled data subset is as training labeled data, 1≤K≤N/2.

4. data processing method according to claim 2, which is characterized in that it is described according to the labeled data to be screened and It predicts labeled data, determines that conflict labeled data includes:

When the labeled data to be screened and the prediction labeled data are inconsistent, the labeled data to be screened is determined as Conflict labeled data.

5. data processing method according to any one of claims 1 to 4, which is characterized in that described according to first mark Data set and the second labeled data collection determine that third labeled data collection includes:

The labeled data substitution that the conflict labeled data that first labeled data is concentrated is concentrated with second labeled data, Obtain third labeled data collection.

6. data processing method according to claim 5, which is characterized in that in the evaluation knot of the third labeled data collection When fruit is unsatisfactory for default evaluation condition, using the third labeled data collection as the first labeled data collection before, the side Method further include:

Concentrate the labeled data for extracting the first quantity as sample labeled data from the third labeled data；

Count the second quantity for meeting the labeled data of the default mark rule in the sample labeled data；

The ratio for calculating second quantity and the first quantity, using the ratio as the evaluation knot of the third labeled data collection Fruit.

7. a kind of data processing equipment, which is characterized in that described device includes:

First obtains module, and for obtaining the first labeled data collection, the first labeled data collection is according to default mark rule Treat the labeled data that labeled data is labeled；

Spider module, the labeled data concentrated for traversing first labeled data are traversing the first labeled data collection In labeled data when, utilize mark prediction model determine conflict labeled data；

Second obtains module, and for obtaining the second labeled data collection, the second labeled data collection is will to obtain in ergodic process Conflict labeled data according to the default labeled data that is marked again of mark rule；

First determining module, for determining third labeled data according to the first labeled data collection and the second labeled data collection Collection；

Circular treatment module, for when the evaluation result of the third labeled data collection is unsatisfactory for default evaluation condition, by institute The step of third labeled data collection is stated as the first labeled data collection, executes the traversal is until the third labeled data The evaluation result of collection meets the default evaluation condition.

8. data processing equipment according to claim 7, which is characterized in that the spider module includes:

Module is chosen, chooses at least one labeled data as mark number to be screened for concentrating from first labeled data According to, and the labeled data after the first labeled data collection is removed the labeled data to be screened is as training labeled data；

Generation module generates mark prediction model for carrying out machine learning to the trained labeled data；

Prediction module, for carrying out the corresponding data input to be marked of the labeled data to be screened mark prediction model Mark prediction, obtains prediction labeled data corresponding to the data to be marked；

Second determining module, for determining conflict labeled data according to the labeled data to be screened and prediction labeled data.

9. data processing equipment according to claim 8, which is characterized in that described device further include:

Third obtains module, and the data characteristics of labeled data is concentrated for obtaining first labeled data；

Module is split, for the first labeled data collection to be split as N parts of labeled data subsets, the labeled data subset institute The data characteristics for the labeled data for including meets preset distribution rule, N >=2；

The selection module is specifically used for choosing K parts of labeled data subsets from the N parts of labeled data subset as to be screened Labeled data, and by (N-K) part labeled data subset as training labeled data, 1≤K≤N/2.

10. data processing equipment according to claim 8, which is characterized in that second determining module is specifically used for When the labeled data to be screened and the prediction labeled data are inconsistent, the labeled data to be screened is determined as conflict mark Infuse data.

11. according to any data processing equipment of claim 7 to 10, which is characterized in that the first determining module packet It includes:

Alternative module, what the conflict labeled data for concentrating first labeled data was concentrated with second labeled data Labeled data substitution, obtains third labeled data collection.

12. data processing equipment according to claim 11, which is characterized in that described device further include:

Abstraction module, for concentrating the labeled data for extracting the first quantity to mark number as sample from the third labeled data According to；

Statistical module, for counting the second number for meeting the labeled data of the default mark rule in the sample labeled data Amount；

Computing module is marked for calculating the ratio of second quantity and the first quantity using the ratio as the third The evaluation result of data set.

13. a kind of network equipment characterized by comprising

Processor is adapted for carrying out one or one or more instruction；And

Memory, the memory are stored with one or one or more instruction, and described one or one or more instruction are suitable for by institute Processor is stated to load and execute data processing method as claimed in any one of claims 1 to 6.