CN112686301A - Data annotation method based on cross validation and related equipment - Google Patents

Data annotation method based on cross validation and related equipment Download PDF

Info

Publication number
CN112686301A
CN112686301A CN202011594265.0A CN202011594265A CN112686301A CN 112686301 A CN112686301 A CN 112686301A CN 202011594265 A CN202011594265 A CN 202011594265A CN 112686301 A CN112686301 A CN 112686301A
Authority
CN
China
Prior art keywords
data
marking
labeling
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011594265.0A
Other languages
Chinese (zh)
Inventor
魏万顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011594265.0A priority Critical patent/CN112686301A/en
Publication of CN112686301A publication Critical patent/CN112686301A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence, and relates to a data annotation method based on cross validation and related equipment. The method comprises the following steps: acquiring a small sample initial standard data set; inputting the initial standard data set into a classification model for cross validation to obtain an initial standard data model; acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-marking, and determining a correction data set according to a pre-marking result; inputting the correction data set into a primary standard data training model for cross validation to obtain a final standard data model; and marking the received data to be marked through the final mark data model. In addition, the application also relates to a block chain technology, and the data to be labeled can be stored in the block chain. The marking efficiency is improved to a great extent, and the error marking data in the marking process is detected by adopting a cross validation method; repeated labeling of the vast majority of data is avoided.

Description

Data annotation method based on cross validation and related equipment
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a data annotation method based on cross validation and related equipment.
Background
In recent years, computer recognition technology based on deep learning is widely used in various industries. An excellent deep learning model needs a large amount of high-quality labeled data to support, and the high-quality labeled data are almost all labeled by adopting a manual labeling mode at present. The efficiency of the manual data labeling mode is very low, and whether the labeling result is accurate or not depends on the labeling level of a labeling person to a great extent, so that the quality of labeling the data by the manual labeling mode cannot be effectively guaranteed. Therefore, the problem of labeling quality generally exists in the current data labeling process. In the two existing solutions, a redundant data labeling scheme is generally adopted, the same data set is repeatedly labeled by multiple persons, and the best label is voted and elected, so that the method has good effect, but consumes a large amount of labeling cost; in enterprise-level projects, labeling and auditing are mostly performed in a manual spot check mode, and when the auditing is passed, labeling data are brought into a data set, but the full amount of samples cannot be checked, so that the labeling accuracy is reduced.
Disclosure of Invention
The embodiment of the application aims to provide a data labeling method, a data labeling device, computer equipment and a storage medium based on cross validation, and solves the technical problem that the full-scale sample is difficult to be checked efficiently.
In order to solve the above technical problem, an embodiment of the present application provides a data annotation method based on cross validation, which adopts the following technical solutions:
a data annotation method based on cross validation comprises the following steps:
acquiring a small sample initial standard data set;
inputting the initial standard data set into a classification model for cross validation to obtain an initial standard data model;
acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-marking, and determining a correction data set according to a pre-marking result;
inputting the correction data set into a primary standard data training model for cross validation to obtain a final standard data model;
and marking the received data to be marked through the final mark data model.
In order to solve the above technical problem, an embodiment of the present application further provides a data labeling device based on cross validation, which adopts the following technical solution:
a cross-validation-based data annotation apparatus comprising:
the acquisition module is used for acquiring a small sample initial standard data set;
the first verification module is used for inputting the initial standard data set into the classification model for cross verification to obtain an initial standard data model;
the pre-labeling module is used for acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-labeling, and determining a correction data set according to a pre-labeling result;
the second verification module is used for inputting the correction data set into the initial standard data training model for cross verification to obtain a final standard data model;
and the marking module is used for marking the received data to be marked through the final mark data model.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
a computer device comprising at least one memory having stored therein a computer program and at least one processor implementing the steps of the cross-validation based data annotation method as described above when executed by the processor.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the cross-validation-based data annotation method as described above.
Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:
the method comprises the steps of performing cross validation on a classification model through a labeled initial standard data set to obtain an initial standard data training model, and evaluating the feasibility of a project at the initial labeling stage; then, pre-labeling the large sample data set by using the initial standard data training model, determining a correction data set according to a pre-labeling result, and pre-labeling the large sample data set by using the initial standard data training model, so that the labeling efficiency is improved to a greater extent; and inputting the correction data set into the initial standard data training model for cross validation until the marking accuracy reaches the expectation, determining a final standard data model, and detecting the error standard data in the marking process by adopting a cross validation method, so that the marking and auditing resources are greatly saved, and the repeated marking of most data is avoided.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is a flow diagram of one embodiment of a cross-validation based data annotation process according to the present application;
FIG. 2 is a flowchart of one embodiment of step S2 of FIG. 1;
FIG. 3 is a flowchart of one embodiment of step S21 in FIG. 2;
FIG. 4 is a flowchart of one embodiment of step S4 of FIG. 1;
FIG. 5 is a flowchart of one embodiment of step S41 in FIG. 4;
FIG. 6 is a schematic block diagram of one embodiment of a cross-validation based data annotation device according to the present application;
FIG. 7 is a schematic block diagram of one embodiment of a computer device according to the present application.
Reference numerals: 2. a data labeling device based on cross validation; 201. an acquisition module; 202. a splicing module; 203. building a module; 204. a search module; 3. a computer device; 301. a memory; 302. a processor; 303. a network interface.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The system architecture may include a terminal device, a network, and a server. The network serves as a medium for providing a communication link between the terminal device and the server. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use a terminal device to interact with a server over a network to receive or send messages, etc. The terminal device can be provided with various communication client applications, such as a web browser application, a shopping application, a searching application, an instant messaging tool, a mailbox client, social platform software and the like.
The terminal device may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like.
The server may be a server providing various services, such as a background server providing support for pages displayed on the terminal device.
It should be noted that the data annotation method based on cross validation provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the data annotation device based on cross validation is generally disposed in the server/terminal device. It should be understood that there may be any number of end devices, networks, and servers, as desired for an implementation.
Referring to FIG. 1, a flow diagram of one embodiment of a cross-validation based data annotation process in accordance with the present application is shown. The data annotation method based on the cross validation comprises the following steps:
step S1, acquiring a small sample initial standard data set;
in this embodiment, the initial standard data set is a small sample initial standard data set, data is randomly screened according to the maximum daily standard value of the annotator, preferably, a week time, specifically, according to the specific analysis of the project bearable duration, and the data is the original distribution data faced by the model in the future in the production environment.
Specifically, partial data is screened from the original distribution data to serve as an initial standard data set, the data in the initial standard data set is labeled, and screening can be performed according to dates during screening.
After the initial labeling is finished, the data category distribution of the labeled initial labeling data set is counted and divided into negative samples and positive samples, the positive samples can be specific text classifications, such as abuse text, sensitive word text and the like, negative samples are defined except the positive samples, and the distribution of the data categories is used as a reference for selecting a subsequent classification model.
Step S2, inputting the initial standard data set into a classification model for cross validation to obtain an initial standard data model;
in this embodiment, data in the primary standard data set is divided into a preset number of sample sets, one sample set is taken as a verification set, the other sample sets are taken as training sets, cross-validation is performed through a classification model to obtain a prediction result of all data in the verification set, error data is determined according to the prediction result, an error threshold value is adjusted, the filtered error data is re-labeled, the re-labeled data is input into the classification model again for cross-validation, and the primary standard data training model is obtained until the classification model reaches an expected labeling accuracy.
Step S3, acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-labeling, and determining a correction data set according to a pre-labeling result;
in this embodiment, a large sample data set is introduced, the initial standard data training model obtained in step S2 is used to perform pre-labeling on the data, so as to obtain pre-labeling information of each data in the large sample data set, where the pre-labeling information includes a pre-labeling result and a probability, and the data is corrected according to the pre-labeling result to determine a correction data set. Alternatively, the initial standard data set and the corrected large sample data set may be combined to serve as the correction data set.
Step S4, inputting the correction data set into the initial standard data training model for cross validation to obtain a final standard data model;
in this embodiment, data in the correction data set is divided into a preset number of sample sets, one sample set is taken as a verification set, the other sample sets are taken as training sets, cross-validation is performed through an initial standard data training model to obtain a prediction result of all data in the verification set, error standard data is determined according to the prediction result, an error threshold value is adjusted, the filtered error standard data is re-marked, the re-marked data is input into the initial standard data training model again for cross-validation, and a final standard data model is determined until the initial standard data training model reaches an expected marking accuracy.
And step S5, labeling the received data to be labeled through the final label data model.
The data labeling method based on cross validation provided by the embodiment of the application comprises the steps of obtaining a primary standard data set, inputting the primary standard data set into a classification model for cross validation, obtaining a primary standard data training model, evaluating the feasibility of a project at the initial stage of labeling, pre-labeling a large sample data set by using the primary standard data training model, determining a correction data set according to a pre-labeling result, inputting the correction data set into the primary standard data training model for cross validation until the labeling accuracy rate reaches the expectation, and determining a final standard data model; in the large sample data set labeling stage, the initial standard data training model is used for pre-labeling the large sample data set, so that the labeling efficiency is improved to a greater extent; the method adopts the cross validation method to detect the error marking data in the marking process, greatly saves the resources of marking audit, and avoids the repeated marking of most data.
As shown in fig. 2, the step S2 specifically includes:
step S21, dividing data in the initial standard data set into K sample sets with consistent number, taking one sample set as a verification set and the other sample sets as training sets, and performing K-round cross verification through a classification model to obtain prediction results of all data in K groups of verification sets;
step S22, determining error mark data according to the prediction result;
step S23, sending the false mark data to a designated marking person for false mark type analysis, and adjusting an error threshold value between a prediction result and a marking result according to a received analysis result;
step S24, screening out new error marking data through the adjusted error threshold value, and sending the new error marking data to an appointed marking person, so that the appointed marking person constructs a new marking rule, and re-marking the screened error marking data;
and step S25, re-inputting the re-labeled data into the classification model for cross validation, and obtaining an initial labeled data training model when the classification model reaches the expected labeling accuracy.
In the embodiment, the cross validation is to group the raw data (dataset), take one part as a training set (train set) and take the other part as a validation set or test set, firstly train the classification model by using the training set, and then test the classification model obtained by training by using the validation set, so as to be used as the performance index for evaluating the classifier.
Optionally, the classification model may employ a plurality of machine learning algorithms, which may be a na iotave bayes algorithm, a nearest neighbor rule, a linear discriminant analysis, a support vector machine, or a decision tree algorithm.
Specifically, the data in the initial standard data set is divided into sample sets with a preset number, after the initial standard is finished, an algorithm worker can divide the initial standard data set into 5 sample sets to 100 sample sets, when the data of the initial standard data set is less than 1000, the number of the sample sets is 100, when the data of the initial standard data set is increased by 10 times, the number of the sample sets is halved, namely when the data of the initial standard data set is 10000, the number of the sample sets can be 50, and when the number of the sample sets is reduced to 5, the continuous reduction is not suggested; the number of the sample sets cannot be too large or too small, the training set is closer to the whole training sample, the model deviation is reduced, but the operation speed is reduced due to the fact that the number of the sample sets is too large, and in the embodiment, data in the initial standard data set are divided into 10 sample sets.
Sending the mislabeling data to a designated labeling person for analysis, analyzing the case with inconsistent prediction result and labeling result, wherein the labeling person analyzes whether the classification model prediction result is wrong or the manual labeling is wrong, and adjusting an error threshold according to the received analysis result, wherein the error threshold is the maximum value of the difference between the prediction result and the labeling result, for example, the error threshold of a negative sample can be preset to be 0.75, the labeling result of the negative sample is 0, the prediction result of the negative sample is 0.95, the difference between the prediction result of the negative sample and the labeling result is 0.95, if the error threshold is exceeded, the negative sample is judged to be misjudged data, and similarly, the positive sample is also judged to be the mislabeling data by the method.
If the error threshold is too low, the amount of the false mark data is increased, the burden of a marking personnel team is increased, if the error threshold is too high, the false mark data is easy to miss selection, the error threshold may be adjusted according to the data re-calibration capability that the annotating staff team can bear, for example, when the error threshold of the negative sample is preset to be 0.75, the number of negative sample predictors differing from the annotation result by more than the error threshold is 10000, the complex scalar quantity born by the marking personnel team every day is 2000, and the screened out error scalar data greatly exceeds the complex scalar quantity born by the marking personnel team, therefore, the error threshold may be adjusted to be higher, for example, the error threshold of the negative sample is adjusted to be 0.9, the number of the difference between the predicted result and the labeled result of the negative sample exceeding the error threshold is correspondingly reduced, the false tagging density of the false tagging data reaches the range which can be borne by the re-tagging of the team data of the tagging personnel; similarly, when the error threshold is set to be low, the error threshold can be correspondingly increased when the error data which can be screened out is smaller than the complex scalar which can be born by the marking personnel team.
And delivering the error mark data screened out through the error threshold value to a data designated marking person so as to enable the marking person to communicate again and formulate a marking rule, and when the effect of the classification model still does not reach an acceptable range after iteration, replacing and re-selecting the classification model and re-marking the screened error mark data.
And (4) circulating the steps from S21 to S24, wherein the range of the data set subjected to cross validation is a full initial standard data set until the model effect and the labeling effect reach an acceptable range at the same time, the model effect of the embodiment reaches the acceptable range through the iteration or the replacement of the classification model, the expected labeling accuracy is reached or is lower than the expected false standard rate, and the labeling effect reaches the acceptable range, the expected labeling accuracy is reached or is lower than the expected false standard rate after the repeated labeling is passed. The false mark rate is obtained by the ratio of the number of the false mark data screened in the step S24 to the number of the total prediction result, the false mark data is the positive mark data, and the labeling accuracy is obtained by the ratio of the number of the positive mark data to the number of the total prediction result. Meanwhile, whether objective rules exist or not is artificially evaluated, and whether the current algorithm capability can be determined by comprehensively considering or not can be fitted. Model pre-research can be carried out in the cross validation stage, project feasibility can be evaluated in the early stage of marking, and error marking data in the marking process are detected by adopting a cross validation method, so that a large amount of marking and auditing resources are saved, and repeated marking of most data is avoided.
As shown in fig. 3, the step S21 specifically includes:
step S201, randomly dividing K sample sets with consistent number into a verification set and a training set, taking one sample set as the verification set, and taking the other sample sets as the training sets to obtain K groups of different training combinations, wherein K is a positive integer;
step S202, calculating the proportion of negative samples and positive samples in the training set in each training combination, and selecting a corresponding classification model according to the proportion;
step S203, training the selected classification model by alternately utilizing the training set in each training combination;
and S204, predicting the verification set of the current round by using the classification model after each round of training, and recording the prediction result to obtain the prediction result of all data of the K groups of verification sets.
In this embodiment, K-fold cross validation is adopted, a preset number of sample sets are randomly divided into K sample sets, one sample set is reserved as a validation set of a validation model, the other K-1 sample sets are used for training, a selected classification model is trained by using a training set in each training combination, then, the validation set of a current round is predicted by using the classification model after each round of training, and a prediction result is recorded. And repeating K rounds of cross validation to obtain the prediction results of all data of the K groups of validation sets. The randomly generated sample set can be used repeatedly to train and verify, and the result of each time is verified once. In this embodiment, 10-fold cross validation may be adopted, that is, a sample set is divided into 10 parts, 9 parts of the sample set are used as a training set in turn, and 1 part of the sample set is used as a validation set of test data, so as to perform a test, and finally obtain a prediction result of all data of K sets of validation sets. The method adopts the cross validation method to detect the error marking data in the marking process, greatly saves the resources of marking audit, and avoids the repeated marking of most data.
The step S22 specifically includes:
comparing whether the actual labeling of each round of data is consistent with the prediction result or not, wherein the actual labeling is the labeling result of labeling the data in the initial labeling data set by a labeling person;
when the actual label of the current round of data is the same as the prediction result, giving a first assignment to the prediction result of the current round of data; when the labeling result given to the first data is different from the prediction result, giving a second assignment to the prediction result of the current round of data;
calculating the cumulative assignment of each data;
if the data accumulated assignment is larger than the preset assignment, adding the data into a correct data set;
if the data accumulated assignment is smaller than the preset assignment, adding the data into the error mark data set;
and outputting the error marking data in the error marking data set.
Specifically, after the classification model is labeled N times, the labeling results of all the data of N times are counted, and the lower the score of the data is, the higher the possibility that the data is labeled incorrectly is. For example, the first assignment may be set to 1, the second assignment may be set to-1, when the actual label of the data is the same as the predicted result of the data, the current assignment of the data is 1, when the actual label of the data is different from the predicted result of the data, the current assignment of the data is-1, the labeling is repeated N times, wherein if the assignments of several times are 1, the remaining assignments are-1, the sum and the average of the N assignments of the data are calculated, and the cumulative assignment is calculated. And when the preset assignment is set to be 0.7, if the data accumulated assignment is greater than 0.7, adding the data into a correct data set, and if the data accumulated assignment is less than 0.7, adding the data into a false mark data set and outputting false mark data in the false mark data set. And comparing the actual labeling and prediction results of each data in each round, calculating the accumulated assignment of each data, and judging whether the data is the false label data or not through the accumulated assignment, so that the identification accuracy of the false label data can be effectively improved.
The step of inputting the large sample data set into an initial standard data training model for pre-labeling, and determining a correction data set according to a pre-labeling result specifically comprises the following steps:
selecting a first number of data from a large sample dataset, wherein the first number of data is greater than the data amount of the small sample initial-scale dataset;
pre-labeling the data of the first number by utilizing an initial standard data training model to obtain pre-labeling information of the data of the first number;
extracting partial data from the first number of data to serve as second number of data, and outputting the second number of data and a pre-labeling result of the second number of data to a designated labeling person;
receiving feedback information of a pre-labeling result aiming at the second number of data, wherein the feedback information comprises correct or wrong labels;
and if the feedback information is error mark, correcting the pre-marking result of the data to obtain a correction data set.
In this embodiment, the data of the first number is pre-labeled by using a primary label data training model to obtain pre-labeling information of the data of the first number, where the data amount of the first number is greater than the data amount of the primary label data set of the small sample, the data of the first number may be all data in the large sample data set, or the data amount greater than the data amount of the primary label data set of the small sample may be extracted from the large sample data set in batches as the data of the first number, a part of the data is extracted from the data of the first number as the data of the second number, and the data of the second number and the pre-labeling result of the data of the second number are output to a designated labeling person. When the second number of data is extracted from the first number of data, the second number of data may be randomly extracted from the first number of data or selected according to a pre-labeling result, for example, the first number of data is 100000 data, 20000 data is randomly extracted from 10000 data as the second number of data, or after the pre-labeling result of each data is obtained through pre-labeling, if the pre-labeling result of 10000 data is a positive sample and the pre-labeling result of 90000 data is a negative sample in the 100000 data, only 10000 positive samples or only 90000 negative samples are selected from the 100000 data as the second number of data, that is, data of the same labeling category is selected as the data of the second data according to the category of the labeling result.
The annotator receives a second number of data and the pre-annotation result of said second number of data, e.g. the data pre-annotated as positive samples by 10000 as described above. The 10000 data can be checked by the annotator, and the annotator can correct the data with errors in the pre-annotation result, for example, the data is a negative sample rather than a positive sample. And receiving feedback information of the pre-labeling result aiming at the second number of data, wherein the feedback information comprises correct or wrong labels. For example, if the annotator deems that the pre-annotation result of certain data is correct, the annotator can click on the data annotation device to submit the relevant options, and if the annotator deems that the pre-annotation result of certain data is mislabeled, the annotator can click on the data annotation device to modify the relevant options. And determining the final labeling result of the second number of data based on the feedback information, and if the feedback information of the data is the error label, correcting the pre-labeling result of the data to obtain a correction data set. The large sample data set is pre-labeled through the initial labeling data training model, and the labeling and auditing resources are greatly saved through manual selective examination of the pre-labeling result.
Before the step of outputting the second number of data and the pre-labeling result of the second number of data to the designated labeling personnel, the method further comprises the following steps:
predicting, by the initial standard data training model, a probability of a category to which the second number of data belongs;
and sorting the second number of data according to the probability of the category.
The second number of data may be sorted according to the data scores of the second number of data. The data score may be used to measure the accuracy of the pre-labeled result of the corresponding data. In this embodiment, the data score of the data may be set as the probability of the category to which the data belongs. For example, assuming that the probability that a certain data output by the initial standard data training model belongs to a negative sample is 0.7, the data score of the data can be considered to be 0.7. The initial standard data training model may be used to predict the class of the input data, and the initial standard data training model may output the probability (i.e., probability distribution) that the data belongs to different predetermined classes at the last layer (i.e., output layer). The data pre-labeled as positive samples can be sorted from high to low according to the data scores, the data pre-labeled as negative samples are sorted from high to low, the probability of error of the pre-labeling result is low, and the probability of error of the pre-labeling result is high. It can be seen that sorting by probability helps annotators to quickly check.
Before the step of outputting the second number of data and the pre-annotation result of the second number of data to the designated annotating personnel, the method further comprises the following steps:
selecting data having a data score greater than a first score threshold or less than a second score threshold from the first number of data as a second number of data; or selecting a second number of data with highest data scores from the first number of data.
When the second number of data items is selected, the data items having a score greater than the first score threshold value may be selected, or the data items having a score less than the second score threshold value may be selected. Suppose a data score for a data is the probability that the data belongs to a certain category. For example, data having a probability of belonging to a positive sample greater than (or less than) 0.7 may be selected from the first number of data and displayed as the second number of data. For another example, 500 data having the highest (or lowest) probability of belonging to a positive sample may be selected from the first number of data and displayed as the second number of data.
As shown in fig. 4, the step S4 specifically includes:
step S41, dividing the data in the correction data set into a preset number of sample sets, taking one sample set as a verification set and the other sample sets as training sets, and performing k-round cross verification through an initial standard data training model to obtain the prediction results of all the data in k groups of verification sets;
step S42, determining error mark data according to the prediction result;
step S43, sending the false mark data to a designated marking person for false mark type analysis, and adjusting an error threshold value between a prediction result and a marking result according to a received analysis result;
step S44, screening out new error marking data through the adjusted error threshold value, and sending the new error marking data to an appointed marking person, so that the appointed marking person constructs a new marking rule, and re-marking the screened error marking data;
and step S45, inputting the re-marked data into the initial standard data training model again for cross validation, and obtaining a final standard data model when the initial standard data training model reaches the expected marking accuracy.
In this embodiment, the mis-labeled data is analyzed, that is, a case where the prediction result is inconsistent with the labeling result is analyzed, the labeling staff team analyzes whether the model prediction result is wrong or the manual labeling result is wrong, and adjusts an error threshold, where the error threshold is a maximum value of a difference between the prediction result and the labeling result, and the error threshold can be adjusted according to the data relabeling capability that the labeling staff team can bear, so that the mis-labeled density of the mis-labeled data reaches a range that the labeling staff team data can bear.
And delivering the error mark data screened out by the error threshold value to a data designated marking person, re-communicating and making a marking rule, and when the effect of the classification model still does not reach the acceptable range after iteration, replacing and re-selecting the classification model and re-marking the possible error mark data.
And (4) circulating the steps S41 to S43, wherein the range of the cross-validation data set is a full-scale correction data set until the model effect and the labeling effect reach an acceptable range at the same time, the model effect reaches the acceptable range, the expected labeling accuracy or the expected false mark rate is achieved or is lower than the expected false mark rate through the iteration or the replacement of the classification model, the labeling effect reaches the acceptable range, the expected labeling accuracy or the false mark rate is achieved after the repeated labeling, the labeling accuracy or the false mark rate is calculated through the false mark data screened by the error threshold in the step S43 and the overall prediction result, meanwhile, whether objective rules exist or not is artificially evaluated, and whether the current algorithm capability can be fitted or not is judged. The method adopts the cross validation method to detect the error marking data in the marking process, greatly saves the resources of marking audit, and avoids the repeated marking of most data.
As shown in fig. 5, the step S41 specifically includes:
step S401, randomly dividing a preset number of sample sets into a verification set and a training set, taking one sample set as the verification set, and taking the other sample sets as the training sets to obtain k groups of different training combinations, wherein k is a positive integer;
s402, training the selected initial standard data training model by alternately utilizing the training set in each training combination;
and S403, predicting the verification set of the current round by using the initial standard data training model after each round of training, and recording the prediction result to obtain the prediction result of all data of the k groups of verification sets.
In this embodiment, k-fold cross validation is adopted, a preset number of sample sets are randomly divided into k sample sets, one sample set is reserved as a validation set of a validation model, the other k-1 sample sets are used for training, a selected initial standard data training model is trained by using a training set in each training combination, then the validation set of the current round is predicted by using the initial standard data training model after each round of training, and a prediction result is recorded. And repeating k rounds of cross validation to obtain the prediction results of all data of the k groups of validation sets. In this embodiment, 10-fold cross validation may be adopted, that is, a sample set is divided into 10 parts, 9 parts of the sample set are used as a training set in turn, and 1 part of the sample set is used as a validation set of test data, so as to perform a test, and finally obtain a prediction result of all data of K sets of validation sets. The method adopts the cross validation method to detect the error marking data in the marking process, greatly saves the resources of marking audit, and avoids the repeated marking of most data.
It should be emphasized that, in order to further ensure the privacy and security of the data to be labeled, the data to be labeled may also be stored in a node of a block chain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 6, as an implementation of the method shown in fig. 1, the present application provides an embodiment of a data annotation device based on cross validation, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 1, and the device can be applied to various computer devices.
As shown in fig. 6, the data annotation device 2 based on cross validation according to the present embodiment includes:
an acquisition module 201, a first verification module 202, a pre-annotation module 203, a second verification module 204 and an annotation module 205.
The obtaining module 201 is configured to obtain a small sample initial standard data set.
In this embodiment, the obtaining module 201 obtains an initial standard data set, where the initial standard data set is a small sample initial standard data set, randomly screens data according to a maximum daily standard value of a annotating person, preferably a week, specifically analyzes the data according to a project bearable duration, and the data is original distribution data faced by a model in a production environment in the future;
specifically, partial data is screened from original distribution data to serve as an initial standard data set, the data in the initial standard data set is labeled, and screening can be carried out according to dates during screening;
after the initial labeling is finished, the data category distribution of the labeled initial labeling data set is counted and divided into negative samples and positive samples, the positive samples can be specific text classifications, such as abuse text, sensitive word text and the like, negative samples are defined except the positive samples, and the distribution of the data categories is used as a reference for selecting a subsequent classification model.
The first verification module 202 is configured to input the initial standard data set to the classification model for cross-verification, so as to obtain an initial standard data model.
In this embodiment, data in the primary standard data set is divided into a preset number of sample sets, one sample set is taken as a verification set, the other sample sets are taken as training sets, cross-validation is performed through a classification model to obtain a prediction result of all data in the verification set, error data is determined according to the prediction result, an error threshold value is adjusted, the filtered error data is re-labeled, the re-labeled data is input into the classification model again for cross-validation, and the primary standard data training model is obtained until the classification model reaches an expected labeling accuracy.
The pre-labeling module 203 is configured to obtain a large sample data set, input the large sample data set to the initial standard data training model for pre-labeling, and determine a correction data set according to a pre-labeling result.
In this embodiment, a large sample dataset is introduced, data is pre-labeled through an initial standard data training model, pre-labeling information of each data in the large sample dataset is obtained, the pre-labeling information includes a pre-labeling result and probability, and a correction data set is determined according to the pre-labeling result.
The second verification module 204 is configured to input the correction data set to the initial standard data training model for cross verification, so as to obtain a final standard data model.
In this embodiment, data in the correction data set is divided into a preset number of sample sets, one sample set is taken as a verification set, the other sample sets are taken as training sets, cross-validation is performed through an initial standard data training model to obtain a prediction result of all data in the verification set, error standard data is determined according to the prediction result, an error threshold value is adjusted, the filtered error standard data is re-marked, the re-marked data is input into the initial standard data training model again for cross-validation, and a final standard data model is determined until the initial standard data training model reaches an expected marking accuracy.
The labeling module 205 is configured to label the received data to be labeled through the terminal label data model.
The data labeling device based on cross validation provided by the embodiment of the application obtains a small sample initial label data set, inputs the initial label data set into a classification model for cross validation, obtains an initial label data training model, can evaluate the feasibility of a project at the initial stage of labeling, pre-labels a large sample data set by using the initial label data training model, determines a correction data set according to a pre-labeling result, inputs the correction data set into the initial label data training model for cross validation until the labeling accuracy rate reaches the expectation, and determines a final label data model; in the large sample data set labeling stage, the initial standard data training model is used for pre-labeling the large sample data set, so that the labeling efficiency is improved to a greater extent; the method adopts the cross validation method to detect the error marking data in the marking process, greatly saves the resources of marking audit, and avoids the repeated marking of most data.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 3 comprises a memory 301, a processor 302 and a network interface 303 which are mutually connected through a system bus in a communication way, wherein the memory 301 stores a computer program, and the processor 302 realizes the steps of the data annotation method based on cross validation as described above when executing the computer program. It is noted that only computer device 3 having components 301 and 303 is shown, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a mobile phone, a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 301 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 301 may be an internal storage unit of the electronic device 3, such as a hard disk or a memory of the computer device 3. In other embodiments, the memory 301 may also be an external storage device of the computer device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 3. Of course, the memory 301 may also comprise both an internal storage unit of the computer device 3 and an external storage device thereof. In this embodiment, the memory 301 is generally used for storing an operating system and various types of application software installed on the computer device 3, such as program codes of a cross-validation-based data labeling method. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 302 is typically used to control the overall operation of the computer device 3. In this embodiment, the processor 302 is configured to execute the program code stored in the memory 301 or process data, for example, execute the program code of the cross-validation-based data tagging method.
The network interface 303 may comprise a wireless network interface or a wired network interface, and the network interface 303 is typically used for establishing a communication connection between the computer device 3 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a cross-validation based data annotation program, which is executable by at least one processor 302 to cause the at least one processor 302 to perform the steps of the cross-validation based data annotation method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A data annotation method based on cross validation is characterized by comprising the following steps:
acquiring a small sample initial standard data set;
inputting the initial standard data set into a classification model for cross validation to obtain an initial standard data model;
acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-marking, and determining a correction data set according to a pre-marking result;
inputting the correction data set into a primary standard data training model for cross validation to obtain a final standard data model;
and marking the received data to be marked through the final mark data model.
2. The cross-validation-based data labeling method of claim 1, wherein the step of inputting the initial standard data set into the classification model for cross validation to obtain the initial standard data model specifically comprises:
dividing data in the initial standard data set into K sample sets with consistent number, taking one sample set as a verification set and the other sample sets as training sets, and performing K-round cross verification through a classification model to obtain prediction results of all data in K groups of verification sets;
determining error marking data according to the prediction result;
sending the false mark data to an appointed marking person for false mark type analysis, and adjusting an error threshold value between a prediction result and a marking result according to a received analysis result;
screening new error marking data through the adjusted error threshold value, and sending the new error marking data to a designated marking person, so that the designated marking person can construct a new marking rule, and performing re-marking on the screened error marking data;
and inputting the re-labeled data into the classification model again for cross validation, and obtaining an initial labeled data training model when the classification model reaches the expected labeling accuracy.
3. The data labeling method based on cross validation as claimed in claim 2, wherein the step of dividing the data in the initial calibration data set into K sample sets with the same number, taking one sample set as a validation set and the other sample sets as training sets, and performing K rounds of cross validation through a classification model to obtain the prediction results of all the data in K validation sets specifically comprises:
randomly dividing K sample sets with consistent number into a verification set and a training set, taking one sample set as the verification set, and taking the other sample sets as the training sets to obtain K groups of different training combinations, wherein K is a positive integer;
calculating the proportion of negative samples and positive samples in the training set in each training combination, and selecting a corresponding classification model according to the proportion;
training the selected classification model by using the training set in each training combination in a crossed manner; and predicting the verification set of the current round by using the classification model after each round of training, and recording the prediction result to obtain the prediction result of all data of the K groups of verification sets.
4. The cross-validation-based data annotation method of claim 2, wherein the step of determining the mis-labeled data according to the prediction result specifically comprises:
comparing whether the actual labeling of each round of data is consistent with the prediction result or not, wherein the actual labeling is the labeling result of labeling the data in the initial labeling data set by a labeling person;
when the actual label of the current round of data is the same as the prediction result, giving a first assignment to the prediction result of the current round of data;
when the labeling result of the current round of data is different from the prediction result of the current round of data, giving a second assignment to the prediction result of the current round of data;
calculating the cumulative assignment of each data;
if the data accumulated assignment is larger than the preset assignment, adding the data into a correct data set;
if the data accumulated assignment is smaller than the preset assignment, adding the data into the error mark data set;
and outputting the error marking data in the error marking data set.
5. The cross-validation-based data annotation method of claim 1, wherein the step of inputting the large sample dataset into a primary standard data training model for pre-annotation and determining the correction dataset according to the pre-annotation result specifically comprises:
selecting a first number of data from a large sample dataset, wherein the first number of data is greater than the data amount of the small sample initial-scale dataset;
pre-labeling the first number of data by using an initial labeled data training model to obtain pre-labeled information of the first number of data, wherein the pre-labeled information comprises a pre-labeled result;
extracting partial data from the first number of data to serve as second number of data, and outputting the second number of data and a pre-labeling result of the second number of data to a designated labeling person;
receiving feedback information of a pre-labeling result aiming at the second number of data, wherein the feedback information comprises correct or wrong labels;
and if the feedback information is error mark, correcting the pre-marking result of the data to obtain a correction data set.
6. The cross-validation-based data annotation method of claim 5, wherein said step of outputting said second number of data and said pre-annotation result of said second number of data to a designated annotating person is preceded by the steps of:
predicting, by the initial standard data training model, a probability of a category to which the second number of data belongs;
and sorting the second number of data according to the probability of the category.
7. The cross-validation-based data labeling method of claim 1, wherein the step of inputting the calibration data set to the initial calibration data training model for cross validation and obtaining the final calibration data model specifically comprises:
dividing data in the correction data set into k sample sets with consistent number, taking one sample set as a verification set and the other sample sets as training sets, and performing k rounds of cross verification through an initial standard data training model to obtain prediction results of all data in k sets of verification sets;
determining error marking data according to the prediction result;
sending the false mark data to an appointed marking person for false mark type analysis, and adjusting an error threshold value between a prediction result and a marking result according to a received analysis result;
screening new error marking data through the adjusted error threshold value, and sending the new error marking data to a designated marking person, so that the designated marking person can construct a new marking rule, and performing re-marking on the screened error marking data;
and inputting the re-marked data into the initial marking data training model again for cross validation, and obtaining a final marking data model when the initial marking data training model reaches the expected marking accuracy.
8. A data annotation device based on cross validation, comprising:
the acquisition module is used for acquiring a small sample initial standard data set;
the first verification module is used for inputting the initial standard data set into the classification model for cross verification to obtain an initial standard data model;
the pre-labeling module is used for acquiring a large sample data set, inputting the large sample data set into the initial standard data training model for pre-labeling, and determining a correction data set according to a pre-labeling result;
the second verification module is used for inputting the correction data set into the initial standard data training model for cross verification to obtain a final standard data model;
and the marking module is used for marking the received data to be marked through the final mark data model.
9. A computer device comprising at least one memory having stored therein a computer program and at least one processor implementing the steps of the cross-validation based data annotation method of any one of claims 1 to 7 when executed by the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the cross-validation-based data annotation method according to any one of claims 1 to 7.
CN202011594265.0A 2020-12-29 2020-12-29 Data annotation method based on cross validation and related equipment Pending CN112686301A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011594265.0A CN112686301A (en) 2020-12-29 2020-12-29 Data annotation method based on cross validation and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011594265.0A CN112686301A (en) 2020-12-29 2020-12-29 Data annotation method based on cross validation and related equipment

Publications (1)

Publication Number Publication Date
CN112686301A true CN112686301A (en) 2021-04-20

Family

ID=75455240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011594265.0A Pending CN112686301A (en) 2020-12-29 2020-12-29 Data annotation method based on cross validation and related equipment

Country Status (1)

Country Link
CN (1) CN112686301A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN114819000A (en) * 2022-06-29 2022-07-29 北京达佳互联信息技术有限公司 Feedback information estimation model training method and device and electronic equipment
CN115600601A (en) * 2022-11-08 2023-01-13 税友软件集团股份有限公司(Cn) Method, device, equipment and medium for constructing tax law knowledge base
CN116756576A (en) * 2023-08-17 2023-09-15 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN114819000A (en) * 2022-06-29 2022-07-29 北京达佳互联信息技术有限公司 Feedback information estimation model training method and device and electronic equipment
CN114819000B (en) * 2022-06-29 2022-10-21 北京达佳互联信息技术有限公司 Feedback information estimation model training method and device and electronic equipment
CN115600601A (en) * 2022-11-08 2023-01-13 税友软件集团股份有限公司(Cn) Method, device, equipment and medium for constructing tax law knowledge base
CN116756576A (en) * 2023-08-17 2023-09-15 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium
CN116756576B (en) * 2023-08-17 2023-12-12 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN112084383B (en) Knowledge graph-based information recommendation method, device, equipment and storage medium
CN112686301A (en) Data annotation method based on cross validation and related equipment
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN113326991B (en) Automatic authorization method, device, computer equipment and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN112035549B (en) Data mining method, device, computer equipment and storage medium
CN113807728A (en) Performance assessment method, device, equipment and storage medium based on neural network
CN110991538B (en) Sample classification method and device, storage medium and computer equipment
CN117093477A (en) Software quality assessment method and device, computer equipment and storage medium
CN111639360A (en) Intelligent data desensitization method and device, computer equipment and storage medium
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
CN115237724A (en) Data monitoring method, device, equipment and storage medium based on artificial intelligence
CN113283222B (en) Automatic report generation method and device, computer equipment and storage medium
CN112887371B (en) Edge calculation method and device, computer equipment and storage medium
CN114036921A (en) Policy information matching method and device
CN116843395A (en) Alarm classification method, device, equipment and storage medium of service system
CN115757075A (en) Task abnormity detection method and device, computer equipment and storage medium
CN112598540B (en) Material reserve recommending method, equipment and storage medium
CN112084408B (en) List data screening method, device, computer equipment and storage medium
CN114925275A (en) Product recommendation method and device, computer equipment and storage medium
CN112069807A (en) Text data theme extraction method and device, computer equipment and storage medium
CN114549053B (en) Data analysis method, device, computer equipment and storage medium
CN114866818B (en) Video recommendation method, device, computer equipment and storage medium
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN113902032A (en) Service data processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination