CN108960409B - Method and device for generating annotation data and computer-readable storage medium - Google Patents

Method and device for generating annotation data and computer-readable storage medium Download PDF

Info

Publication number
CN108960409B
CN108960409B CN201810609646.8A CN201810609646A CN108960409B CN 108960409 B CN108960409 B CN 108960409B CN 201810609646 A CN201810609646 A CN 201810609646A CN 108960409 B CN108960409 B CN 108960409B
Authority
CN
China
Prior art keywords
data
data set
labeled
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810609646.8A
Other languages
Chinese (zh)
Other versions
CN108960409A (en
Inventor
郑斌
徐晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Black Shark Technology Co Ltd
Original Assignee
Nanchang Black Shark Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Black Shark Technology Co Ltd filed Critical Nanchang Black Shark Technology Co Ltd
Priority to CN201810609646.8A priority Critical patent/CN108960409B/en
Publication of CN108960409A publication Critical patent/CN108960409A/en
Application granted granted Critical
Publication of CN108960409B publication Critical patent/CN108960409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and equipment for generating annotation data and a computer-readable storage medium. The method for generating the labeling data comprises the following steps: s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled; s200: analyzing the data characteristics of the marked data set, and manufacturing a pseudo data set according with the data characteristics according to the data characteristics; s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set; s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set; s500: carrying out neural network training on the training data set to form a training model; s600: and cleaning data outside the labeled data set in the data full set based on the training model, labeling the data conforming to the training model and putting the labeled data set in the labeled data set, so that a training number set which is high in matching degree with sample data and strong in randomness can be generated quickly and efficiently on the basis of a small amount of data, and the data volume of the labeled data is enlarged.

Description

Method and device for generating annotation data and computer-readable storage medium
Technical Field
The present invention relates to the field of data models, and in particular, to a method and an apparatus for generating labeled data, and a computer-readable storage medium.
Background
With the rapid development of application programs on intelligent terminals and artificial intelligence technologies built based on the application programs, people have increasingly and widely entered the lives of people. Regardless of the fields of daily use, games, work and the like, learning based on original sample data is needed to know the use habits in the fields so as to make intelligent judgment.
For learning of the original sample data, a deep neural network technique may be employed. The deep neural network technology is rapidly developed in recent years, obtains far-expected precision in the field of image recognition, and obtains popular application in many fields. In practical engineering applications, however, many special image recognition requirements lack a data set available for training, and the model accuracy of the deep neural network greatly depends on the size and quality of the data set. In order to solve the problem of lack of training data, the prior art generally performs random cropping, rotation, stretching and inversion on the existing labeled data, but has the following defects:
1. the original image data corresponding to some models is small in length and width, and the quantity of data which can be expanded by random cropping is limited.
2. When the original sample data is small, the data obtained by these methods is prone to overfitting the model due to insufficient feature dispersion.
3. Some models are sensitive to data stretching, and the recognition rate is obviously reduced after the stretching;
4. manually collecting and labeling data can consume a great deal of labor and energy.
Therefore, a new method for generating labeled data is needed, which can generate a large amount of training number sets with strong randomness rapidly and under the condition of less labeled sample data, and simplify the work of collecting and labeling training data.
Disclosure of Invention
In order to overcome the above technical drawbacks, an object of the present invention is to provide a method, an apparatus, and a computer-readable storage medium for generating labeled data, which are capable of quickly and efficiently generating a training number set with high matching degree with sample data and strong randomness based on a small amount of data, thereby expanding the data size of the labeled data.
The invention discloses a method for generating marking data, which comprises the following steps:
s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled;
s200: analyzing the data characteristics of the marked data set, and manufacturing a pseudo data set according with the data characteristics according to the data characteristics;
s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set;
s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set;
s500: carrying out neural network training on the training data set to form a training model;
s600: and cleaning data outside the labeled data set in the data corpus based on the training model, labeling the data conforming to the training model and putting the data into the labeled data set.
Preferably, the annotation data generation method further comprises the following steps:
s700: judging whether the data volume in the labeled data set is larger than or equal to an expected data volume;
s800: and when the data volume in the labeled data set is smaller than the expected data volume, taking the union of the training data set and the labeled data set, and executing the steps S500-S600 again.
Preferably, the step S800 is replaced by:
s800': and when the data volume in the labeled data set is smaller than the expected data volume, replacing the data in the pseudo data set with the data in the labeled data set, and executing the steps S300-S600 again.
Preferably, the annotation data generation method further comprises the following steps:
s900: training the other data sets except the data corpus based on the labeled data set and/or the training data set formed in step S600.
Preferably, the step S300 of expanding the pseudo data set based on the GAN neural network to form an expanded data set includes:
s310: constructing a generation model and a discrimination model;
s320: configuring the discrimination model to output discrimination probability values of data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability values of data in non-pseudo data sets based on the discrimination probability values of the data in the pseudo data set;
s330: the generation model generates a data set to be expanded based on the data in the pseudo data set;
s340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
s350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form the expanded data set.
Preferably, the step S400 of identifying whether labeling is required for the data in the augmented data set, and filtering the labeled data to form a training data set includes:
s410: verifying the data in the extended data set according to the labeled data set and the data characteristics;
s420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
Preferably, the step S410 of verifying the data in the augmented data set according to the annotation data set and the data feature comprises:
s411: verifying the data in the extended data set by taking the data in the labeled data set as a model;
s412: and when the data are verified to be consistent by more than half of levels or all levels in the model, judging that the verification result is the identification label.
Preferably, the data features include: one or more of a background of the data, a unit number of the data, a digital gap of the data, an object of the data, and a noise of the data.
The invention also discloses labeled data generating equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the labeled data generating method is realized when the processor executes the computer program.
The present invention also discloses a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the annotation data generation method as described above.
After the technical scheme is adopted, compared with the prior art, the method has the following beneficial effects:
1. even if the amount of sample data is small, a data set containing a large amount of annotation data can be generated quickly;
2. the data randomness is strong, the overfitting situation is not easy to occur, and the quality of the labeled data set is improved;
3. the model generated by the pseudo data set is used for identifying and labeling other data, the size and the richness of the labeled data set are expanded, forward iteration can be circulated in the process, and the training speed and the training precision of the deep neural network model are improved.
Drawings
FIG. 1 is a flow chart illustrating a method for generating annotation data in accordance with a preferred embodiment of the present invention;
FIG. 2 is data of a pseudo data set in accordance with a preferred embodiment of the present invention;
FIG. 3 is a flow chart illustrating a method for generating annotation data in accordance with a further preferred embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating a method for generating annotation data in accordance with a further preferred embodiment of the present invention;
FIG. 5 is a flowchart illustrating the step S300 of the annotation data generation method according to a preferred embodiment of the invention;
FIG. 6 is a flowchart illustrating the step S400 of the annotation data generation method according to a preferred embodiment of the invention.
Detailed Description
The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. Depending on context, the word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination"
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
Fig. 1 is a schematic flow chart of a method for generating annotation data according to a preferred embodiment of the present invention. According to the method for generating the labeled data, a data model with a large amount of data, such as a phenomenon in life, financial service, social representation and the like, in an application program is suggested and analyzed, whether other data accord with the rules of the collected data model or not is labeled according to the collected data model, and further, the data can be extended to generated data from judgment data, so that the purposes of predicting in advance, processing the data and the like in artificial intelligence are achieved. All the marked and unmarked data in a certain field are a data complete set U. The sample data, i.e. the data that has been labeled, is the data that has been obtained by the user in the data corpus U and is determined to be the labeling result, and these data form a labeled data set a. In this embodiment, an identification model is established based on the data corpus U and the labeled data set a, and the unlabeled data of the data corpus U is labeled accurately, specifically, the method includes the following steps:
s100: acquiring a data corpus U and a labeled data set A which is contained in the data corpus U and is labeled;
the data corpus U containing all data may be a data set formed by collecting all data included in an application program, such as numbers, letters, characters, and the like, or a data set formed by a purchaser and a user after collecting data such as activity, active time, online time, active duration, and the like. In some embodiments, the data volume is too large, and the data contained in the data complete set U is substantially equivalent to a data set formed by combining all numbers, letters, characters, pictures and the like. Regardless of the data volume of the data corpus U, the data with parts marked by the user, that is, the data with the authenticity determined as true, is included in the data corpus U, and these marked data are used as sample data to form a marked data set a which is really contained in the data corpus U. The acquisition of the data full set U and the annotation data set A is taken as the basic operation of the subsequent steps.
S200: analyzing the data characteristics of the marked data set A, and manufacturing a pseudo data set F according with the data characteristics according to the data characteristics;
the annotated dataset a will then be analyzed. Specifically, since the user currently has only the data in the labeled data set a with the determined authenticity of the data, the data characteristics of the labeled data set, including but not limited to one or more of the background of the data in the labeled data set a, the unit number of the data, the number interval of the data, the target of the data, and the noise of the data, need to be analyzed first based on the labeled data set a as an extended basis. As shown in fig. 2, in an embodiment, the data in the tagged data set a obtained in an application, which is the number "2589", may be analyzed to include data characteristics including: the data including numbers 2, 5, 8 and 9, gray scale and brightness of the background picture when representing the numbers, font of the numbers, and interval of the number keys, etc. after analyzing one by one, each data in the labeled data set a will have its own data characteristics.
Based on the data characteristics obtained from the above analysis, other data that fit these data characteristics, but are different from the data in the annotation data set a, will then be produced. It is desirable that the numbers used are the numbers (e.g., any one or more of 0-9) of the data in the annotation data set a, that the gaps between the numbers satisfy the gaps (preferably, they may not be completely equal, and are established with a ± error of 10% -15%) of the numbers of the data in the annotation data set a, and so on. As shown in FIG. 2, dummy data such as "25", "52", "89" can be produced based on "2589" data within the annotation data set. Since the dummy data set is manufactured in accordance with the data characteristics of the data in the label data, it may be a recombination of the respective units of the data in the label data, and when the units are combined, the manufacturing conditions may be arbitrarily set in accordance with the required dummy data amount. It will be appreciated that if a large amount of dummy data is required, only the data features corresponding to the digits of the annotation data may be extracted, while the other data features are arbitrarily chosen to form a large amount of dummy data, and vice versa.
Since the dummy data in the dummy data set F is a combination of the elements of the data in the annotation data, the dummy data in the dummy data set F may be data that is not identical to any of the data in the annotation data. For example, the different data extracted in the annotation data set a includes the numbers 1 and 6, but the numbers 1 and 6 do not constitute the data 16 or 61 separately. However, the dummy data set F may include any arbitrary combination of the number 1 and the number 6, such as 16, 61, 116, 661, etc. Therefore, when the data in the labeled data set appears in a number unit of 0-9, the pseudo data set can be used for manufacturing any real number based on the data characteristics, and the data quantity of the pseudo data set F is greatly enlarged.
S300: expanding the pseudo data set F based on a GAN neural network to form an expanded data set G;
the pseudo data set F is a lateral extension of the annotation data because it still conforms to the original sample data, i.e. conforms to the data characteristics of the annotation data, when it is formed, but the data in the pseudo data set F may not be completely covered when the display of the background, character gaps, and fonts of the data based on the same number in the application program changes on different devices, for example. Therefore, on the basis of the dummy data set F, the dummy data set is expanded based on a GAN neural network to ensure the randomness and diversity of data, and the expanded data set forms an expanded data set G. For data expansion, original data characteristics can be added, and the numbers are recombined on the basis of the added data characteristics.
By utilizing the principle of the GAN neural network, a false sample set is formed by expanding the false data set F, and meanwhile, the false sample set is confronted with a module with a distinguishing function on the basis of marking the true data set A.
The module with discrimination solves a supervised binary problem for determining whether the input data is a label data set a (pseudo data set F) or an extended data set G. In the training process, the label data set A and the expanded data set G generated by the generation network are randomly input into the discrimination model, and whether the discrimination model is true or not is judged by using the discrimination model. And the performance of generating the network and judging the network is continuously improved through a competitive machine learning mechanism, and when the parameters of the two models are stable, the training is finished.
After training, new extended data can be generated according to the current new generation network, and the part of new extended data is closer to the data in the pseudo data set F and the labeled data set A.
S400: identifying whether the data in the extended data set G needs to be labeled or not, and screening the labeled data to form a training data set T;
in order to further ensure the fitting property of the extended data in the extended data set G and the data in the labeled data set A, the extended data is identified and labeled by a training model obtained by means of deep training of a GAN neural network, so that the extended data is cleaned. Since the cleansing action is performed on the augmented data set G, the data in the training data set T is obtained as part of the augmented data set G, i.e. the training data set T is actually contained in the augmented data set G.
S500: training the training data set T by a neural network to form a training model;
the data in the training data set T after cleaning has certain cleanliness and fitting degree with the data in the labeling data set A, the training data set T has data which is in accordance with the data characteristics of the labeling data set A except all the data in the labeling data set A, and the data volume of the data is larger than that of the labeling data set A. It is understood that the amount of data in the extended data set G and the training data set T increases, and the extended data obtained after manufacturing and screening is more in steps S300-S400. The more data volume, when training the training data set T by the neural network, the more full training model is formed, on one hand, the training model accords with the data characteristics of the data in the labeled data set A, on the other hand, the characteristic of data loss of the labeled data set A is enriched, the deformation of the data of the labeled data set A under the conditions of different backgrounds, fonts, intervals and the like is expanded, and the training model is based on the labeled data set A and is superior to the training model of the labeled data set A.
S600: and cleaning the data in the data full set U except the labeled data set A based on the training model, labeling the data conforming to the training model and putting the labeled data in the labeled data set A.
After the training model is provided, other data (data except data in the labeled data set A) in the data complete set U is analyzed and judged according to the standard for judging when the data is labeled in the training model, such as whether the data meets the numbers, the number gaps, the number backgrounds, the number targets and the like in the data characteristics or the variations of the numbers, the variations of the number backgrounds and the like, the data which does not meet the training model can be cleaned according to the judgment result, the data which meets the training model is reserved and classified into the labeled data set A, and therefore the other data which meet the original sample, namely the data in the labeled data set A, are selected in the range of the data complete set U in a high-accuracy mode, and the data quantity of the original sample is enlarged. And it can be understood that the marking data amount for training is increased based on the gradual expansion of the data amount of the original sample, the magnitude of the marking data and the maturity of the training model are exponentially increased after multiple iterations, and the subsequent discrimination on the same type of data can be based on big data and is accurate and efficient in discrimination.
In different embodiments, if the data amount of the new labeled data set a' obtained after the step S600 is executed is insufficient and more data amount needs to be accumulated through multiple iterations, different steps are executed in different embodiments.
Example one
Referring to fig. 3, in a preferred embodiment, the method for generating the annotation data further includes the following steps:
s700: judging whether the data volume in the label data set A' is larger than or equal to an expected data volume;
firstly, it is necessary to judge whether the data volume of the currently obtained labeled data set a' and the quality of the training model satisfy the expected data volume or the expected quality, that is, whether the training model can accurately label the data when analyzing and judging the data in the data corpus U, and the verification method can adopt an experiment to judge how the fitting between the labeled data and the labeled data is supposed to be accurate under the training model.
S800: and when the data volume in the annotation data set A is smaller than the expected data volume, taking the union of the training data set T and the annotation data set A, and executing the steps S500-S600 again.
When the amount of data in the annotation data set A is deemed to be smaller than the desired amount of data, the amount of data in the annotation data set A needs to be increased. Therefore, it is only necessary to increase the data amount of the training data set T obtained in step S400 without newly performing the quality improvement of the extended data set G based on the GAN neural network. Therefore, the training data set T and the labeling data set a are merged to obtain a new training data set T ', that is, T' ═ tuo. And then, based on the new training data set T ', iterating back to the step S500, performing neural network training on the training data set again to form a new training model, and then executing the step S600 again to obtain a new label data set A' after iteration. If the condition of model training cannot be met, repeating the steps S500-S600 until the finally obtained new annotation data set A' meets the condition of model training.
Example two
In another embodiment, the annotation data generation method further comprises the steps of:
s700: judging whether the data volume in the label data set A' is larger than or equal to an expected data volume;
firstly, it is necessary to judge whether the data volume of the currently obtained labeled data set a' and the quality of the training model satisfy the expected data volume or the expected quality, that is, whether the training model can accurately label the data when analyzing and judging the data in the data corpus U, and the verification method can adopt an experiment to judge how the fitting between the labeled data and the labeled data is supposed to be accurate under the training model.
S800': when the data amount in the annotation data set A is smaller than the expected data amount, the data in the dummy data set F is replaced with the data in the annotation data set A, and the steps S300-S600 are executed again
Different from the trust of the GAN neural network training result in the first embodiment, in this embodiment, it is determined that the countermeasure experiment of the GAN neural network needs to be performed again, and therefore, after the step S600 is completed, when the data volume in the labeled data set a' is considered to be smaller than the expected data volume, the data in the dummy data set F is replaced with the data in the labeled data set a, so that the original sample of the GAN neural network is reduced to the data in the labeled data set a.
And after replacing the data in the pseudo data set F with the data in the labeled data set A, sequentially executing the steps S300-S600 to obtain a new labeled data set A' after iteration. If the new annotation data set a ″ still fails to satisfy the condition of model training, it can be selected to iterate back to S500 or S300 for re-execution as needed in the first or second embodiment. And because the iterative process may be more than once, the iterative manner of the first embodiment and the second embodiment can be freely combined according to the recognition degree of the user on the quality of the dummy data set F and the quality of the training data set T.
Referring to fig. 4, in this embodiment, after the step S600, S800 or S800' is executed, the method further includes:
s900: training other data sets except the full data set based on the labeled data set A and/or the training data set T formed in step S600
In this step, if the data corpus U includes other data sets besides the data corpus U when the data corpus U is selected, but not data in all states of an application program, or only data within a certain time period is selected, according to the user's needs, in this case, the labeled data set a and/or the training data set T having a certain amount of data already in a certain scale, or the training model may be used to train and label the other data sets. It will be appreciated that, since the training data set T is derived from the annotation data set a, the data in the annotation data set a can be preferentially selected and then optionally participate in subsequent training according to the quality of the data.
Referring to fig. 5, when the dummy data set F is augmented based on the GAN neural network, the method specifically comprises the following steps:
s310: constructing a generation model and a discrimination model;
as described above, the module having a generating function, which is a generating model, has a generating network, the module having a discriminating function, which is a discriminating model, has a discriminating network. The generative model and the discriminant model are subjected to antagonistic training based on the principle of the GAN neural network, so that the data generation precision of the generative model and the data discrimination accuracy of the discriminant model are improved simultaneously.
S320: configuring the discrimination model to output the discrimination probability value of the data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability value of the data in the non-pseudo data set based on the discrimination probability value of the data in the pseudo data set;
in the initial training, the discrimination target to be input is data in the pseudo data set F, and the data in the pseudo data set F is data that is manufactured by the user to expand the diversity of the data. The discrimination network makes a judgment on the input, judges whether the input is true (a false data set F) or false (expanded data generated by the generation network), compares the judgment result with the real situation, reversely influences the model parameters, and continuously iterates the process, so that the discrimination network learns the difference of true and false data. Through the judgment of the marked data, the prototype of the data is trained by the judgment model, so that when the input user is unknown true or false, the possibility of the true or false of the data is verified based on the true or false of the preorder data, that is, based on the deep learning of the judgment probability value of the data in the false data set F, the output of the judgment probability value of the data in the non-false data set F, that is, the expanded data is realized, and the true or false of the data is judged while the data is expanded.
S330: generating a model to generate a data set to be expanded based on data in the pseudo data set F;
on the model generation side, based on the data in the pseudo data set F, the numbers, the intervals, the fonts, the combination forms and the like in the data are deformed and recombined again, so that a data set to be expanded is generated.
S340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
along with the formation of the data set to be expanded, the marking data set A is gradually expanded into a pseudo data set F and a data set to be expanded, then the pseudo data set F and the data set to be expanded are input into a discrimination model, and the discrimination model discriminates the data and outputs the true-false probability value of the data.
S350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form an expanded data set.
Because the discrimination model trains the discrimination standard based on the data in the pseudo data set F, the data with the discrimination probability value larger than 0.5 (namely the data in all the pseudo data sets F and partial data in the data set to be expanded which accords with the discrimination standard of the discrimination model) output by the discrimination model are collected, and the merged data set is the expanded data set G.
The expansion data set G obtained through the steps not only ensures the characteristic approximation of the data in the labeled data set A, but also randomly expands and diversifies the data in the pseudo data set F, namely, orderly expands the data volume of the labeled data set A.
Referring to fig. 6, in this embodiment, step S400 is specifically performed by the following steps:
s410: verifying the data in the extended data set G according to the data characteristics of the marked data set A;
and identifying and labeling the data in the newly generated extended data set G through an existing model of labeling the data characteristics of the data set A, and verifying whether the data in the extended data set G should be subjected to subsequent identification operation.
The verification mode can adopt a voting verification method, namely secondary identification is carried out based on the principle that minority obeys majority or uniformly passes, and the verification result that a plurality of discriminators use more or all proportion based on the data characteristics of the labeled data set A is respected. That is, the step S410 of verifying the data in the augmented data set G according to the data feature of the annotation data set a includes: s411: verifying the data in the extended data set G by taking the data in the labeled data set A as a model; s412: and when the data verification result of more than half of the stages or all the stages in the model is consistent, judging that the verification result is the identification mark.
S420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
The verification result comprises an identification label and an unidentified label, the data belonging to the identification label, namely the data characteristics of the data conforming to the labeled data set A, are retained in the expanded data set G, and the verification result is unidentified label and is deleted for data cleaning.
The method for generating the annotation data in any embodiment can be applied to an annotation data generation device, such as an intelligent terminal, a server, a workstation, a virtual server, a virtual workstation, and the like. The annotated data generating device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the computer program realizes the annotated data generating method according to a pre-configured source language.
The above-mentioned annotation data generation method can also be integrated into a computer-readable storage medium, on which a computer program is stored, and the computer program can be executed by a processor to implement the above-mentioned voice control method, and the computer-readable storage medium can be represented in the form of software, virtual file, etc.
The smart terminal may be implemented in various forms. For example, the terminal described in the present invention may include an intelligent terminal such as a mobile phone, a smart phone, a notebook computer, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc. In the following, it is assumed that the terminal is a smart terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type terminal in addition to elements particularly used for moving purposes.
It should be noted that the embodiments of the present invention have been described in terms of preferred embodiments, and not by way of limitation, and that those skilled in the art can make modifications and variations of the embodiments described above without departing from the spirit of the invention.

Claims (10)

1. A method for generating annotation data is characterized by comprising the following steps:
s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled;
s200: analyzing the data characteristics of the labeled data set, manufacturing a pseudo data set which accords with the data characteristics according to the data characteristics, and manufacturing other data which accords with the data characteristics and is different from the data in the labeled data set on the basis of the data characteristics obtained by the analysis, wherein the used numbers are the numbers of the data in the labeled data set when manufacturing, and the gaps among the numbers meet the gaps among the numbers of the data in the labeled data set, so that the pseudo data in the pseudo data set is the recombination of each unit of the data in the labeled data;
s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set;
s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set;
s500: carrying out neural network training on the training data set to form a training model;
s600: and cleaning the data in the data corpus except the labeled data set based on the training model, labeling the data conforming to the training model and putting the data into the labeled data set.
2. The annotation data generation method of claim 1,
the method for generating the annotation data further comprises the following steps:
s700: judging whether the data volume in the labeled data set is larger than or equal to an expected data volume;
s800: and when the data volume in the labeled data set is smaller than the expected data volume, taking the union of the training data set and the labeled data set, and executing the steps S500-S600 again.
3. The annotation data generation method of claim 2,
step S800 is replaced with:
s800': and when the data volume in the labeled data set is smaller than the expected data volume, replacing the data in the pseudo data set with the data in the labeled data set, and executing the steps S300-S600 again.
4. The annotation data generation method of claim 1,
the method for generating the annotation data further comprises the following steps:
s900: training the other data sets except the data corpus based on the labeled data set and/or the training data set formed in step S600.
5. The annotation data generation method of claim 1,
expanding the pseudo data set based on the GAN neural network to form an expanded data set S300, including:
s310: constructing a generation model and a discrimination model;
s320: configuring the discrimination model to output discrimination probability values of data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability values of data in non-pseudo data sets based on the discrimination probability values of the data in the pseudo data set;
s330: the generation model generates a data set to be expanded based on the data in the pseudo data set;
s340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
s350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form the expanded data set.
6. The annotation data generation method of claim 1,
identifying whether tagging is required for data in the augmented data set, and the step S400 of filtering the tagged data to form a training data set includes:
s410: verifying the data in the extended data set according to the labeled data set and the data characteristics;
s420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
7. The annotation data generation method of claim 6,
the step S410 of verifying the data in the augmented data set according to the annotated data set and the data features comprises:
s411: verifying the data in the extended data set by taking the data in the labeled data set as a model;
s412: and when the data are verified to be consistent by more than half of levels or all levels in the model, judging that the verification result is the identification label.
8. The annotation data generation method of claim 1,
the data characteristics include: one or more of a background of the data, a unit number of the data, a digital gap of the data, an object of the data, and a noise of the data.
9. An annotation data generation apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the annotation data generation method according to any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the annotation data generation method according to any one of claims 1 to 8.
CN201810609646.8A 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium Active CN108960409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810609646.8A CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810609646.8A CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN108960409A CN108960409A (en) 2018-12-07
CN108960409B true CN108960409B (en) 2021-08-03

Family

ID=64488602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810609646.8A Active CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN108960409B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816019A (en) * 2019-01-25 2019-05-28 上海小萌科技有限公司 A kind of image data automation auxiliary mask method
CN109978029B (en) * 2019-03-13 2021-02-09 北京邮电大学 Invalid image sample screening method based on convolutional neural network
CN110189351A (en) * 2019-04-16 2019-08-30 浙江大学城市学院 A kind of scratch image data amplification method based on production confrontation network
CN110569379A (en) * 2019-08-05 2019-12-13 广州市巴图鲁信息科技有限公司 Method for manufacturing picture data set of automobile parts
CN110874484A (en) * 2019-10-16 2020-03-10 众安信息技术服务有限公司 Data processing method and system based on neural network and federal learning
US11651276B2 (en) 2019-10-31 2023-05-16 International Business Machines Corporation Artificial intelligence transparency
CN111143617A (en) * 2019-12-12 2020-05-12 浙江大学 Automatic generation method and system for picture or video text description
CN111177132A (en) * 2019-12-20 2020-05-19 中国平安人寿保险股份有限公司 Label cleaning method, device, equipment and storage medium for relational data
CN111382785B (en) * 2020-03-04 2023-09-01 武汉精立电子技术有限公司 GAN network model and method for realizing automatic cleaning and auxiliary marking of samples
CN111476324B (en) * 2020-06-28 2020-10-02 平安国际智慧城市科技股份有限公司 Traffic data labeling method, device, equipment and medium based on artificial intelligence
CN111741018B (en) * 2020-07-24 2020-12-01 中国航空油料集团有限公司 Industrial control data attack sample generation method and system, electronic device and storage medium
CN112308167A (en) * 2020-11-09 2021-02-02 上海风秩科技有限公司 Data generation method and device, storage medium and electronic equipment
CN112508000B (en) * 2020-11-26 2023-04-07 上海展湾信息科技有限公司 Method and equipment for generating OCR image recognition model training data
CN112580310B (en) * 2020-12-28 2023-04-18 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
CN113239205B (en) * 2021-06-10 2023-09-01 阳光保险集团股份有限公司 Data labeling method, device, electronic equipment and computer readable storage medium
CN114926709A (en) * 2022-05-26 2022-08-19 成都极米科技股份有限公司 Data labeling method and device and electronic equipment
CN116451087B (en) * 2022-12-20 2023-12-26 石家庄七彩联创光电科技有限公司 Character matching method, device, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392125A (en) * 2017-07-11 2017-11-24 中国科学院上海高等研究院 Training method/system, computer-readable recording medium and the terminal of model of mind
CN107622056A (en) * 2016-07-13 2018-01-23 百度在线网络技术(北京)有限公司 The generation method and device of training sample
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622056A (en) * 2016-07-13 2018-01-23 百度在线网络技术(北京)有限公司 The generation method and device of training sample
CN107392125A (en) * 2017-07-11 2017-11-24 中国科学院上海高等研究院 Training method/system, computer-readable recording medium and the terminal of model of mind
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"生成式对抗网络GAN 的研究进展与展望";王坤峰;《自动化学报》;20170331;第43卷(第3期);第1-4节 *

Also Published As

Publication number Publication date
CN108960409A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108960409B (en) Method and device for generating annotation data and computer-readable storage medium
CN110097094B (en) Multiple semantic fusion few-sample classification method for character interaction
CN109993102B (en) Similar face retrieval method, device and storage medium
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN108288051B (en) Pedestrian re-recognition model training method and device, electronic equipment and storage medium
CN109919252B (en) Method for generating classifier by using few labeled images
CN109299258A (en) A kind of public sentiment event detecting method, device and equipment
CN109948735B (en) Multi-label classification method, system, device and storage medium
CN110232373A (en) Face cluster method, apparatus, equipment and storage medium
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
CN108229588B (en) Machine learning identification method based on deep learning
CN109993057A (en) Method for recognizing semantics, device, equipment and computer readable storage medium
CN112016601B (en) Network model construction method based on knowledge graph enhanced small sample visual classification
WO2018196718A1 (en) Image disambiguation method and device, storage medium, and electronic device
CN109446333A (en) A kind of method that realizing Chinese Text Categorization and relevant device
CN108959474B (en) Entity relation extraction method
CN110610193A (en) Method and device for processing labeled data
CN109829065B (en) Image retrieval method, device, equipment and computer readable storage medium
CN107871314A (en) A kind of sensitive image discrimination method and device
CN112733602B (en) Relation-guided pedestrian attribute identification method
CN112784921A (en) Task attention guided small sample image complementary learning classification algorithm
CN111325237A (en) Image identification method based on attention interaction mechanism
CN115272692A (en) Small sample image classification method and system based on feature pyramid and feature fusion
CN112148994B (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN108229692B (en) Machine learning identification method based on dual contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant