CN115146716B

CN115146716B - Labeling method, labeling device, labeling apparatus, labeling storage medium and labeling program product

Info

Publication number: CN115146716B
Application number: CN202210713931.0A
Authority: CN
Inventors: 袁松岭; 王子璇; 文心杰; 王晓利; 郭伟东; 刘雅良; 孟祥磊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2024-06-14
Anticipated expiration: 2042-06-22
Also published as: CN115146716A

Abstract

The application discloses a labeling method, a labeling device, labeling equipment, a storage medium and a program product, which belong to the technical field of artificial intelligence and comprise the following steps: acquiring a first raw data set, wherein the raw data is unlabeled data; determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data with the degree of containing target information meeting the preset requirement; acquiring true value data corresponding to each of the plurality of first target raw data, wherein the plurality of first target raw data comprises a data set for learning labeling and a data set for verifying learning effects; generating a plurality of labeling cases according to each first target raw data and the corresponding true value data; and acquiring a labeling result aiming at least one second target raw data based on the plurality of labeling cases. The embodiment of the application can be used for quickly marking and ensuring the marking accuracy.

Description

Labeling method, labeling device, labeling apparatus, labeling storage medium and labeling program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a labeling method, apparatus, device, storage medium, and program product.

Background

The acquisition of the annotation data is a prerequisite for promoting the development of artificial intelligence technology and realizing machine learning, the support of the annotation data is not separated from the training of various intelligent models, but the acquisition of the annotation data is greatly dependent on the manual annotation of the annotators, the annotators usually depend on the documented annotation rules, but the expression capacity of the documented annotation rules is lower, the problems of incomplete rule coverage, inaccurate rule expression and the like can also exist, the texting rule learning of the annotators is time-consuming and labor-consuming, and more communication is needed with the personnel specifying the documents, so that the acquiring duration of the annotation data is long and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a labeling method, a labeling device, labeling equipment, a storage medium and a program product, which can improve the labeling accuracy and reduce the labeling time consumption.

According to an aspect of an embodiment of the present application, there is provided a labeling method, the method including:

Acquiring a first raw data set, wherein raw data in the first raw data set are unlabeled data;

Determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data with the degree of containing target information meeting the preset requirement, and the target information represents information in the first raw data set;

acquiring true value data corresponding to each of the plurality of first target raw data, wherein the plurality of first target raw data comprises a data set for learning annotation and a data set for verifying learning effect;

Generating a plurality of labeling cases according to the first target raw data and the corresponding truth value data, specifically including a data set for learning labeling and the corresponding truth value data, and a data set for verifying learning effects and the corresponding truth value data;

and acquiring a labeling result aiming at least one second target raw data based on the plurality of labeling cases.

According to an aspect of an embodiment of the present application, there is provided an labeling apparatus, the apparatus including:

the first life data set acquisition module is used for acquiring a first life data set, wherein the life data in the first life data set are unlabeled data;

the data screening module is used for determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data with the degree of containing target information meeting the preset requirement, and the target information represents information in the first raw data set;

The true value acquisition module is used for acquiring true value data corresponding to each of the plurality of first target raw data, wherein the plurality of first target raw data comprises a data set for learning annotation and a data set for verifying learning effect;

The case generation module is used for generating a plurality of labeling cases according to the first target raw data and the corresponding truth value data, specifically comprising a data set for learning labeling and the corresponding truth value data, and a data set for verifying learning effects and the corresponding truth value data;

And the labeling module is used for acquiring labeling results aiming at least one second target raw data based on the plurality of labeling cases.

According to an aspect of an embodiment of the present application, there is provided a computer apparatus including a processor and a memory, where at least one instruction, at least one program, a code set, or an instruction set is stored, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the labeling method described above.

According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which are loaded and executed by a processor to implement the above-mentioned labeling method.

According to one aspect of an embodiment of the present application, there is provided a computer program product comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions so that the computer device executes to implement the labeling method described above.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

The embodiment of the application provides a labeling method, which is characterized in that a plurality of raw data with high representativeness and high representativeness are determined from a large amount of raw data, labeling cases are obtained by labeling the raw data, the labeling cases are provided for labeling personnel to learn by themselves, on the basis of learning the cases, the labeling personnel can label other raw data, the information expression capacity of the labeling cases is far higher than that of text rules, and the multidimensional learning capacity of the brain of the labeling personnel can be fully utilized through learning the labeling cases, so that the labeling speed and the labeling accuracy are improved, the communication time and the learning time of the text rules are saved, and the labeling efficiency is remarkably improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application runtime environment provided by one embodiment of the present application;

FIG. 2 is a flow chart of a labeling method provided by one embodiment of the present application;

FIG. 3 illustrates a raw data screening schematic;

FIG. 4 illustrates a core set screening schematic;

FIG. 5 is a schematic diagram of a complete flow of labeling provided by one embodiment of the application;

FIG. 6 is a schematic diagram of a visualization result of a labeling platform according to an embodiment of the present application;

FIG. 7 illustrates a schematic diagram of a practice problem;

FIG. 8 illustrates a schematic diagram of an examination question;

FIG. 9 illustrates a labeling case diagram;

FIG. 10 illustrates a block diagram of an annotation device;

FIG. 11 is a block diagram of a computer device according to one embodiment of the present application.

Detailed Description

Before describing the method embodiments of the present application, related terms or nouns that may be involved in the method embodiments of the present application are briefly described, so as to be understood by those skilled in the art of the present application.

BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded representation model based on conversion model), which is a large-scale text pre-training model, improves the benchmark performance of natural language processing tasks by a big-cut with transformer encoder (conversion model encoder) of 12 layers. Compared with word2vec (word vector), the BERT trained by massive texts can introduce more migration knowledge in a classification algorithm, and provide more accurate text features.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Deep learning: the concept of deep learning is derived from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.

Computer Vision (CV) Computer Vision is a science of how to "look" a machine, and more specifically, replace it with a camera and a Computer.

The human eyes recognize and measure the target, and further perform graphic processing, so that the computer processing becomes an image which is more suitable for the human eyes to observe or transmit to the instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction, and other techniques, as well as common face recognition, fingerprint recognition, and other biometric feature recognition techniques.

Key technologies To Speech technology (Speech Technology) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is a development direction of human-computer interaction in the future, and voice becomes one of human-computer interaction modes which are watched in the future.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

CNN (full: convolutional Neural Networks): the convolutional neural network is a feedforward neural network (Feedforward Neural Networks) which comprises convolutional calculation and has a deep structure, is one of representative algorithms of deep learning, has characteristic learning capability, and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network.

Raw data: the original waiting standard data of the result is not marked yet.

Contrast learning (Contrastive Learning): contrast learning is a method of describing the tasks of similar and dissimilar things for a machine learning model. With this approach, machine learning models can be trained to distinguish between similar and different data samples.

Active learning (ACTIVE LEARNING): active learning can select the most representative, most confusing and most informative sample in the birth data through a certain algorithm.

And the labeling staff is staff for providing labeling results on the labeling platform.

The required personnel: and issuing a labeling task on a labeling platform, and requiring personnel for labeling results.

Case labeling rules: sample labeling answers and notes are given by a demand person through selecting a typical sample through an algorithm, and labeling rules are learned by a labeling person through example data with the answers and notes.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

In the related art, in the fields of artificial intelligence, machine learning, neural networks and the like, in order to train a model, first, labeling data needs to be acquired. In the labeling process, a user firstly provides data to be labeled, lists detailed rules of data labeling, and performs a labeling test process. For example, when a data labeling set is required to be acquired, corresponding subtask descriptions are required to be generated according to corresponding labeling rules of various data in the data set, labeling personnel label according to the rules, the process is usually accompanied by multi-round communication between the requiring personnel and the labeling personnel, the labeling rules are corrected and perfected, and finally complete labeling rules and example data are obtained. For example, the demanding personnel and annotators need to break in with regular understanding, and this break-in process may be completed in 2-9 days. In most cases, the running-in rules require more time than the annotators actually annotate the data, otherwise the recovered data will not be acceptable.

For labeling of specific small sample data, the making of labeling rules is complex and complicated, and the labeling process takes a long time under the condition of small labeling quantity, so that the labeling efficiency is low, the preposition work of labeling takes a long time, and the overall labeling efficiency is also influenced.

In the actual labeling process, labeling staff usually encounters data samples with rules not covered and ambiguous, so that the labeling communication process is increased. According to different labeling difficulties, the process usually needs multiple rounds of communication, and for some relatively complex labeling tasks, background knowledge and expression modes between labeling personnel and demand personnel are limited, so that deviation occurs in understanding of character rules, the labeling communication efficiency is affected, and the whole labeling time period is prolonged. According to statistics of some marking platforms, the regular running-in time is basically 2-9 days, and most data marking can be completed within 7 days after the regular running-in is successful. In addition, some demand personnel can give a small number of labeling cases while making rules, but manually select typical cases, which is time-consuming and not comprehensive enough, and is difficult to completely avoid the time cost of communication.

In view of the above, the embodiment of the application provides a labeling method, which determines a plurality of raw data with high representativeness and high representativeness from a large amount of raw data, labels the raw data to obtain labeling cases, the labeling cases are provided for a labeling person to learn by himself, the labeling person can label other raw data on the basis of learning the cases, the information expression capacity of the labeling case is far higher than that of a text rule, and the multidimensional learning capacity of the brain of the labeling person can be fully utilized through learning the labeling case, so that the labeling speed and the labeling accuracy are improved, the communication time and the learning time of the text rule are saved, and the labeling efficiency is remarkably improved.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application running environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 includes, but is not limited to, a cell phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, a game console, an electronic book reader, a multimedia playing device, a wearable device, and the like. A client in which an application program can be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of labeling services. Optionally, a client of the above application program is running in the terminal 10. The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a background server of the application program described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Alternatively, the server 20 provides background services for applications in a plurality of terminals 10 at the same time.

Alternatively, the terminal 10 and the server 20 may communicate with each other via the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Referring to fig. 2, a flowchart of a labeling method according to an embodiment of the application is shown, and the method is applied to a first recommendation system. The method can be applied to a computer device, wherein the computer device is an electronic device with data computing and processing capabilities, and the execution subject of each step can be the server 20 in the application running environment shown in fig. 1. The method may comprise the steps of:

S101, acquiring a first raw data set, wherein raw data in the first raw data set is unlabeled data.

The first set of raw data is not limited in this embodiment of the present application, and may be considered as an original set of raw data for labeling, for example, if a puppy in a picture needs to be labeled, the raw data is a variety of pictures that may contain a puppy. Of course, the original set of raw data for labeling may be filtered to obtain a plurality of representative raw data, thereby forming the first raw data set.

Specifically, a second raw data set may be obtained, where raw data in the second raw data set is original unlabeled data. And extracting the characteristics of each piece of raw data in the second raw data set to obtain corresponding characteristic information. And determining the first raw data set in the second raw data set according to the characteristic information. The second raw data set may be regarded as an original set of raw data for labeling.

The embodiment of the application is not limited to the specific method for performing the feature extraction or determining the first raw data set in the second raw data set. For example, the comparison learning (Contrastive Learning) and the active learning (ACTIVE LEARNING) may be combined to pick out representative and typical raw data from the second raw data set to form the first raw data set. Specifically, a first model may be obtained through comparison learning training, and a second model may be obtained through active learning training, where the first model is used for extracting features for each of the second raw data sets, and the second model is used for determining the first raw data set in the second raw data sets according to each of the feature information.

In order to obtain better semantic representation (feature extraction) of unlabeled raw data, the embodiment of the application can introduce contrast learning. Contrast learning is widely applied to unsupervised representation learning in the fields of images and texts, and is mainly used for constructing positive and negative sample sets, for example, the image field is generally subjected to data enhancement operations such as rotation and cutting, and the text field is often subjected to methods such as back translation, character insertion and deletion and the like. By pulling similar samples and pushing dissimilar samples away, the purpose of learning a good semantic representation space from the samples is achieved, and therefore the accuracy of feature extraction is improved.

In order to obtain representative and typical data, the embodiment of the application can also introduce active learning. Active learning can recognize that not all data are equally valuable, and can discover which data in the second raw data set are more valuable and have larger information content, so that the first raw data set is obtained by screening. And labeling through typical data in the first life data set to obtain a better labeling result.

The active learning is mostly applied to the situation that a small amount of marked data exists, and the active learning model obtained through training is used for selecting more valuable data. However, when no labeling data is cold started, the active learning algorithm is often difficult to obtain good performance. According to the embodiment of the application, the contrast learning and the active learning can be combined, the good semantic representation (feature extraction) is obtained through the unsupervised training of the contrast learning, and then the representative data is selected through the active learning, so that the cold start problem is solved.

In one embodiment, please refer to fig. 3, which illustrates a diagram of the data filtering. For unlabeled raw data, the BERT model or Resnet model can be adjusted by contrast learning; and then obtaining vector representation of the text or the picture according to the trained model, thereby realizing feature extraction, and then selecting typical raw data based on a result of the feature extraction through an active learning method. The contrast learning and the active learning are described in detail below:

(1) Contrast learning

For raw data of text type, unsupervised SimCSE may be used to train the BERT model and obtain a text vector representation. Unsupervised SimCSE is an unsupervised contrast learning model, the main idea is to use exit noise data enhancement (Dropout Noise DataAugmentation) for contrast learning unsupervised training, and the specific method comprises the steps of inputting the same text twice to an encoder, and obtaining two different vector representations according to different parameters (respectively denoted as z and z'). The mathematical expression is: /> F _θ denotes encoder mapping,/>And/>Representing different vector representations. When the encoder is trained in batches, the training objective function is as follows: where τ is a hyper-parameter, e.g., the value may be 0.05, e represents the natural logarithm, and N represents the number of training data in the batch.

In some embodiments, the training process of the contrast-learned model may also be used to obtain as similar a characterization result as possible for similar text content and as dissimilar as possible for dissimilar content. The comparison learning model obtained by adjusting BERT through Unsupervised SimCSE algorithm can obtain better semantic representation of unmarked raw data, and lays a foundation for active learning and selection of typical raw data.

For raw data of image types, a typical neural network ResNet model can be trained by adopting a visual representation contrast learning framework SimCLR algorithm, and image vector representations aiming at the raw data of the image types are obtained according to the model obtained through training, so that feature extraction is completed. Assuming that the input picture is x, performing two times of picture enhancement by the image processing tool to obtain a picture x _i,x_j respectively: h _i＝f(x_i)＝ResNet(x_i),z_i＝g(h_i), wherein g (·) is a single layer multi-layer perceptron, in the case of batch training, the training objective function is: Where 1 _[k≠i] ε {0,1} is a flag bit function, and if k+.i its value is 1, N represents the amount of training data in the batch.

(2) Active learning

After better semantic information representation is obtained through contrast learning, the embodiment of the application integrates a plurality of active learning methods such as clustering, a core set CoreSet and the like, thereby completing the screening of raw data. Taking a clustering algorithm and CoreSet algorithm as examples for simple introduction:

The clustering algorithm is a clustering analysis algorithm for iterative solution, and the data are divided into K groups. The method comprises the steps of randomly selecting K objects as initial cluster centers, calculating the distance between each object and each seed cluster center, and distributing each object to the cluster center nearest to the object. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, and the sum of squares of errors is locally minimum. And finally, data corresponding to the K clustering centers are obtained and used as selected typical data.

The CoreSet algorithm refers to data screening by applying a core set construction method to achieve the following objectives: the selected subset contains the least redundant information and the subset can maximize inclusion of unselected set information. That is, the selected core set data can represent as much of the data as possible. As shown in fig. 4, each center point constitutes a core set representing data information within a radius δ _s, and the distance between points represents similarity. The core set data is chosen, and the problem may be to minimize the maximum distance between the data point and its nearest center for the purpose of selecting b center points. The mathematical expression is: Where s ⁰ and s ¹ are data sets. The solving process of the algorithm can be summarized as: the minimum distance of all samples in the selected dataset U to the unselected dataset L is calculated and the n samples with the largest distance relative to L are selected to be incorporated into L, thereby iterating. The pseudo code of the algorithm is as follows:

Input: a selected data set L ₀, an unselected data set U and a total of data quantity b to be newly selected;

Initializing l=l ₀;

and (3) circulation:

u=argmax _i∈Lmin_j∈UΔ(x_i,x_j), Δ (·) represents distance;

L＝L∪{u}；

until: l=b+|l ₀ |;

Returning: subtracting the remaining data of the initial selected data set L0 from the total selected data set L;

A subset is selected from the full data by the CoreSet algorithm, which removes redundant data so that the subset is as close and representative of the entire data set as possible. By voting (Ensemble) the selected data of different active learning algorithms, such as clustering, coreSet, etc., the embodiment of the application is not limited to voting methods, and related techniques can be used to select the final most representative and typical data.

S102: and determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data which contain target information and have the degree meeting the preset requirement, and the target information represents information in the first raw data set.

In the embodiment of the present application, the first target raw data and the second target raw data are data that contain target information to a degree that satisfies a preset requirement, the meaning of the information included in the first raw data set represented by the target information is that the first target raw data and the second target raw data are representative data selected from the first raw data set, and the method for selecting the first target raw data and the second target raw data from the first raw data set and the method for selecting the first raw data set from the second raw data set in the foregoing may be based on the same inventive concept, which is not repeated herein. For example, the foregoing method may be used to select a plurality of raw data in the first raw data set, and then divide the plurality of raw data into two types by means of manual division or random distribution, where one type belongs to the first target raw data and the other type belongs to the second target raw data. Of course, in some embodiments, it may be defined that the degree of inclusion of the target information in any of the first target raw data is higher than the degree of inclusion of the target information in any of the second target raw data.

In order to verify the effectiveness of the screening concept, a large number of comparison experiments are carried out on different business data and public data sets, and the screening concept can be proved to have obviously improved data quality compared with the data obtained by randomly selecting the data, as shown in a table-1.

TABLE-1

This table is used to verify that the algorithm that picked the data is valid. The performance of the data selected by the algorithm is improved compared with that of the data selected randomly. The screening method of the embodiment of the application has better effect than random effect when 10% of data are added each time.

S103: and acquiring true value data corresponding to each of the plurality of first target raw data, wherein the plurality of first target raw data comprises a data set for learning annotation and a data set for verifying learning effect.

The truth value data can be given by a labeling requirement method, that is, the requirement party gives the truth value data according to own requirement, and the requirement party needs to give comments on bolts, nuts and the like in the image, so that labeling information of the bolts, the nuts and the like can be given by first target raw data of the image type, and the labeling information is the truth value data.

S104: according to the first target raw data and the corresponding true value data, a plurality of labeling cases are generated according to the data set for learning labeling and the corresponding true value data, and according to the data set for verifying learning effects and the corresponding true value data.

The labeling cases are used for the labeling staff to learn, so that the labeling staff can fully know the labeling requirements of the demander, and other raw data, such as second target raw data, can be labeled.

In an embodiment, the generating the labeling cases according to the first objective data and the corresponding truth data specifically includes generating a plurality of labeling cases according to the data set for learning labeling and the corresponding truth data, and according to the data set for verifying learning effects and the corresponding truth data, including: classifying the plurality of first target raw data to obtain a first-class first raw data set and a second-class first raw data set, wherein the first-class first raw data set is a data set for learning labeling, and the second-class first raw data set is a data set for verifying learning effects.

Classifying the labeling cases generated by the first target raw data belonging to the first raw data set and the corresponding true value data into a first labeling case set; classifying the labeling cases generated by the first target raw data belonging to the second class first raw data set and the corresponding true value data into a second class labeling case set.

S105: and acquiring a labeling result aiming at least one second target raw data based on the plurality of labeling cases.

The obtaining, based on the plurality of labeling cases, a labeling result for at least one second target raw data includes: and acquiring a labeling result aiming at least one second target raw data based on the first labeling case set and the second labeling case set.

For example, 10 cases can be formed according to the truth value data given by the demand side, five cases belong to the first type labeling case set, the other five cases belong to the second type labeling case set, the labeling person can learn the first type labeling case set, then the raw data in the second type labeling case set are shown to the labeling person to be labeled, the labeling result is compared with the truth value data in the second type labeling case set, the learning result of the labeling person can be judged, and if learning is better, the labeling person can label the relevant raw data.

In one embodiment, the at least one second target raw data may be displayed when a target message is acquired, where the target message characterizes learning completion for the labeling case; and responding to the condition of detecting the labeling operation for the at least one second target raw data, and acquiring the labeling result for the at least one second target raw data.

In one embodiment, before displaying the at least one second target raw data when the target message is acquired, the method includes: displaying a first type labeling case set; in response to the first message being acquired, displaying at least one first target raw data in the second-class first raw data set; responding to the condition of obtaining a labeling result to be verified aiming at least one first target raw data in the second-class first raw data set, and verifying the labeling result to be verified according to the second-class labeling case; and acquiring the target message when the verification is passed.

The embodiment of the application aims to solve the problem of pain points of rule making and running-in rule in the process of data marking, and generates case marking rules to replace the traditional text marking rules. The labeling staff marks through the learning case labeling rules, so that the process that the demand staff formulates complex labeling rules and communicates in multiple rounds is avoided, and the overall efficiency is improved.

Specifically, the embodiment of the application can combine contrast learning and active learning, select some typical and representative raw data from unlabeled raw data (first raw data set) and obtain first target raw data and second target raw data. Firstly, a person is required to mark selected typical data (first target raw data) and give marking basis, marking cases are generated from the batch of data after the marking is completed, the marking cases can be divided into two types, namely a first type of marking cases and a second type of marking cases, the first type of marking cases are used as practice questions and adaptation answers of a marker, and the second type of marking cases are used as examination questions and adaptation answers. And the labeling personnel finish the practice exercises, learn and fit labeling ideas of the demand personnel according to the labeling cases. And finally screening out labeling personnel meeting the requirements to participate in formal labeling through examination questions. On one hand, representative data are screened through a machine learning method and are submitted to a demand person for trial marking so as to obtain a marking case, so that the demand person is directly prevented from making complex marking rules and rule communication links, the overall marking efficiency is greatly improved, and the efficiency is particularly improved for small samples; on the other hand, the labeling cases can be used as examination questions to screen labeling personnel, and personnel for understanding labeling requirements are selected to participate in the labeling process, so that the final delivery quality is improved to a certain extent.

In a specific embodiment, please refer to fig. 5, which shows the complete flow of labeling, specifically including the following steps:

Step one: and submitting data, namely firstly submitting unlabeled raw data on a visualized page by a demand person, wherein the part of data can form a first raw data set, and also can screen the first raw data set based on the part of data.

Step two: and after the typical data is selected and submitted successfully, the typical raw data can be selected through a comparison learning and active learning algorithm, the screened raw data is divided into first target raw data and second target raw data, and a part of the first target raw data can be used for forming practice problems and examination problems and generating a problem making page.

Step three: the required personnel do the questions, give the labeling answers and the notes, and generate labeling cases.

Step four: the labeling task is created, the labeling platform provides labeling cases, practice problems and examination problems, labeling personnel can learn on the labeling platform, and second target data displayed on the labeling platform are labeled after learning is finished.

Step five: and checking and accepting, namely marking to finish manual checking and accepting or automatic checking and accepting.

Referring to fig. 6, a visualization result of a labeling platform is shown, taking a picture labeling task as an example, a label of a given picture needs to be judged by a labeling person whether the given label exists in the picture. The left example diagram is a labeling case automatically generated by the system after standard answers and comments are given by a demand person through selected typical practice problems; the right side is the data to be marked, and the marking personnel refers to the example marking rules on the left side for marking.

In order to verify the effectiveness of the present embodiments, the present embodiments provide two sets of comparative experimental data, the detailed experimental data are as follows:

The following table shows the effect of obtaining the annotation data of specific scenes, the fight guiding requirement scene and the click guiding requirement scene

/>

As shown in the table, the two historical requirements are compared based on difficulty and simplicity, and the embodiment of the application omits the previous rule making and communication links by generating practice problems, examination problems and labeling cases. The small sample labeling efficiency is very obvious when the acceptance rate is reached, and the unit time consumption (per 1000 data volume) labeling time is saved by 80% -90%. Through case rule labeling, the research and development efficiency of the whole AI model is improved, and the research and development cost is reduced.

Taking the labeling requirement of a text learning class as an example, it is required to judge whether the comment is a text. The detailed flow is as follows:

the model selects exercise questions and examination questions for 50 times;

The required personnel gives out practice problems, examination problem answers and labeling bases (the examination problem labeling bases are optional); wherein, practice problems and examination problems are respectively represented by fig. 7 and 8;

generating 5-10 example labeling rules, namely labeling cases according to answers and bases of practice problems and examination problems, wherein the labeling cases are represented by a figure 9;

The labeling personnel refers to the labeling case to finish practice problems; the labeling personnel who finish practice exercises refer to the examination, 15-20 questions are randomly extracted from the examination questions, and the personnel who pass the examination can participate in formal task labeling.

In this example, the water filling comment requirement is marked with a complete acceptance accuracy of 94% (acceptance criteria of 90%), the whole process is critical and takes about 4 hours, wherein the model side takes about 12 minutes to pick data, the marking time is 120 minutes, and the requirement unit takes about 18 hours (marking 1000 data volume units). Compared with the current AI data labeling task of a data kitchen, the statistics P90, namely the history requirement of 90 minutes (without case labeling), the time consumption of a requirement unit (the time consumption of labeling 1000 data volume units) is about 30 hours, and the effectiveness of the visible case labeling scheme is very obvious. The details are given in the following table:

/>

The following are examples of apparatus of the application that may be used to perform the method embodiments of the application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Referring to fig. 10, a block diagram of an labeling apparatus according to an embodiment of the application is shown. The device has the function of realizing the marking method, and the function can be realized by hardware or by executing corresponding software by hardware. The device may be a computer device or may be provided in a computer device. The apparatus may include:

A first raw data set obtaining module 101, configured to obtain a first raw data set, where raw data in the first raw data set is unlabeled data;

The data filtering module 102 is configured to determine a plurality of first target raw data and a plurality of second target raw data in the first raw data set, where the first target raw data and the second target raw data are data that includes target information, and the degree of the target information satisfies a preset requirement, and the target information characterizes information included in the first raw data set;

The truth value obtaining module 103 is configured to obtain truth value data corresponding to each of the plurality of first target raw data, where the plurality of first target raw data includes a data set for learning annotation and a data set for verifying learning effect;

The case generation module 104 is configured to generate a plurality of labeling cases according to each of the first target raw data and the corresponding truth data, specifically including according to the data set for learning labeling and the corresponding truth data, and according to the data set for verifying learning effects and the corresponding truth data;

The labeling module 105 is configured to obtain a labeling result for at least one second target raw data based on the plurality of labeling cases.

In an exemplary embodiment, the labeling module 105 is configured to display the at least one second objective raw data when a target message is acquired, where the target message characterizes learning for the labeling case;

and responding to the condition of detecting the labeling operation for the at least one second target raw data, and acquiring the labeling result for the at least one second target raw data.

In an exemplary embodiment, the case generation module 104 is configured to:

classifying the plurality of first target raw data to obtain a first-class first raw data set and a second-class first raw data set, wherein the first-class first raw data set is a data set for learning labeling, and the second-class first raw data set is a data set for verifying learning effects;

Classifying the labeling cases generated by the first target raw data belonging to the first raw data set and the corresponding true value data into a first labeling case set;

Classifying the labeling cases generated by the first target raw data belonging to the second class first raw data set and the corresponding true value data into a second class labeling case set;

And acquiring a labeling result aiming at least one second target raw data based on the first labeling case set and the second labeling case set.

In an exemplary embodiment, the labeling module 105 is configured to:

displaying a first type labeling case set;

in response to the first message being acquired, displaying at least one first target raw data in the second-class first raw data set;

responding to the condition of obtaining a labeling result to be verified aiming at least one first target raw data in the second-class first raw data set, and verifying the labeling result to be verified according to the second-class labeling case;

and acquiring the target message when the verification is passed.

In an exemplary embodiment, the first target raw data includes target information to a higher degree than the second target raw data includes target information.

In an exemplary embodiment, the first set of raw data acquisition module 101 is configured to:

Acquiring a second raw data set, wherein raw data in the second raw data set are original unlabeled data;

extracting features of each piece of raw data in the second raw data set to obtain corresponding feature information;

and determining the first raw data set in the second raw data set according to the characteristic information.

The first model is used for extracting characteristics of each piece of raw data in the second raw data set, and the second model is used for determining the first raw data set in the second raw data set according to the characteristic information.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 11, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the labeling method described above. Specifically, the present application relates to a method for manufacturing a semiconductor device.

The computer device 1600 includes a central processing unit (Central Processing Unit, CPU) 1601, a system Memory 1604 including a random access Memory (RandomAccess Memory, RAM) 1602 and a Read Only Memory (ROM) 1603, and a system bus 1605 connecting the system Memory 1604 and the central processing unit 1601. Computer device 1600 also includes a basic Input/Output system (I/O) 1606, which facilitates the transfer of information between various devices within the computer, and a mass storage device 1607 for storing an operating system 1613, application programs 1614, and other program modules 1615.

The basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609, such as a mouse, keyboard, etc., for inputting information to the content consumption object. Wherein the display 1608 and the input device 1609 are connected to the central processing unit 1601 by way of an input output controller 1610 connected to the system bus 1605. The basic input/output system 1606 may also include an input/output controller 1610 for receiving and processing input from a keyboard, mouse, or electronic stylus among a number of other devices. Similarly, the input-output controller 1610 also provides output to a display screen, printer, or other type of output device.

The mass storage device 1607 is connected to the central processing unit 1601 by a mass storage controller (not shown) connected to the system bus 1605. The mass storage device 1607 and its associated computer-readable media provide non-volatile storage for the computer device 1600. That is, mass storage device 1607 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, electrically erasable programmable read-only memory), flash memory or other solid state memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1604 and mass storage 1607 described above may be collectively referred to as memory.

According to various embodiments of the application, the computer device 1600 may also operate by being connected to a remote computer on a network, such as the Internet. That is, the computer device 1600 may be connected to the network 1612 through a network interface unit 1611 coupled to the system bus 1605, or the network interface unit 1611 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the labeling method described above.

In an exemplary embodiment, a computer readable storage medium is also provided, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions, when executed by a processor, implement the labeling method.

Alternatively, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (RandomAccess Memory ), SSD (Solid STATE DRIVES), or optical disk, etc. The random access memory may include, among other things, reRAM (RESISTANCE RANDOMACCESS MEMORY, resistive random access memory) and DRAM (Dynamic Random Access Memory ).

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the labeling method described above.

The labeling method at least comprises the following steps:

determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data which contain target information and have the degree meeting the preset requirement, and the target information represents information in the first raw data set;

Acquiring true value data corresponding to each of the plurality of first target raw data, wherein the plurality of first target raw data comprises a data set for learning labeling and a data set for verifying learning effects;

In an embodiment, the obtaining, based on the plurality of labeling cases, a labeling result for at least one second target raw data includes:

Displaying the at least one second target raw data under the condition that a target message is acquired, wherein the target message represents the completion of learning for the labeling case;

In one embodiment, the method specifically includes, before generating a plurality of labeling cases according to each of the first objective raw data and the corresponding truth data, according to the data set for learning labeling and the corresponding truth data, and according to the data set for verifying learning effects and the corresponding truth data, generating a plurality of labeling cases, the method includes:

the method specifically includes generating a plurality of labeling cases according to the data set for learning labeling and the corresponding truth value data, and according to the data set for verifying learning effects and the corresponding truth value data, after generating the plurality of labeling cases according to each of the first target raw data and the corresponding truth value data, and then:

The obtaining, based on the plurality of labeling cases, a labeling result for at least one second target raw data includes:

In one embodiment, before displaying the at least one second target raw data when the target message is acquired, the method includes:

displaying a first type labeling case set;

and acquiring the target message when the verification is passed.

In one embodiment, the first target raw data includes target information to a higher degree than the second target raw data includes target information.

In one embodiment, the acquiring the first set of life data includes:

In one embodiment, the method further comprises:

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.

In addition, in the embodiments of the present application, related data such as content consumption object information is related, when the above embodiments of the present application are applied to specific products or technologies, permission or consent of the content consumption object needs to be obtained, and collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims

1. A method of labeling, the method comprising:

acquiring a second raw data set, wherein raw data in the second raw data set are original unlabeled data, and the original unlabeled data comprise texts or pictures;

Obtaining a first model through contrast learning training and obtaining a second model through active learning training, wherein the first model is used for extracting characteristics of each piece of raw data in the second raw data set, and the second model is used for determining a first raw data set in the second raw data set according to the characteristic extraction result;

determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data with the degree of containing target information meeting the preset requirement, the target information represents information in the first raw data set, and the degree of containing the target information in any one of the first target raw data is higher than the degree of containing the target information in any one of the second target raw data;

Acquiring true value data corresponding to each of the plurality of first target raw data; classifying the plurality of first target raw data to obtain a first-class first raw data set and a second-class first raw data set;

Displaying a first type labeling case set for enabling a annotator to fit a labeling thought of a required person, wherein the first type labeling case set comprises practice problems of the annotator and corresponding adaptation answers, and is a labeling case generated by first target raw data and corresponding true value data belonging to the first type first raw data set;

responding to the condition of obtaining a labeling result to be verified, which is given by a labeling person aiming at an examination question, and verifying the labeling result to be verified according to a corresponding second labeling case set, wherein the second labeling case set comprises the examination question and a corresponding adapting answer, and is a labeling case generated by first target raw data and corresponding true value data belonging to the second first raw data set;

And under the condition that the verification result characterizes that all the labeling cases are learned, acquiring the labeling result aiming at least one second target raw data.

2. An labeling device, the device comprising:

The first raw data set acquisition module is used for acquiring a second raw data set, wherein raw data in the second raw data set are original unlabeled data, and the original unlabeled data comprise texts or pictures; obtaining a first model through contrast learning training and obtaining a second model through active learning training, wherein the first model is used for extracting characteristics of each piece of raw data in the second raw data set, and the second model is used for determining a first raw data set in the second raw data set according to the characteristic extraction result;

The data screening module is used for determining a plurality of first target raw data and a plurality of second target raw data in the first raw data set, wherein the first target raw data and the second target raw data are data with the degree of containing target information meeting the preset requirement, the target information represents information in the first raw data set, and the degree of containing target information in any one of the first target raw data is higher than the degree of containing target information in any one of the second target raw data;

The true value acquisition module is used for acquiring true value data corresponding to each of the plurality of first target raw data; classifying the plurality of first target raw data to obtain a first-class first raw data set and a second-class first raw data set;

The labeling module is used for displaying a first type labeling case set for enabling a label person to fit a labeling thought of a demand person, wherein the first type labeling case set comprises practice problems of the label person and corresponding adaptation answers, and is a labeling case generated by first target raw data and corresponding true value data in the first type first raw data set;

3. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the labeling method of claim 1.

4. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the labeling method of claim 1.

5. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to execute to implement the labeling method of claim 1.