CN118228047A

CN118228047A - Data labeling method, device and equipment

Info

Publication number: CN118228047A
Application number: CN202410256899.7A
Authority: CN
Inventors: 赵楠
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2024-03-06
Filing date: 2024-03-06
Publication date: 2024-06-21

Abstract

The embodiment of the specification discloses a method, a device and equipment for labeling data, wherein the method comprises the following steps: the method comprises the steps of obtaining prompt information aiming at target data to be marked, inputting the prompt information into a pre-trained marking model, marking the target data through the marking model to obtain label information corresponding to the target data, constructing the marking model through a pre-trained large model, constructing a reward function through output data of a pre-trained reward model, performing reinforcement learning on the marking model, measuring differences between generated label information and label information expected by a marker, and finally, performing model training on a business model in business corresponding to the target data based on the label information corresponding to the target data.

Description

Data labeling method, device and equipment

Technical Field

The present document relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for labeling data.

Background

The capability of artificial intelligence technology to process unstructured data (e.g., image, text, voice-like data, etc.) is constantly increasing, independent of the vast amount of labeling data (i.e., data carrying tag information). Labeling data is the process of labeling data content recognizable by computer vision or natural language processing, by which unstructured data such as image, text, voice-like data, etc., become easier to characterize, learn and understand.

In general, the labeling mode of data is mainly realized by large-scale manual services such as a large number of data labeling operators, labeling quality inspectors and the like, however, the labeling process of the data is long in time consumption and needs manual participation, so that a large amount of time and manpower resources are consumed, and especially for a large-scale data set, the labeling time and manpower resources are multiplied, moreover, as people attach more importance to private data, the mode of manually processing the data can improve the probability of data leakage, and secondly, the labeling result has strong subjectivity, and due to subjective factors of different data labeling operators, the same data can obtain different labeling results, so that the inconsistency and unreliability of the labeling results are caused, and the trained model can be unstable in the process of new data. Therefore, it is necessary to provide a data labeling method capable of reducing time and human resource consumption, improving data labeling quality, and improving consistency and reliability of labeling results.

Disclosure of Invention

The embodiment of the specification aims to provide a data marking mode which not only can reduce the consumption of time and human resources, but also can improve the quality of data marking and consistency and reliability of marking results.

In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:

The embodiment of the specification provides a method for labeling data, which comprises the following steps: and acquiring prompt information aiming at target data to be marked. The prompt information is input into a pre-trained labeling model, labeling processing is carried out on the target data through the labeling model, the label information corresponding to the target data is obtained, the labeling model is built through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function built through output data of a pre-trained reward model, and the reward model is used for measuring differences between generated label information and label information expected by a labeling person. And carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

The embodiment of the specification provides a device for labeling data, which comprises: the prompt information acquisition module acquires prompt information aiming at target data to be marked. The labeling module inputs the prompt information into a pre-trained labeling model, labeling the target data through the labeling model to obtain label information corresponding to the target data, the labeling model is constructed through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function constructed through output data of a pre-trained reward model, and the reward model is used for measuring differences between generated label information and label information expected by a labeling person. And the application module is used for carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

The embodiment of the specification provides a data labeling device, the data labeling device includes: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: and acquiring prompt information aiming at target data to be marked. The prompt information is input into a pre-trained labeling model, labeling processing is carried out on the target data through the labeling model, the label information corresponding to the target data is obtained, the labeling model is built through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function built through output data of a pre-trained reward model, and the reward model is used for measuring differences between generated label information and label information expected by a labeling person. And carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

The present description also provides a storage medium for storing computer-executable instructions that when executed by a processor implement the following: and acquiring prompt information aiming at target data to be marked. The prompt information is input into a pre-trained labeling model, labeling processing is carried out on the target data through the labeling model, the label information corresponding to the target data is obtained, the labeling model is built through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function built through output data of a pre-trained reward model, and the reward model is used for measuring differences between generated label information and label information expected by a labeling person. And carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

Drawings

For a clearer description of embodiments of the present description or of the solutions of the prior art, the drawings that are required to be used in the description of the embodiments or of the prior art will be briefly described, it being obvious that the drawings in the description below are only some of the embodiments described in the description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art;

FIG. 1 is a schematic diagram of a process for labeling data according to the present disclosure;

FIG. 2 is a diagram illustrating an embodiment of a method for labeling data according to the present disclosure;

FIG. 3 is a schematic diagram of a model training process for a bonus model of the present disclosure;

FIG. 4 is a schematic diagram of a model training process for labeling models according to the present disclosure;

FIG. 5 is a schematic diagram of another model training process for labeling models according to the present disclosure;

FIG. 6 is a diagram of another embodiment of a method for labeling data according to the present disclosure;

FIG. 7 is an embodiment of a device for labeling data according to the present disclosure;

fig. 8 is an embodiment of a device for labeling data according to the present disclosure.

Detailed Description

The embodiment of the specification provides a method, a device and equipment for labeling data.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The embodiment of the specification provides a mechanism for fusing a large language model and reinforcement learning data, and the capability of an artificial intelligence technology in processing unstructured data (such as image, text, voice data and the like) is continuously improved, and the artificial intelligence technology is independent from massive labeling data (namely data carrying tag information). Labeling data is the process of labeling data content recognizable by computer vision or natural language processing, by which unstructured data such as image, text, voice-like data, etc., become easier to characterize, learn and understand.

In general, the labeling mode of data is mainly realized by large-scale manual services such as a large number of data labeling operators, labeling quality inspectors and the like, as shown in fig. 1, firstly, a project manager issues labeling tasks, including data sources to be labeled and types of the labeling tasks (such as image data classification, text data frame word segmentation and the like); then, the data annotators accept and realize the annotating task, and one annotating task is usually split into different subtask packages and distributed to different data annotators for being respectively completed; and after the labeling is completed by the multi-data labeling staff, submitting the labeling result to an auditor for one-round result examination, if the labeling result is wrong, checking and refusing, and if the labeling result is wrong, secondarily labeling the data labeling staff until the examination is passed, then giving the quality evaluation to the labeling result to a quality inspector, if the labeling result is wrong, carrying out quality inspection refusing, and marking the data labeling staff again until the quality inspection is passed, and finally giving the labeling result to an inspection receiver (which can be a data labeling demand party) for final inspection and acceptance of the labeling result by combining the original demand. However, the labeling process of the labeling mode of the data is long in time consumption and needs to participate manually, so that a large amount of time and human resources are consumed, particularly for a large-scale data set, the labeling time and human resources are multiplied, the subjectivity of the labeling result is strong, the same data can possibly obtain different labeling results due to subjective factors of different data labeling operators, the non-uniformity and unreliability of the labeling result are caused, the trained model can possibly be unstable in performance when new data is processed, and finally, the manual quality inspection process after labeling is performed manually through 'secondary' intervention, but the problem that the labeling result is credible cannot be fundamentally eliminated. In order to further improve the quality of data labeling and reduce the resource consumption of manual labeling, the specification provides a data labeling mode integrating a large language model and reinforcement learning, wherein a reward model is introduced to convert the difference between cognition of a labeling person on a labeling result and a model prediction result into explicit data, and then the reinforcement learning mode is used to directly optimize the labeling model, and the training process of reinforcement learning enables the labeling model to be continuously aligned with the complex human value, so that the accuracy and the robustness of the model are improved, and further the labeling model can output more accurate label information, thereby not only reducing the consumption of time and human resources, but also improving the consistency and the reliability of the labeling result. Specific processing can be seen from the details in the following examples.

As shown in fig. 2, the embodiment of the present disclosure provides a method for labeling data, where an execution subject of the method may be a terminal device or a server, where the terminal device may be a mobile terminal device such as a mobile phone, a tablet computer, or a computer device such as a notebook computer or a desktop computer, or may be an IoT device (specifically, such as a smart watch, a vehicle-mounted device, or the like), and where the server may be a separate server, or may be a server cluster formed by multiple servers, or may be a background server such as a financial service or an online shopping service, or may be a background server of a certain application program, or the like. In this embodiment, the execution subject is taken as a server for example for detailed description, and for the case that the execution subject is a terminal device, the following processing of the case of the server may be referred to, and will not be described herein. The method specifically comprises the following steps:

in step S202, a hint information for target data to be annotated is obtained.

The target data may include various types of data, for example, the target data may be an image, in different services, where the image may be different, specifically, in an identification service, the image may be a facial image of a user, or may be a fingerprint image of a user, in an image repair service, the image may be any image that needs repair or repair, or the like, further, the target data may be other types of data, for example, the target data may also be text data, specifically, in a risk identification service, the text data may include transaction data, user information, history data of a user, and the like between different users, in an information recommendation service, the text data may include preference information of a user, history recommendation information of a user, feedback information of a user, and the like, and specifically may be set according to actual conditions, which is not limited in the embodiment of the present specification. The Prompt information may be a Prompt in a large model, where the Prompt information may be used to describe target data, for example, in a risk prevention and control service, the Prompt information may include detection task information of a certain risk, information of a detection rule of the risk, and the like, the information of the detection rule of the certain risk may be information how to detect or determine that a certain user or certain data has content of a rule of a preset risk, for example, the Prompt information may include detection task information of a K risk, that is, please analyze whether the user a has the K risk, the Prompt information may also include information of the detection rule of the K risk, for example, the number of transactions is greater than 30, and if the total transaction amount is greater than 10 ten thousand, it is determined that the K risk exists, and in practical application, it may also include various Prompt information, which may be specifically set according to practical conditions, and the embodiment of the specification is not limited.

In implementation, when a model for performing certain processes exists in a certain business, if the model needs to perform model training, data (i.e., target data) needed for training the model can be obtained, for example, if model training is needed for a risk prevention and control model in a certain financial business, data of one or more different modes in the financial business, specifically, data of an image mode, data of a text mode, data of an audio mode, and the like, can be obtained, and the data of the one or more different modes can be used as target data to be labeled. The foregoing model may be combined with the foregoing model to analyze the foregoing target data, and determine prompt information corresponding to the target data, where the foregoing model is specifically, for example, a risk prevention and control model, where the risk prevention and control model may be used to detect a specified risk (may be one risk or multiple different risks), and for the specified risk, a corresponding detection rule of the risk may be set, based on this, a corresponding prompt information may be set for the target data, where the prompt information may include information of a detection task of the specified risk, information of a detection rule of the specified risk, and so on, and may be specifically also set according to an actual situation.

In step S204, the prompt information is input into a pre-trained labeling model, the labeling model is constructed by a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function constructed by output data of a pre-trained reward model, and the reward model is used for measuring differences between generated label information and label information expected by a labeling person.

The labeling model may be used to set corresponding label information for the specified data, so that the specified model may be subjected to supervised model training by using the data and the label information corresponding to the data, the labeling model may be constructed by a pre-trained large model, where the large model may include multiple types, for example, the large model may be a model formed by a transducer module, or the large model may be a model constructed by other algorithms (such as a classification algorithm, a clustering algorithm, etc.) and/or a network (such as a recurrent neural network, a convolutional neural network, etc.), and may specifically be set according to practical situations, and the embodiment of the present specification does not limit the present specification. For reinforcement learning, an Agent (Agent) generates a corresponding Action according to a known Environment (Environment) and a State (State) thereof, then the Environment changes according to the State caused by the Action and outputs a corresponding reward signal (Reward), and after receiving the State and the reward signal of a new Environment, the Agent adjusts the output Action of the Agent in the next step in time or in a delayed manner according to the actual situation, so that a decision strategy in reinforcement learning is continuously optimized in the learning process, wherein the Agent can be composed of one or more of the decision strategy, a cost function and a model, the decision strategy can be a mapping from the State (State) to the Action (Action), the Agent can select the Action according to the decision strategy, and the final goal of reinforcement learning can be to learn a decision strategy so as to control how the Agent performs the Action. The cost function may be used to evaluate the degree of "good" and "bad" of the current state. The model may be an agent modeling the environment, i.e. modeling the dynamics of the environment. The status may be existing information that determines events to occur in the future (or at some point in the future). The Environment may be an Environment where an Agent in reinforcement learning is located, for example, reinforcement learning is performed on a risk prevention and control model in a financial transaction field (specifically, a payment field or a transfer field) by a reinforcement learning algorithm, and besides the risk prevention and control model itself, objects related to or intersecting with the risk prevention and control model may be the Environment, specifically, transaction amount, account information of both transaction parties, risk status information of both transaction parties, and the like. The reward model may be constructed by various algorithms or networks, for example, the reward model may be constructed by a discriminant algorithm, specifically, may be constructed by a classification algorithm, etc.; or the reward model may be constructed by a convolutional neural network, etc., and may be specifically set according to actual situations, which is not limited in the embodiment of the present specification.

In implementation, a corresponding algorithm may be obtained, and a reward model may be constructed based on the algorithm, input data of the reward model may be generated tag information and tag information expected by a marker, and output data may be information (may be a numerical value or text information, etc.) for measuring a difference between the generated tag information and the tag information expected by the marker. Then, a training sample (i.e., generated tag information and tag information expected by a corresponding annotator, etc.) for training the reward model may be obtained, the training sample may be used to train the corresponding reward model, in the training process, an objective function may be preset, and model parameters in the reward model may be optimized based on the objective function, so as to finally obtain the trained reward model.

In addition, a pre-trained large model can be obtained, model parameters in the large model can be further finely adjusted, input data of the large model can be prompt information of specified data and tag information of the specified data, and output data can be tag information predicted for the specified data. Then, a training sample (i.e. prompt information of specified data, label information of the specified data, etc.) for training the large model may be obtained, the training sample may be used, through a reinforcement learning manner, the large model may first generate a corresponding Action (Action) according to a known Environment (Environment) and a State thereof, then, a reward function may be constructed through output data of a pre-trained reward model, since the Environment may change according to the State caused by the Action, thereby causing the reward function to output a corresponding reward signal (Reward), and after receiving the State of a new Environment and the reward signal, the large model may adjust the output Action of the large model in the next step in time or in time according to the actual situation, thereby continuously optimizing model parameters in the large model in the learning process, and finally, the final labeling model may be obtained. The labeling model obtained through the mode can realize labeling processing of the specified data.

After the labeling model is obtained in the above manner, the prompt information can be input into a pre-trained labeling model, and labeling processing is performed on the target data through the labeling model, so as to obtain label information corresponding to the target data (namely, information of a label predicted by the target data).

In step S206, model training is performed on the service model in the service corresponding to the target data based on the tag information corresponding to the target data.

The business models may include multiple types, different business models may be set in different businesses, and multiple business models with different functions may also be set in the same business, for example, an information recommendation model may be set in an information recommendation business, a risk prevention and control model may be set in a financial business, and in a financial business, an identity recognition model of a user, a risk prevention and control model of a financial transaction and the like may be set at the same time, which may be specifically set according to practical situations.

In an implementation, after the corresponding tag information is generated for the target data in the above manner, the content of the target data, the tag information of the target data, and the prompt information of the target data may be included in the target data, at this time, required data may be obtained from the data, and the acquired data may be used to perform model training on a service model in a service corresponding to the target data, for example, the target data and the tag information of the target data may be used to perform model training on a service model constructed by a deep neural network, or the prompt information of the target data and the tag information of the target data may be used to perform model training on a service model constructed by a pre-trained large model, which may be specifically set according to an actual situation.

The embodiment of the specification provides a method for labeling data, which comprises the steps of obtaining prompt information aiming at target data to be labeled, inputting the prompt information into a pre-trained labeling model, labeling the target data through the labeling model to obtain label information corresponding to the target data, constructing the labeling model through a pre-trained big model, constructing the labeling model through a reward function constructed through output data of a pre-trained reward model, and performing reinforcement learning on the labeling model, wherein the reward model is used for measuring the difference between the generated label information and label information expected by a labeling person. In addition, the model training can be performed on the service model in the service corresponding to the target data based on the label information corresponding to the target data, and the accuracy of the service model and the accuracy of the prediction result of the service model can be improved because the obtained label information is more accurate.

The specific processing manner of step S206 may be varied, and the following provides an alternative processing manner, which may specifically include the following: model training is carried out on a service model in a service corresponding to the target data based on the target data and the label information corresponding to the target data; or performing model training on a service model in the service corresponding to the target data based on the prompt information of the target data and the label information corresponding to the target data; or performing model training on a service model in the service corresponding to the target data based on the target data, the prompt information of the target data and the label information corresponding to the target data.

In implementation, the service model that is trained by the target data and the tag information corresponding to the target data may include various service models, for example, a service model constructed by a specified algorithm (specifically, such as a classification algorithm, a clustering algorithm, etc.), or a service model constructed by a specified network (specifically, such as a convolutional neural network, a cyclic neural network, etc.), etc., which may be specifically set according to the actual situation.

The business model for model training through the prompt information of the target data and the label information corresponding to the target data can comprise various business models (such as the labeling model and the like) constructed through a pre-training large model (such as a pre-training language large model, a pre-training discriminant large model and the like).

The service model for model training through the target data, the prompt information of the target data and the label information corresponding to the target data can comprise various service models, for example, the service model constructed through the pre-trained large model can be specifically set according to actual conditions.

In practical applications, the reward model may be trained in a plurality of different manners, and an alternative processing manner is provided below, that is, the training sample is used to directly perform model training on the individual reward model, which may specifically include the following processing of step A2 and step A4.

In step A2, tag sample data for the sample data, predicted tag information generated for the sample data, and bonus information corresponding to the tag sample data are acquired.

The label sample data may be real label information of the sample data, and the label sample data may be set by means of manual labeling (e.g. based on expert experience, etc.). The predictive tag information may be tag information generated for the sample data in a specified manner, such as by the labeling model described above, or by a specified network (e.g., deep neural network, etc.) or algorithm (e.g., classification algorithm or clustering algorithm, etc.).

In step A4, model training is performed on the reward model based on the label sample data, the predicted label information, and the reward information corresponding to the label sample data, to obtain a trained reward model.

In implementation, the label sample data and the predicted label information may be input into a reward model to obtain corresponding reward prediction information, loss information between the reward prediction information and the reward information may be calculated through a preset loss function, model parameters of the reward model are adjusted based on the loss information to perform model training on the reward model, and if the loss function is not converged at this time, model training on the reward model is continuously performed in the above manner until the loss function converges, so as to obtain a trained reward model.

It should be noted that the reward model may be used to evaluate whether the predicted tag information appears better to humans, and convert the predicted tag information into a scalar that characterizes the quality of the result to quantify the magnitude of the difference between the predicted tag information and the tag information desired by the annotator. Specifically, a worker may rank the predictive label information generated for the sample data from top to bottom, and then train a reward model to predict the score of the generated predictive label information, with the generated predictive label information and the ranking score, the reward model creating a mathematical representation of the human preference.

In practical application, the reward model can be trained by performing combined training with the labeling model, and the method specifically comprises the following steps B2 to B8.

In step B2, a set of hint information samples is obtained.

In the implementation, part of data can be obtained from the database, the part of data is taken as a sample, the part of data can be marked to obtain sample label information of the part of data, and in addition, corresponding prompt information can be set for each data in the database. The prompt information corresponding to the obtained partial data can be used as a prompt information sample to construct the prompt information sample set.

In step B4, the prompt information samples in the prompt information sample set are input into the labeling model, and prediction label information corresponding to each prompt information sample is obtained.

In implementation, as shown in fig. 3, the hint information samples in the hint information sample set may be respectively input into a labeling model, and the label information of the data corresponding to each hint information sample is predicted by the labeling model to obtain the predicted label information corresponding to each hint information sample, that is, the predicted label information corresponding to each hint information sample is recorded in the text information in fig. 3.

In step B6, the prediction label information corresponding to each prompt information sample and the sample label information corresponding to the corresponding prompt information sample are input into the reward model, so as to obtain the reward information corresponding to each prompt information sample.

In step B8, based on the rewarding information corresponding to each prompt information sample and the sample rewarding information corresponding to the corresponding prompt information sample, the labeling model and the rewarding model are jointly trained through a preset loss function, and a trained rewarding model is obtained.

In implementation, as shown in fig. 3, a worker may perform reward evaluation on the prediction tag information corresponding to each prompt information sample, determine sample reward information corresponding to each prompt information sample, then, through a preset loss function, calculate loss information between reward information corresponding to each prompt information sample and sample reward information corresponding to a corresponding prompt information sample, and adjust model parameters of the labeling model and the reward model based on the loss information, so as to perform joint training on the labeling model and the reward model, and if the loss function is not converged at this time, continue to perform joint training on the labeling model and the reward model in the above manner until the loss function converges, so as to obtain a trained reward model.

It should be noted that, whether the output result in the labeling model training process appears better to human is evaluated by introducing the reward model, and the output result in the labeling model training process is converted into a scalar r _θ for describing the quality of the result to quantify the difference between the predicted label information and the sample label information expected by the labeling person.

In practical application, based on the related content of the joint training, the labeling model can be optimized based on reinforcement learning, and the following can be specifically referred to: and taking a function constructed by the output data of the trained reward model as a reward function in reinforcement learning, taking the labeling model as a strategy in reinforcement learning, taking the arrangement combination of preset words at the output position as an action space, and carrying out optimization treatment on the labeling model through reinforcement learning to obtain the optimized labeling model.

In implementation, the optimization task of the labeling model can be converted into a reinforcement learning problem, in the reinforcement learning problem, three basic elements such as a Policy (Policy), an Action Space (Action Space), a reward function (Reward Function) and the like can be defined, specifically, the Policy can be a labeling model, the labeling model uses prompt information as input data, and output data is a labeling task result (namely label information), such as a labeling result of small animals such as cats and dogs contained in a certain image. The action space can be the arrangement and combination of the word Token in the word list corresponding to the prompt information at the output position. The reward function may be constructed based on the output data of the trained reward model. The observation space in reinforcement learning may be a possible sequence of input words Token (i.e. hint information), and may be an arrangement combination of words Token in the vocabulary at the input position.

The method can form a complete strengthening training mechanism based on the defined elements, and can optimize the labeling model based on the formed strengthening training mechanism, wherein the labeling model learns through interaction with the environment, the labeling model performs actions which influence the environment where the labeling model is located, the environment is further converted into a new state and returns rewarding information, the rewarding information is a feedback signal which enables the labeling model to be adjusted, when the labeling model is trained, the labeling model can adjust own strategy and take a series of actions to maximize return of the labeling model, and finally the optimized labeling model is obtained.

In practical application, based on the related content of the joint training, the labeling model can be optimized based on reinforcement learning of manual feedback, and the following processing of step C2 and step C4 can be seen.

In step C2, feedback information provided by the annotator for the predictive tag information is obtained.

In step C4, the function constructed by the output data of the trained bonus model and the feedback information provided by the annotator is used as a bonus function in reinforcement learning, the annotation model is used as a strategy in reinforcement learning, the permutation and combination of the preset words at the output position is used as an action space, and the annotation model is optimized through reinforcement learning, so that the optimized annotation model is obtained.

In implementation, in the predicted tag information obtained by performing tag prediction on the data corresponding to the prompt information sample through the labeling model, the data corresponding to the prompt information sample may be submitted to a labeling person for verification and correction, the labeling person may label according to the predicted result output by the labeling model, and provide correctly labeled sample tag information as feedback, where the task of the labeling person is to sort and label the results generated by the initial labeling model, instead of scoring the results, because there is a large difference in scoring preference among different labeling persons, for example, for the accuracy of text generation, some labeling person has a scoring value of 0.98 and some labeling person has a scoring value of 1. And the sorting process can eliminate the difference of scoring results and reduce a large amount of noise data caused by the difference of artificial cognition. The feedback information provided by the annotators can be obtained through the mode.

As shown in fig. 4, output data of the trained reward model may be combined with feedback information provided by a annotator, and a reward function in reinforcement learning may be constructed based on the combined information, then, the annotation model may be used as a strategy in reinforcement learning, and an arrangement combination of preset words at an output position may be used as an action space, and details of the specific process may be referred to the foregoing and will not be repeated herein. Based on the above defined elements, a complete reinforcement training mechanism may be formed, as shown in fig. 5, an optimization process may be performed on the labeling model based on the reinforcement training mechanism formed above, where in each training set, one or more prompt information samples obtained from the prompt information sample set are input into an initial large model and a current fine-tuned large model, and corresponding prediction tag information is generated respectively, then, the prediction tag information generated by the two models is transmitted to the reward model, the reward model provides a value r _θ to evaluate the consistency with human preference (i.e. sample tag information expected by a labeling person) to obtain a penalty term of difference, and in addition, a scaling of KL divergence between the prediction tag information generated by the two models may be set, where the term is used by the reinforcement learning strategy to generate a large deviation from the initial model in each training batch so as to ensure that the model outputs reasonably consistent text. In addition, a new pre-training gradient can be added through a PPO (Proximal Policy Optimization, near-end strategy optimization) algorithm, the evolution of a reward function can be expected to continue along with the progress of RLHF, according to the PPO algorithm, the reward index of the current batch of data can be optimized, then the labeling model is updated, the processing process can be repeated, the reward model and the labeling model are updated iteratively, so that the quality of the output data of the labeling model can be more accurately depicted by the reward model, the output data of the labeling model can be pulled away from an initial large model more and more, the output data of the labeling model can be more and more consistent with the requirements of people, and finally the labeling model corresponding to the predicted label information with higher score in the reward model can be created, and the optimized labeling model is obtained.

In practical application, based on the related content of the joint training, the process of optimizing the labeling model based on the reinforcement learning of the artificial feedback can be realized in another mode, and the following processes of step D2 to step D6 can be referred to specifically.

In step D2, a preset number of predictive label information is selected from the predictive label information corresponding to the prompt information sample included in the prompt information sample set.

The preset number may be set according to actual situations, and in this embodiment, the preset number may be set with the goal of ensuring diversity and representativeness of the manual feedback.

In step D4, the selected preset number of predicted tag information is sent to the terminal device of the annotator, and feedback information provided by the annotator for the preset number of predicted tag information and sent by the terminal device is received.

In step D6, the function constructed by the output data of the trained reward model is used as a reward function in reinforcement learning or the function constructed by the output data of the trained reward model and the feedback information is used as a reward function in reinforcement learning, the labeling model is used as a strategy in reinforcement learning, the permutation and combination of the preset words at the output position is used as an action space, and the labeling model is optimized through reinforcement learning, so as to obtain the optimized labeling model.

In the implementation, in the predicted tag information obtained by performing tag prediction on the data corresponding to the prompt information samples through the labeling model, a part (the preset number) of the data corresponding to the prompt information samples can be submitted to a labeling person for verification and correction, the labeling person can label according to the predicted result output by the labeling model and provide correctly labeled sample tag information as feedback, wherein the feedback result needs to sample a certain number of data corresponding to the prompt information samples so as to ensure the diversity and representativeness of manual feedback, and meanwhile, the mode can also improve the optimization efficiency of the model; the task of the annotator is to sequence and annotate the results generated by the initial annotation model, instead of scoring the results, so that the difference of scoring results is eliminated, and a large amount of noise data caused by the artificial cognition difference is reduced. By the method, feedback information provided by the annotators for the preset number of predictive label information can be obtained.

For the prompt information sample for which the feedback information provided by the annotator is not obtained, the function constructed by the output data of the trained reward model can be used as a reward function in reinforcement learning, the annotation model can be used as a strategy in reinforcement learning, the arrangement combination of the preset words at the output position can be used as an action space, the annotation model is optimized through reinforcement learning, and the specific processing process can be referred to the related content and is not repeated here.

For the prompt information sample of the feedback information provided by the annotator, the output data of the trained reward model and the function constructed by the feedback information can be used as a reward function in reinforcement learning, the annotation model is used as a strategy in reinforcement learning, the arrangement combination of the preset words at the output position is used as an action space, the annotation model is optimized through reinforcement learning, and the specific processing process can be referred to the related content and is not repeated here. The processing process can be repeated, the reward model and the labeling model can be iteratively updated, so that the reward model can more accurately describe the quality of the output data of the labeling model, the output data of the labeling model can be pulled away from the initial large model, the output data of the labeling model becomes more and more consistent with the requirements of people, and finally, the labeling model corresponding to the predictive label information with higher score in the reward model can be created, and the optimized labeling model is obtained.

In practical applications, the target data may be unstructured data, which may include one or more of image data, text data, voice data, and video data.

In practical application, the labeling model may be a model constructed by a large language model, input data of the large language model may be Prompt information, output data may be text data, and in this embodiment, the output data may be text data including tag information.

In addition, a reward model training process is added, the difference between the cognition of a marker on a marking result and the label information predicted in the training process of the marking model is explicitly converted into a data scalar, the scalar is minimized through continuous iteration of the model, the updating and optimization of the marking model effect are realized, in addition, the marking model with manual feedback is directly optimized in a reinforcement learning mode, and through the RLHF training process, a large language model trained on an initial marking data corpus can be continuously aligned with the complex human value, and the accuracy and the robustness of the model are improved.

The following provides a detailed description of a method for labeling data in the embodiments of the present specification in connection with a specific application scenario, where the labeling model may be a model constructed by a pre-trained large language model, and data corresponding to a prompt information sample in a prompt information sample set may be historical transaction data (for convenience of subsequent expression, may be referred to as first historical transaction information) between different users, and specifically may include, for example, behavior information of the users, account information of both parties of the transaction, transaction amount, transaction time, transaction location, and the like. The target data may be transaction data (for convenience of subsequent expression, may be referred to as second historical transaction information) between different users, the business model may be a risk prevention and control model, the business corresponding to the target data may be a financial business or a financial transaction business, and the rewarding information may be a rewarding value.

As shown in fig. 6, the embodiment of the present disclosure provides a method for labeling data, where an execution subject of the method may be a terminal device or a server, where the terminal device may be a mobile terminal device such as a mobile phone, a tablet computer, or a computer device such as a notebook computer or a desktop computer, or may be an IoT device (specifically, such as a smart watch, an in-vehicle device, or the like), and where the server may be a separate server, or may be a server cluster formed by multiple servers, or may be a background server such as a financial service or an online shopping service, or may be a background server of a certain application program, or the like. In this embodiment, the execution subject is taken as a server for example for detailed description, and for the case that the execution subject is a terminal device, the following processing of the case of the server may be referred to, and will not be described herein. The method specifically comprises the following steps:

in step S602, a hint information for a first historical transaction information is obtained as a hint information sample set that is constructed from hint information samples.

Wherein the first historical transaction information may include unstructured data, which may include one or more of image data, text data, voice data, and video data.

In step S604, the hint information samples in the hint information sample set are input into the pre-trained large language model, so as to obtain predictive label information corresponding to each hint information sample.

In step S606, the predicted tag information corresponding to each hint information sample and the sample tag information corresponding to the corresponding hint information sample are input into a reward model, so as to obtain a reward value corresponding to each hint information sample, where the reward model is used to measure the difference between the generated tag information and the tag information expected by the annotator.

In step S608, based on the reward value corresponding to each prompt information sample and the sample reward value corresponding to the corresponding prompt information sample, the pre-trained large language model and the reward model are jointly trained through a preset loss function, so as to obtain a trained reward model.

In step S610, a preset number of predictive label information is selected from the predictive label information corresponding to the hint information samples included in the hint information sample set.

In step S612, the selected preset number of predicted tag information is sent to the terminal device of the annotator, and feedback information provided by the annotator for the preset number of predicted tag information sent by the terminal device is received.

In step S614, the function constructed by the output data of the trained bonus model is used as a bonus function in reinforcement learning or the function constructed by the output data of the trained bonus model and the feedback information is used as a bonus function in reinforcement learning, the labeling model is used as a strategy in reinforcement learning, the permutation and combination of the preset words at the output positions are used as an action space, and the labeling model is optimized through reinforcement learning, so as to obtain the optimized labeling model.

In step S616, prompt information for the second historical transaction information to be annotated is acquired.

Wherein the second historical transaction information may include unstructured data, which may include one or more of image data, text data, voice data, and video data.

In step S618, the prompt information is input into the labeling model, and the labeling model is used to label the second historical transaction information, so as to obtain the label information corresponding to the second historical transaction information.

In step S620, model training is performed on the risk prevention and control model in the financial business based on the second historical transaction information and the tag information corresponding to the second historical transaction information.

The above method for labeling data provided in the embodiment of the present disclosure further provides a device for labeling data based on the same concept, as shown in fig. 7.

The labeling device of the data comprises: a prompt information acquisition module 701, a labeling module 702 and an application module 703, wherein:

the prompt information acquisition module 701 acquires prompt information for target data to be marked;

The labeling module 702 inputs the prompt information into a pre-trained labeling model, labeling the target data through the labeling model to obtain label information corresponding to the target data, wherein the labeling model is constructed through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function constructed through output data of a pre-trained reward model, and the reward model is used for measuring difference between generated label information and label information expected by a labeling person;

The application module 703 performs model training on a service model in the service corresponding to the target data based on the tag information corresponding to the target data.

In this embodiment of the present disclosure, the application module 703 performs model training on a service model in a service corresponding to the target data based on the target data and tag information corresponding to the target data; or training a service model in a service corresponding to the target data based on the prompt information of the target data and the label information corresponding to the target data; or performing model training on a service model in the service corresponding to the target data based on the target data, the prompt information of the target data and the label information corresponding to the target data.

In an embodiment of the present disclosure, the apparatus further includes:

The data acquisition module acquires label sample data aiming at sample data, prediction label information generated for the sample data and rewarding information corresponding to the label sample data;

and the model training module is used for carrying out model training on the reward model based on the label sample data, the predicted label information and the reward information corresponding to the label sample data to obtain a trained reward model.

In an embodiment of the present disclosure, the apparatus further includes:

the sample set acquisition module acquires a prompt information sample set;

the label prediction module is used for inputting the prompt information samples in the prompt information sample set into the labeling model to obtain predicted label information corresponding to each prompt information sample;

The rewards determining module is used for inputting the prediction label information corresponding to each prompt information sample and the sample label information corresponding to the corresponding prompt information sample into the rewards model to obtain rewards information corresponding to each prompt information sample;

And the joint training module is used for carrying out joint training on the labeling model and the rewarding model through a preset loss function based on the rewarding information corresponding to each prompt information sample and the sample rewarding information corresponding to the corresponding prompt information sample, so as to obtain a trained rewarding model.

In an embodiment of the present disclosure, the apparatus further includes:

The first optimization module takes a function constructed by the output data of the trained bonus model as a bonus function in reinforcement learning, takes the annotation model as a strategy in reinforcement learning, takes the arrangement combination of preset words at the output position as an action space, and optimizes the annotation model through reinforcement learning to obtain an optimized annotation model.

In an embodiment of the present disclosure, the apparatus further includes:

The first information acquisition module is used for acquiring feedback information provided by a labeling person aiming at the predictive label information;

And the second optimization module takes the output data of the trained bonus model and the function constructed by the feedback information as a bonus function in reinforcement learning, the labeling model as a strategy in reinforcement learning and the arrangement combination of preset words at the output position as an action space, and optimizes the labeling model through reinforcement learning to obtain the optimized labeling model.

In an embodiment of the present disclosure, the apparatus further includes:

The selecting module is used for selecting a preset number of predictive label information from the predictive label information corresponding to the prompt information samples contained in the prompt information sample set;

The second information acquisition module is used for transmitting the selected preset number of predicted tag information to the terminal equipment of the annotator and receiving feedback information which is transmitted by the terminal equipment and provided by the annotator for the preset number of predicted tag information;

And the third optimization module takes the function constructed by the output data of the trained reward model as a reward function in reinforcement learning or takes the function constructed by the output data of the trained reward model and the feedback information as a reward function in reinforcement learning, takes the annotation model as a strategy in reinforcement learning and takes the arrangement combination of preset words at the output position as an action space, and optimizes the annotation model through reinforcement learning to obtain the optimized annotation model.

In this embodiment of the present disclosure, the target data is unstructured data, and the unstructured data includes one or more of image data, text data, voice data, and video data.

In the embodiment of the present specification, the labeling model is a model constructed by a large language model.

The embodiment of the specification provides a data labeling device, prompt information aiming at target data to be labeled is obtained, then the prompt information can be input into a pre-trained labeling model, the target data is labeled through the labeling model, label information corresponding to the target data is obtained, the labeling model is constructed through a pre-trained large model, the labeling model is a model obtained by reinforcement learning through a reward function constructed through output data of a pre-trained reward model, the reward model is used for measuring the difference between generated label information and label information expected by a labeling person, in this way, the embodiment provides a data labeling mode integrating the pre-trained large model and reinforcement learning, the difference between cognition of the labeling person on the labeling result and the predicted result of the labeling model is converted into explicit data through the introduction of the reward model, the labeling model is directly optimized through the reinforcement learning mode, the labeling model can be continuously aligned with the requirements of complex people through the reinforcement learning training process, the accuracy and the labeling model are improved, further the labeling model can output more label information, accordingly, the time and the consumption of manpower resources are not reduced, the quality and the labeling reliability are improved, and the labeling quality is improved, and only the label quality is consistent. In addition, the model training can be performed on the service model in the service corresponding to the target data based on the label information corresponding to the target data, and the accuracy of the service model and the accuracy of the prediction result of the service model can be improved because the obtained label information is more accurate.

The above device for labeling data provided in the embodiment of the present disclosure further provides a device for labeling data based on the same concept, as shown in fig. 8.

The labeling device of the data may provide a terminal device or a server, etc. for the above embodiments.

The labeling device of the data may have a relatively large difference due to different configurations or performances, and may include one or more processors 801 and a memory 802, where one or more storage applications or data may be stored in the memory 802. Wherein the memory 802 may be transient storage or persistent storage. The application program stored in the memory 802 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in a labeling device for data. Still further, the processor 801 can be configured to communicate with the memory 802 and execute a series of computer-executable instructions in the memory 802 on a labeling device for data. The labeling device for data may also include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input/output interfaces 805, and one or more keyboards 806.

In particular, in this embodiment, the labeling device for data includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer executable instructions in the labeling device for data, and configured to be executed by the one or more processors, the one or more programs including computer executable instructions for:

Acquiring prompt information aiming at target data to be marked;

Inputting the prompt information into a pre-trained labeling model, labeling the target data through the labeling model to obtain label information corresponding to the target data, wherein the labeling model is constructed through a pre-trained large model, the labeling model is a model obtained by reinforcement learning of the labeling model through a reward function constructed through output data of a pre-trained reward model, and the reward model is used for measuring the difference between the generated label information and label information expected by a labeling person;

and carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the labeling device embodiment of the data, since it is substantially similar to the method embodiment, the description is relatively simple, and reference is made to the partial description of the method embodiment for relevant points.

Further, based on the method shown in fig. 2 to 6, one or more embodiments of the present disclosure further provide a storage medium, which is used to store computer executable instruction information, and in a specific embodiment, the storage medium may be a U disc, an optical disc, a hard disk, etc., where the computer executable instruction information stored in the storage medium can implement the following flow when executed by a processor:

Acquiring prompt information aiming at target data to be marked;

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for one of the above-described storage medium embodiments, since it is substantially similar to the method embodiment, the description is relatively simple, and reference is made to the description of the method embodiment for relevant points.

The embodiment of the specification provides a storage medium, prompt information aiming at target data to be marked is obtained, then the prompt information can be input into a pre-trained marking model, the target data is marked through the marking model, label information corresponding to the target data is obtained, the marking model is built through a pre-trained big model, the marking model is a model obtained by reinforcement learning of the marking model through a reward function built through output data of a pre-trained reward model, the reward model is used for measuring the difference between the generated label information and label information expected by a marker, in this way, the embodiment provides a data marking mode integrating the pre-trained big model and reinforcement learning, the difference between cognition of the marker on the marking result and the predicted result of the marking model is converted into explicit data through the introduction of the reward model, the marking model is directly optimized through the reinforcement learning mode, the marking model can be continuously aligned with the requirements of complex people through the reinforcement learning training process, the accuracy and the robustness of the marking model are improved, the marking model can output more label information, and accordingly, the time and the manpower consumption are not only reduced, the quality and the marking reliability are improved, and the quality is also improved. In addition, the model training can be performed on the service model in the service corresponding to the target data based on the label information corresponding to the target data, and the accuracy of the service model and the accuracy of the prediction result of the service model can be improved because the obtained label information is more accurate.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable fraud case serial-to-parallel device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable fraud case serial-to-parallel device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method of labeling data, the method comprising:

Acquiring prompt information aiming at target data to be marked;

2. The method of claim 1, wherein the model training the service model in the service corresponding to the target data based on the tag information corresponding to the target data comprises:

Performing model training on a service model in a service corresponding to the target data based on the target data and label information corresponding to the target data; or alternatively

Performing model training on a service model in a service corresponding to the target data based on the prompt information of the target data and the label information corresponding to the target data; or alternatively

And training a service model in a service corresponding to the target data based on the target data, the prompt information of the target data and the label information corresponding to the target data.

3. The method of claim 2, the method further comprising:

Acquiring label sample data aiming at sample data, prediction label information generated for the sample data and rewarding information corresponding to the label sample data;

and carrying out model training on the reward model based on the label sample data, the predicted label information and the reward information corresponding to the label sample data to obtain a trained reward model.

4. The method of claim 2, the method further comprising:

Acquiring a prompt information sample set;

inputting the prompt information samples in the prompt information sample set into the annotation model to obtain prediction tag information corresponding to each prompt information sample;

inputting the prediction label information corresponding to each prompt information sample and the sample label information corresponding to the corresponding prompt information sample into the reward model to obtain reward information corresponding to each prompt information sample;

based on the rewarding information corresponding to each prompt information sample and the sample rewarding information corresponding to the corresponding prompt information sample, the labeling model and the rewarding model are jointly trained through a preset loss function, and a trained rewarding model is obtained.

5. The method of claim 4, the method further comprising:

And taking a function constructed by the output data of the trained reward model as a reward function in reinforcement learning, taking the annotation model as a strategy in reinforcement learning, taking the arrangement combination of preset words at the output position as an action space, and carrying out optimization treatment on the annotation model through reinforcement learning to obtain an optimized annotation model.

6. The method of claim 4, the method further comprising:

acquiring feedback information provided by a label maker aiming at the predictive label information;

And taking the output data of the trained bonus model and the function constructed by the feedback information as a bonus function in reinforcement learning, taking the labeling model as a strategy in reinforcement learning, taking the arrangement combination of preset words at the output position as an action space, and carrying out optimization treatment on the labeling model through reinforcement learning to obtain the optimized labeling model.

7. The method of claim 4, the method further comprising:

Selecting a preset number of predictive label information from predictive label information corresponding to the prompt information samples contained in the prompt information sample set;

The method comprises the steps of sending a selected preset number of predictive label information to terminal equipment of a marker, and receiving feedback information provided by the marker for the preset number of predictive label information and sent by the terminal equipment;

And taking a function constructed by the output data of the trained reward model as a reward function in reinforcement learning or taking a function constructed by the output data of the trained reward model and the feedback information as a reward function in reinforcement learning, taking the annotation model as a strategy in reinforcement learning, taking the arrangement combination of preset words at the output position as an action space, and carrying out optimization treatment on the annotation model through reinforcement learning to obtain the optimized annotation model.

8. The method of any of claims 1-7, the target data being unstructured data comprising one or more of image data, text data, speech data, and video data, the annotation model being a model constructed by a large language model.

9. A device for labeling data, the device comprising:

The prompt information acquisition module is used for acquiring prompt information aiming at target data to be marked;

the labeling module is used for inputting the prompt information into a pre-trained labeling model, labeling the target data through the labeling model to obtain label information corresponding to the target data, wherein the labeling model is constructed through a pre-trained large model, the labeling model is a model obtained by performing reinforcement learning on the labeling model through a reward function constructed through output data of a pre-trained reward model, and the reward model is used for measuring the difference between the generated label information and label information expected by a labeling person;

and the application module is used for carrying out model training on a service model in the service corresponding to the target data based on the label information corresponding to the target data.

10. A labeling apparatus for data, the labeling apparatus for data comprising:

A processor; and

A memory arranged to store computer executable instructions that, when executed, cause the processor to:

Acquiring prompt information aiming at target data to be marked;