WO2022001232A1 - 一种问答数据增强方法、装置、计算机设备及存储介质 - Google Patents

一种问答数据增强方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022001232A1
WO2022001232A1 PCT/CN2021/082936 CN2021082936W WO2022001232A1 WO 2022001232 A1 WO2022001232 A1 WO 2022001232A1 CN 2021082936 W CN2021082936 W CN 2021082936W WO 2022001232 A1 WO2022001232 A1 WO 2022001232A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeled
labeling
data set
data
question
Prior art date
Application number
PCT/CN2021/082936
Other languages
English (en)
French (fr)
Inventor
谯轶轩
陈浩
高鹏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022001232A1 publication Critical patent/WO2022001232A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a question and answer data enhancement method, apparatus, computer equipment and storage medium.
  • Multimodal learning in the field of deep learning technology has been a research hotspot in the past two years.
  • Cross-modal deep learning can be constructed for any two or more fields such as structured data, images, videos, speech, and text. Model.
  • the existing technology mainly generates the text corresponding to the corresponding label based on the labeled image data set that solves the specific task. The inventor realizes that through the solutions of the prior art, the generated data set has the problem that it cannot cover the whole picture of the multimodal data distribution to be studied.
  • the present application provides a question and answer data enhancement method, device, computer equipment and storage medium, so as to solve the problem that the data set in the prior art cannot cover the whole picture of the multimodal data distribution to be studied.
  • a question and answer data enhancement method includes:
  • the question and answer data set includes a plurality of data points, and a true label corresponding to each data point;
  • a first soft label prediction is performed on each data point in the question and answer data set, and a first soft label corresponding to each data point in the question and answer data set is obtained;
  • each data point and its corresponding first soft label in the question and answer data set as a soft label data set, and using knowledge distillation technology to generate a labeling model from the soft label data set and the prediction model;
  • the present application also provides a question and answer data enhancement device, the device includes:
  • an acquisition module for acquiring a question and answer data set, the question and answer data set includes a plurality of data points, and a real label corresponding to each data point;
  • the prediction module is used to predict the first soft label for each data point in the question and answer data set based on the pre-trained prediction model and the real label, and obtain the first soft label corresponding to each data point in the question and answer data set.
  • Label
  • a generating module configured to construct each data point and its corresponding first soft label in the question and answer data set as a soft label data set, and use knowledge distillation technology to generate a labeling model from the soft label data set and the prediction model;
  • the screening and prediction module is used to obtain the data set to be labeled, input the data set to be labeled into the labeling model for pre-labeling, and filter the data set to be labeled according to the labeling result to obtain a labeled sample set.
  • an embodiment of the present application further provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes
  • the computer-readable instructions implement the following steps:
  • the question and answer data set includes a plurality of data points, and a true label corresponding to each data point;
  • the first soft label prediction is performed on each data point in the question and answer data set, and the first soft label corresponding to each data point in the question and answer data set is obtained;
  • each data point and its corresponding first soft label in the question and answer data set as a soft label data set, and using knowledge distillation technology to generate a labeling model from the soft label data set and the prediction model;
  • an embodiment of the present application further provides a computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processing The device performs the following steps:
  • the question and answer data set includes a plurality of data points, and a true label corresponding to each data point;
  • a first soft label prediction is performed on each data point in the question and answer data set, and a first soft label corresponding to each data point in the question and answer data set is obtained;
  • each data point and its corresponding first soft label in the question and answer data set as a soft label data set, and using knowledge distillation technology to generate a labeling model from the soft label data set and the prediction model;
  • the question and answer data enhancement method, device, computer equipment and storage medium provided according to the embodiments of the present application have at least the following beneficial effects:
  • the soft label is compared with the previous soft label.
  • the set real labels its generalization ability is strong; the data points and their corresponding first soft labels are used to construct a soft label data set, and the soft label data set and the prediction model are used to generate a labeling model through knowledge distillation technology; then The labeling model is used to label the data set to be labeled, and the data set to be labeled is screened according to the labeling result, and finally a labeling sample set is obtained.
  • the sample set generated by the above steps can cover the whole picture of the multimodal data distribution to be studied, and can comprehensively label the unlabeled data set, which improves the efficiency and quality of labeling.
  • FIG. 1 is a schematic flowchart of a question and answer data enhancement method provided by an embodiment of the present application
  • FIG. 2 is a diagram of the use effect of the prediction model provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another question and answer data enhancement method provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a question and answer data enhancement apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the present application provides a question and answer data augmentation method.
  • FIG. 1 it is a schematic flowchart of a question and answer data enhancement method provided by an embodiment of the present application.
  • the question and answer data enhancement method includes:
  • the question and answer data set includes a plurality of data points, and a real label corresponding to each data point, the data point represents a picture and a question, and the real label corresponding to the data point is for the picture and the question,
  • the label marked on the picture, the true label is the label obtained by manually labeling the picture;
  • the acquired question-and-answer dataset is a dataset that has been labelled for pictures and questions and is publicly available on the official website of the acquired VQA (Visual Question Answering).
  • the first soft label is predicted for each data point, and the first soft label corresponding to each data point in the data set is obtained, and the soft label is relative to the
  • the real label has strong generalization ability, that is, it contains more information, such as information between different categories, which can highlight the difference with other labels.
  • the soft label is equivalent to a regularization term to a certain extent, which prevents the model from overfitting and plays a role in stabilizing the model.
  • the data of the real label has been set as [1, 0, 0].
  • the soft label [0.9, 0.05, 0.05] is obtained.
  • the generalization ability can be obtained. Stronger soft labels [0.7, 0.27, 0.03].
  • the real label corresponding to the data point is input into the prediction model to perform the first round of prediction of the first soft label, and the first round of prediction results are obtained;
  • the prediction model is used to perform m rounds of prediction of the first soft label on each data point of the question answering data set to obtain the first soft label, where m>1.
  • the real label corresponding to the data point in a certain data set is input into the prediction model to perform the first round of prediction of the first soft label, and the first round of prediction results are obtained;
  • the prediction result of the first round is used as input, and the prediction model is used to perform the second round of prediction of the first soft label for each data point in the certain data set, so as to obtain the second round of prediction results;
  • the prediction result is used as the input, and the third round of prediction is carried out through the prediction model; that is, the prediction model is used to perform multiple rounds of prediction on the real label, starting from the second round of prediction, the input of each round is the prediction result of the previous round, through the above steps In order to obtain the first soft label with stronger generalization ability.
  • the above-mentioned data set may be the question and answer data set, or may be other data sets including data points and their corresponding real labels.
  • the question and answer dataset is used in this application.
  • each data point and its corresponding first soft label are constructed as a soft label data set, and the prediction model of the soft label data set is distilled into a labeling model using knowledge distillation technology.
  • the knowledge distillation technique is to transfer the knowledge learned by one complex model or multiple models to another lightweight model.
  • the model is made lightweight, that is, it is easy to deploy and the inference speed is fast. That is, the labeling model has a smaller amount of parameters, and its labeling efficiency is improved.
  • Textbrewer knowledge distillation tool uses the Textbrewer knowledge distillation tool to generate a labeling model from the soft label data set and the prediction model.
  • the advantage of using the Textbrewer knowledge distillation tool is that it provides a simple workflow, which is convenient for quickly building distillation experiments, and can be flexibly configured and expanded according to needs.
  • the Textbrewer knowledge distillation tool is a knowledge distillation tool made by Harbin Institute of Technology based on the PyTorch framework, which has good performance for knowledge distillation.
  • the training set that is, the soft label data set of the present application
  • the soft label data and the weights generated by the prediction model initializing the prediction model and initializing the preset labeling model
  • the labeling model is obtained through the Textbrewer knowledge distillation tool, and the labeling model is in While the parameter size is smaller, its performance is consistent with the prediction model.
  • the to-be-labeled data is acquired, and the labeling model is used to pre-label the to-be-labeled data, and the to-be-labeled data set is screened according to the labeling result, and finally a labelled sample set is obtained.
  • the data set to be labeled only contains data points, and pre-labeling the data points is to use the data points to generate their corresponding soft labels; the difference from the question and answer data set is that the question and answer data set contains data points and The ground-truth labels are manually annotated for the data points, while the to-be-labeled dataset does not contain any labels.
  • the obtaining the data set to be labeled includes:
  • the database needs to perform a signature verification step when calling the data set to be labelled.
  • the database will perform the signature verification step on the token and return the signature verification result. Only when passed, can the data set to be labeled in the database be called.
  • the database may be a distributed database, that is, a blockchain.
  • the filtering of the label data set according to the labeling results is to set filtering conditions for the labeling results according to the requirements, and finally form a labeling sample set for all the labeling results that meet the filtering conditions and their corresponding data points.
  • all the data of the data set to be labeled and the labeled sample set can also be stored in the nodes of a blockchain.
  • S4 specifically includes:
  • the to-be-labeled data points and their corresponding labeling results constitute the labeling sample set.
  • the labeling model when it annotates the data to be labeled, it outputs the corresponding confidence level of the labeling result while outputting the labeling result, and the sum of the total confidence level is 1.
  • the labeling model Pre-labeling, you will get multiple labeling results, that is, multiple soft labels, and the labeling model will also output the corresponding confidence level of multiple labeling results at the same time. 's confidence level.
  • the first preset value can also be freely set as required, and all the annotation results with a confidence level greater than 0.9 are reserved in this application.
  • the method further includes:
  • the ratio is less than the second preset value, the labeled sample set and the question-and-answer data set are combined, and the prediction model is retrained until the ratio is greater than or equal to the second preset value.
  • the second preset value is to combine the labeled sample set and the question and answer data set, and retrain the prediction model; that is, the number of data points in the labeled sample set and the number of data points to be labeled in the data set to be labeled.
  • the prediction model will be retrained, and the data set after the question and answer data set and the labeled sample set will be combined to perform soft label prediction to obtain the soft label corresponding to the data point, and will form a soft label.
  • the soft-label data set and the prediction model are turned into a labeling model, and then the labeling data set is labeled, and then filtered, and finally the labeled sample set is obtained again.
  • the ratio of the number of data points in the labeled sample set to the number of data points to be labeled in the to-be-labeled data set is then calculated until the ratio is greater than the second preset value.
  • the labeled sample set and the question-and-answer data model are replaced with the original data set, and the above steps are repeated until the ratio of the number of data points in the final labeled sample set to the number of data points in the data set to be labeled is greater than or equal to the second preset up to the value.
  • the second preset value can be set freely according to requirements, and in this application, the second preset value is 90%.
  • using the prediction model to perform m rounds of prediction of the first soft label on each data point of the question and answer data set to obtain the first soft label specifically includes:
  • the prediction is stopped, and the prediction result of the mth round is output as the first soft label, where m ⁇ 2.
  • the above-mentioned m rounds of predictions are performed on the real labels. While making the predictions, the prediction results of two adjacent rounds will be obtained to calculate the cross-entropy loss function. When the loss function is smaller than the third preset value, it will stop Prediction, and output the prediction result of the latter round of the two rounds of prediction as the first soft label.
  • the setting of the third preset value will be set according to your own needs. For example, when you need to obtain the first soft label with strong generalization ability, you can set the third preset value to 0.1. In the subsequent steps, treat the When labeling the label data, a label with high confidence can be directly obtained; when it is necessary to obtain the first soft label with a weaker generalization ability, the third preset value can be set to 1, and in the subsequent steps, the label data is treated Labels with slightly lower confidence can be directly obtained when labeling. Therefore, the third preset value can be freely set as required.
  • control of the number of real label predictions can be realized, and the number of predictions can be indirectly controlled as required to avoid redundancy in the entire process.
  • the soft label is compared with the previous soft label.
  • the set real labels its generalization ability is strong; the data points and their corresponding first soft labels are used to construct a soft label data set, and the soft label data set and the prediction model are used to generate a labeling model through knowledge distillation technology; then The labeling model is used to label the data set to be labeled, and the data set to be labeled is screened according to the labeling result, and finally a labeling sample set is obtained.
  • the sample set generated by the above steps can cover the whole picture of the multimodal data distribution to be studied, that is, the unlabeled data set can be comprehensively labeled, which improves the efficiency and quality of the labeling.
  • step S2 it also includes:
  • the new vector represents the result obtained after the linear transformation, and then the second soft label is obtained after being processed by the classification network;
  • the data point is the picture or the question
  • the real label is the real label corresponding to the picture or question
  • the picture will be represented by its vector through the open source Faster-RCNN model
  • the question is firstly Embedding through the GloVe word vector published by Stanford processing, and then get its vector representation through the LSTM network;
  • the vector representations of pictures and questions are interactively processed to obtain new vector representations
  • the new image vector representation and the new question vector representation are linearly transformed to obtain h_image and h_question; h_image and h_question are also a vector representation of the image and the question, but they are different from the aforementioned vector representations. That is to say, the new image vector representation and the new question vector representation are obtained by linear transformation of h_image and h_question, which is still a vector representation, but its representation is different.
  • the cross-entropy loss function is calculated, and the weight parameters of each layer of the initial prediction model are adjusted based on the cross-entropy loss function to obtain the pre-trained prediction model.
  • k represents the kth dimension of the vector
  • y_ori is the vector of the original label y encoded by one-hot.
  • the pre-training of the prediction model is realized, and the prediction model obtained based on the above steps has high labeling efficiency and better quality.
  • an embodiment of the present application further provides a question-and-answer data enhancement apparatus 100 .
  • the question and answer data enhancement apparatus 100 described in this application can be installed in an electronic device.
  • the question and answer data enhancement apparatus 100 may include an acquisition module 101 , a prediction module 102 , a generation module 103 and a screening prediction module 104 .
  • the module in the present invention can also be called a unit, which refers to an instruction segment of a series of computer-readable instructions that can be executed by the electronic device processor and can perform fixed functions, and is stored in the memory of the electronic device.
  • each module/unit is as follows:
  • an obtaining module 101 configured to obtain a question and answer data set, the question and answer data set includes a plurality of data points, and a real label corresponding to each data point;
  • the question and answer data set includes a plurality of data points, and a real label corresponding to each data point, the data point represents a picture and a question, and the real label corresponding to the data point is for the picture and the question,
  • the label marked on the picture, the true label is the label obtained by manually labeling the picture;
  • the acquisition module 101 is a data set that has been set with labels for pictures and questions and is published on the official website of VQA (Visual Question Answering).
  • the prediction module 102 is configured to perform a first soft label prediction for each data point in the question and answer data set based on the pre-trained prediction model and the real label, and obtain the first soft label corresponding to each data point in the question and answer data set. soft label;
  • the prediction module 102 predicts the first soft label for each data point by using the pre-trained prediction model and the real label, and obtains the first soft label corresponding to each data point in the data set.
  • the soft label Compared with the real label, its generalization ability is strong, that is, it contains more information, such as information between different categories, which can highlight the difference from other labels.
  • the soft label is equivalent to a regularization term to a certain extent, which prevents the model from overfitting and plays a role in stabilizing the model.
  • the data of the real label has been set as [1, 0, 0].
  • the soft label [0.9, 0.05, 0.05] is obtained.
  • the generalization ability can be obtained. Stronger soft labels [0.7, 0.27, 0.03].
  • the prediction module 102 includes a first prediction sub-module and multiple rounds of prediction sub-modules.
  • the first-round prediction submodule is used to input the real labels corresponding to the data points into the prediction model to perform the first-round predictions of the first soft labels, and obtain the first-round prediction results;
  • the multi-round prediction sub-module is used to take the prediction result of the previous round as input, and use the prediction model to perform m rounds of prediction of the first soft label on each data point of the question and answer data set to obtain the first soft label, where m > 1.
  • the first-round prediction submodule inputs the real labels corresponding to the data points in a certain data set into the prediction model to perform the first-round prediction of the first soft label, and obtains the first-round prediction results;
  • the multi-round prediction sub-module takes the prediction result of the first round as input, and uses the prediction model to perform the second round of prediction of the first soft label for each data point in the certain data set, and obtains the second round of prediction results;
  • the third round of prediction is carried out through the prediction model; that is, the prediction model is used to perform multiple rounds of prediction on the real label, starting from the second round of prediction, the input of each round is the prediction of the previous round
  • the above steps are used to obtain the first soft label with stronger generalization ability.
  • the prediction model is used to predict the real label for multiple rounds, and the input of each round is the prediction result of the previous round.
  • the above-mentioned data set may be the question and answer data set, or may be other data sets including data points and their corresponding real labels.
  • the question and answer dataset is used in this application.
  • the multi-round prediction sub-module includes a judgment unit and a soft label output unit.
  • the judgment unit calculates the cross-entropy loss function according to the prediction results of the mth round and the m-1th round;
  • the soft label output unit is configured to stop prediction when the loss function is less than a third preset value, and output the mth round as the first soft label, where m ⁇ 2.
  • the judgment unit will obtain the prediction results of two adjacent rounds to calculate the cross entropy loss function, and the soft label output unit is used to stop the prediction when the loss function is less than the third preset value, and output the two rounds of The prediction result of the later round in the prediction is used as the first soft label.
  • the setting of the third preset value will be set according to your own needs. For example, when you need to obtain the first soft label with strong generalization ability, you can set the third preset value to 0.1. In the subsequent steps, treat the When labeling the label data, a label with high confidence can be directly obtained; when it is necessary to obtain the first soft label with a weaker generalization ability, the third preset value can be set to 1, and in the subsequent steps, the label data is treated Labels with slightly lower confidence can be directly obtained when labeling. Therefore, the third preset value can be freely set as required.
  • the generating module 103 is used for constructing each data point in the question and answer data set and the corresponding first soft label as a soft label data set, and using knowledge distillation technology to generate a labeling model from the soft label data set and the prediction model ;
  • the generating module 103 constructs each data point and its corresponding first soft label into a soft label data set, and uses the knowledge distillation technology to distill the prediction model of the soft label data set into a labeling model.
  • the generation module 103 includes a Textbrewer submodule
  • the Textbrewer sub-module uses the Textbrewer knowledge distillation tool to generate a labeling model by combining the soft label data with the prediction model.
  • Textbrewer submodule provides a simple workflow, which is convenient for quickly building distillation experiments, and can be flexibly configured and expanded according to needs.
  • the Textbrewer submodule is a knowledge distillation tool made by Harbin Institute of Technology based on the PyTorch framework, which has good performance for knowledge distillation.
  • the training set ie the soft label data set of this application
  • the soft label data and the weights generated by the prediction model initializing the prediction model and initializing the preset labeling model
  • the labeling model is obtained through the Textbrewer knowledge distillation tool, and the labeling model is Its performance is consistent with the predictive model while the parameter size is smaller.
  • the screening and prediction module 104 is configured to obtain a data set to be labeled, input the data set to be labeled into the labeling model for pre-labeling, and filter the data set to be labeled according to the labeling result to obtain a labeled sample set.
  • the screening and prediction module 104 will obtain the data to be labelled, use the labeling model to pre-label the data to be labelled, and filter the data set to be labelled according to the labeling result, and finally obtain the labeling sample set.
  • the filtering of the label data set according to the labeling results is to set filtering conditions for the labeling results according to the requirements, and finally form a labeling sample set for all the labeling results that meet the filtering conditions and their corresponding data points.
  • the data set to be labeled only contains data points, and pre-labeling the data points is to use the data points to generate their corresponding soft labels; the difference from the question and answer data set is that the question and answer data set contains data points and The ground-truth labels are manually annotated for the data points, while the to-be-labeled dataset does not contain any labels.
  • all the data of the data set to be labeled and the labeled sample set can also be stored in the nodes of a blockchain.
  • the screening prediction module 104 includes an acquisition sub-module
  • the acquisition submodule sends a call request to the database, where the call request carries a signature verification token;
  • the acquisition sub-module sends a call request to the database, wherein the call request carries a signature verification token; the database will perform signature verification steps on the token, and return the signature verification result, which can only be called when the signature verification result passes.
  • the dataset to be labeled in the database is a signature verification token
  • the screening prediction module 104 includes a confidence output sub-module and a confidence judgment sub-module;
  • the confidence output sub-module inputs the to-be-labeled data points in the to-be-labeled data set into the labeling model for pre-labeling to obtain labeling results, and calculates the confidence level of each of the labeling results;
  • the confidence level judgment submodule compares the confidence level of the labeling result with the first preset value, deletes the labeling result and the to-be-labeled data point whose confidence level is less than or equal to the first preset value, and assigns the The remaining data points to be labeled in the data set to be labeled and their corresponding labeled results form the labeled sample set
  • the confidence output sub-module outputs the labeling result and also outputs the corresponding confidence level of the labeling result, and the sum of the total confidence level is 1
  • the labeling model uses the labeling model to pre-label, you will get multiple labeling results, that is, multiple soft labels, and the labeling model will also output the corresponding confidence level of multiple labeling results at the same time, and this application is the direct output with the highest confidence
  • the annotation results and their corresponding confidence levels are the direct output with the highest confidence
  • the confidence level judgment submodule compares the confidence level of the labeling result with the first preset value, deletes the labeling result and the data points to be labeled whose confidence level is less than or equal to the first preset value, and compares the data points to be labeled. The remaining data points to be labeled and the corresponding labeling results in the set form the labeling sample set.
  • the first preset value can also be freely set as required, and all the annotation results with a confidence level greater than 0.9 are reserved in this application.
  • the higher preset value set ensures the relative reliability of the labeling, reasonably controls the number of samples labeled with the model, and facilitates subsequent iterative labeling with the original samples.
  • the screening prediction module 104 includes a ratio calculation submodule and a ratio judgment submodule;
  • the ratio calculation sub-module calculates the ratio of the number of data points in the labeled sample set to the number of to-be-labeled data points in the to-be-labeled data set, and the ratio judgment sub-module determines whether the labeling model treats the labeled data set.
  • the labeled sample set and the question and answer data set will be combined, and the prediction model will be retrained; that is, the number of data points in the labeled sample set and the number of data points to be labeled
  • the prediction model will be retrained, and the data set after the question and answer data set and the labeled sample set will be combined for soft label prediction, and the corresponding data points will be obtained.
  • the soft label will be formed into a soft label data set.
  • the soft label data set and the prediction model will be turned into a labeling model through knowledge distillation, and then the label data set will be labeled, and then filtered, and finally the labeled sample set will be obtained again.
  • the ratio of the number of data points in the labeled sample set to the number of data points to be labeled in the to-be-labeled data set is then calculated until the ratio is greater than the second preset value.
  • the ratio judgment submodule replaces the initial data set with the labeled sample set and the question-and-answer data model, and then repeats the above steps until the ratio of the number of data points in the final labeled sample set to the number of data points in the data set to be labeled is greater than or equal to the above until the second preset value.
  • the second preset value can be set freely according to requirements, and in this application, the second preset value is 90%.
  • the ratio calculation sub-module and the ratio judgment sub-module ensure an overall quality of the labeling model for labeling data.
  • the device uses the acquisition module, the prediction module, the generation module, the judgment module and the screening and prediction module in conjunction, so as to realize the whole picture covering the multi-modal data distribution to be studied, that is, the unlabeled data set can be analyzed.
  • Comprehensive labeling improves the efficiency and quality of labeling.
  • the device also includes: a pre-training module
  • a pre-training module for vectorizing the plurality of data points
  • the new vector represents the result obtained after the linear transformation, and then the second soft label is obtained after being processed by the classification network;
  • the data point is the picture or the question
  • the real label is the real label corresponding to the picture or question
  • the pre-training module obtains the vector representation of the picture through the open source Faster-RCNN model
  • the GloVe word vector is processed by Embedding, and then its vector representation is obtained through the LSTM network;
  • the vector representations of pictures and questions are interactively processed to obtain new vector representations
  • the new image vector representation and the new question vector representation are linearly transformed to obtain h_image and h_question; h_image and h_question are also a vector representation of the image and the question, but they are different from the aforementioned vector representations. That is to say, the new image vector representation and the new question vector representation are obtained by linear transformation of h_image and h_question, which is still a vector representation, but its representation is different.
  • the cross-entropy loss function is calculated, and the weight parameters of each layer of the initial prediction model are adjusted based on the cross-entropy loss function to obtain the pre-trained prediction model.
  • k represents the kth dimension of the vector
  • y_ori is the vector of the original label y encoded by one-hot.
  • the pre-training module Through the pre-training module, the pre-training of the prediction model is realized, and the prediction model obtained based on the above steps has high labeling efficiency and better quality.
  • FIG. 5 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 .
  • the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store the operating system and various application software installed on the computer device 4 , such as computer-readable instructions of a question-and-answer data enhancement method.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the question-and-answer data enhancement method.
  • CPU Central Processing Unit
  • controller central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 42 is typically used to control the overall operation of the computer device 4 .
  • the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the question-and-answer data enhancement method.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the processor executes the computer-readable instructions stored in the memory
  • the steps of the question-and-answer data enhancement method as described in the above-mentioned embodiment are implemented, and the question-and-answer data set is obtained by obtaining the label and the pre-trained prediction model is used to analyze the question-answer data. Concentrate each data point to predict the first soft label and obtain the corresponding first soft label.
  • the soft label Compared with the set real label, the soft label has strong generalization ability; A soft label constructs a soft label data set, and uses the soft label data set and the prediction model to generate a labeling model through knowledge distillation technology; then use the labeling model to label the data set to be labeled, and according to the labeling result The data set is filtered, and finally the labeled sample set is obtained.
  • the sample set generated by the above steps can cover the whole picture of the multimodal data distribution to be studied, that is, the unlabeled data set can be comprehensively labeled, which improves the efficiency and quality of the labeling.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to causing the at least one processor to perform the steps of the above-mentioned question and answer data enhancement method, by obtaining a question and answer data set with a set label, and using a pre-trained prediction model to perform a first soft label on each data point in the question and answer data set Predict and obtain the corresponding first soft label.
  • the soft label Compared with the set real label, the soft label has strong generalization ability; construct a soft label data set from the data point and its corresponding first soft label, The soft label data set and the prediction model are used to generate a labeling model through knowledge distillation technology; then the labeling model is used to label the data set to be labelled, and the labelled data set is screened according to the labeling result, and finally the labeling sample set is obtained. .
  • the sample set generated by the above steps can cover the whole picture of the multimodal data distribution to be studied, that is, the unlabeled data set can be comprehensively labeled, which improves the efficiency and quality of the labeling.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种问答数据增强方法、装置、计算机设备及存储介质,涉及人工智能技术,具体应用于深度学习中。该方法包括获取问答数据集,问答数据集包括多个数据点以及其对应的真实标签(S1);基于预训练的预测模型和真实标签,对每个数据点进行第一软标签预测,得到每个数据点对应的第一软标签(S2);将每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将软标签数据集和预测模型生成标注模型(S3);获取待标签数据集,将待标签数据集输入到标注模型进行预标注,并根据标注结果对待标签数据集进行筛选,得到标注样本集(S4)。上述方法还涉及区块链技术,所述标注样本集和待标签数据集中的数据存储于区块链中。上述方法能提高标注标签的效率以及质量。

Description

一种问答数据增强方法、装置、计算机设备及存储介质
本申请要求于2020年10月30日提交中国专利局、申请号为202011192632.4,发明名称为“一种问答数据增强方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种问答数据增强方法、装置、计算机设备及存储介质。
背景技术
对于深度学习技术领域中的多模态学习是近两年的研究热点,对于结构化数据、图像、视频、语音、文本等任意不同的两个或多个领域都可以构建跨模态的深度学习模型。
在仅含有图像或文本的单模态领域中,有大量的特定领域的人工标注好的数据集,例如在图像领域用于解决分类,分割,检测等任务的数据集,在文本领域用于解决情感分析,命名实体识别,问答的数据集。目前现有技术主要基于已标注好的解决特定任务的图片数据集,生成对应标签所对应的文本。发明人意识到通过现有技术的方案,其生成的数据集存在无法涵盖待研究的多模态数据分布全貌的问题。
发明内容
本申请提供一种问答数据增强方法、装置、计算机设备及存储介质,以解决现有技术中数据集无法涵盖待研究的多模态数据分布全貌的问题。
为解决上述问题,本申请提供的一种问答数据增强方法,包括:
获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
为了解决上述问题,本申请还提供一种问答数据增强装置,所述装置包括:
获取模块,用于获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
预测模块,用于基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
生成模块,用于将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
筛选预测模块,用于获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
为了解决上述问题,本申请实施例还提供一种计算机设备,包括存储器、处理器,以及存储在所述存储器中,并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标 签预测,得到所述问答数据集中每个数据点对应的第一软标签;
将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
为了解决上述问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:
获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
根据本申请实施例提供的问答数据增强方法、装置、计算机设备及存储介质,与现有技术相比至少具有以下有益效果:
通过获取已设定标签的问答数据集,以及利用预训练的预测模型来对问答数据集中每个数据点进行第一软标签预测并得到对应的第一软标签,所述软标签相较于已设定的真实标签而言,其泛化能力强;将数据点及其对应的第一软标签构建一个软标签数据集,并将软标签数据集与预测模型通过知识蒸馏技术生成标注模型;然后利用该标注模型来对待标签的数据集进行标注,并根据标注结果对所述待标签数据集进行筛选,最终得到标注样本集。通过上述步骤生成的样本集可以涵盖待研究的多模态数据分布的全貌,能对未标签的数据集进行全面的标注标签,提高了标注标签的效率以及质量。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图做一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一实施例提供的问答数据增强方法的流程示意图;
图2为本申请一实施例提供的预测模型的使用效果图;
图3为本申请一实施例提供的另一问答数据增强方法的流程示意图;
图4为本申请一实施例提供的问答数据增强装置的模块示意图;
图5为本申请一实施例的计算机设备的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是是相同的实施例,也不是与其它实施例相互排斥的独立的或备选的实施例。本领域技术人员显式地或隐式地理解 的是,本文所描述的实施例可以与其它实施例相结合。
本申请提供一种问答数据增强方法。参照图1所示,为本申请一实施例提供的问答数据增强方法的流程示意图。
在本实施例中,问答数据增强方法包括:
S1、获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
具体的,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签,所述数据点表示的是图片和问题,而数据点对应的真实标签即为针对该图片和问题,在图片上标注的标签,所述真实标签即为通过人工对图片进行标注而得到的标签;
所述获取问答数据集是获取的VQA(Visual Question Answering)官网公开的一个针对图片和问题已设定好标签的数据集。
S2、基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
具体的,利用预训练的预测模型和所述真实标签,对每个数据点进行第一软标签的预测,得到数据集中每个数据点对应的第一软标签,所述软标签相对于所述真实标签,其泛化能力强,即包含更多的信息,例如不同类别间的信息,能突显与其他标签的区别。
软标签在一定程度上相当于正则化项,防止模型过拟合,起到了稳定模型的作用。
例如,已设定真实标签的数据为[1,0,0],经模型T预测处理后一次后,得到软标签[0.9,0.05,0.05],经多次预测处理后,能得到泛化能力更强的软标签[0.7,0.27,0.03]。
参照图2所示,示出了预测模型将真实标签变化为软标签的状态。
进一步的,将所述数据点对应的所述真实标签输入到所述预测模型进行第一软标签的第一轮预测,得到第一轮预测结果;
将上一轮预测结果作为输入,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签,其中m>1。
具体的,将某一数据集中数据点对应的真实标签输入到预测模型中进行第一软标签的第一轮预测,得到第一轮预测结果;
随后将第一轮的预测结果作为输入,利用预测模型对所述某一数据集中每个数据点进行第一软标签的第二轮预测,得到第二轮预测结果;然后再将第二轮的预测结果作为输入,通过预测模型进行第三轮预测;即利用预测模型对真实标签进行多轮的预测,从第二轮预测开始,每一轮的输入为上一轮的预测结果,通过上述步骤以得到泛化能力更强的第一软标签。
上述的某一数据集其可以为所述问答数据集,也可为其他包含数据点以及其对应的真实标签的数据集。在本申请中采用的是所述问答数据集。
通过上述步骤,对数据点对应的真实标签进行多轮预测,从而得到泛化能力强的软标签。
S3、将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
具体的,将每个数据点及其对应的第一软标签构建为软标签数据集,并利用知识蒸馏技术奖软标签数据集合预测模型蒸馏为标注模型。
所述知识蒸馏技术即将一个复杂模型或多个模型学到的知识迁移到另一个轻量级模型上。在尽量不损失性能的情况下,使模型变轻量,即方便部署,推理速度快。即标注模型在参数量更小的同时,其标注效率提高了。
进一步的,利用Textbrewer知识蒸馏工具将所述软标签数据集和所述预测模型生成标注模型。
采用Textbrewer知识蒸馏工具的好处在于,其提供了简单的工作流程,方便快速搭建蒸馏实验,并且可以根据需求进行灵活的配置和拓展。
Textbrewer知识蒸馏工具为哈工大基于PyTorch框架制作的知识蒸馏工具,其对于知识 蒸馏具有良好的性能。通过输入训练集,即本申请的软标签数据集、输入软标签数据与预测模型生成的权重并初始化预测模型和初始化预设的标注模型,通过Textbrewer知识蒸馏工具得到标注模型,所述标注模型在参数量更小的同时,其性能与预测模型一致。
S4、获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
具体的,在得到标注模型后,将获取待标签数据并利用所述标注模型来对所述待标签数据进行预标注,并根据标注结果对待标签数据集进行筛选,最终得到标注样本集。
所述待标签数据集为其只包含了数据点,对数据点进行预标注即利用数据点来生成其对应的软标签;与所述问答数据集的不同在于,问答数据集包含了数据点以及针对数据点进行了人工标注的真实标签,而待标签数据集不含任何标签。
进一步的,所述获取待标签数据集包括:
向数据库发送调用请求,所述调用请求携带验签令牌;
接收所述数据库返回的验签结果,并在验签结果为通过时,调用所述数据库中的所述待标签数据集。
具体的,为了数据的安全性,在调用待标签数据集时需要进行数据库需要进行验签步骤。
所以要从数据库中获取待标签数据集需要向数据库发送调用请求,其中调用请求中其携带有验签令牌;数据库将对令牌进行验签步骤,并返回验签结果,只有在验签结果通过时,才能调用所述数据库中的待标签数据集。
所述数据库可以为分布式数据库,即为区块链。
通过上述步骤保证了数据的安全性。
所述根据标注结果对待标签数据集进行筛选为根据需求对标注结果设定筛选条件,对所有符合筛选条件的标注结果及其对应的数据点,来最终组成标注样本集。
需要强调的是,为了进一步保证数据的私密性和安全性,待标签数据集和标注样本集的所有数据还可以存储于一区块链的节点中。
进一步的,S4具体包括:
将所述待标签数据集中的待标签数据点输入到所述标注模型进行预标注得到标注结果,并计算每一个所述标注结果的置信度大小;
将所述标注结果的置信度大小与第一预设数值进行比较,删除置信度小于等于第一预设数值的所述标注结果和所述待标签数据点,并将所述待标签数据集中剩余的所述待标签数据点及其对应的所述标注结果组成所述标注样本集。
具体的,在标注模型在对所述待标签数据进行标注时,在输出标注结果的同时,还将输出该标注结果对应的置信度大小,总的置信度大小的和为1,利用标注模型进行预标注,将会得到多个标注结果,即多个软标签,并且标注模型还会同时输出多个标注结果对应的,置信度大小,本申请是直接输出的置信度最大的标注结果及其对应的置信度大小。
将所述标注结果的置信度大小与第一预设数值进行比较,并删除置信度小于等于第一预设数值的所述标注结果和所述标注结果对应的待标签数据点,并将待标签数据集中剩余的所述待标签数据点和对应标注结果组成所述标注样本集。
所述第一预设数值也将根据需要可以自由设定,在本申请中保留的都是置信度大于0.9的标注结果。
通过采用上述方案设定的较高预设数值确保了标注的相对可靠性,合理控制了利用模型标注样本的数量,并且便于后续同原始样本进行迭代标注。
进一步的,在获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集之后,还包括:
计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值;
若所述比值小于第二预设数值,则组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练,直至所述比值大于等于所述第二预设数值为止。
具体的,在筛选后,通过计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值,来判断标注模型对待标签数据集的标注质量情况,若比值小于第二预设数值,将组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练;即在所标注样本集中的数据点数量与待标签数据集中待标签数据点数量的比值达不到预设要求时,将对所述预测模型重新训练,并将问答数据集和标注样本集组合后的数据集进行软标签预测,得到数据点对应的软标签,并将组成软标签数据集,最后通过知识蒸馏将软标签数据集和预测模型变成标注模型,随后在对待标签数据集进行标注,随后再筛选,最终再次得到标注样本集。再计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值,直至所述比值大于所述第二预设数值为止。
即将所述标注样本集和问答数据模型替换最开始的数据集,再来重复上述步骤,直至最终得到的标注样本集中数据点数量与待标签数据集中数据点数量的比值大于等于所述第二预设数值为止。
所述第二预设数值可以根据需求来自由设定,在本申请中所述第二预设数值为90%。
通过上述步骤保证了标注模型对待标签数据标注的一个整体质量。
再进一步的,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,以得到所述第一软标签具体包括:
根据第m轮与第m-1轮的预测结果计算交叉熵损失函数;
当所述损失函数小于第三预设数值,则停止预测,将所述第m轮的预测结果作为第一软标签输出,其中m≥2。
具体的,上述对真实标签进行了m轮预测,在进行预测的同时,将获取相邻两轮的预测结果来计算交叉熵损失函数,当所述损失函数小于第三预设数值时,将停止预测,并输出所述两轮预测中的后一轮预测结果作为第一软标签。
对于第三预设数值的设定将依据自身的需求来设定,例如当需要获取泛化能力强的第一软标签时,可以设定第三预设数值为0.1,在后续步骤中,对待标签数据进行标注时能直接得出置信度高的标签;当需要的获取泛化能力稍弱的第一软标签时,可以设定第三预设数值为1,在后续步骤中,对待标签数据进行标注时能直接得出置信度稍低的标签。所以根据需要可以对第三预设数值自由设定。
通过上述步骤,实现对真实标签预测次数的控制,可以根据需要间接的对预测次数进行控制,避免整个流程冗余。
通过获取已设定标签的问答数据集,以及利用预训练的预测模型来对问答数据集中每个数据点进行第一软标签预测并得到对应的第一软标签,所述软标签相较于已设定的真实标签而言,其泛化能力强;将数据点及其对应的第一软标签构建一个软标签数据集,并将软标签数据集与预测模型通过知识蒸馏技术生成标注模型;然后利用该标注模型来对待标签的数据集进行标注,并根据标注结果对所述待标签数据集进行筛选,最终得到标注样本集。通过上述步骤生成的样本集可以涵盖待研究的多模态数据分布的全貌,即能对未标签的数据集进行全面的标注标签,提高了标注标签的效率以及质量。
如图3所示,在步骤S2之前还包括:
将所述多个数据点向量化;
将向量化后的所述数据点通过交互处理得到新的向量表示;
将所述新的向量表示经过线性变换后得到的结果,再经过分类网络处理后得到第二软标签;
根据所述数据点对应的所述真实标签和所述第二软标签计算交叉熵损失函数,并基于所述 交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
具体的,所述数据点即为图片或问题,真实标签即为图片或问题对应的真实标签;图片将通过开源的Faster-RCNN模型得到其向量表示;问题首先通过斯坦福公开的GloVe词向量进行Embedding处理,然后通过LSTM网络得到其向量表示;
图片和问题的向量表示通过交互处理得到新的向量表示;
将新的图片向量表示和新的问题向量表示通过线性变换后得到h_image和h_question;h_image和h_question还是图片和问题的一个向量表示,但其于前述的向量表示都不同。即将新的图片向量表示和新的问题向量表示通过线性变换后得到的h_image和h_question,还是一种向量表示,只不过其表示不同。
将h_image和h_question通过分类网络处理最终得到软标签y_soft,其中每个维度代表从属于每个类别的概率
y_soft=softmax(h_image+h_question),
其中h_image+h_question表示向量对应元素相加;
根据软标签y_soft与本身问题对应的真实标签y,计算交叉熵损失函数,并基于所述交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
Figure PCTCN2021082936-appb-000001
其中k表示向量的第k维,y_ori是原始标签y通过one-hot编码后的向量。
通过上述步骤,实现对预测模型的预训练,并且基于上述步骤得出的预测模型其标注效率高且质量较优。
为解决上述技术问题,本申请实施例还提供一种问答数据增强装置100。
如图4所示,本申请所述问答数据增强装置100可以安装于电子设备中。根据实现的功能,所述问答数据增强装置100可以包括获取模块101、预测模块102、生成模块103和筛选预测模块104。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机可读指令的指令段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下:
获取模块101,用于获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
具体的,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签,所述数据点表示的是图片和问题,而数据点对应的真实标签即为针对该图片和问题,在图片上标注的标签,所述真实标签即为通过人工对图片进行标注而得到的标签;
所述获取模块101是获取的VQA(Visual Question Answering)官网公开的一个针对图片和问题已设定好标签的数据集。
预测模块102,用于基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
具体的,预测模块102通过利用预训练的预测模型和所述真实标签,对每个数据点进行第一软标签的预测,得到数据集中每个数据点对应的第一软标签,所述软标签相对于所述真实标签,其泛化能力强,即包含更多的信息,例如不同类别间的信息,能突显与其他标签的区别。
软标签在一定程度上相当于正则化项,防止模型过拟合,起到了稳定模型的作用。
例如,已设定真实标签的数据为[1,0,0],经模型T预测处理后一次后,得到软标签[0.9,0.05,0.05],经多次预测处理后,能得到泛化能力更强的软标签[0.7,0.27,0.03]。
进一步的,预测模块102包括第一预测子模块和多轮预测子模块。
所述第一轮预测子模块用于将所述数据点对应的所述真实标签输入到所述预测模型进行第一软标签的第一轮预测,得到第一轮预测结果;
多轮预测子模块用于将上一轮预测结果作为输入,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签,其中m>1。
具体的,第一轮预测子模块将某一数据集中数据点对应的真实标签输入到预测模型中进行第一软标签的第一轮预测,得到第一轮预测结果;
随后多轮预测子模块将第一轮的预测结果作为输入,利用预测模型对所述某一数据集中每个数据点进行第一软标签的第二轮预测,得到第二轮预测结果;然后再将第二轮的预测结果作为输入,通过预测模型进行第三轮预测;即利用预测模型对真实标签进行多轮的预测,从第二轮预测开始,每一轮的输入为上一轮的预测结果,通过上述步骤以得到泛化能力更强的第一软标签。
即利用预测模型对真实标签进行多轮的预测,并且每一轮的输入为上一轮的预测结果,通过上述步骤以得到泛化能力更强的软标签。
上述的某一数据集其可以为所述问答数据集,也可为其他包含数据点以及其对应的真实标签的数据集。在本申请中采用的是所述问答数据集。
通过第一预测子模块和多轮预测子模块,对数据点对应的真实标签进行多轮预测,从而得到泛化能力强的软标签。
再进一步的,所述多轮预测子模块包括判断单元和软标签输出单元。
判断单元根据第m轮与第m-1轮的预测结果计算交叉熵损失函数;
软标签输出单元用于当所述损失函数小于第三预设数值,则停止预测,将所述第m轮的作为第一软标签输出,其中m≥2。
具体的,判断单元将获取相邻两轮的预测结果来计算交叉熵损失函数,软标签输出单元用于当所述损失函数小于第三预设数值时,将停止预测,并输出所述两轮预测中的后一轮预测结果作为第一软标签。
对于第三预设数值的设定将依据自身的需求来设定,例如当需要获取泛化能力强的第一软标签时,可以设定第三预设数值为0.1,在后续步骤中,对待标签数据进行标注时能直接得出置信度高的标签;当需要的获取泛化能力稍弱的第一软标签时,可以设定第三预设数值为1,在后续步骤中,对待标签数据进行标注时能直接得出置信度稍低的标签。所以根据需要可以对第三预设数值自由设定。
通过上述判断单元和软标签输出单元,实现对真实标签预测次数的控制,可以根据需要间接的对预测次数进行控制,避免整个流程冗余。
生成模块103,用于将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
具体的,生成模块103将每个数据点及其对应的第一软标签构建为软标签数据集,并利用知识蒸馏技术奖软标签数据集合预测模型蒸馏为标注模型。
进一步的,所述生成模块103包括Textbrewer子模块;
具体的,Textbrewer子模块通过采用Textbrewer知识蒸馏工具来实现将软标签数据集合所述预测模型生成标注模型。
采用Textbrewer子模块的好处在于,其提供了简单的工作流程,方便快速搭建蒸馏实验,并且可以根据需求进行灵活的配置和拓展。
Textbrewer子模块为哈工大基于PyTorch框架制作的知识蒸馏工具,其对于知识蒸馏具有良好的性能。通过输入训练集(即本申请的软标签数据集)、输入软标签数据与预测模型生成的权重并初始化预测模型和初始化预设的标注模型,通过Textbrewer知识蒸馏工具得到标注模型,所述标注模型在参数量更小的同时,其性能与预测模型一致。
筛选预测模块104,用于获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
具体的,筛选预测模块104在得到标注模型后,将获取待标签数据并利用所述标注模型来对所述待标签数据进行预标注,并根据标注结果对待标签数据集进行筛选,最终得到标注样本集。
所述根据标注结果对待标签数据集进行筛选为根据需求对标注结果设定筛选条件,对所有符合筛选条件的标注结果及其对应的数据点,来最终组成标注样本集。
所述待标签数据集为其只包含了数据点,对数据点进行预标注即利用数据点来生成其对应的软标签;与所述问答数据集的不同在于,问答数据集包含了数据点以及针对数据点进行了人工标注的真实标签,而待标签数据集不含任何标签。
需要强调的是,为了进一步保证数据的私密性和安全性,待标签数据集和标注样本集的所有数据还可以存储于一区块链的节点中。
进一步的,筛选预测模块104包括获取子模块;
获取子模块向数据库发送调用请求,所述调用请求携带验签令牌;
接收所述数据库返回的验签结果,并在验签结果为通过时,调用所述数据库中的所述待标签数据集。
具体的,获取子模块向数据库发送调用请求,其中调用请求中其携带有验签令牌;数据库将对令牌进行验签步骤,并返回验签结果,只有在验签结果通过时,才能调用所述数据库中的待标签数据集。
进一步的,筛选预测模块104包括置信度输出子模块和置信度判断子模块;
置信度输出子模块将所述待标签数据集中的待标签数据点输入到所述标注模型进行预标注得到标注结果,并计算每一个所述标注结果的置信度大小;
置信度判断子模块将所述标注结果的置信度大小与第一预设数值进行比较,删除置信度小于等于第一预设数值的所述标注结果和所述待标签数据点,并将所述待标签数据集中剩余的所述待标签数据点及其对应的所述标注结果组成所述标注样本集
具体的,置信度输出子模块在标注模型在对所述待标签数据进行标注时,在输出标注结果的同时,还将输出该标注结果对应的置信度大小,总的置信度大小的和为1,利用标注模型进行预标注,将会得到多个标注结果,即多个软标签,并且标注模型还会同时输出多个标注结果对应的,置信度大小,并且本申请是直接输出的置信度最大的标注结果及其对应的置信度大小。
置信度判断子模块将所述标注结果的置信度大小与第一预设数值进行比较,并删除置信度小于等于第一预设数值的所述标注结果和待标签数据点,并将待标签数据集中剩余的所述所述待标签数据点和对应标注结果组成所述标注样本集。
所述第一预设数值也将根据需要可以自由设定,在本申请中保留的都是置信度大于0.9的标注结果。
通过置信度输出子模块和置信度判断子模块配合,设定的较高预设数值确保了标注的相对可靠性,合理控制了利用模型标注样本的数量,并且便于后续同原始样本进行迭代标注。
进一步的,筛选预测模块104包括比值计算子模块和比值判断子模块;
具体的,在筛选后,比值计算子模块通过计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值,比值判断子模块来判断标注模型对待标签数据集的标注质量情况,若比值小于第二预设数值,将组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练;即在所标注样本集中的数据点数量与待标签数据集中待标签数据点数量的比值达不到预设要求时,将对所述预测模型重新训练,并将问答数据集和标注样本集组合后的数据集进行软标签预测,得到数据点对应的软标签,并将组成软标签数据集,最后通过知识蒸馏将软标签数据集和预测模型变成标注模型,随后在对待标签数据集进行标注,随后再筛选, 最终再次得到标注样本集。再计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值,直至所述比值大于所述第二预设数值为止。
比值判断子模块将所述标注样本集和问答数据模型替换最开始的数据集,再来重复上述步骤,直至最终得到的标注样本集中数据点数量与待标签数据集中数据点数量的比值大于等于所述第二预设数值为止。
所述第二预设数值可以根据需求来自由设定,在本申请中所述第二预设数值为90%。
比值计算子模块和比值判断子模块保证了标注模型对待标签数据标注的一个整体质量。
通过采用上述装置,所述装置通过获取模块、预测模块、生成模块、判断模块和筛选预测模块配合使用,实现了涵盖待研究的多模态数据分布的全貌,即能对未标签的数据集进行全面的标注标签,提高了标注标签的效率以及质量。
所述装置还包括:预训练模块;
预训练模块,用于将所述多个数据点向量化;
将向量化后的所述数据点通过交互处理得到新的向量表示;
将所述新的向量表示经过线性变换后得到的结果,再经过分类网络处理后得到第二软标签;
根据所述数据点对应的所述真实标签和所述第二软标签计算交叉熵损失函数,并基于所述交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
具体的,所述数据点即为图片或问题,真实标签即为图片或问题对应的真实标签;预训练模块将图片通过开源的Faster-RCNN模型得到其向量表示;并将问题首先通过斯坦福公开的GloVe词向量进行Embedding处理,然后通过LSTM网络得到其向量表示;
图片和问题的向量表示通过交互处理得到新的向量表示;
将新的图片向量表示和新的问题向量表示通过线性变换后得到h_image和h_question;h_image和h_question还是图片和问题的一个向量表示,但其于前述的向量表示都不同。即将新的图片向量表示和新的问题向量表示通过线性变换后得到的h_image和h_question,还是一种向量表示,只不过其表示不同。
将h_image和h_question通过分类网络处理最终得到软标签y_soft,其中每个维度代表从属于每个类别的概率
y_soft=softmax(h_image+h_question),
其中h_image+h_question表示向量对应元素相加;
根据软标签y_soft与本身问题对应的真实标签y,计算交叉熵损失函数,并基于所述交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
Figure PCTCN2021082936-appb-000002
其中k表示向量的第k维,y_ori是原始标签y通过one-hot编码后的向量。
通过预训练模块,实现对预测模型的预训练,并且基于上述步骤得出的预测模型其标注效率高且质量较优。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图5,图5为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过***总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息 处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作***和各类应用软件,例如问答数据增强方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述问答数据增强方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本实施例通过处理器执行存储在存储器的计算机可读指令时实现如上述实施例问答数据增强方法的步骤,通过获取已设定标签的问答数据集,以及利用预训练的预测模型来对问答数据集中每个数据点进行第一软标签预测并得到对应的第一软标签,所述软标签相较于已设定的真实标签而言,其泛化能力强;将数据点及其对应的第一软标签构建一个软标签数据集,并将软标签数据集与预测模型通过知识蒸馏技术生成标注模型;然后利用该标注模型来对待标签的数据集进行标注,并根据标注结果对所述待标签数据集进行筛选,最终得到标注样本集。通过上述步骤生成的样本集可以涵盖待研究的多模态数据分布的全貌,即能对未标签的数据集进行全面的标注标签,提高了标注标签的效率以及质量。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的问答数据增强方法的步骤,通过获取已设定标签的问答数据集,以及利用预训练的预测模型来对问答数据集中每个数据点进行第一软标签预测并得到对应的第一软标签,所述软标签相较于已设定的真实标签而言,其泛化能力强;将数据点及其对应的第一软标签构建一个软标签数据集,并将软标签数据集与预测模型通过知识蒸馏技术生成标注模型;然后利用该标注模型来对待标签的数据集进行标注,并根据标注结果对所述待标签数据集进行筛选,最终得到标注样本集。通过上述步骤生成的样本集可以涵盖待研究的多模态数据分布的全貌,即能对未标签的数据集进行全面的标注标签,提高了标注标签的效率以及质量。所述计算机可读存储介质可以是非易失性,也可以是易失性。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光 盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种问答数据增强方法,所述方法包括:
    获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
    基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
    将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
    获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
  2. 根据权利要求1所述的问答数据增强方法,其中,所述获取待标签数据集包括:
    向数据库发送调用请求,所述调用请求携带验签令牌;
    接收所述数据库返回的验签结果,并在验签结果为通过时,调用所述数据库中的所述待标签数据集。
  3. 根据权利要求1所述的问答数据增强方法,其中,所述将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集包括:
    将所述待标签数据集中的待标签数据点输入到所述标注模型进行预标注得到标注结果,并计算每一个所述标注结果的置信度大小;
    将所述标注结果的置信度大小与第一预设数值进行比较,删除置信度小于等于第一预设数值的所述标注结果和所述待标签数据点,并将所述待标签数据集中剩余的所述待标签数据点及其对应的所述标注结果组成所述标注样本集。
  4. 根据权利要求1所述的问答数据增强方法,其中,在所述获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集之后,还包括:
    计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值;
    若所述比值小于第二预设数值,则组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练,直至所述比值大于等于所述第二预设数值为止。
  5. 根据权利要求1所述的问答数据增强方法,其中,所述基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签包括:
    将所述数据点对应的所述真实标签输入到所述预测模型进行第一软标签的第一轮预测,得到第一轮预测结果;
    将上一轮预测结果作为输入,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签,其中m>1。
  6. 根据权利要求5所述的问答数据增强方法,其中,所述利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签包括:
    根据第m轮与第m-1轮的预测结果计算交叉熵损失函数;
    当所述损失函数小于第三预设数值,则停止预测,将所述第m轮的预测结果作为第一软标签输出,其中m≥2。
  7. 根据权利要求1至6中任一项所述的问答数据增强方法,其中,所述基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测之前,还包括:
    将所述多个数据点向量化;
    将向量化后的所述数据点通过交互处理得到新的向量表示;
    将所述新的向量表示经过线性变换后得到的结果,再经过分类网络处理后得到第二软标签;
    根据所述数据点对应的所述真实标签和所述第二软标签计算交叉熵损失函数,并基于所述 交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
  8. 一种问答数据增强装置,包括:
    获取模块,用于获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
    预测模块,用于基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
    生成模块,用于将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
    筛选预测模块,用于获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
  9. 一种计算机设备,包括存储器、处理器,以及存储在所述存储器中,并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
    基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
    将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
    获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
  10. 根据权利要求9所述的计算机设备,其中,所述将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集包括:
    将所述待标签数据集中的待标签数据点输入到所述标注模型进行预标注得到标注结果,并计算每一个所述标注结果的置信度大小;
    将所述标注结果的置信度大小与第一预设数值进行比较,删除置信度小于等于第一预设数值的所述标注结果和所述待标签数据点,并将所述待标签数据集中剩余的所述待标签数据点及其对应的所述标注结果组成所述标注样本集。
  11. 根据权利要求9所述的计算机设备,其中,在所述获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集之后,还包括:
    计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值;
    若所述比值小于第二预设数值,则组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练,直至所述比值大于等于所述第二预设数值为止。
  12. 根据权利要求9所述的计算机设备,其中,所述基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签包括:
    将所述数据点对应的所述真实标签输入到所述预测模型进行第一软标签的第一轮预测,得到第一轮预测结果;
    将上一轮预测结果作为输入,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签,其中m>1。
  13. 根据权利要求12所述的计算机设备,其中,所述利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签包括:
    根据第m轮与第m-1轮的预测结果计算交叉熵损失函数;
    当所述损失函数小于第三预设数值,则停止预测,将所述第m轮的预测结果作为第一软标签输出,其中m≥2。
  14. 根据权利要求9至13中任一项所述的计算机设备,其中,所述基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测之前,还包括:
    将所述多个数据点向量化;
    将向量化后的所述数据点通过交互处理得到新的向量表示;
    将所述新的向量表示经过线性变换后得到的结果,再经过分类网络处理后得到第二软标签;
    根据所述数据点对应的所述真实标签和所述第二软标签计算交叉熵损失函数,并基于所述交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:
    获取问答数据集,所述问答数据集包括多个数据点,以及每个数据点对应的真实标签;
    基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签;
    将所述问答数据集中每个数据点及其对应的所述第一软标签构建为软标签数据集,利用知识蒸馏技术将所述软标签数据集和预测模型生成标注模型;
    获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集包括:
    将所述待标签数据集中的待标签数据点输入到所述标注模型进行预标注得到标注结果,并计算每一个所述标注结果的置信度大小;
    将所述标注结果的置信度大小与第一预设数值进行比较,删除置信度小于等于第一预设数值的所述标注结果和所述待标签数据点,并将所述待标签数据集中剩余的所述待标签数据点及其对应的所述标注结果组成所述标注样本集。
  17. 根据权利要求15所述的计算机可读存储介质,其中,在所述获取待标签数据集,将所述待标签数据集输入到所述标注模型进行预标注,并根据标注结果对所述待标签数据集进行筛选,得到标注样本集之后,还包括:
    计算所述标注样本集中的数据点数量与所述待标签数据集中的待标签数据点数量的比值;
    若所述比值小于第二预设数值,则组合所述标注样本集和所述问答数据集,重新对所述预测模型进行训练,直至所述比值大于等于所述第二预设数值为止。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述基于预训练的预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测,得到所述问答数据集中每个数据点对应的第一软标签包括:
    将所述数据点对应的所述真实标签输入到所述预测模型进行第一软标签的第一轮预测,得到第一轮预测结果;
    将上一轮预测结果作为输入,利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签,其中m>1。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述利用预测模型对所述问答数据集每个数据点进行第一软标签的m轮预测,得到所述第一软标签包括:
    根据第m轮与第m-1轮的预测结果计算交叉熵损失函数;
    当所述损失函数小于第三预设数值,则停止预测,将所述第m轮的预测结果作为第一软标签输出,其中m≥2。
  20. 根据权利要求15-19中任一项所述的计算机可读存储介质,其中,所述基于预训练的 预测模型和所述真实标签,对所述问答数据集中的每个数据点进行第一软标签预测之前,还包括:
    将所述多个数据点向量化;
    将向量化后的所述数据点通过交互处理得到新的向量表示;
    将所述新的向量表示经过线性变换后得到的结果,再经过分类网络处理后得到第二软标签;
    根据所述数据点对应的所述真实标签和所述第二软标签计算交叉熵损失函数,并基于所述交叉熵损失函数对初始预测模型的各层的权重参数进行调整,得到预训练的所述预测模型。
PCT/CN2021/082936 2020-10-30 2021-03-25 一种问答数据增强方法、装置、计算机设备及存储介质 WO2022001232A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011192632.4 2020-10-30
CN202011192632.4A CN112308237B (zh) 2020-10-30 2020-10-30 一种问答数据增强方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022001232A1 true WO2022001232A1 (zh) 2022-01-06

Family

ID=74332869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082936 WO2022001232A1 (zh) 2020-10-30 2021-03-25 一种问答数据增强方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112308237B (zh)
WO (1) WO2022001232A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227831A (zh) * 2024-05-23 2024-06-21 中国科学院自动化研究所 跨模态视频检索方法、装置及电子设备

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308237B (zh) * 2020-10-30 2023-09-26 平安科技(深圳)有限公司 一种问答数据增强方法、装置、计算机设备及存储介质
CN112906375B (zh) * 2021-03-24 2024-05-14 平安科技(深圳)有限公司 文本数据标注方法、装置、设备及存储介质
CN113537942A (zh) * 2021-07-28 2021-10-22 深圳供电局有限公司 一种提高样本标记数量的方法及***
CN113887621B (zh) * 2021-09-30 2024-04-30 中国平安财产保险股份有限公司 问答资源调整方法、装置、设备及存储介质
CN116070711B (zh) * 2022-10-25 2023-11-10 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质
CN116257613B (zh) * 2023-02-10 2024-02-06 北京百度网讯科技有限公司 数据生产方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415938A (zh) * 2018-01-24 2018-08-17 中电科华云信息技术有限公司 一种基于智能模式识别的数据自动标注的方法及***
CN109711544A (zh) * 2018-12-04 2019-05-03 北京市商汤科技开发有限公司 模型压缩的方法、装置、电子设备及计算机存储介质
US10339468B1 (en) * 2014-10-28 2019-07-02 Groupon, Inc. Curating training data for incremental re-training of a predictive model
CN110674880A (zh) * 2019-09-27 2020-01-10 北京迈格威科技有限公司 用于知识蒸馏的网络训练方法、装置、介质与电子设备
CN111753092A (zh) * 2020-06-30 2020-10-09 深圳创新奇智科技有限公司 一种数据处理方法、模型训练方法、装置及电子设备
CN112308237A (zh) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 一种问答数据增强方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180053097A1 (en) * 2016-08-16 2018-02-22 Yahoo Holdings, Inc. Method and system for multi-label prediction
CN111401445B (zh) * 2020-03-16 2023-03-10 腾讯科技(深圳)有限公司 一种图像识别模型的训练方法、图像识别的方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10339468B1 (en) * 2014-10-28 2019-07-02 Groupon, Inc. Curating training data for incremental re-training of a predictive model
CN108415938A (zh) * 2018-01-24 2018-08-17 中电科华云信息技术有限公司 一种基于智能模式识别的数据自动标注的方法及***
CN109711544A (zh) * 2018-12-04 2019-05-03 北京市商汤科技开发有限公司 模型压缩的方法、装置、电子设备及计算机存储介质
CN110674880A (zh) * 2019-09-27 2020-01-10 北京迈格威科技有限公司 用于知识蒸馏的网络训练方法、装置、介质与电子设备
CN111753092A (zh) * 2020-06-30 2020-10-09 深圳创新奇智科技有限公司 一种数据处理方法、模型训练方法、装置及电子设备
CN112308237A (zh) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 一种问答数据增强方法、装置、计算机设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227831A (zh) * 2024-05-23 2024-06-21 中国科学院自动化研究所 跨模态视频检索方法、装置及电子设备

Also Published As

Publication number Publication date
CN112308237B (zh) 2023-09-26
CN112308237A (zh) 2021-02-02

Similar Documents

Publication Publication Date Title
WO2022001232A1 (zh) 一种问答数据增强方法、装置、计算机设备及存储介质
CN112685565B (zh) 基于多模态信息融合的文本分类方法、及其相关设备
WO2021169116A1 (zh) 智能化的缺失数据填充方法、装置、设备及存储介质
CN111898696A (zh) 伪标签及标签预测模型的生成方法、装置、介质及设备
CN112395979B (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
CN113344206A (zh) 融合通道与关系特征学习的知识蒸馏方法、装置及设备
CN112395390B (zh) 意图识别模型的训练语料生成方法及其相关设备
CN111666873A (zh) 一种基于多任务深度学习网络的训练方法、识别方法及***
CN112418292A (zh) 一种图像质量评价的方法、装置、计算机设备及存储介质
CN108268629B (zh) 基于关键词的图像描述方法和装置、设备、介质
CN112528029A (zh) 文本分类模型处理方法、装置、计算机设备及存储介质
CN112364933B (zh) 图像分类方法、装置、电子设备和存储介质
WO2022222854A1 (zh) 一种数据处理方法及相关设备
US20240119743A1 (en) Pre-training for scene text detection
CN114359582B (zh) 一种基于神经网络的小样本特征提取方法及相关设备
WO2023231753A1 (zh) 一种神经网络的训练方法、数据的处理方法以及设备
Sun et al. Image steganalysis based on convolutional neural network and feature selection
CN114398557A (zh) 基于双画像的信息推荐方法、装置、电子设备及存储介质
CN114742224A (zh) 行人重识别方法、装置、计算机设备及存储介质
US11321397B2 (en) Composition engine for analytical models
CN117312535B (zh) 基于人工智能的问题数据处理方法、装置、设备及介质
CN114282258A (zh) 截屏数据脱敏方法、装置、计算机设备及存储介质
CN112199954A (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN113988223B (zh) 证件图像识别方法、装置、计算机设备及存储介质
CN115730603A (zh) 基于人工智能的信息提取方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21834194

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21834194

Country of ref document: EP

Kind code of ref document: A1