CN113626566B

CN113626566B - Knowledge dialogue cross-domain learning method based on synthetic data

Info

Publication number: CN113626566B
Application number: CN202110763112.2A
Authority: CN
Inventors: 魏凯敏; 林健成; 张继连; 刘志全; 冯丙文
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2023-07-18
Anticipated expiration: 2041-07-06
Also published as: CN113626566A

Abstract

The invention discloses a knowledge dialogue cross-domain learning method based on synthetic data. Aiming at the problem of insufficient data resources in the cross-domain learning of a knowledge dialogue system, the method provides the following strategies: aiming at questions and answers, a method for jointly constructing synthetic data by a template and a multi-round dialogue generation model is provided; providing a knowledge reservation and template method to construct a synthetic data method aiming at catastrophic forgetfulness; to utilize the mismatched dialog corpus, we propose a method of constructing synthetic data using methods such as search, filtering, sorting, and the like. The model performance trained by using the synthetic data can approximate to a model trained by using artificial annotation data, and the dependence of the cross-domain learning of a knowledge dialogue system on data resources is effectively relieved.

Description

Knowledge dialogue cross-domain learning method based on synthetic data

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge dialogue cross-domain learning method based on synthetic data.

Background

In the field of dialog systems, knowledge dialogues are widely used to generate replies with more information and more convincing power, applicable to various dialog robots for satisfying the emotion of a user and to various dialog robots for achieving a specific task. However, the current knowledge dialogue system generally has a problem: because the corpus used for training by the knowledge dialogue system has timeliness, the knowledge dialogue system performs poorly when facing new fields. In the face of new fields, available training data is expensive to collect, and may typically suffer from a few data or even zero data. This makes cross-domain learning by the deployed knowledge dialogue system very difficult. According to the organization form of knowledge, the knowledge is divided into structured knowledge and unstructured knowledge, a structured knowledge system usually exists in the form of a knowledge graph triplet, and the invention relates to a knowledge dialogue system using the structured knowledge.

For the research of knowledge dialogue systems, the current research focuses on how to better utilize knowledge for dialogue generation in limited fields. The knowledge dialogue system is less researched how to update online after being deployed, so that the current application of the knowledge dialogue system still has larger limitation.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a knowledge dialogue cross-domain learning method based on synthetic data.

The aim of the invention can be achieved by adopting the following technical scheme:

the method realizes the cross-domain learning of the structured knowledge dialogue system by constructing the synthetic data aiming at four scenes including question-answering, boring, disastrous forgetting and only having mismatched dialogue corpus, the method aims at the realization process of different scenes as follows,

(1) The steps of cross-domain learning for question and answer scenes are as follows:

s11, manually presetting a template;

s12, for any new domain knowledge triplet (Entity, attr, value), wherein the Entity represents a certain Entity in the real world, such as a specific movie name, attr represents an attribute of the Entity, such as a director, the movie Entity has an actor attribute, and Value is a specific Value of the attribute, and the attribute is substituted into the position of the template to obtain a piece of synthesized data;

s13, repeating the step S11 and the step S12 until the knowledge in the new field is completely constructed into corresponding synthetic data;

(2) The steps of cross-domain learning for the chatty scene are as follows:

s21, pretraining on a large-scale dialogue corpus by using a dialoGPT dialogue model;

s22, for any knowledge triplet { Entity, attr, value } in the new field, aggregating according to the Entity to obtain a plurality of groups G;

s23, for the same group, use "do you say { Entity } with me? "may be" is the beginning of a multi-round session;

s24, randomly generating a random number p, wherein 0< = p < = 1;

s25, if the random number p generated in the step S24 is more than 0.5, performing the development of the multi-turn dialogue mentioned in the step S3 by using a dialoGPT dialogue model for writing;

s26, if the random number p generated in the step S24 is less than 0.5, carrying out multiple rounds of dialogue by using a template to carry out writing;

s27, repeating the steps S24-S26 until each knowledge triplet { Entity, attr, value } in the group of conversations is covered and used, and generating corresponding synthesized data;

s28, repeating the steps S22-S27 until all knowledge in the new field is covered;

(3) The steps of cross-domain learning for a catastrophically forgotten scene are as follows:

s31, constructing N pieces of dialogue synthesized data by using templates for N pieces of new domain knowledge to be learned;

s32, randomly extracting N pieces of knowledge and dialogue corpus from the knowledge in the old field;

s33, constructing N pieces of dialogue synthesized data for the N pieces of knowledge extracted in the step S32 by using templates;

s34, mixing the dialogue corpus extracted in the steps S31, S32 and S33 with the synthesized data to form a new data set;

(4) The construction of the synthesized data for the scene where only the non-matching dialogue corpus exists is as follows:

s41, marking dialogue corpus acquired in a social network, wherein the content of dialogue context is context, the corresponding dialogue reply is response, then word segmentation is carried out on the response part, and an inverted index is established by using a tool to obtain a database D;

s42, pre-training the data set constructed in the step S41 by using a BERT model;

s43, carrying out word segmentation on N new domain knowledge to be learned and a plurality of corresponding short sentences by using a crust word segmentation tool; the jieba is a Chinese word segmentation tool, is simple and direct to install, and can be downloaded and used by referring to the address (https:// gitsub.

S44, searching in a database D by taking a Value in a word segmentation result and knowledge triples (Entity, attr, value) as a keyword, and returning a dialogue context of 50 bits before the relevance score;

s45, filtering the searched dialogue context;

s46, scoring by using the trained BERT in the step S42, and forming the synthesized data by using the context with the highest score and the corresponding short sentence in the step S43.

Further, when the scene is a question and answer, the manually preset template is "do you know { Attr } of { Entity? "know about the same, is { Value }; wherein Entiy, attr, value is a filling position corresponding to the knowledge triplet (Attr, value).

Further, when the scene is boring, training is carried out by using a public data set LCCC in a dialogGPT model;

the DialogGPT model includes 12 layers of transducer structures, each layer of transducer structure including 12 heads, 768 dimensions of hidden states;

the loss function of the DialogGPT model is:

-log P(y _n |y ₀ ,y ₁ ……y _n-1 ,C)

where C is the context in the dialog corpus, y ₀ ～y _n-1 Is the character already generated in the reply, y _n Is the character currently required to be generated.

Further, the scene is that only a non-matching dialogue corpus exists, the BERT model comprises 12 layers of transformation structures, each layer of transformation structure comprises 12 heads and 768-dimensional hidden states;

the loss function of the BERT model is:

-logP(1|c,r ⁺ )-logP(0|c,r ^- )

wherein C is the context in the dialogue corpus, r ⁺ Is a positive sample, is the original response, r in the dialogue and the corpus ^- Is a negative sample, and is a sentence randomly extracted from the dialogue corpus.

Further, when the scene is that only the non-matching dialogue corpus exists, filtering the retrieved context in step S45 includes filtering sensitive words and filtering names.

Compared with the prior art, the invention has the following advantages and effects:

1. since the method of using synthetic data construction supports online operation, the dialog model can update parameters online. Therefore, the invention can lead the learning dialogue system to still learn the new knowledge after deployment, and offline training and redeployment are not needed.

2. In the dialogue field, the traditional data enhancement technology can only replace some characters in dialogue corpus so as to make expressions more various, but in the face of the situation that dialogue corpus in the new field is zero, the traditional data enhancement technology cannot be used. However, because our synthetic data only depends on knowledge in the field and not on the amount of resources already available for the dialog corpus, the present invention is suitable for a more stringent data environment than conventional data enhancement techniques. Such as where the knowledge dialog corpus is zero in the new domain, or only text in non-dialog form.

3. The invention supports online synthesis of synthetic data generated during learning of new knowledge, which can be used for online updating of dialogue models. Therefore, the storage is not needed after learning, and the storage pressure is reduced.

Drawings

FIG. 1 is a flow chart of a method of knowledge dialogue cross-domain learning based on synthetic data as disclosed in the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1, learning is performed on an existing dialogue corpus to obtain an initial version of the knowledge dialogue system. In the process of online interaction with users in the knowledge dialogue system, the real human society can continuously generate some knowledge, such as newly-launched movies and published news, and the synthetic data strategy provided by the embodiment can construct corresponding synthetic data online and update a model through the synthetic data. The embodiment provides a knowledge dialogue cross-domain learning method based on synthetic data, which aims at four scenes including question-answering, boring, catastrophic forgetting and only having non-matching dialogue corpus, and realizes the cross-domain learning of a structured knowledge dialogue system by constructing the synthetic data.

In this embodiment, the model does not need to be updated offline, and the model can learn the knowledge of the new field online by constructing synthetic data through four types of models. After learning is completed, a model is saved that has completed learning on the new domain of knowledge. The constructed synthetic data in this process can be discarded, thereby avoiding the pressure increase in storage as the number of new fields increases. The method comprises the following specific steps:

s11, manually presetting a template;

the manually preset template is "do you know { Attr } of { Entity? "know about the same, is { Value }; wherein Entiy, attr, value is a filling position corresponding to the knowledge triplet (Attr, value).

S12, substituting knowledge triples (Value) of any new field into the positions of the templates respectively to obtain synthetic data;

s13, repeating the step S11 and the step S12 until the knowledge in the new field is completely constructed into corresponding synthetic data.

(2) The steps of cross-domain learning for the chatty scene are as follows:

s21, using a dialog generation model based on a transducer: the dialoGPT model performs pre-training on a large-scale dialogue corpus;

s24, randomly generating a random number p, wherein 0< = p < = 1;

s25, if the random number p generated in the step S24 is more than 0.5, performing multiple rounds of dialogue for writing by using the DialoGPT mentioned in the step S21;

s27, repeating the steps S24-S26 until each knowledge triplet { Entity, attr, value } in the group of conversations is covered;

s28, repeating the steps S22-S27 until all knowledge in the new field is covered.

In this embodiment, training is performed in a DialogGPT model using a public dataset LCCC;

the DialogGPT model includes 12 layers of transducer structures, each comprising 12 heads, 768 dimensions of hidden states. Where head is a component in the transformers, each head generates a probability distribution for the context that indicates how important each character is to the task, and the hidden state is the output of each layer of transformers;

the loss function of the DialogGPT model mentioned in the above step S21 is: log P (y) _n |y ₀ ,y ₁ ……y _n-1 ,C)

(4) The step of performing cross-domain learning for a scene where only the non-matching dialogue corpus exists is as follows:

marking dialogue corpus collected in a social network, wherein the content of dialogue context is context, the corresponding dialogue reply is response, then word segmentation is carried out on the response part, and an inverted index is established by using a tool to obtain a database D;

s42, pre-training the data set constructed in the S41 by using a BERT model;

s43, for N new domain knowledge and a plurality of corresponding short sentences to be learned, using a crust word segmentation tool (https:// gitsub.com/fxsjy/jieba) to segment words;

s44, searching in a database D by taking a Value in a word segmentation result and knowledge triples (Entity, attr, value) as a keyword, and returning a dialogue context of a correlation top-50;

s45, filtering the searched dialogue context;

s46, scoring by using the trained BERT in the step S42, and forming the synthetic data by using the context with the highest score and the corresponding short sentence in the step S43.

In this embodiment, filtering the context retrieved in step S45 includes filtering sensitive words, filtering names of people, and the like.

The BERT model comprises 12 layers of transformers, each layer of transformers comprising 12 heads and 768 dimensions of hidden states. Where head is a component in the transformers, each head generates a probability distribution for the context that indicates how important each character is to the task, and the hidden state is the output of each layer of transformers;

the loss function of the BERT model is:

wherein, C is the context in the dialogue corpus, positive sample, original response in the dialogue corpus, negative sample, and a sentence randomly extracted from the dialogue corpus.

After the synthetic data is obtained in the four scenes, learning is performed on the synthetic data in a mode of batch size=1 by using a knowledge dialogue model with a learning rate of 0.0001 and matching with an Adam learner. And after the learning of all the synthesized data is finished, obtaining a final model.

In this example, the models used are all based on neural networks of the transducer structure, which perform better than the RNN, LSTM, etc. models in terms of long-term dependence of the sequence. Residual error networks are used in transformers, so that the problem of gradient disappearance in the training process can be better relieved.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The method realizes the cross-domain learning of the structured knowledge dialogue system by constructing the synthetic data aiming at four scenes including question-answering, boring, disastrous forgetting and only having mismatched dialogue corpus, and is characterized in that the method aims at the realization process of different scenes as follows,

s11, manually presetting a template;

s12, substituting knowledge triples (Entity, attr, value) in any new field into the positions of the templates respectively to obtain a piece of synthesized data, wherein the Entity represents a specific Entity in the real world, attr represents an attribute of the Entity, and Value is a specific numerical Value of the attribute;

(2) The steps of cross-domain learning for the chatty scene are as follows:

s24, randomly generating a random number p, wherein 0< = p < = 1;

s43, carrying out word segmentation on N new domain knowledge to be learned and a plurality of corresponding short sentences by using a crust word segmentation tool;

s45, filtering the searched dialogue context;

2. The method for learning the knowledge dialogue cross-domain based on the synthetic data according to claim 1, wherein when the scene is a question and answer, the manually preset template is "do you know { Attr } of { Entity? "know about the same, is { Value }; wherein Entiy, attr, value is a filling position corresponding to the knowledge triplet (Attr, value).

3. The knowledge dialogue cross-domain learning method based on synthetic data according to claim 1, wherein when the scene is boring, training is performed by using a public data set LCCC in a DialogGPT model;

the loss function of the DialogGPT model is:

-log P(y _n |y ₀ ,y ₁ ……y _n-1 ,C)

4. The knowledge dialogue cross-domain learning method based on synthetic data according to claim 1, wherein the scene is that only a mismatch dialogue corpus exists, the BERT model comprises 12 layers of transformers structures, each layer of transformers structure comprises 12 heads and 768-dimensional hidden states;

the loss function of the BERT model is:

-logP(1|c,r ⁺ )-logP(0|c,r ^- )

5. The knowledge dialogue cross-domain learning method based on synthetic data according to claim 1, wherein when the scene is that only the non-matching dialogue corpus exists, filtering the retrieved context in step S45 includes sensitive word filtering and name filtering.