CN110929476B

CN110929476B - Task type multi-round dialogue model construction method based on mixed granularity attention mechanism

Info

Publication number: CN110929476B
Application number: CN201910929777.9A
Authority: CN
Inventors: 仇婕; 王鹏; 马婷婷; 窦海波; 高玮
Original assignee: Unit 63626 Of Pla
Current assignee: Unit 63626 Of Pla
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2022-09-30
Anticipated expiration: 2039-09-27
Also published as: CN110929476A

Abstract

The invention provides a task type multi-round dialogue model construction method based on a mixed granularity attention mechanism, which comprises the following steps of: s1: preprocessing the text such as word segmentation, word stop, word vector coding and the like; s2: using an input encoder to encode the converted high-dimensional vector into a sentence vector, and memorizing the details of the conversation; s3: the context encoder encodes the sentence vector; s4: the context coding layer outputs an attention mechanism combined with sentence granularity to realize the context coding; s5: step S4 outputs as input the first layer of the output decoder, decoding by decoding the first layer of the layer; s6: calculating an attention value of word granularity; s7: the output of the first layer of the decoder is combined with the attention value of the word granularity calculated in step S6, and the output generated by the decoder is mapped to the dimension of the word list size, and the result is output. The method and the device greatly improve the accuracy of generating the reply by performing multiple rounds of conversation tasks on the actual data set.

Description

Task type multi-round dialogue model construction method based on mixed granularity attention mechanism

Technical Field

The invention relates to the field of natural language processing, in particular to a task type multi-round dialogue model construction method based on a mixed granularity attention mechanism.

Background

The task-oriented multi-turn dialog system in the natural language field refers to the realization of user requirements in specific fields, and the representative is as follows: the method helps users to navigate, find commodities, check weather, make schedules, order air tickets and the like, and is an important mode for realizing man-machine interaction. The task type multi-turn dialogue system can obviously reduce labor cost and improve service efficiency, provides more convenient and natural service for people to acquire information, and has very clear practical value and important application prospect.

The research on the task-based multi-turn dialog systems in recent years can be divided into two categories, one category is the task-based multi-turn dialog system in the traditional field, and the other category is the end-to-end task-based multi-turn dialog system. The task-based multi-turn dialogue model in the traditional field needs to be solved by different methods and models from input to output in each module. This kind of split-module solution has good effect in practical application, but there are some challenges and problems: firstly, because the modules are independent from each other, the information of the lower layer is difficult to feed back to the upper layer module; secondly, the task-based dialog system in the traditional field has poor expandability; third, conventional domain task based dialog systems mostly require a large number of manually annotated task specific corpora. In recent years, many scholars have attempted to apply a sequence-to-sequence model to task-based multi-turn dialog systems to achieve end-to-end solution. Although the end-to-end dialog system has the advantages of good expandability, error propagation resistance and the like compared with the dialog system in the traditional field, the standard sequence-to-sequence model adopted by the end-to-end dialog system cannot well model historical dialog information. In a dialog system, however, it is very critical to construct context information. In order to make the sentences generated by the end-to-end dialogue system more conform to the characteristics of multiple rounds of dialogue, a plurality of solutions have been proposed from different perspectives. Some researchers have introduced a knowledge base into the dialog system, Wen et al (2014), which is based on an end-to-end trainable dialog system. Although the method reduces manual intervention to a certain extent, structured knowledge base data of related professional fields need to be combined, and the knowledge data is difficult to acquire and needs to be acquired by means of expert analysis of the related fields. Mou et al (2016) use statistical methods to calculate the subject words that should appear in the response, which can reduce the generation of meaningless responses. But it is obviously not enough for a multi-turn dialogue system to use only one word as a subject word at a semantic level. Serban et al (2017) proposed a layered coder Decoder Model (HRED) whose core idea is to add a layer of context coder for coding history information based on a standard sequence-to-sequence Model. For modeling multiple rounds of dialog, this hierarchical structure has advantages over conventional sequence-to-sequence models. In the process of back propagation, the context vector information of the conventional sequence-to-sequence structure is gradually diluted by the information of the new statement. And by adding a separate context encoder to model historical information of a multi-turn dialog system from a global perspective, the HRED can better capture semantic information. Serban et al (2017) proposed a VHRED model that attempts to introduce gaussian random variables into the context information based on HRED, increasing the diversity of replies.

Disclosure of Invention

The invention provides a task type multi-round dialogue model construction method based on a mixed granularity attention mechanism. Based on the hierarchical structure characteristic that multiple rounds of conversations consist of sentences and words, the invention designs a mixed attention mechanism aiming at the sentence granularity and the word granularity of the model. The attention of sentence granularity focuses on paying attention to information such as overall context and intention of multi-turn conversations, the attention of word granularity is used for paying more attention to details, and the combination of the information and the information can extract more effective context information from different levels, so that the generated reply is more meaningful. In addition, the invention researches five existing key modeling and training technologies: in order to make the model more accurate, the invention adopts a multilayer network structure, but simultaneously, the problem of gradient disappearance or gradient explosion is also brought, so the model introduces Residual Connection (Residual Connection) in each sub-layer. Due to the fact that the size of a data set is limited, an overfitting is easy to occur on a complex model, and the overfitting is controlled from different angles by combining a dropout mechanism and a label smoothing (1abel smoothing) method. In order to make deep network training more stable, each Layer input of the model adopts a Layer Normalization (Layer Normalization) method. In consideration of the diversity of responses of the multi-turn dialog system, the invention adopts the beam search method to find the most possible response sequence. The five technologies are organically combined, a new mixed model is further obtained through optimization, and the accuracy of reply generation of the end-to-end task type dialogue model is greatly improved.

A task type multi-round dialogue model construction method based on a mixed granularity attention mechanism comprises the following steps:

s1: for the input natural text X ₁ ，X ₂ ，...，X _N After a series of natural language processing steps such as word embedding and word stop, the sentence is processedConverting each word into a vector representation of fixed length

S2: encoding the converted high-dimensional vector into a sentence vector using an input encoder, memorizing the details of the dialog, where M denotes the number of layers of the encoder and decoder

S3: the sentence vector is used as input for each time step of the context encoder, and the sentence vector is encoded by the context encoder (h) ₁ ，...，h _t )＝RNNContextEncoder(E ₁ ，...，E _N )；

S4: the context coding layer outputs an attention mechanism combined with sentence granularity to realize the context coding;

s5: step S4 outputs as input the first layer of the output decoder, decoding D by decoding the first layer of the output decoder ₁ ＝RNNDecoder ₁ (v)；

S6: calculating an attention value of word granularity;

s7: and the output of the first layer of the decoder is combined with the attention value of the word granularity calculated in the step S6, the decoding is started step by step until the decoder generates a terminator position, the output generated by the decoder is mapped to the dimension of the word list size, and the result is output.

Further, the specific process of step S4 is:

s4.1: introducing sentence vector u _s ，u _s Initially by randomly initializing assignments. To h _i Performing nonlinear transformation to obtain u _i ；

u _i ＝tanh(W _s h _i +b _s )

S4.2: for u is paired _i And u _s Calculating similarity to obtain weight, and obtaining normalized weight alpha after softmax _i ；

S4.3: to h _i And carrying out weighted average to obtain a final context vector v.

Further, the specific process of step S6 is:

s6.1: to obtain information of different subspaces, first D ₁ And E _N Different linear transformations were performed to obtain the following values (Query) ₁ ，Value ₁ ，Key ₁ )…(Query _N ，Value _N ，Key _N )，

The weight vector value formula of the ith calculation weight vector is as follows:

s6.2: then, the zoom dot product is calculated, and dim is Key _i The ith calculation formula is as follows:

s6.3: and finally, splicing the N values obtained by the calculation in the step S6.2, and obtaining the multi-head attention value of the expected dimensionality through simple linear transformation.

mulAttention＝concat(Att ₁ ，...，Att _N )*W _out

Further, the specific process of step S7 is:

s7.1: initializing the decoding input d ₀ ＝di _nitial ；

S7.2: the step-by-step decoding from the second layer to the last layer in the decoder is as follows, where L _maxsize Denotes the maximum length for which a reply is generated, 0 _maxsize ：

S7.3: mapping the output generated by the decoder to the dimension of the word list size;

s7.4: obtaining the distribution of the output of the step j on the vocabulary table through normalization;

s7.5: finding out the word list ID corresponding to the word with the maximum probability in each step;

s7.6: converting the word ID into a readable character string;

s7.7: when the decoder generates the terminator, decoding is stopped and the concatenated word generates the N +1 th round of replies.

Y＝join(y ₁ ，y ₂ ，...，y _end )

The end-to-end task oriented dialogue model construction method based on the mixed attention mechanism has the following advantages:

1. the invention uses a mixed-granularity attention mechanism, wherein the attention of sentence granularity focuses on paying attention to information such as overall context, intention and the like of multi-turn conversations, the attention of word granularity pays attention to more details, and the combination of the two can extract more effective context information from different layers, so that the generated reply is more meaningful.

2. The end-to-end task oriented model provided by the invention only needs original text data and can automatically organize a relatively large training set through a simple preprocessing method, thereby ensuring the data expansion capability of the system. The problems that the task-based multi-turn dialogue system is lack of data and most training sets need expensive manual annotations are effectively solved.

3. The invention is respectively researched on two real data sets of a Jingdong customer service data set and a Ubuntu Dialogue Corpus, and the experimental result is superior to that of the previous end-to-end Dialogue system model, thereby demonstrating the effectiveness of the model in introducing a practical application scene. In addition, due to the end-to-end structure of the model and the data requirement without manual marking, the model is simple to fall to the ground and strong in mobility, and the defects of the traditional task-oriented multi-turn conversation model are overcome.

Drawings

FIG. 1 is a schematic diagram of a task-based multi-turn dialogue model construction process based on a mixed-granularity attention mechanism according to the present invention;

FIG. 2 is a schematic overall structure diagram of an embodiment of the present invention;

FIG. 3 is a block diagram of model calculation in a three-wheel dialog scenario in accordance with an embodiment of the present invention;

FIG. 4 is a comparison of the experimental results of seq2seq, HRED, VHRED and the model of the present invention in the Kyoto customer service dataset and the Ubuntu Dialogue Corpus dataset, respectively;

FIG. 5 is a sample of multi-turn dialog replies qualitatively analyzing and comparing the characteristics of replies generated by various methods in a multi-turn dialog scenario;

FIG. 6 is an illustration of the effect of various optimization techniques on model performance;

figure 7 is the effect of beamsize on model performance.

Detailed Description

Specific embodiments of the present invention are described below in conjunction with the accompanying drawings so that those skilled in the art can better understand the present invention.

An end-to-end task oriented dialogue model construction method based on a mixed attention mechanism comprises the following steps:

s1, we take the jingdong dialog dataset and the Ubuntu dialog Corpus multi-turn dialog dataset as the datasets of the examples. For the input natural text X ₁ ，X ₂ ，...，X _N Each word of the input text is converted into a high-dimensional vector through preprocessing of stop words, word embedding and the like (the jieba word segmentation is also needed in the Jingdong dialogue data set). We have built a dictionary with a vocabulary of 21,000. The initialization of the Kyoto customer service data set adopts wiki encyclopedia pre-training word vectors, the Ubuntu Dialogue Corpus data set adopts Google News training data, and the dimensionality of the word vectors is set to 300. Conversion of each word in a sentence into a fixed-length vector representation

And S2, encoding the converted high-dimensional vector into a sentence vector by using the input encoder, and memorizing the details of the conversation. Wherein the encoder and decoder adopt a six-layer structure, and the context encoder is a layer. The encoder, the context encoder and the decoder all adopt Gated Round Units (GRUs);

s3, the sentence vector is used as the input of each time step of the context encoder, and the sentence vector is encoded by the context encoder;

(h ₁ ，...，h _t )＝GRUContextEncoder(E ₁ ，...，E _N )

s4, the context coding layer outputs attention mechanism combined with sentence granularity to realize the context coding;

the specific steps of step S4 are:

s4.1, introducing sentence vector u _s ，u _s Initially by randomly initializing assignments. To h _i Performing nonlinear transformation to obtain u _i The formula is as follows:

u _i ＝tanh(W _s h _i +b _s )

s4.2, for u _i And u _s Calculating similarity to obtain weight, and obtaining normalized weight alpha after softmax _i ；

S4.3, for h _i And carrying out weighted average to obtain a final context vector v.

S5, the output of step S4 is used as the input of the first layer of the output decoder, and the decoding is carried out by the first layer of the decoding layer;

D ₁ ＝GRUDecoder ₁ (v)

s6, calculating the attention value of the word granularity;

the specific step of step S6 is:

s6.1, to obtain information of different subspaces, first D ₁ And E _N Different linear transformations were performed to obtain the following values (Query) ₁ ，Value ₁ ，Key ₁ )…(Query _N ，Value _N ，Key _N )，

Three different weight vectors to be trained for the ith linear transformation, the weight vector value formula for the ith calculation is：

S6.2, carrying out zoom dot product calculation, wherein dim is Key _i The ith calculation formula is as follows:

and S6.3, finally splicing the N values obtained by the calculation in the step S6.2, and obtaining the multi-head attention value of the expected dimensionality through simple linear transformation.

mulAttention＝concat(Att ₁ ，...，Att _N )*W _out

And S7, combining the output of the first layer of the decoder with the attention value of the word granularity calculated in the step S6, starting to decode step by step until the decoder generates a terminator position, mapping the output generated by the decoder to the dimension of the word list size, and outputting the result.

The specific step of step S7 is:

s7.1, initializing the decoding input d ₀ ＝d _initial ；

S7.2, the step-by-step decoding process from the second layer to the sixth layer of the decoder is as follows, wherein L _maxsize Denotes the maximum length for which a reply is generated, 0 _maxsize ：

S7.3, mapping the output generated by the decoder to the dimension of the word list size;

s7.4, obtaining the distribution of the output of the step j on the vocabulary through normalization;

s7.5, finding out the word list ID corresponding to the word with the maximum probability in each step;

s7.6, converting the word ID into a readable character string;

s7.7, when the decoder generates the terminator, the decoding is stopped, and the connected word generates the reply of the (N + 1) th round.

Y＝join(y ₁ ，y ₂ ，...，y _end )

Under this framework, five key modeling and training techniques were applied, where the model was optimized using Adam, setting the learning rate to 0.0001. The model outputs a dropout drop probability of 0.2 for each sub-layer, and introduces residual linking and layer normalization with a beam size of 4. And applying uniform label smoothing to the cross entropy of the output calculation, wherein the uncertainty epsilon is 0.1.

To verify the effectiveness of the present invention, we performed comparative experiments using two datasets, the kyoton multi-round Dialogue dataset and the Ubuntu Dialogue multi-round Dialogue dataset. The Jingdong dialogue data set is a multi-round dialogue data set provided by the 2018JD Dialog Challenge tournament, the language is Chinese, and the multi-round dialogue data set contains real dialogue data of Jingdong customers and Jingdong manual customer service. The data set contained 11 ten thousand sessions of multiple sessions with an average session number of 13 sessions. The Ubuntu dialog kernel multi-turn Dialogue data set is a public data set, the language is English, and the Ubuntu dialog kernel multi-turn Dialogue data set contains Ubuntu related problem technical support multi-turn Dialogue data. The data set contained 100 thousand sessions with an average session number of 8 sessions.

This embodiment is compared on the test set with the advanced end-to-end approach that has been published in recent years. It is a task of the present invention to generate a reply to a given dialog segment for that dialog. The invention adopts deltaBLEU method to evaluate the performance of generating reply. The alignment model employs a sequence-to-sequence model (seqseq), a hierarchical recursive coder-decoder (HRED), and a variant HRED (vhred).

FIG. 4 is a comparison of the results of experiments on seq2seq, HRED, VHRED and the model of the invention in the Kyoto customer service dataset and the Ubuntu Dialogue Corpus dataset, respectively. For a multi-turn dialogue system, the model of the invention achieves optimal results on both datasets. But the single round of dialogue, and other methods, become less distant. The model provided by the invention is more advantageous in a scene of multi-turn conversation.

Through the multi-turn dialog reply sample in fig. 5, the characteristics of the replies generated by the methods in the multi-turn dialog scene are qualitatively analyzed and compared. In general, the model of the present invention is able to create more consistent and accurate responses given the context of multiple rounds of conversation, reducing the generation of meaningless replies. Context encoders for baseline models HRED and VHRED models may also utilize context information to some extent, but not as much as the present model as a whole. Referring to fig. 6 and 7, the effect of various optimization techniques on model performance and on the training model is explored. And (3) independently deleting one of the technologies, carrying out a comparison experiment with the original model, and observing the experiment result. Where "-a" in fig. 6 indicates the technique of deleting a and "-" indicates that the training run is unstable. Based on the comparison result of the mixed attention mechanism, the effect of the mixed attention mechanism on the model generation recovery is greatly improved. Based on the results of label smoothing, dropout and beam search, the three have positive influence on the model of the invention, deltaBLEU can be slightly promoted, and for the beam search, the larger the beam size, the better the model performance. Based on the comparison of layer normalization results, the layer normalization is crucial to the training process of the stable model, the deleted layer normalization is not stable enough during model training, and training parameters need to be adjusted again, so that the influence of the deleted layer normalization cannot be quantized and compared. Based on the comparison of the residual connection results, when the number of layers reaches 6 layers, the model without residual is far less effective than a single-layer model, because the gradient disappears and the gradient explodes due to the multi-layer network, and the problem can be solved well by introducing the residual connection. Experimental results show that the key modeling and training technologies have positive effects on the model in different degrees.

Through the above quantitative and qualitative analysis, it is shown that the model proposed by the present invention can better utilize context information, is suitable for the scenes of multiple rounds of conversation, and is superior to the previous end-to-end model.

Claims

1. A task-type multi-round dialogue model construction method based on a mixed-granularity attention mechanism is characterized by comprising the following steps of:

s1: for the input natural text X ₁ ，X ₂ ，...，X _N After a series of natural language processing steps such as word embedding and word stop, each word in the sentence is converted into vector representation with fixed length

S6: calculating an attention value of word granularity;

2. The task-based multi-turn dialogue model construction method based on the mixed-granularity attention mechanism according to claim 1, wherein the specific process of step S4 is as follows:

s4.1: introducing sentence vector u _s ，u _s Initially by randomly initializing assignments. To h _i Performing nonlinear transformation to obtain u _i ，

u _i ＝tanh(W _s h _i +b _s )；

S4.2: for u is paired _i And u _s Calculating similarity to obtain weight, and obtaining normalized weight alpha after softmax _i ，

S4.3: to h _i A weighted average is performed to obtain a final context vector v,

3. the task-based multi-turn dialogue model construction method based on the mixed-granularity attention mechanism according to claim 2, wherein the specific process of step S6 is as follows:

The formula of the weight vector value calculated for the ith time is as follows:

s6.3: finally, splicing N values obtained by calculation in the S6.2 step, obtaining a multi-head attention value of an expected dimension through simple linear transformation,

mulAttention＝concat(Att ₁ ，...，Att _N )*W _out 。

4. the method according to claim 3, wherein the specific process of step S7 is as follows:

s7.1: initializing the decoding input d ₀ ＝d _initial ；

S7.3: the output produced by the decoder is mapped to the dimension of the word list size,

s7.4: obtaining the distribution of the output of the step j on the vocabulary table through normalization,

s7.5: finding out the word list ID corresponding to the word with the maximum probability in each step,

s7.6: converting word ID into readable character string

S7.7: when the decoder generates a terminator, decoding is stopped, and the concatenated word generates an N +1 th round of reply, Y ═ join (Y) ₁ ，y ₂ ，...，y _end )。