CN109308316B

CN109308316B - Adaptive dialog generation system based on topic clustering

Info

Publication number: CN109308316B
Application number: CN201810823424.6A
Authority: CN
Inventors: 蔡毅; 任达; 闵华清
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2021-05-14
Anticipated expiration: 2038-07-25
Also published as: CN109308316A

Abstract

The invention discloses a topic clustering-based adaptive dialogue generating system, which comprises a dialogue data module, a vectorization module, a clustering module and a Seq2Seq module, wherein the vectorization module is used for carrying out topic clustering on a plurality of groups of words; a dialogue data module for constructing a dialogue data set before training; the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis; the clustering module is used for clustering the vectorized dialogue data set into a plurality of clusters; and the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module. The method can cluster the dialogue data according to the theme, and train the dialogue data of different categories by using a specific Seq2Seq model. Under the classical Seq2Seq model, meaningless replies tend to be generated. The model provided by the invention can enable the dialog system to generate more theme-conforming and more meaningful replies. The reply can make the user more willing to communicate with the dialog system, and the user experience is improved.

Description

Adaptive dialog generation system based on topic clustering

Technical Field

The invention relates to the field of dialog generation, in particular to a topic clustering-based self-adaptive dialog generation system.

Background

Currently, with the development of artificial intelligence, dialog systems are receiving more and more attention. In Turing test, the dialog system is used as an index for judging whether the computer is intelligent. Products with dialog systems are now being introduced into people's lives, such as Siri from apple and Cortana from microsoft. The application and popularity of dialog systems allow people to interact with computers through the use of natural language. Making human communication with a computer more natural.

With the development of deep learning, many key technologies, such as image recognition, voice recognition, machine translation, and the like, have made relatively great progress. In solving the problems of speech recognition and machine translation, the Sequence-to-Sequence (Seq-to-Sequence) model is more commonly used. Unlike other models, the Seq2Seq model can accept sequences of indefinite length as input and output sequences of indefinite length. Such a feature is important because in many problems, the length of the input and output cannot be known in advance, such as machine translation and speech recognition. In recent years, researchers have also attempted to apply the Seq2Seq model to dialog generation tasks, since dialog generation can also be viewed as a sequence-to-sequence task. The Seq2Seq model achieves better results in many data sets than dialog generation using the traditional N-Gram model.

However, applying the Seq2Seq model directly to a conversation may cause the model to tend to generate meaningless replies such as "good", "no", etc. Such replies occur in large numbers in the dialog data set, thus leaving the model prone to generate such replies during the learning process. Such a reply, while relatively secure, may leave the dialog between the computer and the user very brief. It is difficult for the user to communicate with the computer based on these responses. Various improved models have been proposed by many researchers to try to alleviate this problem. However, in the existing Seq2 Seq-based dialog generation system, it is common to use a single Seq2Seq model to train all the dialog data. However, different conversations contain different topics, with different characteristics of the conversations, which are often difficult to capture well by a single Seq2Seq model.

Disclosure of Invention

The invention aims to provide an adaptive dialog generating system based on topic clustering, which can better capture the characteristics among different topics and generate answers related to the topics.

The purpose of the invention can be realized by the following technical scheme:

a self-adaptive dialog generation system based on topic clustering comprises a dialog data module, a vectorization module, a clustering module and a Seq2Seq module;

the dialogue data module is used for constructing a dialogue data set before training;

the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis;

the clustering module is used for clustering the vectorized dialogue data set into clusters;

and the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module.

In the Seq2Seq module, the dialogue data is divided into a plurality of clusters according to the clustering module, a Seq2Seq model is constructed for each cluster to train the input dialogue data, and a dialogue reply is generated.

Specifically, in the vectorization module, a bag-of-words model is constructed by adopting tf-idf as a weight calculation mode, and the vectorization processing is performed on the dialogue data set output by the dialogue data module.

Before clustering is performed by using a clustering module, a dialogue data set needs to be vectorized, and a bag-of-words model is a relatively common way for vectorizing the dialogue data set in natural language processing. In the process of converting the dialogue data set by using the bag-of-words model, each word can be uniquely fixed at a certain position in the vector, in a certain dialogue data set, if a certain word appears, the corresponding position can be set to be a non-0 number, and if a certain word does not appear, the corresponding position can be set to be 0. At this time, how to select the non-0 number becomes very important. A simpler way is to set these numbers to 1, but then the importance between words is not distinguished, so tf-idf is used as the weight calculation in the clustering module.

As a way of weight calculation, tf-idf is used to evaluate how important a word is to one of the documents in a corpus or a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Tf is the word frequency and idf is the inverse file frequency. Wherein idf is specifically: if the documents containing the entry t are fewer, namely n is smaller, idf is larger, the entry t has good category distinguishing capability.

Further, in the present system, the conversion from the dialogue data set to the tf-idf vector is realized by a packet encapsulated in scimit-spare.

After vectorizing the dialogue data set, clustering operation can be carried out on the dialogue data set. In the system, the clustering algorithm adopts a k-means algorithm. The algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method. The algorithm takes Euclidean distance as similarity measurement, and the optimal classification of a corresponding initial clustering center vector V is solved, so that the evaluation index J is minimum. The algorithm uses a sum of squared errors criterion function as a clustering criterion function. The selection of the k initial cluster center points has a large influence on the clustering result, because any k objects are randomly selected as the centers of the initial clusters in the first step of the algorithm to initially represent one cluster. The algorithm reassigns each object remaining in the data set to the nearest cluster based on its distance from the center of the respective cluster in each iteration. After all data objects are examined, one iteration operation is completed, and a new clustering center is calculated. If the value of J does not change before or after an iteration, the algorithm is converged.

Further, in the system, the specific working steps of the k-means algorithm in the clustering module are as follows:

1. randomly selecting k dialogue data sets from the N dialogue data sets as a centroid;

2. measuring the distance to each centroid for each of the remaining sets of session data and categorizing it to the nearest centroid;

3. recalculating the centroid of each obtained class;

4. and (4) repeating the step (2) and the step (3) until the new centroid is equal to the original centroid or is smaller than a specified threshold value.

Further, the Seq2Seq model includes two sub-modules, which are an Encoder and a Decoder, respectively. The Encode receives an input of indefinite length and converts it to a vector of fixed length. And the Decoder generates an output sequence according to the vector obtained by the Encode conversion.

The goal of the Seq2Seq model is to measure a conditional probability p (y)₁，y₂，...y_T|x₁，x₂，...x_T′). Wherein x₁，x₂，...x_T′Is an input sequence, y₁，y₂，...y_TIs the output sequence. The length T' of the input sequence and the length T of the output sequence may be different. The above probability can be expressed by the following equation:

where v is a vector representing a fixed length generated by the Encoder submodule.

Further, in order to realize the Seq2Seq model, a special symbol is added after each sequence to indicate the stop of the sequence in order to allow the model to recognize the stop flag of the sequence.

Further, the Encoder submodule and the Decode submodule are each constituted by one RNNs (RecurrentNeuralNetworks).

The RNNs are one type of artificial neural network, and unlike conventional neural networks, data will be input at different time steps. And the internally stored extrema of the RNNs enable the RNNs to hold previous input information.

Specifically, the output of RNNs is calculated as:

o_t＝f(s_t)

wherein o is_tAnd s_tRespectively, the output and implicit states at time t in RNNs. While the function f may be different for different tasks. In the present system, what is desired is the probability of each word at the current time, and therefore the softmax function is used. The implicit states in RNNs can be updated by the following equations:

s_t＝g(s_t-1，x_t)

wherein g (×) is a non-linear activation function, which may be a sigmoid function or other more complex activation functions. x is the number of_tIs input at time step t, s_t-1Is the implicit state of the RNNs at time step t-1.

The internal memory mechanism of RNNs allows it to capture information from previous inputs, a mechanism that is important for natural language processing tasks. Since the input and output should not be treated as independent among natural language processing tasks. However, when the input sequence is too long, standard RNNs suffer from long-term dependence. Therefore, GRU (gated recovery unit) is used as a building block in RNNs.

The GRU is a simplified implementation of LSTM (long short-term memory). Unlike the neuron structure in classical RNNs, the GRU incorporates a reset gate and an update gate mechanism, and the calculation formula of the reset gate is as follows:

r_t＝σ(W_rx_t+U_rh_t-1)

where σ is sigmoid function, x_tAnd h_t-1Is an implicit state of the input and at time t-1 in the GRU. When GRU is used h_tI.e. s as described in RNNs_tTwo different symbols are used here for the purpose of differentiation. W_rAnd U_rIs the weight matrix that needs to be learned.

The calculation formula for the updated gate z is as follows:

z＝σ(W_zx_t+U_zh_t-1)

W_zand U_zIs a weight matrix to be learned;

and implies a state h_tIs that

Wherein

Wherein the content of the first and second substances,

is element correspondence multiplication, W and U are weight matrixes to be learned, and tanh is a hyperbolic tangent function.

From the above equation, if the reset gate is 0, the implicit state is only affected by the current time. By the mechanism, the model can forget unimportant information in the future by the hidden state, so that the model can be better expressed. And from h_tIt can be seen that the update gate z determines how much the previous implicit state has influence on the current implicit state.

When a sentence is very long, if only the implicit state of the last state of the Encoder is taken, the information of the first half of the sentence is easy to lose. To solve this problem, an attention mechanism is adopted to use the implicit state of the last moment to represent that the whole sentence uses a vector c under the attention mechanism_iIs shown in which c_iThe specific calculation formula of (A) is as follows:

wherein, a_ijThe calculation is made by the following formula:

where η is typically implemented with a multilayer perceptron, s'_i-1Is an implicit state, h ', of the Decoder at time instant i-1'_jAnd exp is an exponential function which is an implicit state of the Encoder at the j-th moment, and the bottom of the exponential function can be a natural constant or other constants. By paying attention to the introduction of the force mechanism, the model can more fully consider the input information at each moment. And more weight may be assigned to information that the model considers to be important.

Further, the specific process of training and generating the reply of the Seq2Seq model is as follows:

during the training process, the Encoder receives each word in the above as an input, and the input word needs to be converted into word embedding. After the word embedding of each word in the above information is sequentially input into the Encoder, the Decoder integrates the output of each time by adopting an attention mechanism and generates the output of each time by combining the input. The output at each time in the Decoder is the probability of each word in the lexicon occurring at the current time.

Further, the Word embedding is a technique of representing each Word by a vector of a fixed length. The word embedding in the Seq2Seq model can be directly input by using a value pre-trained by Google, and can also be input by randomly initializing a vector for each word. The word embedding of each word can be a fixed value or fine-tuned by the Seq2Seq model during training.

In the generation process, the word embedding corresponding to each word in the above text needs to be input into the Seq2Seq model. The Decoder will integrate the outputs at each time using an attention mechanism and combine the inputs at each time to produce an output at each time. The output at each time instant according to Decoder is the probability of each word appearing in the lexicon at the current time instant.

Further, when selecting the generated reply, the most probable word may be selected at each time, or several of the previous most probable outputs may be retained at each time. The latter option sometimes gives better recovery.

Specifically, when a Seq2Seq model is built and implemented in a Seq2Seq module, if all the models are implemented from the start of neuron building, the workload is large. Therefore, the method needs to be implemented by combining some deep learning frameworks, which can encapsulate some operations, so that certain convenience is brought when a Seq2Seq model is built.

Further, PyTorch is selected as a deep learning framework. Python language is also supported by the previously converted vectors and scimit-spare called using the k-means algorithm. So that the two can be combined relatively well. In addition, the Pythrch supports tensor operations and the construction of dynamic networks. In terms of operation speed, the Pythrch supports the use of a GPU to accelerate operation, and thus, the time consumed by the model when large-scale data training is used can be further reduced. By using the pytorech packaged module, each part in the Seq2Seq model can be implemented relatively easily.

Compared with the prior art, the invention has the following beneficial effects:

1. in the system, the dialogue data is clustered, each category of dialogue data is considered to share one theme, and the dialogue data of each theme is trained by a specific Seq2Seq model. During application, each sentence spoken by a user is subjected to category judgment and processed by a corresponding Seq2Seq model, so that the adaptive process based on topic clustering can enable the system to capture some characteristics specific to each topic. Therefore, the system can generate the answers which are sensitive in subject and more meaningful, and the experience of the user in the dialog system is improved.

Drawings

FIG. 1 is a general schematic diagram of an adaptive dialog generation system based on topic clustering according to the present invention;

FIG. 2 is a schematic diagram of the Seq2Seq model constructed in the Seq2Seq module.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

Fig. 1 is a schematic diagram of an adaptive dialog generation system based on topic clustering, where the system includes a dialog data module, a vectorization module, a clustering module, and a Seq2Seq module;

Before clustering is carried out by using a clustering module, vectorization of a normalized data set is required, and a bag-of-words model is a relatively common way for vectorizing a dialogue data set in natural language processing. In the process of converting the dialogue data set by using the bag-of-words model, each word can be uniquely fixed at a certain position in the vector, in a certain dialogue data set, if a certain word appears, the corresponding position can be set to be a non-0 number, and if a certain word does not appear, the corresponding position can be set to be 0. At this time, how to select the non-0 number becomes very important. A simpler way is to set these numbers to 1, but then the importance between words is not distinguished, so tf-idf is used as the weight calculation in the clustering module.

As a way of weight calculation, tf-idf is used to evaluate how important a word is to one of the documents in a corpus or a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Because a word appears in more documents, the less the ability to distinguish documents to illustrate the word. Tf is the word frequency and idf is the inverse file frequency. Wherein idf is specifically: if the documents containing the entry t are fewer, namely n is smaller, idf is larger, the entry t has good category distinguishing capability.

3. recalculating the centroid of each obtained class;

Further, each Seq2Seq model shown in fig. 2 includes two sub-modules, namely an Encoder and a Decoder. The Encode receives an input of indefinite length and converts it to a vector of fixed length. And the Decoder generates an output sequence according to the vector obtained by the Encode conversion.

Specifically, the output of RNNs is calculated as:

o_t＝f(s_t)

s_t＝g(s_t-1，x_t)

r_t＝σ(W_rx_t+U_rh_t-1)

where σ is sigmoid function, x_tAnd h_t-1Is the implicit state of the input and the t-1 th instant of the GRU. When GRU is used h_tI.e. s as described in RNNs_tTwo different symbols are used here for the purpose of differentiation. W_rAnd U_rIs the weight matrix that needs to be learned.

The calculation formula for the updated gate z is as follows:

z＝σ(W_zx_t+U_zh_t-1)

W_zand U_zIs a weight matrix to be learned;

and implies a state h_tThe update formula of (2) is:

wherein

Wherein the content of the first and second substances,

wherein, a_ijThe calculation is made by the following formula:

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An adaptive dialog generation system based on topic clustering, characterized in that the dialog data of each topic in the system is trained by a specific Seq2Seq model; when the method is applied, the type of each sentence spoken by a user is judged, and the sentence spoken by the user is processed by a corresponding Seq2Seq model; the system comprises a dialogue data module, a vectorization module, a clustering module and a Seq2Seq module;

the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module;

in the Seq2Seq module, dividing the dialogue data into a plurality of clusters according to a clustering module, constructing a Seq2Seq model for each cluster to train the input dialogue data, and generating a dialogue reply;

in the Seq2Seq module, constructing a Seq2Seq model to train input dialogue data and generate dialogue replies; the Seq2Seq model comprises two sub-modules, namely an Encoder and a Decoder; the Encoder receives input with an indefinite length and converts the input into a vector with a definite length; the Decoder generates an output sequence according to the vector obtained by the Encode conversion; the Encoder submodule and the Decode submodule are respectively composed of an RNNs;

using attention mechanism to use the implicit state of the last time in the Encoder submodule to represent that the whole sentence uses a vector c under attention mechanism_iIs shown in which c_iThe specific calculation formula of (A) is as follows:

wherein, a_ijThe calculation is made by the following formula:

e_ij＝η(s′_i-1,h′_j)

where η is typically implemented with a multilayer perceptron, s'_i-1Is an implicit state of the Decoder at the i-1 th time,h′_jis the implicit state of Encode at the j-th time.

2. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that in the vectorization module, tf-idf is used as a weight calculation method to construct a bag-of-words model, and the dialog data set output by the dialog data module is vectorized.

3. An adaptive dialog generation system based on topic clustering according to claim 2 characterized in that in the present system the conversion from the dialog data set to the tf-idf vector is done by means of encapsulated packages in scimit-lern.

4. The adaptive dialog generation system based on topic clustering according to claim 2, characterized in that in the system, the specific working steps of the k-means algorithm in the clustering module are:

3. recalculating the centroid of each obtained class;

5. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that when implementing the Seq2Seq model, in order to make the model recognize the stop sign of the sequence, a special symbol is added after each sequence to indicate the stop of the sequence.

6. The adaptive dialog generation system based on topic clustering according to claim 1 characterized by taking GRUs as building blocks in RNNs; unlike the neuron structure in classical RNNs, the GRU incorporates a reset gate and an update gate mechanism, and the calculation formula of the reset gate is as follows:

r_t＝σ(W_rx_t+U_rh_t-1)

where σ is sigmoid function, x_tAnd h_t-1Is the implicit state of the input and at time t-1 in the GRU; w_rAnd U_rIs a weight matrix to be learned;

the calculation formula for the updated gate z is as follows:

z＝σ(W_zx_t+U_zh_t-1)

W_zand U_zIs a weight matrix to be learned;

and implies a state h_tIs that

Wherein

Wherein the content of the first and second substances,

7. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that when constructing and implementing the Seq2Seq model in the Seq2Seq module, PyTorch is used as the deep learning framework.