CN109308316B - Adaptive dialog generation system based on topic clustering - Google Patents

Adaptive dialog generation system based on topic clustering Download PDF

Info

Publication number
CN109308316B
CN109308316B CN201810823424.6A CN201810823424A CN109308316B CN 109308316 B CN109308316 B CN 109308316B CN 201810823424 A CN201810823424 A CN 201810823424A CN 109308316 B CN109308316 B CN 109308316B
Authority
CN
China
Prior art keywords
module
clustering
dialogue data
seq2seq
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810823424.6A
Other languages
Chinese (zh)
Other versions
CN109308316A (en
Inventor
蔡毅
任达
闵华清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810823424.6A priority Critical patent/CN109308316B/en
Publication of CN109308316A publication Critical patent/CN109308316A/en
Application granted granted Critical
Publication of CN109308316B publication Critical patent/CN109308316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a topic clustering-based adaptive dialogue generating system, which comprises a dialogue data module, a vectorization module, a clustering module and a Seq2Seq module, wherein the vectorization module is used for carrying out topic clustering on a plurality of groups of words; a dialogue data module for constructing a dialogue data set before training; the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis; the clustering module is used for clustering the vectorized dialogue data set into a plurality of clusters; and the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module. The method can cluster the dialogue data according to the theme, and train the dialogue data of different categories by using a specific Seq2Seq model. Under the classical Seq2Seq model, meaningless replies tend to be generated. The model provided by the invention can enable the dialog system to generate more theme-conforming and more meaningful replies. The reply can make the user more willing to communicate with the dialog system, and the user experience is improved.

Description

Adaptive dialog generation system based on topic clustering
Technical Field
The invention relates to the field of dialog generation, in particular to a topic clustering-based self-adaptive dialog generation system.
Background
Currently, with the development of artificial intelligence, dialog systems are receiving more and more attention. In Turing test, the dialog system is used as an index for judging whether the computer is intelligent. Products with dialog systems are now being introduced into people's lives, such as Siri from apple and Cortana from microsoft. The application and popularity of dialog systems allow people to interact with computers through the use of natural language. Making human communication with a computer more natural.
With the development of deep learning, many key technologies, such as image recognition, voice recognition, machine translation, and the like, have made relatively great progress. In solving the problems of speech recognition and machine translation, the Sequence-to-Sequence (Seq-to-Sequence) model is more commonly used. Unlike other models, the Seq2Seq model can accept sequences of indefinite length as input and output sequences of indefinite length. Such a feature is important because in many problems, the length of the input and output cannot be known in advance, such as machine translation and speech recognition. In recent years, researchers have also attempted to apply the Seq2Seq model to dialog generation tasks, since dialog generation can also be viewed as a sequence-to-sequence task. The Seq2Seq model achieves better results in many data sets than dialog generation using the traditional N-Gram model.
However, applying the Seq2Seq model directly to a conversation may cause the model to tend to generate meaningless replies such as "good", "no", etc. Such replies occur in large numbers in the dialog data set, thus leaving the model prone to generate such replies during the learning process. Such a reply, while relatively secure, may leave the dialog between the computer and the user very brief. It is difficult for the user to communicate with the computer based on these responses. Various improved models have been proposed by many researchers to try to alleviate this problem. However, in the existing Seq2 Seq-based dialog generation system, it is common to use a single Seq2Seq model to train all the dialog data. However, different conversations contain different topics, with different characteristics of the conversations, which are often difficult to capture well by a single Seq2Seq model.
Disclosure of Invention
The invention aims to provide an adaptive dialog generating system based on topic clustering, which can better capture the characteristics among different topics and generate answers related to the topics.
The purpose of the invention can be realized by the following technical scheme:
a self-adaptive dialog generation system based on topic clustering comprises a dialog data module, a vectorization module, a clustering module and a Seq2Seq module;
the dialogue data module is used for constructing a dialogue data set before training;
the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis;
the clustering module is used for clustering the vectorized dialogue data set into clusters;
and the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module.
In the Seq2Seq module, the dialogue data is divided into a plurality of clusters according to the clustering module, a Seq2Seq model is constructed for each cluster to train the input dialogue data, and a dialogue reply is generated.
Specifically, in the vectorization module, a bag-of-words model is constructed by adopting tf-idf as a weight calculation mode, and the vectorization processing is performed on the dialogue data set output by the dialogue data module.
Before clustering is performed by using a clustering module, a dialogue data set needs to be vectorized, and a bag-of-words model is a relatively common way for vectorizing the dialogue data set in natural language processing. In the process of converting the dialogue data set by using the bag-of-words model, each word can be uniquely fixed at a certain position in the vector, in a certain dialogue data set, if a certain word appears, the corresponding position can be set to be a non-0 number, and if a certain word does not appear, the corresponding position can be set to be 0. At this time, how to select the non-0 number becomes very important. A simpler way is to set these numbers to 1, but then the importance between words is not distinguished, so tf-idf is used as the weight calculation in the clustering module.
As a way of weight calculation, tf-idf is used to evaluate how important a word is to one of the documents in a corpus or a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Tf is the word frequency and idf is the inverse file frequency. Wherein idf is specifically: if the documents containing the entry t are fewer, namely n is smaller, idf is larger, the entry t has good category distinguishing capability.
Further, in the present system, the conversion from the dialogue data set to the tf-idf vector is realized by a packet encapsulated in scimit-spare.
After vectorizing the dialogue data set, clustering operation can be carried out on the dialogue data set. In the system, the clustering algorithm adopts a k-means algorithm. The algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method. The algorithm takes Euclidean distance as similarity measurement, and the optimal classification of a corresponding initial clustering center vector V is solved, so that the evaluation index J is minimum. The algorithm uses a sum of squared errors criterion function as a clustering criterion function. The selection of the k initial cluster center points has a large influence on the clustering result, because any k objects are randomly selected as the centers of the initial clusters in the first step of the algorithm to initially represent one cluster. The algorithm reassigns each object remaining in the data set to the nearest cluster based on its distance from the center of the respective cluster in each iteration. After all data objects are examined, one iteration operation is completed, and a new clustering center is calculated. If the value of J does not change before or after an iteration, the algorithm is converged.
Further, in the system, the specific working steps of the k-means algorithm in the clustering module are as follows:
1. randomly selecting k dialogue data sets from the N dialogue data sets as a centroid;
2. measuring the distance to each centroid for each of the remaining sets of session data and categorizing it to the nearest centroid;
3. recalculating the centroid of each obtained class;
4. and (4) repeating the step (2) and the step (3) until the new centroid is equal to the original centroid or is smaller than a specified threshold value.
Further, the Seq2Seq model includes two sub-modules, which are an Encoder and a Decoder, respectively. The Encode receives an input of indefinite length and converts it to a vector of fixed length. And the Decoder generates an output sequence according to the vector obtained by the Encode conversion.
The goal of the Seq2Seq model is to measure a conditional probability p (y)1,y2,...yT|x1,x2,...xT′). Wherein x1,x2,...xT′Is an input sequence, y1,y2,...yTIs the output sequence. The length T' of the input sequence and the length T of the output sequence may be different. The above probability can be expressed by the following equation:
Figure BDA0001741881690000031
where v is a vector representing a fixed length generated by the Encoder submodule.
Further, in order to realize the Seq2Seq model, a special symbol is added after each sequence to indicate the stop of the sequence in order to allow the model to recognize the stop flag of the sequence.
Further, the Encoder submodule and the Decode submodule are each constituted by one RNNs (RecurrentNeuralNetworks).
The RNNs are one type of artificial neural network, and unlike conventional neural networks, data will be input at different time steps. And the internally stored extrema of the RNNs enable the RNNs to hold previous input information.
Specifically, the output of RNNs is calculated as:
ot=f(st)
wherein o istAnd stRespectively, the output and implicit states at time t in RNNs. While the function f may be different for different tasks. In the present system, what is desired is the probability of each word at the current time, and therefore the softmax function is used. The implicit states in RNNs can be updated by the following equations:
st=g(st-1,xt)
wherein g (×) is a non-linear activation function, which may be a sigmoid function or other more complex activation functions. x is the number oftIs input at time step t, st-1Is the implicit state of the RNNs at time step t-1.
The internal memory mechanism of RNNs allows it to capture information from previous inputs, a mechanism that is important for natural language processing tasks. Since the input and output should not be treated as independent among natural language processing tasks. However, when the input sequence is too long, standard RNNs suffer from long-term dependence. Therefore, GRU (gated recovery unit) is used as a building block in RNNs.
The GRU is a simplified implementation of LSTM (long short-term memory). Unlike the neuron structure in classical RNNs, the GRU incorporates a reset gate and an update gate mechanism, and the calculation formula of the reset gate is as follows:
rt=σ(Wrxt+Urht-1)
where σ is sigmoid function, xtAnd ht-1Is an implicit state of the input and at time t-1 in the GRU. When GRU is used htI.e. s as described in RNNstTwo different symbols are used here for the purpose of differentiation. WrAnd UrIs the weight matrix that needs to be learned.
The calculation formula for the updated gate z is as follows:
z=σ(Wzxt+Uzht-1)
Wzand UzIs a weight matrix to be learned;
and implies a state htIs that
Figure BDA0001741881690000041
Wherein
Figure BDA0001741881690000042
Wherein the content of the first and second substances,
Figure BDA0001741881690000043
is element correspondence multiplication, W and U are weight matrixes to be learned, and tanh is a hyperbolic tangent function.
From the above equation, if the reset gate is 0, the implicit state is only affected by the current time. By the mechanism, the model can forget unimportant information in the future by the hidden state, so that the model can be better expressed. And from htIt can be seen that the update gate z determines how much the previous implicit state has influence on the current implicit state.
When a sentence is very long, if only the implicit state of the last state of the Encoder is taken, the information of the first half of the sentence is easy to lose. To solve this problem, an attention mechanism is adopted to use the implicit state of the last moment to represent that the whole sentence uses a vector c under the attention mechanismiIs shown in which ciThe specific calculation formula of (A) is as follows:
Figure BDA0001741881690000051
wherein, aijThe calculation is made by the following formula:
Figure BDA0001741881690000052
where η is typically implemented with a multilayer perceptron, s'i-1Is an implicit state, h ', of the Decoder at time instant i-1'jAnd exp is an exponential function which is an implicit state of the Encoder at the j-th moment, and the bottom of the exponential function can be a natural constant or other constants. By paying attention to the introduction of the force mechanism, the model can more fully consider the input information at each moment. And more weight may be assigned to information that the model considers to be important.
Further, the specific process of training and generating the reply of the Seq2Seq model is as follows:
during the training process, the Encoder receives each word in the above as an input, and the input word needs to be converted into word embedding. After the word embedding of each word in the above information is sequentially input into the Encoder, the Decoder integrates the output of each time by adopting an attention mechanism and generates the output of each time by combining the input. The output at each time in the Decoder is the probability of each word in the lexicon occurring at the current time.
Further, the Word embedding is a technique of representing each Word by a vector of a fixed length. The word embedding in the Seq2Seq model can be directly input by using a value pre-trained by Google, and can also be input by randomly initializing a vector for each word. The word embedding of each word can be a fixed value or fine-tuned by the Seq2Seq model during training.
In the generation process, the word embedding corresponding to each word in the above text needs to be input into the Seq2Seq model. The Decoder will integrate the outputs at each time using an attention mechanism and combine the inputs at each time to produce an output at each time. The output at each time instant according to Decoder is the probability of each word appearing in the lexicon at the current time instant.
Further, when selecting the generated reply, the most probable word may be selected at each time, or several of the previous most probable outputs may be retained at each time. The latter option sometimes gives better recovery.
Specifically, when a Seq2Seq model is built and implemented in a Seq2Seq module, if all the models are implemented from the start of neuron building, the workload is large. Therefore, the method needs to be implemented by combining some deep learning frameworks, which can encapsulate some operations, so that certain convenience is brought when a Seq2Seq model is built.
Further, PyTorch is selected as a deep learning framework. Python language is also supported by the previously converted vectors and scimit-spare called using the k-means algorithm. So that the two can be combined relatively well. In addition, the Pythrch supports tensor operations and the construction of dynamic networks. In terms of operation speed, the Pythrch supports the use of a GPU to accelerate operation, and thus, the time consumed by the model when large-scale data training is used can be further reduced. By using the pytorech packaged module, each part in the Seq2Seq model can be implemented relatively easily.
Compared with the prior art, the invention has the following beneficial effects:
1. in the system, the dialogue data is clustered, each category of dialogue data is considered to share one theme, and the dialogue data of each theme is trained by a specific Seq2Seq model. During application, each sentence spoken by a user is subjected to category judgment and processed by a corresponding Seq2Seq model, so that the adaptive process based on topic clustering can enable the system to capture some characteristics specific to each topic. Therefore, the system can generate the answers which are sensitive in subject and more meaningful, and the experience of the user in the dialog system is improved.
Drawings
FIG. 1 is a general schematic diagram of an adaptive dialog generation system based on topic clustering according to the present invention;
FIG. 2 is a schematic diagram of the Seq2Seq model constructed in the Seq2Seq module.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
Fig. 1 is a schematic diagram of an adaptive dialog generation system based on topic clustering, where the system includes a dialog data module, a vectorization module, a clustering module, and a Seq2Seq module;
the dialogue data module is used for constructing a dialogue data set before training;
the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis;
the clustering module is used for clustering the vectorized dialogue data set into clusters;
and the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module.
In the Seq2Seq module, the dialogue data is divided into a plurality of clusters according to the clustering module, a Seq2Seq model is constructed for each cluster to train the input dialogue data, and a dialogue reply is generated.
Specifically, in the vectorization module, a bag-of-words model is constructed by adopting tf-idf as a weight calculation mode, and the vectorization processing is performed on the dialogue data set output by the dialogue data module.
Before clustering is carried out by using a clustering module, vectorization of a normalized data set is required, and a bag-of-words model is a relatively common way for vectorizing a dialogue data set in natural language processing. In the process of converting the dialogue data set by using the bag-of-words model, each word can be uniquely fixed at a certain position in the vector, in a certain dialogue data set, if a certain word appears, the corresponding position can be set to be a non-0 number, and if a certain word does not appear, the corresponding position can be set to be 0. At this time, how to select the non-0 number becomes very important. A simpler way is to set these numbers to 1, but then the importance between words is not distinguished, so tf-idf is used as the weight calculation in the clustering module.
As a way of weight calculation, tf-idf is used to evaluate how important a word is to one of the documents in a corpus or a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Because a word appears in more documents, the less the ability to distinguish documents to illustrate the word. Tf is the word frequency and idf is the inverse file frequency. Wherein idf is specifically: if the documents containing the entry t are fewer, namely n is smaller, idf is larger, the entry t has good category distinguishing capability.
Further, in the present system, the conversion from the dialogue data set to the tf-idf vector is realized by a packet encapsulated in scimit-spare.
After vectorizing the dialogue data set, clustering operation can be carried out on the dialogue data set. In the system, the clustering algorithm adopts a k-means algorithm. The algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method. The algorithm takes Euclidean distance as similarity measurement, and the optimal classification of a corresponding initial clustering center vector V is solved, so that the evaluation index J is minimum. The algorithm uses a sum of squared errors criterion function as a clustering criterion function. The selection of the k initial cluster center points has a large influence on the clustering result, because any k objects are randomly selected as the centers of the initial clusters in the first step of the algorithm to initially represent one cluster. The algorithm reassigns each object remaining in the data set to the nearest cluster based on its distance from the center of the respective cluster in each iteration. After all data objects are examined, one iteration operation is completed, and a new clustering center is calculated. If the value of J does not change before or after an iteration, the algorithm is converged.
Further, in the system, the specific working steps of the k-means algorithm in the clustering module are as follows:
1. randomly selecting k dialogue data sets from the N dialogue data sets as a centroid;
2. measuring the distance to each centroid for each of the remaining sets of session data and categorizing it to the nearest centroid;
3. recalculating the centroid of each obtained class;
4. and (4) repeating the step (2) and the step (3) until the new centroid is equal to the original centroid or is smaller than a specified threshold value.
Further, each Seq2Seq model shown in fig. 2 includes two sub-modules, namely an Encoder and a Decoder. The Encode receives an input of indefinite length and converts it to a vector of fixed length. And the Decoder generates an output sequence according to the vector obtained by the Encode conversion.
The goal of the Seq2Seq model is to measure a conditional probability p (y)1,y2,...yT|x1,x2,...xT′). Wherein x1,x2,...xT′Is an input sequence, y1,y2,...yTIs the output sequence. The length T' of the input sequence and the length T of the output sequence may be different. The above probability can be expressed by the following equation:
Figure BDA0001741881690000081
where v is a vector representing a fixed length generated by the Encoder submodule.
Further, in order to realize the Seq2Seq model, a special symbol is added after each sequence to indicate the stop of the sequence in order to allow the model to recognize the stop flag of the sequence.
Further, the Encoder submodule and the Decode submodule are each constituted by one RNNs (RecurrentNeuralNetworks).
The RNNs are one type of artificial neural network, and unlike conventional neural networks, data will be input at different time steps. And the internally stored extrema of the RNNs enable the RNNs to hold previous input information.
Specifically, the output of RNNs is calculated as:
ot=f(st)
wherein o istAnd stRespectively, the output and implicit states at time t in RNNs. While the function f may be different for different tasks. In the present system, what is desired is the probability of each word at the current time, and therefore the softmax function is used. The implicit states in RNNs can be updated by the following equations:
st=g(st-1,xt)
wherein g (×) is a non-linear activation function, which may be a sigmoid function or other more complex activation functions. x is the number oftIs input at time step t, st-1Is the implicit state of the RNNs at time step t-1.
The internal memory mechanism of RNNs allows it to capture information from previous inputs, a mechanism that is important for natural language processing tasks. Since the input and output should not be treated as independent among natural language processing tasks. However, when the input sequence is too long, standard RNNs suffer from long-term dependence. Therefore, GRU (gated recovery unit) is used as a building block in RNNs.
The GRU is a simplified implementation of LSTM (long short-term memory). Unlike the neuron structure in classical RNNs, the GRU incorporates a reset gate and an update gate mechanism, and the calculation formula of the reset gate is as follows:
rt=σ(Wrxt+Urht-1)
where σ is sigmoid function, xtAnd ht-1Is the implicit state of the input and the t-1 th instant of the GRU. When GRU is used htI.e. s as described in RNNstTwo different symbols are used here for the purpose of differentiation. WrAnd UrIs the weight matrix that needs to be learned.
The calculation formula for the updated gate z is as follows:
z=σ(Wzxt+Uzht-1)
Wzand UzIs a weight matrix to be learned;
and implies a state htThe update formula of (2) is:
Figure BDA0001741881690000091
wherein
Figure BDA0001741881690000092
Wherein the content of the first and second substances,
Figure BDA0001741881690000093
is element correspondence multiplication, W and U are weight matrixes to be learned, and tanh is a hyperbolic tangent function.
From the above equation, if the reset gate is 0, the implicit state is only affected by the current time. By the mechanism, the model can forget unimportant information in the future by the hidden state, so that the model can be better expressed. And from htIt can be seen that the update gate z determines how much the previous implicit state has influence on the current implicit state.
When a sentence is very long, if only the implicit state of the last state of the Encoder is taken, the information of the first half of the sentence is easy to lose. To solve this problem, an attention mechanism is adopted to use the implicit state of the last moment to represent that the whole sentence uses a vector c under the attention mechanismiIs shown in which ciThe specific calculation formula of (A) is as follows:
Figure BDA0001741881690000101
wherein, aijThe calculation is made by the following formula:
Figure BDA0001741881690000102
where η is typically implemented with a multilayer perceptron, s'i-1Is an implicit state, h ', of the Decoder at time instant i-1'jAnd exp is an exponential function which is an implicit state of the Encoder at the j-th moment, and the bottom of the exponential function can be a natural constant or other constants. By paying attention to the introduction of the force mechanism, the model can more fully consider the input information at each moment. And more weight may be assigned to information that the model considers to be important.
Further, the specific process of training and generating the reply of the Seq2Seq model is as follows:
during the training process, the Encoder receives each word in the above as an input, and the input word needs to be converted into word embedding. After the word embedding of each word in the above information is sequentially input into the Encoder, the Decoder integrates the output of each time by adopting an attention mechanism and generates the output of each time by combining the input. The output at each time in the Decoder is the probability of each word in the lexicon occurring at the current time.
Further, the Word embedding is a technique of representing each Word by a vector of a fixed length. The word embedding in the Seq2Seq model can be directly input by using a value pre-trained by Google, and can also be input by randomly initializing a vector for each word. The word embedding of each word can be a fixed value or fine-tuned by the Seq2Seq model during training.
In the generation process, the word embedding corresponding to each word in the above text needs to be input into the Seq2Seq model. The Decoder will integrate the outputs at each time using an attention mechanism and combine the inputs at each time to produce an output at each time. The output at each time instant according to Decoder is the probability of each word appearing in the lexicon at the current time instant.
Further, when selecting the generated reply, the most probable word may be selected at each time, or several of the previous most probable outputs may be retained at each time. The latter option sometimes gives better recovery.
Specifically, when a Seq2Seq model is built and implemented in a Seq2Seq module, if all the models are implemented from the start of neuron building, the workload is large. Therefore, the method needs to be implemented by combining some deep learning frameworks, which can encapsulate some operations, so that certain convenience is brought when a Seq2Seq model is built.
Further, PyTorch is selected as a deep learning framework. Python language is also supported by the previously converted vectors and scimit-spare called using the k-means algorithm. So that the two can be combined relatively well. In addition, the Pythrch supports tensor operations and the construction of dynamic networks. In terms of operation speed, the Pythrch supports the use of a GPU to accelerate operation, and thus, the time consumed by the model when large-scale data training is used can be further reduced. By using the pytorech packaged module, each part in the Seq2Seq model can be implemented relatively easily.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. An adaptive dialog generation system based on topic clustering, characterized in that the dialog data of each topic in the system is trained by a specific Seq2Seq model; when the method is applied, the type of each sentence spoken by a user is judged, and the sentence spoken by the user is processed by a corresponding Seq2Seq model; the system comprises a dialogue data module, a vectorization module, a clustering module and a Seq2Seq module;
the dialogue data module is used for constructing a dialogue data set before training;
the vectorization module is used for vectorizing the dialogue data set before clustering and taking the vectorized dialogue data set as the input of a clustering model to become a clustering basis;
the clustering module is used for clustering the vectorized dialogue data set into clusters;
the Seq2Seq module is used for constructing a Seq2Seq model and generating corresponding replies to the dialogue data sets in the clusters obtained by the clustering module;
in the Seq2Seq module, dividing the dialogue data into a plurality of clusters according to a clustering module, constructing a Seq2Seq model for each cluster to train the input dialogue data, and generating a dialogue reply;
in the Seq2Seq module, constructing a Seq2Seq model to train input dialogue data and generate dialogue replies; the Seq2Seq model comprises two sub-modules, namely an Encoder and a Decoder; the Encoder receives input with an indefinite length and converts the input into a vector with a definite length; the Decoder generates an output sequence according to the vector obtained by the Encode conversion; the Encoder submodule and the Decode submodule are respectively composed of an RNNs;
using attention mechanism to use the implicit state of the last time in the Encoder submodule to represent that the whole sentence uses a vector c under attention mechanismiIs shown in which ciThe specific calculation formula of (A) is as follows:
Figure FDA0002949809910000011
wherein, aijThe calculation is made by the following formula:
Figure FDA0002949809910000012
eij=η(s′i-1,h′j)
where η is typically implemented with a multilayer perceptron, s'i-1Is an implicit state of the Decoder at the i-1 th time,h′jis the implicit state of Encode at the j-th time.
2. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that in the vectorization module, tf-idf is used as a weight calculation method to construct a bag-of-words model, and the dialog data set output by the dialog data module is vectorized.
3. An adaptive dialog generation system based on topic clustering according to claim 2 characterized in that in the present system the conversion from the dialog data set to the tf-idf vector is done by means of encapsulated packages in scimit-lern.
4. The adaptive dialog generation system based on topic clustering according to claim 2, characterized in that in the system, the specific working steps of the k-means algorithm in the clustering module are:
1. randomly selecting k dialogue data sets from the N dialogue data sets as a centroid;
2. measuring the distance to each centroid for each of the remaining sets of session data and categorizing it to the nearest centroid;
3. recalculating the centroid of each obtained class;
4. and (4) repeating the step (2) and the step (3) until the new centroid is equal to the original centroid or is smaller than a specified threshold value.
5. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that when implementing the Seq2Seq model, in order to make the model recognize the stop sign of the sequence, a special symbol is added after each sequence to indicate the stop of the sequence.
6. The adaptive dialog generation system based on topic clustering according to claim 1 characterized by taking GRUs as building blocks in RNNs; unlike the neuron structure in classical RNNs, the GRU incorporates a reset gate and an update gate mechanism, and the calculation formula of the reset gate is as follows:
rt=σ(Wrxt+Urht-1)
where σ is sigmoid function, xtAnd ht-1Is the implicit state of the input and at time t-1 in the GRU; wrAnd UrIs a weight matrix to be learned;
the calculation formula for the updated gate z is as follows:
z=σ(Wzxt+Uzht-1)
Wzand UzIs a weight matrix to be learned;
and implies a state htIs that
Figure FDA0002949809910000021
Wherein
Figure FDA0002949809910000022
Wherein the content of the first and second substances,
Figure FDA0002949809910000023
is element correspondence multiplication, W and U are weight matrixes to be learned, and tanh is a hyperbolic tangent function.
7. The adaptive dialog generation system based on topic clustering according to claim 1, characterized in that when constructing and implementing the Seq2Seq model in the Seq2Seq module, PyTorch is used as the deep learning framework.
CN201810823424.6A 2018-07-25 2018-07-25 Adaptive dialog generation system based on topic clustering Active CN109308316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810823424.6A CN109308316B (en) 2018-07-25 2018-07-25 Adaptive dialog generation system based on topic clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810823424.6A CN109308316B (en) 2018-07-25 2018-07-25 Adaptive dialog generation system based on topic clustering

Publications (2)

Publication Number Publication Date
CN109308316A CN109308316A (en) 2019-02-05
CN109308316B true CN109308316B (en) 2021-05-14

Family

ID=65225979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810823424.6A Active CN109308316B (en) 2018-07-25 2018-07-25 Adaptive dialog generation system based on topic clustering

Country Status (1)

Country Link
CN (1) CN109308316B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297909B (en) * 2019-07-05 2021-07-02 中国工商银行股份有限公司 Method and device for classifying unlabeled corpora
CN113836275B (en) * 2020-06-08 2023-09-05 菜鸟智能物流控股有限公司 Dialogue model establishment method and device, nonvolatile storage medium and electronic device
CN111751714A (en) * 2020-06-11 2020-10-09 西安电子科技大学 Radio frequency analog circuit fault diagnosis method based on SVM and HMM
CN112115687B (en) * 2020-08-26 2024-04-26 华南理工大学 Method for generating problem by combining triplet and entity type in knowledge base

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133224A (en) * 2017-04-25 2017-09-05 中国人民大学 A kind of language generation method based on descriptor
US9830315B1 (en) * 2016-07-13 2017-11-28 Xerox Corporation Sequence-based structured prediction for semantic parsing
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9830315B1 (en) * 2016-07-13 2017-11-28 Xerox Corporation Sequence-based structured prediction for semantic parsing
CN107133224A (en) * 2017-04-25 2017-09-05 中国人民大学 A kind of language generation method based on descriptor
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
基于双向LSTMN神经网络的中文分词研究分析;黄积杨;《中国优秀硕士学位论文全文数据库信息科技辑》;20161015(第10期);第34页 *
基于强化学习的开放领域聊天机器人对话生成算法;曹东岩;《中国优秀硕士学位论文全文数据库信息科技辑》;20180215(第2期);全文 *
基于深度学习的智能聊天机器人的研究;梁苗苗;《中国优秀硕士学位论文全文数据库信息科技辑》;20180615(第6期);全文 *
官宸宇.面向事件的社交媒体文本自动摘要研究.《中国优秀硕士学位论文全文数据库信息科技辑》.2017,(第8期),第12、16-18、25、28、33、36-37页. *
面向事件的社交媒体文本自动摘要研究;官宸宇;《中国优秀硕士学位论文全文数据库信息科技辑》;20170815(第8期);第12、16-18、25、28、33、36-37页 *

Also Published As

Publication number Publication date
CN109308316A (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN111368996B (en) Retraining projection network capable of transmitting natural language representation
KR102071582B1 (en) Method and apparatus for classifying a class to which a sentence belongs by using deep neural network
CN110188358B (en) Training method and device for natural language processing model
CN109308316B (en) Adaptive dialog generation system based on topic clustering
CN111858931B (en) Text generation method based on deep learning
CN110427461B (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN110377916B (en) Word prediction method, word prediction device, computer equipment and storage medium
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
WO2022217849A1 (en) Methods and systems for training neural network model for mixed domain and multi-domain tasks
CN112417894B (en) Conversation intention identification method and system based on multi-task learning
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111581970B (en) Text recognition method, device and storage medium for network context
CN114676234A (en) Model training method and related equipment
CN114596844B (en) Training method of acoustic model, voice recognition method and related equipment
CN111046157B (en) Universal English man-machine conversation generation method and system based on balanced distribution
KR20240089276A (en) Joint unsupervised and supervised training for multilingual automatic speech recognition.
CN111899766A (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
Yang et al. Sequence-to-sequence prediction of personal computer software by recurrent neural network
WO2023159759A1 (en) Model training method and apparatus, emotion message generation method and apparatus, device and medium
CN116150334A (en) Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism
CN111274359B (en) Query recommendation method and system based on improved VHRED and reinforcement learning
CN113901820A (en) Chinese triplet extraction method based on BERT model
Tascini Al-Chatbot: elderly aid

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant