CN110765270A

CN110765270A - Training method and system of text classification model for spoken language interaction

Info

Publication number: CN110765270A
Application number: CN201911066202.5A
Authority: CN
Inventors: 方艳; 徐华; 初敏
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-02-07
Anticipated expiration: 2039-11-04
Also published as: CN110765270B

Abstract

The embodiment of the invention provides a training method of a text classification model for spoken language interaction. The method comprises the following steps: acquiring a spoken language text corpus training set and dialogue historical context information; performing corpus expansion on the spoken language text corpus training set through the historical context information of the conversation to enrich the spoken language text corpus training set; and establishing a text classification model based on a bidirectional long-and-short time memory network, and training the text classification model through conversation historical context information and a spoken language text corpus training set after corpus expansion, so that the text classification model learns the field classification of the spoken language text through the conversation historical context information. The embodiment of the invention also provides a training system of the text classification model for spoken language interaction. The embodiment of the invention determines the historical context information of the conversation, constructs a large amount of virtual conversation texts and makes up for the shortage of linguistic data; the dialogue history contextual information is used as part of the input of the training model, and helps the model to improve the accuracy of the domain classification.

Description

Training method and system of text classification model for spoken language interaction

Technical Field

The invention relates to the field of intelligent voice conversation, in particular to a training method and a training system of a text classification model for spoken language interaction.

Background

In the text classification of spoken language interaction, a large amount of manually marked corpora are usually used for training a deep learning model, the model can automatically acquire text characteristics, and after a result is output by the model, the final field output needs to be selected by combining with the previous set of speech state design rules.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the text classification method based on the feature engineering needs to consume manpower to design text features, the final performance of the model is limited by the quality of the feature design, and the features used by the method often have the problems of sparsity and dimension explosion, so that the final classification performance is relatively low.

No matter the method is based on the characteristic engineering or the deep learning method, the input of the classification model is the text of the current user, and the effect of the dialogue historical information on the model classification is not considered. The output field of the model is only an intermediate state, corresponding field selection rules are designed by combining conversation history, and one or more fields are screened out from the alternative fields given by the model to serve as final output. The whole process of the method is relatively complicated and is not simple and convenient; the rules of artificial design are often not flexible enough and not accurate enough.

Disclosure of Invention

In order to at least solve the problems that the manual design of text features in the prior art is time-consuming and labor-consuming; after the classification model obtains the result, the final field needs to be judged by a manual design rule, so that the time and the labor are consumed, and the flexibility is not high; the dialogue information can help the model to judge the field and improve the accuracy of field classification, but the existing method model has no problem of adding dialogue information.

In a first aspect, an embodiment of the present invention provides a method for training a text classification model for spoken language interaction, including:

acquiring a spoken language text corpus training set and dialogue historical context information;

performing corpus expansion on the spoken language text corpus training set through the dialogue historical contextual information to enrich the spoken language text corpus training set;

and establishing a text classification model based on a bidirectional long-and-short time memory network, and training the text classification model through the conversation historical context information and the spoken language text corpus training set after corpus expansion, so that the text classification model learns the field classification of the spoken language text through the conversation historical context information.

In a second aspect, an embodiment of the present invention provides a training system for a text classification model for spoken language interaction, including:

the information acquisition program module is used for acquiring a spoken language text corpus training set and dialogue historical context information;

the corpus expansion program module is used for performing corpus expansion on the spoken language text corpus training set through the dialogue historical contextual information to enrich the spoken language text corpus training set;

and the model training program module is used for establishing a text classification model based on a bidirectional long-time and short-time memory network, and training the text classification model through the conversation history context information and the spoken language text corpus training set after corpus expansion, so that the text classification model learns the field classification of the spoken language text through the conversation history context information.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a text classification model for spoken language interaction of any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for training a text classification model for spoken language interaction according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: extracting key factors influencing the field of the next pair of dialogues, determining historical context information of the dialogues, and constructing a large number of virtual dialog texts through the historical context information of the dialogues, so that the situation that the linguistic data of a spoken language text corpus training set is insufficient is made up; and the historical context information of the dialogue is used as a part of the input of the training model, and the output of the model is the final domain result which is in line with the current dialogue scene. The whole system does not have a complicated process of manually judging the field, time and labor are saved, and the historical context information can be conversed to help the model to improve the accuracy of field classification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a text classification model for spoken language interaction according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating the structure of a method for training a text classification model for spoken language interaction according to an embodiment of the present invention;

FIG. 3 is a diagram of a dialogue history context information adding BLSTM model of a training method for a text classification model for spoken language interaction according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating comparison of performance of a training method for a text classification model for spoken language interaction according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a training system for a text classification model for spoken language interaction according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a training method of a text classification model for spoken language interaction according to an embodiment of the present invention, including the following steps:

s11: acquiring a spoken language text corpus training set and dialogue historical context information;

s12: performing corpus expansion on the spoken language text corpus training set through the dialogue historical contextual information to enrich the spoken language text corpus training set;

s13: and establishing a text classification model based on a bidirectional long-and-short time memory network, and training the text classification model through the conversation historical context information and the spoken language text corpus training set after corpus expansion, so that the text classification model learns the field classification of the spoken language text through the conversation historical context information.

In the present embodiment, spoken text without dialogue information is relatively easily available, but annotation data with dialogue context is relatively rare and time-consuming and labor-consuming to annotate. The history information of the conversation has many contents, including the domain where the previous conversation was, the answer of the machine, and the like. How to effectively utilize such information is not easily conceivable.

For step S11, in training the spoken language interactive text classification model, not only the spoken language text corpus training set but also additional dialogue history context information are required, and the text classification model is trained by considering data of multiple dimensions.

As an embodiment, the acquiring a spoken language text corpus training set and dialogue historical context information includes:

extracting related domain-intentions in the domain set and the intention set based on the domain set and the intention set of the spoken language interaction and a reply template set for feeding back the intention;

extracting reply templates matched with the domain-intention from the reply template set, and determining a domain-intention-dialogue template;

the domain-intent-dialog template is obtained and determined as dialog history context information.

In the present embodiment, there are many contents in the historical contextual dialog information, but three of the domain where the previous dialog is located (pre _ domain), the intention of the user of the previous dialog (pre _ intent), and the reply of the dialog system of the previous dialog (pre _ system reply) are key factors that affect the domain where the next dialog is located. The range of pre _ domain is limited, i.e. the defined set of domains; under a specific domain, the scope of pre _ intent is also limited, i.e., the set of intents defined; the templates of pre _ system reply are also limited and defined set of reply templates, given the particular domain and intent. The system selects a suitable template from the limited templates, and then replaces variables in the template with specific values, thereby generating the final system reply. Therefore, when the dialog corpus is insufficient, a virtual dialog text with context information can be artificially constructed. Pre _ domain, pre _ intent, pre _ system reply (domain-intent-dialog template) are taken together as dialog history context information (dialog _ context for short) of a sentence, for example: "music-playing song-playing { song name }" for you is a complete dialog _ context, "music" indicates that the last round of dialog domain is a music domain, "playing song" indicates the intention of the user of the last round, and "playing { song name }" for you indicates the reply template of the last round of system.

In this embodiment, the set of domains, the set of intentions, and the set of reply templates for feedback intentions for spoken language interaction are pre-configured before the dialog history context information is obtained. For example, a manually predefined set of domains, intents, and reply templates may be obtained in other ways.

With respect to step S12, since the true dialog markup text with dialog _ context is not easily available, and the plain text corpus without dialog _ context is more common, the dialog text needs to be constructed in case of insufficient dialog corpus. The construction method comprises the following steps: randomly selecting a domain from a domain set as pre _ domain of a text corpus, selecting one domain from an intention set supported by the pre _ domain as pre _ intent, and randomly selecting one domain as pre _ systemresply according to a reply template supported by the pre _ intent. The selected pre _ domain, pre _ intent and pre _ systemresply are used as dialog _ context of a sentence, and finally, an original label is modified into a new field label according to the current dialog _ context, and the new label result is in line with the field result in the current dialog scene. By using the corpus construction method, one sentence can construct a plurality of sentences with dialog _ context, thereby enriching the spoken language text corpus training set.

For step S13, the method uses a bidirectional long-short memory network (BLSTM) for modeling. One drawback with conventional LSTM is that it can only utilize previous content from the forward sequence. In text category analysis, future content coming in reverse sequence also plays a crucial role in the judgment of classification. Structured knowledge is extracted by processing the forward and reverse sequences so that complementary information from the past and future can be integrated together for reasoning. Bi-directional LSTM processes data from 2 directions, forward and reverse, with 2 independent hidden layers to achieve the above purpose, and then takes both the hidden layer outputs of the forward and reverse sequences as inputs to the output layer. As shown in fig. 2, dialog _ context and text are simultaneously used as input of the BLSTM model, the model has information of the dialog field, and after the model outputs the field, it is not necessary to make judgment of field selection according to the dialog history information, and the output result of the model is the optimal field classification result according with the current context.

The field classification under the spoken language interaction scene is carried out on the trained text classification model, namely all possible fields are divided for the sentences spoken by the current user according to the above dialogue state. The task is characterized in that the conversation history information at the previous moment has an important influence on the judgment of the field to which the next wheel belongs, and the field classification results of texts are different under different conversation histories. For example, the sentence "in rush year of playing" is divided into fields, and the sentence can belong to the fields of "music" and "movie" because the sentence "in rush year" is the name of both a song and a movie. If dialog _ context is "music-play song-is playing { song name } for you", then there is a greater likelihood that the field of the sentence is "music"; if dialog _ context is "movie-find movie-has found { quantity } movie name } resources for you, then there is a greater likelihood that the sentence belongs to" movie ". This allows the text classification model to learn the domain classification of spoken text from the dialog history context information.

According to the embodiment, key factors influencing the field of the next pair of dialogs are extracted, the historical context information of the dialogs is determined, a large number of virtual dialog texts are constructed through the historical context information of the dialogs, and the situation that the linguistic data of a spoken language text corpus training set is insufficient is made up; and the historical context information of the dialogue is used as a part of the input of the training model, and the output of the model is the final domain result which is in line with the current dialogue scene. The whole system does not have a complicated process of manually judging the field, time and labor are saved, and the historical context information can be conversed to help the model to improve the accuracy of field classification.

As an implementation manner, in this embodiment, the training the text classification model through the dialog history context information and the corpus-augmented spoken language text corpus training set includes:

training the dialogue historical context information as an input layer of a text classification model of the bidirectional long-time and short-time memory network; or

Training the dialogue historical context information as an output layer of a text classification model of the bidirectional long-time and short-time memory network; or

And simultaneously training the conversation history context information as an input layer and an output layer of the text classification model of the bidirectional long-and-short time memory network.

In this embodiment, the input to the BLSTM model is embedding for each word or word. The output layer is a linear classifier, and the input of the linear classifier is the splicing of hidden layers at the two ends of the last moment of the BLSTM. The difference between the invention and the traditional BLSTM model is two points: (1) the dialog history context information is used as part of the model input. There are various ways to add the context information of the dialog history to the model, i.e. it can be added to the input layer of BLSTM, or it can be used as input to the output layer, or it can be added to both the input layer and the output layer, as shown in fig. 3. (2) The output of the model is a "1" and "-1" representation for each domain, where a "1" represents belonging to the domain and a "-1" represents not belonging to the domain, and the output of the overall system, i.e., all domains in the model whose output is "1", are sorted by probability score.

The method uses two test sets to evaluate the classification performance of the system, namely a correct text of the audio by manual transcription and a recognized text recognized by a voice system, wherein the correct text comprises 2 ten thousand sentences, the recognized text comprises 15 ten thousand sentences, and the performance is shown in figure 4. The basic system is a classification result of a traditional method, namely, conversation historical context information is not added as input of a model, and after the model outputs a field, a final field needs to be selected according to the conversation historical information. It can be seen from the figure that the system uses three methods of adding the historical context information of the conversation, and the performance of the system is improved relative to that of the basic system.

Fig. 5 is a schematic structural diagram of a training system for a text classification model for spoken language interaction according to an embodiment of the present invention, which can execute the training method for a text classification model for spoken language interaction according to any of the above embodiments and is configured in a terminal.

The training system of the text classification model for spoken language interaction provided by the embodiment comprises: an information acquisition program module 11, a corpus expansion program module 12 and a model training program module 13.

The information acquisition program module 11 is configured to acquire a spoken language text corpus training set and dialogue history context information; the corpus expansion program module 12 is configured to perform corpus expansion on the spoken language text corpus training set through the dialog history context information to enrich the spoken language text corpus training set; the model training program module 13 is configured to establish a text classification model based on a bidirectional long-and-short-term memory network, and train the text classification model through the dialog history context information and the spoken language text corpus training set after corpus expansion, so that the text classification model learns the domain classification of the spoken language text through the dialog history context information.

Further, the information acquisition program module is configured to:

Further, the set of domains, the set of intentions, and the set of reply templates for feedback intentions of the spoken language interaction are pre-configured before the dialog history context information is obtained.

Further, the model training program module is to:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the text classification model for spoken language interaction in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a text classification model for spoken language interaction in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a text classification model for spoken language interaction of any embodiment of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for training a text classification model for spoken language interaction, comprising:

2. The method of claim 1, wherein said obtaining a training set of spoken text corpora and dialog history context information comprises:

3. The method of claim 2, wherein the set of domains for spoken interaction, the set of intentions, and the set of reply templates for feedback intentions are pre-configured prior to obtaining the dialog history context information.

4. The method of claim 1, wherein the training of the text classification model with the dialog history context information and the corpus-augmented spoken-language-text corpus training set comprises:

5. A training system for a text classification model for spoken language interaction, comprising:

6. The system of claim 5, wherein the information acquisition program module is to:

7. The system of claim 6, wherein the set of domains for spoken interaction, the set of intentions, and the set of reply templates for feedback intentions are pre-configured prior to obtaining the dialog history context information.

8. The system of claim 5, wherein the model training program module is to:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.