CN111462734A

CN111462734A - Semantic slot filling model training method and system

Info

Publication number: CN111462734A
Application number: CN202010248117.7A
Authority: CN
Inventors: 俞凯; 刘辰; 朱苏; 陈露; 曹瑞升
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28
Anticipated expiration: 2040-03-31
Also published as: CN111462734B

Abstract

The embodiment of the invention provides a semantic slot filling model training method. The method comprises the following steps: training a first training data set with labels to generate a first semantic slot filling model; inputting a second training data set of automatic voice recognition into the first semantic slot filling model, and determining a first semantic slot value pair; the rule-based error correction module corrects the first semantic slot value pair and determines a second semantic slot value pair, wherein the error correction module corrects the first semantic slot value pair based on a preset rule; and performing strategy gradient training on the first semantic slot filling model based on the second semantic slot value pair, and determining a trained second semantic slot filling model. The embodiment of the invention also provides a semantic slot filling model training system. The embodiment of the invention directly introduces the error correction based on the rule into the training method through reinforcement learning, and is used for the slot filling task in the spoken language semantic understanding. Thereby improving the robustness of semantic understanding to speech recognition errors.

Description

Semantic slot filling model training method and system

Technical Field

The invention relates to the field of intelligent voice, in particular to a semantic slot filling model training method and system.

Background

Spoken semantic understanding is a technique for converting the output produced by automatic speech recognition into a structured semantic representation and is therefore very sensitive to speech recognition errors. In semantic understanding, semantic slot filling will typically be used. In order to improve the robustness of semantic understanding to speech recognition errors, the predicted slot values of semantic slot filling are corrected by using a rule-based correction model. Thereby ensuring the accuracy of the semantic understanding of the spoken language.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the drawback of these methods is that the slot filling model and the rule-based error correction model are independent of each other, and even though the two models are trained separately, the quality of the correction result is greatly limited by the rule error correction model. However, the error correction should be a post-processing module and should not affect the semantic understanding of the spoken language too much. Making spoken semantic understanding less robust to speech recognition.

Disclosure of Invention

The method aims to at least solve the problem that in the prior art, a slot filling model and a rule-based error correction model are independent from each other in the spoken language semantic understanding, so that the robustness of the spoken language understanding to speech recognition errors is poor.

In a first aspect, an embodiment of the present invention provides a semantic slot filling model training method, including:

training a first training data set with labels to generate a first semantic slot filling model;

inputting a second training data set of automatic voice recognition into the first semantic slot filling model, and determining a first semantic slot value pair;

correcting the first semantic slot value pair by an error correction module based on rules to determine a second semantic slot value pair, wherein the error correction module corrects the first semantic slot value pair based on preset rules;

and performing strategy gradient training on the first semantic slot filling model based on the second semantic slot value pair, and determining a trained second semantic slot filling model.

In a second aspect, an embodiment of the present invention provides a semantic slot filling model training system, including:

the data training program module is used for training a first training data set with labels to generate a first semantic slot filling model;

a semantic slot value pair determining program module, configured to input a second training data set for automatic speech recognition to the first semantic slot filling model, and determine a first semantic slot value pair;

a correcting program module, configured to correct the first semantic slot value pair by using a rule-based error correcting module, and determine a second semantic slot value pair, where the error correcting module corrects the first semantic slot value pair based on a preset rule;

and the semantic slot filling model training program module is used for performing strategy gradient training on the first semantic slot filling model based on the second semantic slot value pair to determine a trained second semantic slot filling model.

In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the semantic slot filling model training method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the semantic slot filling model training method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: rule-based error correction is directly introduced into the training method through reinforcement learning and is used for a slot filling task in spoken language semantic understanding. On the one hand domain knowledge is utilized and on the other hand two modules of slot filling and error correction are connected. Thereby improving the robustness of semantic understanding to speech recognition errors.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart of a semantic slot filling model training method according to an embodiment of the present invention;

FIG. 2 is a model architecture diagram of a semantic slot filling model training method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating results of a test set of a semantic slot filling model training method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an example of a semantic slot filling model training method according to an embodiment of the present invention;

FIG. 5 is a performance diagram of a semantic slot filling model training method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a semantic slot filling model training system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a semantic slot filling model training method according to an embodiment of the present invention, which includes the following steps:

s11: training a first training data set with labels to generate a first semantic slot filling model;

s12: inputting a second training data set of automatic voice recognition into the first semantic slot filling model, and determining a first semantic slot value pair;

s13: correcting the first semantic slot value pair by an error correction module based on rules to determine a second semantic slot value pair, wherein the error correction module corrects the first semantic slot value pair based on preset rules;

s14, based on the second semantic slot value pair, strategy gradient training is carried out on the first semantic slot filling model, and the trained second semantic slot filling model is determined.

In the embodiment, in order to solve the defects in the prior art, an error correction module is further introduced in the training process of the slot filling model, and since the correction process is a rule-based non-derivable process, the training is performed by using a strategy gradient transfer method in reinforcement learning. The correction module is considered in the training process, so that the output of the slot filling model can be better used for the correction module, and the robustness of the semantic understanding on the speech recognition is improved. The method further comprises two modules, a semantic slot filling model, and a rule-based error correction module.

For step S11, appropriate data, including manually labeled real text, and text for speech recognition hypotheses, needs to be prepared for training the semantic slot filling model. Both of which are used during the training phase. The first data is manually labeled real text data, and each word is clearly labeled, so that the slot filling task can be converted into a sequence labeling task to be processed, and the semantic slot filling model is trained.

As an embodiment, the first training data set with labels is trained via a bidirectional long-term and short-term memory network.

For step S12, the text of the speech recognition hypothesis, i.e., the text of the automatic speech recognition, is input to the semantic slot filling model trained in step S11, as shown in the model architecture diagram of fig. 2, the user speaks "i want to go to the quiet zone", but mistakenly recognizes "i want to go to the quiet bay" due to an error in the speech recognition. The wrong first semantic slot value pair is obtained, and for convenience of representation, represented by a semantic triple, resulting in (interior-end-quiet bay area). Because of the error of semantic slot value pair, "i want to go to quiet bay" can not get the correct alignment mark.

For step S13, the error correction module is composed of a plurality of rules that are gradually enriched by continuously collecting errors of speech recognition in daily use. The "i want to go to quiet bay" (in-end-quiet bay) is corrected based on the error correction module, for example, the corrected user's original meaning "i want to go to quiet bay". The resulting slot value pair is "in form-end-quiet zone". The real text is aligned and labeled to obtain 'I, O, B-in-endpoint, I-in-endpoint area and I-in-endpoint'.

For step S14, the correction process is considered to be a rule-based non-derivable process and is thus trained using a strategic gradient delivery method in reinforcement learning, wherein the strategic gradient delivery method includes Pre-training and R L-training reinforcement learning training.

It can be seen from this embodiment that rule-based error correction is introduced directly into the training method through reinforcement learning for slot filling tasks in spoken semantic understanding. On the one hand domain knowledge is utilized and on the other hand two modules of slot filling and error correction are connected. Thereby improving the robustness of semantic understanding to speech recognition errors.

As an implementation manner, in this embodiment, after determining the trained second semantic slot filling model, the method further includes:

receiving a test data set;

inputting the test data set into the second semantic slot filling model, and determining slot value pairs before correction;

and inputting the slot value pair before correction into the error correction module to obtain a final slot value pair.

In this embodiment, in order to verify the effect of the trained second semantic slot filling model, a pre-prepared test data set is input into the second semantic slot filling model to obtain the slot value pair before correction, for example, the test data set may be a hypothetical text of speech recognition. And inputting the slot value pair before correction into an error correction module for correction to obtain a final slot value pair.

It can be seen from this embodiment that the robustness of semantic understanding to speech recognition errors is further improved by test checking.

To address this issue, the method proposes a policy gradient-based reinforcement learning (R L) method to optimize the S2U model to take into account the final performance after error correction 539.

The present method is fully described by defining some symbols which will be used hereinafter. Wherein, let r ═ (r)₁...r_|r|) And u ═ u (u)₁...u_|u|) Respectively represent ASR (Automatic Speech Recognition) best-recognized text and real text, y ═ y₁...y_|y|) Stands for sentences in the form of act (slot) triplesSub-level semantic tags, and o ═ o (o)₁... O | u |) represents a word-level label on u in "BIO" mode (B-begin, I-inside, O-outslide).

B L STM (Bidirectional L ong Short-Term Memory Network, Bidirectional long-and Short-Term Memory Network) encoder reads input sequence x (u or r), and reads input sequence x (u or r) through the encoder

Generating a hidden state at the t-th time step, wherein

And

is the hidden state L STM decoder passes s at the t time step_t＝LSTM(s_t-1，ψ(o_t-1)，c_t) Recursively updating its hidden state, where ψ (●) is a label embedding function, c_tIn the focusing mechanism, i.e. h_tOnly the hidden state of the alignment is considered. s₀By using

And (5) initializing. Then, passing P (o)_t|o<t；x)＝g(S_t) Generating a Slot tag o_tWhere g denotes a linear layer followed by a Softmax function for classification.

Assume that there is a predicted act triple a (s v). Representing a corresponding act-slot-value triple candidate set in the current domain ontology as V ═ V (V)₁...V_|V|). Based on the ontology, firstly constructing an n-gram word G_n. Each value is considered as a word sequence v ═ (v ═ v)₁...v_M) N-gram set of vⁿ＝{(v_i...v_i+n-1) M-n +1 }. Then, a binary-valued eigenvector d ═ is established for v (d)₁...d_|Gn|) Wherein

And by

It is normalized. Similarly, the candidate set of values V can be represented as a feature matrix (after normalization)

The k-th column is V_kThe feature vector of (2). Therefore, the best candidate value can be found in a manner similar to cosine similarity. The index of the best value is:

since some slots have many possible values in the ontology, efficiency can be greatly improved by simply performing matrix multiplication. In practice, n ranges from 1 to 2, so the vocabulary amount is equal to | G1| + | G2 |. A threshold (here 0.5) is set to reject bad selections.

To prune the larger search space, the model is pre-trained using labeled real text to lead R L training.

Let D_tscpDenotes real text with alignment labels. The slot filling model is supervised by negative log likelihood loss:

where θ represents the model parameters.

In the R L training phase, automatic speech recognition hypotheses are used in conjunction with a misalignment marker, labeled D_hyp{ (r, y) }. The slot filling model samples through a beam search to generate K tag sequences and then triples act (slot). Finally, a set of semantic tuples is generated after the EC module

For each input speech r, the reward is considered at both the three levels and the sentence level:

where the first term penalizes False Positives (FP) and False Negatives (FN) at the triplet level and the second term is a binary value that indicates whether the entire sentence is predicted correctly. The model is optimized by maximizing the expected cumulative reward using strategic gradient descent. The policy gradient can be calculated as:

wherein

Is a baseline for reducing the variance of the gradient estimate, which is obtained by averaging the rewards inside the bundle.

To stabilize the training process, use D alternately_tscpAnd D_hypTraining is beneficial.

Experiments were conducted on the above on the first chinese audio text spoken language understanding challenge (CATS L U) dataset containing four dialog domains (map, music, video, weather).

The 200-dimensional char embedding may be initialized by pre-training a L STM-based bi-directional language model (bi L M) using a zhwiki3 corpus, L STM is a single layer with 256 hidden cells, in the training process, the parameters are uniformly sampled in the range of (-0.2; 0.2), dropout with a probability of 0.5 is applied to the acyclic layer, Adpout is selected as the optimizer, the learning rate is set to 0.001 in pre-am training, 0.0005 in R L training, fixed during training, the bundle size is set to 5 in the decoding phase, the best model is selected based on the performance on the validation set, and then the F-score and sentence-level accuracy of act (slot) triplets are evaluated.

By displaying the primary results compared to different benchmarks. In the evaluation phase, all experiments were error corrected. The following criteria were studied:

HD: only unaligned data is employed.

Focus: the annotated real text is trained and the ASR hypotheses are evaluated.

UA, changing the groove filling model from B L STM to Focus.

DA: a data enhancement method in which training data is enhanced by pseudo-aligned ASR hypotheses in two ways: (1) generated from a pre-trained marker model (Gen); (2) aligned with the real text by the minimum edit distance (Align).

The results diagram of the test set shown in fig. 3 shows the overall results of the test set. The results show that models trained in an end-to-end fashion using unaligned data ("HD") are less effective than labeled models ("Focus"). The "UA" method is transferred from real text to ASR hypotheses and obtains results comparable to Focus. No increase was found using the "UA" and "DA" methods, possibly due to noisy datasets. Compared with the "focus" and "DA" benchmarks, the proposed model has significant improvement except in the music domain (the significance level is 95% in the video and weather domain and 90% in the map domain).

FIG. 4 gives an example of how R L training may provide benefits for vacancy filling.A reference model identifies two bin blocks, "company" and "Ganhezi town," separated by the special word "is," which would generate erroneous bin value pairs.S L U models learn by R L to produce outputs more suitable for correction.

The effectiveness of each sub-module in the model was studied by ablation studies. As can be seen from the upper half of the performance diagram shown in FIG. 5, if only ASR hypothesis D is used_hyp(i.e. without "T_scp") training, then due to the lack of a strong supervisory signal from the real text, the training is not performedThe performance may be degraded. Without pre-training ("PT"), the performance of the system also decreased (F score of 0.47%, joint accuracy of 0.72%), indicating the importance of pre-training. Furthermore, without any real text supervision, the average performance drops dramatically. This is because searching in a large space is difficult.

Fig. 6 is a schematic structural diagram of a semantic slot filling model training system according to an embodiment of the present invention, which can execute the semantic slot filling model training method according to any of the above embodiments and is configured in a terminal.

The semantic slot filling model training system provided by the embodiment comprises: data training program module 11, semantic slot value pair determination program module 12, correction program module 13, and semantic slot filling model training program module 14.

The data training program module 11 is configured to train a first training data set with labels to generate a first semantic slot filling model; the semantic slot value pair determining program module 12 is configured to input the second training data set for automatic speech recognition to the first semantic slot filling model, and determine a first semantic slot value pair; the correcting program module 13 is configured to correct the first semantic slot value pair by using a rule-based error correcting module, and determine a second semantic slot value pair, where the error correcting module corrects the first semantic slot value pair based on a preset rule; the semantic slot filling model training program module 14 is configured to perform policy gradient training on the first semantic slot filling model based on the second semantic slot value pair, and determine a trained second semantic slot filling model.

Further, the system includes a test program module for:

receiving a test data set;

Further, the data training program module is to:

and training the first training data set with the labels through a bidirectional long-time memory network.

Further, the semantic slot value pair comprises a semantic triple.

Further, the strategy gradient training comprises Pre-training and R L-training reinforcement learning training.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the semantic slot filling model training method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a semantic slot filling model training method in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the semantic slot filling model training method of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A semantic slot filling model training method comprises the following steps:

2. The method of claim 1, wherein after determining the trained second semantic slot filling model, the method further comprises:

receiving a test data set;

3. The method of claim 1, wherein the training the labeled first training data set comprises:

4. The method of claim 1, wherein the semantic slot value pairs comprise semantic triples.

5. The method of claim 1, wherein the strategy gradient training comprises Pre-training and R L-training reinforcement learning training.

6. A semantic slot filling model training system, comprising:

7. The system of claim 6, wherein the system further comprises a test program module to:

receiving a test data set;

8. The system of claim 6, wherein the data training program module is to:

9. The system of claim 6, wherein the semantic slot value pairs comprise semantic triples.

10. The system of claim 6, wherein the strategy gradient training includes Pre-training and R L-training reinforcement learning training.