CN113160803A - End-to-end voice recognition model based on multilevel identification and modeling method - Google Patents

End-to-end voice recognition model based on multilevel identification and modeling method Download PDF

Info

Publication number
CN113160803A
CN113160803A CN202110642751.3A CN202110642751A CN113160803A CN 113160803 A CN113160803 A CN 113160803A CN 202110642751 A CN202110642751 A CN 202110642751A CN 113160803 A CN113160803 A CN 113160803A
Authority
CN
China
Prior art keywords
output
module
character
sequence
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110642751.3A
Other languages
Chinese (zh)
Inventor
唐健
胡宇晨
戴礼荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110642751.3A priority Critical patent/CN113160803A/en
Publication of CN113160803A publication Critical patent/CN113160803A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an end-to-end voice recognition modeling method based on multi-level identification, which comprises decoding inference, wherein the decoding inference adopts a post-inference algorithm, and the post-inference algorithm comprises the following steps: model generation posterior probability output sequence corresponding to fine-grained text sequence
Figure DDA0003107691490000011
The output sequence
Figure DDA0003107691490000012
Can uniquely correspond to coarse-grained subsequences
Figure DDA0003107691490000013
Generating the coarse-grained subsequence by a calculation model
Figure DDA0003107691490000014
And using the log-likelihood value as the existing prediction output sequence
Figure DDA0003107691490000015
Cross validation of (3); and (4) cutting the existing decoding path according to the likelihood probability score obtained by the calculation of the two steps, and ensuring that the search path is controlled within the beam width range.

Description

End-to-end voice recognition model based on multilevel identification and modeling method
Technical Field
The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition model based on multi-level identification and a modeling method.
Background
End-to-End (E2E) Automatic Speech Recognition (ASR) based on an encoding-decoding framework directly models a sequence mapping relationship between an input audio sequence and an output text. The advantages of a simple framework and no need of linguistic background knowledge make the structure gradually pursued by the academic and industrial fields.
In end-to-end ASR, an input speech sequence may be mapped to a different hierarchical text sequence. The mapping relationship between the speech sequence and the text sequence is one-to-many. In Chinese ASR, the text sequence may consist of Pinyin (pinyin), Chinese characters; the english-chinese text sequence may be composed of words (words) and characters (characters).
In general, in end-to-end speech recognition, modeling with word-level text sequences is the most desirable choice. The model output does not need to be further converted through a dictionary, and end-to-end modeling in a complete sense is realized. However, if the word-level text is adopted for modeling, the capacity of the model and the required model parameters are large; on the other hand, character-level (character) text sequences are also an alternative. The character text sequence can effectively control the size and parameter of the model, but the capability of capturing the context correlation of the long time sequence in the voice signal is insufficient, and the character-level text sequence has poor performance on the task of large-vocabulary continuous voice recognition from the prior research work.
In recent years, with the development of Deep Learning (DL), Automatic Speech Recognition (ASR) has made a great progress. The traditional deep learning-based ASR framework is based on a hybrid architecture, which consists of several independent components trained based on conditional independent approximations. On the other hand, new research in ASR focuses on using an end-to-end approach to model the mapping between input audio to sequences of target text. Such as a connection-aware Temporal Classification (CTC), a Recurrent Neural Network converter (RNN-T), a Segmented Conditional Random Field (SCRF), an Attention-based codec (AED) model, and a Transformer model. Compared with the traditional hybrid architecture, the ASR of the end-to-end framework reduces the dependence on linguistic information and simplifies the system structure.
An end-to-end sequence mapping method maps an input audio sequence to a target text. The target text sequence may be composed of different levels of text. For example, english text may be composed of words (words), subwords (subwords), or characters (characters). The different levels of identification have their corresponding advantages and disadvantages.
Word-level text representation is the most common way of text representation in reality. The method is the most ideal selection as a target sequence of end-to-end speech recognition and also conforms to the application assumption of end-to-end speech. It has another advantage that: the output of the word-level model is consistent with the performance evaluation index, and the mismatching of the model optimization target and the evaluation index is avoided. In the case of a sufficient corpus of required text-tagged training words, word-level text is the most ideal choice for end-to-end speech recognition modeling. The defects of the method are that the required training data is large in quantity and the samples are not distributed uniformly. To avoid the problems of directly using word-level text, researchers have attempted to model using characters. The character-level text sequence has fewer text units, and the number of output units and the size of the model can be better controlled, so that the requirement on training data volume is reduced. However, the influence between adjacent units in the output text sequence is not considered in the construction of the character-level text unit, and the problems of the cooperative pronunciation, the non-pronunciation and the like of the voice cannot be considered. Consider the modeling difficulty of the word modeling unit and the performance deficiencies of the character text unit. There have been work using subwords for modeling, aiming at finding a balance point between the modeling difficulty and the model performance.
Another direction of research for the use of multi-level identification information is to combine multiple text sequences in an ASR system rather than picking one of them. The content of the output sequence is represented by the multiple text sequences together, so that rich and multi-level output information can be provided for the model, and the information content of the target text is enhanced. In end-to-end speech recognition modeling, researchers have adopted several multi-level labeled end-to-end speech recognition modeling approaches. The existing multi-level identification end-to-end voice recognition modeling method can be divided into three categories, namely a multi-task learning strategy (MTL), a pre-training method (pre-training) and score fusion (score fusion).
Disclosure of Invention
In view of the above, the present invention provides an end-to-end speech recognition model based on multi-level identification and a modeling method thereof, so as to partially solve at least one of the above technical problems.
In order to achieve the above object, as an aspect of the present invention, there is provided an end-to-end speech recognition modeling method based on multi-level identifiers, including decoding inference, where the decoding inference employs a post-inference algorithm, and the post-inference algorithm includes:
model generation posterior probability output sequence corresponding to fine-grained text sequence
Figure BDA0003107691470000031
The output sequence
Figure BDA0003107691470000032
Can uniquely correspond to coarse-grained subsequences
Figure BDA0003107691470000033
Generating the coarse-grained subsequence by a calculation model
Figure BDA0003107691470000034
And using the log-likelihood value as the existing prediction output sequence
Figure BDA0003107691470000035
Cross validation of (3);
and (4) cutting the existing decoding path according to the likelihood probability score obtained by the calculation of the two steps, and ensuring that the search path is controlled within the beam width range.
The core of the post-inference algorithm is that the inter-sequence alignment mapping information is used in the decoding and inference stage.
Wherein no new decoding path is generated in the cross-validation process, and the scores of the output results of the existing paths are re-ordered from another angle.
The score increment of each decoding path is composed of a fine-granularity log-likelihood probability score and a plurality of coarse-granularity log-likelihood probability scores.
As another aspect of the present invention, an end-to-end speech recognition model obtained by the modeling method is provided, where the speech recognition model includes an interactive decoder, and the interactive decoder includes a character module, an interactive module, a sub-word hidden layer module, and a sub-word classification module; wherein,
the character module is used for modeling output prediction of character subsequences
Figure BDA0003107691470000036
And provides character history status for subsequent operation process
Figure BDA0003107691470000037
The interactive module is used for fusing the character state and the sub-word state and using the fused interactive state as the calculation of the interactive attention module.
Wherein the character module comprises a character attention module calculation layer, a recurrent neural network layer and a full connection layer; the input of the character module is the information representation of character history output and the output sequence of the encoder
Figure BDA0003107691470000038
Wherein the interaction module comprises an interaction attention mechanism and a recurrent neural network layer; the interaction moduleThe inputs of are character history state, sub-word state and encoder output sequence
Figure BDA0003107691470000041
Wherein, the input of the sub-word hidden layer module is the information representation of the history output of the sub-words and the output sequence of the encoder
Figure BDA0003107691470000042
And respectively realizing the calculation of the sub-word attention vector and the updating of the state of the sub-word through the sub-word attention module and the recurrent neural network layer structure.
The input of the sub-word classification module is an interaction state and a sub-word state, the interaction state and the sub-word state are respectively used for realizing output prediction of sub-words through a full connection layer, and the two outputs are respectively called sub-word output and auxiliary sub-word output.
Wherein the interactive decoder generates three types of outputs: the method comprises character output, sub-word output and auxiliary sub-word output, wherein the three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training.
Based on the technical scheme, compared with the prior art, the end-to-end voice recognition model based on the multilevel identification and the modeling method have at least one of the following beneficial effects:
(1) the application of the post-inference algorithm and the interactive decoder provided by the invention ensures that the accuracy of the voice recognition result is improved compared with the existing recognition model by the end-to-end voice recognition model.
(2) The application of the post-inference algorithm proposed by the present invention is not limited by the end-to-end architecture.
Drawings
Fig. 1 is an alignment mapping relationship between multilevel identifiers according to an embodiment of the present invention (here, subwords and characters are taken as examples);
fig. 2 is a multi-level identification modeling method based on MTL and an end-to-end multi-level identification sequence alignment method according to an embodiment of the present invention;
FIG. 3 is a graphical model corresponding to joint conditional probabilities of a multi-level labeled end-to-end model provided by an embodiment of the present invention;
fig. 4 is an application of the alignment mapping relationship provided in the embodiment of the present invention in a multi-level label end-to-end decoding process: joint decoding algorithm (y)iAnd yjRespectively with sub-word ybAnd the character ycFor example);
FIG. 5 is a depiction of various experimental configurations provided by an embodiment of the present invention;
FIG. 6 is a block diagram of a sequence-to-sequence speech synthesis acoustic model for bi-level autoregressive decoding according to an embodiment of the present invention;
FIG. 7 is a block diagram of an interactive decoder in a sequence-to-sequence speech synthesis acoustic model according to an embodiment of the present invention;
FIG. 8 is a diagram of the use of multi-granular target information provided by an embodiment of the present invention; wherein (a) the interactive decoder; (b) a joint decoding algorithm;
Detailed Description
The selection of one item from a multi-level text sequence for end-to-end speech recognition modeling is not the only choice, and is less than the optimal choice. A plurality of text sequences selected in the speech recognition end-to-end modeling are recorded as multi-level identification (Multiple-hierarchical Target Sequence). The invention provides a Multi-Granularity Sequence Alignment Method (MGSA) by considering that a plurality of text sequences are selected to be commonly used for end-to-end voice recognition modeling to realize better effect.
The end-to-end ASR system can be integrally split into two parts, a model training stage (training stage) and a decoding inference stage (inference stage). The MGSA method proposed by this patent will optimize the ASR system using multi-level identification information in these two phases, respectively. First, the decoder module of the end-to-end ASR will sequentially generate a multi-level text sequence in a model structure that will take into account the interaction between different level identifiers. In addition, in an end-to-end output inference stage, the method explores and utilizes a recessive alignment mapping relation among different level marks to further improve the recognition performance. The proposed Post Inference Algorithm (Post Inference Algorithm) can further calibrate the posterior probability scores of the output sequences using multi-level identification information. Experimental results on the WSJ-80hrs and Switchboard 300hrs data sets indicate that the method has significant advantages over traditional multitasking methods and single event baseline systems.
The MGSA method provided by the invention aims to fully utilize multi-granularity information and improve the performance of an end-to-end voice recognition system as much as possible under the condition of not increasing the whole input information quantity. On the other hand, the multi-level information is put forward to play a part of the function of the language model from a certain angle, and the dependence of the end-to-end model on the external language model can be relieved. The MGSA utilizes the interactive information of the MGSA through the alignment mapping relation among different granularity units, so that the model can learn the semantic information in the MGSA, and further the performance of the model is improved.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
1. Sequence alignment mapping relation between multi-level identifiers
In end-to-end speech recognition modeling, there are three types of text elements-character elements, sub-word elements, and word-level elements. Of these three types of text units, a previous text unit (e.g., a character) may be clustered to form the latter (e.g., a subword) that corresponds to one or more of the text units in the former. For example, in FIG. 1 the word unit "CURSE" corresponds to the sub-word sub-string "_ C OUR SE", which also uniquely maps to the character sub-sequence "O U R". The implicit alignment mapping relationship between the text sequences can be obtained by querying a dictionary. The alignment mapping relationship between text sequences is overall strict, definite and easily obtained. We denote the implicit, unique correspondence that exists between such multi-level texts (as shown by the middle solid line in fig. 1) as an alignment mapping. The invention will introduce a method MGSA of how to introduce alignment mapping in end-to-end speech recognition modeling.
End-to-end ASR can be divided into two phases of model training and decoding inference. The MGSA method takes into account the use of alignment mapping at both stages. The general framework of the method is shown in fig. 2 (b). There are three main differences compared to the conventional MTL-based approach commonly used in fig. 2 (a).
Firstly, the MGSA is used as a target of model optimization based on joint conditional probability of multi-level identification, and the use of alignment mapping relation among sequences is fully considered in the optimization process. Secondly, a novel decoder module is provided, which embodies the information transmission between the multi-level identifiers on the structural level. Interaction and fusion between the multi-level identifiers through the structure is achieved inside the model (dashed lines in fig. 2); in addition, in the output decoding stage of the end-to-end ASR, the post-inference algorithm proposed in this patent checks and corrects the recognition result of the model (dotted line in fig. 2) by identifying the correspondence between the outputs through multiple levels.
2. Coding and decoding structure of multi-level identification
2.1 formula derivation of optimization objectives
For any two text representations
Figure BDA0003107691470000061
And
Figure BDA0003107691470000062
setting yiFor fine-grained text sequences, yjIs a coarse-grained text sequence. At yiEach text unit in (1) maps uniquely to yjBy one or more sub-sequences of text units in the text sequence of (a). Here will be yjIs in with yiThe tth text unit
Figure BDA0003107691470000063
The corresponding text sub-string is marked as
Figure BDA0003107691470000064
In that
Figure BDA0003107691470000065
In which contains ktA text unit of. Further, will
Figure BDA0003107691470000066
The u-th text unit in
Figure BDA0003107691470000067
By this representation, y can be expressedjThe following form is rewritten.
Figure BDA0003107691470000068
Delta in formula (1) will be the text sequence yjAnd yiThe sequence alignment mapping relationship between the two is embodied in a more intuitive and explicit way. Based on the method, the invention provides a Multi-hierarchical identification Sequence Alignment (MGSA) method. Before the details of the method are introduced, an optimization formula of the MGSA is obtained by adopting a formula derivation mode. With fine-grained text sequence yiAnd a coarse-grained text sequence yjThe derivation of the end-to-end speech recognition objective function is performed for example. The derivation of joint conditional probabilities for three and more text sequences can be obtained by analogy, only in the case of two text representation sequences being discussed here. Given an input speech feature sequence x, the goal of the multi-level identification end-to-end speech recognition model is to model a joint conditional probability Pθ(yi;yj|x)。
Applying to coarse-grained text sequence y in equation (1)jAnd on the basis thereof will be expressed by
Figure BDA0003107691470000071
And its corresponding text sub-sequence
Figure BDA0003107691470000072
The text pairs formed together are marked as
Figure BDA0003107691470000073
The joint conditional probability of the model at this time can be expressed as
Figure BDA0003107691470000074
A multi-level identification is a representation of the same text at different granularities. The identifiers are represented in different forms, but correspond to the same text meaning. Wherein each unit is associated with other units. In fig. 3(a), we present a graphical model representation corresponding to the modeling objective (equation (2)).
The interaction between the two types of text sequences in FIG. 3(a) is not reasonable in view of the sequential causal properties between the multiple levels of identification. First, a unit in a certain granular text sequence should be independent of a relatively earlier-ordered unit in another granular text sequence. For example,
Figure BDA0003107691470000075
should not affect
Figure BDA0003107691470000076
The prediction of (2). Second, in a set of text pairs
Figure BDA0003107691470000077
Should not affect each other. A set of text pairs as in fig. 3
Figure BDA0003107691470000078
And
Figure BDA0003107691470000079
they are expressions for a unit of text that are at different granularities for the same text "OUR". Allowing interaction between them means that the output is calculated with the true identity known, so that the calculation process is meaningless. Combining the above two time sequence cause and effect considerations, further combining the conditional probability Pθ(yi;yj| x) is expressed as (the lower side will take the simplified form PθRepresenting joint conditional probability Pθ(yi;yj|x))
Figure BDA00031076914700000710
The graphical model of joint conditional probabilities at this time can be simplified by using text to represent temporal causal properties between sequences as shown in FIG. 3 (b).
And comparing with the formula (3), and continuously simplifying the corresponding graph model. On this basis, it is assumed that the variables of the text sequence obey a first order markov assumption. Thereby further simplifying the joint conditional probability with the result that
Figure BDA0003107691470000081
Fig. 3(c) corresponds to formula (4). In joint conditional probabilities, text subsequences
Figure BDA0003107691470000082
Is a length of ktA subsequence of (2). Substituting the relation into joint conditional probability, and calculating course of coarse-grained text substring
Figure BDA0003107691470000083
Further expansion may be by the chain rule. The formula for obtaining the joint conditional probability is expressed as follows
Figure BDA0003107691470000084
Figure BDA0003107691470000085
Equation (6) corresponds to fig. 3 (d). The formula shows that in the joint optimization of two text sequences, historical information of corresponding moments of two hierarchies is required to be considered in the generation process of model prediction output.
The basic principles to be satisfied when constructing the model can be obtained from the derivation process:
1. alignment mapping relationship between sequences: there is a strict correspondence between the fine-grained and coarse-grained text sequence expressions of the same text content, i.e., each text unit in the fine-grained text sequence corresponds to one or more text units in the coarse-grained text sequence. There is a strict one-to-many mapping relationship between the two text units, which is the root of performing multi-level identification end-to-end speech recognition modeling. The mutual influence between the multi-level identifiers to be considered later needs to be established on the premise of the mapping relation.
2. The history information is independent of each other: to guarantee the history information of two text sequences
Figure BDA0003107691470000086
And
Figure BDA0003107691470000087
without having a direct effect on each other. For each state variable, the historical time sequence modeling capability of the state variable is guaranteed, and the influence of historical output of other text sequences is avoided.
3. Directly acting on classification: the interaction of multi-level identification directly affects the classification process of text units. The end-to-end modeling process based on recursive formal expansion can be split into: historical text sequence modeling and estimation of model predictions. Considering the independence of the history information mentioned above, the interaction process of the multi-level identification information needs to be reflected in the classification process.
In the above derivation, the interaction between the multi-level identifiers is bi-directional. But some simplification can be made in practical use. Ignoring fine-grained text sequences yiTo coarse-grained text sequences yjThe formula (6) is further simplified as:
Figure BDA0003107691470000091
2.2 codec Structure description of Multi-level identification
The proposed model structure consists of two parts, an encoder and a decoder. The structure of the encoder is the same as that of the traditional encoder; in the decoder part, the invention proposes an interactive decoder structure. The structure comprises a character module, an interaction module, a sub-word hidden layer module and a sub-word classification module; in addition, a total of three loss functions are used to guide model training.
An encoder module, the input of which is a feature sequence x of a sentence of speech, the encoder module functioning as a feature extractor for enhancing the correlation of the input sequence in the time dimension and generating an encoder output sequence
Figure BDA0003107691470000092
Specifically, the context information expression at each time is obtained by fusing feature sequence codes by using a Convolutional Neural Network (CNNs) and a bidirectional long-short-term memory (Bi-LSTM).
A decoder module. The module comprises a character module, a sub-word hidden layer module, an interactive module and a sub-word classification module, and the input of the module is the output sequence of an encoder
Figure BDA0003107691470000093
The output is a word output, which is predicted at the current time based on the context information output by the encoder.
1. And a character module. The input of the module is the information representation of character history output and the output sequence of the encoder
Figure BDA0003107691470000094
The module is composed of a character attention module calculation, Recurrent Neural Network (RNN) layer and Full Connectivity (FC), and the part is used for modeling output prediction of character subsequences
Figure BDA0003107691470000095
And provides character history status for subsequent operation process
Figure BDA0003107691470000096
2. Sub-wordAnd (5) a hidden layer module. The input of the module is information representation of the history output of the sub-words and the output sequence of the encoder
Figure BDA0003107691470000097
And respectively realizing the calculation of the sub-word attention vector and the updating of the state of the sub-word through a sub-word attention module and an RNN layer structure.
3. And an interaction module. The inputs to the module are the character history status, the subword status and the encoder output sequence
Figure BDA0003107691470000098
The module consists of an interactive attention mechanism and an RNN layer. The module is used for fusing the character state and the sub-word state and using the fused interactive state as the calculation of the interactive attention module. The process mainly reflects the influence of the character state on the sub-word state.
4. And a sub-word classification module. The input of the module is an interactive state and a sub-word state, and the interactive state and the sub-word state are respectively used for realizing output prediction of the sub-words through a full connection layer. The two outputs are referred to herein as a subword output and an auxiliary subword output, respectively.
Three types of outputs are mainly generated in an interactive decoder: the method comprises character output, sub-word output and auxiliary sub-word output, wherein three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training. The first two are used for ensuring model training and convergence of the character module, the interaction module and the sub-word classification module; and training a sub-word attention module in a sub-word hidden layer module in the last auxiliary model.
3. Post inference algorithm
The use of the inter-sequence alignment mapping information is not limited to the model structure, and can be used in the decoding stage.
With fine-grained text sequence yiAnd a coarse-grained text sequence yjThe description is given for the sake of example. When the model generates the candidate output result at the t-th time in the decoding process
Figure BDA0003107691470000101
When the temperature of the water is higher than the set temperature,we can obtain the corresponding sub-sequence by the alignment mapping relationship between the sequences
Figure BDA0003107691470000102
For example, when the model-derived subword candidate output is "SE", its corresponding character subsequence "S E" can be synchronously derived. In the decoding phase, the invention will investigate how to use
Figure BDA0003107691470000103
For the predicted output
Figure BDA0003107691470000104
And (6) carrying out verification.
3.1 formula derivation of optimization objectives
The decoding stage of end-to-end ASR uses a beam search algorithm to pick a decoding path under a defined beam width. The log-likelihood probability of the existing decoding path is typically used as the current path score value. The decoding stage is formulated as follows
Figure BDA0003107691470000105
In the formula
Figure BDA0003107691470000106
Is a sequence of
Figure BDA0003107691470000107
A corresponding dictionary. The outcome of the argmax function is determined by the relative value of the function, and multiplying by 2 on the basis of the above formula does not change the outcome.
Figure BDA0003107691470000111
Figure BDA0003107691470000112
Figure BDA0003107691470000113
Second term in equation (9)
Figure BDA0003107691470000114
Is generated at time t
Figure BDA0003107691470000115
Likelihood probability of (d); using coarse-grained subsequences corresponding thereto
Figure BDA0003107691470000116
The alignment is replaced (corresponding to equation (10)), at which point
Figure BDA0003107691470000117
As a fine grained prediction output
Figure BDA0003107691470000118
Cross validation of (3); and further expanding the coarse-grained subsequence to obtain a final joint decoding algorithm expression (corresponding to the formula (11)). A new end-to-end model decoding algorithm is proposed based on formula inference of the decoding process, and is called as a joint decoding algorithm. The derivation process is based on two text sequence expansion, and the derivation process of three or more text sequences can be obtained by analogy.
3.2 introduction of concept of post-reasoning algorithm (Joint decoding algorithm)
Details of the implementation of the joint decoding algorithm are described in detail. The joint decoding process can be divided into three steps of prediction, check and clipping as a whole. Fig. 4 shows the end-to-end speech recognition decoding process at time t. Wherein, predicting: model-generated posterior probability output for fine-grained text sequences
Figure BDA0003107691470000119
Checking: because of the output sequence
Figure BDA00031076914700001110
Can uniquely correspond to a coarse-grained substring
Figure BDA00031076914700001111
Computing model generation subsequences
Figure BDA00031076914700001112
And the log-likelihood value of (2) is used as the existing prediction output
Figure BDA00031076914700001113
Cross-validation of (3). In this process, no new decoding path is generated, and the score of the output result of the existing path is re-ordered (re-ordering) from another point of view, so that the result is called a check. And finally, cutting, namely cutting the existing decoding path according to the likelihood probability score obtained by the two steps of calculation, and ensuring that the searching path is controlled within the beam width range. In the process, the score increment of each decoding path is composed of a fine-granularity log-likelihood probability score and a plurality of coarse-granularity log-likelihood probability scores. Overall, the joint decoding algorithm adds this step of verification compared to the traditional beam search algorithm.
As shown in FIG. 6, the input to the model coder is a speech feature sequence whose feature representation in the time dimension is extracted by CNN and BLSTM
Figure BDA0003107691470000121
The input to the model decoder is the encoder output
Figure BDA0003107691470000122
The word output corresponding to the previous frame and the character sequence output of the previous frame are the predicted word output of the current frame, and the character-level predicted output and the like are also provided.
Given a set of multi-level labeled training samples [ x; (y)b,yc)](ii) a The voice sequence is converted into an audio characteristic sequence x through characteristic extraction operation, and corresponding multi-level identifications are subword (subword) text sequences ybAnd character text sequence yc
The encoder module of the model functions as a feature extractor for enhancing the correlation of the input audio feature sequence in the time dimension to generate an encoder output sequence
Figure BDA0003107691470000123
The interactive decoder module is at decoding time t from
Figure BDA0003107691470000124
Extracts information related to current time output, and generates sub-word prediction output by combining model historical output
Figure BDA0003107691470000125
Next, a specific structure of the interactive decoder in the model will be described by taking the subword prediction process at the t-th time as an example.
Before the model carries out the predicted output of the sub-words at the time t, the character sub-sequence corresponding to the sub-words at the time t-1 needs to be completed firstly
Figure BDA0003107691470000126
The prediction of (2).
FIG. 7 is a block diagram of an interactive decoder in the sequence-to-sequence speech synthesis acoustic model, and the details of the parts are described in detail below.
(1) Character module
Character subsequence
Figure BDA0003107691470000127
The prediction output process of the u-th character in (1) is as follows, firstly, the model carries out character decoder state updating and attention vector calculation as the structure of a traditional decoder. State vector
Figure BDA0003107691470000128
Output according to previous character time
Figure BDA0003107691470000129
Updating character decoder states
Figure BDA00031076914700001210
Where RNN represents a single-layer Recurrent Neural Network (RNN).
Character decoder state
Figure BDA00031076914700001211
And
Figure BDA00031076914700001212
loaded as feedback information into a character attention module for generating a character attention vector
Figure BDA00031076914700001213
And context vector
Figure BDA00031076914700001214
Input and output sequences in speech recognition have a monotone alignment mapping relationship, so additive attention calculation using the convolution feature [1]. The attention vector calculation corresponds to the following formula
Figure BDA0003107691470000131
Wherein Attend denotes a general attention module; on the basis of which the output of the character is predicted, based on
Figure BDA0003107691470000132
Further updating decoder states
Figure BDA0003107691470000133
Will be provided with
Figure BDA0003107691470000134
And
Figure BDA0003107691470000135
act together on
Figure BDA0003107691470000136
The output prediction process of (1). The process of character prediction output at this time is as follows
Figure BDA0003107691470000137
When the status of sub-word is ignored
Figure BDA0003107691470000138
For the influence of character prediction output, the character prediction output process can be further simplified into
Figure BDA0003107691470000139
W in the formulacAnd bcThe parameters of the discipline matrix and the deviation vector (bias vector) are respectively. The above formula forms a pair character subsequence
Figure BDA00031076914700001310
The prediction output of the u-th unit in (1). Repeating the above process until the character subsequence is completed
Figure BDA00031076914700001311
To output a prediction. After completion, the state vector of the character decoder at the moment
Figure BDA00031076914700001312
Is marked as
Figure BDA00031076914700001313
The vector contains character history information necessary for generating a predicted output of the subword at time t.
(2) Sub-word hidden layer module
And after the preparation work of the character part is completed, performing output prediction of the sub-words at the t-th moment. Is carried out in the same wayUpdating of decoding state and calculation of attention vector. First, the predicted output of historical time is adopted
Figure BDA00031076914700001314
Updating decoder states
Figure BDA00031076914700001315
Update procedures such as
Figure BDA00031076914700001316
And then updated state information
Figure BDA00031076914700001317
Generating corresponding attention vector and context vector as input of attention module
Figure BDA0003107691470000141
The overall structure of the sub-word hidden layer module is shown in fig. 7. The following calculation process is different from the conventional encoding-decoding model. In-subword prediction
Figure BDA0003107691470000142
The extra character decoder state is considered to be introduced in the process
Figure BDA0003107691470000143
Influence on the subword output. The invention realizes the influence of the characters on the prediction process by adding the interaction module.
(3) Interaction module
The module is composed of an attention module and two layers of RNNs, and the corresponding calculation process is shown in the middle area of FIG. 7. The structure of the interaction module is further described: by single-layer RNN fusion
Figure BDA0003107691470000144
And
Figure BDA0003107691470000145
obtaining an interaction state vector at time t
Figure BDA0003107691470000146
Figure BDA0003107691470000147
The method comprises the steps of including historical output information meeting the alignment mapping relation among sequences, and using the historical output information to calculate the attention vector of an interaction module
Figure BDA0003107691470000148
Including this calculation process, a total of three attends are included in the interactive decoder structure. To distinguish them, the present invention refers to the attentive calculation processes described above as a character attention module, a sub-word attention module, and an interactive attention module, respectively. The interactive attention module can generate an interactive context vector giving consideration to sub-word and character information
Figure BDA0003107691470000149
Can be used as
Figure BDA00031076914700001410
The information of (1) is supplemented.
The sub-word state can be obtained at the same time by completing the above process
Figure BDA00031076914700001411
And character status
Figure BDA00031076914700001412
Including the history state of the sub-word
Figure BDA00031076914700001413
And subword history output
Figure BDA00031076914700001414
Two sets of information;
Figure BDA00031076914700001415
is the character history state
Figure BDA00031076914700001416
And
Figure BDA00031076914700001417
in combination with (1). Further information fusion can be performed with the two states at the same timing and with similar composition. Reference GLU activation Unit [2]Computing a fused vector
Figure BDA00031076914700001418
σ (.) and FC in the formula refer to sigmoid activation function and full connectivity layer, respectively. Obtaining a fused vector
Figure BDA00031076914700001419
Then, by a single layer of RNN and
Figure BDA00031076914700001420
updating interactive decoder states
Figure BDA00031076914700001421
Calculation procedures as
Figure BDA0003107691470000151
(4) Sub-word classification module
Last using the state of the interactive module
Figure BDA0003107691470000152
Predicting the main subword output at the current time t
Figure BDA0003107691470000153
The calculation process is
Figure BDA0003107691470000154
Calculation of but
Figure BDA0003107691470000155
Status of subword
Figure BDA0003107691470000156
While generating auxiliary sub-word outputs as input vectors
Figure BDA0003107691470000157
The calculation process is
Figure BDA0003107691470000158
In these two formulae, WiAnd WbFor the traineable matrix parameters, biAnd bbIs an offset vector. The subword classification module corresponds to the lower left corner region of fig. 7 as a whole.
(5) Model loss function
In the above calculation process, the model generates three types of outputs: output of character sub-sequences
Figure BDA0003107691470000159
Output of subword units
Figure BDA00031076914700001510
And assist in subword output
Figure BDA00031076914700001511
The three types of outputs correspond to a portion of the respective penalty functions. After the prediction process of outputting the sub-word with the length of T is completed, the loss function corresponding to the model is as follows
Figure BDA00031076914700001512
Figure BDA00031076914700001513
In the formula, lambda belongs to [0, 1 ]]Is a pre-set hyper-parameter in the model training process. In model training, we select Cross Entropy (CE) loss function as the objective function. The first term and the second term in the formula respectively correspond to the cross entropy loss of the character output and the sub-word output, and the third term is auxiliary sub-word output
Figure BDA00031076914700001514
And corresponding cross entropy loss is used for assisting the training of the sub-word attention module in the model.
(6) Information usage differences
The post-inference algorithm and the interactive decoder module both use the alignment mapping information, but at different stages. The learned words are illustratively explained in FIG. 8
Figure BDA0003107691470000161
The difference in context used. For post-inference algorithms, subsequences
Figure BDA0003107691470000162
It can be further applied to verify and correct the prediction output in fig. 8(a), while the historical output characters corresponding to time step (t-1) are used in the interactive decoder module shown in fig. 8 (b). It is clear that the alignment mapping information is utilized at different time steps. Therefore, the proposed MGSA end-to-end model can make full use of the alignment mapping information in the current and historical time steps by using a post-inference algorithm in the decoding phase.
Experimental verification
To verify the effectiveness of the proposed inter-frame decoder module and post-inference algorithm, its ASR performance was evaluated for various systems based on the bit error rate (WER) on the Switchboard-300hrs data set. The Switchboard consists of a large number of english phonetics, and a 300 hour subset of LDCs 97S62 was selected for training, leaving 10% for cross validation. And selected Hub5 eval2000 (i.e., LDC2002S09) for performance evaluation, which consists of two subsets: 1) switchboard (similar to training set) and 2) CallHome, which is collected from conversations between friends and inside the family. The complete Hub5 eval2000, subsets Switchboard and calhome are denoted as "Full", "SWD" and "CHE", respectively. For completeness, ASR performance was also evaluated for RT03 Switchboard test set (i.e., LDC2007S 10).
The encoder of the model has two convolutional layers that down sample the time series using a 3 x 3 filter and 32 channels, followed by 6 layers of bidirectional long-short-time memory (LSTM), with a cell size of 800. The default decoder is a 2-layer unidirectional LSTM with 800 cells. The 80-dimensional log-mel filter bank coefficients, three pitch coefficients and normalized mean and variance are used as input features. The char target in the experiment is a set of 46 characters containing english letters, numbers, punctuation marks and special transcription marks; for the subword target, segmenting by using a SentecPiece based on a BPE algorithm; based on the default settings in ESPnet, a vocabulary of approximately 2000 in size is used for Switchboard.
The different experimental configurations used are shown in fig. 5, where Baseline is Baseline, Baseline + is added with a layer of BLSTM in the former encoder module in order to exclude the effect of model size, MultiTask is a multi-task learning scheme, MGSAbiAnd MGSAuniThe MGSA scheme proposed for the present invention considers the interactive information between words and characters in the former scheme, and only the information contribution of characters to words in the latter scheme.
The results of the experiment are shown in tables I and II:
Figure BDA0003107691470000171
TABLE I switchboard data set test results
Figure BDA0003107691470000172
TABLE II Experimental results of post-inference Algorithm
1. Experiment one: joint decoding algorithm
To analyze the effect of model structure on performance, we first consider the traditional bundle search algorithm of all methods in the decoding phase. Table I lists the WERs implemented on both verification sets of the Switchboard. Obviously, compared with MultiTask and Baseline based on eval2000 dataset, the MGSA proposed by the inventionuniThe WER is reduced by 1.4 percent and 1.9 percent respectively; for RT03, MGSA compared to MultiTask and BaselineuniThe WER of (A) is reduced by 1.0% and 1.7% respectively; and MGSAbiIs inferior to MGSAuni. In fact, MGSAuniAnother advantage of (a) is that predictions for all character sequences can be computed simultaneously and all reference characters that need to be provided for the corresponding subword can be extracted at once.
2. Experiment two: interactive decoder
Since the multi-granularity target affects not only the model structure but also the decoding, we evaluated the impact of applying the proposed post-inference algorithm experimentally in the decoding stage. For the sake of brevity, MGSA will be used separately belowuni+, MultiTask + for MGSAuniAnd MultiTask plus post-reasoning algorithms.
The results of the experiments on the Switchboard data set are shown in table II. Compared with MGSAuniThe present invention provides MGSAuniThe + method also reduced the WER of eval2000 by 0.7%, while the WER of RT03 by 0.8%. Compared with the MultiTask method, the method is also obviously improved.
The MultiTask + performance is demonstrated on the Switchboard dataset in table II, since the application of the proposed post-inference algorithm is not limited by the end-to-end architecture. Due to the use of the post-inference algorithm, the WER of the MultiTask on the Switchboard is reduced by 1.2% compared to the original MultiTask approach. Therefore, it was concluded that the proposed post-inference algorithm can further improve ASR performance. Notably, the algorithm is at MGSAuniThe improvement on the model is higher than MultiTask. This is due to the fact that in the formerThe alignment mapping information contained in the multiple granularities is considered, while the latter is not. Due to MGSAuni+ relative to MGSAuniThe performance boost is achieved and is slightly less than the MultiTask + boost to MultiTask, it can be concluded that the performance boost obtained using the interframe decoder and post-inference algorithm, respectively, may be partially complementary.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An end-to-end speech recognition modeling method based on multilevel identification is characterized by comprising decoding inference, wherein the decoding inference adopts a post-inference algorithm, and the post-inference algorithm comprises the following steps:
model generation posterior probability output sequence corresponding to fine-grained text sequence
Figure FDA0003107691460000011
The output sequence
Figure FDA0003107691460000012
Can uniquely correspond to coarse-grained subsequences
Figure FDA0003107691460000013
Generating the coarse-grained subsequence by a calculation model
Figure FDA0003107691460000014
And using the log-likelihood value as the existing prediction output sequence
Figure FDA0003107691460000015
Cross validation of (3);
and (4) cutting the existing decoding path according to the likelihood probability score obtained by the calculation of the two steps, and ensuring that the search path is controlled within the beam width range.
2. The modeling method of claim 1, wherein the core of the post-inference algorithm is to use inter-sequence alignment mapping information in the decoding inference phase.
3. The modeling method of claim 1, wherein no new decoding path is generated during the cross-validation process, and the scores are re-ordered with respect to the existing path output results from another perspective.
4. The modeling method of claim 1, wherein the score increment for each decoding path consists of one fine-grained log-likelihood probability score and a plurality of coarse-grained log-likelihood probability scores.
5. An end-to-end speech recognition model obtained by the modeling method of any one of claims 1-4, wherein the speech recognition model comprises an interactive decoder comprising a character module, an interactive module, a subword hidden layer module, and a subword classification module; wherein,
the character module is used for modeling output prediction of character subsequences
Figure FDA0003107691460000016
And provides character history status for subsequent operation process
Figure FDA0003107691460000017
The interactive module is used for fusing the character state and the sub-word state and using the fused interactive state as the calculation of the interactive attention module.
6. Speech according to claim 5The recognition model is characterized in that the character module comprises a character attention module calculation layer, a recurrent neural network layer and a full connection layer; the input of the character module is the information representation of character history output and the output sequence of the encoder
Figure FDA0003107691460000018
7. The speech recognition model of claim 5, wherein the interaction module comprises an interaction attention mechanism and recurrent neural network layer; the input of the interaction module is character history state, sub-word state and encoder output sequence
Figure FDA0003107691460000021
8. The speech recognition model of claim 5, wherein the input to the subword hiding module is an information representation of a subword history output and a sequence of encoder outputs
Figure FDA0003107691460000022
And respectively realizing the calculation of the sub-word attention vector and the updating of the state of the sub-word through the sub-word attention module and the recurrent neural network layer structure.
9. The speech recognition model of claim 5, wherein the input of the subword classification module is an interaction state and a subword state, the interaction state and the subword state are respectively used for realizing output prediction of the subwords through a full connection layer, and the two outputs are respectively called a subword output and an auxiliary subword output.
10. The speech recognition model of claim 5, wherein the interactive decoder generates three types of outputs: the method comprises character output, sub-word output and auxiliary sub-word output, wherein the three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training.
CN202110642751.3A 2021-06-09 2021-06-09 End-to-end voice recognition model based on multilevel identification and modeling method Pending CN113160803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110642751.3A CN113160803A (en) 2021-06-09 2021-06-09 End-to-end voice recognition model based on multilevel identification and modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110642751.3A CN113160803A (en) 2021-06-09 2021-06-09 End-to-end voice recognition model based on multilevel identification and modeling method

Publications (1)

Publication Number Publication Date
CN113160803A true CN113160803A (en) 2021-07-23

Family

ID=76875905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110642751.3A Pending CN113160803A (en) 2021-06-09 2021-06-09 End-to-end voice recognition model based on multilevel identification and modeling method

Country Status (1)

Country Link
CN (1) CN113160803A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628630A (en) * 2021-08-12 2021-11-09 科大讯飞股份有限公司 Information conversion method and device and electronic equipment
CN114495114A (en) * 2022-04-18 2022-05-13 华南理工大学 Text sequence identification model calibration method based on CTC decoder

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111480197A (en) * 2017-12-15 2020-07-31 三菱电机株式会社 Speech recognition system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111480197A (en) * 2017-12-15 2020-07-31 三菱电机株式会社 Speech recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐健: "深度学习语音识别***中的若干建模问题研究", CNKI博士学位论文全文库, no. 1, pages 25 - 100 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628630A (en) * 2021-08-12 2021-11-09 科大讯飞股份有限公司 Information conversion method and device and electronic equipment
CN113628630B (en) * 2021-08-12 2023-12-01 科大讯飞股份有限公司 Information conversion method and device based on coding and decoding network and electronic equipment
CN114495114A (en) * 2022-04-18 2022-05-13 华南理工大学 Text sequence identification model calibration method based on CTC decoder

Similar Documents

Publication Publication Date Title
Lipton et al. A critical review of recurrent neural networks for sequence learning
Doetsch et al. Fast and robust training of recurrent neural networks for offline handwriting recognition
Gao et al. RNN-transducer based Chinese sign language recognition
CN111557029A (en) Method and system for training a multilingual speech recognition network and speech recognition system for performing multilingual speech recognition
CN113516968B (en) End-to-end long-term speech recognition method
Woellmer et al. Keyword spotting exploiting long short-term memory
CN114023316A (en) TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
CN113160803A (en) End-to-end voice recognition model based on multilevel identification and modeling method
JP2019159654A (en) Time-series information learning system, method, and neural network model
Gandhe et al. Audio-attention discriminative language model for asr rescoring
Tassopoulou et al. Enhancing handwritten text recognition with n-gram sequence decomposition and multitask learning
Mai et al. Pronounce differently, mean differently: a multi-tagging-scheme learning method for Chinese NER integrated with lexicon and phonetic features
Liu et al. Multimodal emotion recognition based on cascaded multichannel and hierarchical fusion
Soltau et al. Reducing the computational complexity for whole word models
Tian et al. Integrating lattice-free MMI into end-to-end speech recognition
CN112967720B (en) End-to-end voice-to-text model optimization method under small amount of accent data
CN116362242A (en) Small sample slot value extraction method, device, equipment and storage medium
CN113096646B (en) Audio recognition method and device, electronic equipment and storage medium
JP2019078857A (en) Method of learning acoustic model, and computer program
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
Weng et al. Named entity recognition based on bert-bilstm-span in low resource scenarios
Liu et al. Investigating for punctuation prediction in Chinese speech transcriptions
CN112364668A (en) Mongolian Chinese machine translation method based on model independent element learning strategy and differentiable neural machine
Bijwadia et al. Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
CN117787224B (en) Controllable story generation method based on multi-source heterogeneous feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination