CN113160803A

CN113160803A - End-to-end voice recognition model based on multilevel identification and modeling method

Info

Publication number: CN113160803A
Application number: CN202110642751.3A
Authority: CN
Inventors: 唐健; 胡宇晨; 戴礼荣
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-07-23

Abstract

The invention provides an end-to-end voice recognition modeling method based on multi-level identification, which comprises decoding inference, wherein the decoding inference adopts a post-inference algorithm, and the post-inference algorithm comprises the following steps: model generation posterior probability output sequence corresponding to fine-grained text sequence

The output sequence

Can uniquely correspond to coarse-grained subsequences

Generating the coarse-grained subsequence by a calculation model

And using the log-likelihood value as the existing prediction output sequence

Cross validation of (3); and (4) cutting the existing decoding path according to the likelihood probability score obtained by the calculation of the two steps, and ensuring that the search path is controlled within the beam width range.

Description

End-to-end voice recognition model based on multilevel identification and modeling method

Technical Field

The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition model based on multi-level identification and a modeling method.

Background

End-to-End (E2E) Automatic Speech Recognition (ASR) based on an encoding-decoding framework directly models a sequence mapping relationship between an input audio sequence and an output text. The advantages of a simple framework and no need of linguistic background knowledge make the structure gradually pursued by the academic and industrial fields.

In end-to-end ASR, an input speech sequence may be mapped to a different hierarchical text sequence. The mapping relationship between the speech sequence and the text sequence is one-to-many. In Chinese ASR, the text sequence may consist of Pinyin (pinyin), Chinese characters; the english-chinese text sequence may be composed of words (words) and characters (characters).

In general, in end-to-end speech recognition, modeling with word-level text sequences is the most desirable choice. The model output does not need to be further converted through a dictionary, and end-to-end modeling in a complete sense is realized. However, if the word-level text is adopted for modeling, the capacity of the model and the required model parameters are large; on the other hand, character-level (character) text sequences are also an alternative. The character text sequence can effectively control the size and parameter of the model, but the capability of capturing the context correlation of the long time sequence in the voice signal is insufficient, and the character-level text sequence has poor performance on the task of large-vocabulary continuous voice recognition from the prior research work.

In recent years, with the development of Deep Learning (DL), Automatic Speech Recognition (ASR) has made a great progress. The traditional deep learning-based ASR framework is based on a hybrid architecture, which consists of several independent components trained based on conditional independent approximations. On the other hand, new research in ASR focuses on using an end-to-end approach to model the mapping between input audio to sequences of target text. Such as a connection-aware Temporal Classification (CTC), a Recurrent Neural Network converter (RNN-T), a Segmented Conditional Random Field (SCRF), an Attention-based codec (AED) model, and a Transformer model. Compared with the traditional hybrid architecture, the ASR of the end-to-end framework reduces the dependence on linguistic information and simplifies the system structure.

An end-to-end sequence mapping method maps an input audio sequence to a target text. The target text sequence may be composed of different levels of text. For example, english text may be composed of words (words), subwords (subwords), or characters (characters). The different levels of identification have their corresponding advantages and disadvantages.

Word-level text representation is the most common way of text representation in reality. The method is the most ideal selection as a target sequence of end-to-end speech recognition and also conforms to the application assumption of end-to-end speech. It has another advantage that: the output of the word-level model is consistent with the performance evaluation index, and the mismatching of the model optimization target and the evaluation index is avoided. In the case of a sufficient corpus of required text-tagged training words, word-level text is the most ideal choice for end-to-end speech recognition modeling. The defects of the method are that the required training data is large in quantity and the samples are not distributed uniformly. To avoid the problems of directly using word-level text, researchers have attempted to model using characters. The character-level text sequence has fewer text units, and the number of output units and the size of the model can be better controlled, so that the requirement on training data volume is reduced. However, the influence between adjacent units in the output text sequence is not considered in the construction of the character-level text unit, and the problems of the cooperative pronunciation, the non-pronunciation and the like of the voice cannot be considered. Consider the modeling difficulty of the word modeling unit and the performance deficiencies of the character text unit. There have been work using subwords for modeling, aiming at finding a balance point between the modeling difficulty and the model performance.

Another direction of research for the use of multi-level identification information is to combine multiple text sequences in an ASR system rather than picking one of them. The content of the output sequence is represented by the multiple text sequences together, so that rich and multi-level output information can be provided for the model, and the information content of the target text is enhanced. In end-to-end speech recognition modeling, researchers have adopted several multi-level labeled end-to-end speech recognition modeling approaches. The existing multi-level identification end-to-end voice recognition modeling method can be divided into three categories, namely a multi-task learning strategy (MTL), a pre-training method (pre-training) and score fusion (score fusion).

Disclosure of Invention

In view of the above, the present invention provides an end-to-end speech recognition model based on multi-level identification and a modeling method thereof, so as to partially solve at least one of the above technical problems.

In order to achieve the above object, as an aspect of the present invention, there is provided an end-to-end speech recognition modeling method based on multi-level identifiers, including decoding inference, where the decoding inference employs a post-inference algorithm, and the post-inference algorithm includes:

model generation posterior probability output sequence corresponding to fine-grained text sequence

The output sequence

Can uniquely correspond to coarse-grained subsequences

Generating the coarse-grained subsequence by a calculation model

And using the log-likelihood value as the existing prediction output sequence

Cross validation of (3);

and (4) cutting the existing decoding path according to the likelihood probability score obtained by the calculation of the two steps, and ensuring that the search path is controlled within the beam width range.

The core of the post-inference algorithm is that the inter-sequence alignment mapping information is used in the decoding and inference stage.

Wherein no new decoding path is generated in the cross-validation process, and the scores of the output results of the existing paths are re-ordered from another angle.

The score increment of each decoding path is composed of a fine-granularity log-likelihood probability score and a plurality of coarse-granularity log-likelihood probability scores.

As another aspect of the present invention, an end-to-end speech recognition model obtained by the modeling method is provided, where the speech recognition model includes an interactive decoder, and the interactive decoder includes a character module, an interactive module, a sub-word hidden layer module, and a sub-word classification module; wherein,

the character module is used for modeling output prediction of character subsequences

And provides character history status for subsequent operation process

The interactive module is used for fusing the character state and the sub-word state and using the fused interactive state as the calculation of the interactive attention module.

Wherein the character module comprises a character attention module calculation layer, a recurrent neural network layer and a full connection layer; the input of the character module is the information representation of character history output and the output sequence of the encoder

Wherein the interaction module comprises an interaction attention mechanism and a recurrent neural network layer; the interaction moduleThe inputs of are character history state, sub-word state and encoder output sequence

Wherein, the input of the sub-word hidden layer module is the information representation of the history output of the sub-words and the output sequence of the encoder

And respectively realizing the calculation of the sub-word attention vector and the updating of the state of the sub-word through the sub-word attention module and the recurrent neural network layer structure.

The input of the sub-word classification module is an interaction state and a sub-word state, the interaction state and the sub-word state are respectively used for realizing output prediction of sub-words through a full connection layer, and the two outputs are respectively called sub-word output and auxiliary sub-word output.

Wherein the interactive decoder generates three types of outputs: the method comprises character output, sub-word output and auxiliary sub-word output, wherein the three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training.

Based on the technical scheme, compared with the prior art, the end-to-end voice recognition model based on the multilevel identification and the modeling method have at least one of the following beneficial effects:

(1) the application of the post-inference algorithm and the interactive decoder provided by the invention ensures that the accuracy of the voice recognition result is improved compared with the existing recognition model by the end-to-end voice recognition model.

(2) The application of the post-inference algorithm proposed by the present invention is not limited by the end-to-end architecture.

Drawings

Fig. 1 is an alignment mapping relationship between multilevel identifiers according to an embodiment of the present invention (here, subwords and characters are taken as examples);

fig. 2 is a multi-level identification modeling method based on MTL and an end-to-end multi-level identification sequence alignment method according to an embodiment of the present invention;

FIG. 3 is a graphical model corresponding to joint conditional probabilities of a multi-level labeled end-to-end model provided by an embodiment of the present invention;

fig. 4 is an application of the alignment mapping relationship provided in the embodiment of the present invention in a multi-level label end-to-end decoding process: joint decoding algorithm (y)ⁱAnd y^jRespectively with sub-word y^bAnd the character y^cFor example);

FIG. 5 is a depiction of various experimental configurations provided by an embodiment of the present invention;

FIG. 6 is a block diagram of a sequence-to-sequence speech synthesis acoustic model for bi-level autoregressive decoding according to an embodiment of the present invention;

FIG. 7 is a block diagram of an interactive decoder in a sequence-to-sequence speech synthesis acoustic model according to an embodiment of the present invention;

FIG. 8 is a diagram of the use of multi-granular target information provided by an embodiment of the present invention; wherein (a) the interactive decoder; (b) a joint decoding algorithm;

Detailed Description

The selection of one item from a multi-level text sequence for end-to-end speech recognition modeling is not the only choice, and is less than the optimal choice. A plurality of text sequences selected in the speech recognition end-to-end modeling are recorded as multi-level identification (Multiple-hierarchical Target Sequence). The invention provides a Multi-Granularity Sequence Alignment Method (MGSA) by considering that a plurality of text sequences are selected to be commonly used for end-to-end voice recognition modeling to realize better effect.

The end-to-end ASR system can be integrally split into two parts, a model training stage (training stage) and a decoding inference stage (inference stage). The MGSA method proposed by this patent will optimize the ASR system using multi-level identification information in these two phases, respectively. First, the decoder module of the end-to-end ASR will sequentially generate a multi-level text sequence in a model structure that will take into account the interaction between different level identifiers. In addition, in an end-to-end output inference stage, the method explores and utilizes a recessive alignment mapping relation among different level marks to further improve the recognition performance. The proposed Post Inference Algorithm (Post Inference Algorithm) can further calibrate the posterior probability scores of the output sequences using multi-level identification information. Experimental results on the WSJ-80hrs and Switchboard 300hrs data sets indicate that the method has significant advantages over traditional multitasking methods and single event baseline systems.

The MGSA method provided by the invention aims to fully utilize multi-granularity information and improve the performance of an end-to-end voice recognition system as much as possible under the condition of not increasing the whole input information quantity. On the other hand, the multi-level information is put forward to play a part of the function of the language model from a certain angle, and the dependence of the end-to-end model on the external language model can be relieved. The MGSA utilizes the interactive information of the MGSA through the alignment mapping relation among different granularity units, so that the model can learn the semantic information in the MGSA, and further the performance of the model is improved.

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

1. Sequence alignment mapping relation between multi-level identifiers

In end-to-end speech recognition modeling, there are three types of text elements-character elements, sub-word elements, and word-level elements. Of these three types of text units, a previous text unit (e.g., a character) may be clustered to form the latter (e.g., a subword) that corresponds to one or more of the text units in the former. For example, in FIG. 1 the word unit "CURSE" corresponds to the sub-word sub-string "_ C OUR SE", which also uniquely maps to the character sub-sequence "O U R". The implicit alignment mapping relationship between the text sequences can be obtained by querying a dictionary. The alignment mapping relationship between text sequences is overall strict, definite and easily obtained. We denote the implicit, unique correspondence that exists between such multi-level texts (as shown by the middle solid line in fig. 1) as an alignment mapping. The invention will introduce a method MGSA of how to introduce alignment mapping in end-to-end speech recognition modeling.

End-to-end ASR can be divided into two phases of model training and decoding inference. The MGSA method takes into account the use of alignment mapping at both stages. The general framework of the method is shown in fig. 2 (b). There are three main differences compared to the conventional MTL-based approach commonly used in fig. 2 (a).

Firstly, the MGSA is used as a target of model optimization based on joint conditional probability of multi-level identification, and the use of alignment mapping relation among sequences is fully considered in the optimization process. Secondly, a novel decoder module is provided, which embodies the information transmission between the multi-level identifiers on the structural level. Interaction and fusion between the multi-level identifiers through the structure is achieved inside the model (dashed lines in fig. 2); in addition, in the output decoding stage of the end-to-end ASR, the post-inference algorithm proposed in this patent checks and corrects the recognition result of the model (dotted line in fig. 2) by identifying the correspondence between the outputs through multiple levels.

2. Coding and decoding structure of multi-level identification

2.1 formula derivation of optimization objectives

For any two text representations

And

setting yⁱFor fine-grained text sequences, y^jIs a coarse-grained text sequence. At yⁱEach text unit in (1) maps uniquely to y^jBy one or more sub-sequences of text units in the text sequence of (a). Here will be y^jIs in with yⁱThe tth text unit

The corresponding text sub-string is marked as

In that

In which contains k_tA text unit of. Further, will

The u-th text unit in

By this representation, y can be expressed^jThe following form is rewritten.

Delta in formula (1) will be the text sequence y^jAnd yⁱThe sequence alignment mapping relationship between the two is embodied in a more intuitive and explicit way. Based on the method, the invention provides a Multi-hierarchical identification Sequence Alignment (MGSA) method. Before the details of the method are introduced, an optimization formula of the MGSA is obtained by adopting a formula derivation mode. With fine-grained text sequence yⁱAnd a coarse-grained text sequence y^jThe derivation of the end-to-end speech recognition objective function is performed for example. The derivation of joint conditional probabilities for three and more text sequences can be obtained by analogy, only in the case of two text representation sequences being discussed here. Given an input speech feature sequence x, the goal of the multi-level identification end-to-end speech recognition model is to model a joint conditional probability P_θ(yⁱ；y^j|x)。

Applying to coarse-grained text sequence y in equation (1)^jAnd on the basis thereof will be expressed by

And its corresponding text sub-sequence

The text pairs formed together are marked as

The joint conditional probability of the model at this time can be expressed as

A multi-level identification is a representation of the same text at different granularities. The identifiers are represented in different forms, but correspond to the same text meaning. Wherein each unit is associated with other units. In fig. 3(a), we present a graphical model representation corresponding to the modeling objective (equation (2)).

The interaction between the two types of text sequences in FIG. 3(a) is not reasonable in view of the sequential causal properties between the multiple levels of identification. First, a unit in a certain granular text sequence should be independent of a relatively earlier-ordered unit in another granular text sequence. For example,

should not affect

The prediction of (2). Second, in a set of text pairs

Should not affect each other. A set of text pairs as in fig. 3

And

they are expressions for a unit of text that are at different granularities for the same text "OUR". Allowing interaction between them means that the output is calculated with the true identity known, so that the calculation process is meaningless. Combining the above two time sequence cause and effect considerations, further combining the conditional probability P_θ(yⁱ；y^j| x) is expressed as (the lower side will take the simplified form P_θRepresenting joint conditional probability P_θ(yⁱ；y^j|x))

The graphical model of joint conditional probabilities at this time can be simplified by using text to represent temporal causal properties between sequences as shown in FIG. 3 (b).

And comparing with the formula (3), and continuously simplifying the corresponding graph model. On this basis, it is assumed that the variables of the text sequence obey a first order markov assumption. Thereby further simplifying the joint conditional probability with the result that

Fig. 3(c) corresponds to formula (4). In joint conditional probabilities, text subsequences

Is a length of k_tA subsequence of (2). Substituting the relation into joint conditional probability, and calculating course of coarse-grained text substring

Further expansion may be by the chain rule. The formula for obtaining the joint conditional probability is expressed as follows

Equation (6) corresponds to fig. 3 (d). The formula shows that in the joint optimization of two text sequences, historical information of corresponding moments of two hierarchies is required to be considered in the generation process of model prediction output.

The basic principles to be satisfied when constructing the model can be obtained from the derivation process:

1. alignment mapping relationship between sequences: there is a strict correspondence between the fine-grained and coarse-grained text sequence expressions of the same text content, i.e., each text unit in the fine-grained text sequence corresponds to one or more text units in the coarse-grained text sequence. There is a strict one-to-many mapping relationship between the two text units, which is the root of performing multi-level identification end-to-end speech recognition modeling. The mutual influence between the multi-level identifiers to be considered later needs to be established on the premise of the mapping relation.

2. The history information is independent of each other: to guarantee the history information of two text sequences

And

without having a direct effect on each other. For each state variable, the historical time sequence modeling capability of the state variable is guaranteed, and the influence of historical output of other text sequences is avoided.

3. Directly acting on classification: the interaction of multi-level identification directly affects the classification process of text units. The end-to-end modeling process based on recursive formal expansion can be split into: historical text sequence modeling and estimation of model predictions. Considering the independence of the history information mentioned above, the interaction process of the multi-level identification information needs to be reflected in the classification process.

In the above derivation, the interaction between the multi-level identifiers is bi-directional. But some simplification can be made in practical use. Ignoring fine-grained text sequences yⁱTo coarse-grained text sequences y^jThe formula (6) is further simplified as:

2.2 codec Structure description of Multi-level identification

The proposed model structure consists of two parts, an encoder and a decoder. The structure of the encoder is the same as that of the traditional encoder; in the decoder part, the invention proposes an interactive decoder structure. The structure comprises a character module, an interaction module, a sub-word hidden layer module and a sub-word classification module; in addition, a total of three loss functions are used to guide model training.

An encoder module, the input of which is a feature sequence x of a sentence of speech, the encoder module functioning as a feature extractor for enhancing the correlation of the input sequence in the time dimension and generating an encoder output sequence

Specifically, the context information expression at each time is obtained by fusing feature sequence codes by using a Convolutional Neural Network (CNNs) and a bidirectional long-short-term memory (Bi-LSTM).

A decoder module. The module comprises a character module, a sub-word hidden layer module, an interactive module and a sub-word classification module, and the input of the module is the output sequence of an encoder

The output is a word output, which is predicted at the current time based on the context information output by the encoder.

1. And a character module. The input of the module is the information representation of character history output and the output sequence of the encoder

The module is composed of a character attention module calculation, Recurrent Neural Network (RNN) layer and Full Connectivity (FC), and the part is used for modeling output prediction of character subsequences

And provides character history status for subsequent operation process

2. Sub-wordAnd (5) a hidden layer module. The input of the module is information representation of the history output of the sub-words and the output sequence of the encoder

And respectively realizing the calculation of the sub-word attention vector and the updating of the state of the sub-word through a sub-word attention module and an RNN layer structure.

3. And an interaction module. The inputs to the module are the character history status, the subword status and the encoder output sequence

The module consists of an interactive attention mechanism and an RNN layer. The module is used for fusing the character state and the sub-word state and using the fused interactive state as the calculation of the interactive attention module. The process mainly reflects the influence of the character state on the sub-word state.

4. And a sub-word classification module. The input of the module is an interactive state and a sub-word state, and the interactive state and the sub-word state are respectively used for realizing output prediction of the sub-words through a full connection layer. The two outputs are referred to herein as a subword output and an auxiliary subword output, respectively.

Three types of outputs are mainly generated in an interactive decoder: the method comprises character output, sub-word output and auxiliary sub-word output, wherein three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training. The first two are used for ensuring model training and convergence of the character module, the interaction module and the sub-word classification module; and training a sub-word attention module in a sub-word hidden layer module in the last auxiliary model.

3. Post inference algorithm

The use of the inter-sequence alignment mapping information is not limited to the model structure, and can be used in the decoding stage.

With fine-grained text sequence yⁱAnd a coarse-grained text sequence y^jThe description is given for the sake of example. When the model generates the candidate output result at the t-th time in the decoding process

When the temperature of the water is higher than the set temperature,we can obtain the corresponding sub-sequence by the alignment mapping relationship between the sequences

For example, when the model-derived subword candidate output is "SE", its corresponding character subsequence "S E" can be synchronously derived. In the decoding phase, the invention will investigate how to use

For the predicted output

And (6) carrying out verification.

3.1 formula derivation of optimization objectives

The decoding stage of end-to-end ASR uses a beam search algorithm to pick a decoding path under a defined beam width. The log-likelihood probability of the existing decoding path is typically used as the current path score value. The decoding stage is formulated as follows

In the formula

Is a sequence of

A corresponding dictionary. The outcome of the argmax function is determined by the relative value of the function, and multiplying by 2 on the basis of the above formula does not change the outcome.

Second term in equation (9)

Is generated at time t

Likelihood probability of (d); using coarse-grained subsequences corresponding thereto

The alignment is replaced (corresponding to equation (10)), at which point

As a fine grained prediction output

Cross validation of (3); and further expanding the coarse-grained subsequence to obtain a final joint decoding algorithm expression (corresponding to the formula (11)). A new end-to-end model decoding algorithm is proposed based on formula inference of the decoding process, and is called as a joint decoding algorithm. The derivation process is based on two text sequence expansion, and the derivation process of three or more text sequences can be obtained by analogy.

3.2 introduction of concept of post-reasoning algorithm (Joint decoding algorithm)

Details of the implementation of the joint decoding algorithm are described in detail. The joint decoding process can be divided into three steps of prediction, check and clipping as a whole. Fig. 4 shows the end-to-end speech recognition decoding process at time t. Wherein, predicting: model-generated posterior probability output for fine-grained text sequences

Checking: because of the output sequence

Can uniquely correspond to a coarse-grained substring

Computing model generation subsequences

And the log-likelihood value of (2) is used as the existing prediction output

Cross-validation of (3). In this process, no new decoding path is generated, and the score of the output result of the existing path is re-ordered (re-ordering) from another point of view, so that the result is called a check. And finally, cutting, namely cutting the existing decoding path according to the likelihood probability score obtained by the two steps of calculation, and ensuring that the searching path is controlled within the beam width range. In the process, the score increment of each decoding path is composed of a fine-granularity log-likelihood probability score and a plurality of coarse-granularity log-likelihood probability scores. Overall, the joint decoding algorithm adds this step of verification compared to the traditional beam search algorithm.

As shown in FIG. 6, the input to the model coder is a speech feature sequence whose feature representation in the time dimension is extracted by CNN and BLSTM

The input to the model decoder is the encoder output

The word output corresponding to the previous frame and the character sequence output of the previous frame are the predicted word output of the current frame, and the character-level predicted output and the like are also provided.

Given a set of multi-level labeled training samples [ x; (y)^b，y^c)](ii) a The voice sequence is converted into an audio characteristic sequence x through characteristic extraction operation, and corresponding multi-level identifications are subword (subword) text sequences y^bAnd character text sequence y^c。

The encoder module of the model functions as a feature extractor for enhancing the correlation of the input audio feature sequence in the time dimension to generate an encoder output sequence

The interactive decoder module is at decoding time t from

Extracts information related to current time output, and generates sub-word prediction output by combining model historical output

Next, a specific structure of the interactive decoder in the model will be described by taking the subword prediction process at the t-th time as an example.

Before the model carries out the predicted output of the sub-words at the time t, the character sub-sequence corresponding to the sub-words at the time t-1 needs to be completed firstly

The prediction of (2).

FIG. 7 is a block diagram of an interactive decoder in the sequence-to-sequence speech synthesis acoustic model, and the details of the parts are described in detail below.

(1) Character module

Character subsequence

The prediction output process of the u-th character in (1) is as follows, firstly, the model carries out character decoder state updating and attention vector calculation as the structure of a traditional decoder. State vector

Output according to previous character time

Updating character decoder states

Where RNN represents a single-layer Recurrent Neural Network (RNN).

Character decoder state

And

loaded as feedback information into a character attention module for generating a character attention vector

And context vector

Input and output sequences in speech recognition have a monotone alignment mapping relationship, so additive attention calculation using the convolution feature [1]. The attention vector calculation corresponds to the following formula

Wherein Attend denotes a general attention module; on the basis of which the output of the character is predicted, based on

Further updating decoder states

Will be provided with

And

act together on

The output prediction process of (1). The process of character prediction output at this time is as follows

When the status of sub-word is ignored

For the influence of character prediction output, the character prediction output process can be further simplified into

W in the formula^cAnd b^cThe parameters of the discipline matrix and the deviation vector (bias vector) are respectively. The above formula forms a pair character subsequence

The prediction output of the u-th unit in (1). Repeating the above process until the character subsequence is completed

To output a prediction. After completion, the state vector of the character decoder at the moment

Is marked as

The vector contains character history information necessary for generating a predicted output of the subword at time t.

(2) Sub-word hidden layer module

And after the preparation work of the character part is completed, performing output prediction of the sub-words at the t-th moment. Is carried out in the same wayUpdating of decoding state and calculation of attention vector. First, the predicted output of historical time is adopted

Updating decoder states

Update procedures such as

And then updated state information

Generating corresponding attention vector and context vector as input of attention module

The overall structure of the sub-word hidden layer module is shown in fig. 7. The following calculation process is different from the conventional encoding-decoding model. In-subword prediction

The extra character decoder state is considered to be introduced in the process

Influence on the subword output. The invention realizes the influence of the characters on the prediction process by adding the interaction module.

(3) Interaction module

The module is composed of an attention module and two layers of RNNs, and the corresponding calculation process is shown in the middle area of FIG. 7. The structure of the interaction module is further described: by single-layer RNN fusion

And

obtaining an interaction state vector at time t

The method comprises the steps of including historical output information meeting the alignment mapping relation among sequences, and using the historical output information to calculate the attention vector of an interaction module

Including this calculation process, a total of three attends are included in the interactive decoder structure. To distinguish them, the present invention refers to the attentive calculation processes described above as a character attention module, a sub-word attention module, and an interactive attention module, respectively. The interactive attention module can generate an interactive context vector giving consideration to sub-word and character information

Can be used as

The information of (1) is supplemented.

The sub-word state can be obtained at the same time by completing the above process

And character status

Including the history state of the sub-word

And subword history output

Two sets of information;

is the character history state

And

in combination with (1). Further information fusion can be performed with the two states at the same timing and with similar composition. Reference GLU activation Unit [2]Computing a fused vector

σ (.) and FC in the formula refer to sigmoid activation function and full connectivity layer, respectively. Obtaining a fused vector

Then, by a single layer of RNN and

updating interactive decoder states

Calculation procedures as

(4) Sub-word classification module

Last using the state of the interactive module

Predicting the main subword output at the current time t

The calculation process is

Calculation of but

Status of subword

While generating auxiliary sub-word outputs as input vectors

The calculation process is

In these two formulae, WⁱAnd W^bFor the traineable matrix parameters, bⁱAnd b^bIs an offset vector. The subword classification module corresponds to the lower left corner region of fig. 7 as a whole.

(5) Model loss function

In the above calculation process, the model generates three types of outputs: output of character sub-sequences

Output of subword units

And assist in subword output

The three types of outputs correspond to a portion of the respective penalty functions. After the prediction process of outputting the sub-word with the length of T is completed, the loss function corresponding to the model is as follows

In the formula, lambda belongs to [0, 1 ]]Is a pre-set hyper-parameter in the model training process. In model training, we select Cross Entropy (CE) loss function as the objective function. The first term and the second term in the formula respectively correspond to the cross entropy loss of the character output and the sub-word output, and the third term is auxiliary sub-word output

And corresponding cross entropy loss is used for assisting the training of the sub-word attention module in the model.

(6) Information usage differences

The post-inference algorithm and the interactive decoder module both use the alignment mapping information, but at different stages. The learned words are illustratively explained in FIG. 8

The difference in context used. For post-inference algorithms, subsequences

It can be further applied to verify and correct the prediction output in fig. 8(a), while the historical output characters corresponding to time step (t-1) are used in the interactive decoder module shown in fig. 8 (b). It is clear that the alignment mapping information is utilized at different time steps. Therefore, the proposed MGSA end-to-end model can make full use of the alignment mapping information in the current and historical time steps by using a post-inference algorithm in the decoding phase.

Experimental verification

To verify the effectiveness of the proposed inter-frame decoder module and post-inference algorithm, its ASR performance was evaluated for various systems based on the bit error rate (WER) on the Switchboard-300hrs data set. The Switchboard consists of a large number of english phonetics, and a 300 hour subset of LDCs 97S62 was selected for training, leaving 10% for cross validation. And selected Hub5 eval2000 (i.e., LDC2002S09) for performance evaluation, which consists of two subsets: 1) switchboard (similar to training set) and 2) CallHome, which is collected from conversations between friends and inside the family. The complete Hub5 eval2000, subsets Switchboard and calhome are denoted as "Full", "SWD" and "CHE", respectively. For completeness, ASR performance was also evaluated for RT03 Switchboard test set (i.e., LDC2007S 10).

The encoder of the model has two convolutional layers that down sample the time series using a 3 x 3 filter and 32 channels, followed by 6 layers of bidirectional long-short-time memory (LSTM), with a cell size of 800. The default decoder is a 2-layer unidirectional LSTM with 800 cells. The 80-dimensional log-mel filter bank coefficients, three pitch coefficients and normalized mean and variance are used as input features. The char target in the experiment is a set of 46 characters containing english letters, numbers, punctuation marks and special transcription marks; for the subword target, segmenting by using a SentecPiece based on a BPE algorithm; based on the default settings in ESPnet, a vocabulary of approximately 2000 in size is used for Switchboard.

The different experimental configurations used are shown in fig. 5, where Baseline is Baseline, Baseline + is added with a layer of BLSTM in the former encoder module in order to exclude the effect of model size, MultiTask is a multi-task learning scheme, MGSA_biAnd MGSA_uniThe MGSA scheme proposed for the present invention considers the interactive information between words and characters in the former scheme, and only the information contribution of characters to words in the latter scheme.

The results of the experiment are shown in tables I and II:

TABLE I switchboard data set test results

TABLE II Experimental results of post-inference Algorithm

1. Experiment one: joint decoding algorithm

To analyze the effect of model structure on performance, we first consider the traditional bundle search algorithm of all methods in the decoding phase. Table I lists the WERs implemented on both verification sets of the Switchboard. Obviously, compared with MultiTask and Baseline based on eval2000 dataset, the MGSA proposed by the invention_uniThe WER is reduced by 1.4 percent and 1.9 percent respectively; for RT03, MGSA compared to MultiTask and Baseline_uniThe WER of (A) is reduced by 1.0% and 1.7% respectively; and MGSA_biIs inferior to MGSA_uni. In fact, MGSA_uniAnother advantage of (a) is that predictions for all character sequences can be computed simultaneously and all reference characters that need to be provided for the corresponding subword can be extracted at once.

2. Experiment two: interactive decoder

Since the multi-granularity target affects not only the model structure but also the decoding, we evaluated the impact of applying the proposed post-inference algorithm experimentally in the decoding stage. For the sake of brevity, MGSA will be used separately below_uni+, MultiTask + for MGSA_uniAnd MultiTask plus post-reasoning algorithms.

The results of the experiments on the Switchboard data set are shown in table II. Compared with MGSA_uniThe present invention provides MGSA_uniThe + method also reduced the WER of eval2000 by 0.7%, while the WER of RT03 by 0.8%. Compared with the MultiTask method, the method is also obviously improved.

The MultiTask + performance is demonstrated on the Switchboard dataset in table II, since the application of the proposed post-inference algorithm is not limited by the end-to-end architecture. Due to the use of the post-inference algorithm, the WER of the MultiTask on the Switchboard is reduced by 1.2% compared to the original MultiTask approach. Therefore, it was concluded that the proposed post-inference algorithm can further improve ASR performance. Notably, the algorithm is at MGSA_uniThe improvement on the model is higher than MultiTask. This is due to the fact that in the formerThe alignment mapping information contained in the multiple granularities is considered, while the latter is not. Due to MGSA_uni+ relative to MGSA_uniThe performance boost is achieved and is slightly less than the MultiTask + boost to MultiTask, it can be concluded that the performance boost obtained using the interframe decoder and post-inference algorithm, respectively, may be partially complementary.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An end-to-end speech recognition modeling method based on multilevel identification is characterized by comprising decoding inference, wherein the decoding inference adopts a post-inference algorithm, and the post-inference algorithm comprises the following steps:

The output sequence

Can uniquely correspond to coarse-grained subsequences

Generating the coarse-grained subsequence by a calculation model

And using the log-likelihood value as the existing prediction output sequence

Cross validation of (3);

2. The modeling method of claim 1, wherein the core of the post-inference algorithm is to use inter-sequence alignment mapping information in the decoding inference phase.

3. The modeling method of claim 1, wherein no new decoding path is generated during the cross-validation process, and the scores are re-ordered with respect to the existing path output results from another perspective.

4. The modeling method of claim 1, wherein the score increment for each decoding path consists of one fine-grained log-likelihood probability score and a plurality of coarse-grained log-likelihood probability scores.

5. An end-to-end speech recognition model obtained by the modeling method of any one of claims 1-4, wherein the speech recognition model comprises an interactive decoder comprising a character module, an interactive module, a subword hidden layer module, and a subword classification module; wherein,

And provides character history status for subsequent operation process

6. Speech according to claim 5The recognition model is characterized in that the character module comprises a character attention module calculation layer, a recurrent neural network layer and a full connection layer; the input of the character module is the information representation of character history output and the output sequence of the encoder

7. The speech recognition model of claim 5, wherein the interaction module comprises an interaction attention mechanism and recurrent neural network layer; the input of the interaction module is character history state, sub-word state and encoder output sequence

8. The speech recognition model of claim 5, wherein the input to the subword hiding module is an information representation of a subword history output and a sequence of encoder outputs

9. The speech recognition model of claim 5, wherein the input of the subword classification module is an interaction state and a subword state, the interaction state and the subword state are respectively used for realizing output prediction of the subwords through a full connection layer, and the two outputs are respectively called a subword output and an auxiliary subword output.

10. The speech recognition model of claim 5, wherein the interactive decoder generates three types of outputs: the method comprises character output, sub-word output and auxiliary sub-word output, wherein the three types of output correspond to three cross entropy losses, and the three types of output form a loss function of model training.