CN118098216B

CN118098216B - Method for improving performance of speech recognition system by using non-parallel corpus

Info

Publication number: CN118098216B
Application number: CN202410495685.5A
Authority: CN
Inventors: 严宇平; 阮伟聪; 林嘉鑫; 林浩; 邵彦宁; 卫潮冰; 陈泽鸿; 胡波; 吴文远; 吴石松
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2024-04-24
Filing date: 2024-04-24
Publication date: 2024-07-09
Anticipated expiration: 2044-04-24
Also published as: CN118098216A

Abstract

The invention discloses a method for improving the performance of a voice recognition system by using non-parallel corpus, which relates to the technical field of voice recognition and comprises the following steps: collecting non-parallel corpus containing a large amount of voices and texts, wherein sources for obtaining the corpus comprise Internet, social media, broadcast programs and the like, and establishing a non-parallel resource library; pre-training an encoder based on unlabeled speech; training a speech recognition decoder based on the non-parallel text library; the training model is fused to realize the joint fine adjustment of the encoder and the decoder, noise with certain energy is added at the input end of the decoder, soft labels with a certain proportion of language models are added on labels marked at the output end of the decoder, the proportion of the noise to the soft labels is gradually reduced along with the number of training iterations, and the decoder is gradually converted into a voice recognizer with a given audio representation along with the increase of the number of training iterations; and applying the model to a voice recognition system, and finally improving the performance of the voice recognition system.

Description

Method for improving performance of speech recognition system by using non-parallel corpus

Technical Field

The invention relates to the field of voice recognition, in particular to a method for improving the performance of a voice recognition system by using non-parallel corpus.

Background

Speech recognition technology is a technology that converts human speech into text, allowing a computer to recognize and understand spoken expressions of human speech, and has been widely used in various practical scenarios in recent years.

The voice recognition system usually needs a large amount of paired voice-text form labeling training data during training, and the acquisition cost of the labeling data is quite high. Insufficient voice data contacted by the voice recognition model can cause the limitation of the extraction capacity of the voice recognition model to voice characteristics, and the accuracy of voice classification is reduced; insufficient text data contacted by the speech recognition model can lead to insufficient text generation capability, and the recognition result does not meet the grammar rules of human beings.

In recent years, self-supervised learning techniques have been rapidly developed, which aim to learn information representations from unlabeled data, as compared to supervised learning, which requires a large amount of labeling data. The method has the advantages that the self-supervision pre-training is performed by using massive non-labeling data, the supervised fine tuning training characterization learning is performed by using a small amount of labeling data, and the method has a very good effect in the field of voice recognition.

In order to solve the problems in the field of speech recognition, the patent provides a novel method by utilizing a self-supervision learning technology and improves the performance of a speech recognition system by utilizing non-parallel corpus.

Disclosure of Invention

The application provides a method for improving the performance of a voice recognition system by utilizing non-parallel corpus voice, which aims to improve the recognition accuracy, and improves the feature extraction capacity of a recognition model by utilizing the autocorrelation characteristic of a voice signal through a voice pre-training voice recognition encoder without labels; by using unpaired text pre-trained speech recognition decoders, the prior distribution of text is utilized to enhance the text modeling capabilities of the recognition model as well as the encoder-decoder joint fine tuning technique. The application provides a solution which can bring more reliability and high efficiency for the application of voice recognition by fully utilizing non-parallel corpus and improving the accuracy of voice recognition. In order to achieve the above purpose, the present application adopts the following technical scheme:

a method for improving the recognition performance of a voice system by using non-parallel corpus comprises the following steps:

S1: collecting non-parallel corpus containing a large amount of voices and texts, wherein sources for obtaining the corpus comprise Internet, social media, broadcast programs and the like, and establishing a non-parallel resource library;

s2: pre-training an encoder based on unlabeled speech;

s3: training a speech recognition decoder based on the non-parallel text library;

S4: fusing the models obtained by training in the step S2 and the step S3 to realize the joint fine adjustment of the encoder and the decoder, adding noise with certain energy to the input end of the decoder, adding soft labels of a certain proportion of language models to the labeled labels of the output end of the decoder, gradually reducing the proportion of the noise to the soft labels along with the number of training iterations, and gradually converting the decoder into a voice recognizer with a given audio representation along with the increase of the number of training iterations;

s5: and (3) applying the model obtained in the step (S4) to a voice recognition system, and finally improving the performance of the voice recognition system.

Preferably, the encoder in step S2 is trained based on a non-parallel resource pool, and the training method comprises the following steps:

firstly, extracting features of original voices in a non-parallel corpus by using one-dimensional convolution, and extracting features of original voice signals through fbank to obtain a feature matrix X;

context features are then extracted by a transducer model.

Preferably, the nonlinear features are introduced into the extracted feature matrix, and the nonlinear features are introduced by applying a nonlinear activation function.

Preferably, the position information is added in the sequence before training the transducer model, and masking operation is performed on the context feature H, so that the masked context featureSending the characteristic matrix into a quantization module Q to obtain a quantized characteristic matrix。

Preferably, the time sequence relation between the context feature and the quantized feature is also determined during the training of the transducer model, and the similarity between the features at the same time is minimized while the similarity between the features at the same time is maximized, so as to balance the similarity between the features at the same time and the difference between the features at different times, and define the loss functionThe following are provided;

；

Where t is the frame length, Representation context feature H and quantization featureA measure of the degree of similarity between the two,Is a trade-off factor.

Preferably, in step S3, a channel model of the speech recognition system is first constructed, and then the decoder recovers the text data Y according to the audio feature H; a noise condition language model is proposed, simulation of p is abandoned, and noise is directly used for replacing the voice characteristic H.

Preferably, in the fusion encoder and decoder model, energy noise is added at the input end of the decoder, soft labels of a certain proportion of language models are added on labels of the output end of the decoder, the proportion of the noise to the soft labels gradually decreases along with the number of training iterations, and the decoder gradually converts into a voice recognizer with a given audio representation along with the increase of the number of training iterations, and a simulated annealing algorithm is used for optimizing a voice recognition system model to avoid sinking into a local optimal solution.

Compared with the prior art, the invention has the following advantages:

1. A large number of non-parallel corpora are collected, and the non-labeling speech pre-training encoder and the non-parallel text library are utilized to train a speech recognition decoder, so that the performance and generalization capability of the speech recognition system are improved.

2. Through the joint fine tuning of the encoder and the decoder, the proportion of noise and soft labels is gradually reduced in the training process, the training process of the model is balanced, and the accuracy and the robustness of the model are improved.

3. Nonlinear features are introduced in feature extraction, and the performance and feature diversity of the model are enhanced through nonlinear activation functions, so that the accuracy of voice recognition is improved.

4. By judging the time sequence relation between the context characteristics and the quantization characteristics, the mapping relation between the audio characterization and the text data is better captured, and the learning capacity and the generalization capacity of the model are improved.

5. The model is optimized by using a simulated annealing algorithm, energy noise and a soft label of a language model are introduced in decoder training, and the model is prevented from sinking into a local optimal solution by gradually reducing the proportion of the noise and the soft label, so that the stability and the performance of the model are improved.

Drawings

FIG. 1 is a block diagram of a method for improving performance of a speech recognition system using non-parallel corpus in accordance with the present invention;

FIG. 2 is a block diagram of the architecture of a speech recognition encoder of the present invention utilizing unlabeled speech;

FIG. 3 is a block diagram of the architecture of a speech recognition decoder of the present invention utilizing unpaired text pre-training.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 1-3, the method for improving the performance of a speech recognition system by using non-parallel corpus speech provided by the invention comprises the following steps:

S1: non-parallel corpus containing a large amount of voices and texts is collected, sources for obtaining the corpus comprise Internet, social media, broadcast programs and the like, and a non-parallel resource library is built.

S2: pre-training an encoder based on unlabeled speech;

I: extracting features of original voices in a non-parallel corpus by using one-dimensional convolution, and extracting fbank features of an original voice signal to obtain a feature matrix X;

Specifically, when feature extraction is performed on an original speech signal using one-dimensional convolution, the original speech signal is divided into n frames, and one-dimensional convolution operation is applied to each frame, the original speech signal being expressed as Obtaining frame signals after framing；

;

Wherein t represents time, after obtaining frame signals, applying one-dimensional convolution operation to signals of each frame to obtain corresponding feature vectors, and obtaining a feature matrix X by using fbank feature extraction method, wherein each row represents a feature vector of a specific time window;

① Dividing an original speech signal into frame signals:

;

wherein, The window function is represented by a function of the window,The frame size is indicated, where the frame length is 30ms and the adjacent frame coverage is 15ms;

Optionally, the energy of the low frequency part of the voice signal is reduced by using a high-pass filter to improve the signal-to-noise ratio of the signal, and the operation is realized by applying a first-order high-pass filter to the voice signal of each frame, as follows;

;

wherein, Is the signal after pre-emphasis and is then applied,Is the pre-emphasis coefficient, whereTaking the weight of the mixture to be 0.95 percent,The method is characterized in that the method is a sample value of an original voice signal, and the correlation between the current sample value and the previous sample value is reduced by calculating the difference between the current sample value and the previous sample value, so that redundant information in the voice signal is reduced, a high-frequency part is more prominent in the subsequent feature extraction process, and the finally obtained high-frequency enhanced signal is beneficial to the subsequent feature extraction and voice recognition performance improvement;

② Applying a one-dimensional convolution operation to each frame signal:

;

wherein, Representing the one-dimensional convolution kernel,Representation ofOr alternativelyIs a feature vector of (1);

③ Extracting a feature matrix X by using fbank:

;

Optionally, a nonlinear activation function is applied to introduce nonlinear features, and the linear combination of the convolution outputs is converted into a nonlinear feature representation, where the activation function includes, but is not limited to, functions such as ReLU, sigmoid, and tanh, and here ReLU is used as the activation function, specifically;

;

and applying a ReLU activation function to obtain a nonlinear feature matrix X:

;

II: extracting context characteristics through a transducer model;

Before the features are sent to a transducer, in order to preserve timing information in the sequence, position coding is added to a feature matrix X, and in order to enhance the robustness of the model, masking operation M is carried out on the context features H, and the deletion of part of the features is simulated, in particular;

① Adding position information:

;

Wherein P is a position coding matrix, A position-coded vector representing a time step;

adding a position coding matrix P to the feature matrix X to obtain a feature matrix X' with position coding:

;

② Masking operation is carried out on X';

Defining a masking matrix M, wherein the transposed matrix of the masking matrix M is the same as X', and the elements of M are expressed as Wherein i, j respectively represent elementsThe number of rows and columns is the same;

;

Introducing a masking matrix, wherein the masking matrix controls the model in a self-attention mechanism to only pay attention to the positions before the current position in the sequence and to ignore the positions after the current position so as to prevent information leakage, and after the operation of softmax, the attention weights of the positions become close to zero, so that the masking effect is realized;

obtaining masked contextual features ;

;

Wherein the method comprises the steps ofRepresentation by element multiplication;

③ Masking context features Sending the characteristic matrix into a quantization module Q to obtain a quantized characteristic matrixThe continuous eigenvectors are converted into discrete representations using the k-means algorithm, noted as；

④ The feature matrix with the position codes is used as input, the context features are extracted through a transducer encoder, and the transducer encoder comprises a multi-layer self-attention module and a feedforward neural network module and can capture the dependency relationship between sequences;

Wherein in each layer of the transducer encoder, the multi-headed self-attention mechanism is able to focus on different positions in the input sequence, helping to extract context information;

In particular, the method comprises the steps of,

A: a multi-headed self-attention mechanism;

calculating inquiry, key and value, and inputting feature matrix Respectively applying linear transformation to obtain a matrix of query Q, key K and value V;

;

wherein, The weight matrices representing the query Q, the key K and the value V, respectively, are learned by a training process, in which,Are continually tuned by the optimization algorithm to enable the model to better capture the correlations and patterns in the sequence;

The attention weight of each position and other positions is obtained by calculating the dot product of the query Q and the key K and performing softmax operation, namely an original attention score matrix S is obtained;

;

namely, to obtain the attention weight matrix a, a represents the attention distribution of each position and other positions;

Calculating an output matrix SH, the output of the self-attention mechanism

;

B: a first residual error connection and layer normalization operation;

Further, after each self-attention module and feedforward neural network module, residual connection and layer normalization operations are added, which are helpful for slowing down gradient elimination and accelerating convergence;

The residual connection is utilized, so that a connection matrix L is output, the problem of gradient disappearance is avoided by the connection matrix L, and information can be transmitted to a deep network more easily;

;

After residual connection, carrying out layer normalization processing on the output, so that the input of each layer has similar mean and variance, thereby being beneficial to accelerating training speed and improving generalization capability of a model;

specifically, the mean and variance of feature dimensions need to be calculated first;

;

Wherein i represents the dimension of the connection matrix L;

for each feature of each sample L Carrying out normalization treatment;

;

wherein, Is a very small positive number, used only to prevent the variance from being zero;

By scaling parameters And translation parametersScaling and translating the normalized result;

;

wherein, AndAre all parameters to be learned;

After layer normalization processing, each characteristic of each sample is normalized to a distribution with a mean value of 0 and a variance of 1, and then final output is obtained through scaling and translation operation; the training process of the network is accelerated, and the generalization capability of the model is improved;

c: a feed-forward neural network module;

by steps A and B, the product is obtained An output matrix Y is formed;

for the residual connection and layer normalization output result Y, mapping the residual connection and layer normalization output result Y to a new feature space through linear transformation, introducing nonlinear transformation through an activation function, and mapping the activated result to a final output space through linear transformation;

;

wherein, AndIs a weight matrix to be learned and is,AndIs a bias vector;

The feedforward neural network module performs nonlinear transformation on the residual error connection of the upper layer and the layer normalization output result, so that the expression capacity and learning capacity of the model are improved;

D: the second residual connection is normalized with the layer;

in the same way as in step B, here input Performing layer normalization processing, and finally outputting a matrix LN (Y);

Repeating A, B, C and D, taking the output of each layer as the input of the next layer, processing by a transducer encoder with 6 layers or more, and finally obtaining a context characteristic H for subsequent voice recognition by taking the output of the last layer as the context characteristic representation of the whole sequence;

III: loss control;

The model needs to judge the time sequence relation between the context characteristic and the quantization characteristic during training, and specifically, a loss function is defined ；

;

Where t is the frame length,Representation context feature H and quantization featureA measure of similarity between, here cosine similarity,Is a trade-off factor for adjusting the contextual characteristic H and the quantitative characteristicThe importance between these two targets is determined by minimizing this loss function during trainingThe model automatically learns the timing relationship between the contextual features and the quantized features, i.e., minimizes the similarity between other temporal features while maximizing the similarity of the same temporal features.

The method comprises the steps that the prior distribution of texts is utilized to promote text modeling capacity of a recognition model, a decoder pre-training method of non-parallel texts is constructed, the texts are used as target output of the whole system, text data are used at a decoder end to calculate a loss function so as to guide the model to learn and generate audio features, and then audio features which are as close as possible to the target texts are generated;

I: firstly, constructing a channel model of a voice recognition system, regarding a voice recognition task as an information transmission process, and converting text data Y into audio features H through communication channels such as speaker speaking, voice transmission, feature extraction, an encoder and the like;

II: the decoder recovers the text data Y based on the audio feature H, and before the decoder receives H, the uncertainty of the decoder for Y uses the entropy of the information To indicate that after H is received, its uncertainty over Y is reduced to become；

；

Wherein the method comprises the steps ofFor the mutual information quantity, determined by the transmission channel p, the channel p is unknown for text data lacking corresponding speech, and the process needs to be simulated, and an artificial channel q is constructed to realize the conversion from Y to H:

Under the condition of insufficient training data, the difference between q and p is larger, so that the problem of dependence of the model on q in a self-supervision learning stage is caused;

Further, a noise condition language model is provided, the simulation of p is abandoned, noise is directly used for replacing the voice characteristic H, and the specific model construction steps are as follows;

A: preparing text data for training a noise condition language model, wherein the text data comprises text data used in a voice recognition task and data in a non-parallel resource library;

B: generating noise related to the text data, wherein the noise can be generated randomly or according to a certain rule so as to simulate missing voice characteristics;

C: and pairing the generated noise with the text data to form a training sample. Each training sample comprises a text data and a corresponding noise feature;

d: the structure of the conditional language model, estimating parameters using maximum likelihood estimation;

Given training set If so, the number is the number;

;

Wherein the method comprises the steps of Is the text data Y and is displayed as a text data,Is the corresponding noise signature of the signal,Model parameters that are maximum likelihood functions;

generating a noise feature N with a gaussian noise model using a predefined noise model P (n|mle), and then calculating a conditional probability using a bayesian formula;

;

Calculating the conditional probability is to train a noise conditional language model so that the noise conditional language model can generate noise characteristics N related to text data MLE, and in a voice recognition task, text data cannot be directly converted into audio characteristics due to lack of corresponding voice characteristics, so that the noise characteristics N are generated by calculating the conditional probability by using the noise model, and then the noise characteristics N are taken as input so that a decoder can generate corresponding text data;

Through calculating the conditional probability, the noise conditional language model can be trained, so that the corresponding relation between text data and noise characteristics can be learned, the text data is used for training in the pre-training stage of the decoder, and meanwhile, the generated noise characteristics are used for guiding the decoder to learn to generate audio characteristics, so that the performance of the decoder on a text reconstruction task is improved;

At this time, since H and Y are independent of each other, the uncertainty of the decoder to Y is not reduced, that is, the decoder still needs to generate the recognition result according to the prior probability of the text instead of completely performing random prediction.

S4: the models obtained by training in S2 and S3 are fused, the joint fine adjustment of an encoder and a decoder is realized, noise with certain energy is added at the input end of the decoder, soft labels of a certain proportion of language models are added on labels of the output end of the decoder, the proportion of the noise to the soft labels is gradually reduced along with the number of training iterations, the decoder is gradually converted into a voice recognizer with a given audio representation along with the increase of the number of training iterations, a simulated annealing algorithm is used for optimizing a voice recognition system model, and the situation that the voice recognition system model falls into a local optimal solution is avoided, and the method is specifically as follows;

setting an initial noise and soft label proportion and a strategy of gradually reducing along with the training iteration number;

S401, in each training iteration, according to the current noise and soft label proportion, adjusting the input and output of a decoder to gradually convert the noise condition language model into a voice recognizer with a given audio representation;

；

wherein, AndThe noise ratio and the soft label ratio are represented respectively,AndRepresenting the initial noise ratio and the initial soft label ratio respectively,AndIs the decay rate;

s402, in each training iteration, adjusting the input and output of a decoder according to the current noise and soft label proportion;

;

And Representing the input and output of the decoder respectively,Representing the original input of the decoder,Representing the original output of the decoder, Z represents added noise, and R represents added soft labels;

s403, training the decoder to gradually learn the mapping relation from the given audio representation to the corresponding text data, gradually reducing the proportion of noise and soft labels along with the training iteration, and gradually converting the decoder from a mode using the noise and soft labels to a normal voice recognition mode.

S5: and (3) applying the model obtained in the step (4) to a voice recognition system, and finally improving the performance of the voice recognition system.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

1. A method for improving the recognition performance of a voice system by using non-parallel corpus is characterized by comprising the following steps:

S1: collecting non-parallel corpus containing a large amount of voices and texts, wherein sources for obtaining the corpus comprise Internet, social media and broadcast programs, and establishing a non-parallel resource library;

s2: pre-training an encoder based on unlabeled speech;

S5: applying the model obtained in the step S4 to a voice recognition system;

The encoder in step S2 is trained based on a non-parallel resource pool, the training step comprising: firstly, extracting features of original voices in a non-parallel corpus by using one-dimensional convolution, and extracting features of original voice signals through fbank to obtain a feature matrix X; then, extracting context characteristics through a transducer model; extracting a feature matrix X, introducing nonlinear features, and applying a nonlinear activation function to introduce the nonlinear features;

The time sequence relation between the context features and the quantized features is also judged when the transducer model is trained, the similarity of the features at the same time is maximized, the similarity between the features at other times is minimized at the same time, the similarity of the features at the same time and the difference of the features at different times are balanced, and a loss function is defined The following are provided;

;

Where t is the frame length, Representation context feature H and quantization featureA measure of the degree of similarity between the two,Is a trade-off factor;

The transducer encoder comprises a multi-layer self-attention module and a feedforward neural network module, wherein the multi-layer self-attention module and the feedforward neural network module are used for capturing the dependency relationship among sequences, and in each layer of the transducer encoder, different positions in an input sequence are focused through a multi-head self-attention mechanism, so that the context information is further extracted;

In the fusion encoder and decoder model, energy noise is added at the input end of the decoder, soft labels of a certain proportion of language models are added on labels of the output end of the decoder, the proportion of the noise to the soft labels gradually decreases along with the number of training iterations, and the decoder is gradually converted into a voice recognizer with a given audio representation along with the increase of the number of training iterations, and a simulated annealing algorithm is used for optimizing a voice recognition system model.

2. The method for improving recognition performance of a speech system using non-parallel corpus according to claim 1, wherein the position information is added to the sequence before training the transducer model, and the masking operation is performed on the contextual feature H to mask the contextual feature HSending the characteristic matrix into a quantization module Q to obtain a quantized characteristic matrix。

3. The method for improving recognition performance of a speech system by using non-parallel corpus according to claim 1, wherein in step S3, a channel model of the speech recognition system is first constructed, and then a decoder recovers text data Y according to audio features Hs; the noise condition language model is proposed, the simulation of the transmission channel p is abandoned, and the noise is directly used for replacing the audio characteristic Hs.

4. The method for improving recognition performance of a speech system by using non-parallel corpus according to claim 1, wherein in step S4, an initial noise and soft label ratio and a strategy for gradually decreasing with the number of training iterations are set, comprising the steps of:

;