CN114360502A - Processing method of voice recognition model, voice recognition method and device - Google Patents

Processing method of voice recognition model, voice recognition method and device Download PDF

Info

Publication number
CN114360502A
CN114360502A CN202111292319.2A CN202111292319A CN114360502A CN 114360502 A CN114360502 A CN 114360502A CN 202111292319 A CN202111292319 A CN 202111292319A CN 114360502 A CN114360502 A CN 114360502A
Authority
CN
China
Prior art keywords
speech
voice
character sequence
semantic
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111292319.2A
Other languages
Chinese (zh)
Inventor
邓克琦
曹松军
马龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111292319.2A priority Critical patent/CN114360502A/en
Publication of CN114360502A publication Critical patent/CN114360502A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application relates to a processing method of a voice recognition model, a voice recognition method and a voice recognition device. The method relates to a voice recognition technology in the field of artificial intelligence, and comprises the following steps: obtaining a voice characteristic corresponding to the sample signal through a voice recognition model, and outputting a first predicted character sequence based on the voice characteristic; inputting a forward character sequence corresponding to the label character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the label character sequence; in a decoder, decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain voice semantic combined features, and obtaining a second predicted character sequence based on the voice semantic combined features; and jointly training the speech recognition model and the decoder based on the speech recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence. By adopting the method, the accuracy rate of voice recognition can be improved.

Description

Processing method of voice recognition model, voice recognition method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method for processing a speech recognition model, a method and an apparatus for speech recognition.
Background
With the development of computer technology and artificial intelligence technology, speech recognition is required in many scenarios, such as virtual robot interaction scenarios, intelligent device control scenarios, machine translation scenarios, text conversion scenarios of voice messages, and the like. For example, the terminal receives a voice signal input by a user through a virtual robot program installed on the terminal, performs voice recognition on the voice signal to obtain a voice recognition result, and performs corresponding operations based on the voice recognition result. For another example, a voice control client is installed on the intelligent device, the intelligent device receives a voice signal input by a user through the voice control client, performs voice recognition on the voice signal to obtain a voice recognition result, obtains a control instruction based on the voice recognition result, and then executes a corresponding operation.
At present, a non-autoregressive speech recognition model is widely applied due to the advantages of high speech recognition speed and the like. However, the non-autoregressive speech recognition model has a disadvantage of low recognition accuracy because it uses only information of a speech signal at a speech level.
Disclosure of Invention
In view of the above, it is desirable to provide a speech recognition model processing method, a speech recognition method, and a speech recognition apparatus that can improve the accuracy of speech recognition.
A method of processing a speech recognition model, the method comprising:
acquiring a sample signal and a corresponding marked character sequence;
inputting the sample signal into a speech recognition model to obtain a speech feature corresponding to the sample signal and a first predicted character sequence output based on the speech feature;
inputting a forward character sequence corresponding to the tagged character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the tagged character sequence;
in the decoder, decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain voice semantic combined features corresponding to the sample signals, and predicting based on the voice semantic combined features to obtain a second predicted character sequence corresponding to the sample signals;
and jointly training the voice recognition model and the decoder based on the voice recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence.
An apparatus for processing a speech recognition model, the apparatus comprising:
the acquisition module is used for acquiring the sample signal and the corresponding marked character sequence;
the coding module is used for inputting the sample signal into a voice recognition model to obtain a voice feature corresponding to the sample signal and a first prediction character sequence output based on the voice feature;
the input module is used for inputting the forward character sequence corresponding to the label character sequence into a decoder, and the forward character sequence is generated based on the previous character of each character in the label character sequence;
a decoding module, configured to decode, in the decoder, the speech feature according to the semantic feature corresponding to the forward character sequence, to obtain a speech-semantic combined feature corresponding to the sample signal, and perform prediction based on the speech-semantic combined feature, to obtain a second predicted character sequence corresponding to the sample signal;
and the training module is used for jointly training the voice recognition model and the decoder based on the voice recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence.
In one embodiment, the encoding module is further configured to: inputting the sample signal into the speech recognition model; outputting the voice characteristics corresponding to the sample signals through an encoder of the voice recognition model; outputting, by a classifier coupled to the encoder in the speech recognition model, the first predicted character sequence based on the speech feature.
In one embodiment, the encoder includes a feature extraction network and a self-attention based speech context network; the encoding module is further configured to: inputting the sample signal into the encoder to obtain a speech vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal; carrying out random covering processing on the voice vectors in the voice vector sequence; and inputting the voice vector sequence after the masking processing into the voice context network to obtain the context voice feature output by the voice context network as the voice feature corresponding to the sample signal.
In one embodiment, the decoder includes a vectorization layer, a self-attention based semantic context network and a cross-attention based speech semantic context network; the decoding module is further configured to: converting the forward character sequence into a corresponding forward character vector sequence through a vectorization layer of the decoder, and inputting the forward character vector sequence into the semantic context network; calculating context semantic features corresponding to the forward character sequence based on the forward character vector sequence through the semantic context network, wherein the context semantic features are used as semantic features corresponding to the forward character sequence; and calculating to obtain the voice semantic combined feature corresponding to the sample signal based on the semantic feature corresponding to the forward character sequence and the voice feature through the voice semantic context network.
In one embodiment, the decoding module is further configured to: inputting the speech semantic joint features into a classifier of the decoder; and outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic joint feature through the classifier.
In one embodiment, the speech recognition model includes an encoder and a classifier coupled to the encoder; the encoder is a pre-trained encoder obtained by performing self-supervision training by using an unlabeled sample signal; the training module is further configured to: performing supervised training on the decoder and a classifier of the voice recognition model according to the voice recognition loss and the semantic recognition loss; and when a supervision training stopping condition is met, carrying out supervision training on the decoder and the voice recognition model according to the voice recognition loss and the semantic recognition loss.
In one embodiment, the encoder is a pre-trained encoder obtained by performing an auto-supervised training using an unlabeled sample signal; the speech recognition model further comprises a pre-training module configured to: acquiring the label-free sample signal; inputting the label-free sample signal into an initial encoder to obtain a voice vector sequence which is output by a feature extraction network in the initial encoder and corresponds to the label-free sample signal; performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence; randomly masking the voice vectors in the voice vector sequence, and then determining masked voice vectors; inputting the voice vector sequence after the covering processing into a voice context network of the initial encoder to obtain a predicted voice vector which is output by the voice context network and corresponds to the covering voice vector; constructing an unsupervised training loss based on a difference between a speech quantization vector in the sequence of speech quantization vectors corresponding to the masked speech vector and the predicted speech vector; and after updating the network parameters of the initial encoder according to the self-supervision training loss, returning to the step of obtaining the label-free sample signal to continue training until the training is finished, and obtaining the pre-trained encoder.
In one embodiment, the training module is further to: constructing the speech recognition loss based on a difference between the annotated character sequence and the first predicted character sequence; constructing a semantic recognition loss based on a difference between the annotated character sequence and the second predicted character sequence; weighting and summing the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss; and jointly training the speech recognition model and the decoder according to the target loss.
In one embodiment, the processing means of the speech recognition model further comprises a speech recognition module for: acquiring a signal to be identified; inputting the signal to be recognized into a trained voice recognition model to obtain voice features output by an encoder in the voice recognition model, and outputting a voice recognition result based on the voice features by a classifier in the voice recognition model.
A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the processing method of the speech recognition model when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method of processing a speech recognition model.
A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read by a processor of a computer device from the computer readable storage medium, the computer instructions being executed by the processor to cause the computer device to perform the steps of the method of processing a speech recognition model as described above.
The processing method, the device, the computer equipment and the storage medium of the speech recognition model input the sample signal into the speech recognition model to obtain the speech characteristics corresponding to the sample signal, and the first predicted character sequence output based on the speech characteristics, input the forward character sequence corresponding to the marked character sequence into the decoder, in the decoder, decode the speech characteristics according to the semantic characteristics corresponding to the forward character sequence to obtain the speech semantic combination characteristics corresponding to the sample signal, because the forward character sequence is generated based on the previous character of each character in the marked character sequence, the speech semantic combination characteristics obtained by decoding and encoding the speech characteristics output by the encoder according to the semantic characteristics corresponding to the forward character sequence carry the context information of the semantic level, and predict based on the speech semantic combination characteristics to obtain the second predicted character sequence corresponding to the sample signal, and training a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and the labeled character sequence, and distilling context information of semantic levels into the voice recognition model, so that the recognition accuracy of the voice recognition model is improved.
A method of speech recognition, the method comprising:
acquiring a signal to be identified;
inputting the signal to be recognized into a trained voice recognition model to obtain voice features output by an encoder in the voice recognition model and voice recognition results output by a classifier in the voice recognition model based on the voice features;
the speech recognition model and the decoder are obtained based on speech recognition loss and semantic recognition loss joint training, the speech recognition loss is obtained through calculation according to a first predicted character sequence and a labeled character sequence corresponding to the sample signal, the semantic recognition loss is obtained through calculation according to a second predicted character sequence and the labeled character sequence, the first predicted character sequence is obtained through classification based on speech features output by the encoder, the second predicted character sequence is obtained through prediction of speech semantic joint features obtained through decoding of the speech features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
A speech recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a signal to be identified;
the speech recognition module is used for inputting the signal to be recognized into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model and speech recognition results output by a classifier in the speech recognition model based on the speech features;
the speech recognition model and the decoder are obtained based on speech recognition loss and semantic recognition loss joint training, the speech recognition loss is obtained through calculation according to a first predicted character sequence and a labeled character sequence corresponding to the sample signal, the semantic recognition loss is obtained through calculation according to a second predicted character sequence and the labeled character sequence, the first predicted character sequence is obtained through classification based on speech features output by the encoder, the second predicted character sequence is obtained through prediction of speech semantic joint features obtained through decoding of the speech features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the voice recognition method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned speech recognition method.
A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read by a processor of a computer device from the computer readable storage medium, the computer instructions being executed by the processor to cause the computer device to perform the steps of the speech recognition method described above.
According to the speech recognition method, the speech recognition device, the computer equipment and the storage medium, the signal to be recognized is input into the trained speech recognition model, the speech features output by the encoder in the speech recognition model are obtained, and the speech recognition result output by the classifier in the speech recognition model based on the speech features is obtained.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a method for processing speech recognition models;
FIG. 2 is a diagram of a speech recognition scenario in one embodiment;
FIG. 3 is a block flow diagram of a method for processing a speech recognition model in one embodiment;
FIG. 4 is a diagram of training a speech recognition model with assistance from a decoder in one embodiment;
FIG. 5 is a diagram illustrating an embodiment of obtaining speech characteristics corresponding to a sample signal by an encoder;
FIG. 6 is a diagram of an embodiment of an unsupervised pre-training of an initial encoder;
FIG. 7 is a diagram of training a speech recognition model with the assistance of a decoder in another embodiment;
FIG. 8 is a block flow diagram of a method for processing a speech recognition model in one embodiment;
FIG. 9 is a diagram of a speech recognition model training aided by a decoder in yet another embodiment;
FIG. 10 is a graphical representation of test results in one embodiment;
FIG. 11 is a block flow diagram of a speech recognition method in one embodiment;
FIG. 12 is a block diagram showing the configuration of a processing means of a speech recognition model in one embodiment;
FIG. 13 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment;
FIG. 14 is a diagram showing an internal structure of a computer device in one embodiment;
fig. 15 is an internal structural view of a computer device in another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The processing method and the voice recognition method of the voice recognition model provided by the embodiment of the application relate to an Artificial Intelligence (AI) technology, wherein the AI technology is a theory, a method, a technology and an application system which simulate, extend and expand human Intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and use the knowledge to acquire an optimal result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The processing method of the speech recognition model provided by the embodiment of the application mainly relates to Machine Learning (ML) technology of artificial intelligence. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
For example, in the embodiment of the present application, the speech recognition model and the decoder are trained based on the speech recognition loss and the semantic recognition loss, and finally the speech recognition model for recognizing the speech signal is obtained.
The Speech recognition method provided by the embodiment of the application mainly relates to the Speech Technology (Speech Technology) of artificial intelligence. The key technologies of the voice technology are automatic voice recognition technology, voice synthesis technology and voiceprint recognition technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
For example, in the embodiment of the present application, a speech feature corresponding to a signal to be recognized is output through an encoder in a trained speech recognition model, and a speech recognition result is output based on the speech feature through a classifier in the trained speech recognition model.
The processing method and the voice recognition method of the voice recognition model provided by the embodiment of the application can also relate to a block chain technology. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
For example, in the embodiment of the present application, the server may be a block link point in a block chain network, and the trained speech recognition model may be stored on the block chain, and upload the signal to be recognized to a data block of the block chain, so as to perform speech recognition on the signal to be recognized.
The processing method and the speech recognition method of the speech recognition model provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but is not limited to, various smart phones, tablet computers, notebook computers, desktop computers, portable wearable devices, smart speakers, in-vehicle devices, and the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform.
In one embodiment, the terminal 102 obtains a sample signal and a corresponding tagged character sequence, sends the sample signal and the corresponding tagged character sequence to the server 104, and the server 104 inputs the sample signal into the speech recognition model to obtain a speech feature corresponding to the sample signal and a first predicted character sequence output based on the speech feature; inputting a forward character sequence corresponding to the label character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the label character sequence; in a decoder, decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain voice semantic combined features corresponding to the sample signals, and predicting based on the voice semantic combined features to obtain a second predicted character sequence corresponding to the sample signals; and jointly training the speech recognition model and the decoder based on the speech recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence.
In the processing method of the speech recognition model provided in the embodiment of the present application, the execution main body may be the processing apparatus of the speech recognition model provided in the embodiment of the present application, or a computer device integrated with the processing apparatus of the speech recognition model, where the processing apparatus of the speech recognition model may be implemented in a hardware or software manner. The computer device may be the terminal 102 or the server 104 shown in fig. 1.
In one embodiment, the terminal 102 obtains a signal to be recognized, sends the signal to be recognized to the server 104, and the server 104 inputs the signal to be recognized into the trained speech recognition model, obtains speech features output by an encoder in the speech recognition model, and speech recognition results output by a classifier in the speech recognition model based on the speech features; the voice recognition model and the decoder are obtained through joint training based on voice recognition loss and semantic recognition loss, the voice recognition loss is obtained through calculation according to a first prediction character sequence and a labeled character sequence corresponding to a sample signal, the semantic recognition loss is obtained through calculation according to a second prediction character sequence and the labeled character sequence, the first prediction character sequence is obtained after classification based on voice features output by an encoder, the second prediction character sequence is obtained through prediction of voice semantic combined features obtained through decoding of the voice features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
In the speech recognition method provided by the embodiment of the present application, the execution subject may be the speech recognition apparatus provided by the embodiment of the present application, or a computer device integrated with the speech recognition apparatus, where the speech recognition apparatus may be implemented in a hardware or software manner. The computer device may be the terminal 102 or the server 104 shown in fig. 1.
The voice recognition method provided by the embodiment of the application can be applied to voice interaction scenes, such as virtual robot interaction scenes, intelligent equipment control scenes, machine translation scenes, text conversion scenes of voice messages and the like. In a voice interaction scenario, a voice recognition technology and a semantic recognition technology are generally involved, the voice recognition technology can convert a voice signal into characters, and the semantic recognition technology can recognize the intention of the characters converted from the voice signal. The speech recognition model obtained by training is particularly applied to the speech recognition technology.
For example, a virtual robot program is installed on the terminal, and a background server of the virtual robot program stores the speech recognition model obtained through training in the application. The terminal receives a voice signal input by a user through a virtual robot program, a voice recognition model stored in the background server recognizes a text corresponding to the voice signal, and the terminal can execute corresponding operation based on the text or a semantic recognition result of the text.
Taking a vehicle-mounted robot as an example, the vehicle-mounted robot is a social robot applied to a scene of a vehicle-mounted intelligent cabin, and belongs to a service robot. The vehicle-mounted robot can respond to the input voice of the user in the vehicle and provide corresponding services, such as playing music/radio station/news/electronic book, navigating, inquiring weather/surrounding food, making a call, interactive chatting and the like.
Referring to fig. 2, the voice recognition system of the in-vehicle robot may include an acoustic front-end module, a cloud voice recognition module, an offline/cloud semantic recognition module, and the like. The acoustic front-end module is used for providing functions of voice noise reduction, sound source positioning, echo cancellation and the like. The off-line voice recognition module is used for providing functions of awakening by fixed awakening words, awakening by customized awakening words, off-line voice recognition and the like. The cloud speech recognition module may include a speech recognition model, the speech recognition model being configured to recognize speech signals as words, optionally, the speech recognition model may be split into an acoustic model, a language model, a dictionary, and a decoder, the acoustic model being configured to recognize speech signals as phonemes, the language model and the dictionary being configured to convert the phonemes into words, the decoder being configured to combine the acoustic model, the language model, and the dictionary to perform a whole search process from speech signals to words. The offline/cloud semantic recognition module is used for recognizing the intention of characters converted from the voice signals. The speech recognition model that this application training obtained can be applied to vehicle robot's high in the clouds speech recognition module, improves vehicle robot speech recognition's accuracy.
For another example, a voice control client is installed on the intelligent device, and a background server of the voice control client stores the voice recognition model obtained through the training of the application. The intelligent device receives a voice signal input by a user through the voice control client, the voice recognition model stored in the background server recognizes a text corresponding to the voice signal, and the intelligent device can obtain a control instruction based on the text or a semantic recognition result of the text, so as to execute corresponding operation. Smart devices include, but are not limited to, smart home devices and the like.
For example, a translation client is installed on the terminal, and a background server of the translation client stores the speech recognition model obtained through the training of the application. The terminal receives a voice signal input by a user through a translation client, a voice recognition model stored in the background server recognizes a text corresponding to the voice signal, the text or a semantic recognition result of the text is translated to obtain a translation result, and the terminal outputs the translation result corresponding to the voice signal.
For another example, a session client is installed on the terminal, and a background server of the session client stores the speech recognition model obtained by the training of the application. The terminal receives the voice message input by the user through the session client, responds to the voice message conversion instruction, the voice recognition model stored in the background server recognizes the text corresponding to the voice message, and the terminal can display the text message corresponding to the voice message based on the text or the semantic recognition result of the text.
In an embodiment, as shown in fig. 3, a method for processing a speech recognition model is provided, and this embodiment is mainly illustrated by applying the method to the computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:
step S302, a sample signal and a corresponding labeled character sequence are obtained.
Wherein the sample signal is a speech signal used for training a speech recognition model, which has a timing characteristic. The sample signal may be an original analog sound signal or a digital signal obtained by processing the original analog sound signal. The speech recognition model is an acoustic model with speech recognition capability after training, and specifically may be a model for performing phoneme or character recognition on a speech signal obtained by training with a sample signal as training data. Each sample signal has a corresponding sequence of annotation characters, which may be a sequence of phonemes or a sequence of words. For example, the sample signal "good weather today", the tag data may be a phoneme sequence "bei 3jing1 tie 1 qi4 hao 3", or may be a text sequence "good weather today".
In one embodiment, the speech recognition model may be a non-autoregressive model based on CTC (connected temporal classification). The CTC algorithm is used for solving the problem of labeling time series data. In the conventional acoustic model training, for the sample signal of each frame, the corresponding labeled character needs to be known for effective training, so that the alignment processing of the sample signal needs to be performed before training, which is a time-consuming task. And the CTC loss function is adopted for training, so that the training can be realized only by providing the sample signal and a marking character sequence corresponding to the sample signal without performing alignment processing on the sample signal. An Autoregressive (ART) model needs to predict a next word from a generated word during speech recognition, and has a feature of high recognition accuracy but a low recognition speed, whereas a non-Autoregressive speech recognition model can generate predicted words simultaneously within a certain number of iterations during speech recognition and has a high recognition speed but a recognition accuracy inferior to that of an Autoregressive speech recognition model. In the application, the decoder is introduced when the speech recognition model is trained, the speech recognition model and the decoder are jointly trained to help the speech recognition model to learn the semantic level context information, and the decoder does not participate in the speech recognition process of the speech recognition model during speech recognition, so that the recognition accuracy of the speech recognition model is improved, and the recognition speed of the speech recognition model is not influenced.
In one embodiment, the computer device obtains a sample signal and a corresponding tagged character sequence, and trains a speech recognition model using the sample signal and the corresponding tagged character sequence.
Step S304, inputting the sample signal into a speech recognition model to obtain a speech feature corresponding to the sample signal and a first predicted character sequence output based on the speech feature.
Wherein the speech features are data describing characteristics of the sample signal at a speech level. The speech features may be in the form of vectors. For example, the speech signal is converted into "[ 10.240.30.100.80.700.72.15.20. ]". The first predicted character sequence is a prediction result obtained by performing speech recognition on the sample signal by the speech recognition model based on the speech feature, and may be a phoneme sequence or a text sequence.
In one embodiment, a computer device inputs a sample signal into a speech recognition model; outputting a voice characteristic corresponding to the sample signal through an encoder of the voice recognition model; a first predicted character sequence is output based on the speech features by a classifier coupled to the encoder in the speech recognition model.
In one embodiment, the speech recognition model may include an encoder for encoding the sample signal to obtain speech features corresponding to the sample signal, and a classifier for recognizing characters corresponding to each time interval signal in the sample signal based on the speech features and outputting a first predicted character sequence corresponding to the sample signal.
For example, referring to fig. 4, fig. 4 is a diagram illustrating assisted training of a speech recognition model by a decoder in one embodiment. The computer device inputs the sample signal into a speech recognition model, outputs speech characteristics [ c1 c2 c3 c4 c5] corresponding to the sample signal through an encoder of the speech recognition model, and outputs a first predicted character sequence 'w 1w2w3w4w 5' based on the speech characteristics [ c1 c2 c3 c4 c5] through a classifier of the speech recognition model.
In one embodiment, the encoder may employ a general encoder structure, such as CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), and the like. The classifier may also adopt a general classifier structure, such as a linear classifier or the like.
Step S306, inputting the forward character sequence corresponding to the label character sequence into the decoder, wherein the forward character sequence is generated based on the previous character of each character in the label character sequence.
Wherein the forward character sequence is generated based on a character preceding each character in the annotation character sequence. For example, if the labeled character sequence L is "good today", then according to the previous character of each character in L, the forward character sequence corresponding to L can be obtained as "/good today". Specifically, since the first character "present" in L does not have a corresponding previous character, the previous character of "present" can be represented by "/", and the first character in the forward character sequence corresponding to L is obtained. Similarly, the second character in L is "day", and its previous character is "present", so that the second character in the forward character sequence corresponding to L is "present". By analogy, the forward character sequence corresponding to the L is obtained as '/today weather'.
Step S308, in the decoder, the voice features are decoded according to the semantic features corresponding to the forward character sequence to obtain the voice semantic combined features corresponding to the sample signals, and prediction is carried out based on the voice semantic combined features to obtain a second predicted character sequence corresponding to the sample signals.
The second predicted character sequence is a predicted result obtained by the decoder performing speech recognition based on the speech semantic union characteristics, and may be a phoneme sequence or a character sequence. The speech semantic union feature is a feature obtained by decoding and re-encoding the speech feature by using the above semantic information of the labeled character sequence embodied by the forward character sequence. As the name suggests, the voice and semantic combined feature considers the feature of the voice signal on the voice level and also considers the information of the labeled character sequence corresponding to the voice information on the semantic level.
In one embodiment, a decoder decodes and re-encodes the voice features output by the encoder according to the semantic features corresponding to the forward character sequence to obtain voice semantic combined features, the voice semantic combined features carry context information of semantic levels, a second predicted character sequence corresponding to a sample signal is obtained based on voice semantic combined feature prediction, a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and a labeled character sequence is trained, the context information of the semantic levels can be distilled into the voice recognition model, the voice recognition model is helped to alleviate the defects of independence assumption and incapability of utilizing the context information of the semantic levels, and therefore the recognition accuracy of the voice recognition model is improved.
In one embodiment, the computer device inputs the forward character sequence corresponding to the tagged character sequence into a decoder, obtains semantic features corresponding to the forward character sequence in the decoder, decodes and re-encodes the voice features according to the semantic features corresponding to the forward character sequence to obtain voice and semantic combined features, and performs prediction based on the voice and semantic combined features to obtain a second predicted character sequence corresponding to the sample signal.
In one embodiment, the decoder may include a vectorization layer and a cross-attention based speech semantic context network. The vectorization layer is used for acquiring semantic features corresponding to the forward character sequence. The feature dimension of semantic features corresponding to the forward character sequence is consistent with the feature dimension of the voice features. The cross attention-based voice semantic context network is used for decoding voice features and coding the voice features by using semantic features corresponding to a forward character sequence, so that the obtained voice semantic combined features carry context information of semantic levels.
For example, with continued reference to fig. 4, the computer apparatus inputs the forward character sequence "/x 2x3x4x 5" corresponding to the tagged character sequence into the decoder 402, obtains the semantic features [ e1 e2 e3 e4 e5] corresponding to the forward character sequence through the vectorization layer of the decoder 402, inputs the semantic features [ e1 e2 e3 e4 e5] and the speech features [ c1 c2 c3 c4 c5] extracted by the encoder into the speech semantic context network of the decoder 402, obtains the speech semantic conjunction features [ r1 r2 r3 r3 r3 r3 ] through the speech semantic context network, and performs prediction based on the speech semantic conjunction features [ r3 r3 r3 r3 ] to obtain a second predicted character sequence "y 1y2y3y4y 3" corresponding to the sample signal.
Step S310, based on the speech recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence, a speech recognition model and a decoder are jointly trained.
It can be understood that the general loss function satisfies the requirements of the embodiments of the present application for the speech recognition loss and the semantic recognition loss, so that the computer device can use the general loss function to construct the speech recognition loss and the semantic recognition loss. General loss functions such as cross entropy loss function, cosine similarity loss function, etc.
For example, with continued reference to FIG. 4, the computer device obtains a second predicted character sequence "y 1y2y3y4y 5" corresponding to the sample signal via the decoder 402, and the classifier of the speech recognition model outputs a first predicted character sequence "w 1w2w3w4w 5" corresponding to the sample signal based on the speech features [ c1 c2 c3 c4 c5 ]. Thus, the computer device may calculate a speech recognition loss based on the annotated character sequence "x 1x2x3x4x 5" and the first predicted character sequence "w 1w2w3w4w 5", and a semantic recognition loss based on the annotated character sequence "x 1x2x3x4x 5" and the second predicted character sequence "y 1y2y3y4y 5", the speech recognition model and the decoder being trained based on the speech recognition loss and the semantic recognition loss in combination.
In one embodiment, the computer device performs weighted summation on the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss; and jointly training the speech recognition model and the decoder according to the target loss.
In one embodiment, the target loss is a composite loss function that is a combination of the speech recognition loss and the semantic recognition loss. The target loss can be expressed by the following formula:
Lt=λ1Lv2Ls
wherein L istRepresenting a target loss; l isvRepresenting loss of speech recognition, λ1Loss weighting coefficients representing the correspondence of speech recognition losses, e.g. lambda1Can be taken as 0.3; l issRepresenting loss of semantic recognition, λ2Loss weighting coefficients representing correspondence of semantic recognition loss, e.g. λ2It may take 0.7.
In one embodiment, the computer device obtains a gradient corresponding to the training based on a gradient descent algorithm according to the direction of minimizing the target loss, and updates the network parameters of the speech recognition model and the decoder according to the gradient. The gradient descent algorithm may be a random gradient descent algorithm, or an algorithm optimized based on a random gradient descent algorithm, such as a random gradient descent algorithm with vector terms.
In the processing method of the speech recognition model, a sample signal is input into the speech recognition model to obtain speech characteristics corresponding to the sample signal, a first predicted character sequence output based on the speech characteristics is output, a forward character sequence corresponding to a labeled character sequence is input into a decoder, the speech characteristics are decoded in the decoder according to the semantic characteristics corresponding to the forward character sequence to obtain speech semantic combination characteristics corresponding to the sample signal, because the forward character sequence is generated based on the previous character of each character in the labeled character sequence, the speech semantic combination characteristics obtained by decoding and encoding the speech characteristics output by an encoder according to the semantic characteristics corresponding to the forward character sequence carry context information of a semantic level, and a second predicted character sequence corresponding to the sample signal is obtained by prediction based on the speech semantic combination characteristics, and training a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and the labeled character sequence, and distilling context information of semantic levels into the voice recognition model, so that the recognition accuracy of the voice recognition model is improved.
In one embodiment, an encoder includes a feature extraction network and a self-attention based speech context network; outputting the speech characteristics corresponding to the sample signal through an encoder of the speech recognition model, wherein the speech characteristics comprise: inputting the sample signal into an encoder to obtain a voice vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal; carrying out random masking treatment on the voice vectors in the voice vector sequence; and inputting the voice vector sequence after the masking processing into a voice context network to obtain the context voice characteristics output by the voice context network as the voice characteristics corresponding to the sample signal.
The speech vector sequence is a sequence formed by speech vectors, and the speech vectors refer to results obtained by mapping speech signals to a high-dimensional vector space.
In one embodiment, the computer device inputs the sample signal into the encoder, and obtains a speech vector sequence output by a feature extraction network in the encoder and corresponding to the sample signal, wherein each speech vector in the speech vector sequence is a speech vector corresponding to a speech signal of each time interval in the sample signal. For example, the computer device divides the sample signal into speech signals of t1 time period to t5 time period, inputs the sample signal into the encoder, and obtains a speech vector sequence [ z1 z2 z3 z4 z5] output by the feature extraction network in the encoder, wherein the speech vector z1 is a speech vector corresponding to the speech signal of t1 time period. It can be understood that the duration of each time period can be set according to practical application, and the application is not particularly limited.
In one embodiment, the encoder may include a feature extraction network and a self-attention-based speech context network, the feature extraction network is configured to perform feature extraction on a sample signal to obtain a speech vector sequence corresponding to the sample signal, and the self-attention-based speech context network is configured to encode the speech vector sequence to obtain contextual speech features corresponding to the sample signal, and the self-attention-based speech context network is capable of encoding the speech vector sequence by using context information, while the self-attention mechanism ensures efficient parallel efficiency and direct connection for long-distance information, thereby improving the characterization capability of speech features.
In one embodiment, the feature extraction Network may adopt a general feature extraction Network, such as CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), and the like. The self-attention-based speech context network may employ a general self-attention (self-attention) model, such as a transform model, a former model, or the like.
In one embodiment, the computer device performs a random masking process on the speech vectors in the sequence of speech vectors. It can be understood that the general masking processing mode satisfies the requirement of the embodiment of the present application for the masking processing, and therefore the speech vectors in the speech vector sequence can be masked by the general masking processing mode. Alternatively, the computer device may perform the masking process on the speech vectors in the speech vector sequence through the gelu (gaussian Error linear units).
In one embodiment, the computer device inputs the voice vector sequence after the masking processing into a voice context network, and respectively calculates the self-attention corresponding to each voice vector in the voice vector sequence after the masking processing through the voice context network, wherein the self-attention can reflect the importance degree of each voice vector in the voice vector sequence after the masking processing; and outputting context speech features based on each speech vector and the corresponding self-attention thereof through a feedforward neural network.
In one embodiment, the computer device calculates the similarity between each speech vector in the speech vector sequence after the masking processing and the speech vector sequence after the masking processing through a self-attention network in the speech context network, and performs normalization processing on each similarity to obtain the self-attention corresponding to each speech vector in the speech vector sequence after the masking processing. Optionally, the computer device calculates a sum of the similarity corresponding to each speech vector in the speech vector sequence after the masking processing, and respectively calculates a ratio of the similarity corresponding to each speech vector in the speech vector sequence after the masking processing to the sum of the similarities as the self-attention corresponding to each speech vector in the speech vector sequence after the masking processing.
For example, referring to fig. 5, fig. 5 is a schematic diagram of obtaining a speech feature corresponding to a sample signal by an encoder according to an embodiment. The computer apparatus divides the sample signal into speech signals of t 1-t 5 periods, inputs the sample signal to the encoder 502, and obtains a sequence of speech vectors [ z1 z2 z3 z4 z5] output by the feature extraction network in the encoder 502. The computer device carries out random masking processing on the voice vectors in the voice vector sequence [ z1 z2 z3 z4 z5] to obtain a masking processed voice vector sequence [ xz 2 xz 4 ]. The computer device inputs the mask-processed speech vector sequence [ xz 2 xz 4 ] into the self-attention-based speech context network 504, calculates the similarity s1, s2, s3, s4, s5 between each speech vector in the mask-processed speech vector sequence [ xz 2 xz 4 ] and the mask-processed speech vector sequence [ z2 xz 4 ] through the self-attention network in the self-attention-based speech context network 504, normalizes the similarities s1, s2, s3, s4, s5, and obtains the self-attention p1, p2, p3, p4, p 5. The computer device inputs each speech vector in the self-attention p1, p2, p3, p4, p5 and the masking processed speech vector sequence [. xz 2. xz 4 ] into the feedforward neural network in the self-attention-based speech context network 504 for encoding, and obtains the contextual speech features output by the feedforward neural network [ c1 c2 c3 c4 c5 ].
In this embodiment, the encoder includes a self-attention-based speech context network, and the self-attention-based speech context network can encode a speech vector sequence output by the feature extraction network by using context information, and the self-attention mechanism ensures efficient parallel efficiency and direct connection to long-distance information, thereby improving the characterization capability of speech features.
In one embodiment, the encoder is a pre-trained encoder obtained by performing an auto-supervised training using an unlabeled sample signal; the method further comprises the following steps: acquiring a label-free sample signal; inputting the label-free sample signal into an initial encoder to obtain a voice vector sequence which is output by a feature extraction network in the initial encoder and corresponds to the label-free sample signal; performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence; randomly masking the voice vectors in the voice vector sequence, and then determining masked voice vectors; inputting the voice vector sequence after the covering processing into a voice context network of an initial encoder to obtain a predicted voice vector which is output by the voice context network and corresponds to the covering voice vector; constructing a self-supervision training loss based on the difference between the voice quantization vector corresponding to the masked voice vector in the voice quantization vector sequence and the predicted voice vector; and after updating the network parameters of the initial encoder according to the self-supervision training loss, returning to the step of obtaining the label-free sample signal to continue training until the training is finished, and obtaining the pre-trained encoder.
Wherein the unlabeled sample signal is a speech signal used for performing the self-supervised pre-training of the encoder. The unlabeled sample signal has no corresponding labeled data. The initial encoder is the encoder to be subjected to the self-supervised pre-training.
In one embodiment, a computer device performs a quantization operation on a sequence of speech vectors resulting in a sequence of speech quantization vectors. The quantization operation may be a discretization process such as a product quantization process, i.e. a Cartesian product (Cartesian product). The infinite feature space is collapsed into the finite discrete space through the quantization operation, the robustness of the features is enhanced, and the characterization capability of the features is improved.
In one embodiment, each speech vector in the speech quantization vector sequence includes a first speech vector corresponding to the speech signal of each time interval in the sample signal, and the speech feature corresponding to the unlabeled sample signal also includes a second speech vector corresponding to the speech signal of each time interval in the sample signal. The computer device constructs the voice vector prediction loss corresponding to the voice signal in the period based on the difference between the first voice vector and the second voice vector corresponding to the voice signal in the same period, and fuses the voice vector prediction loss corresponding to the voice signal in each period to obtain the self-supervision training loss.
In one embodiment, the speech vector prediction loss corresponding to the speech signal for the t period can be expressed by the following formula:
Figure BDA0003335263460000181
wherein L ismRepresenting the speech vector prediction loss corresponding to the speech signal in the t period; q. q.stRepresenting a first voice vector corresponding to the voice signal in the t period; c. CtA second voice vector corresponding to the voice signal representing the t period; qtRepresenting a set of candidate speech vectors, comprising qtAnd k error speech vectors;
Figure BDA0003335263460000182
represents QtAny erroneous speech vector in; sim (c)t,qt) Denotes ctAnd q istThe correlation between them;
Figure BDA0003335263460000183
denotes ctAnd
Figure BDA0003335263460000184
the correlation between them.
In one embodiment, the computer device obtains a preset loss weighting coefficient, and performs weighted summation on the speech vector prediction loss corresponding to the speech signal in each time interval according to the preset loss weighting coefficient to obtain the self-supervision training loss.
In one embodiment, when the training times reach the preset times or the loss value calculated by the self-supervision training loss is smaller than the preset value, the training is finished.
For example, referring to fig. 6, fig. 6 is a schematic diagram of an embodiment of performing an auto-supervised pre-training on an initial encoder. The computer device divides the non-labeled sample signal into speech signals of t 1-t 5 time period, inputs the non-labeled sample signal into the initial encoder, and obtains a speech vector sequence [ z1 z2 z3 z4 z5] output by the feature extraction network in the initial encoder. The computer device performs a quantization operation on the sequence of speech vectors [ z1 z2 z3 z4 z5] resulting in a sequence of speech quantization vectors [ q1 q2 q3 q4 q5 ]. After the computer device performs a random masking process on the speech vectors in the sequence of speech vectors [ z1 z2 z3 z4 z5], a masked speech vector z1, z3, z5 is determined. The computer device inputs the occlusion processed speech vector sequence [ xz 2 xz 4 ] into the self-attention-based speech context network, respectively calculates similarity degrees s1, s2, s3, s4 and s5 between each speech vector in the occlusion processed speech vector sequence [ xz 2 xz 4 ] and the occlusion processed speech vector sequence [ xz 2 xz 4 ] through the self-attention network in the self-attention-based speech context network, and normalizes the similarity degrees s1, s2, s3, s4 and s5 to obtain self-attention p1, p2, p3, p4 and p 5. The computer device predicts predicted speech vectors c1, c3, c5 corresponding to masked speech vectors z1, z3, z5 based on the self-attention p1, p3, p5 through a feedforward neural network in the self-attention based speech context network. The computer device trains the initial encoder based on the speech vector prediction loss constructed from the difference between c1 and q1, the speech vector prediction loss constructed from the difference between c3 and q3, and the speech vector prediction loss constructed from the difference between c5 and q 5.
In this embodiment, carry out the self-supervision training in advance to the encoder, can promote the characterization ability of the speech feature of encoder output, and then promote subsequent training efficiency and training effect.
In one embodiment, a speech recognition model includes an encoder and a classifier coupled to the encoder; the encoder is a pre-trained encoder obtained by performing self-supervision training by using a label-free sample signal; jointly training a speech recognition model and a decoder based on speech recognition losses calculated from the annotated character sequence and the first predicted character sequence and semantic recognition losses calculated from the annotated character sequence and the second predicted character sequence, comprising: according to the voice recognition loss and the semantic recognition loss, performing supervision training on a decoder and a classifier of a voice recognition model; and when the supervised training stopping condition is met, carrying out supervised training on the decoder and the voice recognition model according to the voice recognition loss and the semantic recognition loss.
In one embodiment, the computer device performs self-supervision pre-training on the encoder in advance, fixes the network parameters of the encoder after obtaining the pre-trained encoder, updates the network parameters of the classifier of the decoder and the speech recognition model according to the speech recognition loss and the semantic recognition loss, and updates the network parameters of the decoder and the speech recognition model according to the speech recognition loss and the semantic recognition loss when the training stop condition is met.
In one embodiment, a decoder includes a vectorization layer, a self-attention based semantic context network and a cross-attention based speech semantic context network; decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain the voice semantic combined features corresponding to the sample signals, wherein the method comprises the following steps: converting the forward character sequence into a corresponding forward character vector sequence through a vectorization layer of a decoder, and inputting the forward character vector sequence into a semantic context network; calculating context semantic features corresponding to the forward character sequence through a semantic context network based on the forward character vector sequence, wherein the context semantic features are used as semantic features corresponding to the forward character sequence; and calculating to obtain the voice semantic combined feature corresponding to the sample signal based on the semantic feature and the voice feature corresponding to the forward character sequence through a voice semantic context network.
In one embodiment, the decoder may include vectorization layers, a self-attention based semantic context network and a cross-attention based speech semantic context network. The vectorization layer is used for converting the forward character sequence into a vector form, namely a forward character vector sequence. A self-attention based semantic context network is used to determine the attention of each forward character vector in the sequence of forward character vectors itself, i.e. the degree of importance of each forward character vector in the sequence of forward character vectors. The cross attention based phonetic semantic context network is used to determine the attention contribution of the previous character to the prediction of the next character, i.e. how much attention the previous character needs to pay to predict the next character.
In one embodiment, the computer device inputs the forward character vector sequence into a self-attention-based semantic context network of the decoder, respectively calculates the similarity between each forward character vector and the forward character vector sequence through the semantic context network, and performs normalization processing on each similarity to obtain the self-attention of each forward character vector in the forward character vector sequence as the context semantic feature of the forward character vector sequence. Optionally, the computer device calculates the sum of the similarities, and respectively calculates the ratio of each similarity to the sum of the similarities as the self-attention of each forward character vector in the sequence of forward character vectors.
In one embodiment, the computer device inputs the self-attention and the speech features extracted by the encoder into a speech semantic context network based on cross attention of a decoder, respectively calculates the similarity between the self-attention and the speech features corresponding to each forward character vector through the speech semantic context network, normalizes the similarities to obtain the cross attention of the self-attention corresponding to each forward character vector in the speech features, and obtains the speech semantic union feature based on the cross attention. In one embodiment, the computer device inputs the self-attention and the speech features extracted by the encoder into a cross-attention-based speech semantic context network of a decoder, respectively calculates the similarity between the self-attention and the speech features corresponding to each forward character vector through the cross-attention network in the speech semantic context network, and normalizes the similarities to obtain the cross-attention of the self-attention corresponding to each forward character vector in the speech features. Optionally, the computer device calculates a sum of the similarities, and respectively calculates a ratio of each similarity to the sum of the similarities as a cross-attention of the self-attention corresponding to each forward character vector in the speech feature.
In one embodiment, the computer device inputs the cross attention corresponding to each forward character vector into a feedforward neural network in the speech semantic context network for encoding, and obtains speech semantic joint features output by the feedforward neural network. For example, referring to fig. 7, fig. 7 is a diagram illustrating assisted training of a speech recognition model by a decoder in one embodiment. The computer device inputs the forward character sequence "/x 2x3x4x 5" into the decoder 702, and converts the forward character sequence "/x 2x3x4x 5" into a corresponding sequence of forward character vectors [ e1 e2 e3 e4 e5] through the vectorization layer of the decoder 702. The computer device inputs the forward character vector sequence [ e1 e2 e3 e4 e5] into a self-attention-based semantic context network of the decoder 702, respectively calculates the similarities s1, s2, s3, s4 and s5 between the forward character vectors e1, e2, e3, e4 and e5 and the forward character vector sequence [ e1 e2 e3 e4 e5] through the semantic context network, normalizes the similarities s1, s2, s3, s4 and s5, and obtains the self-attention o1, o2, o3, o4 and o5 of the forward character vectors in the forward character vector sequence as the semantic context features of the forward character vector sequence [ e5 e5 e5 e5 e5] of the forward character vector sequence. The computer device inputs the speech features [ c1 c2 c3 c4 c5] extracted from attention o1, o2, o3, o4, o5 and the encoder into the cross attention based speech semantic context network 704 of the decoder 702, calculates the similarities s1, s2, s3, s4, s5 between the respective attention and speech features through the cross attention network in the speech semantic context network 704, respectively, normalizes the similarities s1, s2, s3, s4, s5, and obtains the cross attention 1, u2, u3, u4, u 39 5 in the speech features [ c1 c2 c3 c4 c5] of the respective attention o1, o2, o3, o4, o 5. The computer device encodes the cross attention u1, u2, u3, u4 and u5 into a feedforward neural network in the speech semantic context network 704 to obtain speech semantic joint features output by the feedforward neural network [ r1 r2 r3 r4 r5 ].
The cross-attention u1, u2, u3, u4, and u5 are used to represent the contribution of each forward character vector to predicting the next character corresponding to the forward character vector. For example, if the sequence of the label character is "good today" and the sequence of the forward character is "/weather today", the cross attention u2 corresponding to the forward character vector of the forward character "present" is used to indicate the importance of the forward character "present" to the prediction of "day".
In one embodiment, the predicting based on the speech semantic union feature to obtain the second predicted character sequence corresponding to the sample signal includes: inputting the speech semantic joint characteristics into a classifier of a decoder; and outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic union characteristics through the classifier.
In one embodiment, the decoder may include a vectorization layer, a self-attention based semantic context network, a cross-attention based speech semantic context network, and a classifier for recognizing characters corresponding to each time interval signal in the sample signal based on the speech semantic joint features, and outputting a second predicted character sequence corresponding to the sample signal. For example, with continued reference to fig. 7, the computer device inputs the speech semantic union feature [ r1 r2 r3 r4 r5] into the classifier of the decoder 702, and outputs a second predicted character sequence "y 1y2y3y4y 5" corresponding to the sample signal based on the speech semantic union feature through the classifier. And the classifier of the speech recognition model outputs a first predicted character sequence "w 1w2w3w4w 5" corresponding to the sample signal based on the speech features [ c1 c2 c3 c4 c5 ]. Thus, the computer device may jointly train the speech recognition model with the decoder 702 based on the speech recognition loss calculated for the annotated character sequence "x 1x2x3x4x 5" and the first predicted character sequence "w 1w2w3w4w 5" and the semantic recognition loss calculated for the annotated character sequence "x 1x2x3x4x 5" and the second predicted character sequence "y 1y2y3y4y 5".
In this embodiment, the decoder includes a cross attention-based speech semantic context network, and the cross attention-based speech semantic context network can utilize speech features carrying speech levels output by the encoder and context information corresponding to a forward character vector sequence input to the decoder to assist the speech recognition model in training, so as to distill the semantic level context information into the speech recognition model, help the speech recognition model to alleviate independence assumption and the deficiency that the semantic level context information cannot be utilized, and further improve speech recognition accuracy.
In one embodiment, the method further comprises: acquiring a signal to be identified; inputting the signal to be recognized into the trained speech recognition model to obtain the speech features output by an encoder in the speech recognition model, and outputting the speech recognition result based on the speech features by a classifier in the speech recognition model.
The signal to be recognized is a speech signal to be subjected to speech recognition by the method provided by the embodiment of the application. The signal to be recognized may be a voice signal received in a voice interaction scenario, such as a virtual robot interaction scenario, an intelligent device control scenario, a machine translation scenario, a text conversion scenario of a voice message, and the like.
In one embodiment, a computer device obtains a signal to be recognized, inputs the signal to be recognized into a trained speech recognition model, obtains speech features output by an encoder in the speech recognition model, and outputs a speech recognition result based on the speech features by a classifier in the speech recognition model, wherein the speech recognition result can be a phoneme or a character corresponding to the signal to be recognized.
In this embodiment, the trained speech recognition model can perform speech recognition by using context information of semantic hierarchy, so that the accuracy of speech recognition can be improved.
In one embodiment, referring to fig. 8, there is provided a method for processing a speech recognition model, comprising the steps of:
step S802, obtaining a non-labeled sample signal; inputting the label-free sample signal into an initial encoder to obtain a voice vector sequence which is output by a feature extraction network in the initial encoder and corresponds to the label-free sample signal; performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence; starting from the first voice vector of the voice vector sequence, sequentially carrying out covering processing on the voice vectors in the voice vector sequence; sequentially inputting the voice vector sequence after the covering processing into a voice context network of an initial encoder to obtain context voice features output by the voice context network, wherein the context voice features are used as voice features corresponding to the unmarked sample signals; constructing an automatic supervision training loss based on the difference between the voice quantization vector sequence and the voice characteristics corresponding to the label-free sample signal; and after updating the network parameters of the initial encoder according to the self-supervision training loss, returning to the step of obtaining the label-free sample signal to continue training until the training is finished, and obtaining the pre-trained encoder.
Step 804, obtaining a sample signal and a corresponding marked character sequence; inputting the sample signal into a pre-trained coder in a speech recognition model to obtain a speech vector sequence which is output by a feature extraction network in the coder and corresponds to the sample signal; carrying out random masking treatment on the voice vectors in the voice vector sequence; inputting the voice vector sequence after the covering processing into a voice context network to obtain context voice characteristics output by the voice context network as voice characteristics corresponding to the sample signal; a first predicted character sequence is output based on the speech features by a classifier coupled to the encoder in the speech recognition model.
Step 806, inputting a forward character sequence corresponding to the tagged character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the tagged character sequence; in a decoder, converting a forward character sequence into a corresponding forward character vector sequence through a vectorization layer of the decoder, and inputting the forward character vector sequence into a semantic context network; calculating context semantic features corresponding to the forward character sequence through a semantic context network based on the forward character vector sequence, wherein the context semantic features are used as semantic features corresponding to the forward character sequence; calculating to obtain a speech semantic combined feature corresponding to the sample signal based on semantic features and speech features corresponding to the forward character sequence through a speech semantic context network; inputting the speech semantic joint characteristics into a classifier of a decoder; and outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic union characteristics through the classifier.
808, constructing a voice recognition loss based on the difference between the annotated character sequence and the first predicted character sequence; constructing a semantic recognition loss based on a difference between the annotated character sequence and the second predicted character sequence; weighting and summing the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss; performing supervision training on a decoder and a classifier of a voice recognition model according to the target loss; and when the supervised training stopping condition is met, carrying out supervised training on the decoder and the speech recognition model according to the target loss.
For example, referring to fig. 9, fig. 9 is a diagram illustrating assisted training of a speech recognition model by a decoder in one embodiment. The computer apparatus divides the sample signal into speech signals of t 1-t 5 periods, inputs the sample signal to an encoder, and obtains a sequence of speech vectors [ z1 z2 z3 z4 z5] output by a feature extraction network in the encoder. The computer device carries out random masking processing on the voice vectors in the voice vector sequence [ z1 z2 z3 z4 z5] to obtain a masking processed voice vector sequence [ xz 2 xz 4 ]. The computer device inputs the occlusion processed speech vector sequence [ xz 2 xz 4 ] into the self-attention-based speech context network, respectively calculates similarity degrees s1, s2, s3, s4 and s5 between each speech vector in the occlusion processed speech vector sequence [ xz 2 xz 4 ] and the occlusion processed speech vector sequence [ xz 2 xz 4 ] through the self-attention network in the self-attention-based speech context network, and normalizes the similarity degrees s1, s2, s3, s4 and s5 to obtain self-attention p1, p2, p3, p4 and p 5. The computer device inputs each speech vector in the self-attention p1, p2, p3, p4, p5 and the masking processed speech vector sequence [. multidot.z 2. multidot.z 4 ] into a feedforward neural network in the self-attention based speech context network for encoding, and obtains the context speech characteristics output by the feedforward neural network [ c1 c2 c3 c4 c5 ]. The computer device inputs the forward character sequence "/x 2x3x4x 5" into the decoder, and converts the forward character sequence "/x 2x3x4x 5" into a corresponding forward character vector sequence [ e1 e2 e3 e4 e5] through a vectorization layer of the decoder. The computer device inputs the forward character vector sequence [ e1 e2 e3 e4 e5] into a self-attention-based semantic context network of a decoder, respectively calculates the similarity s1, s2, s3, s4 and s5 between each forward character vector e1, e2, e3, e4 and e5 and the forward character vector sequence [ e1 e2 e3 e4 e5] through the semantic context network, normalizes the similarities s1, s2, s3, s4 and s5 to obtain the self-attention o1, o2, o3, o4 and o5 of each forward character vector in the forward character vector sequence as the semantic features of the forward character vector sequence [ e5 e5 e5 e5 e5] of the decoder. The computer device inputs the speech features [ c1 c2 c3 c4 c5] extracted from the attention points o1, o2, o3, o4 and o5 and the encoder into a cross-attention-based speech semantic context network of the decoder, respectively calculates the similarities s1, s2, s3, s4 and s5 between the respective attention points and the speech features through the cross-attention network in the speech semantic context network, normalizes the similarities s1, s2, s3, s4 and s5, and obtains the cross-attention points u5, 36u 5, u5 of the respective attention points o1, o2, o3, o4 and o5 in the speech features [ c1 c2 c3 c4 c5] of the speech features [ c2 c 867 c3 c4 c5 ]. The computer device inputs the cross attention u1, u2, u3, u4 and u5 into a feedforward neural network in the speech semantic context network for encoding, and obtains speech semantic joint features output by the feedforward neural network [ r1 r2 r3 r4 r5 ].
The computer device inputs the speech semantic joint feature [ r1 r2 r3 r4 r5] into a classifier of a decoder, and outputs a second predicted character sequence 'y 1y2y3y4y 5' corresponding to the sample signal through the classifier based on the speech semantic joint feature. And the classifier of the speech recognition model outputs a first predicted character sequence "w 1w2w3w4w 5" corresponding to the sample signal based on the speech features [ c1 c2 c3 c4 c5 ]. Thus, the computer device may jointly train the speech recognition model and the decoder based on the speech recognition loss calculated for the annotated character sequence "x 1x2x3x4x 5" and the first predicted character sequence "w 1w2w3w4w 5" and the semantic recognition loss calculated for the annotated character sequence "x 1x2x3x4x 5" and the second predicted character sequence "y 1y2y3y4y 5".
In one embodiment, the self-attention based voice context network may have M layers, each layer having a structure comprising, in order: multi-head Self Attention, Add (summation operation), Norm (normalization operation), Feed Forward (Feed Forward neural network), Add (summation operation), Norm (normalization operation). M may take the value 12.
In one embodiment, the decoder may specifically include an Embedding Layer, and an N-Layer intermediate coding Layer connected to the vectorizing Layer, wherein the N-Layer intermediate coding Layer may include a self-attention based semantic context network and a cross-attention based speech semantic context network in sequence. The decoder may further include a classifier connected to the intermediate coding layer of the N layers. The specific structure of the semantic context network based on Self-Attention sequentially comprises Multi-head Self-Attention, Add (summation operation) and Norm (normalization operation). The specific structure of the Cross Attention-based speech semantic context network sequentially comprises Multi-head Cross Attention, Add (summation operation), Norm (normalization operation), Feed Forward (Feed Forward neural network), Add (summation operation) and Norm (normalization operation). N may take the value of 6.
The processing method of the speech recognition model comprises the steps of inputting a sample signal into the speech recognition model to obtain speech characteristics corresponding to the sample signal and a first predicted character sequence output based on the speech characteristics, inputting a forward character sequence corresponding to a labeled character sequence into a decoder, decoding the speech characteristics according to the semantic characteristics corresponding to the forward character sequence in the decoder to obtain speech semantic combined characteristics corresponding to the sample signal, wherein the forward character sequence is generated based on a previous character of each character in the labeled character sequence, so that the speech semantic combined characteristics obtained by decoding and encoding the speech characteristics output by an encoder according to the semantic characteristics corresponding to the forward character sequence carry context information of a semantic layer, and predicting based on the speech semantic combined characteristics to obtain a second predicted character sequence corresponding to the sample signal, and training a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and the labeled character sequence, and distilling context information of semantic levels into the voice recognition model, so that the recognition accuracy of the voice recognition model is improved.
In order to verify the effect produced by the scheme provided by the embodiment of the application, a test was performed through a comparative experiment. The test adopts two training modes for the speech recognition model, one is to jointly train the speech recognition model and the decoder (hereinafter referred to as joint training mode) and the other is to train the speech recognition model alone (hereinafter referred to as individual training mode). Specific implementations of the two training modes will now be described.
For the joint training mode, the computer device performs the self-supervised pre-training on the encoder of the speech recognition model, and the pre-training step of the encoder refers to the step S802, which is not described herein again. After obtaining the pre-trained encoder, the computer equipment obtains a sample signal and a corresponding labeled character sequence, inputs the sample signal into the pre-trained encoder in the speech recognition model, and obtains a speech vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal; randomly masking the voice vectors in the voice vector sequence, inputting the masked voice vector sequence into a voice context network to obtain context voice features output by the voice context network as voice features corresponding to the sample signals; a first predicted character sequence is output based on the speech features by a classifier coupled to the encoder in the speech recognition model. The computer equipment inputs a forward character sequence corresponding to the label character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the label character sequence; in a decoder, converting a forward character sequence into a corresponding forward character vector sequence through a vectorization layer of the decoder, inputting the forward character vector sequence into a semantic context network, and calculating context semantic features corresponding to the forward character sequence as semantic features corresponding to the forward character sequence on the basis of the forward character vector sequence through the semantic context network; calculating to obtain a speech semantic combined feature corresponding to the sample signal based on semantic features and speech features corresponding to the forward character sequence through a speech semantic context network; inputting the speech semantic joint characteristics into a classifier of a decoder; outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic union characteristics through a classifier; the computer equipment constructs voice recognition loss based on the difference between the marked character sequence and the first predicted character sequence, constructs semantic recognition loss based on the difference between the marked character sequence and the second predicted character sequence, and performs weighted summation on the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain target loss; and carrying out supervised training on the classifier of the decoder and the speech recognition model according to the target loss, and carrying out supervised training on the decoder and the speech recognition model according to the target loss when the supervised training stopping condition is met.
For the single training mode, the computer device performs the self-supervised pre-training on the encoder of the speech recognition model, and the pre-training step of the encoder refers to the step S802, which is not described herein again. After obtaining the pre-trained encoder, the computer equipment obtains a sample signal and a corresponding labeled character sequence, inputs the sample signal into the pre-trained encoder in the speech recognition model, and obtains a speech vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal; randomly masking the voice vectors in the voice vector sequence, inputting the masked voice vector sequence into a voice context network to obtain context voice features output by the voice context network as voice features corresponding to the sample signals; outputting a first predicted character sequence based on the speech features through a classifier connected with an encoder in the speech recognition model; constructing a speech recognition loss based on a difference between the annotated character sequence and the first predicted character sequence; and carrying out supervised training on the classifier of the voice recognition model according to the voice recognition loss, and carrying out supervised training on the encoder and the classifier of the voice recognition model according to the voice recognition loss when the supervised training stopping condition is met.
The adopted self-supervision training data of the two training modes are 960-hour librispeech data, the adopted supervision training data are an open-source Chinese speech recognition data set Aishell-1, the Aishell-1 data set comprises a training set, a verification set and a test set, the number of test strips of the Aishell-1 training set is 120098, the number of test strips of the Aishell-1 verification set is 14326, and the number of test strips of the Aishell-1 test set is 7176. The characteristic dimensions of the decoder and encoder are 768. For the joint training mode, the loss weighting coefficient of the speech recognition loss is 0.3, and the loss weighting coefficient of the semantic recognition loss is 0.7. The value of the M layer of the self-attention based speech context network is 12 and the value of the N layer of the decoder is 6.
The speech recognition models obtained by the combined training mode and the individual training mode are tested, and the obtained test results are shown in fig. 10. It can be seen that compared with the speech recognition model obtained by training in a single training mode, the speech recognition model obtained by training in a joint training mode has a significantly reduced word error rate, that is, the performance of the model can be significantly improved by the joint training mode.
In an embodiment, as shown in fig. 11, a speech recognition method is provided, and this embodiment is mainly illustrated by applying the method to the computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:
step S1102, a signal to be identified is acquired.
The signal to be recognized is a speech signal to be subjected to speech recognition by the method provided by the embodiment of the application. The signal to be recognized may be a voice signal received in a voice interaction scenario, such as a virtual robot interaction scenario, an intelligent device control scenario, a machine translation scenario, a text conversion scenario of a voice message, and the like.
Step S1104, inputting the signal to be recognized into the trained speech recognition model, and obtaining the speech features output by the encoder in the speech recognition model and the speech recognition result output by the classifier in the speech recognition model based on the speech features; the voice recognition model and the decoder are obtained through joint training based on voice recognition loss and semantic recognition loss, the voice recognition loss is obtained through calculation according to a first prediction character sequence and a labeled character sequence corresponding to a sample signal, the semantic recognition loss is obtained through calculation according to a second prediction character sequence and the labeled character sequence, the first prediction character sequence is obtained after classification based on voice features output by an encoder, the second prediction character sequence is obtained through prediction of voice semantic combined features obtained through decoding of the voice features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
In one embodiment, a computer device obtains a signal to be recognized, inputs the signal to be recognized into a trained speech recognition model, obtains speech features output by an encoder in the speech recognition model, and outputs a speech recognition result based on the speech features by a classifier in the speech recognition model, wherein the speech recognition result can be a phoneme or a character corresponding to the signal to be recognized.
For the training of the speech recognition model, reference may be made to the above embodiments, which are not described herein again.
In the speech recognition method, the signal to be recognized is input into the trained speech recognition model to obtain the speech features output by the encoder in the speech recognition model, and the speech recognition result output by the classifier in the speech recognition model based on the speech features.
It should be understood that although the steps in the flowcharts of fig. 3, 8, 11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3, 8, and 11 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.
In one embodiment, as shown in fig. 12, there is provided a speech recognition model processing apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an acquisition module 1202, an encoding module 1204, an input module 1206, a decoding module 1208, and a training module 1210, wherein:
an obtaining module 1202, configured to obtain a sample signal and a corresponding labeled character sequence;
the encoding module 1204 is configured to input the sample signal into a speech recognition model, obtain a speech feature corresponding to the sample signal, and output a first predicted character sequence based on the speech feature;
an input module 1206, configured to input a forward character sequence corresponding to the tagged character sequence into the decoder, where the forward character sequence is generated based on a previous character of each character in the tagged character sequence;
a decoding module 1208, configured to decode, in a decoder, the speech feature according to the semantic feature corresponding to the forward character sequence to obtain a speech-semantic combined feature corresponding to the sample signal, and perform prediction based on the speech-semantic combined feature to obtain a second predicted character sequence corresponding to the sample signal;
a training module 1210 for jointly training the speech recognition model and the decoder based on the speech recognition loss calculated from the annotated character sequence and the first predicted character sequence and the semantic recognition loss calculated from the annotated character sequence and the second predicted character sequence.
In one embodiment, the encoding module 1204 is further configured to: inputting the sample signal into a speech recognition model; outputting a voice characteristic corresponding to the sample signal through an encoder of the voice recognition model; a first predicted character sequence is output based on the speech features by a classifier coupled to the encoder in the speech recognition model.
In one embodiment, an encoder includes a feature extraction network and a self-attention based speech context network; the encoding module 1204 is further configured to: inputting the sample signal into an encoder to obtain a voice vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal; carrying out random masking treatment on the voice vectors in the voice vector sequence; and inputting the voice vector sequence after the masking processing into a voice context network to obtain the context voice characteristics output by the voice context network as the voice characteristics corresponding to the sample signal.
In one embodiment, a decoder includes a vectorization layer, a self-attention based semantic context network and a cross-attention based speech semantic context network; the decoding module 1208 is further configured to: converting the forward character sequence into a corresponding forward character vector sequence through a vectorization layer of a decoder, and inputting the forward character vector sequence into a semantic context network; calculating context semantic features corresponding to the forward character sequence through a semantic context network based on the forward character vector sequence, wherein the context semantic features are used as semantic features corresponding to the forward character sequence; and calculating to obtain the voice semantic combined feature corresponding to the sample signal based on the semantic feature and the voice feature corresponding to the forward character sequence through a voice semantic context network.
In one embodiment, the decoding module 1208 is further configured to: inputting the speech semantic joint characteristics into a classifier of a decoder; and outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic union characteristics through the classifier.
In one embodiment, a speech recognition model includes an encoder and a classifier coupled to the encoder; the encoder is a pre-trained encoder obtained by performing self-supervision training by using a label-free sample signal; the training module 1210 is further configured to: according to the voice recognition loss and the semantic recognition loss, performing supervision training on a decoder and a classifier of a voice recognition model; and when the supervised training stopping condition is met, carrying out supervised training on the decoder and the voice recognition model according to the voice recognition loss and the semantic recognition loss.
In one embodiment, the encoder is a pre-trained encoder obtained by performing an auto-supervised training using an unlabeled sample signal; the speech recognition model further comprises a pre-training module 1210, the pre-training module 1210 being configured to: acquiring a label-free sample signal; inputting the label-free sample signal into an initial encoder to obtain a voice vector sequence which is output by a feature extraction network in the initial encoder and corresponds to the label-free sample signal; performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence; randomly masking the voice vectors in the voice vector sequence, and then determining masked voice vectors; inputting the voice vector sequence after the covering processing into a voice context network of an initial encoder to obtain a predicted voice vector which is output by the voice context network and corresponds to the covering voice vector; constructing a self-supervision training loss based on the difference between the voice quantization vector corresponding to the masked voice vector in the voice quantization vector sequence and the predicted voice vector; and after updating the network parameters of the initial encoder according to the self-supervision training loss, returning to the step of obtaining the label-free sample signal to continue training until the training is finished, and obtaining the pre-trained encoder.
In one embodiment, training module 1210 is further configured to: constructing a speech recognition loss based on a difference between the annotated character sequence and the first predicted character sequence; constructing a semantic recognition loss based on a difference between the annotated character sequence and the second predicted character sequence; weighting and summing the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss; and jointly training the speech recognition model and the decoder according to the target loss.
In one embodiment, the processing means of the speech recognition model further comprises a speech recognition module for: acquiring a signal to be identified; inputting the signal to be recognized into the trained speech recognition model to obtain the speech features output by an encoder in the speech recognition model, and outputting the speech recognition result based on the speech features by a classifier in the speech recognition model.
For the specific definition of the processing means of the speech recognition model, reference may be made to the above definition of the processing method of the speech recognition model, which is not described herein again. The respective modules in the processing means of the above-described speech recognition model may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In the processing device of the speech recognition model, a sample signal is input into the speech recognition model to obtain speech characteristics corresponding to the sample signal, a first predicted character sequence output based on the speech characteristics is output, a forward character sequence corresponding to a labeled character sequence is input into a decoder, the speech characteristics are decoded according to the semantic characteristics corresponding to the forward character sequence in the decoder to obtain speech semantic combination characteristics corresponding to the sample signal, because the forward character sequence is generated based on the previous character of each character in the labeled character sequence, the speech semantic combination characteristics obtained by decoding and encoding the speech characteristics output by an encoder according to the semantic characteristics corresponding to the forward character sequence carry context information of a semantic level, and a second predicted character sequence corresponding to the sample signal is obtained by prediction based on the speech semantic combination characteristics, and training a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and the labeled character sequence, and distilling context information of semantic levels into the voice recognition model, so that the recognition accuracy of the voice recognition model is improved.
In one embodiment, as shown in fig. 13, there is provided a speech recognition apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an acquisition module 1302 and a speech recognition module 1304, wherein:
an obtaining module 1302, configured to obtain a signal to be identified;
a speech recognition module 1304, configured to input a signal to be recognized into a trained speech recognition model, to obtain speech features output by an encoder in the speech recognition model, and a speech recognition result output by a classifier in the speech recognition model based on the speech features;
the voice recognition model and the decoder are obtained through joint training based on voice recognition loss and semantic recognition loss, the voice recognition loss is obtained through calculation according to a first prediction character sequence and a labeled character sequence corresponding to a sample signal, the semantic recognition loss is obtained through calculation according to a second prediction character sequence and the labeled character sequence, the first prediction character sequence is obtained after classification based on voice features output by an encoder, the second prediction character sequence is obtained through prediction of voice semantic combined features obtained through decoding of the voice features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
For the specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, which are not described herein again. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In the speech recognition device, the signal to be recognized is input into the trained speech recognition model, the speech features output by the encoder in the speech recognition model are obtained, and the speech recognition result output by the classifier in the speech recognition model based on the speech features is obtained.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 14. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing processing data and/or image generation data of the speech recognition model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a processing method of a speech recognition model and/or a speech recognition method.
In one embodiment, a computer device is provided, which may be a terminal or a face acquisition device, and its internal structure diagram may be as shown in fig. 15. The computer equipment comprises a processor, a memory, a communication interface and a voice acquisition device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a processing method of a speech recognition model and/or a speech recognition method.
It will be appreciated by those skilled in the art that the configurations shown in fig. 14 and 15 are block diagrams of only some of the configurations relevant to the present application, and do not constitute a limitation on the computing devices to which the present application may be applied, and a particular computing device may include more or less components than those shown, or some of the components may be combined, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (15)

1. A method of processing a speech recognition model, the method comprising:
acquiring a sample signal and a corresponding marked character sequence;
inputting the sample signal into a speech recognition model to obtain a speech feature corresponding to the sample signal and a first predicted character sequence output based on the speech feature;
inputting a forward character sequence corresponding to the tagged character sequence into a decoder, wherein the forward character sequence is generated based on a previous character of each character in the tagged character sequence;
in the decoder, decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain voice semantic combined features corresponding to the sample signals, and predicting based on the voice semantic combined features to obtain a second predicted character sequence corresponding to the sample signals;
and jointly training the voice recognition model and the decoder based on the voice recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence.
2. The method of claim 1, wherein the inputting the sample signal into a speech recognition model, obtaining a speech feature corresponding to the sample signal, and outputting a first predicted character sequence based on the speech feature comprises:
inputting the sample signal into the speech recognition model;
outputting the voice characteristics corresponding to the sample signals through an encoder of the voice recognition model;
outputting, by a classifier coupled to the encoder in the speech recognition model, the first predicted character sequence based on the speech feature.
3. The method of claim 2, wherein the encoder comprises a feature extraction network and a self-attention based speech context network;
outputting, by an encoder of the speech recognition model, speech features corresponding to the sample signal, including:
inputting the sample signal into the encoder to obtain a speech vector sequence which is output by a feature extraction network in the encoder and corresponds to the sample signal;
carrying out random covering processing on the voice vectors in the voice vector sequence;
and inputting the voice vector sequence after the masking processing into the voice context network to obtain the context voice feature output by the voice context network as the voice feature corresponding to the sample signal.
4. The method of claim 1, wherein the decoder comprises a vectorization layer, a self-attention based semantic context network and a cross-attention based speech semantic context network;
the decoding the voice feature according to the semantic feature corresponding to the forward character sequence to obtain the voice semantic combined feature corresponding to the sample signal includes:
converting the forward character sequence into a corresponding forward character vector sequence through a vectorization layer of the decoder, and inputting the forward character vector sequence into the semantic context network;
calculating context semantic features corresponding to the forward character sequence based on the forward character vector sequence through the semantic context network, wherein the context semantic features are used as semantic features corresponding to the forward character sequence;
and calculating to obtain the voice semantic combined feature corresponding to the sample signal based on the semantic feature corresponding to the forward character sequence and the voice feature through the voice semantic context network.
5. The method according to claim 4, wherein the predicting based on the speech semantic union feature to obtain a second predicted character sequence corresponding to the sample signal comprises:
inputting the speech semantic joint features into a classifier of the decoder;
and outputting a second predicted character sequence corresponding to the sample signal based on the speech semantic joint feature through the classifier.
6. The method of claim 1, wherein the speech recognition model comprises an encoder and a classifier coupled to the encoder; the encoder is a pre-trained encoder obtained by performing self-supervision training by using an unlabeled sample signal;
the jointly training the speech recognition model and the decoder based on the speech recognition loss calculated from the annotated character sequence and the first predicted character sequence and the semantic recognition loss calculated from the annotated character sequence and the second predicted character sequence comprises:
performing supervised training on the decoder and a classifier of the voice recognition model according to the voice recognition loss and the semantic recognition loss;
and when a supervision training stopping condition is met, carrying out supervision training on the decoder and the voice recognition model according to the voice recognition loss and the semantic recognition loss.
7. The method of claim 1, wherein the encoder is a pre-trained encoder obtained by performing an auto-supervised training using an unlabeled sample signal;
the method further comprises the following steps:
acquiring the label-free sample signal;
inputting the label-free sample signal into an initial encoder to obtain a voice vector sequence which is output by a feature extraction network in the initial encoder and corresponds to the label-free sample signal;
performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence;
randomly masking the voice vectors in the voice vector sequence, and then determining masked voice vectors;
inputting the voice vector sequence after the covering processing into a voice context network of the initial encoder to obtain a predicted voice vector which is output by the voice context network and corresponds to the covering voice vector;
constructing an unsupervised training loss based on a difference between a speech quantization vector in the sequence of speech quantization vectors corresponding to the masked speech vector and the predicted speech vector;
and after updating the network parameters of the initial encoder according to the self-supervision training loss, returning to the step of obtaining the label-free sample signal to continue training until the training is finished, and obtaining the pre-trained encoder.
8. The method of claim 1, wherein jointly training the speech recognition model and the decoder based on the speech recognition loss calculated from the annotated character sequence and the first predicted character sequence and the semantic recognition loss calculated from the annotated character sequence and the second predicted character sequence comprises:
constructing the speech recognition loss based on a difference between the annotated character sequence and the first predicted character sequence;
constructing a semantic recognition loss based on a difference between the annotated character sequence and the second predicted character sequence;
weighting and summing the voice recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss;
and jointly training the speech recognition model and the decoder according to the target loss.
9. The method of claim 1, further comprising:
acquiring a signal to be identified;
inputting the signal to be recognized into a trained voice recognition model to obtain voice features output by an encoder in the voice recognition model, and outputting a voice recognition result based on the voice features by a classifier in the voice recognition model.
10. A method of speech recognition, the method comprising:
acquiring a signal to be identified;
inputting the signal to be recognized into a trained voice recognition model to obtain voice features output by an encoder in the voice recognition model and voice recognition results output by a classifier in the voice recognition model based on the voice features;
the speech recognition model and the decoder are obtained based on speech recognition loss and semantic recognition loss joint training, the speech recognition loss is obtained through calculation according to a first predicted character sequence and a labeled character sequence corresponding to the sample signal, the semantic recognition loss is obtained through calculation according to a second predicted character sequence and the labeled character sequence, the first predicted character sequence is obtained through classification based on speech features output by the encoder, the second predicted character sequence is obtained through prediction of speech semantic joint features obtained through decoding of the speech features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
11. An apparatus for processing a speech recognition model, the apparatus comprising:
the acquisition module is used for acquiring the sample signal and the corresponding marked character sequence;
the coding module is used for inputting the sample signal into a voice recognition model to obtain a voice feature corresponding to the sample signal and a first prediction character sequence output based on the voice feature;
the input module is used for inputting the forward character sequence corresponding to the label character sequence into a decoder, and the forward character sequence is generated based on the previous character of each character in the label character sequence;
a decoding module, configured to decode, in the decoder, the speech feature according to the semantic feature corresponding to the forward character sequence, to obtain a speech-semantic combined feature corresponding to the sample signal, and perform prediction based on the speech-semantic combined feature, to obtain a second predicted character sequence corresponding to the sample signal;
and the training module is used for jointly training the voice recognition model and the decoder based on the voice recognition loss calculated according to the marked character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the marked character sequence and the second predicted character sequence.
12. A speech recognition apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a signal to be identified;
the speech recognition module is used for inputting the signal to be recognized into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model and speech recognition results output by a classifier in the speech recognition model based on the speech features;
the speech recognition model and the decoder are obtained based on speech recognition loss and semantic recognition loss joint training, the speech recognition loss is obtained through calculation according to a first predicted character sequence and a labeled character sequence corresponding to the sample signal, the semantic recognition loss is obtained through calculation according to a second predicted character sequence and the labeled character sequence, the first predicted character sequence is obtained through classification based on speech features output by the encoder, the second predicted character sequence is obtained through prediction of speech semantic joint features obtained through decoding of the speech features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.
13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.
15. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 10 when executed by a processor.
CN202111292319.2A 2021-11-03 2021-11-03 Processing method of voice recognition model, voice recognition method and device Pending CN114360502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111292319.2A CN114360502A (en) 2021-11-03 2021-11-03 Processing method of voice recognition model, voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111292319.2A CN114360502A (en) 2021-11-03 2021-11-03 Processing method of voice recognition model, voice recognition method and device

Publications (1)

Publication Number Publication Date
CN114360502A true CN114360502A (en) 2022-04-15

Family

ID=81096284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111292319.2A Pending CN114360502A (en) 2021-11-03 2021-11-03 Processing method of voice recognition model, voice recognition method and device

Country Status (1)

Country Link
CN (1) CN114360502A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691476A (en) * 2022-06-06 2023-02-03 腾讯科技(深圳)有限公司 Training method of voice recognition model, voice recognition method, device and equipment
CN116524521A (en) * 2023-06-30 2023-08-01 武汉纺织大学 English character recognition method and system based on deep learning
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691476A (en) * 2022-06-06 2023-02-03 腾讯科技(深圳)有限公司 Training method of voice recognition model, voice recognition method, device and equipment
CN115691476B (en) * 2022-06-06 2023-07-04 腾讯科技(深圳)有限公司 Training method of voice recognition model, voice recognition method, device and equipment
CN116524521A (en) * 2023-06-30 2023-08-01 武汉纺织大学 English character recognition method and system based on deep learning
CN116524521B (en) * 2023-06-30 2023-09-15 武汉纺织大学 English character recognition method and system based on deep learning
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium
CN116705058B (en) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
Ariav et al. An end-to-end multimodal voice activity detection using wavenet encoder and residual networks
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN112071330B (en) Audio data processing method and device and computer readable storage medium
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN104541324A (en) A speech recognition system and a method of using dynamic bayesian network models
CN109344242B (en) Dialogue question-answering method, device, equipment and storage medium
Chi et al. Speaker role contextual modeling for language understanding and dialogue policy learning
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN114676234A (en) Model training method and related equipment
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN111859954A (en) Target object identification method, device, equipment and computer readable storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN115376495A (en) Speech recognition model training method, speech recognition method and device
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN113178200B (en) Voice conversion method, device, server and storage medium
CN114445832A (en) Character image recognition method and device based on global semantics and computer equipment
CN116564270A (en) Singing synthesis method, device and medium based on denoising diffusion probability model
CN116978370A (en) Speech processing method, device, computer equipment and storage medium
CN115712739A (en) Dance action generation method, computer device and storage medium
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination