CN113362811A

CN113362811A - Model training method, speech recognition method, device, medium and equipment

Info

Publication number: CN113362811A
Application number: CN202110735675.0A
Authority: CN
Inventors: 董林昊; 梁镇麟; 蔡猛; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-07
Anticipated expiration: 2041-06-30
Also published as: CN113362811B; WO2023273612A1

Abstract

The present disclosure relates to a model training method, a speech recognition method, apparatus, medium, and device, the speech recognition model including an encoder, a predictor model, a language decoder, and a symbol decoder, the method including: obtaining a character acoustic vector corresponding to training voice data according to the encoder and the predictor model; obtaining an output text and a language feature vector corresponding to the predicted character according to the character acoustic vector and the voice decoder; determining symbol probability distribution corresponding to the predicted character according to the character acoustic vector, the language feature vector and the symbol decoder; determining target loss according to the symbol probability distribution and target symbol characteristics corresponding to the predicted characters in the training voice data; and under the condition that an updating condition is met, updating the model parameters of the voice recognition model according to the target loss. Therefore, the accuracy and the real-time performance of voice recognition and symbol recognition can be improved.

Description

Model training method, speech recognition method, device, medium and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a model training method, a speech recognition method, an apparatus, a medium, and a device.

Background

With the rise of deep learning, various methods completely relying on neural networks for end-to-end modeling are gradually rising and gradually developing into the mainstream in Automatic Speech Recognition (ASR) technology. By automatic speech recognition, the original speech data can be directly converted into corresponding text results. In order to further improve the readability of the text result, punctuation marks are often required to be added to the text result. In the related art, an off-line punctuation post-processing mode can be adopted, namely after the voice recognition is finished, punctuation processing is carried out on the obtained recognition result. According to the method, after the recognition of the recognition result is finished, the recognition result is input into a post-processing model for processing, so that the delay of the final recognition result is large.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for training a speech recognition model, the speech recognition model including an encoder, a predictor model, a speech decoder, and a symbol decoder, the method comprising:

obtaining character acoustic vectors corresponding to training voice data according to the encoder and the predictor model, wherein each character acoustic vector corresponds to a predicted character of the training voice data;

obtaining an output text and a language feature vector corresponding to the predicted character according to the character acoustic vector and the voice decoder;

determining symbol probability distribution corresponding to the predicted character according to the character acoustic vector, the language feature vector and the symbol decoder;

determining target loss according to the symbol probability distribution and target symbol characteristics corresponding to the predicted characters in the training voice data;

and under the condition that an updating condition is met, updating the model parameters of the voice recognition model according to the target loss.

Optionally, the determining a target loss according to the symbol probability distribution and the target symbol feature corresponding to the predicted character in the training speech data includes:

determining cross entropy loss according to the symbol probability distribution and target symbol characteristics corresponding to the predicted characters;

and taking the determined cross entropy loss as the target loss.

Optionally, the updating the model parameters of the speech recognition model according to the target loss includes:

and updating the model parameters of the symbol decoder in the voice recognition model according to the target loss.

determining a first cross entropy loss according to the symbol probability distribution and a target symbol characteristic corresponding to the predicted character;

determining a second cross entropy loss corresponding to the language decoder according to the target text corresponding to the predicted character and the output text;

and determining the target loss according to the first cross entropy loss and the second cross entropy loss.

Optionally, the symbolic feature corresponding to each character in the text corresponding to the training speech data is determined by:

acquiring text data which comprises punctuations and corresponds to the training voice data;

aiming at each character in the text data, if a punctuation mark exists adjacent behind the character, taking the punctuation mark as a label mark corresponding to the character; if the punctuation mark does not exist behind the character, taking a preset mark as a mark corresponding to the character;

and aiming at each character in the text data, generating a symbol characteristic corresponding to the character according to the label symbol and the symbol space corresponding to the character.

In a second aspect, the present disclosure provides a speech recognition method, the method comprising:

receiving voice data to be recognized;

determining a symbol to be added and a recognition text corresponding to the voice data according to the voice data and the voice recognition model, wherein the voice recognition model is obtained by training based on the training method of the voice recognition model in the first aspect;

and under the condition that the symbol to be added is a target symbol, generating a text result corresponding to the voice data to be recognized according to the symbol to be added and the recognition text.

In a third aspect, the present disclosure provides an apparatus for training a speech recognition model, the speech recognition model including an encoder, a predictor model, a speech decoder, and a symbol decoder, the apparatus comprising:

the first processing module is used for obtaining character acoustic vectors corresponding to training voice data according to the encoder and the predictor model, wherein each character acoustic vector corresponds to a predicted character of the training voice data;

the second processing module is used for obtaining an output text and a language feature vector corresponding to the predicted character according to the character acoustic vector and the voice decoder;

a first determining module, configured to determine, according to the character acoustic vector, the language feature vector, and the symbol decoder, a symbol probability distribution corresponding to the predicted character;

a second determining module, configured to determine a target loss according to the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data;

and the updating module is used for updating the model parameters of the voice recognition model according to the target loss under the condition that the updating condition is met.

In a fourth aspect, the present disclosure provides a speech recognition apparatus, the apparatus comprising:

the receiving module is used for receiving voice data to be recognized;

a third determining module, configured to determine, according to the voice data and a voice recognition model, a symbol to be added and a recognition text corresponding to the voice data, where the voice recognition model is obtained by training based on the training method of the voice recognition model according to the first aspect;

and the generating module is used for generating a text result corresponding to the voice data to be recognized according to the symbol to be added and the recognition text under the condition that the symbol to be added is a target symbol.

In a fifth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first or second aspect.

In a sixth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first or second aspect.

In the above technical solution, according to an encoder and a predictor model, a character acoustic vector corresponding to training speech data is obtained, so that an output text and a language feature vector corresponding to a predicted character can be obtained according to the character acoustic vector and the speech decoder, a symbol probability distribution corresponding to the predicted character is determined according to the character acoustic vector, the language feature vector and the symbol decoder, and then a target loss is determined according to the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data, so that model parameters of the speech recognition model are updated according to the target loss when an update condition is satisfied. Therefore, by the technical scheme, the acoustic vectors of the predicted characters in the voice data are processed, and the voice recognition and punctuation prediction are carried out on each predicted character by combining the symbol decoder, so that on one hand, the processing precision of the voice recognition model can be improved, on the other hand, the time delay of determining punctuation symbols in the voice recognition process can be effectively reduced, the trained voice recognition model can be applied to the streaming data processing process, and support is provided for the user experience of streaming recognition products. In addition, in the process of decoding by the symbol decoder, the character acoustic vector in the prediction submodel and the language feature vector in the language decoder can be multiplexed, so that the acoustic information and the language information can be comprehensively utilized to predict punctuation, the accuracy of the obtained speech recognition model is further trained, and the use experience of a user is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram of a method of training a speech recognition model provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a speech recognition method provided in accordance with one embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of training a speech recognition model provided according to one embodiment of the present disclosure;

FIG. 4 is a block diagram of a speech recognition device provided in accordance with one embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a method for training a speech recognition model according to an embodiment of the present disclosure, where the speech recognition model includes an encoder, a predictor model, a speech decoder, and a symbol decoder, and as shown in fig. 1, the method includes:

in step 11, character acoustic vectors corresponding to training speech data are obtained according to an encoder and a predictor model, wherein each character acoustic vector corresponds to a predicted character of the training speech data.

In this embodiment, the training speech data may be encoded according to an encoder, and an acoustic vector of an audio frame of the training speech data may be obtained.

In general, each second of voice data may be divided into a plurality of audio frames, so that data processing is performed based on the audio frames, and for example, each second of voice data may be divided into 100 audio frames for processing. Accordingly, by the encoder encoding the audio frame of the training speech data, the obtained acoustic vector sequence H can be represented as:

H:{H₁,H₂,…,H_Uu is used to indicate the number of audio frames in the input training speech data, i.e. the length of the acoustic vector sequence.

And then, obtaining a character acoustic vector corresponding to the training voice data according to the acoustic vector and the predictor model.

For example, the acoustic vector may be input into a predictor model, and the predictor model may perform information amount prediction on the acoustic vector to obtain the information amount corresponding to the audio frame. And then combining the acoustic vectors of the audio frames according to the information quantity of the plurality of audio frames to obtain a character acoustic vector sequence.

For example, the predictor model may be a CIF (Continuous integrated-and-Fire) model, in an embodiment of the present disclosure, the information amount corresponding to each predicted character is the same by default, so that the information amount corresponding to the audio frame may be accumulated from left to right, and when the information amount is accumulated to a preset threshold, the audio frame corresponding to the accumulated information amount is considered to be formed as one predicted character, and one predicted character corresponds to one or more audio frames. The preset threshold may be set according to practical application scenarios and experience, and may be set to 1, which is not limited in this disclosure.

In one possible embodiment, the acoustic vectors of the audio frames may be combined according to the information content of the plurality of audio frames by:

sequentially acquiring the information amount W of an audio frame i according to the sequence of the information amount_i；

If W_iAnd if the sum of the accumulated values is greater than the preset threshold value, it can be considered that a character boundary appears at the moment, namely, part of the currently traversed audio frame belongs to the current predicted character, and the other part of the currently traversed audio frame belongs to the next predicted character.

Exemplarily, if W₁+W₂If β is greater than β, it can be considered that a character boundary occurs at this time, i.e., the 1 st audio frame and the 2 nd audio frame may correspond to a predicted character whose boundary is in the 2 nd audio frame. At this time, the information amount of the 2 nd audio frame may be divided into two parts, i.e. one part of the information amount belongs to the current predicted character and the remaining part of the information amount belongs to the next predicted character.

Accordingly, the information amount W of the 2 nd audio frame₂Amount of information W belonging to the current predicted character₂₁Can be expressed as: w₂₁＝β-W₁(ii) a Information quantity W belonging to the next predicted character₂₂Can be expressed as: w₂₂＝W₁-W₂₁。

Then, the information amount of the audio frame is continuously traversed, and the information amount accumulation is continuously carried out from the information amount of the rest part of the 2 nd audio frame, namely the information amount W in the 2 nd audio frame₂₂And the amount of information W in the 3 rd audio frame₃Accumulating until the sum reaches a preset threshold value beta to obtain the audio frequency corresponding to the next predicted characterAnd (5) frame. And combining the information amount of the subsequent audio frames in the same way to obtain each predicted character corresponding to the plurality of audio frames.

Based on this, after determining the corresponding relationship between the predicted character and the audio frame in the speech data, for each predicted character, the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character may be determined as the acoustic vector corresponding to the predicted character. The weight of the acoustic vector of each audio frame corresponding to the predicted character is the information amount of the audio frame corresponding to the predicted character. If the audio frame belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information content of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information content of the part in the audio frame.

As described in the above example, for the first predicted character, which contains the 1 st audio frame and the 2 nd audio frame, the acoustic vector C corresponding to the predicted character₁Can be expressed as:

C₁＝W₁*H₁+W₂₁*H₂；

as another example, for a second predicted character that includes a portion of the 2 nd audio frame and the 3 rd audio frame, the predicted character corresponds to an acoustic vector C₂Can be expressed as:

C₂＝W₂₂*H₂+W₃*H₃。

turning back to fig. 1, in step 12, an output text and a language feature vector corresponding to the predicted character are obtained according to the character acoustic vector and the speech decoder.

Feature extraction can be performed on an acoustic vector of each input predicted character in a speech decoder, so that softmax processing is performed on the basis of the extracted features, and the probability that the predicted character corresponds to each real character, namely the text probability distribution corresponding to the predicted character, is obtained. Then, based on the text probability distribution, the recognition character corresponding to each predicted character can be determined through an algorithm of Greedy Search (Greedy Search) or an algorithm of Beam Search (Beam Search) to obtain the output text.

In this embodiment, the speech feature vector may be a feature vector output by a feature layer in the middle of the speech decoder, for example, may be a corresponding feature when performing softmax processing.

In step 13, a symbol probability distribution corresponding to the predicted character is determined based on the character acoustic vector, the language feature vector, and the symbol decoder.

In this step, when the symbol is decoded, the corresponding input includes the character acoustic vector and the speech feature vector of the predicted character, so that the punctuation symbol can be predicted by combining the speech feature and the language feature, and the accuracy of the determined symbol probability distribution is ensured.

In step 14, a target loss is determined based on the symbol probability distribution and corresponding target symbol characteristics in the training speech data.

The training voice data can be pre-labeled, so that a target text corresponding to the training voice data and a symbolic feature corresponding to the training voice data can be obtained, wherein the symbolic feature of a symbol corresponding to the training voice data can be used for representing punctuation marks adjacent to the back of each character in a text with a punctuation mark corresponding to the training voice data.

In step 15, if the update condition is satisfied, the model parameters of the speech recognition model are updated according to the target loss.

As an example, the update condition may be that the target loss is greater than a preset loss threshold, which indicates that the recognition accuracy of the speech recognition model is insufficient. As another example, the update condition may be that the number of iterations is less than a preset number threshold, and the speech recognition model is considered to have a smaller number of iterations and insufficient recognition accuracy. Accordingly, the model parameters of the speech recognition model may be updated according to the target loss if an update condition is satisfied. The mode of updating the model parameters based on the determined loss may adopt a common updating mode in the art, such as a gradient descent method, and is not described herein again.

Under the condition that the updating condition is not met, the recognition accuracy of the speech recognition model can be considered to meet the training requirement, at the moment, the training process can be stopped, and the trained speech recognition model is obtained.

In one possible embodiment, the symbolic feature corresponding to each character in the text corresponding to the training speech data may be determined by:

and acquiring text data which comprises punctuations and corresponds to the training voice data.

Aiming at each character in the text data, if a punctuation mark exists adjacent behind the character, taking the punctuation mark as a label mark corresponding to the character; and if the punctuation marks do not exist behind the character, taking the preset mark as the corresponding annotation mark of the character.

By way of example, the text data may be: you are good, meet for the first time, please have multiple concerns. Then, for the character in the text data, punctuation marks are adjacent to the characters "good", "face" and "photo", and then for the character "good", the punctuation mark adjacent to the character "good" may be used as the corresponding notation mark, and for the character "face", the punctuation mark adjacent to the character "good" may be used as the corresponding notation mark, and for the character "photo", the punctuation mark adjacent to the character "photo" may be used. "is used as its corresponding reference symbol. For other characters in the text data, a preset symbol, such as "+", may be used as its corresponding label, and for the text data, a sequence of labels "+,", may be obtained. ".

The symbol space may include a common punctuation mark and the preset symbol, and after determining the annotation symbol corresponding to the character, the annotation symbol may be determined to be the symbol feature mapped to the symbol space in a one-hot unique encoding manner.

Therefore, by the technical scheme, the punctuations in the text data with the punctuations can be independently coded, so that training voice data supporting punctuation prediction and voice recognition for joint training can be obtained, the training voice data is matched with the training process of the voice recognition model, and data support is provided for the joint training of the voice recognition model.

In a possible embodiment, the target symbol feature is a one-hot coded feature, and in step 14, an exemplary implementation manner of determining the target loss according to the symbol probability distribution and the target symbol feature corresponding to the predicted character in the training speech data is as follows, and the step may include:

and determining cross entropy loss according to the symbol probability distribution and the target symbol characteristics corresponding to the predicted characters.

The probability that the adjacent symbol after the current predicted character is each punctuation symbol can be obtained by decoding through a symbol decoder, the target symbol characteristic corresponding to the predicted character is a one-hot coding characteristic, namely, the dimensionality of the punctuation symbol corresponding to the predicted character in the training voice data is 1, the dimensionalities of other dimensionalities are 0, and the probability is equivalent to the special probability distribution corresponding to the label symbol.

Thus, in this embodiment, the cross-entropy loss may be calculated based on the symbol probability distribution corresponding to the predicted character and the special probability distribution corresponding to the annotation symbol. The cross entropy loss is used for measuring the difference between the two probability distributions, and the smaller the cross entropy loss is, the more similar the representation of the predicted probability distribution and the real probability distribution is, namely the better the prediction effect of the trained speech recognition model is.

And then, taking the determined cross entropy loss as the target loss.

Therefore, by the technical scheme, the corresponding target loss can be directly determined according to the symbol probability distribution output by the symbol decoder, so that the cross entropy loss corresponding to the symbol decoder can be directly optimized, the accuracy of the symbol decoder is effectively improved, the accuracy of the speech recognition model obtained by training for punctuation prediction is improved, the readability of the speech recognition model for outputting a text result is improved to a certain extent, and the user experience is improved.

In a possible embodiment, in the case of using the cross-entropy loss determined based on the output of the symbol decoder as the target loss, an exemplary implementation manner of updating the model parameters of the speech recognition model according to the target loss in step 15 is as follows, and the step may include:

As indicated above, in this embodiment, the corresponding target penalty is determined from the symbol probability distribution output by the symbol decoder, i.e. when performing model optimization, only the model parameters of the symbol decoder in the speech recognition model can be optimized, while the parameters of other parts in the speech recognition model remain unchanged.

Therefore, by the scheme, the symbol decoder can be added in the existing trained model, the model is trained through the training voice data to obtain the voice recognition model, the symbol recognition capability is added in the existing trained model, and the text recognition capability is ensured to be unchanged, so that the application range and the training efficiency of the training method of the voice recognition model are improved.

and determining a first cross entropy loss according to the symbol probability distribution and the target symbol characteristics corresponding to the predicted characters. The determination method of the first cross entropy loss is described in detail above, and is not described herein again.

And determining a second cross entropy loss corresponding to the language decoder according to the target text corresponding to the predicted character and the output text.

Wherein, the speech recognition model comprises a language decoder for converting speech into text, so that the second cross entropy loss can be optimized to improve the recognition accuracy of converting speech into text in the speech recognition model.

The target loss may be determined by performing weighted summation on the first cross entropy loss and the second cross entropy loss, where weights corresponding to the first cross entropy loss and the second cross entropy loss may be set according to an actual application scenario, which is not limited in this disclosure.

Therefore, by the technical scheme, when the target loss of the voice recognition model is determined, the target loss can be determined based on the cross entropy losses respectively corresponding to the language decoder and the symbol decoder, and the language decoder and the symbol decoder are jointly trained by constructing a multi-task learning mode, so that the accuracy of texts and symbols recognized by the trained voice recognition model can be improved to a certain extent, and the user experience is improved.

The present disclosure also provides a speech recognition method, as shown in fig. 2, the method may include:

in step 21, speech data to be recognized is received.

In step 22, the symbols to be added and the recognition texts corresponding to the voice data are determined according to the voice data and the voice recognition model, wherein the voice recognition model is obtained by training based on any one of the above-mentioned training methods of the voice recognition model.

As described above, the input speech data can be streamed based on the speech recognition model, i.e. the character-level symbol probability distribution and the recognition text determination are performed for the speech data. When the character acoustic vector of a predicted character is determined by the predictor model, the character acoustic vector can be decoded by a language decoder and a symbol decoder respectively, and then a coincidence and recognition text to be added is obtained. In this embodiment, for each character determined in the voice data, the symbol with the highest corresponding probability in the symbol probability distribution corresponding to the character may be determined as the symbol to be added.

In step 23, in the case that the symbol to be added is the target symbol, a text result corresponding to the speech data to be recognized is generated according to the symbol to be added and the recognition text, and the text result is a text containing punctuation information.

In the symbol decoder of the speech recognition model, the decoded symbol may be a predetermined symbol ". times.. And under the condition that the symbol to be added is determined to be the target symbol, the adjacent punctuation mark is present after the text corresponding to the predicted character, and the symbol to be added can be added to the recognized text after the text corresponding to the predicted character, so that the text result corresponding to the voice data to be recognized is obtained.

Therefore, according to the technical scheme, the voice recognition model can be applied to the streaming data processing process, the input voice data can be processed at the character level based on the voice recognition model, so that whether punctuation marks need to be added after each character can be determined in real time, the accuracy and the real-time performance of punctuation mark prediction are improved, meanwhile, the accuracy of the obtained text result can be improved, the voice recognition model can be applied to an online voice recognition scene, the time delay of text result display is effectively reduced, and the user experience is improved.

Fig. 3 is a block diagram illustrating a training apparatus for a speech recognition model according to an embodiment of the present disclosure. The speech recognition model comprises an encoder, a predictor model, a speech decoder and a symbol decoder, the apparatus 30 comprises:

a first processing module 31, configured to obtain, according to the encoder and the predictor model, character acoustic vectors corresponding to training speech data, where each character acoustic vector corresponds to a predicted character of the training speech data;

a second processing module 32, configured to obtain, according to the character acoustic vector and the speech decoder, an output text and a language feature vector corresponding to the predicted character;

a first determining module 33, configured to determine a symbol probability distribution corresponding to the predicted character according to the character acoustic vector, the language feature vector, and the symbol decoder;

a second determining module 34, configured to determine a target loss according to the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data;

and the updating module 35 is configured to update the model parameters of the speech recognition model according to the target loss when an updating condition is satisfied.

Optionally, the target symbol feature is a one-hot coded feature, and the second determining module includes:

and the first determining submodule is used for determining cross entropy loss according to the symbol probability distribution and the target symbol characteristics corresponding to the predicted characters, and taking the determined cross entropy loss as the target loss.

Optionally, the update module includes:

and the updating submodule is used for updating the model parameters of the symbol decoder in the voice recognition model according to the target loss.

the second determining submodule is used for determining first cross entropy loss according to the symbol probability distribution and the target symbol characteristics corresponding to the predicted characters;

the second determining submodule is used for determining a second cross entropy loss corresponding to the language decoder according to the target text corresponding to the predicted character and the output text;

a third determining submodule, configured to determine the target loss according to the first cross entropy loss and the second cross entropy loss.

Fig. 4 is a block diagram illustrating a speech recognition apparatus according to an embodiment of the present disclosure. The device 40 comprises:

a receiving module 41, configured to receive voice data to be recognized;

a third determining module 42, configured to determine, according to the speech data and a speech recognition model, a symbol to be added and a recognition text corresponding to the speech data, where the speech recognition model is obtained by training based on any one of the above training methods of the speech recognition model;

a generating module 43, configured to generate a text result corresponding to the to-be-recognized voice data according to the to-be-added symbol and the recognition text when the to-be-added symbol is a target symbol.

Referring now to FIG. 5, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform a method of training a speech recognition model, the speech recognition model comprising an encoder, a predictor model, a language decoder, and a symbol decoder, the method comprising: obtaining character acoustic vectors corresponding to training voice data according to the encoder and the predictor model, wherein each character acoustic vector corresponds to a predicted character of the training voice data; obtaining an output text and a language feature vector corresponding to the predicted character according to the character acoustic vector and the voice decoder; determining symbol probability distribution corresponding to the predicted character according to the character acoustic vector, the language feature vector and the symbol decoder; determining target loss according to the symbol probability distribution and target symbol characteristics corresponding to the predicted characters in the training voice data; and under the condition that an updating condition is met, updating the model parameters of the voice recognition model according to the target loss.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform a speech recognition method, the method comprising: receiving voice data to be recognized; determining a symbol to be added and a recognition text corresponding to the voice data according to the voice data and the voice recognition model, wherein the voice recognition model is obtained by training based on any one of the above training methods of the voice recognition model; and under the condition that the symbol to be added is a target symbol, generating a text result corresponding to the voice data to be recognized according to the symbol to be added and the recognition text.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases form a limitation on the module itself, for example, the first processing module may also be described as a "module for obtaining a character acoustic vector corresponding to training speech data according to the encoder and the predictor model".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a method of training a speech recognition model including an encoder, a predictor model, a speech decoder, and a symbol decoder, the method including:

Example 2 provides the method of example 1, wherein the target symbol feature is a one-hot coded feature, and determining a target loss according to the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data includes:

and taking the determined cross entropy loss as the target loss.

Example 3 provides the method of example 2, the updating model parameters of the speech recognition model according to the target loss, including:

Example 4 provides the method of example 1, wherein the target symbol feature is a one-hot coded feature, and determining a target loss according to the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data includes:

Example 5 provides the method of any one of examples 1-4, wherein the symbolic feature corresponding to each character in the text corresponding to the training speech data is determined by:

Example 6 provides a speech recognition method according to one or more embodiments of the present disclosure, the method comprising:

receiving voice data to be recognized;

determining a symbol to be added and a recognition text corresponding to the voice data according to the voice data and a voice recognition model, wherein the voice recognition model is obtained by training based on the training method of the voice recognition model in any one of examples 1-5;

Example 7 provides an apparatus for training a speech recognition model including an encoder, a predictor model, a speech decoder, and a symbol decoder, the apparatus comprising:

Example 8 provides, in accordance with one or more embodiments of the present disclosure, a speech recognition apparatus, the apparatus comprising:

the receiving module is used for receiving voice data to be recognized;

a third determining module, configured to determine, according to the speech data and a speech recognition model, a symbol to be added and a recognition text corresponding to the speech data, where the speech recognition model is obtained by training based on a training method of the speech recognition model described in any one of examples 1 to 5;

Example 9 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-6, in accordance with one or more embodiments of the present disclosure.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-6.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of training a speech recognition model, the speech recognition model comprising an encoder, a predictor model, a speech decoder and a symbol decoder, the method comprising:

2. The method of claim 1, wherein the target symbol feature is a one-hot coded feature, and wherein determining a target loss based on the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data comprises:

and taking the determined cross entropy loss as the target loss.

3. The method of claim 2, wherein updating model parameters of the speech recognition model based on the target loss comprises:

4. The method of claim 1, wherein the target symbol feature is a one-hot coded feature, and wherein determining a target loss based on the symbol probability distribution and a target symbol feature corresponding to the predicted character in the training speech data comprises:

5. The method according to any one of claims 1-4, wherein the symbolic feature corresponding to each character in the text corresponding to the training speech data is determined by:

6. A method of speech recognition, the method comprising:

receiving voice data to be recognized;

determining symbols and recognition texts to be added corresponding to the voice data according to the voice data and a voice recognition model, wherein the voice recognition model is obtained by training based on the training method of the voice recognition model of any one of claims 1-5;

7. An apparatus for training a speech recognition model, the speech recognition model including an encoder, a predictor model, a speech decoder, and a symbol decoder, the apparatus comprising:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving voice data to be recognized;

a third determining module, configured to determine, according to the speech data and a speech recognition model, a symbol to be added and a recognition text corresponding to the speech data, where the speech recognition model is obtained by training based on the training method of the speech recognition model according to any one of claims 1 to 5;

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 6.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 6.