CN111951785B

CN111951785B - Voice recognition method and device and terminal equipment

Info

Publication number: CN111951785B
Application number: CN201910407618.2A
Authority: CN
Inventors: 陈明
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2024-03-15
Anticipated expiration: 2039-05-16
Also published as: CN111951785A

Abstract

The invention is applicable to the technical field of voice recognition, and provides a voice recognition method, a voice recognition device and terminal equipment, wherein the method comprises the following steps: calculating a first conditional probability of the sentence according to the pre-trained language model; adjusting a first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function; training the speech recognition model with the second loss function, and performing speech recognition using the trained speech recognition model. The invention can improve the accuracy of voice recognition.

Description

Voice recognition method and device and terminal equipment

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition method, a voice recognition device and terminal equipment.

Background

The voice recognition technology aims to recognize the input voice signal and output the text which can be read by a computer, and can be applied to intelligent home, intelligent vehicle-mounted, intelligent customer service robots and the like. With the development of deep learning technology, the speech recognition technology is changed from the traditional machine learning mixed gaussian and hidden markov models (Gaussian Mixture Model-Hidden Markov Model, GMM-HMM) to the deep neural network (Deep Neural Networks, DNN) based technology. The DNN-based speech recognition technology is divided into two types: one is to replace the original GMM part with DNN, namely deep neural network and hidden markov model (Deep Neural Networks-Hidden Markov Model, DNN-HMM), and the other is an end-to-end speech recognition technology based on deep neural network.

Because the End-To-End voice recognition technology (End-To-End Automatic Speech Recognition) based on the deep neural network can directly realize voice input and decoding recognition, complex alignment work and pronunciation dictionary making work are not needed, and a large amount of early preparation time can be saved, so that the method has wider application. At present, the existing end-to-end voice recognition technology (such as continuous time sequence classification CTC, deep full feedforward connected neural network DFSMN, attention mechanism sequence to sequence network Seq2Seq-Attention, etc.) cannot learn a complex language model, and often recognizes input voice through voice waveforms, so that the logic of the recognized words is poor. Therefore, when the trained voice recognition model is adopted for voice recognition, if more complex voice is encountered, the recognition accuracy is lower.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method, an apparatus, and a terminal device for voice recognition, so as to solve the problem in the prior art that the recognition accuracy of a trained voice recognition model is low when encountering complex voice.

A first aspect of an embodiment of the present invention provides a speech recognition method, including:

calculating a first conditional probability of the sentence according to the pre-trained language model;

adjusting a first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function;

training the speech recognition model with the second loss function, and performing speech recognition using the trained speech recognition model.

A second aspect of an embodiment of the present invention provides a voice recognition apparatus, including:

a first conditional probability calculation module for calculating a first conditional probability of a sentence according to the pre-trained language model;

the adjusting module is used for adjusting the first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function;

and the voice recognition module is used for training the voice recognition model by utilizing the second loss function and performing voice recognition by using the trained voice recognition model.

A third aspect of an embodiment of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect described above when the computer program is executed.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in the first aspect above.

In the embodiment of the invention, the first conditional probability of sentences is calculated by using the pre-trained language model, the original first loss function of the language identification model is corrected to obtain a second loss function, the language identification model is further trained by using the second loss function, the optimization of the loss function of the language identification model is realized, and the characteristics of the pre-trained language model are introduced; because the first conditional probability of the pre-trained language model is adopted to optimize the first loss function, the pre-trained language model is embedded into the voice recognition model, and the recognition accuracy of the voice recognition model after the training is completed is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a specific implementation process for adjusting the first loss function according to the second conditional probability and the influence coefficient according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present invention, which is described in detail below:

s101: a first conditional probability of the sentence is calculated based on the pre-trained language model.

It should be noted that, the language model can summarize the internal relation between words from a large amount of text information, reduce the error rate of identifying words, and make the identification result more logical, and the common language models include an n-gram language model and a language model based on a neural network.

The pre-trained voice model in the embodiment of the invention can be trained by a language model training tool SRILM and an n-gram voice model, wherein the parameter n represents that the probability of the current word appears is related to the probability of the previous n-1 words. In the embodiment of the invention, a ternary language model, namely a language model with n=3, is trained, and the probability of the current word occurrence is related to the probability of the previous 2 words occurrence. And the sentence refers to a sentence that the speech recognition model predicts to generate from the input samples (speech data).

Further, the calculating the conditional probability of the sentence according to the pre-trained language model includes:

for each sentence, a first conditional probability thereof is calculated according to the following equation:

in the above formula (1), P (S) represents a first conditional probability of the sentence S, C (w) _i-(n-1) ,…,w _i-1 ,w _i ) Representing word w _i-(n-1) ,…,w _i-1 Word w after occurrence _i Number of occurrences, C (w _i-(n-1) ,…,w _i-1 ) Representing word w _i-(n-1) ,…w _i-2 Word w after occurrence _i-1 The number of occurrences, m, represents the number of samples, n represents a positive integer greater than 1, and i represents the ith word.

Since the n-gram language model refers to that the probability of the current word occurrence is related to the probability of the previous n-1 words occurrence, for a sentence S, the first conditional probability P (S) thereof can be expressed as:

in the above formula (2), P (w) _i |w _i-(n-1) ,…,w _i-1 ) Representing word w _i In the word w _i-(n-1) ,…,w _i-1 The probability of occurrence in the case of occurrence can be calculated by using a maximum likelihood estimation method, and the expression (1) can be obtained.

S102: and adjusting the first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function.

The loss function is the difference between the predicted value and the true value, reflects the deviation degree of the predicted value and the true value, and the lower the deviation degree of the predicted value and the true value is, the more accurate the predicted result is, so that the smaller the loss function is, the better the quality of the finally trained model is, namely the higher the accuracy of voice recognition is.

The first loss function refers to an original loss function of the voice recognition model. The original loss function of the voice recognition model is adjusted by utilizing the first conditional probability, so that the characteristics of the pre-trained language model are introduced, and the accuracy of the trained voice recognition model can be improved.

Specifically, the adjusting the first loss function of the speech recognition model according to the first conditional probability includes:

calculating a second conditional probability using the first conditional probability;

and adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model.

After the first conditional probability P (S) is obtained through calculation, the first conditional probability P (S) is transformed to obtain a second conditional probability T, and then the first loss function is adjusted by using the T and the influence coefficient r of the pre-trained language model.

Further, the calculating a second conditional probability using the first conditional probability includes:

using the first conditional probability, and calculating according to the following formula:

in the above formula (3), T represents the calculated second conditional probability, P (S) represents the first conditional probability, and length represents the length of the sentence S, i.e., the number of words contained in S.

As shown in fig. 2, fig. 2 is a schematic flow chart of a specific implementation process of adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model, which includes the following steps S201 to S203:

s201: acquiring a plurality of predicted sentences, and calculating a second conditional probability of each sentence;

obtaining multiple predicted sentences from speech recognition models, provided that k predicted sentences are obtained, i.e. y_pred ¹ ,y_pred ² ,…,y_pred ^k . Calculating the T value of each sentence by using the formula (1) and the formula (3) to obtain T ¹ ,T ² ,…,T ^k 。

S202: according to the second conditional probabilities of all sentences and the influence coefficients, calculating average conditional probabilities;

based on the T value obtained in the above step S201 and the influence coefficient r of the pre-trained language model, an average conditional probability T is calculated according to the following formula _i ：

In the above formula (4), T _i Represents the calculated average conditional probability, r represents the influence coefficient, k represents the number of sentences, j represents the jth sentence, T ^j A second conditional probability representing a jth sentence.

S203: and adjusting the first loss function by using the average conditional probability.

The method for adjusting the first loss function by using the average conditional probability comprises the following steps: and adding the average conditional probability on the basis of the original loss function to obtain a second loss function, namely the adjusted loss function.

It should be noted that, since different influence coefficients r affect the recognition accuracy of the speech recognition model after the final training is completed, different r values will be adopted for different sample data.

In a preferred implementation manner of the embodiment of the present invention, the influence coefficient is an optimal influence coefficient, and the method for obtaining the optimal influence coefficient is:

training the voice recognition model by adopting a plurality of influence coefficients, and determining the influence coefficient with the highest recognition accuracy of the voice recognition model according to a training result, namely the optimal influence coefficient;

said adjusting said first loss function according to said second conditional probability and an influence coefficient of said pre-trained language model, comprising:

and adjusting the first loss function according to the second conditional probability and the optimal influence coefficient.

Typically, the influence coefficient r has a selectable value in the range of 0-1. In the embodiment of the invention, after practical training, the following conclusion can be obtained: when the value interval of the influence coefficient r is 0.1 and 0.5, the converged voice recognition model has better recognition accuracy. However, for voice data with different sizes and different domain ranges, different influence coefficients r should be selected, that is, the selection of the influence coefficient r is related to the size and the domain range of the input voice data, and in the actual process, the optimal influence coefficient can be selected according to the needs.

Optionally, the training the speech recognition model using a plurality of influence coefficients includes:

presetting a value interval for the influence coefficient, adjusting the value of the influence coefficient according to a preset step length, and respectively training the voice recognition model by utilizing each influence coefficient.

In the training process of the voice recognition model, a value interval can be preset for r, the value of r is automatically adjusted according to the step length of 0.1 on the assumption that the value interval is [0.1,0.5], the voice recognition model is trained by the value, and r which enables the converged voice recognition model to have the highest recognition precision is determined according to the training result, namely the optimal influence coefficient.

After the optimal influence coefficient is determined, the loss function is adjusted according to the optimal influence coefficient and the first conditional probability, and the speech recognition model is trained according to the adjusted first loss function.

S103: training the speech recognition model with the second loss function, and performing speech recognition using the trained speech recognition model.

It should be noted that, the training process of the speech recognition model is as follows: inputting sample data with labels to a voice recognition model, wherein the sample data are voice data and texts corresponding to the voice data; and extracting the characteristics of the sample data to obtain a characteristic sequence, encoding the characteristic sequence, decoding to obtain a predicted value, making a difference value between the predicted value and a true value to obtain a loss function, and training the model according to the loss function until the model converges to obtain the trained voice recognition model.

The second loss function refers to the difference between the true value and the predicted value, the value of the second loss function is obtained, the value of the second loss function is utilized to carry out parameter adjustment on the voice recognition model, and finally the voice recognition model with optimal parameters is obtained, namely the trained voice recognition model.

When the trained voice recognition model is used for voice recognition, the audio data to be recognized is input into the trained voice recognition model, and the trained voice recognition model outputs the text corresponding to the audio to be recognized, so that voice recognition can be realized.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention, where the device includes: a first conditional probability calculation module 31, an adjustment module 32 and a speech recognition module 33. Wherein:

a first conditional probability calculation module 31 for calculating a first conditional probability of a sentence according to a pre-trained language model.

Further, the first conditional probability calculating module 31 is specifically configured to: for each sentence, a first conditional probability thereof is calculated according to the following equation:

in the above formula, P (S) represents a first conditional probability of sentence S, C (w) _i-(n-1) ,…,w _i-1 ,w _i ) Representing word w _i-(n-1) ,…,w _i-1 Word w after occurrence _i Number of occurrences, C (w _i-(n-1) ,…,w _i-1 ) Representing word w _i-(n-1) ,…w _i-2 Word w after occurrence _i-1 The number of occurrences, m, represents the number of samples, n represents a positive integer greater than 1, and i represents the ith word.

And the adjusting module 32 is configured to adjust the first loss function of the speech recognition model according to the first conditional probability, so as to obtain a second loss function.

Further, the adjustment module 32 includes: a second conditional probability calculation unit 321, an adjustment unit 322, wherein:

the second conditional probability calculating unit 321 is configured to calculate a second conditional probability using the first conditional probability.

Further, the second conditional probability calculating unit 321 is specifically configured to:

in the above expression (3), T represents the calculated second conditional probability, P (S) represents the first conditional probability, and length represents the length of the sentence S.

The adjusting unit 322 is configured to adjust the first loss function according to the second conditional probability and an influence coefficient of the pre-trained language model.

Still further, the adjusting unit 322 includes:

a first calculation subunit 3221 configured to obtain a plurality of predicted sentences, and calculate a second conditional probability of each sentence;

a second calculation subunit 3222, configured to calculate an average conditional probability according to the second conditional probabilities of all sentences and the influence coefficients;

an adjustment subunit 3223 is configured to adjust the first loss function by using the average conditional probability.

A speech recognition module 33 for training the speech recognition model with the second loss function and performing speech recognition using the trained speech recognition model.

Preferably, the influence coefficient is an optimal influence coefficient, and the device further includes an optimal influence coefficient obtaining module 34, configured to train the speech recognition model with a plurality of influence coefficients, and determine, according to a training result, an influence coefficient that makes the recognition accuracy of the speech recognition model highest, that is, the optimal influence coefficient;

preferably, the adjusting unit 322 is configured to adjust the first loss function according to the second conditional probability and the optimal influence coefficient.

Further, the optimal influence coefficient obtaining module 34 is specifically configured to: presetting a value interval for the influence coefficient, adjusting the value of the influence coefficient according to a preset step length, and respectively training the voice recognition model by utilizing each influence coefficient.

Fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41 and a computer program 42, such as a speech recognition program, stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer program 42, implements the steps of the various speech recognition method embodiments described above, such as steps S101 to S103 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, performs the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 31 to 33 shown in fig. 3.

Illustratively, the computer program 42 may be partitioned into one or more modules/units that are stored in the memory 41 and executed by the processor 40 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 42 in the terminal device 4. For example, the computer program 42 may be divided into a first conditional probability calculation module, an adjustment module, and a speech recognition module, each of which specifically functions as follows:

The terminal device 4 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 40, a memory 41. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the terminal device 4 and does not constitute a limitation of the terminal device 4, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The processor 40 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 41 may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method of speech recognition, comprising:

training the voice recognition model by utilizing the second loss function, and performing voice recognition by using the trained voice recognition model;

the adjusting the first loss function of the speech recognition model according to the first conditional probability includes:

adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model;

the calculating a second conditional probability using the first conditional probability includes:

in the above formula, T represents the calculated second conditional probability, P (S) represents the first conditional probability, and length represents the length of the sentence S.

2. The method of claim 1, wherein calculating a first conditional probability of a sentence from a pre-trained language model comprises:

3. The method of claim 1, wherein said adjusting the first loss function based on the second conditional probability and the coefficient of influence of the pre-trained language model comprises:

acquiring a plurality of predicted sentences, and calculating a second conditional probability of each sentence;

according to the second conditional probabilities of all sentences and the influence coefficients, calculating average conditional probabilities;

and adjusting the first loss function by using the average conditional probability.

4. The method of claim 3, wherein the influence coefficient is an optimal influence coefficient, and the method for obtaining the optimal influence coefficient is as follows:

5. The method of claim 4, wherein training the speech recognition model using a plurality of influence coefficients, respectively, comprises:

6. A speech recognition apparatus, comprising:

the voice recognition module is used for training the voice recognition model by utilizing the second loss function and performing voice recognition by using the trained voice recognition model;

the adjustment module comprises:

a second conditional probability calculation unit configured to calculate a second conditional probability using the first conditional probability;

an adjusting unit, configured to adjust the first loss function according to the second conditional probability and an influence coefficient of the pre-trained language model;

the second conditional probability calculation unit is specifically configured to:

7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 5.