CN113643706B

CN113643706B - Speech recognition method, device, electronic equipment and storage medium

Info

Publication number: CN113643706B
Application number: CN202110796768.4A
Authority: CN
Inventors: 李亚桐; 张伟彬; 陈东鹏
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2023-09-26
Anticipated expiration: 2041-07-14
Also published as: CN113643706A

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The method comprises the following steps: acquiring voice data to be recognized; identifying the voice data to be identified, and acquiring a first voice identification result corresponding to the voice data to be identified and loss corresponding to the first voice identification result; obtaining keywords from the first voice recognition result; based on the keywords, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted; and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss. According to the method, the loss of the first voice recognition result is adjusted according to the keywords, and then the second voice recognition result corresponding to the voice data to be recognized is obtained from the first voice recognition result after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

Description

Speech recognition method, device, electronic equipment and storage medium

Technical Field

The application belongs to the field of voice recognition, and particularly relates to a voice recognition method, a voice recognition device, electronic equipment and a storage medium.

Background

Speech recognition is a technique that utilizes machines to simulate human recognition and understanding activities, converting human speech signals into corresponding text or commands. The fundamental purpose of speech recognition is to develop a machine with auditory function that can directly receive human speech and understand the intent of the human. With the development of artificial intelligence technology, speech recognition technology has made great progress and started to enter various fields of home appliances, communications, automobiles, medical treatment and the like, but the accuracy of speech recognition has yet to be improved by related speech recognition methods.

Disclosure of Invention

In view of the above, the present application proposes a voice recognition method, apparatus, electronic device, and storage medium to achieve improvement of the above problems.

In a first aspect, an embodiment of the present application provides a method for voice recognition, where the method includes: acquiring voice data to be recognized; identifying the voice data to be identified, and acquiring a first voice identification result corresponding to the voice data to be identified and loss corresponding to the first voice identification result; obtaining keywords from the first voice recognition result; based on the keywords, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted; and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss.

In a second aspect, an embodiment of the present application provides a voice recognition apparatus, including: the data acquisition unit is used for acquiring voice data to be recognized; the first result acquisition unit is used for identifying the voice data to be identified and acquiring a first voice identification result corresponding to the voice data to be identified and loss corresponding to the first voice identification result; a keyword obtaining unit, configured to obtain a keyword from the first speech recognition result; the loss adjusting unit is used for adjusting the loss corresponding to the first voice recognition result based on the keyword so as to obtain a first voice recognition result after the loss is adjusted; and the second result acquisition unit is used for acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss.

In a third aspect, an embodiment of the present application provides an electronic device, including one or more processors and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, wherein the above-described method is performed when the program code is run.

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium. Firstly, obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining keywords from the first voice recognition result, adjusting the loss of the first voice recognition result based on the keywords to obtain a first voice recognition result after the loss is adjusted, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted. By the method, the keywords are automatically acquired from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the first voice recognition result after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a word graph according to another embodiment of the present application;

FIG. 4 is a flow chart of a speech recognition method according to still another embodiment of the present application;

FIG. 5 is a flow chart of a speech recognition method according to yet another embodiment of the present application;

FIG. 6 is a block diagram showing a voice recognition apparatus according to an embodiment of the present application;

FIG. 7 shows a block diagram of an electronic device for performing a speech recognition method according to an embodiment of the application in real time;

fig. 8 shows a storage unit for storing or carrying program code for implementing a voice recognition method according to an embodiment of the present application in real time.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Speech recognition is a key technology for achieving human-computer interaction by recognizing user voice commands with a machine, which can significantly improve the manner in which human-computer interaction can be accomplished so that a user can speak commands while completing more tasks. Speech recognition is achieved by a speech recognition engine (or system) trained online or offline. The speech recognition process can generally be divided into a training phase and a recognition phase. In the training phase, an Acoustic Model (AM) and vocabulary (lexicon) are statistically derived from the training data based on a mathematical Model on which the speech recognition engine (or system) is based. In the recognition phase, the speech recognition engine (or system) uses the acoustic model and vocabulary to process the input speech to obtain speech recognition results. For example, feature extraction is performed from a sound wave diagram of an input sound to obtain feature vectors, then sequences of phonemes (e.g., [ i ], [ o ], etc.) are obtained according to an acoustic model, and finally words, even sentences, with high matching degree with the sequences of phonemes are located from a vocabulary.

The inventors found in the study of the related speech recognition method that, in the speech recognition result output by the speech recognition engine (or system), the recognition accuracy for the keywords frequently occurring in the input speech greatly affects the readability of the overall speech recognition result.

Therefore, the inventor proposes a voice recognition method, a device, an electronic device and a storage medium, wherein the method comprises the steps of firstly acquiring voice data to be recognized, recognizing the voice data to be recognized, acquiring a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then acquiring a keyword from the first voice recognition result, adjusting the loss of the first voice recognition result based on the keyword to acquire a first voice recognition result after the loss is adjusted, finally acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted, automatically acquiring the keyword from the first voice recognition result, adjusting the loss of the first voice recognition result according to the keyword, and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted, thereby improving the accuracy of voice data recognition to be recognized.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a voice recognition method provided by an embodiment of the present application is applied to an electronic device, and the method includes:

step S110: and acquiring voice data to be recognized.

In the embodiment of the application, the voice data to be recognized can be the voice data of different users under different application scenes. For example, in the interview process, the voice data to be recognized may be voice data of an interviewee or voice data of an interview object; for another example, in a scene of watching a video, the voice data to be recognized may be voice data of a user watching the video, or may be voice data of a user playing the video, or may be voice data of a person in the video.

As one way, the voice data to be recognized may be voice data sent by a voice acquisition device, where the voice acquisition device is an intelligent device that establishes a wireless communication connection with an electronic device. Specifically, after the voice data is collected, the voice collection device may send the collected voice data to the electronic device, so that the electronic device may perform voice recognition on the voice data.

Specifically, when voice data is collected through the voice collection device, voice data of different users in different scenes can be collected through the voice collection device. Optionally, when voice data of different users in different scenes are collected by the voice collection device, if voice data of a plurality of users are needed, in order to make it possible to distinguish the voice data of different users, the voice data of different users can be collected by different voice collection devices. For example, in order to make the voice data of the interviewee and the interviewee object respectively perform voice recognition as the voice to be recognized later in the interviewee process, thereby avoiding the decline of the accuracy rate and the speed of the voice recognition of the interviewee and the interviewee object in a complex environment, the voice data of the interviewee and the interviewee object can be respectively acquired through a first microphone and a second microphone arranged on the voice acquisition equipment. Specifically, in the interview process, the interview person and the interview object are not the same person, so that the first microphone arranged on the voice acquisition device can acquire voice data of the interview person, the second microphone arranged on the voice acquisition device can acquire voice data of the interview object, and the first microphone and the second microphone arranged on the voice acquisition device can respectively point to different directions. Further, a first microphone on the voice acquisition device may be configured to be directed to an interviewee and a second microphone on the voice acquisition device may be configured to be directed to an interview object. Furthermore, the voice data of different users can be collected through the first microphone and the second microphone of the voice collection device, and the voice collection device can send the collected voice data of the interviewee and the voice data of the interviewee to the electronic device to be used as the voice data to be recognized.

When the voice acquisition device sends the acquired voice data of the interviewee and the voice data of the interviewee to the electronic device, the electronic device can receive the voice data of the interviewee acquired by the first microphone of the voice acquisition device through the first channel of the multichannel signal receiver, and the voice data of the interviewee acquired by the second microphone of the voice acquisition device through the second channel of the multichannel signal receiver, so that the electronic device can respectively recognize the voice data of the interviewee and the voice data of the interviewee.

As another way, the voice data to be recognized may also be voice data obtained from the cloud server after the electronic device receives the voice recognition instruction, which is not limited herein.

It should be noted that, the electronic device is a voice recognition terminal, and in the embodiment of the present application, the voice recognition terminal may be: terminals such as mobile phones, personal computers, tablet computers and the like can be servers, and the application is not limited to what kind of equipment the voice recognition terminal is, as long as the voice recognition terminal can respectively perform voice recognition on the voice to be recognized received by each channel in the multichannel signal receiver.

Step S120: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result.

In the embodiment of the present application, the loss corresponding to the first speech recognition result represents a difference between the first speech recognition result and a preset speech recognition result, where the loss may include an acoustic loss and a language loss. The preset voice recognition result may be text content corresponding to the actually input voice data. In the embodiment of the application, the acoustic loss is the loss corresponding to the voice data to be recognized output by the pre-trained acoustic model, the language loss is the loss corresponding to the voice data to be recognized output by the pre-trained language model, the connection possibility between the terms is represented, the smaller the language loss is, the greater the connection possibility between the terms is represented; the greater the loss of language, the less likely it is to characterize the connection between words. For example, if the voice data to be recognized is recognized, the output first voice recognition result includes voice recognition results of "hello, tomorrow" and "hello, computer", wherein the language loss corresponding to "hello, tomorrow" is 8.5; the language loss corresponding to "hello, computer" is 19.2. According to the corresponding language loss, the language loss corresponding to "hello, tomorrow" is smaller than the language loss corresponding to "hello, computer", so the possibility of connecting "hello" and "tomorrow" is greater than the possibility of connecting "hello" and "computer".

After the voice data to be recognized is obtained, voice recognition is carried out on the voice data to be recognized, so that a voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the voice recognition result can be obtained.

The first voice recognition result may include n optimal voice recognition results and a word graph. Wherein the word graph is a weighted finite state machine, and the word graph refers to a graph which is possibly formed by all words in a sentence. If the next word of a word a is likely B, then a path E (a, B) exists between a and B. A word may have a plurality of successors and also may have a plurality of predecessors, and the graph formed by them is called a word graph.

In the embodiment of the application, the word graph is a graph formed by all output words and the sequence of the output words in the voice data to be recognized.

Step S130: and acquiring keywords from the first voice recognition result.

In the embodiment of the application, the keywords can be obtained from the first voice recognition result according to the preset rule or the appointed vocabulary can be obtained from the first voice recognition result according to the vocabulary pre-appointed by the user as the keywords. The preset rules are preset rules capable of determining keywords, such as a relative word frequency method, an absolute word frequency method and the like; the keywords are determined output words which frequently occur in the first speech recognition result.

As one way, the number of keywords obtained from the first speech recognition result may be plural, and when the number of obtained keywords is plural, a keyword list may be established for storing the keywords obtained from the first speech recognition result. Optionally, when the keywords are stored, the importance corresponding to each keyword may also be stored, and further when the loss corresponding to the first speech recognition result is adjusted, the importance of the keywords may be adjusted according to the importance of the keywords. The importance of the keywords can be obtained through a voice recognition model or calculated according to a specified calculation rule. Specifically, the importance of the keyword may be determined according to the frequency of occurrence of the keyword in the first speech recognition result, or the importance corresponding to the keyword may be directly output through the speech recognition model, or may be preset by the user.

Step S140: and adjusting the loss corresponding to the first voice recognition result based on the keyword to obtain a first voice recognition result after the loss is adjusted.

When the keywords are determined by the method and stored in the keyword list, the keyword list can be searched, whether the n optimal voice recognition results included in the first voice recognition result and the keyword stored in the keyword list are included in the word graph or not is determined, and if the n optimal voice recognition results included in the first voice recognition result or the keyword graph includes the keywords stored in the keyword list is determined, the loss of the first voice recognition result is correspondingly adjusted.

Furthermore, when keywords are stored in the keyword list, the importance corresponding to each keyword is also stored, and when the loss of the first voice recognition result is adjusted, the loss corresponding to the corresponding first voice recognition result can be adjusted according to the importance of the searched keywords. Specifically, if the importance of the searched keyword is higher, the loss corresponding to the first speech recognition result is adjusted to be smaller, that is, if the importance of the keyword is higher, the adjustment range of the loss corresponding to the first speech recognition result is larger.

Step S150: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss.

After the loss corresponding to the first voice recognition result is adjusted by the method, the first voice recognition result after the loss is adjusted is obtained, and then the second voice recognition result smaller than the loss corresponding to the first voice recognition result can be obtained from the first voice recognition result after the loss is adjusted. Specifically, for the plurality of speech recognition results included in the first speech recognition result after the loss is adjusted, the speech recognition result with the smallest loss among the plurality of speech recognition results may be used as the second speech recognition result, or after the losses corresponding to the plurality of speech recognition results are arranged from small to large, the speech recognition result with the losses arranged at the preset position may be used as the second speech recognition result. The smaller the loss corresponding to the speech recognition result in the first speech recognition result, the greater the likelihood of using it as the second speech recognition result.

The application provides a voice recognition method, which comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining keywords from the first voice recognition result, adjusting the loss of the first voice recognition result based on the keywords to obtain a first voice recognition result after the loss is adjusted, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted. By the method, the keywords are automatically acquired from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the first voice recognition result after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

Referring to fig. 2, a voice recognition method provided by an embodiment of the present application is applied to an electronic device, and the method includes:

step S210: and acquiring voice data to be recognized.

Step S210 may be specifically explained with reference to the above embodiments, so that details are not repeated in this embodiment.

Step S220: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result.

In the embodiment of the application, the first voice recognition result comprises a word graph, wherein the word graph comprises m output words.

If the first voice recognition result is a word graph, the loss corresponding to the first voice result is the loss corresponding to each output word in the word graph. The word graph and the concrete expression form of the loss corresponding to the word graph may be as shown in fig. 3, the output word and the loss may be disposed on the edge connecting the adjacent nodes, where "hi" is the output word, and "80.76" is the loss corresponding to "hi", where the loss may include a language loss and an acoustic loss. The 80.76 is the sum of a language loss corresponding to "hi" and an acoustic loss, wherein the language loss is a loss corresponding to "hi" output by the pre-trained language model, and the acoustic loss is a loss corresponding to "hi" output by the pre-trained acoustic model.

Step S230: and obtaining the occurrence frequency of each output word in the m output words.

In the embodiment of the application, the number of the output words included in the word graph is counted first, and then the occurrence frequency of each output word is counted. An output word appears once in the word graph, and then the number of occurrences of the output word is increased by 1.

Step S240: and determining keywords from the m output words based on the occurrence times of each output word.

In the embodiment of the present application, the step of determining the keyword from the m output words based on the occurrence number of each output word includes: and taking the output words with the occurrence times larger than the first preset times in the m output words as key words.

Or, obtaining the total occurrence times of the m output words; and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

In the embodiment of the application, the first preset times are preset times of occurrence of the output word when the output word is determined to be a keyword; the first preset probability is the occurrence probability of the preset output word when the output word is determined to be a keyword.

As one way, keywords may be determined by absolute word frequency methods. Specifically, as can be seen from the above, the occurrence frequency of each of the m output words is obtained respectively, then it is determined whether the occurrence frequency of each output word is greater than a first preset frequency, if it is determined that the occurrence frequency of the output word in the m output words is greater than the first preset frequency, the output word whose occurrence frequency is greater than the first preset frequency is determined as a keyword, and then the determined keyword can be stored in the keyword list.

Alternatively, the keywords may be determined by a relative word frequency method. Specifically, as can be seen from the above, the occurrence frequency of each output word in the m output words is obtained respectively, and then the occurrence frequency of each output word is added to obtain the total occurrence frequency of the m output words, so that the occurrence probability of each output word can be determined by the ratio of the occurrence frequency of each output word to the total occurrence frequency of the m output words, and further, the occurrence probability of each output word can be compared with a first preset probability, and the output word with the occurrence probability greater than or equal to the first preset probability is determined as a keyword, and further, the determined keyword can be stored in the keyword list.

Similarly, when the determined keywords are stored, the importance of each keyword may be stored, and further, the loss of the word graph may be adjusted according to the importance of the keywords.

Step S250: and if the keyword is included in the word graph, adjusting the loss of the front and back jumping of the keyword included in the word graph to a first loss value so as to obtain the word graph with the loss adjusted.

In the embodiment of the present application, the first loss value is a preset adjusted loss value, and the loss value may be a specific loss value. The loss of the front-back jump of the keyword can be understood as the loss corresponding to two output words adjacent to the front-back jump of the keyword.

When determining that the keyword is included in the word graph by the method, the loss corresponding to two output words adjacent to the keyword in front and back is reduced, the loss corresponding to two output words adjacent to the keyword in front and back is adjusted to be a first loss value. The higher the importance of the keyword is, the smaller the loss value corresponding to the two output words adjacent to the keyword in front and back is adjusted. Further, since the losses corresponding to the two output words adjacent to the keyword in front of and behind may be different, when the losses corresponding to the two output words adjacent to the keyword in front of and behind are adjusted to the first loss value, it is understood that the losses corresponding to the two output words adjacent to the keyword in front of and behind may be adjusted by the same adjustment range. For example, the loss corresponding to two output words adjacent to the keyword in front of and behind the keyword is reduced by 20%.

Alternatively, in the embodiment of the present application, what is adjusted is the language loss of the output word.

Step S260: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the word graph after the adjustment loss.

When the loss of the front and back jumps of the keywords in the word graph is adjusted to the first loss value by the method, the loss of each path in the word graph can be recalculated, and then one or more paths with the minimum loss can be used as the final voice recognition result, namely the second voice recognition result. In the embodiment of the application, when the loss of each path is calculated, the loss corresponding to each output word in each path is added up to obtain the loss corresponding to the path. For example, as shown in fig. 3, when the loss corresponding to the path "0-1-4-10-15-27" is calculated, the loss 80.76 corresponding to "hi", the loss 16.8 corresponding to "this", the loss 22.36 corresponding to "is", the loss 63.09 corresponding to "my" and the loss 34.56 corresponding to "number" are added up to obtain the loss corresponding to the path "0-1-4-10-15-27", the loss corresponding to the path "0-1-4-10-15-27" is 195.21, i.e., "80.76+16.8+63.09+34.56= 195.21".

After the losses of the paths in the word graph are recalculated, if the losses of the paths are the same and the losses are all the minimum, any one of the paths can be used as the final voice recognition result.

Alternatively, after the loss of each path in the word graph is calculated, the losses corresponding to each path are sorted according to the order from small to large, and then one or more paths with the losses arranged in front can be selected as the final speech recognition result.

The application provides a voice recognition method, which comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining the occurrence number of each output word in m output words, determining a keyword from the m output words based on the occurrence number of each word, adjusting the loss of the keyword which is included in a word graph and jumps back and forth to a first loss value, so as to obtain a word graph after the loss is adjusted, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the word graph after the loss is adjusted. By the method, the keywords are automatically acquired in the word graphs, the loss of the word graphs is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the word graphs after the loss is adjusted, so that the accuracy of recognizing the voice data to be recognized is improved.

Referring to fig. 4, a voice recognition method provided by an embodiment of the present application is applied to an electronic device, and the method includes:

step S310: and acquiring voice data to be recognized.

Step S310 may be specifically explained with reference to the above embodiments, so that details are not repeated in this embodiment.

Step S320: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result.

In the embodiment of the application, the first recognition result comprises n voice recognition results, and the n voice recognition results comprise p output words. The p output words are the total number of all different output words in the n voice recognition results.

Specifically, when the first speech recognition result is n speech recognition results, the n speech recognition results may include greater than or equal to 1 optimal speech recognition result, where the optimal speech recognition result is the n speech recognition results that are ranked from small to large according to output loss and then ranked in front, or the optimal speech recognition result may be the n speech recognition results that are ranked from large to small according to output loss and then ranked in back.

As one way, the loss corresponding to the first speech recognition result includes a loss corresponding to each of the n speech output results. That is, a speech recognition result corresponds to a penalty.

Step S330: and obtaining the occurrence frequency of each output word in the p output words.

In the embodiment of the application, the occurrence times of the output words included in each of the n voice recognition results are counted in turn, and then the occurrence times of the output words included in each of the n voice recognition results are added up, so that the occurrence times of each of the p output words can be obtained.

Step S340: and determining keywords from the p output words based on the occurrence times of each output word.

In the embodiment of the present application, the step of determining the keyword from the p output words based on the occurrence number of each output word includes: and taking the output words with the occurrence times larger than the second preset times in the p output words as key words.

Or, acquiring the total occurrence times of the p output words; and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

In the embodiment of the present application, the second preset number of times is a preset number of times of occurrence of the output word when the output word is determined to be the keyword, and the second preset number of times may be the same as or different from the first preset number of times. The second preset probability is the occurrence probability of the output word when the preset output word is determined to be the keyword, and the second preset probability can be the same as or different from the first preset probability and can be set according to actual requirements.

As one way, keywords may be determined by absolute word frequency methods. Specifically, as can be seen from the above description, the occurrence frequency of each of the p output words is obtained respectively, then it is determined whether the occurrence frequency of each output word is greater than a second preset frequency, if it is determined that the occurrence frequency of the output word in the p output words is greater than the second preset frequency, the output word whose occurrence frequency is greater than the second preset frequency is determined to be a keyword, and then the determined keyword can be stored in the keyword list. In the embodiment of the application, if one output word appears once in n voice recognition results, the number of occurrences of the output word is increased by 1. For example, if the n speech recognition results include 3 speech recognition results, and the 3 speech recognition results include 5 output words in total, after statistics, the number of occurrences of the output word 1 is 5, the number of occurrences of the output word 2 is 3, the number of occurrences of the output word 3 is 7, the number of occurrences of the output word 4 is 3, and the number of occurrences of the output word 5 is 6, and the second preset number of occurrences is 5, then the number of occurrences of the output word 1, the output word 2, the output word 3, the output word 4, and the output word 5 are compared with the second preset number of occurrences 5 in sequence, and thus the output word 3 and the output word 5 can be determined to be keywords.

Alternatively, the keywords may be determined by a relative word frequency method. Specifically, as can be seen from the above, the occurrence frequency of each output word in the p output words is obtained respectively, then the occurrence frequency of each output word is added to obtain the total occurrence frequency of the p output words, and further the occurrence probability of each output word can be determined by the ratio of the occurrence frequency of each output word to the total occurrence frequency of the p output words, and further the occurrence probability of each output word can be compared with a second preset probability, and the output word with the occurrence probability greater than or equal to the second preset probability is determined as a keyword, and further the determined keyword can be stored in the keyword list.

When the keywords are determined by the above method and stored in the keyword list, the importance of each keyword may be stored in the keyword list at the same time.

Step S350: and adjusting the loss of the voice recognition results including the keywords in the n voice recognition results to a second loss value to obtain n voice recognition results after the loss is adjusted.

In the embodiment of the present application, the second loss value is a preset adjusted loss value, and the loss value may be a specific loss value or a loss interval. After the keywords are determined in the above manner, whether each of the n voice recognition results includes keywords or not can be determined by searching the keyword list, if it is determined that a certain voice recognition result of the n voice recognition results includes keywords, the loss of the voice recognition result is reduced, the probability that the voice recognition result is used as an optimal recognition result is improved, the loss value of the voice recognition result is adjusted to be a second loss value, and the second loss value is smaller than the original loss value of the voice recognition result.

As one mode, when the loss corresponding to the speech recognition result including the keyword among the n speech recognition results is adjusted, the loss value of the speech recognition result may be adjusted according to the importance of the keyword included in the speech recognition result including the keyword, and the higher the importance of the keyword, the smaller the loss value corresponding to the speech recognition result is adjusted. For example, a speech recognition result includes keyword 1, and the loss corresponding to the speech recognition result is 80.4. If the importance of the keyword 1 is 50%, the loss corresponding to the speech recognition result can be adjusted to 60.4; if the importance of 1 of the keyword is 75%, the loss corresponding to the speech recognition result may be adjusted to 30.4.

As another aspect, when the loss corresponding to the speech recognition result including the keyword in the n speech recognition results is adjusted, the loss corresponding to the speech recognition result may be adjusted according to the number of keywords included in the speech recognition result, and the loss value corresponding to the speech recognition result may be adjusted to be smaller as the number of keywords included in the speech recognition result is larger.

Furthermore, the number of keywords and the importance of the keywords may be combined, and the loss of the speech recognition result may be adjusted. Specifically, the number of keywords, the importance of the keywords and the corresponding relation of the second loss value may be preset, and further the amplitude of adjusting the loss corresponding to the voice recognition result may be determined through the corresponding relation.

Step S360: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the n voice recognition results after the adjustment loss.

When the loss corresponding to the voice recognition result including the keyword in the n voice recognition results is adjusted, the arrangement sequence of the n voice recognition results can be adjusted according to the loss corresponding to each voice recognition result in the n voice recognition results, so that n voice recognition results after the loss adjustment are obtained, and the voice recognition result with the loss value smaller than the preset loss value in the n voice recognition results after the loss adjustment is used as the final output result, namely the second voice recognition result. The preset loss value is a preset loss value which can be determined as a final voice recognition result of voice data to be recognized.

The application provides a voice recognition method, which comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining the occurrence number of each output word in p output words, determining a keyword from the p output words based on the occurrence number of each word, adjusting the loss of the voice recognition result comprising the keyword in the n voice recognition results to a second loss value to obtain n voice recognition results after the loss is adjusted, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the n voice recognition results after the loss is adjusted. By the method, the keywords are automatically acquired from the n voice recognition results, the loss of the n voice recognition results is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the n voice recognition results after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

Referring to fig. 5, a voice recognition method provided by an embodiment of the present application is applied to an electronic device, and the method includes:

step S410: and acquiring voice data to be recognized.

In the embodiment of the present application, the voice data to be recognized may be voice data to be recognized, which is acquired in real time, or may be voice data to be recognized, which is acquired in advance from an external device. The external device may be an electronic device storing voice data, or may be an electronic device capable of generating voice data in real time.

In the embodiment of the application, the voice data to be recognized can be stored in the storage area of the electronic equipment in advance, and the voice data to be recognized can be stored according to a certain rule, and the voice data to be recognized can be stored in a file named according to a specified rule, so that when the voice data to be recognized needs to be obtained, the voice data to be recognized can be obtained from the storage area of the electronic equipment according to the file naming.

Of course, the voice data to be recognized may also be voice data transmitted by an external device. Specifically, when the electronic device needs to acquire the audio data to be identified, a data acquisition instruction may be sent to the external device, and after the external device receives the data acquisition instruction, the external device returns a voice data to be identified to the electronic device. Optionally, the voice data to be identified returned by the external device may be designated voice data or any voice data, which may depend on whether the data acquisition instruction received by the external device includes an identifier of the voice data (the identifier may be a serial number of the voice data to be identified), and if the data acquisition instruction includes the identifier of the voice data, the external device returns the voice data corresponding to the identifier to the electronic device as the voice data to be identified; if the data acquisition instruction does not include the identification of the voice data, the external device returns any voice data to the electronic device as the voice data to be identified.

When the external device returns the voice data to be recognized to the electronic device, the external device can send the voice data with the generation time arranged at the forefront to the electronic device as the voice data to be recognized according to the time sequence of generating the voice data. In this way as described above, the problem that the voice data to be recognized, which is generated at the earliest time, is not recognized due to too much voice data stored in the external device can be avoided.

Optionally, the voice data to be recognized is voice data in wave format.

Step S420: inputting the voice data to be recognized into a voice recognition model, acquiring a voice recognition result output by the voice recognition model, and taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In the embodiment of the application, the voice recognition model can be composed of an acoustic model and a language model, and the acoustic model and the language model respectively correspond to the calculation of the voice to syllable probability and the calculation of the syllable to word probability.

Alternatively, the speech recognition model may be composed of three parts, namely an acoustic model, a dictionary and a language model. Furthermore, the speech recognition model can be an end-to-end model, and the speech data can be converted into text data through the end-to-end model, so that the sequence conversion operation is simplified, and the training process is simplified. The sequence may include text, voice, image or video sequence data, among others.

After the voice data to be recognized is obtained, the voice data to be recognized is input into a voice recognition model, and then a voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the voice recognition result can be output through the voice recognition model. In the embodiment of the application, the loss can be calculated through a loss function in a voice recognition model.

Step S430: and acquiring keywords from the first voice recognition result.

Step S440: and adjusting the loss corresponding to the first voice recognition result based on the keyword to obtain a first voice recognition result after the loss is adjusted.

Step S450: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss.

The details of step S430, step S440 and step S450 can be specifically explained with reference to the above embodiments, so that the details are not repeated in this embodiment.

The application provides a voice recognition method, which comprises the steps of firstly obtaining voice data to be recognized, inputting the voice data to be recognized into a voice recognition model, obtaining a voice recognition result output by the voice recognition model, taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result, then obtaining keywords from the first voice recognition result, adjusting the loss corresponding to the first voice recognition result based on the keywords, so as to obtain a first voice recognition result after the loss is adjusted, and then obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted. By the method, the keywords are automatically acquired from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the first voice recognition result after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

Referring to fig. 6, a voice recognition apparatus 500 according to an embodiment of the present application is provided, where the apparatus 500 includes:

a data acquisition unit 510, configured to acquire voice data to be recognized.

The first result obtaining unit 520 is configured to identify the voice data to be identified, and obtain a first voice identification result corresponding to the voice data to be identified and a loss corresponding to the first voice identification result.

As one way, the first result obtaining unit 520 is configured to input the voice data to be recognized into a voice recognition model, obtain a voice recognition result output by the voice recognition model, and use the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

A keyword obtaining unit 530, configured to obtain a keyword from the first speech recognition result.

As one way, the keyword obtaining unit 530 is configured to obtain the occurrence number of each of the m output words; and determining keywords from the m output words based on the occurrence times of each output word.

Specifically, the keyword obtaining unit 530 is configured to use, as a keyword, an output word whose occurrence number is greater than a first preset number of times among the m output words; or, obtaining the total occurrence times of the m output words; and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

As another way, the keyword obtaining unit 530 is further configured to obtain the occurrence number of each of the p output words; and determining keywords from the p output words based on the occurrence times of each output word.

Specifically, the keyword obtaining unit 530 is configured to use, as a keyword, an output word whose occurrence number is greater than a second preset number of times in the p output words; or, acquiring the total occurrence times of the p output words; and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

The loss adjustment unit 540 is configured to adjust a loss corresponding to the first speech recognition result based on the keyword, so as to obtain a first speech recognition result after the loss is adjusted.

As one way, the loss adjustment unit 540 is configured to adjust the loss of the keyword included in the word graph, which jumps back and forth, to a first loss value, so as to obtain the word graph after the loss is adjusted.

Alternatively, the loss adjustment unit 540 is configured to adjust the loss of the speech recognition results including the keyword in the n speech recognition results to a second loss value, so as to obtain n speech recognition results after the loss is adjusted.

And a second result obtaining unit 550, configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the first speech recognition result after the adjustment loss.

As one way, the second result obtaining unit 550 is configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the word graph after the adjustment loss.

Alternatively, the second result obtaining unit 550 is configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the n speech recognition results after the adjustment loss.

It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.

An electronic device according to the present application will be described with reference to fig. 7.

Referring to fig. 7, based on the above-mentioned voice recognition method and apparatus, another electronic device 800 capable of executing the above-mentioned voice recognition method is provided in the embodiment of the present application. The electronic device 800 includes one or more (only one shown) processors 802, memory 804, and a network module 806 coupled to each other. The memory 804 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 802 can execute the program stored in the memory 804.

Wherein the processor 802 may include one or more processing cores. The processor 802 utilizes various interfaces and lines to connect various portions of the overall electronic device 800, perform various functions of the electronic device 800, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 804, and invoking data stored in the memory 804. Alternatively, the processor 802 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 802 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 802 and may be implemented solely by a single communication chip.

The Memory 804 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (rom). Memory 804 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 804 may include a stored program area that may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc., and a stored data area. The storage data area may also store data created by the terminal 800 in use (e.g., phonebook, audio-video data, chat-record data), etc.

The network module 806 is configured to receive and transmit electromagnetic waves, and to implement mutual conversion between electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices, such as an audio playback device. The network module 806 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and the like. The network module 806 may communicate with various networks such as the internet, intranets, wireless networks, or with other devices via wireless networks. The wireless network may include a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 806 may interact with base stations.

Referring to fig. 8, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 900 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, computer readable storage medium 900 includes non-volatile computer readable media (non-transitory computer-readable storage medium). The computer readable storage medium 900 has storage space for program code 910 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 910 may be compressed, for example, in a suitable form.

According to the voice recognition method, the device, the electronic equipment and the storage medium, voice data to be recognized are firstly obtained, the voice data to be recognized are recognized, a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result are obtained, keywords are obtained from the first voice recognition result, the loss of the first voice recognition result is adjusted based on the keywords, so that the first voice recognition result after the loss is adjusted is obtained, and finally a second voice recognition result corresponding to the voice data to be recognized is obtained from the first voice recognition result after the loss is adjusted. By the method, the keywords are automatically acquired from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the first voice recognition result after the loss is adjusted, so that the accuracy of voice data recognition to be recognized is improved.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice data to be recognized;

identifying the voice data to be identified, and acquiring a first voice identification result corresponding to the voice data to be identified and loss corresponding to the first voice identification result;

obtaining keywords from the first voice recognition result;

based on the keywords, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted;

the first speech recognition result includes a word graph, and the adjusting the loss corresponding to the first speech recognition result based on the keyword to obtain the first speech recognition result after the loss adjustment includes:

The method comprises the steps that the loss of the front and back jumps of a keyword included in a word graph is adjusted to a first loss value, so that the word graph with the loss adjusted is obtained, wherein the loss of the front and back jumps of the keyword is the loss corresponding to two output words adjacent to the front and back of the keyword; the higher the importance of the key words is, the smaller the loss values corresponding to two output words adjacent to the key words in front and behind are adjusted;

acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss;

the step of obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss includes:

and recalculating the loss of each path in the word graph after the adjustment loss, and taking one or more paths with the minimum loss as the second voice recognition result.

2. The method of claim 1, wherein the word graph includes m output words, and wherein the obtaining the keyword from the first speech recognition result includes:

acquiring the occurrence frequency of each output word in the m output words;

and determining keywords from the m output words based on the occurrence times of each output word.

3. The method of claim 2, wherein determining the keyword from the m output words based on the number of occurrences of each output word comprises:

taking the output words with the occurrence times larger than the first preset times in the m output words as key words; or alternatively, the process may be performed,

acquiring the total occurrence times of the m output words;

and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

4. The method of claim 1, wherein the first speech recognition result comprises n speech recognition results; the step of adjusting the loss corresponding to the first speech recognition result based on the keyword to obtain a first speech recognition result after the loss adjustment, including:

adjusting the loss of the voice recognition results including the keywords in the n voice recognition results to a second loss value to obtain n voice recognition results after the loss is adjusted, wherein the second loss value is smaller than the original loss value of the voice recognition results; the higher the importance of the keywords included in the voice recognition result is, the smaller the loss value corresponding to the voice recognition result is adjusted; the more the number of keywords included in the voice recognition result is, the smaller the loss value corresponding to the voice recognition result is adjusted;

and taking the voice recognition result with the loss value smaller than the preset loss value in the n voice recognition results after the loss adjustment as the second voice recognition result.

5. The method of claim 4, wherein the n speech recognition results include p output words, and wherein the obtaining keywords from the first speech recognition results comprises:

acquiring the occurrence frequency of each output word in the p output words;

and determining keywords from the p output words based on the occurrence times of each output word.

6. The method of claim 5, wherein determining keywords from the p output words based on the number of occurrences of each output word comprises:

taking the output words with the occurrence times larger than a second preset times in the p output words as key words; or alternatively, the process may be performed,

acquiring the total occurrence times of the p output words;

and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as key words, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

7. The method of claim 1, wherein the identifying the voice data to be identified, obtaining a first voice recognition result corresponding to the voice data to be identified and a loss corresponding to the first voice recognition result, comprises:

inputting the voice data to be recognized into a voice recognition model, acquiring a voice recognition result output by the voice recognition model, and taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

8. A speech recognition device, the device comprising:

the data acquisition unit is used for acquiring voice data to be recognized;

the first result acquisition unit is used for identifying the voice data to be identified and acquiring a first voice identification result corresponding to the voice data to be identified and loss corresponding to the first voice identification result;

a keyword obtaining unit, configured to obtain a keyword from the first speech recognition result;

the loss adjusting unit is used for adjusting the loss corresponding to the first voice recognition result based on the keyword so as to obtain a first voice recognition result after the loss is adjusted; the first speech recognition result includes a word graph, and the adjusting the loss corresponding to the first speech recognition result based on the keyword to obtain the first speech recognition result after the loss adjustment includes: the method comprises the steps that the loss of the front and back jumps of a keyword included in a word graph is adjusted to a first loss value, so that the word graph with the loss adjusted is obtained, wherein the loss of the front and back jumps of the keyword is the loss corresponding to two output words adjacent to the front and back of the keyword; the higher the importance of the key words is, the smaller the loss values corresponding to two output words adjacent to the key words in front and behind are adjusted;

A second result obtaining unit, configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the first speech recognition result after the adjustment loss; the step of obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the adjustment loss includes: and recalculating the loss of each path in the word graph after the adjustment loss, and taking one or more paths with the minimum loss as the second voice recognition result.

9. An electronic device comprising one or more processors and memory; one or more programs are stored in the memory and configured to perform the method of any of claims 1-7 by the one or more processors.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, wherein the program code, when being executed by a processor, performs the method of any of claims 1-7.