CN115798462A

CN115798462A - Voice recognition method and device, electronic equipment and chip

Info

Publication number: CN115798462A
Application number: CN202211395190.2A
Authority: CN
Inventors: 姚人天
Original assignee: Unisoc Chongqing Technology Co Ltd
Current assignee: Unisoc Chongqing Technology Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-03-14

Abstract

The invention provides a voice recognition method and a device thereof, electronic equipment and a chip, wherein the voice recognition method comprises the following steps: extracting voice features in the voice to be recognized; sending the voice characteristics to an acoustic model so as to output a primary phoneme sequence through the acoustic model, wherein the acoustic model is obtained through training of a complete pronunciation dictionary module, and each character and a phoneme sequence corresponding to each character are recorded in the complete pronunciation dictionary module; carrying out context transcription on phonemes in the primary phoneme sequence to obtain a secondary phoneme sequence; performing pronunciation dictionary modeling on the secondary phoneme sequence by using a small decoding dictionary submodule to obtain decoded characters, wherein the small decoding dictionary submodule is obtained by converting non-keyword characters in a complete pronunciation dictionary into appointed characters by using a complete pronunciation dictionary module; and sending the decoded character to the language model so that the language model can understand the code character and output a recognition result. The method can reduce the volume of the language model on the basis of ensuring the modeling precision of the acoustic model.

Description

Voice recognition method and device, electronic equipment and chip

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a chip.

Background

In phoneme level keyword speech recognition, an acoustic model is trained by using an intact pronunciation dictionary, although the modeling precision of the acoustic model is improved, the volume of the language model in the decoding process is overlarge, for example, in speech recognition of GMM and HMM-DNN systems (Gaussian mixture model and hidden Markov-deep neural network system), a complete phoneme set is modeled by using the acoustic model built by DNN, the phoneme set in the complete pronunciation dictionary module is composed of phonemes in a phoneme sequence corresponding to keyword characters and a plurality of non-keyword characters, and the language model is designed around keywords of the complete pronunciation dictionary module and other characters. Although the scheme can enable the acoustic model to clearly model the audio features corresponding to each phoneme, the audio features are subjected to context transcription processing after the audio features are modeled, and finally the recognition result is output through the language model. However, the language model is too bulky due to the large number of non-keyword characters in the language model.

In a mode of mapping non-keyword characters except the keywords to "ukown", for example, in speech recognition of GMM and HMM-DNN systems, an acoustic model models only phonemes corresponding to the keywords and "ukown" phonemes, and then a language model is designed around the keyword characters and "ukown". Although the language model can be reduced in size, the method can cause the pronunciation characteristics of a plurality of non-keyword characters to correspond to the "unk" phonemes, so that the difficulty of modeling the pronunciation dictionary is increased, and the modeling precision of the acoustic model is reduced because the pronunciation difference between different characters is large and the phonemes corresponding to the non-keyword characters are fitted to the "unk" phoneme.

Therefore, the problem that the modeling precision of the acoustic model and the volume of the language model are contradictory needs to be solved.

Disclosure of Invention

In order to solve the problems, the voice recognition method and the device, the electronic equipment and the chip thereof provided by the invention train the acoustic model through the complete pronunciation dictionary module and unify the non-keyword characters in the small decoding dictionary sub-module, thereby reducing the design of the language model for decoding paths of the non-keyword characters and further reducing the volume of the language model on the basis of ensuring the modeling precision of the acoustic model.

In a first aspect, the present invention provides a speech recognition method, including:

extracting voice features in the voice to be recognized;

sending the speech features to an acoustic model so that the acoustic module outputs a primary phoneme sequence through the acoustic model; the acoustic model is obtained by training through a complete pronunciation dictionary module, and each character and a phoneme sequence corresponding to each character are recorded in the complete pronunciation dictionary module;

carrying out context transcription on phonemes in the primary phoneme sequence to obtain a secondary phoneme sequence;

the method further comprises the following steps: performing pronunciation dictionary modeling on the secondary phoneme sequence by using a small decoding dictionary submodule to obtain decoded characters, wherein the small decoding dictionary submodule is obtained by converting non-keyword characters in the complete pronunciation dictionary by using the complete pronunciation dictionary module, the designated characters are fixed characters different from the keyword characters, and the decoded characters are designated characters and/or keyword characters;

and sending the decoded character to a language model so that the language model processes the decoded character and outputs a recognition result.

Optionally, the step of modeling a pronunciation dictionary for the secondary phoneme sequence using a decoding lexicon submodule includes:

and inquiring the phoneme sequence which is recorded by the decoding small dictionary submodule and matched with the secondary phoneme sequence according to the secondary phoneme sequence, and determining characters which correspond to the phoneme sequence matched with the secondary phoneme sequence in the decoding small dictionary submodule as decoding characters.

In a second aspect, the present invention provides a speech recognition apparatus comprising:

the extraction module is used for extracting the voice features in the voice to be recognized;

the acoustic module is used for outputting a primary phoneme sequence according to the voice characteristics; the acoustic module comprises an acoustic model, the acoustic model is obtained by training through a complete pronunciation dictionary module, the acoustic model is used for modeling voice characteristics, and the complete pronunciation dictionary module is used for recording a phoneme sequence corresponding to each character;

the phoneme processing module is used for carrying out context transcription on phonemes in the primary phoneme sequence to obtain a secondary phoneme sequence;

the speech recognition apparatus further includes: a pronunciation dictionary conversion module, configured to perform pronunciation dictionary modeling on the secondary phoneme sequence by using a small decoding dictionary submodule to obtain decoded characters, where the small decoding dictionary submodule is obtained by converting, by the complete pronunciation dictionary module, non-keyword characters in the complete pronunciation dictionary by using designated characters, the designated characters are fixed characters different from the keyword characters, and the decoded characters are designated characters and/or keyword characters;

the language model is used for processing the decoded characters and outputting a recognition result;

the output end of the extraction module is connected with the input end of the acoustic module, the output end of the acoustic module is connected with the input end of the phoneme processing module, the output end of the phoneme processing module is connected with the input end of the pronunciation dictionary conversion module, and the output end of the pronunciation dictionary conversion module is connected with the input end of the language model.

Optionally, the pronunciation dictionary conversion module includes: a decoding submodule and a small decoding dictionary submodule;

the decoding submodule is used for inquiring the phoneme sequence which is recorded by the decoding small dictionary submodule and matched with the secondary phoneme sequence and determining the characters which correspond to the secondary phoneme sequence in the decoding small dictionary submodule as decoding characters;

the decoding small dictionary submodule is connected with the decoding submodule, the input end of the decoding submodule is connected with the output end of the phoneme processing module, and the output end of the decoding submodule is connected with the input end of the language model.

In a third aspect, the present invention provides an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of the above.

In a fourth aspect, the present invention provides a chip, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above.

In a fifth aspect, the present invention provides a computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the method of any one of the above.

According to the voice recognition method and the device, the electronic equipment and the chip provided by the embodiment of the invention, the acoustic model is trained through the complete pronunciation dictionary module, the non-keyword characters in the small decoding dictionary sub-module are unified, the design of the language model for decoding paths of the non-keyword characters is reduced, and the volume of the language model is reduced on the basis of ensuring the modeling precision of the acoustic model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the description of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the description below are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a speech recognition system according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.

Reference numerals

200. A voice recognition device; 210. an extraction module; 220. an acoustic model; 230. a phoneme processing module; 231. a hidden Markov submodule; 232. a context transcription submodule; 240. a pronunciation dictionary conversion module; 241. decoding the sub-module; 242. decoding the small dictionary submodule; 250. a language model.

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Embodiments of the present application are set forth in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Spatial relational terms, such as "under," "below," "under," "over," and the like may be used herein to describe one element or feature's relationship to another element or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements or features described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the exemplary terms "under" and "under" can encompass both an orientation of above and below. In addition, the device may also include additional orientations (e.g., rotated 90 degrees or other orientations) and the spatial descriptors used herein interpreted accordingly.

It will be understood that when an element is referred to as being "fixedly attached" to another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. In contrast, when an element is referred to as being "directly on" another element, there are no intervening elements present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

As used herein, the singular forms "a", "an" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises/comprising," "includes" or "including," etc., specify the presence of stated features, integers, steps, operations, components, parts, or combinations thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, components, parts, or combinations thereof.

In a first aspect, the present embodiment provides a speech recognition method, referring to fig. 1, the method includes steps S101 to S105:

step S101: and extracting the voice features in the voice to be recognized.

Specifically, when the voice features are extracted, the voice signals formed by the voice to be recognized are subjected to direct current removal, denoising and other processing; then, from this, MFCC (mel-frequency cepstral coefficients), fbank (mel-frequency band energy), LPC (linear predictive coefficient), and the like are extracted as speech features of the acoustic model input.

Step S102: and sending the speech features to an acoustic model so that the acoustic module obtains a primary phoneme sequence through the acoustic model.

The acoustic model is obtained by training a complete pronunciation dictionary module, a large number of voice labels and a large number of corpora; the complete pronunciation dictionary module is a standard file recorded with each character and a phoneme sequence corresponding to each character; the voice tag refers to a character sequence corresponding to a voice, the character sequence is formed by arranging a plurality of characters, and the characters comprise keyword characters and non-keyword characters.

In this embodiment, each character recorded by the complete pronunciation dictionary module is a commonly used word corresponding to the corresponding language system _， For rarely occurring words, the complete pronunciation dictionary module may not be included, but notTo this end; the acoustic model is a neural network model. In addition, the speech features sent to the acoustic model are time-sequenced, and the acoustic model outputs the speech features for each set of inputs with multiple results according to the classification. Because of the large number of phonemes, each phoneme is also associated with a different probability.

Step S102 further includes: and after the voice characteristics are processed by an acoustic model and hidden Markov modeling is carried out by a hidden Markov submodule, a first-level phoneme sequence is obtained. Wherein the acoustic model provides a plurality of observation probabilities with HMM after processing the speech features; and the hidden Markov submodule performs hidden Markov modeling on the voice characteristics according to the HMM observation probability so as to output a first-level phoneme sequence.

Step S103: and carrying out context transcription on the phonemes in the primary phoneme sequence to obtain a secondary phoneme sequence.

Step S104: and performing pronunciation dictionary modeling on the secondary phoneme sequence by using a small decoding dictionary submodule to obtain decoded characters. The decoding small dictionary submodule is obtained by converting a non-keyword character in the complete pronunciation dictionary into a designated character by the complete pronunciation dictionary module, and the designated character is a fixed character different from a keyword character. It should be noted that, both the keyword characters and the non-keyword characters are manually set according to the application scenario, and this embodiment is not specifically limited; the characters in the complete pronunciation dictionary module belong to either keyword characters or non-keyword characters. Characters in the small decoding dictionary sub-module are not replaced by designated characters, namely the characters belong to keyword characters, and other non-keyword characters are replaced by the same designated characters.

The decoding characters are designated characters and/or keyword characters, that is, one phoneme sequence in the decoding small dictionary submodule may correspond to at least one keyword character, may also correspond to at least one keyword character and at least one designated character, and may also correspond to at least one designated character.

Specifically, a phoneme sequence in the complete pronunciation dictionary module does not correspond to only one character, for example, the characters corresponding to "da" are "typing", "big", and "lap". Similarly, in the sub-module of the decoding small dictionary, the characters corresponding to "d a" are "typing" and "not", and only "typing" of the characters corresponding to "d a" is defined as the keyword character. Thus, the decoded characters corresponding to "d a" are "not" and "not". If some phoneme sequences only correspond to the keyword characters in the small decoding dictionary, the obtained decoded characters only contain the keyword characters. If some phoneme sequences only correspond to the specified characters in the small decoding dictionary, the obtained decoded characters only contain the specified characters. In addition, there are a plurality of characters corresponding to a segment of speech feature, so that in the process of decoding by the small decoding dictionary sub-module, a decoded character can be obtained, which contains a plurality of characters, such as "negate", "open", or "open".

In the present embodiment, it is assumed that one phoneme sequence corresponds to one character for explanation. Further, the step of using the decoding small dictionary submodule to perform pronunciation dictionary modeling on the secondary phoneme sequence comprises: and inquiring the phoneme sequence matched with the secondary phoneme sequence recorded by the decoding small dictionary submodule according to the secondary phoneme sequence, and determining characters corresponding to the phoneme sequence matched with the secondary phoneme sequence in the decoding small dictionary submodule as decoding characters.

For example, the complete pronunciation dictionary module records that the phoneme sequence is "q" and the corresponding character is "please"; the phoneme sequence is that the character corresponding to the "d a" is "typing"; the phoneme sequence is that the character corresponding to the "k ai" is "on"; the character corresponding to the phoneme sequence of zh i is intelligent; the phoneme sequence is that the character corresponding to the 'n eng' is 'can'; the character corresponding to the phoneme sequence of "y in" is "tone"; the phoneme sequence is "x i ang" corresponding to the character is "box". Wherein, the characters 'please', 'wisdom' and 'can' are defined in the complete pronunciation dictionary module to be non-keyword characters. In the present embodiment, the character is designated as "not".

Then the character corresponding to the phoneme sequence "q" is recorded as "not" in the decoding small dictionary submodule, and the character corresponding to the phoneme sequence "d a" is recorded as "typing"; the phoneme sequence is that the character corresponding to the "k ai" is "on"; the phoneme sequence is that the character corresponding to the zh i is not; the phoneme sequence is that the character corresponding to the "n eng" is not; the character corresponding to the phoneme sequence of "y in" is "tone"; the phoneme sequence is "x i ang" corresponding to the character is "box". Therefore, the phoneme sequence which is recorded by the small dictionary sub-module and is matched with the secondary phoneme sequence can be inquired and decoded, and the character corresponding to the matched phoneme sequence is determined as a decoded character.

For example, a plurality of secondary phoneme sequences, namely "q" ing "," d a "," k ai "," zh i "," n eng "," y in "and" x i ang ", are modeled by a pronunciation dictionary, and then corresponding decoded characters, namely" NOT "," open "," NOT "," sound "and" box ", can be obtained.

Step S105: and sending the decoded character to a language model so that the language model can understand the code character and output a recognition result.

Specifically, if the decoded characters sent to the speech model are "not", "open", "not", "sound", and "box", respectively, the speech model outputs "not open non-sound box" as the recognition result after matching, probability statistics, and other related steps.

In an optional embodiment, in the decoding process, the operation step before the step S104 is executed may be implemented by speech recognition of GMM and HMM-DNN systems, which is not described in detail in this embodiment.

The voice recognition method corresponds to the voice label of the training corpus when an acoustic model is trained, and enables the complete pronunciation dictionary module to completely contain each character in the voice label as much as possible when a pronunciation dictionary is constructed, namely a large dictionary. Therefore, each speech segment has a phoneme sequence and a label sequence which are independent respectively. During training, the acoustic model can better fit the characteristics of each section of speech, and can more clearly model each phoneme, so that the modeling precision of the acoustic model is ensured. Meanwhile, in the decoding stage, the speech recognition method does not directly use the complete pronunciation dictionary module constructed in training, but replaces all non-keyword characters in the complete pronunciation dictionary module with the same designated character, such as 'not', namely modifies the complete pronunciation dictionary module to obtain a small decoding dictionary sub-module. At the moment, the character types in the sub-module of the small decoding dictionary become extremely few, namely the small decoding dictionary; therefore, after the decoding process is modeled by the small dictionary, only the keyword character and the designated character are 'not', but a plurality of 'not' in the small dictionary still respectively correspond to the independent phoneme sequences, so that the modeling precision of the acoustic model is improved to a certain extent from the aspect of the whole speech recognition method. In addition, in the language model design in the decoding stage, the decoding path is designed only for the keyword and the designated character "not". Therefore, the language model originally needs to design decoding paths of a plurality of characters, and only few numeric characters, namely the decoding paths of the keyword characters and the designated characters, need to be designed at present, so that the volume of the language model is greatly reduced.

The following is a process for illustrating speech features in the process of acoustic model modeling, hidden Markov modeling, context transcription, pronunciation dictionary modeling and language model processing: assume that a set of audio features is input to the acoustic module every 10 ms. Thus, from the first set of audio features, the acoustic module outputs a phoneme: a1, B1, C1 and D1, wherein A1, B1, C1 and D1 are in parallel relation, and each phoneme corresponds to a probability value. By analogy, from the second set of audio features, the acoustic module outputs the phonemes: a2, B2, C2 and D2; from the third set of audio features, the acoustic module outputs a phoneme: a3, B3, C3 and D3. After 30ms, 64 sets of primary phoneme sequences, such as A1, A2 and A3; a1, A2 and B3; a1, B2, C3, and B1, A2, B3, etc. Then, the probabilities are superposed on the 64 groups of primary phoneme sequences in the process of context transcription, if the probability that A2 appears after A1 is higher, the weights of A1, A2, A3 and A2 parts in A1, A2 and B3 are further increased; and the probability that B3 appears after A2 is not high, the weight of B3 in the sequences A1, A2 and B3 is reduced, so that the sequence with the highest weight sum is selected as a secondary phoneme sequence, such as A1, A2 and A3, according to the weight of each sequence. And then performing pronunciation dictionary modeling on the A1, the A2 and the A3, and if traversing in the decoding small dictionary submodule by using the A1, the A2 and the A3 and finding that a corresponding phoneme sequence is recorded in the decoding small dictionary submodule, outputting a character corresponding to the phoneme sequence as a decoding character. And after the language model receives the decoded characters output by the small decoding dictionary submodule, matching the decoded characters output according to the previous multiple rounds of time sequences, and selecting a group of decoded character matching with the maximum probability as a voice recognition result and outputting the result. Wherein the probability of the speech model reference comprises: observation probability output by the acoustic model, transition probability given in hidden Markov modeling, context probability given in context transcription process, pronunciation dictionary probability given in pronunciation dictionary modeling process and language model probability given in language model processing process. The context probability given in the context transcription process, the pronunciation dictionary probability given in the pronunciation dictionary modeling process, and the language model probability given in the language model processing process are all prior probabilities, which are not described in detail herein.

In addition, it should be noted that the acoustic model, hidden markov modeling, context transcription, and language model processing involved in the present invention are all prior art, and the specific processing procedures are not described in detail in this embodiment.

In a second aspect, the present embodiment provides a speech recognition apparatus 200, and referring to fig. 2, the speech recognition apparatus 200 includes:

an extracting module 210, configured to extract a voice feature in the voice to be recognized;

the acoustic module 220 is configured to output a plurality of groups of primary phoneme sequences according to the speech features, the acoustic module 220 includes an acoustic model, the acoustic model is obtained through training of a complete pronunciation dictionary module, the acoustic model is used for modeling the speech features to provide HMM observation probabilities, and the complete pronunciation dictionary module is used for recording phoneme sequences corresponding to each character;

a phoneme processing module 230, configured to perform context transcription on phonemes in the primary phoneme sequence to obtain multiple groups of secondary phoneme sequences;

the pronunciation dictionary conversion module 240 is configured to perform pronunciation dictionary modeling on the multiple groups of secondary phoneme sequences, where in the pronunciation dictionary modeling, a non-keyword character in the primary character sequence corresponding to the secondary phoneme sequence is replaced with a unique designated character different from the keyword character, and a secondary character sequence is obtained after the pronunciation dictionary modeling;

and the language model 250 is used for processing the secondary character sequence and outputting a recognition result.

The output end of the extraction module 210 is connected to the input end of the acoustic module 220, the output end of the acoustic module 220 is connected to the input end of the phoneme processing module 230, the output end of the phoneme processing module 230 is connected to the input end of the pronunciation dictionary conversion module 240, and the output end of the pronunciation dictionary conversion module 240 is connected to the input end of the language model 250.

In an alternative embodiment, the acoustic module 220 further comprises: a hidden Markov submodule;

the hidden Markov submodule is used for carrying out hidden Markov modeling on voice features according to HMM observation probability output by an acoustic model so as to obtain the primary phoneme sequence.

The hidden Markov submodules are connected with the acoustic model.

In an alternative embodiment, the pronunciation dictionary conversion module 240 includes: a decoding submodule 241 and a decoding small dictionary submodule 242;

a decoding submodule 241, configured to query the phoneme sequence recorded by the decoding small dictionary submodule 242 and matching the second phoneme sequence, and determine a character corresponding to the matched phoneme sequence as a decoded character;

the decoding sub-dictionary sub-module 242 is connected to the decoding sub-module 241, the input terminal of the decoding sub-module 241 is connected to the output terminal of the phoneme processing module 230, and the output terminal of the decoding sub-module 241 is connected to the input terminal of the language model 250.

In a third aspect, the present embodiment provides a speech recognition method, referring to fig. 3, in combination with the speech recognition method of the first aspect and the speech recognition apparatus 200 of the second aspect, the speech recognition method of the third aspect includes steps S301 to S305:

step S301: the speech features in the speech to be recognized are extracted by the extraction module 210.

Step S302: the speech features are sent to the acoustic module 220 to obtain a sequence of primary phonemes through the acoustic module 220.

Step S303: the primary phoneme sequence is sent to the phoneme processing module 230 so that the phoneme processing module 230 outputs a secondary phoneme sequence.

Step S304: the secondary phoneme sequence is sent to a pronunciation dictionary conversion module 240, so that the decoding sub-module 241 replaces the non-keyword characters in the primary character sequence corresponding to the secondary phoneme sequence with unique designated characters different from the keyword characters according to a small decoding dictionary sub-module 242, and decoded characters are obtained.

Step S305: the decoded character is sent to the language model 250 so that the code character is understood at the language model 250 and the recognition result is output.

In a fourth aspect, the present embodiment provides an electronic device, including:

at least one processor; and

In a fifth aspect, the present embodiment provides a chip, including:

at least one processor; and

In a sixth aspect, the present embodiment provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions, when executed by a processor, implement the method of any one of the above.

In the description herein, references to the description of "some embodiments," "other embodiments," "desired embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic depictions of the above terms do not necessarily refer to the same embodiment or example.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech recognition method comprising:

extracting voice features in the voice to be recognized;

sending the voice features to an acoustic model so that the acoustic module obtains a primary phoneme sequence through the acoustic model;

the acoustic model is obtained by training through a complete pronunciation dictionary module, and each character and a phoneme sequence corresponding to each character are recorded in the complete pronunciation dictionary module;

2. The method of claim 1 wherein said step of using a decoding lexicon submodule to model a pronunciation dictionary for said secondary phoneme sequence comprises:

3. A speech recognition apparatus comprising:

the acoustic module is used for outputting a primary phoneme sequence according to the voice characteristics;

the method is characterized in that the acoustic module comprises an acoustic model, the acoustic model is obtained by training through a complete pronunciation dictionary module, the acoustic model is used for modeling voice features, and the complete pronunciation dictionary module is used for recording phoneme sequences corresponding to all characters;

4. The apparatus of claim 3, wherein the pronunciation dictionary module comprises: a decoding submodule and a small decoding dictionary submodule;

the decoding submodule is used for inquiring the phoneme sequence which is recorded by the decoding small dictionary submodule and matched with the second phoneme sequence and determining the characters corresponding to the matched phoneme sequence as decoding characters;

5. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-2.

6. A chip, wherein the chip comprises:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 2.

7. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the method of any of claims 1-2.