CN112542162A

CN112542162A - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112542162A
Application number: CN202011402934.XA
Authority: CN
Inventors: 赖勇铨
Original assignee: China Citic Bank Corp Ltd
Current assignee: China Citic Bank Corp Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-23
Anticipated expiration: 2040-12-04
Also published as: CN112542162B

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: the pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence determined by cluster searching is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.

Background

The bundle searching is a breadth-first heuristic searching algorithm and is used in path searching. Assuming that there are three nodes, each of which may take the value abc, all possible paths include aaa, aab, aac. For the sake of efficiency and storage space, the bundle searching algorithm first expands from width to build a candidate list, where the capacity of the list is at most w, which is also called beam width, i.e. the width of the bundle.

For the above problem, let w be 2, i.e. the current list retains only two most probable paths after each step of search. Then a complete bundle search procedure is as follows: firstly, considering the sequence of a, b and c, selecting two combinations corresponding to the maximum probability as b and c, selecting the two combinations with the maximum probability from the two combinations, arranging the two combinations from high to low, and updating the two combinations into a list; the second step considers the following 6 cases, ba, bb, bc, ca, cb, cc, from which the two combinations with the highest probability are selected and ranked from high to low, assuming bc, ca, updated in the list; the third step considers the following 6 cases, bca, bcb, bcc, caa, cab, cac, from which the two combinations with the highest probability are selected and ranked from high to low, assuming caa, cac; and finishing the search, and outputting caa and cac as a final bundle search result.

The probabilities between the combinations involved in the above calculation process can be obtained by an n-gram language model. Taking a 2-gram as an example, the probability that a combination frequency of characters of 2 nd order or less is used to represent the combination is usually calculated from a large corpus. Assuming a total of three words, a, b, c, then the 2-gram would have a combined probability value by counting the large number of text corpora as follows:

a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc. The probability calculation in the search process is obtained by table lookup, for example, the probability of abc combination is calculated and then decomposed into modulus values ab and bc, which are multiplied.

The cluster searching enhances the voice recognition effect through a language model of n-gram, and ngram is realized through table lookup. In practical applications, for an audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are sorted from high to low according to the probability, and the probability value is obtained by weighting the probabilities of the acoustic model and the ngram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and incapability of realizing long context understanding. The method has the disadvantages that the n-gram model is difficult to model a long sentence, cannot utilize context information of the whole sentence, and is not accurate enough for understanding the context.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are used for breaking through the limitation of an n-gram model and utilizing context information of a whole sentence, so that the category and the probability of characters corresponding to each position in a candidate sentence can be obtained more accurately, the probability of each candidate sentence determined by cluster search is further determined, the candidate sentences are reordered, and the voice recognition result is more accurate.

The technical scheme adopted by the application is as follows:

in a first aspect, a speech recognition method is provided, which includes:

acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a cluster searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;

determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;

and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.

Optionally, determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model includes:

determining character types and probabilities of all positions based on a pre-trained mask-based neural network model;

determining the probability of each character in any candidate sentence based on the character category and the probability of each position;

determining the probability of any candidate sentence based on the probability of each character in any candidate sentence, comprising:

and taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.

Optionally, calculating the text category and probability of each position based on the pre-trained mask-based neural network model includes:

erasing the characters at any position in a mask mode to obtain any candidate sentence with the characters at any position erased;

and inputting any candidate statement erased from any position of characters into a pre-trained neural network model based on a mask to obtain the character category and probability of any position.

Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the position where the mask is erased.

Optionally, the pre-trained mask-based neural network model is a time-series sequence-based neural network model.

Optionally, the method further comprises:

and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.

In a second aspect, a speech recognition apparatus is provided, including:

the acquisition module is used for acquiring a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster search model, wherein the candidate sentence list comprises a plurality of candidate sentences;

the determining module is used for determining the probability of each candidate statement in the candidate statement list; the determining module is specifically used for determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;

and the reordering module is used for reordering each candidate statement in the candidate statement list based on the determined probability of each candidate statement in the candidate statement list to obtain a reordered target candidate statement list.

Optionally, the determining module includes:

the first determining unit is used for determining the character type and probability of each position based on a pre-trained mask-based neural network model;

the second determining unit is used for determining the probability of each character in any candidate sentence based on the character type and the probability of each position;

and the unit is used for taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.

Optionally, the first determining unit is specifically configured to erase the text at any position in a mask manner, so as to obtain any candidate statement with the text at any position erased; and inputting any candidate statement for erasing the characters at any position into a pre-trained neural network model based on a mask to obtain the character category and probability at any position.

Optionally, the apparatus further comprises: and the module is used for taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as the voice recognition result of the target audio.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech recognition method shown in the first aspect is performed.

In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method shown in the first aspect.

Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is usually not accurate enough, the method and the device for recognizing the voice have the advantages that a candidate sentence list obtained by performing voice recognition on target audio based on the cluster search method is obtained, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an example of probability identification determination according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Example one

An embodiment of the present application provides a speech recognition method, as shown in fig. 1, the method may include the following steps:

step S101, acquiring a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;

in particular, in practical applications, for a target audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are sorted from high to low according to the probability, and the probability value is obtained by weighting the probability of each acoustic model and the probability of each n-gram language model. The n-gram language model belongs to a local model, and has the advantages of high efficiency and incapability of realizing long context understanding. The method has the disadvantages that the n-gram model is difficult to model a long sentence, cannot utilize context information of the whole sentence, and is not accurate enough for understanding the context.

Such as for a speech text: "suddenly raining today, but i forgot to take an umbrella when they go out", assuming that the last two words are not clearly recorded enough due to noise or other reasons, the cluster search algorithm may get the following output list:

"it rains suddenly today, but I don't bring three"

"it rains suddenly today, but I do not stay away"

"it rains suddenly today, but I do not have an umbrella"

…

The output sentences in the list are different in the last two words, because the umbrella-carrying language model and the rainfall language model differ by 6 characters, if the context relationship is required to be utilized in the ngram language model, the length of the context is required to be at least 10 (including 4 words of umbrella-carrying language model and rainfall language model), that is, the 10-order language model is required, which is obviously unrealistic (the normal n-gram language model can reach 5 orders at most), and how to break through the limitation of the n-gram model becomes a problem. It is the pre-trained mask-based neural network language model that is utilized to reorder and essentially correct the list of bundled search outputs described above.

Step S102, determining the probability of each candidate sentence in the candidate sentence list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;

the probability of each character in any candidate sentence is determined through the pre-trained mask-based neural network model, and the limitation that the n-gram model cannot utilize context information of the whole sentence and the understanding of the context is not accurate enough is broken through.

Specifically, the pre-trained Mask-based neural network model of the present application may be a masked language model (Mask LM) or other models that implement the functions of the present application. The training process of the Mask LM is similar to the blank filling of a text, an input sentence I with blanks is processed through a deep network and then a word P is output, and the aim is to enable the word P to be as close as possible to a real sentence T. For example as shown below, in the following,

i, if you only need right key to click each disc symbol, choose "[ ] to formalize [ ], namely [ ]

P, if you have, you only need right click each drive letter, choose "format" can

If yes, you only need right click each drive letter and select' format

Through training, the mask LM can predict the text at any position. The implementation of Mask LM network is not limited to RNN/GRU/LSTM isochronous sequence model or its improved version of attention device, but also includes transformer (BERT, gpt) network structure.

Step S103, based on the determined probability of each candidate sentence in the candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.

Illustratively, the probabilities of sentences of ' raining suddenly today, but not taking three ' today ' and ' raining suddenly today, but not staying away ', and ' raining suddenly today, but not taking umbrellas ' are obtained respectively, and then the sentences can be reordered according to the sizes of the probability values, so that the obtained target candidate sentence results are more accurate.

The embodiment of the present application provides a possible implementation manner, and specifically, determining a probability of occurrence of each word in any candidate sentence based on a pre-trained mask-based neural network model, including:

determining character types and probabilities of all positions based on a pre-trained mask-based neural network model; specifically, erasing the characters at any position in a mask mode to obtain any candidate sentence with the characters at any position erased; and inputting any candidate statement erased from any position of characters into a pre-trained neural network model based on a mask to obtain the character category and probability of any position. Specifically, the last layer of the pre-trained mask-based neural network model is a softmax activation function, which is used for classifying the text corresponding to the position where the mask is erased. Illustratively, like my love-country in fig. 3, the "-" location-corresponding text may be "medium, american, large.

The embodiment of the application provides a possible implementation manner, and specifically, the pre-trained mask-based neural network model is a time sequence-based neural network model.

In particular, for each character in the statement, it is erased, i.e., replaced with a "-" (or other non-visible symbol) by way of a mask. And then predicting all possible characters (including actual characters) of the position and probability values thereof through MaskLM, and finally taking the probability corresponding to the actual text of the position as a score corresponding to the position in the sentence. The final total score is the product of the score values (probability values) for all locations.

For example, to calculate the probability of the sentence "puncture or very ocean-qi", the following decomposition is performed:

wearing deviceIt is also very ocean-qi: 0.1

Wearing deviceGet upAlso very ocean-qi: 0.2

Put on and get upAnd alsoIs very ocean-qi: 0.3

Put on and get up stillIs thatVery ocean-qi: 0.6

Whether to put on or put onVery muchThe ocean gas is as follows: 0.7

Is worn up or downOceanThe ratio of gas: 0.2

Put on or very openQi (Qi)The following steps: 0.3

Whether it is a very ocean qiIs/are as follows：0.5

For each position of the sentence, a probability value of the word is calculated (as indicated by the number following the colon), and finally the probability values of the sentence are obtained by multiplying all the values. P(s) is used to represent probability value of sentence

Wherein n is the length of the sentence, i is the position, the probability value of the character corresponding to the position i is Pi, and the value of Pi is obtained by the language model through prediction based on the characters at other positions.

Fig. 3 shows how to calculate Pi of the sentence "i love china", and the probability of the "middle" word in the sentence is 0.99, i.e. the probability value output by the last layer softmax. If the sentence is "I love you's country", the probability value of "you" is 0.0001.

And (4) reordering the sentences, and respectively calculating the probabilities P (S1), P (S2) and …. P (Sm) of the sentence lists obtained by the bundle search, wherein the sentence lists are assumed to be S1, S2, S3 and …. Sm.

And finally, sequencing the sentences from large to small according to the probability value to obtain the reordered sentences.

The embodiment of the present application provides a possible implementation manner, and specifically, the method further includes:

Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is not accurate enough, the method for recognizing the voice comprises the steps of obtaining a candidate sentence list obtained by performing voice recognition on target audio based on the cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.

Example two

Fig. 2 is a speech recognition apparatus according to an embodiment of the present application, where the apparatus 20 includes: an acquisition module 201, a determination module 202, a reordering module 203, wherein,

an obtaining module 201, configured to obtain a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster search model, where the candidate sentence list includes a plurality of candidate sentences;

a determining module 202, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;

and the reordering module 203 is configured to reorder, based on the determined probability of each candidate statement in the candidate statement list, and obtain a reordered target candidate statement list.

Optionally, the determining module includes:

Optionally, the apparatus 20 further includes a module configured to use the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a speech recognition result of the target audio.

Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is not accurate enough, the voice recognition device obtains a candidate sentence list obtained by performing voice recognition on target audio based on a cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.

The apparatus of the embodiment of the present application can execute the method shown in the first embodiment of the present application, and the implementation effect is similar, which is not described herein again.

EXAMPLE III

An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application, and is used to implement the functions of the modules shown in fig. 2. The transceiver 404 includes a receiver and a transmitter.

The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. The processor 401 is configured to execute application program code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.

The embodiment of the present application provides an electronic device suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.

Example four

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the speech recognition method shown in the above embodiment.

The embodiment of the present application provides a computer-readable storage medium suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

acquiring a candidate sentence list obtained by carrying out voice recognition on a target audio based on a cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences;

determining the probability of each candidate statement in the candidate statement list; the determining the probability of each candidate sentence in the candidate sentence list comprises: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;

and reordering each candidate statement in the candidate statement list based on the determined probability of each candidate statement in the candidate statement list to obtain a reordered target candidate statement list.

2. The method of claim 1, wherein determining the probability of each word occurring in any candidate sentence based on the pre-trained mask-based neural network model comprises:

the determining the probability of any candidate sentence based on the probability of occurrence of each word in the any candidate sentence comprises:

3. The method of claim 2, wherein the pre-trained mask-based neural network model calculates the text class and probability for each location, comprising:

and inputting any candidate statement erased from the characters at any position into the pre-trained neural network model based on the mask to obtain the character category and probability of any position.

4. The method of claim 3, wherein a last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the location where the mask is erased.

5. The method of any of claims 1-4, wherein the pre-trained mask-based neural network model is a time-series sequence-based neural network model.

6. The method of claim 5, further comprising:

and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as the voice recognition result of the target audio.

7. A speech recognition model, comprising:

a determining module, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically configured to determine, based on a pre-trained mask-based neural network model, a probability of occurrence of each word in any candidate sentence, and determine, based on the probability of occurrence of each word in any candidate sentence, the probability of any candidate sentence;

8. The model of claim 7, wherein said determining module comprises:

a second determining unit, configured to determine, based on the text type and the probability at each position, a probability of occurrence of each text in any candidate sentence;

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech recognition method according to any of claims 1 to 6.

10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech recognition method of any of claims 1 to 6.