CN115394297A

CN115394297A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115394297A
Application number: CN202210882990.0A
Authority: CN
Inventors: 郑晓明; 李健; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-25

Abstract

The invention relates to a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which relate to the field of voice recognition and comprise the following steps: acquiring target audio data; classifying and identifying the target audio data to obtain first text data; aligning the first text data through a preset acoustic model to obtain an alignment result, wherein the alignment result comprises timestamp information; and generating target text data according to the alignment result. According to the method, the acoustic model is added for alignment, namely, the recognized result is aligned by using a small traditional acoustic model, and compared with the existing voice recognition model, the accurate time point can be obtained in the text data obtained by performing alignment processing through the preset processing model.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

The Speech Recognition (ASR) is a technology for researching how to convert voice Recognition of human Speech into text, and can be applied to services such as voice dialing, voice navigation, indoor device control, voice document retrieval, simple dictation data entry, etc., in order to perform Speech Recognition, the prior art generally uses a Connection Timing Classification (CTC) to perform Speech Recognition, however, the CTC is a sequence Recognition method, although a representative frame of each recognized word can be used as a time point of the word, an accurate time point cannot be given, and thus, an accurate time stamp cannot be output when text data of Speech Recognition is output in an actual situation.

Disclosure of Invention

To overcome the problems in the related art, the present invention provides a speech recognition method, apparatus, electronic device and storage medium.

According to a first aspect of embodiments of the present invention, there is provided a speech recognition method, the method including:

acquiring target audio data;

classifying and identifying the target audio data to obtain first text data;

aligning the first text data through a preset acoustic model to obtain an alignment result, wherein the alignment result comprises timestamp information;

and generating target text data according to the alignment result.

Optionally, the aligning the first text data through a preset processing model to obtain an alignment result includes:

acquiring acoustic characteristic information corresponding to the first text data;

and inputting the first text data and the acoustic characteristic information into a preset acoustic model to obtain the alignment result.

Optionally, the inputting the first text data and the acoustic feature information into an acoustic model includes:

and inputting the first text data and the acoustic characteristic information into a preset acoustic model for hard alignment processing.

Optionally, the classifying and identifying the target audio data to obtain first text data includes:

and inputting the target audio data into a preset voice recognition model for classification recognition, and outputting first text data.

Optionally, the inputting the first text data and the acoustic feature information into a preset acoustic model for hard alignment processing includes:

inputting the first text data and the acoustic feature information into an acoustic model for feature comparison processing to generate timestamp information;

and carrying out one-to-one correspondence on the timestamp information and each text data in the first text data.

According to a second aspect of embodiments of the present invention, there is provided a speech recognition apparatus, the apparatus including:

the acquisition module is used for acquiring target audio data;

the identification module is used for carrying out classification identification on the target audio data to obtain first text data;

the alignment module is used for aligning the first text data through a preset acoustic model to obtain an alignment result, and the alignment result comprises timestamp information;

and the output module is used for generating target text data according to the alignment result.

Optionally, the alignment module includes:

the acquisition unit is used for acquiring acoustic characteristic information corresponding to the first text data;

and the alignment unit is used for inputting the first text data and the acoustic characteristic information into an acoustic model to obtain the alignment result.

The alignment unit includes:

and the alignment subunit is used for inputting the first text data and the acoustic feature information into a preset acoustic model to perform hard alignment processing.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition method according to the first aspect of the embodiment of the present application.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech recognition method according to the first aspect of the embodiments of the present application.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the invention can obtain the target audio data; classifying and identifying the target audio data to obtain first text data; aligning the first text data through a preset acoustic model to obtain an alignment result; and generating target text data according to the alignment result. According to the method, the acoustic model is added for alignment, namely, the recognized result is aligned by using a small traditional acoustic model, and compared with the existing voice recognition model, the accurate time point can be obtained in the text data obtained by performing alignment processing through the preset processing model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a method of speech recognition provided by an embodiment of the present application;

FIG. 2 is a diagram illustrating pronunciation waveforms in the method of speech recognition provided by the embodiment of the present application shown in FIG. 1;

FIG. 3 is a diagram illustrating predicted pronunciation waveforms in the method of speech recognition provided by the embodiment of the present application shown in FIG. 1;

FIG. 4 is a second flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 5 is a third flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 6 is a block diagram of an apparatus for speech recognition provided by an embodiment of the present application;

fig. 7 is a block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

It should be noted that CTC is used to solve the problem that input sequences and output sequences are difficult to correspond one to one, and aims to directly learn sequence data without labeling mapping relationships between input sequences and input sequences in training data in advance, so as to obtain better effects in sequence learning tasks such as speech recognition. In general CTC recognition, although a representative frame of each recognized word can be used as a time point of the word, since CTC is a sequence recognition method, an accurate time point cannot be given, so that the actual situation is often not too accurate, because the representative frame only represents the frame which can best represent a phoneme, but is not necessarily the start time of the phoneme. It should be noted that a phoneme is a minimum speech unit divided according to natural attributes of speech, and is analyzed according to pronunciation actions in a syllable, and one action constitutes one phoneme. The phonemes are classified into two categories, i.e., vowels and consonants, for example, in the chinese syllables, the syllable "257" corresponding to the chinese character "o" has only one phoneme, the syllable "asi" corresponding to the chinese character "ai" has two phonemes, and the syllable "d" corresponding to the chinese character "generation" has three phonemes.

Therefore, in some scenario applications, in addition to outputting the result of voice recognition, it is also necessary to output timestamp information of each word in the recognition result, i.e. the start and end times of each word, so that accurate time point information can be output when performing voice recognition by a method of voice recognition in the embodiment of the present application, and fig. 1 is a flowchart of a voice recognition method according to an exemplary embodiment, as shown in fig. 1, including the following steps:

step 101, target audio data is acquired.

It should be noted that, in performing speech recognition, the next speech recognition is performed by collecting audio data, and sound collection may be performed by any recording device or sound collection device, but for a computer, it is necessary to convert the obtained sound data first to obtain target audio data.

Specifically, the collected voice data may be subjected to front-end processing (preprocessing), and after the voice to be recognized is input, some optimization processing needs to be performed on the voice. For example, if there is a silence in the audio, the silence portion needs to be cut off, so that more ready recognition can be performed, and a silence Detection (VAD) technique can be performed to detect the audio containing the sound information and cut off the silence portion. It should be noted that the silence detection may set a silence detection time length, and determine whether to count as silence according to the time length, and start to cut at what time. After the preprocessing is finished, acoustic feature parameter extraction is needed to be carried out on the audio, the feature extraction mainly obtains the features of the audio in a parameter mode, and the features of the audio are changed into voice feature vectors capable of being processed by a computer, so that the computer can understand, record and compare the features conveniently. The characteristic parameters of each piece of audio are basically different, and the audio characteristics of different timbres of the same piece of speech are likely to be closer. The commonly used feature extraction parameters can be obtained by the following methods: it should be noted that, this application does not specifically limit this, and part of the audio contains noise, and noise reduction processing is required, so that subsequent task flows are better performed.

And 102, classifying and identifying the target audio data to obtain first text data.

After the target audio data is acquired in step 101, the target audio data is subjected to classification recognition processing to obtain first text data, and specifically, the target audio data is subjected to CTC speech recognition, for example, the pronunciation of a sentence of sound is "hello", and the result of CTC speech recognition is also "hello". It should be noted that, in the conventional acoustic model training for speech recognition, for data of each frame, it is necessary to know the corresponding label to perform effective training, and before training the data, preprocessing for speech alignment is required. The process of aligning the speech itself needs to be iterated many times to ensure that the alignment is more accurate, which is a time-consuming task. Fig. 2 is a schematic diagram of waveforms of pronunciation in the method for speech recognition provided by the embodiment of the application shown in fig. 1, and as shown in fig. 2, is a schematic diagram of waveforms of sounds in the word "hello", each box represents one frame of data, and the conventional method needs to know which pronunciation phoneme each frame of data corresponds to. For example, the 1 st, 2 nd, 3 th and 4 th frames correspond to the pronunciation of n, the 5 th, 6 th and 7 th frames correspond to the phoneme of i, the 8 th and 9 th frames correspond to the phoneme of h, the 10 th and 11 th frames correspond to the phoneme of a, and the 12 th frame corresponds to the phoneme of o. Compared with the traditional acoustic model training, the acoustic model training adopting the CTC as the loss function is complete end-to-end acoustic model training, data alignment is not needed in advance, only one input sequence and one output sequence are needed for training, therefore, data alignment and one-to-one labeling are not needed, the CTC directly outputs the probability of sequence prediction, and external post-processing is not needed.

CTC speech is a result of being an input sequence to an output sequence, and for CTC models, it is not a matter of whether each result in the predicted output sequence is exactly aligned with the input sequence at a point in time, primarily by predicting whether the output sequence is close (identical) to the true sequence. Fig. 3 is a schematic diagram of a predicted pronunciation waveform in the speech recognition method according to the embodiment of the present application shown in fig. 1, and as shown in fig. 3, fig. 3 is a schematic diagram of a prediction result of CTC, where blank is introduced by CTC (the frame has no prediction value), each predicted class corresponds to a spike (peak) in a whole speech, and other locations that are not peaks are regarded as blank. For a piece of speech, the final output of the CTC is a sequence of spikes, and does not care how long each phoneme lasts. As shown in FIG. 3, for the example of a hello utterance, the result of the sequence predicted by CTC may be slightly delayed in time from the time corresponding to the actual utterance, and other times are marked with blank. Therefore, after the CTC performs speech recognition on the target audio data, a prediction sequence result, i.e., the first text data, is output, and in order to obtain a time point corresponding to the recognized speech text more accurately and reduce delay, the operation in step 103 is performed.

And 103, aligning the first text data through a preset acoustic model to obtain an alignment result, wherein the alignment result comprises timestamp information.

In step 103, the recognition result in step 102 is aligned through a preset acoustic model to obtain an alignment result, specifically, for example, the recognition result "hello" and the corresponding acoustic feature are sent to the acoustic model for hard alignment, and a conventional DNN-HMM alignment module is added to combine with a hard alignment processing algorithm, so that accurate time point information can be obtained on the basis of adding a small amount of calculation, that is, the alignment result includes timestamp information.

And step 104, generating target text data according to the alignment result.

In step 104, final target text data after speech recognition is generated according to the final alignment node, and in this case, the target text data may include time point information of each word, i.e., start and end times of each word according to the timestamp information.

Fig. 4 is a second flowchart of a speech recognition method according to an embodiment of the present application, as shown in fig. 4, including the following steps:

step 101, target audio data is acquired.

The above steps 101-102 are discussed with reference to the preamble and will not be described again here.

Step 1031, obtaining acoustic feature information corresponding to the first text data.

Step 1032, inputting the first text data and the acoustic feature information into a preset acoustic model to obtain an alignment result, wherein the alignment result includes timestamp information.

Further, in step 1032, inputting the first text data and the acoustic feature information into the acoustic model includes: and inputting the first text data and the acoustic feature information into a preset acoustic model for hard alignment processing.

Specifically, the hard alignment process includes: inputting the first text data and the acoustic feature information into an acoustic model for feature comparison processing to generate timestamp information; and carrying out one-to-one correspondence on the timestamp information and each text data in the first text data.

It should be noted that, in the embodiment of the present application, the first text data is aligned through a preset acoustic model to obtain an alignment result, where the alignment result includes timestamp information, and the specific process includes that acoustic feature information corresponding to the first text data after the guard ball passes through CTC speech recognition, for example, the recognition result "hello" and corresponding acoustic features are sent to the acoustic model to perform Viterbi alignment, where a Viterbi learning algorithm is a hard alignment, so-called hard alignment is an attribution of only 0 or 1, that is, a frame only belongs to a certain state, HMM decoding has two methods, that is, a Viterbi algorithm and an approximate algorithm, respectively, in the embodiment of the present application, the Viterbi algorithm may be selected to perform hard alignment, the Viterbi algorithm is a dynamic programming algorithm, the Viterbi algorithm may obtain a trace-back path with the highest probability, and the Viterbi algorithm is basically multi-step, and each step performs a Viterbi optimal selection problem of a multi-selection model. For all possible choices of each step, the viterbi algorithm preserves the minimum total cost (or maximum value) of all their previous steps to the current choice of the current step and the choice of the previous step in the case of the current cost. After all the steps are calculated in sequence, the selection of the previous step is continuously found through a backtracking method, and then the complete optimal selection path can be found.

It should be noted that, in the embodiment of the present application, the preset acoustic model may be aligned and recognized by a DNN + HMM, where the DNN + HMM alignment method is similar to the GMM + HMM method, specifically, the DNN replaces the GMM in the GMM + HMM for recognition, and the transition matrix and the initial state probability matrix are still derived from the HMM by recording the information of the emission probability by the DNN. The specific processing steps of the DNN-HMM acoustic model comprise: frame length segmentation, feature extraction, wherein feature extraction can be performed by adopting a Mel Frequency Cepstrum Coefficient (MFCC) method; performing Viterbi Alignment or Alignment through a GMM-HMM acoustic model; clustering each frame (total number of phonemes) to obtain the probability of each frame belonging to each phoneme; and performing decoding search through the HMM to obtain an optimal phoneme representation sequence of each frame, giving the phoneme sequence, and performing forced alignment on GMM-HMM- > DNN-HMM- > DNN-HMM iteration at the moment according to the GMM likelihood value of each frame to obtain an alignment result.

And 104, generating target text data according to the alignment result.

The above step 104 is discussed with reference to the preamble and will not be described herein.

The invention can obtain the target audio data; classifying and identifying the target audio data to obtain first text data; aligning the first text data through a preset acoustic model to obtain an alignment result; and generating target text data according to the alignment result. According to the method, the acoustic model is added for alignment, namely, the recognized result is aligned by using a small traditional acoustic model, and compared with the existing voice recognition model, the accurate time point can be obtained in the text data obtained by aligning through the preset processing model.

Fig. 5 is a third flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 5, the method includes the following steps:

step 101, target audio data is acquired.

The above step 101 is discussed with reference to the preamble and will not be described herein.

Step 1021, inputting the target audio data into a preset speech recognition model for classification recognition, and outputting the first text data.

It should be noted that, in the present embodiment, the CTC speech is a result from an input sequence to an output sequence, and for the CTC model, it is not focused whether each result in the predicted output sequence is exactly aligned with the input sequence at a time point, mainly by predicting whether the output sequence is close to (identical to) the real sequence. Fig. 3 is a schematic diagram of a predicted pronunciation waveform in the speech recognition method according to the embodiment of the present application shown in fig. 1, and as shown in fig. 3, fig. 3 is a schematic diagram of a prediction result of CTC, where blank is introduced by CTC (the frame has no prediction value), each predicted class corresponds to a spike (peak) in a whole speech, and other locations that are not peaks are regarded as blank. For a piece of speech, the final output of the CTC is a sequence of spikes, and does not care how long each phoneme lasts. As shown in FIG. 3, for the example of a hello utterance, the result of the sequence predicted by CTC may be slightly delayed in time from the time corresponding to the actual utterance, and other times are marked with blank. Therefore, after the CTC performs speech recognition on the target audio data, a prediction sequence result, i.e., the first text data, is output, and in order to obtain a time point corresponding to the recognized speech text more accurately and reduce delay, the operation in step 103 is performed.

And 104, generating target text data according to the alignment result.

The above steps 103-104 are discussed with reference to the preamble and will not be described further herein.

Fig. 6 is a block diagram of a speech recognition apparatus according to an exemplary embodiment, which includes an obtaining module 601, a recognition module 602, an alignment module 603, and an output module 604.

An obtaining module 601, configured to obtain target audio data;

the identification module 602 is configured to perform classification identification on the target audio data to obtain first text data;

an alignment module 603, configured to perform alignment processing on the first text data through a preset acoustic model to obtain an alignment result, where the alignment result includes timestamp information;

and an output module 604, configured to generate target text data according to the alignment result.

Further, the alignment module 603 includes:

Further, the alignment unit includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a block diagram illustrating an electronic device 400 in accordance with an exemplary embodiment. For example, the electronic device 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, electronic device 400 may include one or more of the following components: processing components 402, memory 404, power components 406, multimedia components 408, audio components 410, input/output interfaces 412, sensor components 414, and communication components 416.

The processing component 402 generally controls the overall operation of the device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the electronic device 400. Examples of such data include instructions for any application or method operating on the device, contact data, phonebook data, messages, pictures, videos, and the like. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply components 406 provide power to the various components of the electronic device 400. Power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 400.

The multimedia component 408 comprises a screen providing an output interface between the electronic device 400 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 400 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The input/output interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the electronic device 400. For example, the sensor assembly 414 may detect an open/closed state of the electronic device 400, the relative positioning of components, such as a display and keypad of the electronic device 400, the sensor assembly 414 may also detect a change in the position of the electronic device 400 or a component of the electronic device 400, the presence or absence of user contact with the electronic device 400, orientation or acceleration/deceleration of the electronic device 400, and a change in the temperature of the electronic device 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the electronic device 400 and other devices. The electronic device 400 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the electronic device 400 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of speech recognition, the method comprising:

acquiring target audio data;

classifying and identifying the target audio data to obtain first text data;

and generating target text data according to the alignment result.

2. The method according to claim 1, wherein the aligning the first text data through a preset processing model to obtain an alignment result comprises:

and inputting the first text data and the acoustic feature information into a preset acoustic model to obtain the alignment result.

3. The method of claim 2, wherein inputting the first text data and the acoustic feature information into an acoustic model comprises:

4. The method of claim 1, wherein the classifying and identifying the target audio data to obtain first text data comprises:

5. The method according to claim 2, wherein the inputting the first text data and the acoustic feature information into a preset acoustic model for hard alignment processing comprises:

6. A speech recognition apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring target audio data;

7. The apparatus of claim 5, wherein the alignment module comprises:

8. The apparatus of claim 6, wherein the alignment unit comprises:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a speech recognition method as claimed in any one of claims 1 to 5.

10. A computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a speech recognition method according to any one of claims 1 to 5.