CN111583939A

CN111583939A - Method and device for specific target wake-up by voice recognition

Info

Publication number: CN111583939A
Application number: CN201910124945.7A
Authority: CN
Inventors: 李政; 吴国扬; 陈心章
Original assignee: Foxlink Electronics Dongguan Co Ltd; Cheng Uei Precision Industry Co Ltd
Current assignee: Foxlink Electronics Dongguan Co Ltd; Cheng Uei Precision Industry Co Ltd
Priority date: 2019-02-19
Filing date: 2019-02-19
Publication date: 2020-08-25

Abstract

The invention discloses a method and a device for waking up a specific target by voice recognition, wherein the method comprises the following steps: receiving a voice message of a specific target and extracting voice characteristics in the voice message; the voice characteristics of the specific target are used as input data of an HVS model which is trained in an identification mode, training is carried out, a specific target acoustic model is obtained, and the specific target acoustic model is stored; receiving a voice message of a target to be detected, and extracting voice characteristics in the voice message; taking the voice characteristics of the target to be tested as input data of a hidden vector state model trained in an identification mode, and training to obtain an acoustic model of the target to be tested; and comparing the acoustic model of the target to be detected with the acoustic model of the specific target, if the acoustic model of the target to be detected and the acoustic model of the specific target are related, performing language decoding on the voice characteristics of the target to be detected by using the language model, and judging whether to awaken or not according to a language decoding result. According to the invention, the HVS model of discriminant training is used as the acoustic model, so that the target can be accurately and quickly judged, and further the awakening function is achieved.

Description

Method and device for specific target wake-up by voice recognition

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method and an apparatus for speech recognition.

Background

In recent years, a smart sound box gradually changes the way of life of people, and as a voice assistant, the smart sound box can assist users in performing tasks in life, such as helping cars, shopping, reminding items, recording information, and the like.

A conventional smart speaker usually employs a voice wake-up method to automatically retrieve a number of pre-registered voice commands (wake-up words) from a continuous voice to wake up the smart speaker for subsequent tasks. Traditionally, the Hidden Markov Model (HMM) technique is used, which uses a Phoneme (Phoneme) alone and the feature-to-line comparison of syllables to find the most probable word, and later, a Gaussian Mixture Model (GMM) is combined to form a classic GMM-HMM Model. The conventional GMM-HMM model usually adopts a Maximum similarity training method (Maximum likehood), but this method is likely to make the probability of the competitor's answer greater than the probability of the correct answer under certain factors, which leads to the decrease of the correct rate, and thus there is still room for improvement.

Disclosure of Invention

The invention aims to provide a method for realizing specific target awakening by voice recognition, aiming at the defects and shortcomings of the prior art, the method is characterized in that the specific target identity recognition monitoring is realized by utilizing the awakening words of the specific target in combination with a Hidden Vector State Model (HVS Model for short) which is differentially trained, so that the purpose of specific target voice awakening is achieved.

In order to achieve the above object, an aspect of the embodiments of the present invention provides a method for voice recognition for specifically target wake-up, including the following steps:

s1: receiving a voice message of a specific target, preprocessing the voice message of the specific target, and extracting a voice feature of the specific target;

s2: taking the voice features of the specific target as input data of a hidden vector state Model (HVS Model) trained in a discriminant manner, training to obtain a specific target acoustic Model, and storing the specific target acoustic Model;

s3: receiving a voice message of a target to be detected, preprocessing the voice message of the target to be detected, and extracting a voice feature of the target to be detected;

s4: taking the voice characteristics of the target to be detected as input data of a hidden vector state model trained in an identification mode, and training to obtain an acoustic model of the target to be detected;

s5: and comparing the relevance between the acoustic model of the target to be detected and the acoustic model of the specific target, if the acoustic model of the target to be detected and the acoustic model of the specific target are related, performing language decoding on the voice characteristics of the target to be detected by using at least one language model, and judging whether to awaken or not according to a language decoding result.

Specifically, the voice message of the specific target and the voice message of the target to be tested include at least one awakening word.

Specifically, the pretreatment comprises: the voice message is processed with noise suppression and echo cancellation.

In particular, the speech features are obtained by means of mel-frequency cepstral coefficients (MFCCs).

Specifically, the discriminant training is trained using a maximum mutual information approach (MMI).

Specifically, the language model includes a lexicon model or a grammar model or a combination thereof.

Specifically, the step of determining whether to wake up the speech recognition according to the speech decoding result includes: performing language decoding on the voice characteristics of the target to be detected; judging whether the target voice message to be detected contains the awakening word or not; if the awakening word is contained, the voice recognition awakening is started, and if the awakening word is not contained, the voice recognition awakening is not started.

Another aspect of an embodiment of the present invention provides a device for waking up a specific target by using voice recognition, including:

the system comprises an acquisition module, a detection module and a display module, wherein the acquisition module comprises a plurality of microphone arrays and is used for receiving voice messages of a specific target and a target to be detected, and the voice messages comprise a wakeup word;

the extracting module is connected with the collecting module and is used for extracting MFCC voice characteristics in the voice messages of the specific target and the target to be detected;

the training module is connected with the extracting module and used for taking MFCC voice characteristics in the voice messages of the specific target and the target to be detected as input data of a hidden vector state model trained by a maximum mutual information method and acquiring an acoustic model of the trained specific target and an acoustic model of the target to be detected;

the storage module is connected with the training module and used for storing the trained acoustic model of the specific target;

the decoding module is connected with the extracting module and is used for carrying out language decoding on the voice message of the target to be detected; and

the processor module is connected with the training module, the storage module and the decoding module and used for comparing the acoustic model of the specific target in the storage module with the acoustic model of the target to be detected, judging whether the decoding module is started to carry out language decoding on the voice message of the target to be detected according to the comparison result, and confirming whether the voice message of the target to be detected after the language decoding contains the awakening word to awaken the device.

Specifically, the device further comprises a registration module, the registration module is connected with the acquisition module and the storage module, and the registration module is used for starting and storing the acoustic model of the specific target to the storage module.

Specifically, the device further comprises a wireless communication module, wherein the wireless communication module is used for external communication connection.

Compared with the prior art, the method and the device for awakening the specific target by voice recognition adopt the hidden vector state model of the discriminant training as the acoustic model, use the discriminant training to maximize the occurrence probability of correct answers, reduce the occurrence probability of competitors, increase the discrimination capability between the correct answers and the competitors, and quickly and accurately judge whether the target to be detected is the specific target so as to achieve the awakening function.

Drawings

Fig. 1 is a flowchart illustrating a method for waking up a specific target by speech recognition according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an apparatus for waking up a specific target by speech recognition according to an embodiment of the present invention.

The reference numerals in the figures are explained below:

100 speech recognition device 11 acquisition module

12 extraction module and 13 training module

14 storage module 15 decoding module

16 processor module 17 registration module

18 wireless communication module

S101 to S105.

Detailed Description

To explain the technical content, structural features, and achieved objects and effects of the present invention in detail, the following embodiments are exemplified and the detailed description is given in conjunction with the drawings.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for waking up a specific target by speech recognition according to an embodiment of the present invention, including the following steps:

step S101: receiving a voice message of a specific target, preprocessing the voice message of the specific target, and extracting a voice feature of the specific target;

specifically, the specific target in this step refers to a registered user who achieves the awakening condition in the voice recognition, the voice message is a text prepared in advance, the text content includes a preset awakening word, and the specific target reads the text content first and collects the voice message of the specific target through an acquisition module 11 of the voice recognition apparatus 100 according to an embodiment of the present invention.

Specifically, the voice message collected in this step is an analog voice signal, and the subsequent voice recognition processing can be performed only by converting the analog voice signal into a digital voice signal. In addition, other environmental noises may be included in the voice message, so that it is also necessary to perform pre-processing on the voice message, including noise suppression processing and echo cancellation processing on the digital voice signal, to filter out unwanted environmental noises and obtain a valid voice signal.

Specifically, in the embodiment of the present invention, a Mel-frequency Cepstral Coefficients (MFCC) mode is adopted to extract the voice feature of the specific target, the preprocessed voice signal is cut into a plurality of frames (Frame blocking), Pre-emphasis (Pre-emphasis) is performed on the part of the voice signal to be emphasized, windowing (Window) is performed, and the like, so as to obtain a clearer and more definite section of voice feature.

Step S102: taking the voice features of the specific target as input data of a Hidden Vector State Model (HVS Model for short) trained in an identification mode, training to obtain a specific target acoustic Model, and storing the specific target acoustic Model;

specifically, in this step, the speech features of the specific target are used as input data to train the acoustic model, in the embodiment of the present invention, the latent vector state model is adopted and the discriminant training mode is used for training, the discriminant training does not aim at maximizing the similarity of the trained acoustic corpus but aims at minimizing the classification (or identification) errors, and the identification rate is improved.

The discriminant training is performed based on a Maximum Mutual Information (MMI) method, which can increase the probability of the occurrence of the Maximum correct answer, effectively reduce the probability of the occurrence of the competitor, and increase the discrimination between the correct answer and the competitor.

Specifically, the step of storing the specific target acoustic model refers to storing the specific target acoustic model in a storage module 14 of the speech recognition apparatus 100 according to the embodiment of the invention.

Step S103: receiving a voice message of a target to be detected, preprocessing the voice message of the target to be detected, and extracting a voice feature of the target to be detected;

specifically, the target to be detected in this step refers to a user who wants to perform voice recognition and comparison, and the target to be detected outputs a section of voice message, and the voice message of the target to be detected is collected by an acquisition module 11 of the voice recognition apparatus 100 according to the embodiment of the present invention.

Specifically, the step of preprocessing the voice message of the target to be detected and extracting the voice feature of the target to be detected is the same as the above-mentioned process of preprocessing the voice message of the specific target and extracting the voice feature of the specific target.

Step S104: taking the voice characteristics of the target to be detected as input data of a hidden vector state model trained in an identification mode, and training to obtain an acoustic model of the target to be detected;

specifically, in this step, the speech characteristics of the target to be measured are used as input data to train the acoustic model, in the embodiment of the present invention, a hidden vector state model is adopted and a discriminant training mode is used for training, and the discriminant training is performed on the basis of Maximum Mutual Information (MMI).

Step S105: and comparing the relevance between the acoustic model of the target to be detected and the acoustic model of the specific target, if the acoustic model of the target to be detected and the acoustic model of the specific target are related, performing language decoding on the voice characteristics of the target to be detected by using at least one language model, and judging whether to awaken or not according to a language decoding result.

Specifically, in this step, language decoding is performed when the acoustic model of the target to be tested conforms to the acoustic model of the specific target, and if the acoustic model of the target to be tested does not conform to the acoustic model of the specific target, no action is performed, and the language decoding performs language model training by using the voice feature of the target to be tested as input data.

When the acoustic model of the target to be detected is judged to be the acoustic model of the specific target, the target to be detected is represented as the specific target, and therefore language decoding is carried out to confirm whether the voice message of the target to be detected contains the awakening words or not. Training a lexicon model and a grammar model of the voice characteristics of the target to be detected, analyzing to obtain the voice message content of the target to be detected, then judging whether the voice message content of the target to be detected contains a wake-up word, if so, waking up the voice recognition to start, and if not, waking up the voice recognition to not start.

Referring to fig. 2, an embodiment of the present invention provides a device for voice recognition for specific target wake-up. A speech recognition device 100 includes an acquisition module 11, an extraction module 12, a training module 13, a storage module 14, a decoding module 15, a processor module 16, a registration module 17, and a wireless communication module 18.

The collecting module 11 is connected with the extracting module 12 and the registering module 17, wherein the collecting module 11 is provided with a plurality of microphone arrays for receiving the voice messages of the specific target and the target to be detected, the collected voice messages are analog voice signals which need to be converted into digital voice signals, meanwhile, the digital voice signals are subjected to noise suppression processing and echo cancellation processing, and then the processed digital voice messages are transmitted to the extracting module 12.

The definition of the specific target is an object for voice recognition of the specific target wakeup according to the present invention, and the definition of the target to be tested is an object for voice recognition by the voice recognition apparatus 100.

The voice message of the specific target comprises a preset awakening word.

The extraction module 12 is connected with the acquisition module 11, the training module 13 and the decoding module 15, and the extraction module 12 is used for receiving the voice message processed by the acquisition module 11, extracting the voice characteristics of a specific target and a target to be detected, and transmitting the voice characteristics to the training module 13 for acoustic model training or transmitting the voice characteristics to the decoding module 15 for decoding.

The extracting of the voice features of the specific target and the target to be detected is to extract the voice features of the voice message by using Mel-frequency Cepstral coeffients (MFCC for short).

The training module 13 is connected with the extraction module 12, the storage module 14 and the processor module 16. The training module 13 is configured to receive the speech features of the specific target and the target to be tested extracted by the extraction module 12, use the speech features of the specific target and the target to be tested as input data of a hidden vector state model trained by using a maximum mutual information method, finally obtain an acoustic model after training, and perform different steps according to the specific target and the target to be tested. If the target is the specific target, the acoustic model of the specific target is transmitted to the storage module 14, and if the target is the object to be measured, the acoustic model of the object to be measured is transmitted to the processor module 16.

The storage module 14 is connected with the training module 13, the processor module 16 and the registration module 17. The storage module 14 is configured to store the acoustic model of the specific target trained by the training module 13. In the embodiment of the present invention, when the specific target performs the operation of the registration module 17, the acoustic model of the specific target trained by the training module 13 is transmitted to the storage module 14 for storage. In addition, when the processor module 16 performs comparison between the object to be measured and the acoustic model of the specific object, the storage module 14 transmits the stored acoustic model of the specific object to the processor module 16.

The decoding module 15 is connected with the extracting module 12 and the processor module 16. The decoding module 15 is used to decode the speech information of the target to be detected, more specifically, the extracting module 12 trains the speech characteristics of the target to be detected as the input data of the lexicon model and the grammar model, and transmits the result to the processor module 16.

The processor module 16 is connected with the training module 13, the storage module 14, the decoding module 15 and the wireless communication module 18. The processor module 16 is configured to compare the acoustic model of the specific target with the acoustic model of the target to be detected, and determine whether to start the decoding module 15 for language decoding according to a comparison result of the two acoustic models, and more specifically, when the training module 13 transmits the acoustic model of the target to be detected, the processor module 16 simultaneously obtains the acoustic model of the specific target from the storage module 14, and compares the two acoustic models in the processor module 16.

When it is determined that the acoustic model of the specific target is related to the acoustic model of the target, i.e., the target is the specific target, the processor module 16 starts the decoding module 15 and the decoding module 15 performs language decoding, so that the voice message language decoding of the target is performed to determine whether the voice message language decoding includes the wakeup word.

The decoding module 15 obtains the voice features of the target to be detected from the extracting module 12, and returns the operation result of the language decoding to the processor module 16, and the processor module 16 can judge whether the voice message of the target to be detected contains the awakening word or not according to the acoustic model of the target to be detected and the result after the language decoding.

When the processor module 16 obtains that the voice message of the target to be detected contains the awakening word, the voice recognition device 100 is awakened, otherwise, the voice recognition device is not executed.

The registration module 17 is connected with the acquisition module 11 and the storage module 14. The registration module 17 is used for providing a specific target to register the speech recognition device 100, wherein the registration module 17 includes a start element and a display element, when the specific target touches the start element, the storage module 14 is started at the same time, which indicates that the acoustic model of the speech message collected by the acquisition module 11 at this time after being trained by the training module 13 needs to be stored in the storage module 14, and in addition, when the specific target touches the start element, the display element is started to provide the specific target to confirm whether the specific target is the registration stage at present.

In an embodiment of the present invention, the activating element is a button, and the display element is a light emitting diode.

The wireless communication module 18 is connected with the processor module 16. The wireless communication module 18 is configured to perform communication with the external device after the processor module 16 determines that the voice recognition device 100 is successfully woken up.

In the embodiment of the present invention, the wireless communication module 18 includes a Wi-Fi module or a bluetooth module.

As described above, the method and apparatus for waking up a specific target by speech recognition of the present invention uses the hidden vector state model of the discriminant training as the acoustic model, and the discriminant training using the maximum mutual information method not only maximizes the occurrence probability of the correct answer, but also reduces the occurrence probability of the competitor, increases the discrimination between the correct answer and the competitor, and can quickly and accurately determine whether the target to be tested is the specific target, thereby achieving the function of waking up.

Claims

1. A method for target-specific wake up by speech recognition, comprising the steps of:

2. The method of claim 1, wherein the voice message of the specific target and the voice message of the target include at least one wake-up word.

3. The method of claim 1, wherein the pre-processing comprises: the voice message is processed with noise suppression and echo cancellation.

4. Method for target-specific wake up by speech recognition according to claim 1, characterized in that the speech features are derived by means of Mel Frequency Cepstral Coefficients (MFCCs).

5. The method of claim 1, wherein the discriminant training is performed using a Maximum Mutual Information (MMI) method.

6. The method of claim 1, wherein the language model comprises a lexicon model or a grammar model or a combination thereof.

7. The method of claim 2, wherein the determining whether the voice recognition wake up is achieved according to the speech decoding result comprises:

performing language decoding on the voice characteristics of the target to be detected;

judging whether the target voice message to be detected contains the awakening word or not;

if the awakening word is contained, the voice recognition awakening is started, and if the awakening word is not contained, the voice recognition awakening is not started.

8. An apparatus for target-specific wake up by speech recognition, the apparatus comprising:

9. The device for waking up with a specific object through voice recognition according to claim 8, further comprising a registration module, wherein the registration module is connected to the acquisition module and the storage module, and the registration module is configured to initiate storing of the acoustic model of the specific object in the storage module.

10. The device of claim 8, further comprising a wireless communication module, wherein the wireless communication module is configured to perform an external communication link.