CN113611294A

CN113611294A - Voice wake-up method, apparatus, device and medium

Info

Publication number: CN113611294A
Application number: CN202110739659.9A
Authority: CN
Inventors: 戚萌; 董斐; 张维城; 姜双双
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-05

Abstract

The invention provides a voice awakening method, a device, equipment and a medium, wherein the method comprises the following steps: configuring a plurality of combined awakening words, wherein each combined awakening word comprises at least two awakening participles respectively; and detecting whether the identification result of the awakening words of at least two voice sections is correspondingly matched with each awakening participle in one combined awakening word one by one, and if so, executing awakening operation. When a plurality of awakening words comprise common words, the method can reduce the difficulty of collecting the training corpus to a certain degree, and simultaneously support the awakening of the preset awakening words, the awakening of the self-defined awakening words and the awakening of the plurality of awakening words on the basis of not increasing the calculated amount and the size of the model.

Description

Voice wake-up method, apparatus, device and medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for voice wakeup.

Background

With the development of science and technology, more and more devices have a Voice function, such as mobile phones, sound boxes, wearable devices, smart homes, vehicle-mounted devices and the like, and the devices all need an entrance for interaction between a person and a machine, and how to open the entrance better and more accurately is very important, and a mode of opening the entrance is called Voice wakeup (Voice Trigger). Voice wake-up, which is defined by academically detecting a particular segment in the speech stream in real time, is intended to activate a device from a sleep state to an operational state. In the process, the user can directly operate by voice without touching with hands, and meanwhile, by utilizing a voice awakening mechanism, the equipment does not need to be in a working state in real time, so that the energy consumption is saved. With the further enrichment of voice awakening in the life application scene, a single awakening word can not meet the requirements of customers gradually, and the setting requirements of a plurality of awakening words and the self-defined awakening word are increased.

At present, voice awakening mainly adopts the following three methods: (1) the traditional keyword spotting (KWS) method based on template matching has the advantages that training and testing steps are simple, requirements on data volume are low, a model is relatively small, accuracy is relatively low, and system robustness is poor, (2) a keyword spotting method based on a GMM (mixed Gaussian model) -HMM (hidden Markov model) is used for acoustic model modeling, HMM is used for fitting voice time sequence change characteristics, and (3) the keyword spotting method based on a neural network is used for a single awakening word, the model is large compared with the models (1) and (2), when the number of the awakening words is increased, the models and the operation amount of the methods (1) and (2) are increased in multiples, the increase of the models and the operation amount of the methods (3) is limited, the accuracy is unchanged, and the advantages of the methods (3) are obvious for a scene with multiple awakening words.

After the specific wake-up word is determined, a data set of the corresponding wake-up word needs to be collected for model training. For the case of multiple awakening words, the data sets of the awakening words need to be collected respectively, if the multiple awakening words contain common words, for example, the awakening instructions such as "open music" and "pause music" all contain "music", "open music" and "open video", and the like, when the corpora are collected, if the corresponding corpora are collected respectively for different instructions, the collection difficulty is large.

Disclosure of Invention

In order to solve the problem that collection difficulty is high when a plurality of awakening words comprise common words and corresponding linguistic data are collected respectively for different awakening words in the prior art, the invention provides a voice awakening method, a device, equipment and a medium, so that the collection difficulty of training linguistic data is reduced to a certain extent.

The invention solves the technical problems through the following technical scheme:

in a first aspect, the present invention provides a voice wake-up method, including:

configuring a plurality of combined awakening words, wherein each combined awakening word comprises at least two awakening participles respectively;

and detecting whether the identification result of the awakening words of at least two voice sections is correspondingly matched with each awakening participle in one combined awakening word one by one, and if so, executing awakening operation.

Preferably, the method further comprises:

configuring at least one non-combined wake-up word;

and when the fact that the identification result of the awakening word of the current voice segment is matched with one of the non-combined awakening words is detected, executing awakening operation.

Preferably, a predetermined interval requirement is satisfied between adjacent speech segments of the at least two speech segments.

Preferably, the voice segment awakening word recognition result is obtained by processing the voice segment based on a pre-trained awakening word recognition model.

Preferably, the process of processing the speech segment based on the wake word recognition model is as follows:

carrying out feature extraction processing on the voice sections to obtain voice features of the voice sections;

processing the voice characteristics through the awakening word recognition model to obtain the probability that the voice section contains different keywords;

and when the probability of the keyword with the maximum probability in the voice section is greater than a preset threshold value, determining the keyword as an awakening word recognition result of the voice section.

Preferably, when it is detected that the recognition result of the wake-up word of the current speech segment is not matched with the wake-up participle or each non-combined wake-up word in each combined wake-up word, and the frame number of the speech segment reaches a preset frame number threshold, the starting point of the speech segment is pushed backward by a preset interval frame number to be used as the starting point of the next speech segment.

Preferably, the preset interval frame number is not greater than the preset frame number threshold.

Preferably, when one of the combined wake-up words includes two wake-up participles, whether a wake-up word recognition result of two speech segments matches each wake-up participle in the one of the combined wake-up words in a one-to-one correspondence manner is detected by the following steps:

detecting whether the identification result of the awakening word of the current voice segment is matched with each awakening participle in the one combined awakening word;

when the awakening word recognition result of the current voice segment is detected to be matched with the first awakening word segmentation in the one combined awakening word, marking that the first awakening word segmentation in the one combined awakening word is successfully matched, acquiring the next voice segment as a new current voice segment, and returning to the step of detecting whether the awakening word recognition result of the current voice segment is matched with each awakening word segmentation in the one combined awakening word;

when the fact that the identification result of the awakening word of the current voice segment is not matched with the two awakening participles in the one combined awakening word is detected, enabling k to be k +1, obtaining the next voice segment as a new current voice segment, and returning to the step of detecting whether the identification result of the awakening word of the current voice segment is matched with each awakening participle in the one combined awakening word or not, wherein the initial value of k is zero;

and when the awakening word recognition result of the current voice segment is detected to be matched with the second awakening participle in the one combined awakening word, judging whether the first awakening participle in the one combined awakening word is marked as successfully matched and whether k is smaller than a preset interval threshold value, and if yes, judging that the awakening word recognition results of the two voice segments are correspondingly matched with the awakening participles in the one combined awakening word one by one.

Preferably, when the one of the combined wake-up words includes three or more wake-up participles, detecting whether there is a one-to-one correspondence match between the wake-up word recognition result of the three or more speech segments and each wake-up participle in the one of the combined wake-up words by the following steps, including:

when the fact that the identification result of the awakening word of the current voice segment is not matched with each awakening word segmentation in one of the combined awakening words is detected, enabling k to be k +1, obtaining the next voice segment as a new current voice segment, and returning to the step of detecting whether the identification result of the awakening word of the current voice segment is matched with each awakening word segmentation in one of the combined awakening words, wherein the initial value of k is zero;

when the awakening word recognition result of the current voice segment is detected to be matched with a certain middle position awakening word in one of the combined awakening words, judging whether the former awakening word of the middle position awakening word is marked to be successfully matched or not and whether k is smaller than a preset interval threshold value or not, if so, marking that the certain middle position awakening word in one of the combined awakening words is successfully matched or not, resetting k, acquiring the next voice segment as a new current voice segment, and returning to the step of detecting whether the awakening word recognition result of the current voice segment is matched with each awakening word in one of the combined awakening words or not;

and when the awakening word recognition result of the current voice segment is matched with the last awakening participle in the one combined awakening word, judging whether the penultimate awakening participle in the one combined awakening word is marked as successful in matching and whether k is smaller than a preset interval threshold value, if so, judging that the awakening word recognition results of three or more voice segments are correspondingly matched with all awakening participles in the one combined awakening word one by one.

In a second aspect, the present invention provides a voice wake-up apparatus, including:

the device comprises a configuration module, a display module and a display module, wherein the configuration module is used for configuring a plurality of combined awakening words, and each combined awakening word comprises at least two awakening participles;

and the combined awakening word detection module is used for detecting whether the awakening word recognition results of at least two voice sections are matched with the awakening participles in one of the combined awakening words in a one-to-one correspondence manner, and if so, calling the awakening module to execute awakening operation.

Preferably, the configuration module is further configured to configure at least one non-combined wake word;

the device further comprises: and the non-combined awakening word detection module is used for detecting whether the awakening word recognition result of the current voice segment is matched with each non-combined awakening word or not, and calling the awakening module to execute awakening operation if the awakening word recognition result of the current voice segment is matched with one combined awakening word.

Preferably, the apparatus further comprises a wake word recognition module for processing the speech segment based on the wake word recognition model, the wake word recognition module comprising:

the feature extraction unit is used for performing feature extraction processing on the voice sections to obtain voice features of the voice sections;

the model processing unit is used for processing the voice characteristics through the awakening word recognition model to obtain the probability that the voice section contains different keywords;

and the recognition result determining unit is used for determining the keyword as the awakening word recognition result of the voice segment when the probability of the keyword with the highest probability in the voice segment is greater than a preset threshold value.

Preferably, the apparatus further comprises: and the voice section starting point updating module is used for pushing the starting point of the voice section backwards by a preset interval frame number to serve as the starting point of the next voice section when the fact that the identification result of the awakening word of the current voice section is not matched with the awakening participle or the non-combined awakening words in each combined awakening word and the frame number of the voice section reaches a preset frame number threshold value is detected.

Preferably, when one of the combined wake-up words includes two wake-up participles, the combined wake-up word detection module is specifically configured to:

Preferably, when one of the combined wake-up words includes three or more wake-up participles, the combined wake-up word detection module is specifically configured to:

and when the awakening word recognition result of the current voice segment is matched with the last awakening word segmentation in the one combined awakening word, judging whether the penultimate awakening word segmentation in the one combined awakening word is marked as successful in matching and whether k is smaller than a preset interval threshold, and if yes, judging that the awakening word recognition results of three or more voice segments are correspondingly matched with all the awakening words in the one combined awakening word one by one.

In a third aspect, the present invention provides a voice wake-up apparatus, including the voice wake-up device described in any one of the preceding claims.

In a fourth aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the above when executing the computer program.

In a fifth aspect, the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any of the above.

By adopting the technical scheme, the invention has the following beneficial effects:

the invention pre-configures a plurality of combined awakening words, each combined awakening word respectively comprises at least two awakening participles, then detects whether an awakening word recognition result of at least two voice segments is matched with each awakening participle in one combined awakening word in a one-to-one correspondence manner, if so, an awakening operation is executed, because the awakening word recognition result of each voice segment can be obtained by processing each voice segment based on the awakening word recognition model trained in advance, when a plurality of combined awakening words contain common words, the invention can respectively collect the language materials of common words and non-common words in the plurality of combined awakening words so as to train the awakening word recognition model, compared with the scheme of respectively collecting the corresponding language materials aiming at different combined awakening words, the invention reduces the language material collection difficulty to a certain extent, and on the basis of not increasing the calculation amount and the model size, meanwhile, preset awakening word awakening, custom awakening word awakening and multi-awakening word awakening are supported.

Drawings

Fig. 1 is a schematic flowchart of a voice wake-up method according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a software implementation process of a voice wake-up method according to embodiment 2 of the present invention;

fig. 3 is a block diagram of a voice wake-up apparatus according to embodiment 3 of the present invention;

fig. 4 is a hardware architecture diagram of an electronic device according to embodiment 5 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

Example 1

The present embodiment provides a voice wake-up method, as shown in fig. 1, the method specifically includes the following steps:

s101, configuring a plurality of combined awakening words, wherein each combined awakening word comprises at least two awakening participles.

In this embodiment, there is a common wake word between the multiple combined wake words. For example, the plurality of combined wake words are: open + music, pause + music, etc. It should be understood that, in addition to two wake-up participles, the combined wake-up word may also be composed of three, four or more wake-up participles according to needs, and this embodiment does not make any specific limitation on this.

S103, detecting whether the identification result of the awakening words of at least two voice sections is matched with each awakening participle in one combined awakening word in a one-to-one correspondence mode, if so, executing the step S105, otherwise, acquiring the next voice section. Wherein, adjacent speech segments in the at least two speech segments meet a predetermined interval requirement.

In this embodiment, the recognition result of the wakeup word of each speech segment is obtained by processing each speech segment based on a pre-trained wakeup word recognition model.

In this embodiment, the one-to-one correspondence matching means that the identification results of the at least two voice segments are in one-to-one correspondence matching with the content and the sequence of each wakeup participle in the one combined wakeup word.

For example, assuming that the recognition result of the wakeup word of the first speech segment is "on", the recognition result of the wakeup word of the second speech segment is "music", and the preset combined wakeup word includes "music on", it is considered that the recognition results of the wakeup words of the first and second speech segments are correspondingly matched with the wakeup participles in the combined wakeup word "music on", indicating that the wakeup command has been detected, and the wakeup operation can be performed.

And S105, executing a wake-up operation to wake up the voice wake-up terminal.

By adopting the above steps, when the plurality of combined wake-up words include the common word, the embodiment can respectively collect the corpus of the common word and the non-common word in the plurality of combined wake-up words to train the wake-up word recognition model, and compared with a scheme of respectively collecting corresponding corpus of different wake-up words, the corpus collection difficulty is reduced to a certain extent, and on the basis of not increasing the calculation amount and the model size, preset wake-up word wake-up, custom wake-up word wake-up and multi-wake-up word wake-up are simultaneously supported.

Preferably, the wake-up identification method provided in this embodiment further includes the following steps:

s102, configuring at least one non-combined awakening word.

In this embodiment, the non-combined wake word refers to a single wake word. Such as "power on", "power off", etc.

And S104, when the fact that the identification result of the awakening word obtained by processing the current voice segment based on the awakening word identification model is matched with one of the non-combined awakening words is detected, executing the step S105.

For example, when it is detected that the recognition result of the wakeup word obtained by processing the current speech segment based on the wakeup word recognition model is "on" and a non-combined wakeup word "on" is pre-configured, it is considered that the recognition result of the wakeup word of the current speech segment matches one of the non-combined wakeup words, indicating that a wakeup command has been detected, and a wakeup operation can be performed.

Through the above steps S21 and S22, the method of the present embodiment supports not only the wake-up by the combined wake-up word, but also the wake-up by the non-combined wake-up word.

if the identification result of the awakening word of the current voice segment is not matched with the awakening participle in each combined awakening word or each non-combined awakening word, and the frame number of the voice segment reaches a preset frame number threshold value, the starting point of the voice segment is pushed backwards by a preset interval frame number to be used as the starting point of the next voice segment, so that the interval between the starting point of the next voice segment and the starting point of the current voice segment is preset for a preset time. And the preset interval frame number is not more than the preset frame number threshold. Invalid voice input can be reduced by the step.

Example 2

This example is a further limitation of example 1.

In this embodiment, the process of processing the speech segment based on the awakening word recognition model to obtain the awakening word recognition result is as follows:

firstly, carrying out feature extraction processing on the voice section to obtain the voice feature of the voice section;

then, processing the voice characteristics through the awakening word recognition model to obtain the probability that the voice section contains different keywords;

and finally, when the probability of the keyword with the maximum probability in the voice section is greater than a preset threshold value, determining the keyword as an awakening word recognition result of the voice section.

In this embodiment, the common words and the non-common words in each combined wake word and the training corpus corresponding to each non-combined wake word are collected in advance to train the wake word recognition model, so that the wake word recognition model can recognize the probabilities that the speech segments belong to the common words and the non-common words in each combined wake word and each non-combined wake word. Preferably, the wake word recognition model is a neural network.

The awakening word recognition result of each voice segment can be accurately obtained through the steps.

In this embodiment, when one of the combined wake-up words includes two wake-up segments, step S103 detects whether there are wake-up word recognition results of at least two speech segments corresponding to the wake-up segments in one of the combined wake-up words:

s10311, detecting whether the awakening word recognition result of the current voice segment is matched with each awakening participle in the one combined awakening word;

s10312, when detecting that the awakening word recognition result of the current voice segment is matched with the first awakening word segmentation in the one combined awakening word, marking that the first awakening word segmentation in the one combined awakening word is successfully matched, acquiring the next voice segment as a new current voice segment, and returning to the step of detecting whether the awakening word recognition result of the current voice segment is matched with each awakening word segmentation in the one combined awakening word;

s10313, when it is detected that the awakening word recognition result of the current speech segment does not match with the two awakening words in the one of the combined awakening words, setting k to k +1, and obtaining a next speech segment as a new current speech segment, and returning to the step of detecting whether the awakening word recognition result of the current speech segment matches with each awakening word in the one of the combined awakening words, where an initial value of k is zero;

and S10314, when it is detected that the awakening word recognition result of the current voice segment is matched with the second awakening word segmentation in the one combined awakening word, judging whether the first awakening word segmentation in the one combined awakening word is marked as successful in matching and whether k is smaller than a preset interval threshold value, and if yes, judging that the awakening word recognition results of the two voice segments are correspondingly matched with the awakening words in the one combined awakening word one by one.

In this embodiment, when the one combined wake word includes three or more wake word segments, step S103 includes:

s10321, detecting whether the identification result of the awakening word of the current voice segment is matched with each awakening participle in the one combined awakening word;

s10322, when it is detected that the recognition result of the wakeup word of the current speech segment matches the first wakeup word segment in the one of the combined wakeup words, marking that the matching of the first wakeup word segment in the one of the combined wakeup words is successful, and obtaining the next speech segment as a new current speech segment, and returning to the step of detecting whether the recognition result of the wakeup word of the current speech segment matches each wakeup word segment in the one of the combined wakeup words;

s10323, when it is detected that the recognition result of the wakeup word of the current speech segment is not matched with each wakeup word in the one of the combined wakeup words, setting k to k +1, and obtaining a next speech segment as a new current speech segment, and returning to the step of detecting whether the recognition result of the wakeup word of the current speech segment is matched with each wakeup word in the one of the combined wakeup words, where an initial value of k is zero;

s10324, when it is detected that the awakening word recognition result of the current voice segment matches with a certain middle position awakening word in one of the combined awakening words, judging whether a previous awakening word of the middle position awakening word is marked as successfully matched and k is smaller than a preset interval threshold value, if so, marking that the certain middle position awakening word in one of the combined awakening words is successfully matched, clearing k, acquiring a next voice segment as a new current voice segment, and returning to the step of detecting whether the awakening word recognition result of the current voice segment matches with each awakening word in one of the combined awakening words;

s10325, when it is detected that the awakening word recognition result of the current speech segment matches the last awakening participle in the one of the combined awakening words, determining whether a penultimate awakening participle in the one of the combined awakening words has been marked as successfully matched and k is smaller than a predetermined interval threshold, and if yes, determining that the awakening word recognition results of three or more speech segments match the awakening participles in the one of the combined awakening words in a one-to-one correspondence manner.

Through the steps, the detection of the combined awakening words of the input voice can be realized.

As shown in fig. 2, taking the detection of a single combined wake-up word1 including a word1 and a word 2 as an example, a specific software implementation process of the voice wake-up method of the present embodiment is described as follows:

s201, confirming the starting point of the voice segment, namely, the start _ frame, and setting the start _ frame to zero when the first execution is carried out.

S202, initializing, which mainly includes: the frame number count j is set to zero, parameters related to the neural network (i.e. the awakening word recognition model) are initialized, such as the frame number threshold width of the voice segment required by the network, and the voice feature is initialized, and the specific initialization mode is matched with the network, which is not limited herein.

S203, inputting the j frame voice.

S204, acquiring the voice characteristics of the j frame voice and updating the voice characteristics

And S205, processing the updated features by adopting a pre-trained neural network to obtain the probability that the voice features belong to different keywords.

And S206, performing confidence judgment, namely judging whether the probability of the keyword with the maximum probability is greater than a preset threshold value, and if so, determining that the keyword is the awakening word recognition result of the voice section.

S207, a combined wake-up word determination is performed, that is, whether the wake-up word recognition result of the current speech segment is a word in a combined wake-up word is detected, if yes, step S208 is executed, and if not, step S209 is executed.

S208, further determine whether the recognition result of the wake-up word of the current speech segment is word1 or word 2 in the combined wake-up word1, if so, execute step S218, if so, execute step S220, otherwise, execute step S212.

S209, set Mode to zero, where Mode is a flag bit of the combined wake-up word, and when Mode is equal to 1, it indicates that the wake-up word recognition result of the current speech segment belongs to the combined wake-up word, and when Mode is equal to 0, it indicates that the wake-up word recognition result of the current speech segment does not belong to the combined wake-up word.

S210, after the Mode is set to zero, further determining whether the recognition result of the wakeup word of the current speech segment is a non-combined wakeup word, if yes, performing step S211, and if not, performing step S212.

And S211, executing the awakening operation.

S212, judge j is greater than or equal to width, if yes, carry out step S214, if no, carry out step S213.

S213, let j equal to j +1, and return to step S203 to add a frame of speech to the previous speech segment to obtain a new speech segment. It should be appreciated that since the length of a speech segment is predetermined, the first frame of speech in a previous speech segment is removed while the speech frame is being added.

S214 determines whether Mode is 1, and if it is 1, step S215 is executed, and if it is not 1, step S217 is executed.

S215, judging whether word1_1 is 1, if so, executing step S216, otherwise, executing step S217. When word1_1 is 1, it indicates that word1 in the combined awakening word1 is marked as successful in matching, and when word1_1 is 0, it indicates that word1 in the combined awakening word1 is not marked as successful in matching.

S216, let k be k +1, k indicate that the word1 is successfully marked, and the wake word recognition result is the speech length of the non-wake word.

And S217, updating the starting point of the voice section, specifically, updating by making a start _ frame be a start _ frame + j-a × width, wherein a is an adjustable coefficient and can be automatically adjusted according to computational power and performance requirements, the value range of a is between 0 and 1, and the starting point of the voice section is pushed backwards by a preset interval frame number to serve as the starting point of the next voice section through the step.

S218, if the recognition result of the wake word of the current speech segment is word1 in the combined wake word1, word1_1 is set to 1, which indicates that the 1 st word in the combined wake word1 has been successfully matched, and then step S219 is executed.

S219, updating the starting point of the speech segment, specifically, using the start _ frame + j to set the end point of the current speech segment as the starting point of the next speech segment.

S220, judging whether word1_1 is 1 and k is smaller than N, if yes, executing step S221, otherwise, executing step S219. Where N represents a preset interval threshold.

S221, let k be 0 and word1_1 be 0.

S222, a wake-up operation is performed, and then step S219 is repeatedly performed.

The detection of only a single combined awakening word1 including the word1 and the word 2 is realized by performing the software flow in parallel on the detection of a plurality of pre-configured combined awakening words (such as the word 3+ the word 4 and the word 5+ the word 6). It is to be understood that the software flow shown in fig. 2 is merely exemplary and not limiting, and that the software flow may be varied without departing from the principles and spirit of the invention.

Example 3

In this embodiment, as shown in fig. 3, the voice wake-up apparatus includes a configuration module 11, a combined wake-up word detection module 12, an un-combined wake-up word detection module 13, and a wake-up module 14.

The above modules are described in detail below:

the configuration module 11 is configured to configure a plurality of combined wake-up words, where each of the combined wake-up words includes at least two wake-up participles;

the combined awakening word detection module 12 is configured to detect whether an awakening word recognition result of at least two voice segments matches each awakening participle in one of the combined awakening words in a one-to-one correspondence manner, and if so, invoke the awakening module 14 to perform an awakening operation, where adjacent voice segments in the at least two voice segments meet a predetermined interval requirement;

and the awakening word recognition result of each voice segment is obtained by processing each voice segment based on a pre-trained awakening word recognition model.

In addition, the configuration module 11 is further configured to configure at least one non-combined wake word;

the non-combined wake-up word detection module 13 is configured to detect whether a wake-up word recognition result obtained by processing the current speech segment based on the wake-up word recognition model matches with each of the non-combined wake-up words, and if the current speech segment matches with one of the combined wake-up words, invoke the wake-up module 14 to perform a wake-up operation.

Preferably, the combined wake word detection module 12 is specifically configured to:

preferably, when one of the combined wake-up words includes two wake-up participles, the combined wake-up word detection module 12 is specifically configured to:

Preferably, when one of the combined wake-up words includes three or more wake-up participles, the combined wake-up word detection module 12 is specifically configured to:

detecting whether the awakening word recognition result of the current voice segment is matched with each awakening participle in the one combined awakening word;

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. Can be understood and implemented by those of ordinary skill in the art without inventive effort

Example 4

The present embodiment provides a voice wake-up apparatus, which includes a voice collecting device and the voice wake-up device according to embodiment 3.

Example 5

Fig. 4 is a schematic diagram of an electronic device according to an exemplary embodiment of the present invention, and illustrates a block diagram of an exemplary electronic device 60 suitable for implementing embodiments of the present invention. The electronic device 60 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 4, the electronic device 60 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 60 may include, but are not limited to: the at least one processor 61, the at least one memory 62, and a bus 63 connecting the various system components (including the memory 62 and the processor 61).

The bus 63 includes a data bus, an address bus, and a control bus.

The memory 62 may include volatile memory, such as Random Access Memory (RAM)621 and/or cache memory 622, and may further include Read Only Memory (ROM) 623.

The memory 62 may also include a program tool 625 (or utility tool) having a set (at least one) of program modules 624, such program modules 624 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 61 executes various functional applications and data processing, such as the methods provided by any of the above embodiments, by running a computer program stored in the memory 62.

The electronic device 60 may also communicate with one or more external devices 64 (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface 65. Also, the model-generating electronic device 60 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via a network adapter 66. As shown, network adapter 66 communicates with the other modules of model-generating electronic device 60 via bus 63. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating electronic device 60, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 6

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method provided by any of the above embodiments.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps of implementing the method as provided by any of the embodiments described above, when said program product is run on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A voice wake-up method, comprising:

2. The voice wake-up method according to claim 1, characterized in that the method further comprises:

configuring at least one non-combined wake-up word;

3. A voice wake-up method according to claim 1, wherein a predetermined interval requirement is satisfied between adjacent speech segments of the at least two speech segments.

4. The voice wake-up method according to claim 1, wherein the wake-up word recognition result of the voice segment is obtained by processing the voice segment based on a pre-trained wake-up word recognition model.

5. The voice wake-up method according to claim 4, wherein the processing of the speech segments based on the wake-up word recognition model is as follows:

6. The voice wake-up method according to claim 2, wherein when it is detected that the wake-up word recognition result of the current voice segment is not matched with the wake-up participles or the non-combined wake-up words in each of the combined wake-up words and the frame number of the voice segment reaches a preset frame number threshold, the start point of the voice segment is shifted backward by a preset interval frame number to serve as the start point of the next voice segment.

7. The voice wake-up method according to claim 5, wherein the preset interval frame number is not greater than the preset frame number threshold.

8. The voice wake-up method according to claim 1, wherein when the one of the combined wake-up words includes two wake-up participles, it is detected whether there are wake-up word recognition results of two voice segments matching with respective wake-up participles in the one of the combined wake-up words in a one-to-one correspondence manner by:

9. The voice wake-up method according to claim 1, wherein when the one of the combined wake-up words includes three or more wake-up segments, detecting whether the wake-up word recognition results of three or more voice segments are matched with the wake-up segments in the one of the combined wake-up words in a one-to-one correspondence includes:

10. A voice wake-up apparatus, comprising:

11. The voice wake-up apparatus according to claim 10, wherein the configuration module is further configured to configure at least one non-combined wake-up word;

12. The voice wake-up apparatus according to claim 10, wherein a predetermined interval requirement is satisfied between adjacent speech segments of the at least two speech segments.

13. The voice wake-up apparatus according to claim 10, wherein the wake-up word recognition result of the voice segment is obtained by processing the voice segment based on a pre-trained wake-up word recognition model.

14. The voice wake-up apparatus according to claim 10, further comprising a wake word recognition module for processing a speech segment based on the wake word recognition model, the wake word recognition module comprising:

15. The voice wake-up apparatus according to claim 11, further comprising: and the voice section starting point updating module is used for pushing the starting point of the voice section backwards by a preset interval frame number to serve as the starting point of the next voice section when the fact that the identification result of the awakening word of the current voice section is not matched with the awakening participle or the non-combined awakening words in each combined awakening word and the frame number of the voice section reaches a preset frame number threshold value is detected.

16. The voice wake-up apparatus according to claim 10, wherein when the one of the combined wake-up words includes two wake-up participles, the combined wake-up word detection module is specifically configured to:

17. The voice wake-up method according to claim 10, wherein when the one combined wake-up word includes three or more wake-up participles, the combined wake-up word detection module is specifically configured to:

18. Voice wake-up device, characterized in that it comprises a voice wake-up apparatus according to any of the preceding claims 10-17.

19. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the computer program.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.