CN114155857A

CN114155857A - Voice wake-up method, electronic device and storage medium

Info

Publication number: CN114155857A
Application number: CN202111570928.XA
Authority: CN
Inventors: 邓建凯; 陈家欢; 甘津瑞; 俞凯
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-08

Abstract

The invention discloses a voice awakening method, electronic equipment and a storage medium, wherein the voice awakening method comprises the following steps: continuously caching user audio streams, and judging whether the user audio streams can be triggered to be awakened or not; responding to the triggering awakening of the user audio stream, and sending a rollback audio stream obtained after the time point of the self-triggering awakening backs for a first preset time interval to a voice activity detection module for voice activity detection; the voice activity detection module detects and simultaneously sends the backspacing audio stream to the server side in real time for recognition to obtain a first recognition result; judging whether the first recognition result contains other voices except the awakening words or not; and if the first recognition result contains other voices except the awakening words, entering an oneshot mode. The backspacing audio stream is sent to the server side in real time to be recognized while the voice activity detection module detects the backspacing audio stream, so that a first recognition result is obtained, and whether the oneshot mode is entered or not can be accurately judged.

Description

Voice wake-up method, electronic device and storage medium

Technical Field

The invention belongs to the technical field of voice data processing, and particularly relates to a voice awakening method, electronic equipment and a storage medium.

Background

Most of the existing oneshot schemes are offline schemes, the offline schemes determine whether the scheme is an oneshot (an interactive mode for responding to an awakening and adding command together) mode by judging whether a person has a Voice through Wakeup (audio data is detected in real time and whether a keyword is hit in output) and Vad (Voice Activity Detection), and the existing schemes all use Wakeup + Vad.

The method steps of the existing off-line scheme comprise: vad caches the current voice audio in real time; when the voice is used for inputting a Wakeup word, Wakeup is triggered by Wakeup; sending the audio after waking to Vad, and detecting whether the start of Vad can be triggered in a specified time; if the start of Vad is triggered within a specified time, the mode is determined to be onehsot mode, otherwise, the mode can be determined to be non-oneshot mode.

The inventor finds that the existing offline scheme cannot solve two problems in the process of implementing the application: the time point of the voice input awakening word triggering awakening is delayed and the time point of the voice input awakening word triggering awakening is advanced. The audio Vad after waking cannot trigger the start of the Vad due to the time point lag of waking triggered by the voice input of the waking word, so that the user actually speaks oneshot but does not hit the oneshot mode; the time point when the human voice inputs the wakeup word to trigger the wakeup is advanced, so that the audio after the wakeup is sent to the Vad to trigger the start of the Vad in advance, and the user is actually in a non-oneshot language but hits the oneshot mode.

Disclosure of Invention

An embodiment of the present invention provides a voice wake-up method and apparatus, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a voice wake-up method, including: continuously caching user audio streams, and judging whether the user audio streams can be triggered to be awakened or not; responding to the user audio stream triggering awakening, sending a rollback audio stream obtained after a time point of the self-triggering awakening backs for a first preset time interval to a voice activity detection module for voice activity detection, wherein the voice activity detection module finishes detection after detecting non-human voice of a second preset time interval; the voice activity detection module detects and simultaneously sends the backspacing audio stream to a server end in real time for recognition to obtain a first recognition result; judging whether the first recognition result contains other voices except the awakening words or not; and if the first recognition result contains other voices except the awakening word, entering an oneshot mode, wherein the oneshot mode is a mode of responding to the fact that the awakening word and the command word are spoken together.

In a second aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the voice wake-up method of any of the embodiments of the present invention.

In a third aspect, the present invention also provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the voice wake-up method according to any embodiment of the present invention.

The method, the electronic equipment and the storage medium send the backspacing audio stream obtained after the time point of self-triggering awakening backspacing for the first preset time interval to the voice activity detection module for voice activity detection, so that voice activity detection can be triggered certainly, and further, the backspacing audio stream is sent to the server end in real time to be identified to obtain a first identification result while the voice activity detection module detects, so that whether the oneshot mode is entered into can be accurately judged.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a voice wake-up method according to an embodiment of the present invention;

fig. 2 is a flowchart of another voice wake-up method according to an embodiment of the present invention;

fig. 3 is a diagram of a prior art scheme of a specific example of a voice wake-up method according to an embodiment of the present invention;

fig. 4 is a voice wakeup flowchart of a specific example of the voice wakeup method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a voice wake-up method according to the present application is shown, where the voice wake-up method according to the present embodiment may be applied to a terminal with a real-time voice conversation function, such as a smart speaker, a vehicle-mounted terminal, a smart phone, a tablet, a computer, and the like.

As shown in fig. 1, in step 101, a user audio stream is continuously buffered, and it is determined whether the user audio stream can trigger a wakeup;

in step 102, in response to the user audio stream triggering wakeup, sending a backoff audio stream obtained after a time point of the self-triggering wakeup backs for a first preset time interval to a voice activity detection module for voice activity detection, where the voice activity detection module ends detection after detecting a non-human voice of a second preset time interval;

in step 103, the voice activity detection module detects and simultaneously sends the backspacing audio stream to a server side in real time for recognition to obtain a first recognition result;

in step 104, determining whether the first recognition result contains other voices except for the awakening word;

in step 105, if the first recognition result includes voices of other people except the wakeup word, an oneshot mode is entered, wherein the oneshot mode is a mode in which a response wakeup word and a command word are spoken together.

In this embodiment, for step 101, the voice wakeup device continuously buffers the user audio stream, and determines whether the user audio stream contains a wakeup word, and whether to trigger wakeup.

For step 102, the voice wakeup apparatus, in response to the user audio stream trigger wakeup, sends a backoff audio stream obtained after a time point of the self-trigger wakeup is backed for a first preset time interval to the voice activity detection module for voice activity detection, where the voice activity detection module ends detection after detecting a non-human voice of a second preset time interval, for example, the user audio stream is "hello play music", and regardless of the time point of the user trigger wakeup, continues voice activity detection for an audio of a preset time before the backoff trigger wakeup.

In step 103, the voice awakening device sends the backspacing audio stream to the server in real time for recognition to obtain a first recognition result while the voice activity detection module detects the backspacing audio stream, wherein when the server recognizes, the voice awakening device can filter the awakening word and only return the recognition result of the command word.

For step 104, the voice wake-up apparatus determines whether the first recognition result includes other voices except for the wake-up word, where the other voices include the command word.

In step 105, if the first recognition result includes other voices except the wakeup word, an oneshot mode is entered, where the oneshot mode is a mode in which the response wakeup word and the command word are spoken together, for example, if the recognition result does not include the command word, a normal wakeup broadcast welcome or query word is entered, and if the audio stream of the user is "you good little play music", the recognition result includes the command word "play music", music and completion instruction feedback will be played directly, and no welcome or query word is broadcast.

According to the method, the backspacing audio stream obtained after the time point of self-triggering awakening backspacing for the first preset time interval is sent to the voice activity detection module for voice activity detection, so that voice activity detection can be triggered certainly, and further, the backspacing audio stream is sent to the server end in real time to be identified while the voice activity detection module detects the voice activity to obtain a first identification result, so that whether the oneshot mode is entered or not can be accurately judged.

In the method according to the foregoing embodiment, after determining whether the first recognition result includes a voice other than a wakeup word, the method further includes:

if the first recognition result does not contain other voices except the awakening words, entering a non-oneshot mode, wherein the non-oneshot mode is a normal awakening mode, and broadcasting welcoming words or inquiry words and the like after the awakening is triggered.

The method of the embodiment can judge whether to enter the oneshot mode by judging whether the first recognition result contains other voices except the awakening words.

In some optional embodiments, the first recognition result returned by the server does not include a wakeup word, and the determining whether the first recognition result includes voices of other persons except the wakeup word includes:

and the voice awakening device judges whether the first identification result is empty, wherein if the first identification result is empty, the user only speaks the awakening word and does not speak the command word.

In some optional embodiments, if the first recognition result includes a voice of a person other than the wakeup word, entering an oneshot mode includes:

and if the first recognition result is not null, entering an oneshot mode.

In some optional embodiments, after the determining whether the first recognition result is empty, the method further includes:

and if the first recognition result is empty, entering a non-oneshot mode.

In some optional embodiments, after entering an oneshot mode if the first recognition result includes a voice of another person other than the wakeup word, the method further includes:

and acquiring a current oneshot interaction mode, wherein the oneshot interaction mode comprises continuous monitoring and broadcasting welcome, and is used for judging whether a time point when a user speaks a command word is after triggering awakening.

Further referring to fig. 2, a flow chart of another voice wake-up method provided in an embodiment of the present application is shown. The flowchart is mainly a flowchart of steps further defined for the flow after "obtaining the current oneshot interaction mode" in the above embodiment.

As shown in fig. 2, in step 201, if the oneshot interaction mode is to continue monitoring, the subsequent audio stream that is not subjected to voice activity detection in the user audio stream is continuously sent to the voice activity detection module for detection, and the subsequent audio stream is simultaneously sent to the server for recognition to obtain a second recognition result;

in step 202, the user audio stream is responded to based on the second recognition result.

In this embodiment, for step 201, if the oneshot interaction mode is continuous monitoring, the subsequent audio stream that is not subjected to the voice activity detection in the user audio stream is continuously sent to the voice activity detection module for detection, and the subsequent audio stream is simultaneously sent to the server for recognition to obtain a second recognition result, taking "hello little speed, music playing" as an example, if the time point triggering wakeup is before music playing, the music playing after the time point triggering wakeup is subjected to the voice activity detection, and is simultaneously sent to the server for recognition.

For step 202, the voice wake-up apparatus responds to the user audio stream based on the second recognition result, for example, if the returned recognition result includes a command word, the voice wake-up apparatus executes a corresponding operation based on the command word, and if the returned recognition result does not include the command word, the voice wake-up apparatus plays a welcome word.

According to the method, the subsequent audio stream which is not subjected to voice activity detection in the user audio stream is continuously sent to the voice activity detection module for detection, and meanwhile, the subsequent audio stream is sent to the server for recognition to obtain the second recognition result, so that whether the oneshot mode is entered or not can be accurately judged.

In some optional embodiments, after the obtaining the current oneshot interaction mode, the method further comprises:

and if the oneshot interaction mode is to play the welcome language, playing the welcome language.

It should be noted that the above method steps are not intended to limit the execution order of the steps, and in fact, some steps may be executed simultaneously or in the reverse order of the steps, which is not limited herein.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventor finds that the defects in the prior art are mainly caused by the following reasons in the process of implementing the application:

since Wakeup cannot accurately trigger the Wakeup message after the Wakeup word is spoken, audio sent by Vad at a later stage may take the audio of the Wakeup word, and real voice of a later person may be lost.

Aiming at the defects of the prior art, the method is generally adopted to enlarge the research of the awakening algorithm and to make the time point of the awakening message correspond to the awakening audio as much as possible, and the starting point of the solution is the modification and improvement of the prior scheme.

The inventor also found that the previous oneshot scheme is a purely offline approach, does not require server cooperation, and cannot be completed without server cooperation.

The scheme of the application is mainly designed and optimized from the following aspects:

referring to fig. 3, a diagram of a prior art scheme of a specific example of a voice wake-up method according to an embodiment of the present invention is shown.

As shown in fig. 3, after waking up is triggered, the audio after waking up is sent to Vad for human voice detection, if the start of Vad is triggered within 500ms, the turn is considered to be oneshot, otherwise, the turn is considered not to be oneshot.

Normally, B point triggers to wake up, but the prior art cannot trigger at B point accurately.

The existing oneshot scheme has two drawbacks:

if a wake is triggered at point a, the audio sent to Vad will contain the audio after point a, in which case the client just says "hello", the existing solution also considers him to be an oneshot say.

If wake-up is triggered at point C, then the audio sent to Vad is the audio after point C, in which case the existing solution would consider him not to be an oneshot utterance even if the customer says "hello is relaxed, exit".

According to the scheme, the audio is cached from the beginning, no matter which time point ABC triggers to wake up, the audio 2.5 seconds before the back-off wake-up is sent to the Vad, the audio is sent to the cloud identification after the Vad is triggered to start, at the moment, the server side filters the wake-up words and only returns a command identification result, if the identification returns to be null, the client only says the wake-up words, and if the identification returns no command, the client only considers the non-oneshot mode. If the recognition returns "exit" indicating that the client said the command word after the wakeup word, the user turns to oneshot mode.

The scheme of this application has fundamentally solved and has awakened the problem that the position is surely inaccurate, with accurate judgement user of high in the clouds identification result for oneshot mode, has solved 2 current defects.

Please refer to fig. 4, which shows a voice wakeup flowchart of a specific example of the voice wakeup method according to an embodiment of the present invention, wherein the user greeting is a welcome language; keeListening is to continue monitoring, and oneshType is an oneshhot interaction mode or an oneshhot type; startVad is a Start Vad.

As shown in fig. 4, step 1: a user inputs a wake-up word of 'hello must drive' by voice and triggers a wake-up result;

step 2: starting Vad engine inside, starting dialogue engine;

and step 3: sending the buffered audio to the Vad, wherein the audio sent to the Vad is 'hello little chi', and then Vad. begin can be triggered;

and 4, step 4: sending audio returned by the kernel after Vad.begin to the cloud terminal for identification;

and 5: sending the audio after the vad to cloud identification;

step 6: end (end of Vad) is triggered, and then audio buffering is started;

and 7: waiting for the recognition result, if the recognition result is empty, judging that the current turn is in a non-oneshot mode, and going through a normal awakening process;

and 8: if the recognition result is not null, judging that the current round is an oneshot mode;

and step 9: if the oneshot mode is the oneshot mode, judging the current oneshot interaction mode, if the oneshot interaction mode is the continuous monitoring mode, continuously sending the audio cached in the step six to Vad for subsequent identification, and if the oneshot mode is the welcome language playing mode, going to a normal playing flow.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the voice wakeup method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

continuously caching user audio streams, and judging whether the user audio streams can be triggered to be awakened or not;

responding to the user audio stream triggering awakening, sending a rollback audio stream obtained after a time point of the self-triggering awakening backs for a first preset time interval to a voice activity detection module for voice activity detection, wherein the voice activity detection module finishes detection after detecting non-human voice of a second preset time interval;

the voice activity detection module detects and simultaneously sends the backspacing audio stream to a server end in real time for recognition to obtain a first recognition result;

judging whether the first recognition result contains other voices except the awakening words or not;

and if the first recognition result contains other voices except the awakening word, entering an oneshot mode, wherein the oneshot mode is a mode of responding to the fact that the awakening word and the command word are spoken together.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the voice wake-up apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice wake up device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any of the above voice wake-up methods.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The voice wake-up method may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, namely, implements the voice wake-up method of the above-mentioned method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication compensation device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a voice wake-up apparatus, and is used for a client, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice wake-up method, comprising:

2. The method according to claim 1, wherein after the determining whether the first recognition result contains other voices except for the wakeup word, the method further comprises:

and if the first recognition result does not contain other voices except the awakening words, entering a non-oneshot mode.

3. The method according to claim 1, wherein the first recognition result returned by the server side does not include a wakeup word, and the determining whether the first recognition result includes a voice other than the wakeup word includes:

and judging whether the first identification result is empty or not.

4. The method of claim 3, wherein entering an oneshot mode if the first recognition result includes voices of other people except for the wakeup word comprises:

and if the first recognition result is not null, entering an oneshot mode.

5. The method of claim 3, wherein after the determining whether the first recognition result is empty, the method further comprises:

and if the first recognition result is empty, entering a non-oneshot mode.

6. The method according to any one of claims 1-3, wherein after entering an oneshot mode if the first recognition result includes a voice of a person other than a wakeup word, the method further comprises:

and acquiring a current oneshot interaction mode, wherein the oneshot interaction mode comprises continuous monitoring and broadcasting of welcome.

7. The method of claim 6, wherein after said obtaining a current oneshot interaction pattern, the method further comprises:

if the oneshot interaction mode is continuous monitoring, continuously sending a subsequent audio stream which is not subjected to voice activity detection in the user audio stream to the voice activity detection module for detection, and simultaneously sending the subsequent audio stream to the server for recognition to obtain a second recognition result;

responding to the user audio stream based on the second recognition result.

8. The method of claim 6, wherein after said obtaining a current oneshot interaction pattern, the method further comprises:

and if the oneshot interaction mode is to play the welcome, playing the welcome.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 8.