CN110910884A - Wake-up detection method, device and medium - Google Patents

Wake-up detection method, device and medium Download PDF

Info

Publication number
CN110910884A
CN110910884A CN201911230226.XA CN201911230226A CN110910884A CN 110910884 A CN110910884 A CN 110910884A CN 201911230226 A CN201911230226 A CN 201911230226A CN 110910884 A CN110910884 A CN 110910884A
Authority
CN
China
Prior art keywords
voice
voice stream
state information
current window
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911230226.XA
Other languages
Chinese (zh)
Other versions
CN110910884B (en
Inventor
朱紫薇
唐文琦
刘忠亮
解传栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201911230226.XA priority Critical patent/CN110910884B/en
Publication of CN110910884A publication Critical patent/CN110910884A/en
Application granted granted Critical
Publication of CN110910884B publication Critical patent/CN110910884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention provides a wake-up detection method, a wake-up detection device and a wake-up detection medium, wherein the method specifically comprises the following steps: performing voice activation detection on audio to obtain a voice stream in the audio; detecting the duration information of the voice stream; if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode; and if the duration information of the voice stream reaches the preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream. The embodiment of the invention can improve the accuracy of voice awakening.

Description

Wake-up detection method, device and medium
Technical Field
The present invention relates to the field of electronic device technologies, and in particular, to a wake-up detection method, a wake-up detection apparatus, and a machine-readable medium.
Background
With the development of electronic technology, many electronic devices introduce voice interaction technology. To save power consumption for voice interaction, electronic devices introduce voice wake-up techniques. The voice wake-up technology wakes up the electronic device via voice to control the electronic device to switch from a non-operating state to an operating state, wherein the electronic device can recognize and feed back the voice of a user in the operating state.
The voice wake-up method in the related art can detect whether the voice contains a wake-up word, and if so, wake up the electronic device.
The inventor finds that the voice awakening accuracy is low in a noise environment in the process of implementing the embodiment of the invention, and specifically, the voice awakening accuracy is low in the recognition rate of the awakening word or the false awakening rate is high.
Disclosure of Invention
Embodiments of the present invention provide a wake-up detection method, a wake-up detection apparatus, a device for wake-up detection, and a machine-readable medium, which can improve accuracy of voice wake-up.
In order to solve the above problem, an embodiment of the present invention discloses a wake-up detection method, including:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
On the other hand, the embodiment of the invention discloses a wake-up detection device, which comprises:
the voice activation detection module is used for carrying out voice activation detection on audio to obtain a voice stream in the audio;
the judging module is used for detecting the duration information of the voice stream;
the first processing module is used for detecting the awakening words of the voice stream by utilizing a data model according to a first mode under the condition that the duration information of the voice stream does not reach preset duration information;
the second processing module is used for carrying out awakening word detection on the voice stream by using a data model according to a second mode under the condition that the duration information of the voice stream reaches preset duration information, retiming the voice stream, and returning to execute the detection on the duration information of the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
In yet another aspect, an embodiment of the present invention discloses an apparatus for wake-up detection, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs including instructions for:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
In yet another aspect, embodiments of the present invention disclose one or more machine-readable media having instructions stored thereon, which, when executed by one or more processors, cause an apparatus to perform one or more of the aforementioned wake-up detection methods.
The embodiment of the invention has the following advantages:
the embodiment of the invention judges whether the voice stream is in a noise environment according to whether the duration information of the voice stream reaches the preset duration information, and if the voice stream is not in the noise environment, a first mode is adopted. The processing procedure corresponding to the first mode includes: after the decoder outputs the state information, the voice stream is subjected to moving windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window. The first mode determines the state information corresponding to the speech frame in the current window according to the state information of the speech frame in the previous window, so that the long-time memory information can be considered, and the accuracy of speech awakening in a quiet environment can be improved.
And if the mobile terminal is in a noise environment, adopting a second mode. The processing procedure corresponding to the second mode includes: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder. Since the second mode determines the state information corresponding to the speech frame in the current window via the encoder, the influence of noise on the accuracy of speech wakeup can be reduced. In addition, the current window corresponding to the second mode can be used as the previous window corresponding to the first mode, so that the second mode can provide memory information for the first mode, the voice awakening accuracy rate in a noise environment can be improved, and the voice awakening accuracy rate in the noise environment and the quiet environment can be balanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the steps of a wake-up detection method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a first mode of processing according to an embodiment of the invention;
FIG. 4 is a diagram illustrating a second mode of processing according to an embodiment of the present invention;
FIG. 5 is a block diagram of another wake-up detection apparatus according to another embodiment of the present invention;
FIG. 6 is a block diagram of an apparatus 900 for wake-up detection of the present invention; and
fig. 7 is a schematic diagram of a server in some embodiments of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a wake-up detection scheme, which can improve the accuracy of voice wake-up in a noise environment and reduce the power consumption of the voice wake-up.
The scheme specifically comprises the following steps: performing VAD (Voice Activity Detection) on the audio to obtain a Voice stream in the audio; detecting the duration information of the voice stream; if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode; and if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the detection on the duration information of the voice stream.
Wherein, the data model may include: and the decoder determines attention information corresponding to the voice frame of the voice stream in the current window according to the state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information.
Referring to fig. 1, a schematic diagram of a data processing method according to an embodiment of the present invention is shown, in which audio, which may be audio collected via an electronic device, may be VAD. For example, in the non-operating state, the electronic device may capture the audio.
VAD, also known as voice endpoint detection or voice boundary detection, aims to identify and eliminate long periods of silence from a voice signal stream. In an embodiment of the present invention, the VAD result may include: voice streams or non-voice streams, where the voice streams may be further processed and the non-voice streams may be discarded.
Embodiments of the present invention may input a voice stream into a data model. The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a relational structure which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The method can adopt methods such as machine learning and deep learning methods to train the mathematical model, and the machine learning method can comprise the following steps: linear regression, decision trees, random forests, etc., and the deep learning method may include: convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated cyclic units (GRU), and so on.
The data model may include: an encoder and a decoder. The encoder may perform feature extraction on the voice stream to obtain state information corresponding to the voice stream. The feature corresponding to the status information may be a speech feature, and the speech feature may include, but is not limited to: fbank (filter bank) characteristic, MFCC (mel frequency cepstral Coefficient) characteristic, and the like.
The decoder may have an attention mechanism, which may determine attention information corresponding to a speech frame of the speech stream in a current window according to the state information output by the encoder, and determine a probability that the current window corresponds to the wakeup word according to the attention information and the state information. Wherein the attention information may characterize the importance of the speech frame in the current window, and in particular, may characterize the importance of the speech frame in the current window to the detection result. The embodiment of the invention can give higher attention to the voice frame with more important detection result, thereby improving the accuracy of the detection result.
Alternatively, the decoder may include: the attention processing module can be used for determining attention information corresponding to the voice frame of the voice stream in the current window according to the state information output by the encoder. And the linear processing module is used for determining the probability of the current window corresponding to the awakening word according to the attention information and the state information. Optionally, the linear processing module may perform normalization processing according to the attention information to obtain a probability that the current window corresponds to the wakeup word. It is understood that the embodiment of the present invention does not impose a limitation on the specific processing procedure of the linear processing module.
The inventor finds that VAD influences the learning of history information by a mathematical model in the process of implementing the embodiment of the invention. Particularly, in a noisy environment, the noise easily affects the transfer of the memory information, so that the accuracy of voice wakeup is low.
The embodiment of the invention judges whether the voice stream is in a noise environment according to whether the duration information of the voice stream reaches the preset duration information, and if the voice stream is not in the noise environment, a first mode is adopted. The processing procedure corresponding to the first mode includes: after the decoder outputs the state information, the voice stream is subjected to moving windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window. The first mode determines the state information corresponding to the speech frame in the current window according to the state information of the speech frame in the previous window, so that long-time memory information can be considered, and the awakening accuracy of the speech in a quiet environment can be improved.
And if the mobile terminal is in a noise environment, adopting a second mode. The processing procedure corresponding to the second mode includes: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder. Since the second mode determines the state information corresponding to the speech frame in the current window via the encoder, the influence of noise on the accuracy of speech wakeup can be reduced. In addition, the current window corresponding to the second mode can be used as the previous window corresponding to the first mode, so that the second mode can provide memory information for the first mode, the awakening accuracy of the voice in the noise environment can be improved, and the awakening accuracy of the voice in the quiet environment and the noise environment can be balanced.
The embodiment of the invention can be applied to voice awakening scenes. In a voice wake-up scene, the embodiment of the invention can detect whether the audio comprises the wake-up word, and if so, the electronic equipment is woken up.
The awakening detection method provided by the embodiment of the invention can be applied to application environments corresponding to the client and the server, wherein the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.
Optionally, the client may run on a terminal, and the terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture experts Group Audio Layer III) players, MP4 (Moving Picture experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like. Alternatively, the client may correspond to any application program, such as a voice interaction program.
Method embodiment
Referring to fig. 2, a flowchart illustrating steps of an embodiment of a wake-up detection method according to the present invention is shown, which may specifically include:
step 201, performing voice activation detection on an audio to obtain a voice stream in the audio;
step 202, detecting the duration information of the voice stream;
step 203, if the duration information of the voice stream does not reach the preset duration information, performing wakeup word detection on the voice stream by using a data model according to a first mode;
step 204, if the duration information of the voice stream reaches preset duration information, performing wakeup word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to perform the detection on the duration information of the voice stream;
wherein, the data model specifically includes: an encoder and a decoder, wherein the decoder determines attention information corresponding to a speech frame of the speech stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the wakeup word according to the attention information and the state information;
the processing procedure corresponding to the first mode specifically includes: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode specifically includes: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
Although the embodiment of the method shown in fig. 1 may be executed by a client or a server, the embodiment of the present invention is not limited to a specific execution subject corresponding to the embodiment of the method.
In step 201, the audio corresponding to the electronic device may be collected, and VAD may be performed on the audio, where the VAD result may include: voice streams or non-voice streams, embodiments of the present invention may perform subsequent processing on the voice streams.
In step 202, the duration information of the voice stream is detected, and it can be determined whether the duration information of the voice stream reaches preset duration information, so as to determine whether the voice stream is in a noise environment. The embodiment of the invention can time the voice stream, and when the time length information corresponding to the voice stream reaches the preset time length information, the voice stream obtained by VAD detection is considered to be continuous, so the voice stream can be considered to be in a noise environment.
The skilled person can determine the preset duration information according to the actual application requirement.
Optionally, the preset duration information is obtained according to information corresponding to the wakeup word. The wake-up word may be used to characterize a voice password for waking up the electronic device, and the wake-up word may be set by a system or a user.
For example, the preset duration information may be determined according to the number of characters included in the wakeup word and the number of speech frames corresponding to one character. For example, if the number of speech frames corresponding to one character is 20 to 30 frames, the preset duration information may correspond to 100 to 150 speech frames when the number of characters included in the wakeup word is 5.
The judgment result obtained in step 202 may include: yes or no. If the determination result is no, step 203 is executed, and if the determination result is yes, step 204 is executed.
In step 203, according to the first mode, a data model may be used to perform wakeup word detection on the voice stream, where the wakeup word detection is used to detect whether a wakeup word is included in the voice stream. The first mode determines the state information corresponding to the speech frame in the current window according to the state information of the speech frame in the previous window, so that long-time memory information can be considered, and the awakening accuracy of the speech in a quiet environment can be improved.
In this embodiment of the present invention, optionally, a corpus used for training the data model may be collected, where the corpus may include: positive corpus and negative corpus. Alternatively, the positive and negative corpuses may be pre-aligned. Optionally, the time information occupied by the wakeup word in the aligned forward corpus may be averaged, and the length of the window may be obtained according to the average value.
In the course of training the corpus, corresponding positive speech is intercepted from the positive corpus and corresponding negative speech is intercepted from the negative corpus according to the length of the window. The positive voice and the negative voice can be learned to obtain the distribution of the voice characteristics corresponding to the positive voice and the negative voice, so that the mathematical model has the capability of awakening detection.
Optionally, the proportion range between the positive-direction corpus and the negative-direction corpus may be 1:20 to 1:40, and the proportion range may improve the reasonability of the distribution of the voice features corresponding to the positive-direction voice and the negative-direction voice. Of course, the embodiment of the present invention does not limit the range of the ratio between the positive-going corpus and the negative-going corpus.
Referring to fig. 3, a schematic diagram of a processing procedure in a first mode according to an embodiment of the present invention is shown, in which a voice stream after VAD processing is input to an encoder, and the voice stream is processed by the encoder to obtain state information corresponding to a voice frame in the voice stream.
Then, the state information output by the encoder may be subjected to moving windowing, the window length may be T, and the window moving may be determined by those skilled in the art according to the actual application requirements. For example, in the case where the window has a length of 100 frames, the window shift may be 10 frames.
In this embodiment of the present invention, optionally, the length of the window may be obtained according to an average value of time information corresponding to the wakeup word in the forward corpus. The corpus may be used to train a mathematical model. The linguistic data can be aligned, the time information occupied by the awakening words in the linguistic data is counted, and the length of the window is obtained according to the average value of the time information, so that the window covers the corresponding awakening words. It is to be understood that the length of the window may be equal to the average value of the time information, or the length of the window may be greater than the average value of the time information. Of course, the length of the window is not limited by the embodiment of the present invention.
The window that is continuously updated can be obtained by moving the windowing, the current window in the embodiment of the present invention may refer to a window being processed, the previous window may refer to a window located before the current window, and the previous window may include: the last window, etc.
In one example of the present invention, assuming that the window length is 100 frames and the window shift is 10 frames, the 1 st window corresponds to 1-100 frames, the 2 nd window corresponds to 11-110 frames, the 3 rd window corresponds to 21-120 frames …, and so on.
The mobile windowed state information may include: h iss+1、hs+2…、hs+T-1、hs+TAnd the like. Wherein s represents the number of data frames before the current window in the process of timing the voice stream at this time, and s may be a numerical value. For example, s for the 1 st window is 0, s for the 2 nd window is 10, s for the 3 rd window is 20, and so on.
The attention processing procedure corresponding to the first mode may include: and determining the state information corresponding to the voice frame in the current window according to the state information of the voice frame in the previous window, and determining the attention information according to the state information corresponding to the voice frame in the current window. In FIG. 3, atThe attention information corresponding to the t-th time is shown, and t may be a natural number. h iss+1、hs+2…、hs+T-1、hs+TThe attention information corresponding to each can be written as: a is1、a2、a3、…aT-1、aT
In this embodiment of the present invention, optionally, the speech frame in the current window may include: a first speech frame;
the processing procedure corresponding to the first mode may include: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
The first speech frame may characterize a speech frame common to the previous window and the current window. For example, the speech frames common to the 1 st window and the second window are: and 11-100 frames, the state information corresponding to 11-100 frames in the second window can be determined according to the state information corresponding to 11-100 frames in the first window, and the power consumption can be further saved due to the fact that the operation amount of the state information corresponding to the first voice frame can be saved.
In this embodiment of the present invention, optionally, the speech frame in the current window may further include: a second speech frame; the processing procedure corresponding to the first mode further includes: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
Taking the 2 nd window as an example, the second speech frame may be 101 to 110 frames, and the encoder may calculate the state information corresponding to the second speech frame.
In this embodiment of the present invention, optionally, the context vector c corresponding to the current window may be determined according to the attention information, and the context vector c may be obtained according to the state information and the attention information at different times, which may be understood that the specific determination process of the context vector c is not limited in this embodiment of the present invention.
The linear processing may be used to perform linear processing on the context vector c to obtain the probability p (y) that the current window corresponds to the wakeup word. The linear processing may include: normalization, etc., it is understood that the embodiment of the present invention does not impose any limitation on the specific linear processing.
Under the condition of adopting the first mode, the state information corresponding to the first speech frame in the current window can be determined according to the state information of the first speech frame in the previous window, so the computation amount of the state information corresponding to the first speech frame can be saved, and further the power consumption can be saved.
In step 204, the processing procedure corresponding to the second mode includes: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder. Since the second mode determines the state information corresponding to the speech frame in the current window via the encoder, the influence of noise on the accuracy of speech wakeup can be reduced. In addition, the current window corresponding to the second mode can be used as the previous window corresponding to the first mode, so that the second mode can provide memory information for the first mode, the awakening accuracy of the voice under noise can be improved, and the awakening accuracy of the voice under quiet and noisy conditions can be balanced.
Referring to fig. 4, a schematic diagram of a second mode of processing according to an embodiment of the present invention is shown, in which a voice stream after VAD processing may be subjected to moving windowing, and the window length may be T.
The window that is continuously updated can be obtained by moving the windowing, the current window in the embodiment of the present invention may refer to a window being processed, the previous window may refer to a window located before the current window, and the previous window may include: the last window, etc.
The speech frame of the current window can be input into the encoder, and the speech frame of the current window is processed by the encoder to obtain the state information corresponding to the speech frame of the current window, such as h1、h2…、hT-1、hTAnd the like.
Then, under the second mode, the state information h corresponding to the first speech frame in the current window1May be a preset value, which may be 0, etc. The state information corresponding to the next speech frame can be obtained according to the state information corresponding to the current speech frame, for example, according to h1To obtain h2According to h2To obtain h3…, according to hT-1To obtain hTEtc., so that the state information corresponding to the speech frame in the current window can be obtained via the encoder.
Hts of the first mode and the second mode can be obtained from ht-1 and xt; wherein xt characterizes the speech frame, such as fbank characterization.
One difference between the first mode and the second mode is that: h 0.
h0 may be used to characterize the initial state information of a window and h0 may be used to determine the state information of the first speech frame in a window. The h0 of the first mode can be obtained from the previous window, for example, the speech frames of the previous window are 1-100 frames, the speech frames of the current window are 11-110 frames, the h0 of the current window can be 10 frames, and the h0 of the current window can be obtained from the previous window.
H0 of the second mode may be a preset value, for example, the preset value may be 0.
Then, the attention information can be determined according to the state information corresponding to the voice frame in the current window. In FIG. 4, atIndicating attention information corresponding to the t-th time, wherein t can be a natural number。h1、h2…、hT-1、hTThe attention information corresponding to each can be written as: a is1、a2、a3、…aT-1、aT
In this embodiment of the present invention, optionally, the context vector c corresponding to the current window may be determined according to the attention information, and the context vector c may be obtained according to the state information and the attention information at different times, which may be understood that the specific determination process of the context vector c is not limited in this embodiment of the present invention.
The linear processing may be used to perform linear processing on the context vector c to obtain the probability p (y) that the current window corresponds to the wakeup word. The linear processing may include: normalization, etc., it is understood that the embodiment of the present invention does not impose any limitation on the specific linear processing.
In step 204, when the duration information of the voice stream reaches the preset duration information, the voice stream may be re-timed, that is, the voice frame may be timed from the 0 th frame of the voice frame, so that the processing in the second mode may be performed once, after the processing in the second mode is performed once, the determination result in step 203 is no, that is, step 203 will be triggered, and the current window obtained by the processing in the second mode once will be used as the previous window in the first mode in step 203, and since the second mode can provide the memory information for the first mode (that is, the information of h0 is provided for the first mode, for example, the window in the second mode is window i, the window in the first mode is window (i +1), and h0 of the window (i +1) can be provided by window i), the accuracy of voice wakeup can be improved.
In the embodiment of the invention, the ith time counting is carried out on the voice flow obtained by VAD, i can be a natural number larger than 0, the duration of the voice flow is shorter in the initial stage of the ith time counting, and the judgment result is negative, so the step 203 can be executed to improve the accuracy of voice awakening in a quiet environment under the condition of saving power consumption. With the increase of the duration of the voice stream in the ith timing process, the determination result is updated to be yes, in this case, the representation is in the noise environment, so step 204 can be executed, and step 204 can execute the processing of the second mode once, so as to improve the accuracy of voice wakeup in the noise environment, and perform the next timing for the voice stream. The second mode processing can provide memory information for the first mode in the next timing process, so that the accuracy of voice awakening can be improved.
In this embodiment of the present invention, the detection results obtained in step 203 and step 204 may include: yes or no. Optionally, the detection result may be determined according to the probability that the current window corresponds to the wakeup word. For example, if the probability exceeds a threshold, the detection result is yes; or, if the probability does not exceed the threshold, the detection result is no.
Optionally, in a case that the detection result is yes, the electronic device may be waken up, and the wake-up detection process according to the embodiment of the present invention is ended. Optionally, in a case that the detection result is negative, the wake-up detection process according to the embodiment of the present invention may be continued, that is, the step 201 may be returned.
To sum up, the wake-up detection method according to the embodiment of the present invention determines whether the voice stream is in a noise environment according to whether the duration information of the voice stream reaches the preset duration information, and if the voice stream is not in the noise environment, the first mode is adopted. The processing procedure corresponding to the first mode includes: after the decoder outputs the state information, the voice stream is subjected to moving windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window. The first mode determines the state information corresponding to the speech frame in the current window according to the state information of the speech frame in the previous window, so that the long-time memory information can be considered, and the accuracy of speech awakening in a quiet environment can be improved.
And if the mobile terminal is in a noise environment, adopting a second mode. The processing procedure corresponding to the second mode includes: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder. Since the second mode determines the state information corresponding to the speech frame in the current window via the encoder, the influence of noise on the accuracy of speech wakeup can be reduced. In addition, the current window corresponding to the second mode can be used as the previous window corresponding to the first mode, so that the second mode can provide memory information for the first mode, the awakening accuracy of the voice under noise can be improved, and the awakening accuracy of the voice under quiet and noisy conditions can be balanced.
It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.
Device embodiment
Referring to fig. 5, a block diagram of a wake-up detection apparatus according to an embodiment of the present invention is shown, which may specifically include:
a voice activation detection module 501, configured to perform voice activation detection on an audio to obtain a voice stream in the audio;
a determining module 502, configured to detect duration information of the voice stream;
a first processing module 503, configured to perform wakeup word detection on the voice stream according to a first mode and by using a data model when the duration information of the voice stream does not reach preset duration information;
a second processing module 504, configured to, when the duration information of the voice stream reaches preset duration information, perform wakeup word detection on the voice stream according to a second mode by using a data model, retime the voice stream, and return to perform the detection on the duration information of the voice stream;
wherein the data model may include: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode may include: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode may include: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
Optionally, the speech frame in the current window may include: a first speech frame;
the processing procedure corresponding to the first mode may include: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
Optionally, the speech frame in the current window may further include: a second speech frame;
the processing procedure corresponding to the first mode may further include: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
Optionally, the preset duration information is obtained according to information corresponding to the wakeup word.
Optionally, the determining, according to the attention information and the state information, the probability that the current window corresponds to the wakeup word may include:
determining a context vector corresponding to the current window according to the attention information and the state information;
and determining the probability of the current window corresponding to the awakening word according to the context vector.
Optionally, the length of the window is obtained according to an average value of time information corresponding to the wakeup word in the forward corpus.
Optionally, the ratio of the positive-direction corpus to the negative-direction corpus of the data model ranges from 1:20 to 1: 40.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Also provided in an embodiment of the present invention is an apparatus for wake-up detection, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors include instructions for: performing voice activation detection on audio to obtain a voice stream in the audio; detecting the duration information of the voice stream; if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode; and if the duration information of the voice stream reaches the preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream.
Fig. 6 is a block diagram illustrating an apparatus 900 for wake-up detection as a terminal according to an exemplary embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 6, apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.
The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.
The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 7 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a wake-up detection method, the method comprising: performing voice activation detection on audio to obtain a voice stream in the audio; detecting the duration information of the voice stream; if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode; and if the duration information of the voice stream reaches the preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream.
The embodiment of the invention discloses A1 and a wake-up detection method, which comprises the following steps:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
A2, according to the method of A1, the speech frames in the current window include: a first speech frame;
the processing procedure corresponding to the first mode comprises the following steps: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
A3, according to the method of A2, the speech frames in the current window further comprising: a second speech frame;
the processing procedure corresponding to the first mode further includes: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
A4, according to the method in A1, the preset duration information is obtained according to the information corresponding to the awakening word.
A5, according to the method in A1, the determining the probability that the current window corresponds to the wakeup word according to the attention information and the state information includes:
determining a context vector corresponding to the current window according to the attention information and the state information;
and determining the probability of the current window corresponding to the awakening word according to the context vector.
A6, obtaining the length of the window according to the average value of the time information corresponding to the awakening word in the forward corpus according to the method of any one of A1 to A5.
A7, according to the method of any one of A1 to A5, the ratio range of the positive-going corpus to the negative-going corpus of the data model is 1: 20-1: 40.
The embodiment of the invention discloses B8 and a wake-up detection device, wherein the device comprises:
the voice activation detection module is used for carrying out voice activation detection on audio to obtain a voice stream in the audio;
the judging module is used for detecting the duration information of the voice stream;
the first processing module is used for detecting the awakening words of the voice stream by utilizing a data model according to a first mode under the condition that the duration information of the voice stream does not reach preset duration information;
the second processing module is used for carrying out awakening word detection on the voice stream by using a data model according to a second mode under the condition that the duration information of the voice stream reaches preset duration information, retiming the voice stream, and returning to execute the detection on the duration information of the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
B9, the apparatus of B8, the speech frames in the current window comprising: a first speech frame;
the processing procedure corresponding to the first mode comprises the following steps: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
B10, the apparatus of B9, the speech frames in the current window further comprising: a second speech frame;
the processing procedure corresponding to the first mode further includes: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
And B11, according to the device of B8, the preset duration information is obtained according to the information corresponding to the awakening word.
B12, according to the apparatus of B8, the determining the probability that the current window corresponds to the wakeup word according to the attention information and the state information includes:
determining a context vector corresponding to the current window according to the attention information and the state information;
and determining the probability of the current window corresponding to the awakening word according to the context vector.
B13, according to the apparatus of any one of B8 to B12, the length of the window is obtained according to the mean of the time information corresponding to the awakening word in the forward corpus.
B14, according to the device of any one of B8 to B12, the ratio range of the positive corpora and the negative corpora corresponding to the data model is 1: 20-1: 40.
The embodiment of the invention discloses C15, an apparatus for wake-up detection, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the detection on the duration information of the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
C16, the apparatus of C15, the speech frames in the current window comprising: a first speech frame;
the processing procedure corresponding to the first mode comprises the following steps: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
C17, the apparatus according to C16, the speech frames in the current window further comprising: a second speech frame;
the processing procedure corresponding to the first mode further includes: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
And C18, according to the device of C15, the preset duration information is obtained according to the information corresponding to the awakening word.
C19, determining the probability that the current window corresponds to the wake word according to the attention information and the state information according to the apparatus of C15, including:
determining a context vector corresponding to the current window according to the attention information and the state information;
and determining the probability of the current window corresponding to the awakening word according to the context vector.
C20, according to the apparatus of any one of C15 to C19, the length of the window is obtained according to the mean value of the time information corresponding to the awakening word in the forward corpus.
C21, according to the apparatus of any one of C15 to C19, the ratio range between the positive corpus and the negative corpus corresponding to the data model is 1: 20-1: 40.
Embodiments of the present invention disclose D22, one or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a wake-up detection method as described in one or more of a 1-a 7.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The wake-up detection method, the wake-up detection device, and the machine-readable medium provided by the present invention are described in detail above, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A wake up detection method, the method comprising:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the awakening word detection on the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
2. The method of claim 1, wherein the speech frames in the current window comprise: a first speech frame;
the processing procedure corresponding to the first mode comprises the following steps: and determining the state information corresponding to the first speech frame in the current window according to the state information of the first speech frame in the previous window.
3. The method of claim 2, wherein the speech frames in the current window further comprise: a second speech frame;
the processing procedure corresponding to the first mode further includes: determining, via the encoder, state information corresponding to a speech frame of the second speech stream in a current window.
4. The method according to claim 1, wherein the preset duration information is obtained according to information corresponding to the wakeup word.
5. The method of claim 1, wherein the determining the probability that the current window corresponds to the wake word according to the attention information and the state information comprises:
determining a context vector corresponding to the current window according to the attention information and the state information;
and determining the probability of the current window corresponding to the awakening word according to the context vector.
6. The method according to any one of claims 1 to 5, wherein the length of the window is obtained according to an average value of time information corresponding to the awakening word in the forward corpus.
7. The method according to any one of claims 1 to 5, wherein the data model corresponds to positive and negative corpora in a ratio ranging from 1:20 to 1: 40.
8. A wake up detection apparatus, the apparatus comprising:
the voice activation detection module is used for carrying out voice activation detection on audio to obtain a voice stream in the audio;
the judging module is used for detecting the duration information of the voice stream;
the first processing module is used for detecting the awakening words of the voice stream by utilizing a data model according to a first mode under the condition that the duration information of the voice stream does not reach preset duration information;
the second processing module is used for carrying out awakening word detection on the voice stream by using a data model according to a second mode under the condition that the duration information of the voice stream reaches preset duration information, retiming the voice stream, and returning to execute the detection on the duration information of the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
9. An apparatus for wake detection comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
performing voice activation detection on audio to obtain a voice stream in the audio;
detecting the duration information of the voice stream;
if the duration information of the voice stream does not reach the preset duration information, performing awakening word detection on the voice stream by using a data model according to a first mode;
if the duration information of the voice stream reaches preset duration information, performing awakening word detection on the voice stream by using a data model according to a second mode, re-timing the voice stream, and returning to execute the detection on the duration information of the voice stream;
wherein the data model comprises: the decoder determines attention information corresponding to a voice frame of the voice stream in a current window according to state information output by the encoder, and determines the probability that the current window corresponds to the awakening word according to the attention information and the state information;
the processing procedure corresponding to the first mode comprises the following steps: after the decoder outputs the state information, the voice stream is subjected to mobile windowing, and the state information corresponding to the voice frame in the current window is determined according to the state information of the voice frame in the previous window;
the processing procedure corresponding to the second mode comprises the following steps: and performing mobile windowing on the voice stream, inputting the voice frame of the voice stream in the current window into the data model, and determining the state information corresponding to the voice frame in the current window through the encoder.
10. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a wake detection method as recited in one or more of claims 1-7.
CN201911230226.XA 2019-12-04 2019-12-04 Wake-up detection method, device and medium Active CN110910884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911230226.XA CN110910884B (en) 2019-12-04 2019-12-04 Wake-up detection method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911230226.XA CN110910884B (en) 2019-12-04 2019-12-04 Wake-up detection method, device and medium

Publications (2)

Publication Number Publication Date
CN110910884A true CN110910884A (en) 2020-03-24
CN110910884B CN110910884B (en) 2022-03-22

Family

ID=69822475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911230226.XA Active CN110910884B (en) 2019-12-04 2019-12-04 Wake-up detection method, device and medium

Country Status (1)

Country Link
CN (1) CN110910884B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833902A (en) * 2020-07-07 2020-10-27 Oppo广东移动通信有限公司 Awakening model training method, awakening word recognition device and electronic equipment
CN112530424A (en) * 2020-11-23 2021-03-19 北京小米移动软件有限公司 Voice processing method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210406993A1 (en) * 2020-06-29 2021-12-30 Dell Products L.P. Automated generation of titles and descriptions for electronic commerce products

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108198548A (en) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 A kind of voice awakening method and its system
CN109509470A (en) * 2018-12-11 2019-03-22 平安科技(深圳)有限公司 Voice interactive method, device, computer readable storage medium and terminal device
CN109545211A (en) * 2018-12-07 2019-03-29 苏州思必驰信息科技有限公司 Voice interactive method and system
US10332508B1 (en) * 2016-03-31 2019-06-25 Amazon Technologies, Inc. Confidence checking for speech processing and query answering
CN110534099A (en) * 2019-09-03 2019-12-03 腾讯科技(深圳)有限公司 Voice wakes up processing method, device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10332508B1 (en) * 2016-03-31 2019-06-25 Amazon Technologies, Inc. Confidence checking for speech processing and query answering
CN108198548A (en) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 A kind of voice awakening method and its system
CN109545211A (en) * 2018-12-07 2019-03-29 苏州思必驰信息科技有限公司 Voice interactive method and system
CN109509470A (en) * 2018-12-11 2019-03-22 平安科技(深圳)有限公司 Voice interactive method, device, computer readable storage medium and terminal device
CN110534099A (en) * 2019-09-03 2019-12-03 腾讯科技(深圳)有限公司 Voice wakes up processing method, device, storage medium and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IVRY, AMIR ET AL.: "Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 *
刘凯: "基于深度学习的语音唤醒研究及其应用", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *
张水利等: "一种具有语音功能的智能家用唤醒***设计", 《微型电脑应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833902A (en) * 2020-07-07 2020-10-27 Oppo广东移动通信有限公司 Awakening model training method, awakening word recognition device and electronic equipment
CN112530424A (en) * 2020-11-23 2021-03-19 北京小米移动软件有限公司 Voice processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110910884B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN107632980B (en) Voice translation method and device for voice translation
CN107221330B (en) Punctuation adding method and device and punctuation adding device
CN110910884B (en) Wake-up detection method, device and medium
WO2021128880A1 (en) Speech recognition method, device, and device for speech recognition
JP7166294B2 (en) Audio processing method, device and storage medium
CN107291704B (en) Processing method and device for processing
CN109961791B (en) Voice information processing method and device and electronic equipment
US12014730B2 (en) Voice processing method, electronic device, and storage medium
CN109471919B (en) Zero pronoun resolution method and device
CN107992813A (en) A kind of lip condition detection method and device
CN111833868A (en) Voice assistant control method, device and computer readable storage medium
CN111369978B (en) Data processing method and device for data processing
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN111583923A (en) Information control method and device, and storage medium
CN111898018A (en) Virtual resource sending method and device, electronic equipment and storage medium
CN111968680A (en) Voice processing method, device and storage medium
CN111580773A (en) Information processing method, device and storage medium
CN109977424B (en) Training method and device for machine translation model
CN111696550A (en) Voice processing method and device for voice processing
CN110908523A (en) Input method and device
CN107301188B (en) Method for acquiring user interest and electronic equipment
CN113674731A (en) Speech synthesis processing method, apparatus and medium
CN111554271A (en) End-to-end awakening word detection method and device
CN110782898B (en) End-to-end voice awakening method and device and computer equipment
CN111124141B (en) Neural network model training method and device for determining candidate items

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant