WO2023029615A1 - 语音唤醒的方法、装置、设备、存储介质及程序产品 - Google Patents

语音唤醒的方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023029615A1
WO2023029615A1 PCT/CN2022/095443 CN2022095443W WO2023029615A1 WO 2023029615 A1 WO2023029615 A1 WO 2023029615A1 CN 2022095443 W CN2022095443 W CN 2022095443W WO 2023029615 A1 WO2023029615 A1 WO 2023029615A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
posterior probability
bone conduction
wake
word
Prior art date
Application number
PCT/CN2022/095443
Other languages
English (en)
French (fr)
Inventor
李晓建
盛晋珲
郎玥
江巍
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22862757.6A priority Critical patent/EP4379712A1/en
Publication of WO2023029615A1 publication Critical patent/WO2023029615A1/zh
Priority to US18/591,853 priority patent/US20240203408A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the embodiments of the present application relate to the technical field of voice recognition, and in particular, to a method, device, device, storage medium, and program product for voice wake-up.
  • the smart device needs to be awakened by first inputting a wake-up word through the user's voice, so as to receive instructions to complete the task.
  • a large number of bone conduction microphones are applied to wearable devices to wake up smart devices through wearable devices.
  • Wearable devices such as wireless earphones, smart glasses, smart watches, etc.
  • the sensor in the bone conduction microphone is a non-acoustic sensor, which converts the vibration signal into an electrical signal by collecting the vibration signal of the vocal cords when people speak, and the electrical signal is called a bone conduction signal.
  • a bone conduction microphone and an air microphone are installed on a wearable device.
  • the air microphone is in a dormant state until the smart device is woken up. Due to the low power consumption of the bone conduction microphone, it is possible to use the bone conduction microphone to collect bone conduction signals, and perform voice detection based on the bone conduction signals (such as voice activation detection (voice activate detector, VAD)), thereby controlling the switch of the air microphone to reduce power consumption.
  • voice activation detection voice activate detector, VAD
  • VAD voice activate detector
  • the voice head of the input command word will be truncated, that is, the collected air conduction signal may lose its head, so that it does not contain the complete information of the command word input by the sound source, resulting in The recognition accuracy of wake-up words is low, and the accuracy of voice wake-up is low.
  • Embodiments of the present application provide a voice wake-up method, device, device, storage medium, and program product, which can improve the accuracy of voice wake-up. Described technical scheme is as follows:
  • a voice wake-up method includes:
  • Speech detection is performed according to the bone conduction signal collected by the bone conduction microphone, the bone conduction signal contains the command word information input by the sound source; when a voice input is detected, the wake-up word is detected based on the bone conduction signal; When the command word includes a wake-up word, voice wake-up is performed on the device to be woken up.
  • a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
  • the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
  • detecting the wake-up word based on the bone conduction signal includes: determining a fusion signal based on the bone conduction signal; and detecting the wake-up word on the fusion signal.
  • the fusion signal determined based on the bone conduction signal also includes command word information input by the sound source.
  • determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, it also includes: turning on the air microphone, and collecting the air conduction signal through the air microphone; determining the fusion signal based on the bone conduction signal, including: combining the initial part of the bone conduction signal with the air conduction The signals are fused to obtain a fusion signal, and the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or, an enhanced initial signal is generated based on the initial part of the bone conduction signal, and the initial signal and the air conduction signal are enhanced Fusion is performed to obtain a fusion signal, and the initial part of the bone conduction signal is determined according to the detection time delay of voice detection; or, the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
  • the embodiment of the present application provides three methods of using the bone conduction signal to compensate for the head loss of the air conduction signal, that is, performing the head loss compensation on the air conduction signal directly through display signal fusion.
  • signal fusion is performed by signal splicing.
  • determining the fusion signal based on the bone conduction signal includes: determining the bone conduction signal as the fusion signal. That is, the embodiment of the present application may also directly use the bone conduction signal to detect the wake-up word.
  • detecting the wake-up word on the fusion signal includes: inputting multiple audio frames included in the fusion signal into the first acoustic model, so as to obtain multiple posterior probability vectors output by the first acoustic model, the multiple A posterior probability vector corresponds to the plurality of audio frames one by one, and the first posterior probability vector in the plurality of posterior probability vectors is used to indicate that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified The probability of the phoneme; the detection of the wake-up word is performed based on the plurality of posterior probability vectors.
  • the fusion signal is first processed by the first acoustic model to obtain a plurality of posterior probability vectors respectively corresponding to the plurality of audio frames included in the fusion signal, and then the wake-up word is performed based on the plurality of posterior probability vectors. , for example, decoding the plurality of posterior probability vectors to detect wake-up words.
  • detecting the wake-up word based on the bone conduction signal before detecting the wake-up word based on the bone conduction signal, it also includes: turning on the air mic, and collecting the air conduction signal through the air mic; detecting the wake-up word based on the bone conduction signal includes: conduction signal, determine a plurality of posterior probability vectors, the plurality of posterior probability vectors correspond to a plurality of audio frames included in the bone conduction signal and the air conduction signal, and the first posterior probability in the plurality of posterior probability vectors The vector is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes; the wake-up word is detected based on the plurality of posterior probability vectors.
  • signal fusion may not be performed, but the posterior probability vectors corresponding to each audio frame are determined directly based on the bone conduction signal and the air conduction signal, so that the obtained multiple posterior probability vectors include
  • the command word information input by the sound source is implicitly included in the form of phoneme probability, that is, the bone conduction signal is implicitly used to compensate the air conduction signal for head loss.
  • determining a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal includes: inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first A number of bone conduction posterior probability vectors and a second number of air conduction posterior probability vectors, the initial part of the bone conduction signal is determined according to the detection time delay of voice detection, the first number of bone conduction posterior probability vectors and bone conduction
  • the audio frames included in the initial part of the conduction signal are in one-to-one correspondence, and the second number of air conduction posterior probability vectors are in one-to-one correspondence with the audio frames included in the air conduction signal; the first bone conduction posterior probability vector and the first The air conduction posterior probability vectors are fused to obtain a second posterior probability vector, the first bone conduction posterior probability vector corresponds to the last audio frame of the initial part of the bone conduction signal, and the duration of the last audio frame is less than the frame duration, The first air conduction posterior probability vector correspond
  • the initial part of the bone conduction signal and the air conduction signal can be processed respectively by the second acoustic model to obtain the corresponding bone conduction posterior probability vector and air conduction posterior probability vector respectively, Then, by fusing the first bone conduction posterior probability vector and the first air conduction posterior probability vector, the bone conduction signal is implicitly used to perform head loss compensation on the air conduction signal.
  • determining a plurality of posterior probability vectors includes: inputting the initial part of the bone conduction signal and the air conduction signal into the third acoustic model to obtain a plurality of A posteriori probability vector, the initial portion of the bone conduction signal is determined according to the detection delay of speech detection; or, the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probabilities output by the third acoustic model vector.
  • the initial part of the bone conduction signal and the air conduction signal may be respectively input into the third acoustic model, and multiple posterior probability vectors are directly obtained through the third acoustic model. That is, through the process of processing the initial part of the bone conduction signal and the air conduction signal by the third acoustic model, the two parts of the signal are implicitly fused, that is, the bone conduction signal is implicitly used to perform head loss compensation on the air conduction signal .
  • detecting the wake-up word based on the multiple posterior probability vectors includes: determining that the phoneme sequence corresponding to the command word includes the phoneme sequence corresponding to the wake-up word based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word Confidence of the sequence; when the confidence exceeds the confidence threshold, it is determined that the command word includes a wake-up word.
  • the confidence is obtained by decoding the plurality of posterior probability vectors, and then the confidence threshold is used to judge whether the command word includes the wake-up word, that is, when the confidence condition is satisfied, it is determined that the command word contains the wake-up word. word.
  • detecting the wake-up word based on the multiple posterior probability vectors includes: determining that the phoneme sequence corresponding to the command word includes the phoneme sequence corresponding to the wake-up word based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word The degree of confidence of the sequence; when the degree of confidence exceeds the degree of confidence threshold, and the distance condition is satisfied between the plurality of posterior probability vectors and the plurality of template vectors, it is determined that the command word is detected to include a wake-up word, and the plurality of templates A vector indicating the probability that a phoneme of a speech signal containing complete information about a wake word belongs to a number of specified phonemes. That is, when the confidence condition is met and the template matches, it is determined that the command word contains a wake-up word, so as to avoid false wake-up as much as possible.
  • the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold. That is, it is possible to judge whether a template matches based on the average distance between vectors.
  • the method further includes: acquiring a bone conduction registration signal, the bone conduction registration signal including complete information of the wake-up word; determining a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word . That is to say, the embodiment of the present application can also determine the confidence threshold and multiple template vectors based on the bone conduction registration signal containing complete information of the wake-up word during the registration process of the wake-up word, and use the confidence threshold and Multiple template vectors are used to detect wake-up words in the subsequent voice wake-up process, which can improve the accuracy of wake-up word detection, thereby reducing false wake-ups.
  • determining a confidence threshold and a plurality of template vectors includes: determining a fusion registration signal based on the bone conduction registration signal; The phoneme sequence of , determine the confidence threshold and multiple template vectors. That is to say, in the registration process of the wake-up word, the fused registration signal can also be obtained first through signal fusion, and the obtained fused registration signal contains the information of the command word input by the sound source, and then the confidence threshold and Multiple template vectors.
  • determining a confidence threshold and a plurality of template vectors includes: inputting a plurality of registration audio frames included in the fused registration signal into the first acoustic model to obtain A plurality of registered posterior probability vectors output by the first acoustic model, the plurality of registered posterior probability vectors are in one-to-one correspondence with the plurality of registered audio frames, and the first registered posterior probability in the plurality of registered posterior probability vectors
  • the vector indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of designated phonemes; the plurality of registration posterior probability vectors is determined as a plurality of template vectors; based on the plurality of registration posterior probability vectors
  • the phoneme sequence corresponding to the wake word determines the confidence threshold.
  • the fusion registration signal can also be processed through the first acoustic model, so as to obtain the information included in the fusion registration signal.
  • the plurality of enrollment posterior probability vectors is decoded to determine a confidence threshold.
  • the plurality of registration posterior probability vectors may also be determined as a plurality of template vectors.
  • the threshold value includes: based on the bone conduction registration signal and the air conduction registration signal, determining a plurality of registration posterior probability vectors, the plurality of registration posterior probability vectors and the plurality of registration audio frames included in the bone conduction registration signal and the air conduction registration signal one by one
  • the first registration a posteriori probability vector in the plurality of registration a posteriori probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes; based on the plurality of registration a posteriori The probability vector and the phoneme sequence corresponding to the wake word determine the confidence threshold. That is, during the registration process of the wake-up word, the signal fusion may not be performed first
  • a voice wake-up device in a second aspect, has a function of implementing the behavior of the voice wake-up method in the first aspect above.
  • the voice wake-up device includes one or more modules, and the one or more modules are used to implement the voice wake-up method provided in the first aspect above.
  • a voice wake-up device which includes:
  • the voice detection module is used to perform voice detection according to the bone conduction signal collected by the bone conduction microphone, and the bone conduction signal contains command word information input by the sound source;
  • the wake-up word detection module is used to detect the wake-up word based on the bone conduction signal when a voice input is detected;
  • the voice wake-up module is configured to perform voice wake-up on the device to be woken up when it is detected that the command word includes a wake-up word.
  • the wake-up word detection module includes:
  • the first determining submodule is used to determine the fusion signal based on the bone conduction signal
  • the wake-up word detection submodule is used to detect the wake-up word on the fusion signal.
  • the device also includes:
  • the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
  • the first determined submodule is used for:
  • the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or,
  • the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
  • the wake-up word detection submodule is used for:
  • the multiple posterior probability vectors correspond one-to-one to the multiple audio frames
  • the first posterior probability vector in the plurality of posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes
  • the wake word detection is performed based on the plurality of posterior probability vectors.
  • the device also includes:
  • the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
  • the wake word detection module includes:
  • the second determining submodule is used to determine a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal, and the plurality of posterior probability vectors correspond to the multiple audio frames included in the bone conduction signal and the air conduction signal one by one.
  • the first posterior probability vector in the posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes;
  • the wake-up word detection submodule is configured to detect wake-up words based on the plurality of posterior probability vectors.
  • the second determination submodule is used for:
  • the bone conduction Inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors output by the second acoustic model, the bone conduction
  • the initial part of the signal is determined according to the detection delay of the speech detection, the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal, and the second number of air conduction posterior probability vectors
  • the vector is in one-to-one correspondence with the audio frames included in the air conduction signal;
  • the first bone conduction posterior probability vector corresponds to the last audio frequency of the initial part of the bone conduction signal frames, the duration of the last audio frame is less than the frame duration
  • the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal, the duration of the first audio frame is less than the frame duration
  • the plurality of posterior probabilities include the second posterior probability vector, the vectors in the first number of bone conduction posterior probability vectors except for the first bone conduction posterior probability vector, and the vectors in the second number of air conduction posterior probability vectors except for the first air conduction A vector other than the vector of posterior probabilities.
  • the second determination submodule is used for:
  • the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or ,
  • the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probability vectors output by the third acoustic model.
  • the wake-up word detection submodule is used for:
  • the command word includes a wake-up word.
  • the wake-up word detection submodule is used for:
  • the command word includes a wake-up word
  • the plurality of template vectors indicate that the wake-up word is included
  • the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
  • the device also includes:
  • the obtaining module is used to obtain the bone conduction registration signal, and the bone conduction registration signal includes complete information of the wake-up word;
  • the determination module is configured to determine a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
  • the determination module includes:
  • the third determining submodule is used to determine the fusion registration signal based on the bone conduction registration signal
  • the fourth determining submodule is configured to determine a confidence threshold and a plurality of template vectors based on the fused registration signal and the phoneme sequence corresponding to the wake-up word.
  • the fourth determining submodule is used for:
  • the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
  • a confidence threshold is determined based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • the device also includes:
  • An acquisition module configured to acquire an air conduction registration signal
  • Identify modules include:
  • the fifth determination sub-module is used to determine a plurality of registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, the plurality of registration posterior probability vectors and the plurality of registration audio signals included in the bone conduction registration signal and the air conduction registration signal Frame one-to-one correspondence, the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
  • the sixth determining submodule is configured to determine a confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • an electronic device in a third aspect, includes a processor and a memory, and the memory is used to store a program for executing the voice wake-up method provided in the first aspect above, and to store a program for realizing the above first The data involved in the voice wake-up method provided by the aspect.
  • the processor is configured to execute programs stored in the memory.
  • the operating device of the storage device may further include a communication bus for establishing a connection between the processor and the memory.
  • a computer-readable storage medium wherein instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer is made to execute the voice wake-up method described in the first aspect above.
  • a fifth aspect provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the voice wake-up method described in the first aspect above.
  • a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
  • the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
  • Fig. 1 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application
  • FIG. 2 is a system architecture diagram involved in a voice wake-up method provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 4 is a flow chart of a voice wake-up method provided in an embodiment of the present application.
  • Fig. 5 is a schematic diagram of the principle of bone conduction signal and air conduction signal generation provided by the embodiment of the present application;
  • FIG. 6 is a signal timing diagram provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a signal splicing method provided in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of downsampling a bone conduction signal provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of gain adjustment for bone conduction signals provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a method for training and generating a network model provided by an embodiment of the present application.
  • Fig. 11 is a schematic structural diagram of another acoustic model provided by the embodiment of the present application.
  • Fig. 12 is a schematic structural diagram of another acoustic model provided by the embodiment of the present application.
  • Fig. 13 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • FIG. 14 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • Fig. 15 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • Fig. 16 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • FIG. 17 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • FIG. 18 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • FIG. 19 is a flow chart of a method for registering a wake-up word provided in an embodiment of the present application.
  • FIG. 20 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
  • Fig. 21 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
  • Fig. 22 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
  • Fig. 23 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
  • Fig. 24 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
  • Fig. 25 is a schematic diagram of a method for training a first acoustic model provided by an embodiment of the present application.
  • Fig. 26 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
  • Fig. 27 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
  • Fig. 28 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
  • Fig. 29 is a schematic diagram of a method for training a second acoustic model provided by an embodiment of the present application.
  • Fig. 30 is a schematic diagram of a method for training a third acoustic model provided by an embodiment of the present application.
  • Fig. 31 is a schematic structural diagram of a voice wake-up device provided by an embodiment of the present application.
  • Speech recognition also known as automatic speech recognition (ASR). Speech recognition refers to the computer recognition of the vocabulary content contained in the speech signal.
  • Voice wake-up also known as keyword spotting (KWS), wake-up word detection, wake-up word recognition, etc.
  • WLS keyword spotting
  • Voice wake-up refers to the real-time detection of wake-up words in the continuous voice stream, and wakes up the smart device when it detects that the named word input by the sound source is a wake-up word.
  • Deep learning Deep learning (deep learning, DL): It is a learning algorithm based on data representation in machine learning.
  • Voice activation detection (voice activate detector, VAD)
  • VAD is used to judge when there is voice input and when it is in a mute state, and is also used to intercept valid segments with voice input. Subsequent operations of the speech recognition are performed on the effective segments intercepted by the VAD, thereby reducing the noise misrecognition rate of the speech recognition system and system power consumption.
  • SNR signal-noise ratio
  • VAD is an implementation manner of voice detection. This embodiment of the present application uses VAD as an example to introduce voice detection. In other embodiments, voice detection may also be performed in other ways.
  • the first step is to detect whether there is speech input, ie, Voice Activation Detection (VAD).
  • VAD Voice Activation Detection
  • the recognition system mainly includes feature extraction, recognition modeling and decoding to obtain recognition results.
  • model training includes acoustic model training, language model training, and the like.
  • Speech recognition is essentially the process of converting an audio sequence to a text sequence, that is, to find the text sequence with the highest probability given a speech input.
  • the speech recognition problem can be decomposed into the conditional probability of the occurrence of this speech in a given text sequence and the prior probability of the occurrence of the text sequence.
  • the model obtained by modeling the conditional probability is the acoustic model.
  • the model obtained by modeling the prior probability of the text sequence is the language model.
  • Phoneme The pronunciation of words is composed of phonemes, which are a kind of sound unit.
  • An English phoneme set i.e. a pronunciation dictionary
  • the phoneme set includes more phonemes.
  • the phoneme set includes 100 phonemes.
  • State It can be regarded as a more detailed unit of speech than a phoneme, and a phoneme is usually divided into three states. In the embodiment of the present application, one frame of audio corresponds to one phoneme, and several phonemes form a word (word).
  • the probability that the phoneme corresponding to each audio frame is each phoneme in the phoneme set can be known through the acoustic model, that is, the posterior probability vector corresponding to the audio.
  • the acoustic model that is, the posterior probability vector corresponding to the audio.
  • the backward movement probability vector corresponding to each audio frame can be known.
  • a decoding map (also called a state network, search space, etc.) is constructed based on the language model, pronunciation dictionary, etc.
  • the test probability vector is used as the input of the decoding map, and the optimal path is searched in the decoding map, and the probability of the phoneme corresponding to the speech is on this path is the largest.
  • the phoneme corresponding to each audio frame can be known, that is, the best word string for speech recognition can be obtained.
  • the process of searching the optimal path in the state network to obtain the word string can be regarded as a kind of decoding, and the decoding is to determine what the word string corresponding to the speech signal is.
  • the probabilities of each phoneme on the decoding path are found in the decoding map, and the found probabilities of each phoneme are added to obtain a path score.
  • the decoding path refers to the phoneme sequence corresponding to the wake-up word. If the path score is large, it is considered that the command word is detected to include a wake word. That is, the decoding in the embodiment of the present application is to judge whether the word string corresponding to the voice signal is a wake-up word based on the decoding map.
  • the acoustic model involved in the embodiment of the present application is further introduced here first.
  • the acoustic model is a model capable of recognizing a single phoneme, which can be modeled using a hidden Markov model (HMM).
  • HMM hidden Markov model
  • the acoustic model is a trained model, and the acoustic model can be trained by using the acoustic features of the sound signal and corresponding labels.
  • the corresponding probability distribution between the acoustic signal and the modeling unit is established.
  • the modeling unit is such as HMM state, phoneme, syllable, word, etc.
  • the modeling unit can also be called the pronunciation unit.
  • the structure of the acoustic model is like GMM-HMM , DNN-HMM, DNN-CTC, etc.
  • GMM Gaussian mixed model
  • DNN means deep neural network
  • CTC connectionist temporal classification
  • the modeling unit is a phoneme
  • the acoustic model is a DNN-HMM model as an example for introduction.
  • the acoustic model can process the audio frame by frame, and output the probability that the phoneme of each audio frame belongs to multiple specified phonemes, and the multiple specified phonemes can be determined according to the pronunciation dictionary.
  • the pronunciation dictionary includes 100 phonemes, and the multiple designated phonemes are the 100 phonemes.
  • Fig. 1 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application.
  • the acoustic model is a DNN-HMM model
  • the dimension of the input layer of the acoustic model is 3
  • the dimension of the two hidden layers is 5
  • the dimension of the output layer is 3.
  • the dimension of the input layer represents the feature dimension of the input signal
  • the dimension of the output layer represents three state dimensions
  • each state dimension includes the probability corresponding to multiple specified phonemes.
  • Decoding in speech recognition can be divided into dynamic decoding and static decoding.
  • dynamic decoding the language score is dynamically searched in the language model centered on the dictionary tree.
  • Static decoding means that the language model is statically compiled into the decoding map in advance, and the decoding efficiency is improved through a series of optimization operations such as determinization, weight shifting, and minimization.
  • static decoding is adopted in the embodiment of the present application, such as a weighted finite state transducer (weighted finite state transducer, WFST), and redundant information is eliminated based on the static decoding of the HCLG network.
  • the generation of the HCLG network requires language models, pronunciation dictionaries, and acoustic models to be expressed in the corresponding FST format, and then compiled into a large decoding graph through operations such as combination, determinization, and minimization.
  • ASL means adding self-loops
  • min means minimizing
  • RDS means removing disambiguation symbols
  • det means deterministic
  • H' means HMM with self-loops removed
  • o means combination.
  • the Viterbi (viterbi) algorithm is used to find the optimal path in the decoding graph, and there will be no two identical paths in the decoding graph.
  • Accumulated beam pruning is adopted in the decoding process, that is, the beam value is subtracted from the current maximum probability path score as a threshold, and the paths smaller than the threshold are pruned.
  • the frame synchronous decoding algorithm is used to find the starting node of the decoding graph, create the token of the corresponding node, and expand from the token corresponding to the starting node to the empty side (that is, the input does not correspond to the real modeling unit), and each reachable node is Bind corresponding tokens, prune and keep active tokens.
  • a token is taken from the current active token, and the corresponding node starts to expand the subsequent non-empty edge (that is, the input corresponds to the real physical modeling unit), traverses all active tokens, prunes and keeps the current frame active token. Repeat the above steps until all audio frames are extended, that is, find the token with the highest score, and trace back to get the final recognition result.
  • the network model refers to the above-mentioned acoustic model.
  • network models such as hidden Markov model HMM, Gaussian mixture model GMM, deep neural network DNN, deep belief networks-hidden Markov model (deep belief networks HMM, DBN-HMM), loop Neural network (recurrent neural network, RNN), long short-term memory (long short-term memory, LSTM) network, convolutional neural network (convolutional neural network, CNN), etc.
  • CNN and HMM are used in the embodiment of this application.
  • the hidden Markov model is a statistical model, which is currently mostly used in the field of speech signal processing.
  • this model whether a state in the Markov chain is transferred to another state depends on the state transition probability, and the observation value produced by a certain state depends on the state generation probability.
  • HMM When performing speech recognition, HMM first establishes a vocalization model for each recognition unit, obtains a state transition probability matrix and an output probability matrix through long-term training, and makes a judgment based on the maximum probability in the state transition process during recognition.
  • the basic structure of the convolutional neural network includes two parts, one part is the feature extraction layer, the input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted.
  • the other part is the feature map layer.
  • Each calculation layer of the network is composed of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal.
  • the feature map structure uses a function with a small influence function kernel (such as sigmoid) as the activation function of the convolutional network, so that the feature map has displacement invariance.
  • a small influence function kernel such as sigmoid
  • neurons on a mapping plane share weights, the number of free parameters of the network is reduced.
  • Each convolutional layer in the convolutional neural network can be followed by a calculation layer for local averaging and secondary extraction. This unique two-time feature extraction structure reduces the feature resolution.
  • the loss function is the iterative basis for the network model in training.
  • the loss function is used to evaluate the degree of difference between the predicted value of the network model and the real value, and the choice of the loss function affects the performance of the network model.
  • the loss functions used by different network models are generally different. Loss functions can be divided into empirical risk loss functions and structural risk loss functions.
  • the empirical risk loss function refers to the difference between the predicted result and the actual result, and the structural risk loss function refers to the empirical risk loss function plus a regular term.
  • the embodiment of the present application uses a cross-entropy loss function (cross-entropy loss function), that is, a CE loss function.
  • the cross-entropy loss function is essentially a log-likelihood function that can be used in binary and multi-classification tasks.
  • the cross-entropy loss function is often used instead of the mean square error loss function, because the cross-entropy loss function can perfectly solve the problem that the square loss function weight update is too slow.
  • the weight update is fast.
  • the weight update has a good property of being slow.
  • the error is backpropagated, and the network parameters are adjusted using the loss function and the gradient descent method.
  • the gradient descent method is an optimization algorithm, the central idea is to update the parameter values along the direction of the gradient of the objective function in order to achieve the minimum (or maximum) of the objective function.
  • Gradient descent is a commonly used optimization algorithm in deep learning. Gradient descent is aimed at the loss function, and the purpose is to find the weight and bias corresponding to the minimum value of the loss function as soon as possible.
  • the core of the backpropagation algorithm is by defining the special variable of neuron error. Starting from the output layer, the neuron error is backpropagated layer by layer. Then use the neuron error to calculate the partial derivative of the weight and bias through the formula. Gradient descent is a way to solve the minimum problem, while backpropagation is a way to solve gradient calculations.
  • FIG. 2 is a system architecture diagram involved in a voice wake-up method provided by an embodiment of the present application.
  • the system architecture includes a wearable device 201 and a smart device 202 .
  • the wearable device 201 and the smart device 202 are connected in a wired or wireless manner for communication.
  • the smart device 202 is the device to be woken up in the embodiment of the present application.
  • the wearable device 201 is configured to receive a voice signal, and send an instruction to the smart device 202 based on the received voice signal.
  • the smart device 202 is configured to receive instructions sent by the wearable device 201, and perform corresponding operations based on the received instructions.
  • the wearable device 201 is used to collect voice signals, detect command words included in the collected voice signals, and send a wake-up instruction to the smart device 202 to wake up the smart device 202 if it is detected that the command words include a wake-up word.
  • the smart device 202 is configured to enter a working state from a sleep state after receiving a wake-up instruction.
  • the wearable device 201 is equipped with a bone conduction microphone, and due to the low power consumption of the bone conduction microphone, the bone conduction microphone can always work.
  • the bone conduction microphone is used to collect bone conduction signals in the working state.
  • the processor in the wearable device 201 performs voice activation detection based on the bone conduction signal, so as to detect whether there is voice input.
  • the processor detects the wake-up word based on the bone conduction signal, so as to detect whether the command word input by the sound source includes the wake-up word.
  • voice wake-up is performed, that is, the wearable device 201 sends a wake-up instruction to the smart device 202 .
  • the wearable device 201 is such as a wireless earphone, smart glasses, a smart watch, a smart bracelet, and the like.
  • the smart device 202 (that is, the device to be woken up) includes a smart speaker, a smart home appliance, a smart toy, a smart robot, and the like.
  • the wearable device 201 and the smart device 202 are the same device.
  • FIG. 3 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.
  • the electronic device is the wearable device 201 shown in FIG. 2 .
  • the electronic device includes one or more processors 301 , a communication bus 302 , a memory 303 , one or more communication interfaces 304 , a bone conduction microphone 308 and an air microphone 309 .
  • the processor 301 is a general-purpose central processing unit (central processing unit, CPU), a network processor (network processing, NP), a microprocessor, or one or more integrated circuits for realizing the solution of the present application, for example, a dedicated Integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • a dedicated Integrated circuit application-specific integrated circuit, ASIC
  • programmable logic device programmable logic device
  • PLD programmable logic device
  • the above-mentioned PLD is a complex programmable logic device (complex programmable logic device, CPLD), field-programmable logic gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or its arbitrary combination.
  • the communication bus 302 is used to transfer information between the aforementioned components.
  • the communication bus 302 is divided into an address bus, a data bus, a control bus, and the like.
  • address bus a data bus
  • control bus a control bus
  • only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 303 is a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM) , optical discs (including compact disc read-only memory, CD-ROM), compact discs, laser discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or any other medium storing desired program code in the form of instructions or data structures and capable of being accessed by a computer, but not limited thereto.
  • the memory 303 exists independently and is connected to the processor 301 through the communication bus 302 , or, the memory 303 and the processor 301 are integrated together.
  • the Communication interface 304 utilizes any transceiver-like device for communicating with other devices or a communication network.
  • the communication interface 304 includes a wired communication interface, and optionally, also includes a wireless communication interface.
  • the wired communication interface is, for example, an Ethernet interface.
  • the Ethernet interface is an optical interface, an electrical interface or a combination thereof.
  • the wireless communication interface is a wireless local area network (wireless local area networks, WLAN) interface, a cellular network communication interface, or a combination thereof.
  • the electronic device includes multiple processors, such as processor 301 and processor 305 shown in FIG. 2 .
  • processors such as processor 301 and processor 305 shown in FIG. 2 .
  • Each of these processors is a single-core processor, or a multi-core processor.
  • a processor here refers to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the electronic device further includes an output device 306 and an input device 307 .
  • Output device 306 is in communication with processor 301 and can display information in a variety of ways.
  • the output device 306 is a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector).
  • the input device 307 communicates with the processor 301 and can receive user input in various ways.
  • the input device 307 includes one or more of a mouse, a keyboard, a touch screen device, or a sensing device.
  • the input device 307 includes a bone conduction microphone 308 and an air microphone 309, and the bone conduction microphone 308 and the air microphone 309 are used to collect bone conduction signals and air conduction signals respectively.
  • the processor 301 is configured to wake up the smart device based on the bone conduction signal or based on the bone conduction signal and the air conduction signal through the voice wake-up method provided in the embodiment of the present application.
  • the processor 301 is further configured to control the smart device to perform tasks based on the bone conduction signal, or the air conduction signal, or both the bone conduction signal and the air conduction signal.
  • the memory 303 is used to store the program code 310 for implementing the solution of the present application, and the processor 301 can execute the program code 310 stored in the memory 303 .
  • the program code 310 includes one or more software modules, and the electronic device can implement the voice wake-up method provided in the embodiment of FIG. 4 below through the processor 301 and the program code 310 in the memory 303 .
  • Fig. 4 is a flow chart of a voice wake-up method provided by an embodiment of the present application, and the method is applied to a wearable device. Please refer to FIG. 4 , the method includes the following steps.
  • Step 401 Carry out voice detection according to the bone conduction signal collected by the bone conduction microphone, the bone conduction signal includes command word information input by the sound source.
  • the bone conduction microphone can be used to collect the bone conduction signal.
  • Voice detection (such as voice activation detection VAD) is performed on the guide signal to detect whether there is a voice input.
  • the components other than the bone conduction microphone in the wearable device can be in a sleep state to reduce power consumption, and then control other components of the wearable device to turn on when a voice input is detected .
  • the air microphone when the wearable device is equipped with an air microphone, since the air microphone is a device with high power consumption, for portable wearable devices, in order to reduce power consumption, the air microphone will be turned on and off. Control, when a voice input is detected (such as the user is speaking), the air microphone will be turned on for sound pickup operation (that is, air conduction signal collection), so that the power consumption of the wearable device can be reduced. That is, before the smart device is woken up, the air microphone is in a dormant state to reduce power consumption, and when a voice input is detected, the air microphone is turned on.
  • the wearable device may perform voice activation detection according to the bone conduction signal collected by the bone conduction microphone, which is not limited in this embodiment of the present application.
  • voice activation detection is mainly used to detect whether there is a human voice signal in the current input signal.
  • voice activation detection distinguishes speech segments from non-speech segments (such as segments with only various background noise signals) by judging the input signal, so that different processing methods can be adopted for each segment of the signal.
  • the voice activation detection detects whether there is voice input by extracting features of the input signal. For example, by extracting the short-time energy (short time energy, STE) and short-time zero-crossing rate (zero cross counter, ZCC) features of each frame input signal to detect whether there is speech input, that is, to perform speech activation detection based on energy features .
  • the short-term energy refers to the energy of a frame of signal
  • the zero-crossing rate refers to the number of times that a frame of time-domain signal crosses 0 (time axis).
  • some high-precision VADs will extract multiple features such as energy-based features, frequency-domain features, cepstrum features, harmonic features, and long-term features for comprehensive detection.
  • threshold comparison, or statistical methods or machine learning methods may also be combined to determine whether a frame of input signal is a speech signal or a non-speech signal.
  • energy-based features, frequency domain features, cepstrum features, harmonic features, long-term features and other features are briefly introduced.
  • VAD is performed based on the two features of STE and ZCC.
  • SNR signal-noise ratio
  • the STE of the speech segment is relatively large and the ZCC is relatively small
  • the STE of the non-speech segment is relatively small and the ZCC is relatively large.
  • the speech signal with human voice usually has high energy, and most of the energy is contained in the low frequency band, while the noise signal usually has low energy and contains more information in the high frequency band. Therefore, the speech signal and the non-speech signal can be distinguished by extracting these two features of the input signal.
  • the method for calculating the STE may be to calculate the sum of squares of the energy of the input signal of each frame through the spectrogram.
  • the method of calculating the short-term zero-crossing rate can be to calculate the number of zero-crossings corresponding to the input signal of each frame in the time domain, for example, in the time domain, all the sampling points in the frame are shifted to the left or right by one point, and after shifting The product of each sampling point and the amplitude value of each sampling point before shifting is done at the corresponding point. If the sign of the product obtained by the corresponding two sampling points is negative, it means that the corresponding sampling point crosses zero, and the negative product in the frame is The short-term zero-crossing rate can be obtained by calculating the number of .
  • Frequency-domain features Through short-time Fourier transform or other time-frequency transform methods, the time-domain signal of the input signal is converted into a frequency-domain signal to obtain a spectrogram, and frequency-domain features are obtained based on the spectrogram. Such as extracting the envelope features of the frequency band based on the spectrogram. In some experiments, when the SNR is 0dB, the long-term envelopes of some frequency bands can distinguish speech segments from noise segments.
  • Cepstral features such as including energy cepstral peaks.
  • the energy cepstrum peak determines the pitch of the speech signal.
  • Mel-frequency cepstral coefficients Mel-frequency cepstral coefficients, MFCC are used as cepstral features.
  • Harmonic-based features An obvious feature of a speech signal is that it contains the fundamental frequency and its multiple harmonic frequencies. Even in strong noise scenes, this feature of harmonics exists.
  • the fundamental frequency of the speech signal can be found using the method of autocorrelation.
  • Speech signals are non-stationary signals.
  • the normal speech rate usually emits 10 to 15 phonemes per second.
  • the spectral distribution between phonemes is different, which leads to changes in the statistical characteristics of speech over time.
  • Most of the daily noise is steady state, that is, the change is relatively slow, such as white noise. Based on this, long-term features can be extracted to judge whether the input signal is a speech signal or a non-speech signal.
  • the input signal used for voice activation detection is a bone conduction signal collected by a bone conduction microphone, and voice activation detection is performed for each frame of the received bone conduction signal to detect whether there is voice or not. enter.
  • the bone conduction microphone since the bone conduction microphone is always in the working state, the bone conduction signal continuously collected by the bone conduction microphone contains complete information of the command word input by the sound source, that is, the bone conduction signal will not lose its head.
  • the sampling rate of the bone conduction signal is 32 kHz (kilohertz), 48 kHz, etc., which is not limited in this embodiment of the present application.
  • the sensor in the bone conduction microphone is a non-acoustic sensor, which can shield the influence of ambient noise and has strong anti-noise performance.
  • Step 402 When a voice input is detected, detect a wake-up word based on the bone conduction signal.
  • the wearable device when a voice input is detected, the wearable device detects the wake-up word based on the bone conduction signal, so as to detect whether the command word includes the wake-up word. It should be noted that there are many implementation methods for the wearable device to detect the wake-up word based on the bone conduction signal, and two of the implementation methods will be introduced next.
  • the wearable device detects the wake-up word based on the bone conduction signal by determining a fusion signal based on the bone conduction signal, and detecting the wake-up word on the fusion signal.
  • the wearable device determines the fusion signal based on the bone conduction signal. It should be noted that there are many ways for wearable devices to determine fusion signals based on bone conduction signals, and four of them will be introduced next.
  • Method 1 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone. For example, when a voice input is detected, the air microphone is turned on, and the air conduction signal is collected through the air microphone.
  • the wearable device fuses the initial part of the bone conduction signal with the air conduction signal to obtain a fusion signal.
  • the initial part of the bone conduction signal is determined according to the detection time delay of voice detection (such as VAD).
  • the wearable device collects the bone conduction signal and the air conduction signal, and uses the initial part of the bone conduction signal to compensate for head loss on the air conduction signal, so that the obtained fusion signal also includes the command word information input by the sound source.
  • the length of the fused signal is relatively short, which can reduce the amount of data processing to a certain extent.
  • signal fusion is performed by signal splicing in the embodiments of the present application, and signal fusion may also be performed by means of signal superposition in some embodiments. In the following embodiments, signal fusion by signal splicing is used as an example Make an introduction.
  • the bone conduction signal and the air conduction signal are signals generated by the same sound source, and the transmission paths of the bone conduction signal and the air conduction signal are different.
  • the bone conduction signal is a signal formed by the vibration signal (excitation signal) transmitted through the bones and tissues inside the human body, and the air conduction signal is formed by the sound wave transmitted through the air.
  • FIG. 6 is a signal timing diagram provided by an embodiment of the present application.
  • the signal timing diagram shows the timing relationship of the bone conduction signal, the air conduction signal, the VAD control signal and the user's voice signal.
  • the sound source emits a voice signal
  • the bone conduction signal immediately becomes a high-level signal.
  • the VAD determines that there is a voice input.
  • a VAD control signal is generated.
  • the VAD control signal controls the air microphone to turn on and collect The air conduction signal, that is, the signal that the air conduction signal becomes high level at this time.
  • the bone conduction signal changes synchronously with the user's voice signal
  • the air conduction signal has a delay of ⁇ t compared to the bone conduction signal, which is caused by the detection delay of the VAD.
  • ⁇ t represents the detection delay of the voice activation detection, that is, the time difference between the time when a voice input is detected and the user actually inputs a voice.
  • the VAD can detect the speech segment and the non-speech segment in the bone conduction signal
  • the endpoint detection can detect the speech segment and the non-speech segment in the air conduction signal.
  • the time range of the speech segment extracted from the bone conduction signal is [0, t]
  • the time range of the initial part of the bone conduction signal that is, the initial part of the intercepted speech segment
  • 0, ⁇ t] the time range of the speech segment intercepted from the air conduction signal is [ ⁇ t, t]
  • the duration of the fusion signal obtained is t.
  • ⁇ t represents the detection delay of the voice activation detection
  • t represents the total duration of the actual voice input.
  • the initial part of the bone conduction signal that is, the speech segment from 0 to ⁇ t
  • the air conduction signal that is, the speech segment from ⁇ t to t
  • the wearable device performs preprocessing on the air conduction signal, and the preprocessing includes front-end enhancement.
  • the front-end enhancement can eliminate part of the noise and the influence of different sound sources, etc., so that the air conduction signal after the front-end enhancement can better reflect the essential characteristics of the voice, so as to improve the accuracy of voice wake-up.
  • front-end enhancement of air conduction signals such as endpoint detection and speech enhancement, and speech enhancement such as echo cancellation, beamforming algorithm, noise cancellation, automatic gain control, reverberation, etc.
  • the endpoint detection can distinguish the speech segment and the non-speech segment of the air conduction signal, that is, accurately determine the starting point of the speech segment.
  • the endpoint detection only the speech segment of the air conduction signal can be processed later, which can improve the accuracy and recall rate of speech recognition.
  • Speech enhancement is to remove the influence of environmental noise on speech clips.
  • echo cancellation uses an effective echo cancellation algorithm to suppress the interference of far-end signals, mainly including double-talk detection and delay estimation, such as by judging the current speech mode (such as near-talk mode, far-talk mode, double-talk mode, etc.
  • the corresponding strategy is used to adjust the filter, and then the far-end interference in the air conduction signal is filtered out through the filter, and on this basis, the residual noise interference is eliminated through the post-filtering algorithm.
  • the automatic gain algorithm is used to quickly gain the signal to an appropriate volume. This solution can multiply all sampling points of the air conduction signal by the corresponding gain factor through rigid gain processing, and multiply each frequency by the corresponding gain factor at the same time in the frequency domain. gain factor.
  • the frequency of the air conduction signal may be weighted according to the equal loudness curve, and the loudness gain factor is mapped onto the equal loudness curve, thereby determining the gain factor of each frequency.
  • the wearable device before the initial part of the bone conduction signal is fused with the air conduction signal, the wearable device performs preprocessing on the bone conduction signal, and the preprocessing includes down-sampling and/or gain adjustment.
  • downsampling can reduce the data volume of the bone conduction signal and improve the efficiency of data processing.
  • Gain adjustment is used to enhance the energy of the adjusted bone conduction signal. The average energy is the same.
  • downsampling refers to reducing the sampling frequency (also referred to as sampling rate) of a signal, and is a way of resampling a signal.
  • the sampling frequency refers to the number of samples of the sound wave amplitude extracted per second after the analog sound waveform is digitized.
  • a sampling point is drawn every M-1 sampling points to obtain an air conduction signal y including M sampling points [m].
  • downsampling may cause spectral aliasing of the signal, so the air conduction signal can be processed with a low-pass de-aliasing filter before downsampling, that is, anti-aliasing filtering is performed to reduce the subsequent downsampling band Come spectrum confusion.
  • the gain adjustment refers to adjusting the amplitude value of the sampling point of the bone conduction signal through the gain factor, or adjusting the energy value of the frequency point of the bone conduction signal.
  • the gain factor may be determined according to a gain function, or may be determined according to statistical information of the air conduction signal and the bone conduction signal, which is not limited in this embodiment of the present application.
  • FIG. 8 is a schematic diagram of downsampling a bone conduction signal provided by an embodiment of the present application.
  • the sampling rate of the bone conduction signal is 48kHz
  • the collected bone conduction signal x[n] is first sent to the anti-aliasing filter H(z) to prevent signal aliasing.
  • v[n] represents the bone conduction signal after the anti-aliasing filter, and the sampling rate remains unchanged.
  • Three times downsampling is performed on v[n] to obtain bone conduction signal y[m] after three times downsampling, and the sampling rate is reduced to 16kHz.
  • FIG. 9 is a schematic diagram of gain adjustment for a bone conduction signal provided by an embodiment of the present application.
  • x[n] represents the bone conduction signal
  • f(g) represents the gain function
  • Method 2 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone.
  • the wearable device generates an enhanced initial signal based on the initial portion of the bone conduction signal, and fuses the enhanced initial signal with the air conduction signal to obtain a fusion signal.
  • the initial part of the bone conduction signal is determined according to the detection time delay of the voice detection. That is to say, the wearable device uses the initial part of the bone conduction signal to generate an enhanced initial signal, and uses the enhanced initial signal to perform head loss compensation on the collected air conduction signal, so that the obtained fusion signal also includes the input signal of the sound source. command word information.
  • the length of the fusion signal is short, which can reduce the amount of data processing to a certain extent.
  • the difference from the above method 1 of determining the fusion signal based on the bone conduction signal is that in the method 2 of determining the fusion signal based on the bone conduction signal, the initial part of the bone conduction signal is used to generate an enhanced initial signal, The enhanced initial signal is fused with the air conduction signal instead of the initial part of the bone conduction signal and the air conduction signal.
  • the other contents introduced in the above method 1 are applicable to this method 2.
  • the bone conduction signal and the air conduction signal can also be detected for the speech segment to extract the speech segment, and then splice the signal based on the intercepted speech segment, thereby reducing the amount of data processing.
  • the wearable device can also perform preprocessing on the bone conduction signal and the air conduction signal, such as performing down-sampling and/or gain adjustment on the bone conduction signal, and performing speech enhancement on the air conduction signal.
  • the wearable device may input the initial part of the bone conduction signal into the generation network model, so as to obtain the enhanced initial signal output by the generation network model.
  • the generative network model is a model trained based on a deep learning algorithm, and the generative network model can be regarded as a signal generator, which can generate a speech signal that contains information about the input signal and is close to real speech based on the input signal.
  • the enhanced initial signal includes signal information of the initial part of the bone conduction signal, and the enhanced initial signal is close to the real speech signal. It should be noted that the embodiment of the present application does not limit the network structure, training method, training equipment, etc. for generating the network model. Next, a training method for generating a network model is exemplarily introduced.
  • the computer device obtains a first training data set, and the first training data set includes a plurality of first sample signal pairs.
  • the computer device inputs the initial part of the bone conduction sample signal in the plurality of first sample signal pairs into the initial generation network model to obtain a plurality of enhanced initial sample signals output by the initial generation network model.
  • the computer device inputs the plurality of enhanced initial sample signals and the initial part of the air conduction sample signal in the plurality of first sample signal pairs into the initial decision network model to obtain a decision result output by the initial decision network model.
  • the computer device adjusts network parameters of the initial generated network model based on the decision result to obtain a trained generated network model.
  • a first sample signal pair includes an initial part of a bone conduction sample signal and an initial part of an air conduction sample signal, and a first sample signal pair corresponds to a command word, the bone conduction sample signal and the air conduction sample signal Contains complete information for the corresponding command word.
  • the first sample signal pair acquired by the computer device includes a bone conduction sample signal and an air conduction sample signal
  • the computer device intercepts the initial part of the bone conduction sample signal and the initial part of the air conduction sample signal to obtain an initial generation network model and the input data of the initial decision network model. That is, the computer equipment first obtains the complete speech signal, and then intercepts the initial part to obtain the training data.
  • the first sample signal pair acquired by the computer device only includes the initial part of the bone conduction sample signal and the initial part of the air conduction sample signal.
  • the first training data set includes directly collected voice data, public voice data and/or voice data purchased from a third party.
  • the computer device can preprocess the acquired first training data set to obtain a preprocessed first training data set, which can simulate real voice data In order to be closer to the speech of the real scene and increase the diversity of training samples.
  • the first training data set is backed up, that is, an additional piece of data is added, and the backed up data is preprocessed.
  • the backup data is divided into multiple parts, and a preprocessing is performed on each part of the data. The preprocessing for each part of the data can be different, which can double the total training data and ensure the comprehensiveness of the data.
  • the method of preprocessing each piece of data may include adding noise (noise addition), volume enhancement, adding reverberation (add reverb), time shifting (time shifting), changing pitch (pitch shifting), time stretching (time one or more of stretching) and the like.
  • adding noise refers to mixing one or more types of background noise into the speech signal, so that the training data can cover more types of noise.
  • office environment noise canteen environment noise, street environment noise and other background noise.
  • the SNR can be selected according to a normal distribution, so that the average value of the SNR is better.
  • the average value can be 10dB, 20dB, etc., and the SNR can range from 10dB to 30dB.
  • the volume enhancement refers to increasing or decreasing the volume of the speech signal according to the variation coefficient of the volume, and the value range of the variation coefficient of the volume may be 0.5 to 1.5, or other value ranges.
  • Adding reverberation refers to adding reverberation processing to the speech signal, and the reverberation is caused by the reflection of the sound signal by the space environment.
  • Pitch changes such as pitch correction to change the preferred pitch of the voice without affecting the speed of sound.
  • Time stretching refers to changing the speed or duration of the speech signal without affecting the pitch, that is, changing the speech rate, so that the training data can cover different speech rates, and the speech rate can vary from 0.9 to 1.1 Or in other ranges.
  • FIG. 10 is a schematic diagram of a method for training and generating a network model provided by an embodiment of the present application.
  • the generator (that is, the initial generation network model) is a network for generating speech signals, and the initial part of the bone conduction sample signal in the first training data set is input into the generator, optionally, before inputting into the generator, the bone conduction A random noise is superimposed on the sample signal.
  • the generator processes the input bone conduction sample signal to generate an enhanced initial sample signal.
  • the decision device (that is, the initial decision network model) is a decision network for judging whether the input signal is a real speech signal, and the decision result output by the decision device indicates whether the input signal is a real speech.
  • the output decision result is 1, it means The decision unit determines that the input signal is a real speech signal, and if the output judgment result is 0, it means that the decision unit determines that the input signal is not a real speech signal. Adjust the parameters in the generator and the decider by judging whether the judgment result is accurate, so as to train the generator and the decider.
  • the goal of the generator is to generate fake speech signals to fool the judger, and the goal of the judger is to be able to distinguish whether the input signal is real or generated. It can be seen that the generator and the decider are essentially playing a game through the training data, and the capabilities of the generator and the decider are improved during the game. Under ideal conditions, the accuracy of the trained decider is close to 0.5 .
  • the computer device may also use other methods to generate the enhanced initial signal based on the initial signal of the bone conduction signal, which is not limited in this embodiment of the present application.
  • Method 3 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone. Wearable devices directly fuse bone conduction signals and air conduction signals to obtain fusion signals. In this way, the obtained fusion signal also contains the command word information input by the sound source. In addition, the fusion signal not only contains the complete speech information in the bone conduction signal, but also contains the complete speech information in the air conduction signal, so that the fusion signal contains Speech features are more abundant, which improves the accuracy of speech recognition to a certain extent.
  • the wearable device directly fuses the bone conduction signal and the air conduction signal , except for this, other content introduced in the above-mentioned method 1 is applicable to the method 3, and will not be introduced in detail in the method 3 one by one.
  • the bone conduction signal and the air conduction signal may also be detected for speech segments to extract the speech segments, and the intercepted speech segments may be fused, thereby reducing the amount of data processing. It is also possible to perform preprocessing on the bone conduction signal and the air conduction signal, such as performing down-sampling and/or gain adjustment on the bone conduction signal, and performing endpoint detection and speech enhancement on the air conduction signal.
  • x1[n] represents the bone conduction signal
  • x2[n] represents the air conduction signal
  • f(x): b[n] 0,2t ⁇ t concat[x1[n] 0 ⁇ t , x2[n] ⁇ tt ]. That is, the bone conduction signal (the speech segment from 0 to t) and the air conduction signal (the signal segment from ⁇ t to t) are spliced by f(x) to obtain the fusion signal b[n].
  • the wearable device determines the bone conduction signal as the fusion signal. That is, it is also possible to detect the wake-up word only by using the bone conduction signal.
  • the difference from the above method 1 of determining the fusion signal based on the bone conduction signal is that in the method 4 of determining the fusion signal based on the bone conduction signal, the bone conduction signal is directly used as the fusion signal.
  • the other content introduced in the above-mentioned method 1 is applicable to the method 4, and will not be described in detail in the method 4 one by one.
  • the bone conduction signal may also be detected for the speech segment to extract the speech segment, and the intercepted speech segment is used as the fusion signal, thereby reducing the amount of data processing. Preprocessing may also be performed on the bone conduction signal, for example, performing down-sampling and/or gain adjustment on the bone conduction signal.
  • the wearable device inputs the multiple audio frames included in the fusion signal into the first acoustic model, so as to obtain multiple posterior probability vectors output by the first acoustic model.
  • the wearable device detects the wake-up word based on the plurality of posterior probability vectors.
  • the plurality of posterior probability vectors correspond one-to-one to the plurality of audio frames included in the fusion signal, that is, one posterior probability vector corresponds to one audio frame included in the fusion signal, among the plurality of posterior probability vectors
  • the first posterior probability vector is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of designated phonemes, that is, a posteriori probability vector indicates that the phoneme of a corresponding audio frame belongs to a plurality of designated phonemes probability. That is, the wearable device processes the fusion signal through the first acoustic model to obtain phoneme information contained in the fusion signal, so as to detect the wake-up word based on the phoneme information.
  • the first acoustic model may be the network model as described above, or a model of other structures.
  • the first acoustic model processes each audio frame included in the fusion signal to obtain the posterior probability vector corresponding to each audio frame output by the first acoustic model .
  • the wearable device after the wearable device obtains multiple posterior probability vectors output by the first acoustic model, it determines the command word input by the sound source based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word
  • the corresponding phoneme sequence includes the confidence of the phoneme sequence corresponding to the wake word. If the confidence level exceeds the confidence level threshold, it is determined that the command word includes the wake-up word. That is, the wearable device decodes the plurality of posterior probability vectors to determine a confidence level.
  • the phoneme sequence corresponding to the wake-up word is called a decoding path
  • the determined confidence level may be called a path score
  • the confidence level threshold may be called a wake-up threshold.
  • the constructed decoding graph looks for the probability of each phoneme on the decoding path in the decoding graph, and adds up the probabilities of each phoneme found to obtain a confidence level.
  • the decoding path refers to the phoneme sequence corresponding to the wake-up word. If the confidence level is greater than the confidence level threshold, it is determined that the command word is detected to include a wake-up word.
  • the wearable device determines that the sound source input is detected
  • the command words include the wake word.
  • the multiple template vectors indicate probabilities that the phonemes of the speech signal including the complete information of the wake-up word belong to multiple specified phonemes. That is, the current input speech not only needs to satisfy the confidence condition, but also needs to match the template.
  • the confidence threshold can be preset, for example, based on experience, or determined according to the bone conduction registration signal and/or air conduction registration signal containing complete information of the wake-up word when registering the wake-up word. The method is described below.
  • the multiple template vectors are registration posterior probability vectors determined according to the bone conduction registration signal and/or the air conduction registration signal, and the specific implementation will be introduced below.
  • the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
  • the wearable device can directly calculate the distances between the multiple posterior probability vectors and the corresponding template vectors and calculate an average value. For example, when the duration of the voice input by the current sound source is consistent with the duration of the voice input by the user when the wake-up word is registered, there may be a one-to-one correspondence between the plurality of posterior probability vectors and the plurality of template vectors.
  • the wearable The device can adopt a dynamic time warping (dynamic time warping, DTW) method to establish the mapping relationship between the multiple posterior probability vectors and the multiple template vectors, so that the wearable device can calculate the multiple posterior probability vectors and the multiple template vectors.
  • DTW dynamic time warping
  • the wearable device first determines the fusion signal based on the bone conduction signal (including four methods), and then passes The acoustic model processes the fused signal to obtain a posterior probability vector. Then, the wearable device decodes the obtained posterior probability vector based on the decoding path corresponding to the wake-up word, so as to obtain the confidence corresponding to the command word currently input by the sound source. If the confidence level is greater than the confidence level threshold, the wearable device determines that the detected command word includes a wake-up word.
  • the wearable device determines that the command word includes a wake-up word.
  • the second implementation method of the wearable device detecting the wake word based on the bone conduction signal is introduced.
  • the wearable device before the wearable device detects the wake-up word based on the bone conduction signal, the air microphone is turned on, and the air conduction signal is collected through the air microphone. For example, when a voice input is detected, the air microphone is turned on, and the air conduction signal is collected through the air microphone.
  • the wearable device determines multiple posterior probability vectors based on the bone conduction signal and the air conduction signal, and detects the wake-up word based on the multiple posterior probability vectors.
  • the plurality of posterior probability vectors correspond one-to-one to the plurality of audio frames included in the bone conduction signal and the air conduction signal
  • the first posterior probability vector in the plurality of posterior probability vectors is used to indicate that in the plurality of audio frames The probability that the phoneme of the first audio frame belongs to multiple specified phonemes.
  • the multiple audio frames include audio frames included in the bone conduction signal and audio frames included in the air conduction signal. That is, each posterior probability vector in the plurality of posterior probability vectors corresponds to an audio frame included in the bone conduction signal or the air conduction signal, and one posterior probability vector indicates that the phoneme of the corresponding audio frame belongs to multiple Probability for the specified phoneme.
  • bone conduction signal and air conduction signal you can refer to the content in the first implementation mode above, including the generation principle of bone conduction signal and air conduction signal, preprocessing of bone conduction signal and air conduction signal, etc. , which will not be repeated here.
  • wearable devices to determine multiple posterior probability vectors based on bone conduction signals and air conduction signals. It should be noted that there are many ways for wearable devices to determine multiple posterior probability vectors based on bone conduction signals and air conduction signals, and three of them will be introduced next.
  • Method 1 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first quantity output by the second acoustic model bone conduction posterior probability vectors and a second number of air conduction posterior probability vectors.
  • the initial part of the bone conduction signal is determined according to the detection time delay of voice detection
  • the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal
  • the conduction posterior probability vector is in one-to-one correspondence with the audio frames included in the air conduction signal.
  • the wearable device fuses the first bone conduction posterior probability vector and the first air conduction posterior probability vector to obtain a second posterior probability vector.
  • the first bone conduction posterior probability vector corresponds to the last audio frame of the initial portion of the bone conduction signal, and the duration of the last audio frame is less than the frame duration
  • the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal. audio frames, the duration of the first audio frame is less than the frame duration.
  • the multiple posterior probability vectors finally determined by the wearable device include a second posterior probability vector, a vector of the first number of bone conduction posterior probability vectors except the first bone conduction posterior probability vector, and a second number of A vector of air conduction posterior probability vectors other than the first air conduction posterior probability vector.
  • the first quantity and the second quantity may be the same or different.
  • the last audio frame of the initial part of the bone conduction signal may not be a complete audio frame, that is, the duration of the last audio frame is less than the frame duration, for example, the initial part of the bone conduction signal includes half a frame Duration of audio frames.
  • the first audio frame of the air conduction signal may not be a complete audio frame, that is, the duration of the first audio frame is less than the frame duration, for example, the first audio frame of the air conduction signal includes half audio frames with frame duration.
  • the sum of the duration of the last audio frame of the initial part of the bone conduction signal and the duration of the first audio frame of the air conduction signal may be equal to the frame duration.
  • voice detection such as VAD
  • the initial part of the bone conduction signal and the first frame of the air conduction signal will be incomplete, and the initial part of the bone conduction signal and the first frame of the air conduction signal are combined to represent Information about a complete audio frame. It should be noted that this complete audio frame is a potential frame of audio, not an actual frame.
  • the wearable device adds the first bone conduction posterior probability vector and the first air conduction posterior probability vector to obtain a second posterior probability vector, and the second posterior probability vector obtained by the wearable device Indicates the probability that the phoneme of this complete audio frame above belongs to more than one specified phoneme.
  • the wearable device needs to fuse (such as add) the second bone conduction posterior probability vector and the second air conduction posterior probability vector to obtain multiple posterior probability vectors.
  • the duration of the last audio frame of the initial part of the bone conduction signal is equal to the frame duration
  • the duration of the first audio frame of the air conduction signal is equal to the frame duration
  • the wearable device uses the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors obtained as a plurality of posterior probability vectors, and performs subsequent processing.
  • Fig. 11 is a schematic structural diagram of another acoustic model provided by an embodiment of the present application.
  • the acoustic model shown in FIG. 11 is the second acoustic model in the embodiment of the present application.
  • the second acoustic model in the embodiment of the present application includes two input layers (not shown), one shared network layer and two output layers.
  • the two input layers are used to respectively input the initial part of the bone conduction signal and the air conduction signal.
  • the shared network layer is used to process the input data of the two input layers separately, so as to extract the initial part of the bone conduction signal and the characteristics of the air conduction signal respectively.
  • These two output layers are used to receive two output data of the shared network layer respectively, and process the two output data respectively to output the first number of bone conduction posterior probability vectors corresponding to the initial part of the bone conduction signal , and the second number of air conduction posterior probability vectors corresponding to the air conduction signal. That is, the wearable device processes the initial part of the bone conduction signal and the two parts of the air conduction signal through the second acoustic model to obtain two sets of posterior probability vectors corresponding to the two parts of the signal. It's just that there is a shared network layer in the acoustic model for the two parts of the signal to share some network parameters.
  • the wearable device fuses the obtained first bone conduction posterior probability vector with the first air conduction posterior probability vector to obtain the second posterior probability vector, so that the multiple bone conduction posterior probability vectors
  • the probability vector and multiple air conduction posterior probability vectors are fused to obtain multiple posterior probability vectors, that is, the wearable device fuses the posterior probabilities of the two parts of the signal, so that the obtained multiple posterior probability vectors
  • the scheme of processing the initial part of the bone conduction signal and the air conduction signal based on the second acoustic model can be considered as a multi-task (multi-task) scheme, that is, the initial part of the bone conduction signal and the air conduction signal are used as two
  • the method of sharing network parameters is used to determine the corresponding posterior probability vectors to implicitly fuse the initial part of the bone conduction signal with the air conduction signal.
  • Method 2 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the initial part of the bone conduction signal and the air conduction signal into the third acoustic model to obtain multiple posterior probability vectors output by the third acoustic model. test probability vector.
  • the initial part of the bone conduction signal is determined according to the detection time delay of the voice detection. It should be noted that for the relevant introduction about the initial part of the bone conduction signal, reference may also be made to the content in the aforementioned first implementation manner, which will not be repeated here.
  • the third acoustic model includes two input layers (such as an input layer including DNN and CNN layers), a splicing layer (concat layer), and a network parameter layer (such as including RNN, etc.) and an output layer (such as including softmax, etc.).
  • the two input layers are used to input the bone conduction signal and the air conduction signal respectively
  • the splicing layer is used to splice the output data of the two input layers
  • the network parameter layer is used to process the output data of the splicing layer
  • the output layer is used to output A set of posterior probability vectors.
  • the wearable device simultaneously inputs the initial part of the bone conduction signal and the air conduction signal into the third acoustic model, and implicitly fuses the initial part of the bone conduction signal and the air conduction signal into the third acoustic model through the splicing layer in the third acoustic model.
  • a set of posterior probability vectors are obtained, so that the obtained multiple posterior probability vectors contain the command word information input by the sound source, which can also be regarded as a method of head loss compensation for air conduction signals based on bone conduction signals. This method is just not to compensate by directly fusing the signals.
  • Method 3 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the bone conduction signal and the air conduction signal into the third acoustic model to obtain multiple posterior probability vectors output by the third acoustic model. That is, the wearable device directly inputs the bone conduction signal and the air conduction signal into the third acoustic model at the same time, and outputs a set of posterior probability vectors through the third acoustic model, so that the obtained multiple posterior probability vectors include the sound source input
  • This can also be regarded as a method of head loss compensation for air conduction signals based on bone conduction signals, but it is not compensated by direct fusion of signals.
  • the wearable device determines the confidence level that the phoneme sequence corresponding to the command word input by the sound source includes the phoneme sequence corresponding to the wake-up word based on the plurality of posterior probability vectors and the phoneme sequence corresponding to the wake-up word. If the confidence level exceeds the confidence level threshold, it is determined that the wake-up word is detected.
  • the confidence level exceeds the confidence level threshold, it is determined that the wake-up word is detected.
  • the wearable device determines that the command word is detected when the confidence exceeds a confidence threshold and the distance condition is satisfied between the plurality of posterior probability vectors and the plurality of template vectors Include the wake word.
  • the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
  • Step 403 When it is detected that the command word includes a wake-up word, wake up the device to be woken up by voice.
  • the wearable device when it is detected that the command word input by the sound source includes the wake-up word, the wearable device performs voice wake-up. For example, the wearable device sends a wake-up instruction to the smart device (that is, the device to be woken up), so as to wake up the smart device. Or, in the case that the wearable device itself is a smart device, the wearable device wakes up other components or modules except the bone conduction microphone, that is, the wearable device as a whole enters a working state.
  • the voice wake-up method provided by the embodiment of the present application has multiple implementations, such as the first implementation and the second implementation described above, and these two implementations include multiple specific implementations respectively. Way. Next, please refer to FIGS. 13 to 18 to explain again the several specific implementations introduced above.
  • FIG. 13 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • FIG. 13 corresponds to manner 1 in the above-mentioned first implementation manner.
  • the wearable device collects the bone conduction signal through the bone conduction microphone, and performs VAD on the bone conduction signal through the VAD control module.
  • the VAD control module Output high level VAD control signal.
  • the VAD control module outputs a low-level VAD control signal.
  • the VAD control module sends the VAD control signal to the air microphone control module, the front-end enhancement module and the recognition engine respectively.
  • the VAD control signal is used to control the switch of the air microphone control module, the front-end enhancement module and the recognition engine.
  • the air microphone control module controls the air microphone to be turned on to collect the air conduction signal
  • the front-end enhancement module is turned on to perform front-end enhancement on the air conduction signal
  • the recognition engine is turned on to collect the air conduction signal based on the bone conduction signal and the air conduction signal. signal for wake word detection.
  • the fusion module performs preprocessing such as downsampling and/or gain adjustment on the bone conduction signal, and uses the initial part of the preprocessed bone conduction signal to perform head loss compensation on the front-end enhanced air conduction signal to obtain the fusion signal.
  • the fusion module sends the fusion signal to the recognition engine, and the recognition engine recognizes the fusion signal through the first acoustic model to obtain the detection result of the wake word.
  • the recognition engine sends the obtained detection result to the processor (such as the illustrated micro-controller unit (MCU)), and the processor determines whether to wake up the smart device based on the detection result. If the detection result indicates that it is detected that the command word input by the sound source includes a wake-up word, the processor wakes up the smart device by voice. If the detection result indicates that no wake word is detected, the processor does not wake up the smart device.
  • MCU micro-controller unit
  • FIG. 14 to FIG. 16 are flowcharts of three other voice wake-up methods provided by the embodiment of the present application.
  • the fusion module generates an enhanced initial signal based on the initial part of the preprocessed bone conduction signal, and uses the enhanced initial signal Head loss compensation is performed on the enhanced air conduction signal at the front end to obtain a fusion signal.
  • the fusion module directly splices the preprocessed bone conduction signal and the front-end enhanced air conduction signal to perform head loss compensation on the air conduction signal, thereby obtaining a fusion signal.
  • the VAD control signal does not need to be sent to the air microphone control module, so there is no need to collect the air conduction signal.
  • the recognition engine directly determines the preprocessed bone conduction signal as the fusion signal.
  • FIG. 17 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • the recognition engine inputs the initial part of the preprocessed bone conduction signal and the front-end enhanced air conduction signal into the second acoustic model respectively to obtain the second acoustic model
  • the two output layers of the model respectively output the bone conduction posterior probability vector and the air conduction posterior probability vector, that is, the posterior probability pair is obtained.
  • the recognition engine fuses the bone conduction posterior probability vector and the air conduction posterior probability vector to obtain multiple posterior probability vectors, and decodes the multiple posterior probability vectors to obtain the detection result of the wake word.
  • Fig. 18 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
  • the difference between Fig. 18 and Fig. 17 is that in the method shown in Fig. 18, the recognition engine inputs the initial part of the preprocessed bone conduction signal and the front-end enhanced air conduction signal into the third acoustic model respectively, or the preprocessed
  • the bone conduction signal and the front-enhanced air conduction signal are respectively input into the third acoustic model to obtain a plurality of posterior probability vectors respectively output by an output layer of the third acoustic model.
  • the bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption. While ensuring low power consumption, it is considered that due to the delay of voice detection, the collected air conduction signal may lose its head, and thus does not contain the complete information of the command words input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains sound
  • the command word information input by the source that is, the bone conduction signal is not lost, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
  • head loss compensation may be performed directly or implicitly on the air conduction signal based on the bone conduction signal, or the wake-up word detection may be performed directly based on the bone conduction signal.
  • the wake-up word can also be registered in the wearable device.
  • the confidence threshold in the above-mentioned embodiment can also be determined while registering the wake-up word, and multiple Template vector. Next, the registration process of the wake word will be introduced.
  • the wearable device first determines the phoneme sequence corresponding to the wake-up word. Afterwards, the wearable device obtains the bone conduction registration signal, and the bone conduction registration signal contains complete information of the wake word. The wearable device determines a confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word. Optionally, the wearable device may also determine multiple template vectors based on the bone conduction signal.
  • the wearable device obtains the input wake-up word, and determines the phoneme sequence corresponding to the wake-up word according to the pronunciation dictionary. Taking the user inputting the wake-up word text to the wearable device as an example, the wearable device obtains the wake-up word text input by the user, and determines the phoneme sequence corresponding to the wake-up word according to the pronunciation dictionary.
  • the wearable device can also detect whether the input wake-up word text meets the text registration conditions after the user enters the wake-up word text.
  • the pronunciation dictionary determines the sequence of phonemes corresponding to the wake word text.
  • the text registration conditions include text input times requirement, character requirement and so on. Taking the number of text input requirements as requiring the user to input one or more wake-up word texts as an example, every time the wearable device detects a wake-up word text entered by the user, it will perform text verification and analysis on the input wake-up word text to verify the current Whether the entered wake word text meets the character requirements. If the wake-up word text entered by the user does not meet the character requirements, the wearable device will prompt the user with text or sound to indicate the reason for the non-compliance and require re-input. If the wake-up word text input by the user one or more times all meet the character requirements and are the same, the wearable device determines the phoneme sequence corresponding to the wake-up word text according to the pronunciation dictionary.
  • the wearable device detects whether the currently input wake word text meets character requirements through text verification.
  • the character requirements include one or more of the following requirements: requiring Chinese (non-Chinese characters do not meet the character requirements), 4 to 6 characters (less than 4 characters or more than 6 characters do not meet the character requirements) , there is no modal particle (if it exists, it does not meet the character requirements), there are no more than 3 repeated words with the same pronunciation (if it exists, it does not meet the character requirements), it is different from the existing command words (if it exists, it does not meet the character requirements),
  • the overlapping ratio of the phoneme with the existing command word is no more than 70% (more than 70% means that it does not meet the character requirements, which is used to prevent accidental entry), and the corresponding phoneme belongs to the phoneme in the pronunciation dictionary (if it does not belong to, it does not meet the character requirements, it is a an exceptional situation).
  • the text registration can determine the phoneme sequence corresponding to the wake-up word.
  • the phoneme sequence can subsequently be used as a decoding path for the wake-up word, and the decoding path is used to detect the wake-up word during the voice wake-up process.
  • Voice registration is required in addition to text registration.
  • the wearable device after the text registration is completed, the wearable device also needs to acquire the bone conduction registration signal, which includes the complete information of the wake-up word.
  • the wearable device also acquires the air conduction registration signal while acquiring the bone conduction registration signal.
  • the wearable device verifies the bone conduction registration signal and the air conduction registration signal after obtaining the input bone conduction registration signal and air conduction registration signal. Whether the guidance registration signal and the air conduction registration signal meet the voice registration conditions, and if the voice registration conditions are met, the wearable device performs subsequent processing to determine the confidence threshold.
  • the voice registration condition includes the voice input frequency requirement, the signal-to-noise ratio requirement, the path score requirement, and the like.
  • the user needs to input the wake-up word voice three times (including the bone conduction registration signal and the air conduction registration signal). Every time the wearable device detects the wake-up word voice input by the user, it will pronounce the wake-up word voice input Checksum analysis to check whether the currently input wake-up word speech meets the signal-to-noise ratio requirements and path score requirements. If the wake-up word text entered by the user does not meet the character requirements, the wearable device will prompt the user the reason for the non-compliance through text or sound and require re-input. If the wake-up word voices input by the user three times meet the SNR requirements and path score requirements, the wearable device determines that the wake-up word voices input by the user meet the voice registration conditions, and the wearable device performs subsequent processing.
  • the wearable device can first detect whether the input wake-up word speech meets the SNR requirement, and then detect whether the input wake-up word speech meets the path score requirement after determining that the input wake-up word speech meets the SNR requirement.
  • the signal-to-noise ratio requirement includes the requirement that the signal-to-noise ratio is not lower than the signal-to-noise ratio threshold (if it is lower than the signal-to-noise ratio requirement), for example, it is required that the signal-to-noise ratio of the bone conduction registration signal is not lower than the first signal-to-noise ratio ratio threshold, and/or, require that the signal-to-noise ratio of the air conduction registration signal is not lower than a second signal-to-noise ratio threshold.
  • the first SNR threshold is greater than the second SNR threshold. If the voice of the wake-up word input by the user does not meet the SNR requirements, the wearable device will prompt the user that the current environment is noisy and not suitable for registration, and the user needs to find a quiet environment to re-enter the voice of the wake-up word.
  • the path score requirements include that the path score obtained based on each input wake-up word voice is not less than the calibration threshold, the average of the three path scores based on three input wake-up word voices is not less than the calibration threshold, and the average value of the three path scores based on any two input wake-up word voices is not less than the calibration threshold.
  • the resulting two path scores differ by no more than 100 points (or other value). Among them, the implementation process of obtaining the path score based on the wake-up word voice will be introduced below, which is essentially similar to the process of obtaining the confidence level based on the bone conduction signal in the aforementioned voice wake-up process.
  • the wearable device determines the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word. Similar to the confidence level obtained based on the bone conduction signal in the aforementioned voice wake-up process, the wearable device can determine the confidence level threshold in a variety of implementation ways, and two of the implementation ways will be introduced next.
  • the wearable device determines a fusion registration signal based on the bone conduction registration signal, and determines a confidence threshold and a plurality of template vectors based on the fusion registration signal and the phoneme sequence corresponding to the wake-up word.
  • the wearable device determines the fusion registration signal based on the bone conduction registration signal. It should be noted that there are multiple ways for the wearable device to determine the fusion registration signal based on the bone conduction registration signal, four of which will be introduced next.
  • Method 1 of determining the fusion registration signal based on the bone conduction registration signal Before determining the fusion registration signal based on the bone conduction registration signal, the air conduction registration signal is acquired. The wearable device fuses the initial part of the bone conduction registration signal with the air conduction registration signal to obtain a fusion registration signal. Wherein, the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection. Optionally, signal fusion is performed by signal splicing in the embodiment of the present application.
  • the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal splicing based on the intercepted voice segment, thereby reducing the amount of data processing. It is also possible to perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing speech enhancement on the air conduction signal, and the like.
  • the specific implementation manner is similar to the principle of the relevant content in the foregoing embodiments, please refer to the foregoing embodiments, and no detailed description will be given here.
  • Method 2 of determining the fusion registration signal based on the bone conduction registration signal before determining the fusion registration signal based on the bone conduction registration signal, the air conduction registration signal is acquired.
  • the wearable device generates an enhanced initial registration signal based on the initial part of the bone conduction registration signal, and fuses the enhanced initial registration signal and the air conduction registration signal to obtain a fused registration signal.
  • the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection.
  • the wearable device uses the initial part of the bone conduction registration signal to generate an enhanced initial registration signal, Instead of fusing the initial portion of the bone conduction registration signal with the air conduction signal, the enhanced initial registration signal is fused with the air conduction registration signal.
  • the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal fusion based on the intercepted voice segment, thereby reducing the amount of data processing.
  • the wearable device can also perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
  • preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
  • the wearable device may input the initial part of the bone conduction registration signal into the generating network model, so as to obtain the enhanced initial registration signal output by the generating network model.
  • the generated network model may be the same as the generated network model described above, or may be another generated network model, which is not limited in this embodiment of the present application.
  • the embodiment of the present application also does not limit the network structure, training method, training equipment, etc. of the generated network model.
  • the air conduction registration signal is acquired.
  • the wearable device directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fusion registration signal.
  • the wearable device directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fusion Register signal.
  • the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal fusion based on the intercepted voice segment, thereby reducing the amount of data processing.
  • the wearable device can also perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
  • the wearable device determines the bone conduction registration signal as the fusion registration signal.
  • the wearable device directly uses the bone conduction registration signal as the fused registration signal.
  • the wearable device can also detect the voice segment of the bone conduction registration signal to extract the voice segment, and perform subsequent processing based on the extracted voice segment, thereby reducing the amount of data processing.
  • the wearable device may also perform preprocessing on the bone conduction registration signal, for example, perform down-sampling and/or gain adjustment on the bone conduction registration signal.
  • the wearable device inputs the multiple registration audio frames included in the fused registration signal into the first acoustic model, so as to obtain multiple registration posterior probability vectors output by the first acoustic model.
  • the plurality of registration posterior probability vectors correspond to the plurality of registration audio frames one by one
  • the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the first registration in the plurality of registration audio frames. Probability that a phoneme of an audio frame belongs to more than one specified phoneme.
  • each registration posterior probability vector in the plurality of registration posterior probability vectors corresponds to a registration audio frame included in the fused registration signal, and a registration posterior probability vector indicates that the phoneme of a corresponding registration audio frame belongs to a plurality of specified phoneme probability.
  • the wearable device determines the plurality of enrollment posterior probability vectors as a plurality of template vectors.
  • the wearable device determines a confidence threshold based on the plurality of registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word. That is, the wearable device processes the fused registration signal through the first acoustic model to obtain the information of the phonemes contained in the fused signal, that is, obtains the registration posterior probability vector, uses the registration posterior probability vector as a template vector, and stores Template vector.
  • the wearable device also decodes the registration posterior probability vector based on the phoneme sequence (that is, the decoding path) corresponding to the wake-up word to determine a path score, uses the path score as a confidence threshold, and stores the confidence threshold.
  • the wearable device first determines the confidence threshold based on the bone conduction registration signal.
  • the registration signals (including four ways) are fused, and then the fused registration signals are processed through an acoustic model to obtain a registration posterior probability vector.
  • the wearable device decodes the obtained registration posterior probability vector based on the decoding path corresponding to the wake-up word to obtain the confidence threshold.
  • the wearable device stores the obtained registration posterior probability vector as a template vector.
  • the wearable device will introduce the second implementation method of determining the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
  • the wearable device obtains the air conduction registration signal before determining the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
  • the wearable device determines multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal.
  • the multiple registration posterior probability vectors correspond to the multiple registration audio frames included in the bone conduction registration signal and the air conduction registration signal
  • the first registration posterior probability vector among the multiple registration posterior probability vectors indicates the multiple The probability that the phoneme of the first registered audio frame in the registered audio frames belongs to multiple specified phonemes.
  • the multiple registration audio frames include registration audio frames included in the bone conduction registration signal and registration audio frames included in the air conduction registration signal.
  • each of the plurality of registration posterior probability vectors corresponds to a registration audio frame included in the bone conduction registration signal or the air conduction registration signal
  • a registration posterior probability vector indicates a corresponding registration Probability that a phoneme of an audio frame belongs to more than one specified phoneme.
  • the wearable device determines a confidence threshold based on the plurality of registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • the wearable device determines the multiple enrollment posterior probability vectors as multiple template vectors.
  • bone conduction registration signal and air conduction registration signal you can refer to the content in the first implementation mode above, including the generation principle of bone conduction registration signal and air conduction registration signal, and the analysis of bone conduction registration signal and air conduction registration signal. Signal preprocessing and the like will not be described here one by one.
  • the wearable device determines multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal. It should be noted that there are multiple ways for the wearable device to determine multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, and three of them will be introduced next.
  • Method 1 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the second acoustic model to obtain the second acoustic model The third quantity of bone conduction registration posterior probability vectors and the fourth quantity of air conduction registration posterior probability vectors are output. The wearable device fuses the first bone conduction posterior registration probability vector with the first air conduction posterior registration probability vector to obtain a second registration posterior probability vector.
  • the initial part of the bone conduction registration signal is determined according to the detection time delay of voice detection
  • the third number of bone conduction registration posterior probability vectors correspond to the registration audio frames included in the initial part of the bone conduction registration signal
  • the first The four air conduction registration posterior probability vectors are in one-to-one correspondence with the registration audio frames included in the air conduction registration signal. That is, a bone conduction registration posterior probability vector corresponds to a registration audio frame included in the initial part of the bone conduction registration signal, and an air conduction registration posterior probability vector corresponds to a registration audio frame included in the air conduction registration signal.
  • the first bone conduction registration posterior probability vector corresponds to the last registration audio frame of the initial part of the bone conduction registration signal, and the duration of the last registration audio frame is less than the frame duration
  • the first air conduction posterior probability vector corresponds to the air conduction registration signal
  • the first registered audio frame of the , the duration of the first registered audio frame is less than the frame duration.
  • the multiple registration posterior probability vectors finally determined by the wearable device include the second registration posterior probability vector and the third number of bone conduction registration posterior probability vectors except for the first bone conduction registration posterior probability vector, And vectors in the fourth number of air conduction registration posterior probability vectors except the first air conduction registration posterior probability vector.
  • the third quantity and the fourth quantity may be the same or different
  • the third quantity may be the same or different from the aforementioned first quantity
  • the fourth quantity may be the same or different from the aforementioned second quantity.
  • the wearable device adds the first bone conduction registration posterior probability vector and the first air conduction registration posterior probability vector to obtain the second registration posterior probability vector.
  • the wearable device obtains the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors through the second acoustic model is the same as the principle of obtaining the first number of posterior probability vectors through the second acoustic model in the previous embodiment.
  • the principle of the bone conduction posterior probability vector is the same as that of the second number of air conduction posterior probability vectors, and will not be described in detail here.
  • Method 2 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the third acoustic model to obtain the third acoustic model Multiple registered posterior probability vectors for output.
  • the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection.
  • Method 3 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the bone conduction registration signal and the air conduction registration signal into the third acoustic model to obtain multiple Register the posterior probability vector. That is, the wearable device directly inputs the bone conduction registration signal and the air conduction registration signal into the third acoustic model at the same time, and outputs a set of registration posterior probability vectors through the third acoustic model, so that the obtained multiple registration posterior probability vectors include complete information about the wake word input by the sound source.
  • the wearable device determines the multiple registration posterior probability vectors, it determines the confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • the principle is the same as that of the wearable device described above.
  • the principle of determining the confidence level based on multiple posterior probability vectors and the phoneme sequence of the wake-up word is similar.
  • FIG. 19 to FIG. 24 are flow charts of six wake-up word registration methods provided by the embodiments of the present application. Next, the registration process of the wake-up word in the embodiment of the present application will be explained again with reference to FIG. 19 to FIG. 24 .
  • FIG. 19 corresponds to mode 1 in the first implementation mode of the above-mentioned wake-up word registration, and the registration process of the wake-up word includes text registration and voice registration.
  • the wearable device first performs text registration.
  • the text registration module of the wearable device obtains the user-defined wake-up word text, performs text verification and text analysis on the input wake-up word text, and determines the phoneme sequence corresponding to the wake-up word text that meets the text registration requirements according to the pronunciation dictionary.
  • the phoneme sequence is determined as a decoding path, and the text registration module sends the decoding path to the recognition engine.
  • the recognition engine stores the decoding path.
  • the wearable device performs voice registration again.
  • the voice registration module of the wearable device acquires voice registration signals, including bone conduction registration signals and air conduction registration signals.
  • the wearable device acquires the bone conduction registration signal and the air conduction registration signal through the VAD, and may also perform preprocessing on the acquired bone conduction registration signal and the air conduction registration signal.
  • the voice registration module performs pronunciation verification on the bone conduction registration signal and the air conduction registration signal
  • the fusion module fuses the bone conduction registration signal and the air conduction registration signal that meet the voice registration requirements after verification to obtain a fusion registration signal.
  • the fused registration signal in FIG. 19 is referred to as fused registration signal 1 here.
  • the voice registration module processes the fused registration signal 1 through the first acoustic model to obtain multiple registration posterior probability vectors, and determines a path score by decoding the multiple registration posterior probability vectors, and uses the path score as
  • the wake-up threshold (that is, the confidence threshold) is sent to the recognition engine, and the recognition engine stores the wake-up threshold, and the wake-up threshold suppresses one-level false wake-up in the user's subsequent voice wake-up.
  • the speech registration module sends a plurality of registration posterior probability vectors obtained as a plurality of template vectors to the recognition engine, and the recognition engine stores the plurality of template vectors, and the plurality of template vectors are used for the secondary False awakening suppression.
  • FIG. 20 to FIG. 22 respectively correspond to mode 2, mode 3 and mode 4 in the first implementation mode of the above-mentioned wake-up word registration.
  • the voice registration module of the wearable device generates an enhanced initial registration signal based on the initial part of the bone conduction registration signal, and combines the enhanced initial registration signal with the air conduction registration
  • the signals are fused to obtain a fused registration signal.
  • the fused registration signal in FIG. 20 is referred to as fused registration signal 2 here.
  • the voice registration module directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fused registration signal.
  • the fused registration signal in FIG. 21 is referred to as fused registration signal 3 here.
  • the voice registration module may directly determine the bone conduction registration signal as the fused registration signal without acquiring the air conduction registration signal.
  • the fused registration signal in FIG. 22 is referred to as fused registration signal 4 here.
  • FIG. 23 corresponds to mode 1 in the second implementation mode of the above-mentioned wake-up word registration.
  • the voice registration module of the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the second acoustic model respectively to obtain the second acoustic model
  • the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors are respectively output.
  • the speech registration module fuses the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors to obtain multiple registration posterior probability vectors.
  • FIG. 24 corresponds to mode 2 and mode 3 in the second implementation mode of the above-mentioned wake-up word registration.
  • the voice registration module of the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the third acoustic model respectively, or inputs the bone conduction registration signal and the air conduction registration signal are input into the third acoustic model to obtain a plurality of registration posterior probability vectors output by the third acoustic model.
  • the processing flow of bone conduction registration signal and air conduction registration signal in the process of wake-up word registration is similar to the processing flow of bone conduction signal and air conduction signal in the process of voice wake-up, except that in the wake-up word registration process
  • it is to obtain the wake-up threshold and the template vector and in the process of voice wake-up, it is to detect the wake-up word.
  • the template vector can improve the accuracy and robustness of this scheme.
  • This solution uses the bone conduction signal to directly or implicitly perform head loss compensation on the air conduction signal, or directly detects the wake-up word based on the bone conduction signal. Since the bone conduction signal contains the command word information input by the sound source, the bone conduction signal does not Lost head, therefore, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
  • the voice wake-up process and the wake-up word registration process are introduced.
  • the acoustic model in the embodiment of the present application needs to be pre-trained.
  • the first acoustic model, the second acoustic model and the third acoustic model all need to be pre-trained.
  • the computer equipment is used to train the acoustic model as This example introduces the training process of the acoustic model.
  • the computer device first obtains the second training data set, the second training data set includes a plurality of second sample signal pairs, one second sample signal pair includes a bone conduction sample signal and an air conduction sample signal, A second sample signal pair corresponds to a command word.
  • the second training data set includes directly collected voice data, public voice data and/or voice data purchased from a third party.
  • the computer device can preprocess the acquired second training data set to obtain a preprocessed second training data set, which can simulate real speech data In order to be closer to the speech of the real scene and increase the diversity of training samples.
  • the second training data set is backed up, that is, an additional piece of data is added, and the backed up data is preprocessed.
  • the backup data is divided into multiple parts, and a preprocessing is performed on each part of the data.
  • the preprocessing for each part of the data can be different, which can double the total training data and ensure the comprehensiveness of the data. , to achieve a balance between performance and training overhead, so that the accuracy and robustness of speech recognition can be improved to a certain extent.
  • the method for preprocessing each piece of data may include one or more of adding noise, increasing volume, adding reverberation, time shifting, changing pitch, and time stretching.
  • the computer device determines a plurality of fusion sample signals based on the second training data set, and there are four ways in total. It should be noted that these four ways correspond one-to-one to the four ways in which the wearable device determines the fusion signal based on the bone conduction signal in the recognition process (that is, the voice wake-up process) in the above-mentioned embodiment.
  • the wearable device fuses the initial part of the bone conduction signal with the air conduction signal to obtain a fusion signal
  • the computer device pairs the multiple second sample signals with each The second sample signal fuses the initial part of the included bone conduction sample signal and the air conduction sample signal to obtain a plurality of fused sample signals.
  • the wearable device If during the identification process, the wearable device generates an enhanced initial signal based on the initial part of the bone conduction signal, and fuses the enhanced initial signal with the air conduction signal to obtain a fusion signal, then during the training process, the computer device The initial part of the bone conduction sample signal included in each second sample signal pair in each second sample signal pair generates an enhanced initial sample signal, and each enhanced initial sample signal is fused with the corresponding air conduction sample signal to obtain Multiple fused sample signals.
  • the computer device directly fuses the bone conduction signal and the air conduction signal to obtain a fusion signal
  • the computer device includes each second sample signal pair among the plurality of second sample signal pairs
  • the bone conduction sample signal and the air conduction sample signal are directly fused to obtain multiple fused sample signals.
  • the wearable device determines the bone conduction signal as a fusion signal
  • the computer device determines the bone conduction sample signals included in the multiple second sample signal pairs as multiple fusion sample signals.
  • the initial part of the bone conduction sample signal is determined according to the detection time delay of the voice detection or set according to experience.
  • the computer device trains the first initial acoustic model by using the plurality of fused sample signals, so as to obtain the first acoustic model in the embodiment of the present application.
  • the network structure of the first initial acoustic model is the same as the network structure of the first acoustic model.
  • the computer device before the computer device determines a plurality of fusion sample signals based on the second training data set, it preprocesses the bone conduction sample signals and air conduction sample signals included in the second training data set, for example, front-end enhancement is performed on the air conduction sample signals , perform down-sampling and gain adjustment on the bone conduction sample signal.
  • the computer device inputs the initial part of the bone conduction sample signal included in each of the plurality of second sample signal pairs into the generation network model, and obtains the enhanced initial sample signal output by the generation network model, the The generated network model is the same as the generated network model in the foregoing embodiments, or may be a different model, which is not limited in this embodiment of the present application.
  • FIG. 25 to FIG. 28 are four schematic diagrams of the first acoustic model respectively trained based on the above four methods provided by the embodiment of the present application.
  • the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling on the bone conduction data through the fusion module and/or Or gain adjustment, and perform front-end enhancement on the air conduction data through the front-end enhancement module.
  • Fig. 25 to Fig. 27 correspond to the first three of the four methods, and the fusion module uses the corresponding method to perform head loss compensation on the air conduction signal through the bone conduction data, so as to obtain the training input data.
  • Fig. 25 to Fig. 27 correspond to the first three of the four methods, and the fusion module uses the corresponding method to perform head loss compensation on the air conduction signal through the bone conduction data, so as to obtain the training input data.
  • Air conduction data is not needed, and the fusion module directly uses bone conduction data as training input data. Then, the computer equipment trains the network model (namely the first initial acoustic model) by training the input data, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, so as to obtain the trained first acoustic model.
  • the network model namely the first initial acoustic model
  • the computer device The initial part of the bone conduction sample signal and the air conduction sample signal included in each of the second sample signal pairs are used as the input of the second initial acoustic model to train the second initial acoustic model to obtain the second acoustic model.
  • the network structure of the second initial acoustic model is the same as the network structure of the second acoustic model. That is, the second initial acoustic model also includes two input layers, one shared network layer and two output layers.
  • Fig. 29 is a schematic diagram of a second acoustic model obtained through training provided by an embodiment of the present application.
  • the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling and/or gain adjustment on the bone conduction data, and air conduction data Import data for front-end enhancement.
  • the computer equipment uses bone conduction data as training input data 1 and air conduction data as training input data 2 .
  • the computer equipment trains the network model (that is, the second initial acoustic model) through the training input data 1 and the training input data 2, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, thereby obtaining the trained second acoustic model.
  • the training input data 1 and the training input data 2 may correspond to the same loss function or different loss functions, which is not limited in this embodiment of the present application.
  • the computer device Taking the training of the third acoustic model as an example, corresponding to the method 2 in which the wearable device determines multiple posterior probability vectors based on the bone conduction signal and the air conduction signal during the voice wake-up process, during the training process, the computer device The initial part of the bone conduction sample signal and the air conduction sample signal included in each of the second sample signal pairs are used as the input of the third initial acoustic model to train the third initial acoustic model to obtain the third acoustic model.
  • the computer device centers the multiple second sample signals
  • the bone conduction sample signal and the air conduction sample signal included in each second sample signal pair are used as the input of the third initial acoustic model to train the third initial acoustic model to obtain the third acoustic model.
  • the network structure of the third initial acoustic model is the same as the network structure of the third acoustic model. That is, the third initial acoustic model also includes two input layers, a concatenation layer, a network parameter layer and an output layer.
  • FIG. 30 is a schematic diagram of a third acoustic model obtained through training according to an embodiment of the present application.
  • the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling and/or gain adjustment on the bone conduction data, and air conduction data Import data for front-end enhancement.
  • the computer equipment uses the bone conduction data or the initial part of the bone conduction data as training input data 1 , and uses the air conduction data as training input data 2 .
  • the computer equipment trains the network model (that is, the third initial acoustic model) through the training input data 1 and the training input data 2, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, thereby obtaining the trained third acoustic model.
  • the network model that is, the third initial acoustic model
  • the air conduction registration signal is also directly or implicitly compensated for head loss through the bone conduction sample signal, so as to construct the training input data to train the initial acoustic model, and obtain A trained acoustic model.
  • voice wake-up use the bone conduction signal to directly or implicitly compensate the air conduction signal for head loss in the same way. Since the bone conduction signal contains the command word information input by the sound source, that is, the bone conduction signal does not lose the head. Therefore, the recognition accuracy of the detection of the wake-up word based on the bone conduction signal is higher, the accuracy of the voice wake-up is higher, and the robustness is also improved.
  • Fig. 31 is a schematic structural diagram of a voice wake-up device 3100 provided by an embodiment of the present application.
  • the voice wake-up device 3100 can be implemented by software, hardware or a combination of the two to become part or all of an electronic device.
  • the electronic device can be The wearable device shown in Figure 2.
  • the device 3100 includes: a voice detection module 3101 , a wake-up word detection module 3102 and a voice wake-up module 3103 .
  • the voice detection module 3101 is used to perform voice detection according to the bone conduction signal collected by the bone conduction microphone, and the bone conduction signal includes command word information input by the sound source;
  • a wake-up word detection module 3102 configured to detect a wake-up word based on a bone conduction signal when a voice input is detected
  • the voice wake-up module 3103 is configured to wake up the device to be woken up by voice when it is detected that the command word includes a wake-up word.
  • the wake-up word detection module 3102 includes:
  • the first determining submodule is used to determine the fusion signal based on the bone conduction signal
  • the wake-up word detection submodule is used to detect the wake-up word on the fusion signal.
  • the device 3100 also includes:
  • the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
  • the first determined submodule is used for:
  • the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or,
  • the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
  • the wake-up word detection submodule is used for:
  • the multiple posterior probability vectors correspond one-to-one to the multiple audio frames
  • the first posterior probability vector in the plurality of posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes
  • the wake word detection is performed based on the plurality of posterior probability vectors.
  • the device 3100 also includes:
  • the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
  • Wake-up word detection module 3102 includes:
  • the second determining submodule is used to determine a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal, and the plurality of posterior probability vectors correspond to the multiple audio frames included in the bone conduction signal and the air conduction signal one by one.
  • the first posterior probability vector in the posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of designated phonemes;
  • the wake-up word detection submodule is configured to detect wake-up words based on the plurality of posterior probability vectors.
  • the second determination submodule is used for:
  • the bone conduction Inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors output by the second acoustic model, the bone conduction
  • the initial part of the signal is determined according to the detection delay of the speech detection, the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal, and the second number of air conduction posterior probability vectors
  • the vector is in one-to-one correspondence with the audio frames included in the air conduction signal;
  • the first bone conduction posterior probability vector corresponds to the last audio frequency of the initial part of the bone conduction signal frames, the duration of the last audio frame is less than the frame duration
  • the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal, the duration of the first audio frame is less than the frame duration
  • the plurality of posterior probabilities include the second posterior probability vector, the vectors in the first number of bone conduction posterior probability vectors except for the first bone conduction posterior probability vector, and the vectors in the second number of air conduction posterior probability vectors except for the first air conduction A vector other than the vector of posterior probabilities.
  • the second determination submodule is used for:
  • the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or ,
  • the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probability vectors output by the third acoustic model.
  • the wake-up word detection submodule is used for:
  • the command word includes a wake-up word.
  • the wake-up word detection submodule is used for:
  • the command word includes a wake-up word
  • the plurality of template vectors indicate that the wake-up word is included
  • the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
  • the device 3100 also includes:
  • the obtaining module is used to obtain the bone conduction registration signal, and the bone conduction registration signal includes complete information of the wake-up word;
  • the determination module is configured to determine a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
  • the determination module includes:
  • the third determining submodule is used to determine the fusion registration signal based on the bone conduction registration signal
  • the fourth determining submodule is configured to determine a confidence threshold and a plurality of template vectors based on the fused registration signal and the phoneme sequence corresponding to the wake-up word.
  • the fourth determining submodule is used for:
  • the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
  • a confidence threshold is determined based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • the device 3100 also includes:
  • An acquisition module configured to acquire an air conduction registration signal
  • Identify modules include:
  • the fifth determination sub-module is used to determine a plurality of registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, the plurality of registration posterior probability vectors and the plurality of registration audio signals included in the bone conduction registration signal and the air conduction registration signal Frame one-to-one correspondence, the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
  • the sixth determining submodule is configured to determine a confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
  • a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
  • the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
  • the voice wake-up device when the voice wake-up device provided by the above-mentioned embodiment performs voice wake-up, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the voice wake-up device and the voice wake-up method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
  • all or part may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)) or a semiconductor medium (for example: solid state disk (solid state disk, SSD)) wait.
  • a magnetic medium for example: floppy disk, hard disk, magnetic tape
  • an optical medium for example: digital versatile disc (digital versatile disc, DVD)
  • a semiconductor medium for example: solid state disk (solid state disk, SSD)
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • All signals are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • the voice signals involved in the embodiments of the present application are all obtained under the condition of sufficient authorization.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Manipulator (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种语音唤醒的方法、装置、设备、存储介质及程序产品,属于语音识别技术领域。该方法包括:根据骨导麦克采集的骨导信号进行语音检测,骨导信号包含声源输入的命令词信息(401);在检测到有语音输入的情况下,基于骨导信号进行唤醒词的检测(402);在检测到该命令词包括唤醒词时,对待唤醒设备进行语音唤醒(403)。通过骨导麦克采集骨导信号进行语音检测,能够保证低功耗,而且唤醒词的识别准确率较高。

Description

语音唤醒的方法、装置、设备、存储介质及程序产品
本申请要求于2021年8月30日提交的申请号为202111005443.6、发明名称为“语音唤醒的方法、装置、设备、存储介质及程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及语音识别技术领域,特别涉及一种语音唤醒的方法、装置、设备、存储介质及程序产品。
背景技术
当前,越来越多的智能设备通过语音控制来完成任务。通常智能设备需要先通过用户语音输入唤醒词来被唤醒,从而接收指令来完成任务。另外,随着骨导器件的发展,大量的骨导麦克被应用于可穿戴设备中,通过可穿戴设备来唤醒智能设备。可穿戴设备如无线耳机、智能眼镜、智能手表等。其中,骨导麦克中的传感器是一种非声传感器,通过采集人们说话时声带的振动信号,将振动信号转换为电信号,该电信号被称为骨导信号。
在相关技术中,可穿戴设备安装有骨导麦克和空气麦克。为了实现可穿戴设备的低功耗,在智能设备被唤醒之前,空气麦克处于休眠状态。由于骨导麦克的功耗较低,因此可以利用骨导麦克采集骨导信号,基于骨导信号进行语音检测(如语音激活检测(voice activate detector,VAD)),从而控制空气麦克的开关以降低功耗。在通过语音检测确认当前有语音输入的情况下,开启空气麦克,通过空气麦克采集气导信号,基于气导信号进行唤醒词的识别,也即进行语音唤醒。
然而,由于语音检测有算法延时,因此会出现输入的命令词的语音头部被截断的现象,即采集的气导信号可能丢头,从而未包含声源输入的命令词的完整信息,导致唤醒词的识别准确率较低,语音唤醒的准确率较低。
发明内容
本申请实施例提供了一种语音唤醒的方法、装置、设备、存储介质及程序产品,能够提高语音唤醒的准确率。所述技术方案如下:
第一方面,提供了一种语音唤醒的方法,该方法包括:
根据骨导麦克采集的骨导信号进行语音检测,该骨导信号包含声源输入的命令词信息;在检测到有语音输入的情况下,基于该骨导信号进行唤醒词的检测;在检测到该命令词包括唤醒词时,对待唤醒设备进行语音唤醒。
在本申请实施例中,通过骨导麦克采集骨导信号进行语音检测,能够保证低功耗。另外,考虑到由于语音检测的延迟可能会导致采集的气导信号丢头,从而未包含声源输入的命令词的完整信息,而骨导麦克采集的骨导信号包含声源输入的命令词信息,即骨导信号未丢头,因此本方案基于骨导信号进行唤醒词的检测。这样,唤醒词的识别准确率较高,语音唤醒的 准确度较高。
可选地,基于该骨导信号进行唤醒词的检测,包括:基于该骨导信号确定融合信号;对该融合信号进行唤醒词的检测。需要说明的是,基于骨导信号所确定的融合信号也包含声源输入的命令词信息。
可选地,基于该骨导信号确定融合信号之前,还包括:开启空气麦克,通过空气麦克采集气导信号;基于该骨导信号确定融合信号,包括:将骨导信号的起始部分和气导信号进行融合,以得到融合信号,该骨导信号的起始部分根据语音检测的检测时延确定;或者,基于骨导信号的起始部分生成增强起始信号,将增强起始信号和气导信号进行融合,以得到融合信号,骨导信号的起始部分根据语音检测的检测时延确定;或者,将骨导信号和气导信号直接进行融合,以得到融合信号。也即是,本申请实施例提供了三种用骨导信号对气导信号进行丢头补偿的方法,即直接通过显示地信号融合以对气导信号进行丢头补偿。可选地,通过信号拼接来进行信号融合。
可选地,基于骨导信号确定融合信号,包括:将骨导信号确定为融合信号。也即是,本申请实施例也可以直接用骨导信号进行唤醒词的检测。
可选地,对融合信号进行唤醒词的检测,包括:将该融合信号包括的多个音频帧输入第一声学模型,以得到第一声学模型输出的多个后验概率向量,该多个后验概率向量与该多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于该指示多个音频帧中的第一音频帧的音素属于多个指定音素的概率;基于该多个后验概率向量进行唤醒词的检测。也即是,先通过第一声学模型对融合信号进行处理,以得到融合信号所包括的多个音频帧分别对应的多个后验概率向量,再基于该多个后验概率向量进行唤醒词的检测,例如对该多个后验概率向量进行解码,以进行唤醒词的检测。
可选地,基于该骨导信号进行唤醒词的检测之前,还包括:开启空气麦克,通过空气麦克采集气导信号;基于该骨导信号进行唤醒词的检测,包括:基于该骨导信号和气导信号,确定多个后验概率向量,该多个后验概率向量与该骨导信号和气导信号包括的多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率;基于该多个后验概率向量进行唤醒词的检测。也即是,在本申请实施例中也可以不进行信号融合,而是直接基于骨导信号和气导信号确定各个音频帧分别对应的后验概率向量,以使得到的多个后验概率向量包含以音素概率的方式隐式地包含声源输入的命令词信息,也即隐式地用骨导信号对气导信号进行丢头补偿。
可选地,基于该骨导信号和气导信号,确定多个后验概率向量,包括:将该骨导信号的起始部分和气导信号输入第二声学模型,以得到第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定,第一数量个骨导后验概率向量与骨导信号的起始部分所包括的音频帧一一对应,第二数量个气导后验概率向量与气导信号所包括的音频帧一一对应;将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,第一骨导后验概率向量对应骨导信号的起始部分的最后一个音频帧,最后一个音频帧的时长小于帧时长,第一气导后验概率向量对应气导信号的第一个音频帧,第一个音频帧的时长小于帧时长,该多个后验概率向量包括第二后验概率向量、第一数量个骨导后验概率向量中除第一骨导后验概率向量之外的向量,以及第二数量个气导后验概率向量中除第一气导后验概率向量之外的向量。也即是,在 本申请实施例中,可以通过第二声学模型分别对骨导信号的起始部分和气导信号进行处理,以得到分别对应的骨导后验概率向量和气导后验概率向量,再通过将第一骨导后验概率向量和第一气导后验概率向量进行融合,来隐式地用骨导信号对气导信号进行丢头补偿。
可选地,基于该骨导信号和气导信号,确定多个后验概率向量,包括:将骨导信号的起始部分和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定;或者,将骨导信号和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量。也即是,在本申请实施例中,可以将骨导信号的起始部分和气导信号分别输入第三声学模型,通过第三声学模型直接得到多个后验概率向量。即,通过在第三声学模型对骨导信号的起始部分和气导信号进行处理的过程中,隐式地融合这两部分信号,即隐式地用骨导信号对气导信号进行丢头补偿。
可选地,基于该多个后验概率向量进行唤醒词的检测,包括:基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;在该置信度超过置信度阈值的情况下,确定检测到该命令词包括唤醒词。例如通过解码该多个后验概率向量以得到该置信度,进而用置信度阈值来判断该命令词是否包括唤醒词,也即在满足置信度条件的情况下,确定检测到该命令词包含唤醒词。
可选地,基于该多个后验概率向量进行唤醒词的检测,包括:基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;在该置信度超过置信度阈值,且该多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到该命令词包括唤醒词,该多个模板向量指示包含唤醒词的完整信息的语音信号的音素属于多个指定音素的概率。也即是,在满足置信度条件且模板匹配的情况下,确定检测到该命令词包含唤醒词,以尽量避免误唤醒。
可选地,在该多个后验概率向量与该多个模板向量一一对应的情况下,该距离条件包括:该多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。即,可以通过向量间的平均距离来进行模板是否匹配的判断。
可选地,该方法还包括:获取骨导注册信号,该骨导注册信号包含唤醒词的完整信息;基于该骨导注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。也即是,本申请实施例还可以在唤醒词的注册过程中,基于包含唤醒词的完整信息的骨导注册信号来确定置信度阈值和多个模板向量,利用这样所得到的置信度阈值和多个模板向量进行后续语音唤醒过程中的唤醒词检测,能够提高唤醒词检测的准确率,进而减少误唤醒。
可选地,基于该骨导注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量,包括:基于该骨导注册信号确定融合注册信号;基于该融合注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。也即是,在唤醒词的注册过程中,也可以先通过信号融合的方式得到融合注册信号,得到的融合注册信号包含声源输入的命令词的信息,进而基于融合注册信号确定置信度阈值和多个模板向量。
可选地,基于该融合注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量,包括:将该融合注册信号包括的多个注册音频帧输入第一声学模型,以得到第一声学模型输出的多个注册后验概率向量,该多个注册后验概率向量与该多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;将该多个注册后验概率向量确定为多个模板向量;基 于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。也即是,与语音唤醒的过程中对融合信号的处理类似地,在唤醒词的注册过程中,也可以先通过第一声学模型对融合注册信号进行处理,以得到融合注册信号所包括的多个注册音频帧分别对应的多个注册后验概率向量,再基于该多个后验概率向量和唤醒词对应的音素序列确定置信度阈值。例如对该多个注册后验概率向量进行解码以确定置信度阈值。另外,还可以将该多个注册后验概率向量确定为多个模板向量。
可选地,基于该骨导注册信号和唤醒词对应的音素序列,确定置信度阈值之前,还包括:获取气导注册信号;基于该骨导注册信号和唤醒词对应的音素序列,确定置信度阈值,包括:基于该骨导注册信号和气导注册信号,确定多个注册后验概率向量,该多个注册后验概率向量与骨导注册信号和气导注册信号包括的多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。也即是,在唤醒词的注册过程中,也可以先不进行信号融合,而是直接基于骨导注册信号和气导注册信号确定各个注册音频帧分别对应的注册后验概率向量。
第二方面,提供了一种语音唤醒的装置,所述语音唤醒的装置具有实现上述第一方面中语音唤醒的方法行为的功能。所述语音唤醒的装置包括一个或多个模块,该一个或多个模块用于实现上述第一方面所提供的语音唤醒的方法。
也即是,提供了一种语音唤醒的装置,该装置包括:
语音检测模块,用于根据骨导麦克采集的骨导信号进行语音检测,该骨导信号包含声源输入的命令词信息;
唤醒词检测模块,用于在检测到有语音输入的情况下,基于骨导信号进行唤醒词的检测;
语音唤醒模块,用于在检测到该命令词包括唤醒词时,对待唤醒设备进行语音唤醒。
可选地,唤醒词检测模块包括:
第一确定子模块,用于基于骨导信号确定融合信号;
唤醒词检测子模块,用于对该融合信号进行唤醒词的检测。
可选地,该装置还包括:
处理模块,用于开启空气麦克,通过空气麦克采集气导信号;
第一确定子模块用于:
将骨导信号的起始部分和气导信号进行融合,以得到融合信号,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
基于骨导信号的起始部分生成增强起始信号,将增强起始信号和气导信号进行融合,以得到融合信号,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
将骨导信号和气导信号直接进行融合,以得到融合信号。
可选地,唤醒词检测子模块用于:
将该融合信号包括的多个音频帧输入第一声学模型,以得到第一声学模型输出的多个后验概率向量,该多个后验概率向量与该多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
基于该多个后验概率向量进行唤醒词的检测。
可选地,该装置还包括:
处理模块,用于开启空气麦克,通过空气麦克采集气导信号;
唤醒词检测模块包括:
第二确定子模块,用于基于骨导信号和气导信号,确定多个后验概率向量,该多个后验概率向量与骨导信号和气导信号包括的多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
唤醒词检测子模块,用于基于该多个后验概率向量进行唤醒词的检测。
可选地,第二确定子模块用于:
将骨导信号的起始部分和气导信号输入第二声学模型,以得到第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定,第一数量个骨导后验概率向量与骨导信号的起始部分所包括的音频帧一一对应,第二数量个气导后验概率向量与气导信号所包括的音频帧一一对应;
将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,第一骨导后验概率向量对应骨导信号的起始部分的最后一个音频帧,该最后一个音频帧的时长小于帧时长,第一气导后验概率向量对应气导信号的第一个音频帧,该第一个音频帧的时长小于帧时长,该多个后验概率向量包括第二后验概率向量、第一数量个骨导后验概率向量中除第一骨导后验概率向量之外的向量,以及第二数量个气导后验概率向量中除第一气导后验概率向量之外的向量。
可选地,第二确定子模块用于:
将骨导信号的起始部分和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
将骨导信号和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量。
可选地,唤醒词检测子模块用于:
基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;
在该置信度超过置信度阈值的情况下,确定检测到该命令词包括唤醒词。
可选地,唤醒词检测子模块用于:
基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;
在该置信度超过置信度阈值,且该多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到该命令词包括唤醒词,该多个模板向量指示包含唤醒词的完整信息的语音信号的音素属于多个指定音素的概率。
可选地,在该多个后验概率向量与该多个模板向量一一对应的情况下,该距离条件包括:该多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
可选地,该装置还包括:
获取模块,用于获取骨导注册信号,该骨导注册信号包含唤醒词的完整信息;
确定模块,用于基于骨导注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。
可选地,确定模块包括:
第三确定子模块,用于基于骨导注册信号确定融合注册信号;
第四确定子模块,用于基于该融合注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。
可选地,第四确定子模块用于:
将该融合注册信号包括的多个注册音频帧输入第一声学模型,以得到第一声学模型输出的多个注册后验概率向量,该多个注册后验概率向量与该多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;
将该多个注册后验概率向量确定为多个模板向量;
基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。
可选地,该装置还包括:
获取模块,用于获取气导注册信号;
确定模块包括:
第五确定子模块,用于基于骨导注册信号和气导注册信号,确定多个注册后验概率向量,该多个注册后验概率向量与骨导注册信号和气导注册信号包括的多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;
第六确定子模块,用于基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。
第三方面,提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的语音唤醒的方法的程序,以及存储用于实现上述第一方面所提供的语音唤醒的方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的语音唤醒的方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的语音唤醒的方法。
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
本申请实施例提供的技术方案至少能够带来以下有益效果:
在本申请实施例中,通过骨导麦克采集骨导信号进行语音检测,能够保证低功耗。另外,考虑到由于语音检测的延迟可能会导致采集的气导信号丢头,从而未包含声源输入的命令词的完整信息,而骨导麦克采集的骨导信号包含声源输入的命令词信息,即骨导信号未丢头, 因此本方案基于骨导信号进行唤醒词的检测。这样,唤醒词的识别准确率较高,语音唤醒的准确度较高。
附图说明
图1是本申请实施例提供的一种声学模型的结构示意图;
图2是本申请实施例提供的一种语音唤醒方法所涉及的***架构图;
图3是本申请实施例提供的一种电子设备的结构示意图;
图4是本申请实施例提供的一种语音唤醒的方法流程图;
图5是本申请实施例提供的一种骨导信号和气导信号产生的原理示意图;
图6是本申请实施例提供的一种信号时序图;
图7是本申请实施例提供的一种信号拼接的方法示意图;
图8是本申请实施例提供的一种对骨导信号进行下采样的示意图;
图9是本申请实施例提供的一种对骨导信号进行增益调整的示意图;
图10是本申请实施例提供的一种训练生成网络模型的方法示意图;
图11是本申请实施例提供的另一种声学模型的结构示意图;
图12是本申请实施例提供的又一种声学模型的结构示意图;
图13是本申请实施例提供的另一种语音唤醒的方法流程图;
图14是本申请实施例提供的又一种语音唤醒的方法流程图;
图15是本申请实施例提供的又一种语音唤醒的方法流程图;
图16是本申请实施例提供的又一种语音唤醒的方法流程图;
图17是本申请实施例提供的又一种语音唤醒的方法流程图;
图18是本申请实施例提供的又一种语音唤醒的方法流程图;
图19是本申请实施例提供的一种唤醒词注册的方法流程图;
图20是本申请实施例提供的另一种唤醒词注册的方法流程图;
图21是本申请实施例提供的又一种唤醒词注册的方法流程图;
图22是本申请实施例提供的又一种唤醒词注册的方法流程图;
图23是本申请实施例提供的又一种唤醒词注册的方法流程图;
图24是本申请实施例提供的又一种唤醒词注册的方法流程图;
图25是本申请实施例提供的一种训练第一声学模型的方法示意图;
图26是本申请实施例提供的另一种训练第一声学模型的方法示意图;
图27是本申请实施例提供的又一种训练第一声学模型的方法示意图;
图28是本申请实施例提供的又一种训练第一声学模型的方法示意图;
图29是本申请实施例提供的一种训练第二声学模型的方法示意图;
图30是本申请实施例提供的一种训练第三声学模型的方法示意图;
图31是本申请实施例提供的一种语音唤醒的装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于理解,首先对本申请实施例中的一些名称或术语进行解释。
语音识别:也称为自动语音识别(automatic speech recognition,ASR)。语音识别是指通过计算机识别语音信号中包含的词汇内容。
语音唤醒:也称为关键词识别(keyword spotting,KWS)、唤醒词检测、唤醒词识别等。语音唤醒是指在连续语音流中实时检测唤醒词,当检测到声源输入的命名词为唤醒词时唤醒智能设备。
深度学习(deep learning,DL):是机器学习中基于对数据进行表征的一种学习算法。
接下来对本申请实施例中关于语音识别所涉及的一些相关知识进行介绍。
语音激活检测(voice activate detector,VAD)
VAD用于判断什么时候有语音输入,什么时候是静音状态,还用于将有语音输入的有效片段截取出来。语音识别后续的操作都是在VAD截取出来的有效片段上进行,从而能够减小语音识别***噪声误识别率及***功耗。在近场环境下,由于语音信号衰减有限,信噪比(signal-noise ratio,SNR)比较高,只需要简单的方式(比如过零率、信号能量)来做语音激活检测。但是在远场环境中,由于语音信号传输距离比较远,衰减比较严重,因而导致麦克风采集数据的SNR很低,这种情况下,简单的语音激活检测方法效果较差。使用深度神经网络(deep neural networks,DNN)做语音激活检测是基于深度学习的语音识别***中常用的方法。VAD是语音检测的一种实现方式,本申请实施例以通过VAD来进行语音检测为例进行介绍,在其他实施例中也可以通过其他方式进行语音检测。
语音识别
对于语音识别***而言,第一步要检测是否有语音输入,即,语音激活检测(VAD)。在低功耗设计中,相比于语音识别的其它部分,VAD采用始终开启(always on)的工作机制。当VAD检测到有语音输入之后,VAD便会唤醒后续的识别***。识别***主要包括特征提取、识别建模及解码得到识别结果等。其中,模型训练包括声学模型训练、语言模型训练等。语音识别本质上是音频序列到文字序列转化的过程,即在给定语音输入的情况下,找到概率最大的文字序列。基于贝叶斯原理,可以把语音识别问题分解为给定文字序列出现这条语音的条件概率以及出现该条文字序列的先验概率,对条件概率建模所得模型即为声学模型,对出现该条文字序列的先验概率建模所得模型是语言模型。
需要说明的是,要对语音信号进行分析和识别,就需要对语音信号分帧,也就是把语音信号切开成多个小段,每小段称为一帧。分帧操作一般不是简单的切开,而是使用窗函数来实现。分帧后相邻帧之间一般是有交叠的。本申请实施例中的一帧音频即通过分帧得到的音频帧,分帧是为了声学模型能够分析声音信号。例如,使用窗函数对语音信号进行分帧,假设窗函数指示以帧长25ms(毫秒)、帧移10ms进行分帧,那么分帧后每帧音频的长度为25ms,相邻两帧之间有25-10=15ms的交叠。
这里再解释两个概念。音素:单词的发音由音素构成,音素是一种发音单元。英文的音素集(即发音词典)如卡内基梅隆大学的一套由39个音素构成的音素集。汉语的音素集如直接用全部声母和韵母作为音素集,或者,还分有声调和无声调的话,音素集包括更多的音素。例如,在本申请实施例中,音素集包括100个音素。状态:可视为比音素更细致的语音单位,通常把一个音素划分成3个状态。在本申请实施例中,一帧音频对应一个音素,若干 音素组成一个单词(字)。那么,只要知道每帧音频对应哪个音素,语音识别的结果也就出来了。在一些实施例中,若干帧音频对应一个状态,每三个状态组合成一个音素,若干个音素组合成一个单词。那么,只要知道每帧音频对应哪个状态,语音识别的结果也就出来了。
声学模型、解码以及语音唤醒
在语音识别中,以一帧音频对应一个音素为例,通过声学模型就能够知道各音频帧对应的音素为音素集中各个音素的概率,即音频对应的后验概率向量。通俗地讲,声学模型里存有一大堆参数,通过这些参数,就能够知道各音频帧对应的后移概率向量。通过训练声学模型就能够得到这些参数,声学模型的训练需要使用巨大数量的语音数据。通过声学模型得到各音频帧对应的后验概率向量之后,基于语言模型、发音词典等构建解码图(也可称为状态网络、搜索空间等),将声学模型输出的连续多帧音频对应的后验概率向量作为该解码图的输入,在该解码图中搜索最优路径,语音对应音素在这条路径上的概率最大。搜索到最优路径之后,即能够知道各音频帧对应的音素,也即得出语音识别到的最佳词串了。其中,在状态网络中搜索最优路径从而得到词串的过程可认为是一种解码,该解码是为了确定语音信号对应的词串是什么。
而在本申请实施例中语音唤醒的解码中,是在解码图中寻找处于解码路径上的各个音素的概率,将寻找到的各个音素的概率相加,以得到一个路径得分。其中,解码路径是指唤醒词对应的音素序列。如果该路径得分较大,则认为检测到该命令词包括唤醒词。也即是,本申请实施例中的解码是基于解码图判断语音信号对应的词串是不是唤醒词。
为了解释本申请实施例,这里先对本申请实施例中涉及的声学模型进行进一步介绍。声学模型是能够识别单个音素的模型,可以采用隐马尔科夫模型(hidden Markov model,HMM)进行建模。声学模型是经训练的模型,可以利用声音信号的声学特征和对应的标签训练声学模型。声学模型中建立了声学信号和建模单元之间对应的概率分布,建模单元如HMM状态、音素、音节、字等,建模单元也可称为发音单元,声学模型的结构如GMM-HMM、DNN-HMM、DNN-CTC等。其中,GMM(gaussian mixed model)表示高斯混合模型,DNN表示深度神经网络,CTC(connectionist temporal classification)表示基于神经网络的时序类分类。在本申请实施例中,以建模单元为音素,声学模型为DNN-HMM模型为例进行介绍。需要说明的是,在本申请实施例中,声学模型可以逐帧音频进行处理,输出各音频帧的音素属于多个指定音素的概率,该多个指定音素可以根据发音词典确定。例如发音词典中包括100个音素,那该多个指定音素即这100个音素。
图1是本申请实施例提供的一种声学模型的结构示意图。该声学模型为DNN-HMM模型,声学模型的输入层的维度为3,两个隐藏层的维度为5,输出层的维度为3。其中,输入层的维度表示输入信号的特征维度,输出层的维度表示三个状态维度,每个状态维度包括多个指定音素对应的概率。
然后对解码进行进一步介绍。语音识别中的解码可以分为动态解码和静态解码。在动态解码的过程中,以词典树为中心,在语言模型中动态查找语言得分。而静态解码是指语言模型提前静态编进解码图,通过确定化、权重前移、最小化等一些列优化操作,提高解码效率。示例性地,本申请实施例中采用静态解码,如加权有限状态转换器(weighted finite state transducer,WFST),基于HCLG网络的静态解码消除冗余信息。本申请实施例中HCLG网络的生成需要语言模型、发音词典、声学模型表示成对应的FST格式,后通过组合、确定化、 最小化等操作编译成一个大的解码图,HCLG网络构建流程为:HCLG=ASL(min(RDS(det(H'o min(det(C o min(det(L o G))))))))。其中,ASL表示加自环,min表示最小化,RDS表示去消岐符,det表示确定化,H'表示去掉自环的HMM,o表示组合。
在解码过程中,使用维特比(viterbi)算法在解码图中寻求最优路径,解码图中不会有相同的两条路径。在解码过程中采用累积beam剪枝,即,从当前概率最大路径得分减去beam值作为阈值,小于阈值的路径被裁剪。同时采用帧同步解码算法,找到解码图的开始节点,创建对应节点的令牌,从开始节点对应的令牌做空边(即输入不对应真实的建模单元)扩展,每一个可达的节点都绑定对应的令牌,剪枝并保留活跃令牌。每输入一帧音频,从当前活跃令牌中取出一个令牌,对应节点开始扩展后续非空边(即输入对应真实物理建模单元),遍历完所有活跃令牌,剪枝并保留当前帧活跃令牌。重复执行以上步骤,直到所有的音频帧都扩展结束,即找到得分最大令牌,回溯得到最后的识别结果。
网络模型
在本申请实施例中,网络模型是指上述声学模型。采用网络模型对语音信号进行识别,网络模型如隐马尔科夫模型HMM、高斯混合模型GMM、深度神经网络DNN、深度置信网络-隐马尔科夫模型(deep belief networks HMM,DBN-HMM)、循环神经网络(recurrent neural network,RNN)、长短时记忆(long short-term memory,LSTM)网络、卷积神经网络(convolutional neural network,CNN)等。本申请实施例中采用的是CNN和HMM。
其中,隐马尔科夫模型是一种统计模型,目前多应用于语音信号处理领域。在该模型中,马尔科夫链中的一个状态是否转移到另一个状态取决于状态转移概率,而某一状态产生的观察值取决于状态生成概率。在进行语音识别时,HMM首先为每个识别单元建立发声模型,通过长时间训练得到状态转移概率矩阵和输出概率矩阵,在识别时根据状态转移过程中的最大概率进行判决。
卷积神经网络的基本结构包括两部分,一部分是特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。另一部分是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的函数(如sigmoid)作为卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。卷积神经网络中的每一个卷积层都可紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。
损失函数是网络模型在训练中的迭代依据。损失函数用来评价网络模型的预测值和真实值不一样的程度,损失函数的选择影响了网络模型的性能。不同的网络模型使用的损失函数一般也不一样。损失函数可分为经验风险损失函数和结构风险损失函数。经验风险损失函数指预测结果和实际结果的差别,结构风险损失函数是指经验风险损失函数加上正则项。本申请实施例中使用的是交叉熵损失函数(cross-entropy loss function),即CE损失函数。交叉熵损失函数本质上也是一种对数似然函数,可用于二分类和多分类任务中。当使用sigmoid作为激活函数的时候,常用交叉熵损失函数而不用均方误差损失函数,因为交叉熵损失函数可以完美解决平方损失函数权重更新过慢的问题,具有误差大的时候,权重更新快,误差小的时候,权重更新慢的良好性质。
在网络模型将误差反向传播,使用损失函数并采用梯度下降法从而调整网络参数。梯度 下降法是一种优化算法,中心思想是沿着目标函数梯度的方向更新参数值以希望达到目标函数最小(或最大)。梯度下降法是深度学习中常用的优化算法。梯度下降是针对损失函数的,目的是为了尽快找到损失函数的最小值所对应的权重和偏置。反向传播算法的核心是通过定义神经元误差这个特殊变量。从输出层开始将神经元误差逐层反向传播神经元误差。再通过公式利用神经元误差计算出权重和偏置的偏导数。梯度下降是解决最小值问题的一种方式,而反向传播是解决梯度计算的一种方式。
图2是本申请实施例提供的一种语音唤醒方法所涉及的***架构图。参见图2,该***架构包括可穿戴设备201和智能设备202。可穿戴设备201与智能设备202之间通过有线或无线的方式连接以进行通信。其中,智能设备202为本申请实施例中的待唤醒设备。
在本申请实施例中,可穿戴设备201用于接收语音信号,基于接收到的语音信号向智能设备202发送指令。智能设备202用于接收可穿戴设备201发送的指令,基于接收到的指令执行相应操作。例如,可穿戴设备201用于采集语音信号,检测采集到的语音信号包含的命令词,若检测到该命令词包括唤醒词,则向智能设备202发送唤醒指令,以唤醒智能设备202。智能设备202用于接收到唤醒指令,之后从休眠状态进入工作状态。
其中,可穿戴设备201安装有骨导麦克,由于骨导麦克的低功耗,骨导麦克可以一直工作状态。骨导麦克用于在工作状态下采集骨导信号。可穿戴设备201中的处理器基于骨导信号进行语音激活检测,以检测是否有语音输入。在检测到有语音输入的情况下,处理器基于骨导信号进行唤醒词的检测,以检测声源输入的命令词是否包括唤醒词。在检测到命令词包括唤醒词时,进行语音唤醒,即可穿戴设备201向智能设备202发送唤醒指令。
在本申请实施例中,可穿戴设备201如无线耳机、智能眼镜、智能手表、智能手环等。智能设备202(即待唤醒设备)如智能音箱、智能家电、智能玩具、智能机器人等。可选地,在一些实施例中,可穿戴设备201与智能设备202为同一个设备。
需要说明的是,本申请实施例描述的***架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着***架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
请参考图3,图3是本申请实施例示出的一种电子设备的结构示意图。可选地,该电子设备为图2中所示的可穿戴设备201。该电子设备包括一个或多个处理器301、通信总线302、存储器303、一个或多个通信接口304、骨导麦克308以及空气麦克309。
处理器301为一个通用中央处理器(central processing unit,CPU)、网络处理器(network processing,NP)、微处理器、或者为一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。可选地,上述PLD为复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线302用于在上述组件之间传送信息。可选地,通信总线302分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一 种类型的总线。
可选地,存储器303为只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、光盘(包括只读光盘(compact disc read-only memory,CD-ROM)、压缩光盘、激光盘、数字通用光盘、蓝光光盘等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器303独立存在,并通过通信总线302与处理器301相连接,或者,存储器303与处理器301集成在一起。
通信接口304使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口304包括有线通信接口,可选地,还包括无线通信接口。其中,有线通信接口例如以太网接口等。可选地,以太网接口为光接口、电接口或其组合。无线通信接口为无线局域网(wireless local area networks,WLAN)接口、蜂窝网络通信接口或其组合等。
可选地,在一些实施例中,该电子设备包括多个处理器,如图2中所示的处理器301和处理器305。这些处理器中的每一个为一个单核处理器,或者一个多核处理器。可选地,这里的处理器指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在一些实施例中,该电子设备还包括输出设备306和输入设备307。输出设备306和处理器301通信,能够以多种方式来显示信息。例如,输出设备306为液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备307和处理器301通信,能够以多种方式接收用户的输入。例如,输入设备307包括鼠标、键盘、触摸屏设备或传感设备等中的一种或多种。
在本申请实施例中,输入设备307包括骨导麦克308和空气麦克309,骨导麦克308和空气麦克309分别用于采集骨导信号和气导信号。处理器301用于基于骨导信号或者基于骨导信号和气导信号,通过本申请实施例提供的语音唤醒的方法来唤醒智能设备。可选地,在唤醒智能设备之后,处理器301还用于基于骨导信号、或气导信号、或骨导信号和气导信号来控制智能设备执行任务。
在一些实施例中,存储器303用于存储执行本申请方案的程序代码310,处理器301能够执行存储器303中存储的程序代码310。该程序代码310中包括一个或多个软件模块,该电子设备能够通过处理器301以及存储器303中的程序代码310,来实现下文图4实施例提供的语音唤醒的方法。
图4是本申请实施例提供的一种语音唤醒的方法的流程图,该方法应用于可穿戴设备。请参考图4,该方法包括如下步骤。
步骤401:根据骨导麦克采集的骨导信号进行语音检测,骨导信号包含声源输入的命令词信息。
由前述可知,为了实现可穿戴设备的低功耗,在智能设备(即待唤醒设备)被唤醒之前,由于骨导麦克的功耗较低,因此可以利用骨导麦克采集骨导信号,基于骨导信号进行语音检测(如语音激活检测VAD),以检测是否有语音输入。在未检测到有语音输入的情况下可穿戴设备中除骨导麦克之外的部件可以处于休眠状态从而降低功耗,而在检测到有语音输入的 情况下再控制可穿戴设备的其他部件开启。例如,在可穿戴设备还安装有空气麦克的情况下,由于空气麦克是一个功耗较高的器件,对于便携式的可穿戴设备来说,为了降低功耗,会对空气麦克进行开启和关闭的控制,当检测到有语音输入的时候(如用户在说话),才会开启空气麦克进行拾音操作(即采集气导信号),这样就可以降低可穿戴设备的功耗。也即是,在智能设备被唤醒之前,空气麦克处于休眠状态以减低功耗,在检测到有语音输入的情况下,开启空气麦克。
其中,可穿戴设备根据骨导麦克采集的骨导信号进行语音激活检测的实现方式可以有多种,本申请实施例对此不作限定。接下来示例性地介绍一些语音激活检测的实现方式。需要说明的是,语音激活检测主要是用于检测当前输入信号中是否存在人的语音信号。其中,语音激活检测通过对输入信号进行判断,以将语音片段与非语音片段(如只有各种背景噪声信号的片段)区分出来,使得能够分别对各段信号采取不同的处理方法。
可选地,语音激活检测通过提取输入信号的特征来检测是否有语音输入。例如,通过提取各帧输入信号的短时能量(short time energy,STE)和短时过零率(zero cross counter,ZCC)的特征来检测是否有语音输入,即基于能量的特征进行语音激活检测。其中,短时能量指一帧信号的能量,过零率指一帧时域信号穿过0(时间轴)的次数。又如,一些精确度较高的VAD会提取基于能量的特征、频域特征、倒谱特征、谐波特征、长时特征等多个特征进行综合检测。可选地,除了提取特征之外,还可以再结合阈值比较,或者结合统计的方法或机器学习的方法,来判断一帧输入信号是语音信号还是非语音信号。接下来对基于能量的特征、频域特征、倒谱特征、谐波特征、长时特征等特征分别进行简单介绍。
基于能量的特征:即基于STE和ZCC两个特征来进行VAD。在信噪比(signal-noise ratio,SNR)较大的情况下,语音片段的STE相对较大而ZCC相对较小,非语音片段的STE相对较小而ZCC相对较大。因为有人声的语音信号通常能量较大,且绝大部分能量包含在低频带内,而噪音信号通常能量较小,且含有较多高频段的信息。因此,可以通过提取输入信号的这两个特征,从而判别语音信号与非语音信号。其中,计算STE的方法可以为,通过频谱图计算每一帧输入信号的能量的平方和。计算短时过零率的方法可以为,计算每一帧输入信号在时域上对应的过零数量,例如,在时域上将帧内所有采样点向左或向右平移一个点,平移后的各个采样点和平移前的各个采样点的幅度值在对应点做乘积,若对应的两个采样点所得积的符号为负,则说明对应采样点处过零,将帧内为负数的积的个数求出即得到短时过零率。
频域特征:通过短时傅里叶变换或其他时频变换方法,将输入信号的时域信号变成频域信号,以得到频谱图,基于频谱图得到频域特征。如基于频谱图提取频带的包络特征。在一些实验中,在SNR为0dB时,一些频带的长时包络可以区分语音片段和噪声片段。
倒谱特征:如包括能量倒谱峰值。对于VAD来说,能量倒谱峰值确定了语音信号的基频(pitch)。在一些实施例中,将梅尔频率倒谱系数(Mel-frequency cepstral coefficients,MFCC)做为倒谱特征。
基于谐波的特征:语音信号的一个明显特征是包含了基频及其多个谐波频率,即使在强噪声场景,谐波这一特征也是存在的。可以使用自相关的方法找到语音信号的基频。
长时特征:语音信号是非稳态信号,普通语速通常每秒发出10~15个音素,音素之间的谱分布是不一样的,这就导致了随着时间变化语音统计特性也是变化的。而日常的绝大多数噪声是稳态的,即变化比较慢,如白噪声等。基于此,可以提取长时特征来进行判断输入信号 是语音信号还是非语音信号。
需要说明的是,在本申请实施例中,用于语音激活检测的输入信号为骨导麦克采集的骨导信号,对于接收到骨导信号的每一帧进行语音激活检测,以检测是否有语音输入。其中,由于骨导麦克一直处于工作状态,因此,骨导麦克持续采集的骨导信号包含声源输入的命令词的完整信息,即骨导信号不会丢头。
可选地,骨导信号的采样率为32kHz(千赫兹)、48kHz等,本申请实施例对此不作限定。骨导麦克中的传感器是一种非声传感器,可以屏蔽周围环境噪声的影响,具有很强的抗噪性能。
步骤402:在检测到有语音输入的情况下,基于骨导信号进行唤醒词的检测。
在本申请实施例中,在检测到有语音输入的情况下,可穿戴设备基于骨导信号进行唤醒词的检测,以检测该命令词是否包括唤醒词。需要说明的是,可穿戴设备基于骨导信号进行唤醒词的检测的实现方式有多种,接下来介绍其中的两种实现方式。
第一种实现方式
在本申请实施例中,可穿戴设备基于骨导信号进行唤醒词的检测的实现方式为:基于骨导信号确定融合信号,对该融合信号进行唤醒词的检测。
首先介绍可穿戴设备基于骨导信号确定融合信号的实现方式。需要说明的是,可穿戴设备基于骨导信号确定融合信号的方式有多种,接下来介绍其中的四种方式。
基于骨导信号确定融合信号的方式1:基于骨导信号确定融合信号之前,开启空气麦克,通过空气麦克采集气导信号。例如在检测到有语音输入的情况下,开启空气麦克,通过空气麦克采集气导信号。可穿戴设备将骨导信号的起始部分和气导信号进行融合,以得到融合信号。其中,骨导信号的起始部分根据语音检测(如VAD)的检测时延确定。也即是,可穿戴设备采集骨导信号以及气导信号,利用骨导信号的起始部分对气导信号进行丢头补偿,以使得到的融合信号也包含声源输入的命令词信息。另外,该融合信号的长度较短,在一定程度上能够减少数据处理量。可选地,在本申请实施例中通过信号拼接来进行信号融合,在一些实施例中也可以通过信号叠加等方式进行信号融合,下述实施例中均以通过信号拼接来进行信号融合为例进行介绍。
需要说明的是,骨导信号和气导信号是由同一声源产生的信号,骨导信号和气导信号的传输路径不同。如图5所示,骨导信号是振动信号(激励信号)经过人体内部骨头、组织等路径传输形成的信号,气导信号是声波经过空气传输形成的信号。
图6是本申请实施例提供的一种信号时序图。该信号时序图中示出了骨导信号、气导信号、VAD控制信号和用户语音信号的时序关系。当声源发出语音信号时,骨导信号立即变为高电平的信号,经过△t时间后,VAD确定检测到有语音输入,此时产生VAD控制信号,VAD控制信号控制空气麦克开启,采集气导信号,也即此时气导信号变为高电平的信号。可以看出,骨导信号与用户语音信号是同步变化的,气导信号会相比于骨导信号有△t时间的延迟,该延迟是由VAD的检测时延导致的。其中,△t表示语音激活检测的检测时延,也即检测到有语音输入的时刻与用户实际输入语音的时间差。
需要说明的是,在本申请实施例中,VAD能够检测出骨导信号中的语音片段和非语音片段,端点检测能够检测出气导信号中的语音片段和非语音片段。可穿戴设备在将骨导信号的起始部分和气导信号进行融合之前,基于VAD的检测结果将骨导信号中的语音片段截取出来, 基于端点检测的检测结果将气导信号中的语音片段截取出来,将截取出来的骨导信号的语音片段的起始部分与截取出来的气导信号的语音片段进行融合,以得到融合信号。以图5为例,从骨导信号中截取出来的语音片段的时间范围为[0,t],骨导信号的起始部分(即截取出来的语音片段的起始部分)的时间范围为[0,△t],从气导信号中截取出来的语音片段的时间范围为[△t,t],得到的融合信号的时长为t。其中,△t表示语音激活检测的检测时延,t表示实际有语音输入的总时长。
图7是本申请实施例提供的一种信号融合的方法示意图。以通过信号拼接来进行信号融合为例,参见图7,x1[n]表示骨导信号的起始部分,x2[n]表示气导信号,f(x)表示拼接函数,f(x):b[n] 0,t=concat[x1[n] 0,△t+x2[n] 0,△t,x2[n] △t,t],其中,x2[n] 0,△t为零。即通过f(x)将骨导信号的起始部分(即0至△t的语音片段)与气导信号(即△t至t的语音片段)进行拼接,得到融合信号b[n]。
可选地,在将骨导信号的起始部分和气导信号进行融合之前,可穿戴设备对气导信号进行预处理,预处理包括前端增强。前端增强能够消除部分噪声和不同声源带来的影响等,使得前端增强后的气导信号更加能反映语音的本质特征,以提升语音唤醒的准确率。需要说明的是,对气导信号进行前端增强的方法有很多,例如,端点检测和语音增强,语音增强如回波消除、波束形成算法、噪音消除、自动增益控制、去混响等。其中,端点检测能够将气导信号的语音片段和非语音片段区分开,即准确地确定出语音片段的起始点。经端点检测之后,后续就可以只对气导信号的语音片段进行处理,这样能够提高语音识别的准确率和召回率。语音增强是为了消除环境噪声对语音片段的影响。例如,回声消除是用有效的回声消除算法来抑制远端信号的干扰,主要包括双讲检测和延时估计,如通过判断当前的讲话模式(如近讲模式、远讲模式、双讲模式等),基于当前的讲话模式采用对应的策略调整滤波器,进而通过滤波器滤除气导信号中的远端干扰,在此基础上通过后置滤波算法消除残留噪声的干扰。又如,自动增益算法用于将信号快速增益到合适的音量,本方案可以通过硬性增益处理对气导信号的所有采样点乘上对应的增益因子,在频域每个频率都同时乘上对应的增益因子。其中,可以按照等响度曲线对气导信号的频率进行加权,把响度增益因子映射到等响度曲线上,从而确定各频率的增益因子。
可选地,在将骨导信号的起始部分和气导信号进行融合之前,可穿戴设备对骨导信号进行预处理,预处理包括下采样和/或增益调整。其中,下采样能够使骨导信号的数据量减小,提高数据处理的效率,增益调整用于使调整后的骨导信号的能量增强,例如增益调整使骨导信号的平均能量与气导信号的平均能量一致。需要说明的是,对骨导信号进行下采样和/或增益调整的方法有很多,本申请实施例对此不作限定。其中,下采样是指降低信号的采样频率(也称为采样率),是信号重采样的一种方式。采样频率是指将模拟声音波形数字化后每秒钟所抽取的声波幅度的样本次数。对采样频率为Fs、包括N个采样点的气导信号x[n]进行下采样的过程中,每隔M-1个采样点抽取一个采样点,得到包括M个采样点的气导信号y[m]。根据奈奎斯特采样定理,下采样可能会造成信号的频谱混淆,因此下采样之前可以用低通去混淆滤波器对气导信号进行处理,即进行抗混叠滤波,以减轻后续下采样带来的频谱混淆。增益调整是指通过增益因子对骨导信号的采样点的幅度值进行调整,或者对骨导信号的频点的能量值进行调整。其中,增益因子可以根据增益函数确定,也可以根据气导信号与骨导信号的统计信息确定,本申请实施例对此不作限定。
图8是本申请实施例提供的一种对骨导信号进行下采样的示意图。参见图8,假设骨导信号的采样率为48kHz,先将采集的骨导信号x[n]送入抗混叠滤波器H(z),以防止信号混叠。v[n]表示经过抗混叠滤波器之后的骨导信号,采样率未变。对v[n]进行三倍下采样,得到三倍下采样后的骨导信号y[m],采样率下降为16kHz。
图9是本申请实施例提供的一种对骨导信号进行增益调整的示意图。参见图9,x[n]表示骨导信号,f(g)表示增益函数,f(g):y[n]=G*x[n],即通过f(g)所确定的增益因子G对x[n]进行增益调整,得到经增益调整的骨导信号y[n]。
基于骨导信号确定融合信号的方式2:基于骨导信号确定融合信号之前,开启空气麦克,通过空气麦克采集气导信号。可穿戴设备基于骨导信号的起始部分生成增强起始信号,将该增强起始信号和气导信号进行融合,以得到融合信号。其中,骨导信号的起始部分根据语音检测的检测时延确定。也即是,可穿戴设备利用骨导信号的起始部分生成增强起始信号,利用该增强起始信号对采集的气导信号进行丢头补偿,以使得到的融合信号也包含声源输入的命令词信息。另外,融合信号的长度较短,在一定程度上能够减少数据处理量。
需要说明的是,与上述基于骨导信号确定融合信号的方式1不同的地方在于,在基于骨导信号确定融合信号的方式2中,是利用骨导信号的起始部分生成增强起始信号,将该增强起始信号与气导信号进行融合,而非将骨导信号的起始部分与气导信号进行融合,除此之外,上述方式1中介绍的其他内容均适用于该方式2,在方式2中不再一一详细介绍。例如在该方式2中,也可以对骨导信号和气导信号进行语音片段的检测,以截取出语音片段,基于截取出的语音片段进行信号拼接,从而减少数据处理量。可穿戴设备还可以对骨导信号和气导信号进行预处理,例如对骨导信号进行下采样和/或增益调整等,对气导信号进行语音增强等。
在本申请实施例中,可穿戴设备可以将骨导信号的起始部分输入生成网络模型,以得到生成网络模型输出的增强起始信号。其中,生成网络模型为基于深度学习算法训练得到的模型,生成网络模型可视为一种信号生成器,能够基于输入信号生成包含输入信号的信息且接近真实语音的语音信号。在本申请实施例中,增强起始信号包含了骨导信号的起始部分的信号信息,且增强起始信号接近于真实语音信号。需要说明的是,本申请实施例不限定生成网络模型的网络结构、训练方式、训练设备等。接下来示例性地介绍一种生成网络模型的训练方法。
在本申请实施例中,以在计算机设备上训练得到生成网络模型为例,计算机设备获取第一训练数据集,第一训练数据集包括多个第一样本信号对。计算机设备将该多个第一样本信号对中的骨导样本信号的起始部分输入初始生成网络模型,以得到初始生成网络模型输出的多个增强起始样本信号。计算机设备将该多个增强起始样本信号和该多个第一样本信号对中的气导样本信号的起始部分输入初始判决网络模型,以得到初始判决网络模型输出的判决结果。计算机设备基于该判决结果调整初始生成网络模型的网络参数,以得到经训练的生成网络模型。其中,一个第一样本信号对包括一个骨导样本信号的起始部分和一个气导样本信号的起始部分,一个第一样本信号对对应一个命令词,骨导样本信号和气导样本信号包含对应的命令词的完整信息。
可选地,计算机设备获取的第一样本信号对包含骨导样本信号和气导样本信号,计算机设备截取骨导样本信号的起始部分和气导样本信号的起始部分,以得到初始生成网络模型和初始判决网络模型的输入数据。也即是,计算机设备先获取完整的语音信号,再截取出起始 部分,以得到训练数据。或者,计算机设备获取的第一样本信号对仅包含骨导样本信号的起始部分和气导样本信号的起始部分。
可选地,第一训练数据集包括直接采集的语音数据、公开语音数据和/或从第三方购买的语音数据。可选地,在训练之前,计算机设备可以对获取的第一训练数据集对进行预处理,以得到经预处理的第一训练数据集,经预处理的第一训练数据集能够模拟真实语音数据的分布,以便更接近于真实场景的语音,增加训练样本的多样性。示例性地,对第一训练数据集进行备份,即额外增加一份数据,对备份的数据进行预处理。可选地,将备份的数据分为多份,对每份数据进行一种预处理,对各份数据所做的预处理可以不同,这样能够使总的训练数据加倍,且保证数据的全面性,在性能和训练开销上达到平衡,使得在一定程度上提高语音识别的准确率和鲁棒性。其中,对每份数据进行预处理的方法可以包括增加噪音(noise addition)、音量增强、增加混响(add reverb)、时移(time shifting)、改变音调(pitch shifting)、时间拉伸(time stretching)等中的一种或多种。
示例性地,增加噪音是指将一种或多种背景噪声混入语音信号中,使得训练数据能够覆盖更多种类的噪声。例如办公室环境噪声、食堂环境噪声、街道环境噪声等背景噪声。还可以混入不同信噪比的噪声,例如信噪比可以按照正态分布的方式选择,使得信噪比的均值较优,均值可以为10dB、20dB等,信噪比可以从10dB到30dB等。其中,计算机设备可以基于信号能量S和信噪比SNR通过公式SNR=10*log 10(S 2/N 2)来计算出噪声能量N。音量增强是指根据音量的变动系数将语音信号的音量增强或减弱,音量的变动系数的取值范围可以为0.5至1.5,或者为其他的取值范围。增加混响是指对语音信号加混响处理,混响是由于空间环境对声音信号的反射产生的。改变音调如高音修正,以改变语音喜欢的音高而不影响音速。时间拉伸是指在不影响音高的情况下改变语音信号的速度或持续时间,也即改变语速,使得训练数据能够覆盖不同的语速,语速的变动范围可以在0.9至1.1之间或者在其他范围内。
图10是本申请实施例提供的一种训练生成网络模型的方法示意图。生成器(即初始生成网络模型)是用于生成语音信号的网络,将第一训练数据集中的骨导样本信号的起始部分输入生成器,可选地,在输入生成器之前,在骨导样本信号中叠加一个随机噪声。通过生成器对输入的骨导样本信号进行处理,以生成增强起始样本信号。判决器(即初始判决网络模型)是一个判决网络,用于判断输入的信号是不是真实的语音信号,判决器输出的判决结果指示输入信号是否为真实语音,如果输出的判决结果为1,表示判决器判定输入信号为真实的语音信号,如果输出的判决结果为0,表示判决器判定输入信号不是真实的语音信号。通过判断判决结果是否准确来调整生成器和判决器中的参数,以训练生成器和判决器。在训练的过程中,生成器的目标就是生成伪造的语音信号去骗过判决器,而判决器的目标就是能够分辨出输入信号是真实的还是生成的。可以看出,生成器和判决器实质上是通过训练数据在进行博弈,在博弈的过程中生成器和判决器的能力均得到提高,在理想情况下,训练后的判决器的准确率接近0.5。
在训练完成后,将训练得到的生成网络模型部署到可穿戴设备中,可穿戴设备将采集的骨导信号的起始信号输入该生成网络模型,以得到该生成网络模型输出的增强起始信号。需要说明的是,除了以上介绍的增强起始信号的生成方法之外,计算机设备也可以采用其他方法基于骨导信号的起始信号生成增强起始信号,本申请实施例对此不作限定。
基于骨导信号确定融合信号的方式3:基于骨导信号确定融合信号之前,开启空气麦克, 通过空气麦克采集气导信号。可穿戴设备将骨导信号和气导信号直接进行融合,以得到融合信号。这样,得到的融合信号也包含了声源输入的命令词信息,另外,融合信号不仅包含骨导信号中完整的语音信息,也包含了气导信号中完整的语音信息,使得融合信号所包含的语音特征更加丰富,在一定程度上提高语音识别的准确率。
需要说明的是,与上述基于骨导信号确定融合信号的方式1不同的地方在于,在基于骨导信号确定融合信号的方式3中,可穿戴设备是直接将骨导信号与气导信号进行融合,除此之外,上述方式1中介绍的其他内容均适用于该方式3,在该方式3中不再一一详细介绍。例如在该方式3中,也可以对骨导信号和气导信号进行语音片段的检测,以截取出语音片段,对截取出的语音片段进行融合,从而减少数据处理量。还可以对骨导信号和气导信号进行预处理,例如对骨导信号进行下采样和/或增益调整,对气导信号进行端点检测和语音增强。
示例性地,假设通过信号拼接来进行信号融合,以x1[n]表示骨导信号,x2[n]表示气导信号,f(x)表示拼接函数为例,假设f(x):b[n] 0,2t=concat[x1[n] 0,t,x2[n] 0,t],其中,x2[n] 0,△t为零。即通过f(x)将骨导信号(0至t的语音片段)与气导信号(0至t的信号片段)进行拼接,得到融合信号b[n]。或者,f(x):b[n] 0,2t-△t=concat[x1[n] 0-t,x2[n] △t-t]。即通过f(x)将骨导信号(0至t的语音片段)与气导信号(△t至t的信号片段)进行拼接,得到融合信号b[n]。
基于骨导信号确定融合信号的方式4:可穿戴设备将骨导信号确定为融合信号。也即是,也可以仅利用骨导信号进行唤醒词的检测。
需要说明的是,与上述基于骨导信号确定融合信号的方式1不同的地方在于,在基于骨导信号确定融合信号的方式4中,是直接将骨导信号作为融合信号,除此之外,上述方式1中介绍的其他内容均适用于该方式4,在该方式4中不再一一详细介绍。例如在该方式4中,也可以对骨导信号进行语音片段的检测,以截取出语音片段,将截取出了语音片段作为融合信号,从而减少数据处理量。还可以对骨导信号进行预处理,例如对骨导信号进行下采样和/或增益调整。
接下来对可穿戴设备对融合信号进行识别,以进行唤醒词的检测的实现方式进行介绍。
在本申请实施例中,可穿戴设备将该融合信号包括的多个音频帧输入第一声学模型,以得到第一声学模型输出的多个后验概率向量。可穿戴设备基于该多个后验概率向量进行唤醒词的检测。其中,该多个后验概率向量与该融合信号所包括的多个音频帧一一对应,即一个后验概率向量对应于该融合信号所包括的一个音频帧,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率,即一个后验概率向量指示相应一个音频帧的音素属于多个指定音素的概率。也即是,可穿戴设备通过第一声学模型对融合信号进行处理,以得到融合信号所包含音素的信息,从而基于音素的信息进行唤醒词的检测。可选地,在本申请实施例中,第一声学模型可以为如前述介绍的网络模型,或者为其他结构的模型。可穿戴设备将该融合信号输入第一声学模型之后,经过第一声学模型对融合信号包括的各个音频帧的处理,得到第一声学模型输出的各个音频帧分别对应的后验概率向量。
在本申请实施例中,可穿戴设备得到第一声学模型输出的多个后验概率向量之后,基于该多个后验概率向量和该唤醒词对应的音素序列,确定声源输入的命令词对应的音素序列包括唤醒词对应的音素序列的置信度。在该置信度超过置信度阈值的情况下,确定检测到该命令词包括该唤醒词。也即是,可穿戴设备对该多个后验概率向量进行解码,以确定一个置信 度。其中,该唤醒词对应的音素序列称为解码路径,所确定的置信度可称为路径得分,置信度阈值可称为唤醒门限。
示例性地,在本申请实施例中,通过第一声学模型得到各个音频帧对应的后验概率向量之后,将连续多个音频帧对应的多个后验概率向量输入基于语言模型和发音词典构建的解码图(也称为状态网络),在解码图中寻找处于解码路径上的各个音素的概率,将寻找到的各个音素的概率相加,以得到一个置信度。其中,解码路径是指唤醒词对应的音素序列。如果该置信度大于置信度阈值,则确定检测到该命令词包括唤醒词。
可选地,为了降低误唤醒率,在该置信度超过置信度阈值,且该多个后验概率向量与多个模板向量之间满足距离条件的情况下,可穿戴设备确定检测到声源输入的命令词包括唤醒词。其中,该多个模板向量指示包含该唤醒词的完整信息的语音信号的音素属于多个指定音素的概率。也即是,当前输入语音不仅需要满足置信度条件,还要与模板匹配。在本申请实施例中,置信度阈值可以预先设定,例如基于经验设定,或者在注册唤醒词时根据包含唤醒词的完整信息的骨导注册信号和/或气导注册信号确定,具体实现方式在下文进行介绍。该多个模板向量是根据骨导注册信号和/或气导注册信号确定的注册后验概率向量,具体实现方式在下文进行介绍。
可选地,在该多个后验概率向量与该多个模板向量一一对应的情况下,该距离条件包括:该多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。需要说明的是,若该多个后验概率向量与该多个模板向量一一对应,那么可穿戴设备可以直接计算该多个后验概率向量与对应的模板向量之间的距离并求均值。例如,当前声源输入语音的时长与唤醒词注册时用户输入语音的时长一致的情况下,该多个后验概率向量与该多个模板向量可能一一对应。而若当前声源输入语音的时长与唤醒词注册时用户输入语音的时长不一致,该多个后验概率向量与该多个模板向量可能不会一一对应,那么在这种情况下,可穿戴设备可以采用动态时间规整(dynamic time warping,DTW)的方法建立该多个后验概率向量与该多个模板向量之间的映射关系,从而使得可穿戴设备能够计算该多个后验概率向量与对应的模板向量之间的距离。也即是,可穿戴设备可以通过DTW解决数据长短不一情况下的模板匹配问题。
上述介绍了可穿戴设备基于骨导信号进行唤醒词的检测的第一种实现方式,在第一种实现方式中,可穿戴设备先是基于骨导信号确定融合信号(包括四种方式),再通过声学模型对融合信号进行处理,以得到后验概率向量。然后,可穿戴设备基于唤醒词对应的解码路径对得到的后验概率向量进行解码,以得到声源当前输入的命令词对应的置信度。在该置信度大于置信度阈值的情况下,可穿戴设备确定检测到该命令词包括唤醒词。或者,在该置信度大于置信度阈值,且得到的后验概率向量与模板向量匹配的情况下,可穿戴设备确定检测到该命令词包括唤醒词。接下来介绍可穿戴设备基于骨导信号进行唤醒词的检测的第二种实现方式。
第二种实现方式
在本申请实施例中,可穿戴设备基于骨导信号进行唤醒词的检测之前,开启空气麦克,通过空气麦克采集气导信号。例如,在检测到有语音输入的情况下,开启空气麦克,通过空气麦克采集气导信号。可穿戴设备基于骨导信号和气导信号,确定多个后验概率向量,基于该多个后验概率向量进行唤醒词的检测。其中,该多个后验概率向量与骨导信号和气导信号包括的多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音 频帧中的第一音频帧的音素属于多个指定音素的概率。需要说明的是,该多个音频帧包括骨导信号所包括的音频帧以及气导信号所包括的音频帧。也即是,该多个后验概率向量中的每个后验概率向量对应于骨导信号或气导信号所包括的一个音频帧,一个后验概率向量指示相应一个音频帧的音素属于多个指定音素的概率。
需要说明的是,关于骨导信号和气导信号的相关介绍可以参照前述第一种实现方式中的内容,包括骨导信号和气导信号的产生原理、对骨导信号和气导信号的预处理等等,这里不再一一赘述。
接下来首先介绍可穿戴设备基于骨导信号和气导信号,确定多个后验概率向量的实现方式。需要说明的是,可穿戴设备基于骨导信号和气导信号,确定多个后验概率向量的方式有多种,接下来介绍其中的三种方式。
基于骨导信号和气导信号,确定多个后验概率向量的方式1:可穿戴设备将骨导信号的起始部分和气导信号输入第二声学模型,以得到第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量。其中,骨导信号的起始部分根据语音检测的检测时延确定,第一数量个骨导后验概率向量与骨导信号的起始部分所包括的音频帧一一对应,第二数量个气导后验概率向量与气导信号所包括的音频帧一一对应。可穿戴设备将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量。其中,第一骨导后验概率向量对应骨导信号的起始部分的最后一个音频帧,该最后一个音频帧的时长小于帧时长,第一气导后验概率向量对应气导信号的第一个音频帧,该第一个音频帧的时长小于帧时长。可穿戴设备最终确定的多个后验概率向量包括第二后验概率向量、第一数量个骨导后验概率向量中除第一骨导后验概率向量之外的向量,以及第二数量个气导后验概率向量中除第一气导后验概率向量之外的向量。其中,第一数量和第二数量可以相同或不同。
需要说明的是,关于骨导信号的起始部分的相关介绍也可以参照前述第一种实现方式中的内容,这里不再赘述。在本申请实施例中,骨导信号的起始部分的最后一个音频帧可能不是完整的音频帧,即该最后一个音频帧的时长小于帧时长,例如骨导信号的起始部分包括半个帧时长的音频帧。由于气导信号丢头,而导致气导信号的第一个音频帧可能不是完整的音频帧,即该第一个音频帧的时长小于帧时长,例如气导信号的第一个音频帧包括半个帧时长的音频帧。另外,骨导信号的起始部分的最后一个音频帧的时长与气导信号的第一个音频帧的时长相加可以等于帧时长。简单来说,由于语音检测(如VAD)导致骨导信号的起始部分和气导信号的第一帧会存在不完整的情况,骨导信号的起始部分和气导信号的第一帧合起来表征了一个完整的音频帧的信息。需要说明的是,这个完整的音频帧是潜在的一帧音频,并不是实际的一帧。可选地,可穿戴设备将第一骨导后验概率向量和第一气导后验概率向量进行相加,以得到第二后验概率向量,可穿戴设备所得到的第二后验概率向量指示上述这个完整的音频帧的音素属于多个指定音素的概率。
也即是,若语音检测的检测时延不是帧时长的整数倍,那么骨导信号的起始部分的最后一个音频帧的时长小于帧时长,气导信号的第一个音频帧的时长小于帧时长,可穿戴设备需要将第二骨导后验概率向量和第二气导后验概率向量进行融合(如相加),从而得到多个后验概率向量。可选地,若语音检测的检测时延是帧时长的整数倍,那么骨导信号的起始部分的最后一个音频帧的时长等于帧时长,气导信号的第一个音频帧的时长等于帧时长,可穿戴设备将得到的第一数量个骨导后验概率向量和第二数量个气导后验概率向量作为多个后验概率 向量,并进行后续的处理即可。
图11是本申请实施例提供的另一种声学模型的结构示意图。图11所示的声学模型为在本申请实施例中的第二声学模型。可以看出,本申请实施例中的第二声学模型包括两个输入层(未示出)、一个共享网络层和两个输出层。其中,这两个输入层用于分别输入骨导信号的起始部分和气导信号。共享网络层用于分别对这两个输入层的输入数据进行处理,以分别提取骨导信号的起始部分和气导信号的特征。这两个输出层用于分别接收共享网络层的两个输出数据,并分别对这两个输出数据进行处理,以输出骨导信号的起始部分对应的第一数量个骨导后验概率向量,以及气导信号对应的第二数量个气导后验概率向量。也即是,可穿戴设备通过第二声学模型对骨导信号的起始部分和气导信号这两部分信号分别进行处理,得到这两部分信号对应的两组后验概率向量。只不过在声学模型中存在共享网络层,以供这两部分信号共享部分网络参数。
在本申请实施例中,可穿戴设备将得到第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,从而使该多个骨导后验概率向量和多个气导后验概率向量得以融合,进而得到多个后验概率向量,即,可穿戴设备将两部分信号的后验概率进行了融合,从而使得到的多个后验概率向量包含了声源输入的命令词信息,这也可以视为基于骨导信号对气导信号进行丢头补偿的一种方法,只不过不是通过直接融合(如拼接)信号来进行补偿而已。另外,基于第二声学模型对骨导信号的起始部分和气导信号进行处理的方案,可以认为是一种多任务(multi-task)方案,即将骨导信号的起始部分和气导信号作为两个任务,采用共享网络参数的方法分别确定对应的后验概率向量,以将骨导信号的起始部分隐式地与气导信号进行融合。
基于骨导信号和气导信号,确定多个后验概率向量的方式2:可穿戴设备将骨导信号的起始部分和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量。其中,骨导信号的起始部分根据语音检测的检测时延确定。需要说明的是,关于骨导信号的起始部分的相关介绍也可以参照前述第一种实现方式中的内容,这里不再赘述。
在本申请实施例中,如图12所示,第三声学模型包括两个输入层(如一个输入层包括DNN和CNN等层)、一个拼接层(concat层)、一个网络参数层(如包括RNN等层)和一个输出层(如包括softmax等层)。其中,这两个输入层用于分别输入骨导信号和气导信号,拼接层用于拼接两个输入层的输出数据,网络参数层用于对拼接层的输出数据进行处理,输出层用于输出一组后验概率向量。也即是,可穿戴设备将骨导信号的起始部分和气导信号同时输入第三声学模型,通过第三声学模型中的拼接层将骨导信号的起始部分和气导信号隐式地融合在一起,进而得到一组后验概率向量,从而使得到的多个后验概率向量包含了声源输入的命令词信息,这也可以视为基于骨导信号对气导信号进行丢头补偿的一种方法,只不过不是通过直接融合信号来进行补偿而已。
基于骨导信号和气导信号,确定多个后验概率向量的方式3:可穿戴设备将骨导信号和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量。也即是,可穿戴设备直接将骨导信号和气导信号同时输入第三声学模型,通过第三声学模型输出一组后验概率向量,从而使得到的多个后验概率向量包含了声源输入的命令词信息,这也可以视为基于骨导信号对气导信号进行丢头补偿的一种方法,只不过不是通过直接融合信号来进行补偿而已。
接下来对可穿戴设备基于该多个后验概率向量进行唤醒词的检测的实现方式进行介绍。
在本申请实施例中,可穿戴设备基于该多个后验概率向量和该唤醒词对应的音素序列,确定声源输入的命令词对应的音素序列包括唤醒词对应的音素序列的置信度。在该置信度超过置信度阈值的情况下,确定检测到该唤醒词。具体实现方式参照前述第一种实现方式中的相关介绍,这里不再赘述。
可选地,为了降低误唤醒率,在该置信度超过置信度阈值,且该多个后验概率向量与多个模板向量之间满足距离条件的情况下,可穿戴设备确定检测到该命令词包括唤醒词。可选地,在该多个后验概率向量与该多个模板向量一一对应的情况下,该距离条件包括:该多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。具体实现方式参照前述第一种实现方式中的相关介绍,这里不再赘述。
步骤403:在检测到该命令词包括唤醒词时,对待唤醒设备进行语音唤醒。
在本申请实施例中,在检测到声源输入的命令词包括唤醒词时,可穿戴设备进行语音唤醒。例如,可穿戴设备向智能设备(即待唤醒设备)发送唤醒指令,以唤醒智能设备。或者,在可穿戴设备本身即为智能设备的情况下,可穿戴设备唤醒除骨导麦克之外的其他部件或模块,即可穿戴设备整体进入工作状态。
由上述可知,本申请实施例提供的语音唤醒的方法有多种实现方式,如上述介绍的第一种实现方式和第二种实现方式,在这两种实现方式中又分别包括多种具体实现方式。接下来请参照图13至18对上述介绍的几种具体实现再次进行解释说明。
图13是本申请实施例提供的另一种语音唤醒的方法流程图。图13对应于上述第一种实现方式中的方式1。以通过可穿戴设备中的多个模块来进行语音唤醒为例,可穿戴设备通过骨导麦克采集骨导信号,通过VAD控制模块对骨导信号进行VAD,在检测到有语音输入时VAD控制模块输出高电平的VAD控制信号。在未检测到有语音输入的情况下VAD控制模块输出低电平的VAD控制信号。VAD控制模块将VAD控制信号分别发送到空气麦克控制模块、前端增强模块和识别引擎。VAD控制信号用于控制空气麦克控制模块、前端增强模块和识别引擎的开关。在VAD控制信号为高电平的情况下,空气麦克控制模块控制空气麦克开启,以采集气导信号,前端增强模块开启以对气导信号进行前端增强,识别引擎开启以基于骨导信号和气导信号进行唤醒词的检测。其中,融合模块对骨导信号进行下采样和/或增益调整等预处理,用预处理后的骨导信号的起始部分对前端增强后的气导信号进行丢头补偿,以得到融合信号。融合模块将融合信号发送给识别引擎,识别引擎通过第一声学模型对融合信号进行识别,以得到唤醒词的检测结果。识别引擎将得到的检测结果发送给处理器(如图示的微控制单元(micro-controller unit,MCU)),处理器基于检测结果确定是否唤醒智能设备。若检测结果指示检测到声源输入的命令词包括唤醒词,则处理器对智能设备进行语音唤醒。若检测结果指示未检测到唤醒词,则处理器不唤醒智能设备。
图14至图16是本申请实施例提供的又三种语音唤醒的方法流程图。图14、图15、图16分别与图13的区别在于,在图14所示的方法中,融合模块基于预处理后的骨导信号的起始部分生成增强起始信号,用增强起始信号对前端增强后的气导信号进行丢头补偿,以得到融合信号。在图15所示的方法中,融合模块将预处理后的骨导信号和前端增强后的气导信号直接拼接,以对气导信号进行丢头补偿,从而得到融合信号。在图16所示的方法中,VAD控制信号无需发送给空气麦克控制模块,也就无需采集气导信号,另外,识别引擎直接将预 处理后的骨导信号确定为融合信号。
图17是本申请实施例提供的又一种语音唤醒的方法流程图。图17与图13的区别在于,在图17所示的方法中,识别引擎将预处理的骨导信号的起始部分和前端增强后的气导信号分别输入第二声学模型,得到第二声学模型的两个输出层分别输出的骨导后验概率向量和气导后验概率向量,即得到后验概率对。识别引擎将骨导后验概率向量和气导后验概率向量进行融合,以得到多个后验概率向量,并通过解码该多个后验概率向量以得到唤醒词的检测结果。
图18是本申请实施例提供的又一种语音唤醒的方法流程图。图18与图17的区别在于,在图18所示的方法中,识别引擎将预处理的骨导信号的起始部分和前端增强后的气导信号分别输入第三声学模型,或将预处理的骨导信号和前端增强后的气导信号分别输入第三声学模型,得到第三声学模型的一个输出层分别输出的多个后验概率向量。
由上述可知,本申请实施例中,通过骨导麦克采集骨导信号进行语音检测,能够保证低功耗。在保证低功耗的同时,考虑到由于语音检测的延迟可能会导致采集的气导信号丢头,从而未包含声源输入的命令词的完整信息,而骨导麦克采集的骨导信号包含声源输入的命令词信息,即骨导信号未丢头,因此本方案基于骨导信号进行唤醒词的检测。这样,唤醒词的识别准确率较高,语音唤醒的精准度较高。在具体实现中,可以基于骨导信号对气导信号直接或隐式地进行丢头补偿,或者直接基于骨导信号进行唤醒词的检测。
以上介绍了可穿戴设备基于骨导信号进行语音唤醒的实现过程。在本申请实施例中,还可以在可穿戴设备中注册唤醒词,可选地,还可以在注册唤醒词的同时确定上述实施例中的置信度阈值,还可以确定上述实施例中的多个模板向量。接下来将对唤醒词的注册过程进行介绍。
在本申请实施例中,可穿戴设备首先确定唤醒词对应的音素序列。之后,可穿戴设备获取骨导注册信号,该骨导注册信号包含唤醒词的完整信息。可穿戴设备基于该骨导注册信号和该唤醒词对应的音素序列,确定置信度阈值。可选地,可穿戴设备还可以基于该骨导信号确定多个模板向量。
可选地,可穿戴设备获取输入的唤醒词,按照发音词典确定唤醒词对应的音素序列。以用户向可穿戴设备输入唤醒词文本为例,可穿戴设备获取用户输入的唤醒词文本,按照发音词典确定唤醒词对应的音素序列。可选地,在注册唤醒词的过程中,可穿戴设备还可以在用户输入唤醒词文本之后,检测输入的唤醒词文本是否符合文本注册条件,在符合文本注册条件的情况下,可穿戴设备按照发音词典确定唤醒词文本对应的音素序列。
示例性地,文本注册条件包括文本输入次数要求和字符要求等。以文本输入次数要求为需要用户输入一次或多次唤醒词文本为例,可穿戴设备每检测到一次用户输入的唤醒词文本,就对输入的唤醒词文本进行文本校验和分析,以检验当前输入的唤醒词文本是否符合字符要求。如果用户输入的唤醒词文本不符合字符要求,则可穿戴设备通过文字或声音的方式来提示用户不符合要求的原因并要求重新输入。若用户一次或多次输入的唤醒词文本均符合字符要求且相同,则可穿戴设备按照发音词典确定唤醒词文本对应的音素序列。
可选地,可穿戴设备通过文本校验来检测当前输入的唤醒词文本是否符合字符要求。示例性地,字符要求包括以下一种或多种要求:要求中文(非中文即不符合字符要求)、4至6个字(少于4个字或多于6个字即不符合字符要求)、不存在语气助词(存在即不符合字符要 求)、不存在3个以上读音相同的重复字(存在即不符合字符要求)、与已有命令词均不同(存在相同则不符合字符要求)、与已有命令词的音素重叠的比例不超过70%(超过70%即不符合字符要求,用于防止误闯)、对应音素属于发音词典中的音素(不属于即不符合字符要求,是一种异常情况)。
以上为文本注册的过程,文本注册能够确定唤醒词对应的音素序列。可穿戴设备确定唤醒词对应的音素序列之后,后续可以将该音素序列作为唤醒词的解码路径,解码路径用于在语音唤醒的过程中进行唤醒词的检测。
除了文本注册之外,还需要语音注册。在本申请实施例中,在文本注册完成后,可穿戴设备还需获取骨导注册信号,该骨导注册信号包含唤醒词的完整信息。可选地,可穿戴设备在获取骨导注册信号的同时还获取气导注册信号。可选地,在语音注册的过程中,以可穿戴设备获取用户输入的骨导注册信号和气导注册信号为例,可穿戴设备在获取输入的骨导注册信号和气导注册信号之后,校验骨导注册信号和气导注册信号是否符合语音注册条件,在符合语音注册条件的情况下,可穿戴设备进行后续的处理,以确定置信度阈值。
示例性地,语音注册条件包括语音输入次数要求、信噪比要求和路径得分要求等。以语音输入次数要求为需要用户输入三次唤醒词语音(包括骨导注册信号和气导注册信号)为例,可穿戴设备每检测到一次用户输入的唤醒词语音,就对输入的唤醒词语音进行发音校验和分析,以检验当前输入的唤醒词语音是否符合信噪比要求和路径得分要求。如果用户输入的唤醒词文本不符合字符要求,则可穿戴设备通过文字或声音的方式提示用户不符合要求的原因并要求重新输入。若用户三次输入的唤醒词语音均符合信噪比要求和路径得分要求,则可穿戴设备确定用户输入的唤醒词语音符合语音注册条件,可穿戴设备进行后续的处理。
可选性,可穿戴设备可以先检测输入的唤醒词语音是否符合信噪比要求,在确定输入的唤醒词语音符合信噪比要求之后,再检测输入的唤醒词语音是否符合路径得分要求。示例性地,信噪比要求包括要求信噪比不低于信噪比阈值(低于则不符合信噪比要求),例如,要求骨导注册信号的信噪比不低于第一信噪比阈值,和/或,要求气导注册信号的信噪比不低于第二信噪比阈值。可选地,第一信噪比阈值大于第二信噪比阈值。若用户输入的唤醒词语音不符合信噪比要求,则可穿戴设备提示用户当前环境噪声较大不适合注册,需要用户找一个安静的环境重新输入唤醒词语音。路径得分要求包括基于每次输入的唤醒词语音得到的路径得分不小于校准阈值、基于三次输入的唤醒词语音得到的三个路径得分的均值不小于校准阈值、基于任意两次输入的唤醒词语音得到的两个路径得分之间相差不超过100分(或其他值)。其中,基于唤醒词语音得到路径得分的实现过程将在下文进行介绍,实质与前述语音唤醒的过程中基于骨导信号得到置信度的过程相类似。
接下来介绍可穿戴设备基于该骨导注册信号和该唤醒词对应的音素序列,确定置信度阈值的实现方式。与前述语音唤醒的过程中基于骨导信号得到置信度相类似,可穿戴设备可以通过多种实现方式来确定置信度阈值,接下来介绍其中的两种实现方式。
第一种实现方式
可穿戴设备基于骨导注册信号确定融合注册信号,基于该融合注册信号和该唤醒词对应的音素序列,确定置信度阈值和多个模板向量。
首先介绍可穿戴设备基于骨导注册信号确定融合注册信号的实现方式。需要说明的是,可穿戴设备基于骨导注册信号确定融合注册信号的方式有多种,接下来介绍其中的四种方式。
基于骨导注册信号确定融合注册信号的方式1:基于骨导注册信号确定融合注册信号之前,获取气导注册信号。可穿戴设备将骨导注册信号的起始部分和气导注册信号进行融合,以得到融合注册信号。其中,骨导注册信号的起始部分根据语音检测的检测时延确定。可选地,在本申请实施例中通过信号拼接来进行信号融合。
需要说明的是,可穿戴设备将骨导注册信号的起始部分和气导注册信号进行融合的实现方式与前述实施例中基于骨导信号确定融合信号的方式1的原理类似,这里不再详细介绍。另外,可穿戴设备也可以对骨导注册信号和气导注册信号进行语音片段的检测,以截取出语音片段,基于截取出的语音片段进行信号拼接,从而减少数据处理量。还可以对骨导注册信号和气导注册信号进行预处理,例如对骨导注册信号进行下采样和/或增益调整,对气导信号进行语音增强等。具体实现方式与前述实施例中的相关内容的原理类似,请参照前述实施例,这里不再详细介绍。
基于骨导注册信号确定融合注册信号的方式2:基于骨导注册信号确定融合注册信号之前,获取气导注册信号。可穿戴设备基于骨导注册信号的起始部分生成增强起始注册信号,将增强起始注册信号和气导注册信号进行融合,以得到融合注册信号。其中,骨导注册信号的起始部分根据语音检测的检测时延确定。
需要说明的是,与上述基于骨导注册信号确定融合注册信号的方式1不同的地方在于,在该方式2中,可穿戴设备是利用骨导注册信号的起始部分生成增强起始注册信号,将增强起始注册信号与气导注册信号进行融合,而非将骨导注册信号的起始部分与气导信号进行融合。另外,在该方式2中,可穿戴设备也可以对骨导注册信号和气导注册信号进行语音片段的检测,以截取出语音片段,基于截取出的语音片段进行信号融合,从而减少数据处理量。可穿戴设备还可以对骨导注册信号和气导注册信号进行预处理,例如对骨导注册信号进行下采样和/或增益调整,对气导信号进行语音增强等。具体实现方式与前述实施例中的相关内容的原理类似,请参照前述实施例,这里不再详细介绍。
在本申请实施例中,可穿戴设备可以将骨导注册信号的起始部分输入生成网络模型,以得到生成网络模型输出的增强起始注册信号。其中,该生成网络模型可以与前述介绍的生成网络模型为同一个,也可以为另外的一个生成网络模型,本申请实施例对此不作限定。本申请实施例也不限定该生成网络模型的网络结构、训练方式、训练设备等。
基于骨导注册信号确定融合注册信号的方式3:基于骨导注册信号确定融合注册信号之前,获取气导注册信号。可穿戴设备将骨导注册信号和气导注册信号直接进行融合,以得到融合注册信号。
需要说明的是,与上述基于骨导注册信号确定融合注册信号的方式1不同的地方在于,在该方式3中,可穿戴设备是直接将骨导注册信号和气导注册信号进行融合,以得到融合注册信号。另外,在该方式3中,可穿戴设备也可以对骨导注册信号和气导注册信号进行语音片段的检测,以截取出语音片段,基于截取出的语音片段进行信号融合,从而减少数据处理量。可穿戴设备还可以对骨导注册信号和气导注册信号进行预处理,例如对骨导注册信号进行下采样和/或增益调整,对气导信号进行语音增强等。具体实现方式与前述实施例中的相关内容的原理类似,请参照前述实施例,这里不再详细介绍。
基于骨导注册信号确定融合注册信号的方式4:可穿戴设备将骨导注册信号确定为融合注册信号。
需要说明的是,与上述基于骨导注册信号确定融合注册信号的方式1不同的地方在于,在该方式4中,可穿戴设备是直接将骨导注册信号作为融合注册信号。另外,在该方式4中,可穿戴设备也可以对骨导注册信号进行语音片段的检测,以截取出语音片段,基于截取出的语音片段进行后续的处理,从而减少数据处理量。可穿戴设备还可以对骨导注册信号进行预处理,例如对骨导注册信号进行下采样和/或增益调整等。具体实现方式与前述实施例中的相关内容的原理类似,请参照前述实施例,这里不再详细介绍。
接下来对可穿戴设备基于该融合注册信号和该唤醒词对应的音素序列,确定置信度阈值和多个模板向量的实现方式进行介绍。
可选地,可穿戴设备将该融合注册信号包括的多个注册音频帧输入第一声学模型,以得到第一声学模型输出的多个注册后验概率向量。其中,该多个注册后验概率向量与该多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率。即,该多个注册后验概率向量中的每个注册后验概率向量对应于融合注册信号所包括的一个注册音频帧,一个注册后验概率向量指示相应一个注册音频帧的音素属于多个指定音素的概率。可穿戴设备将该多个注册后验概率向量确定为多个模板向量。可穿戴设备基于该多个注册后验概率向量和该唤醒词对应的音素序列确定置信度阈值。也即是,可穿戴设备通过第一声学模型对融合注册信号进行处理,以得到融合信号所包含音素的信息,即得到注册后验概率向量,将注册后验概率向量作为模板向量,并存储模板向量。可穿戴设备还基于该唤醒词对应的音素序列(即解码路径),对注册后验概率向量进行解码,以确定一个路径得分,将该路径得分作为置信度阈值,并存储置信度阈值。其中,第一声学模型的相关介绍请参照前述实施例,这里不再赘述。
上述介绍了可穿戴设备基于该骨导注册信号和该唤醒词对应的音素序列,确定置信度阈值的第一种实现方式,在第一种实现方式中,可穿戴设备先是基于骨导注册信号确定融合注册信号(包括四种方式),再通过声学模型对融合注册信号进行处理,以得到注册后验概率向量。然后,可穿戴设备基于唤醒词对应的解码路径对得到的注册后验概率向量进行解码,以得到置信度阈值。可选地,可穿戴设备将得到的注册后验概率向量作为模板向量进行存储。接下来介绍可穿戴设备基于该骨导注册信号和该唤醒词对应的音素序列,确定置信度阈值的第二中实现方式。
第二种实现方式
在本申请实施例中,可穿戴设备基于骨导注册信号和唤醒词对应的音素序列,确定置信度阈值之前,获取气导注册信号。可穿戴设备基于骨导注册信号和气导注册信号,确定多个注册后验概率向量。其中,该多个注册后验概率向量与骨导注册信号和气导注册信号包括的多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率。需要说明的是,该多个注册音频帧包括骨导注册信号所包括的注册音频帧以及气导注册信号所包括的注册音频帧。也即是,该多个注册后验概率向量中的每个注册后验概率向量对应于骨导注册信号或气导注册信号所包括的一个注册音频帧,一个注册后验概率向量指示相应一个注册音频帧的音素属于多个指定音素的概率。可穿戴设备基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。可选地,可穿戴设备将该多个注册后验概率向量确定为多个模板向量。
需要说明的是,关于骨导注册信号和气导注册信号的相关介绍可以参照前述第一种实现 方式中的内容,包括骨导注册信号和气导注册信号的产生原理、对骨导注册信号和气导注册信号的预处理等等,这里不再一一赘述。
接下来首先介绍可穿戴设备基于骨导注册信号和气导注册信号,确定多个注册后验概率向量的实现方式。需要说明的是,可穿戴设备基于骨导注册信号和气导注册信号,确定多个注册后验概率向量的方式有多种,接下来介绍其中的三种方式。
基于骨导注册信号和气导注册信号,确定多个注册后验概率向量的方式1:可穿戴设备将骨导注册信号的起始部分和气导注册信号输入第二声学模型,以得到第二声学模型输出的第三数量个骨导注册后验概率向量和第四数量个气导注册后验概率向量。可穿戴设备将第一骨导后注册验概率向量和第一气导注册后验概率向量进行融合,以得到第二注册后验概率向量。其中,骨导注册信号的起始部分根据语音检测的检测时延确定,第三数量个骨导注册后验概率向量与骨导注册信号的起始部分所包括的注册音频帧一一对应,第四数量个气导注册后验概率向量与气导注册信号所包括的注册音频帧一一对应。即,一个骨导注册后验概率向量对应骨导注册信号的起始部分所包括的一个注册音频帧,一个气导注册后验概率向量对应气导注册信号所包括的一个注册音频帧。第一骨导注册后验概率向量对应骨导注册信号的起始部分的最后一个注册音频帧,该最后一个注册音频帧的时长小于帧时长,第一气导后验概率向量对应气导注册信号的第一个注册音频帧,该第一个注册音频帧的时长小于帧时长。可穿戴设备最终所确定的多个注册后验概率向量包括第二注册后验概率向量、第三数量个骨导注册后验概率向量中除第一骨导注册后验概率向量之外的向量,以及第四数量个气导注册后验概率向量中除第一气导注册后验概率向量之外的向量。其中,第三数量和第四数量可以相同或不同,第三数量和前述第一数量可以相同或不同,第四数量和前述第二数量可以相同或不同。
可选地,可穿戴设备将第一骨导注册后验概率向量和第一气导注册后验概率向量进行相加,以得到第二注册后验概率向量。
需要说明的是,关于骨导注册信号的起始部分的相关介绍也可以参照前述第一种实现方式中的内容,这里不再赘述。另外,关于第二声学模型的相关介绍请参照前述实施例的相关内容,这里不再赘述。可穿戴设备通过第二声学模型得到第三数量个骨导注册后验概率向量和第四数量个气导注册后验概率向量的原理,与前述实施例中通过第二声学模型得到第一数量个骨导后验概率向量和第二数量个气导后验概率向量的原理一致,这里不再详细介绍。
基于骨导注册信号和气导注册信号,确定多个注册后验概率向量的方式2:可穿戴设备将骨导注册信号的起始部分和气导注册信号输入第三声学模型,以得到第三声学模型输出的多个注册后验概率向量。其中,骨导注册信号的起始部分根据语音检测的检测时延确定。
需要说明的是,关于骨导注册信号的起始部分的相关介绍也可以参照前述第一种实现方式中的内容,这里不再赘述。另外,关于第三声学模型的相关介绍请参照前述实施例的相关内容,这里不再赘述。可穿戴设备通过第三声学模型得到多个注册后验概率向量的原理,与前述实施例中通过第三声学模型得到多个后验概率向量的原理一致,这里不再详细介绍。
基于骨导注册信号和气导注册信号,确定多个注册后验概率向量的方式3:可穿戴设备将骨导注册信号和气导注册信号输入第三声学模型,以得到第三声学模型输出的多个注册后验概率向量。也即是,可穿戴设备直接将骨导注册信号和气导注册信号同时输入第三声学模型,通过第三声学模型输出一组注册后验概率向量,从而使得到的多个注册后验概率向量包 含了声源输入的唤醒词的完整信息。
在本申请实施例中,可穿戴设备确定该多个注册后验概率向量之后,基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值,原理与前述介绍的可穿戴设备基于多个后验概率向量和唤醒词的音素序列确定置信度的原理相类似,具体实现方式请参照前述相关介绍,这里不再详细介绍。
图19至图24是本申请实施例提供的六种唤醒词注册的方法流程图。接下来将参照图19至图24对本申请实施例中唤醒词的注册过程再次进行解释说明。
图19对应于上述唤醒词注册的第一种实现方式中的方式1,唤醒词的注册过程包括文本注册和语音注册。以可穿戴设备通过多个模块进行唤醒词注册为例,可穿戴设备首先进行文本注册。可穿戴设备的文本注册模块获取用户自定义输入的唤醒词文本,对输入的唤醒词文本进行文本校验和文本分析,并按照发音词典确定符合文本注册要求的唤醒词文本对应的音素序列,将该音素序列确定为解码路径,文本注册模块将解码路径发送给识别引擎。识别引擎存储该解码路径。可穿戴设备再进行语音注册。可穿戴设备的语音注册模块获取语音注册信号,包括骨导注册信号和气导注册信号。可选地,可穿戴设备通过VAD来获取骨导注册信号和气导注册信号,还可以对获取的骨导注册信号和气导注册信号进行预处理。之后,语音注册模块对骨导注册信号和气导注册信号进行发音校验,融合模块将校验后符合语音注册要求的骨导注册信号和气导注册信号进行融合,以得到一个融合注册信号。为了区分图19至图22,这里将图19中的融合注册信号称为融合注册信号1。语音注册模块通过第一声学模型对该融合注册信号1进行处理,以得到多个注册后验概率向量,并通过解码多个注册后验概率向量,以确定一个路径得分,将该路径得分作为唤醒门限(即置信度阈值)发送给识别引擎,识别引擎存储该唤醒门限,唤醒门限用户后续语音唤醒中的一级误唤醒压制。可选地,语音注册模块将得到的多个注册后验概率向量作为多个模板向量发送给识别引擎,识别引擎存储该多个模板向量,该多个模板向量用于后续语音唤醒中的二级误唤醒压制。
图20至图22分别对应于上述唤醒词注册的第一种实现方式中的方式2、方式3和方式4。与图19的区别在于,在图20所示的方法中,可穿戴设备的语音注册模块基于骨导注册信号的起始部分生成为增强起始注册信号,将增强起始注册信号与气导注册信号进行融合,以得到一个融合注册信号。这里将图20中的融合注册信号称为融合注册信号2。在图21所示的方法中,语音注册模块直接将骨导注册信号和气导注册信号进行融合,以得到一个融合注册信号。这里将图21中的融合注册信号称为融合注册信号3。在图22所示的方法中,语音注册模块可以无需获取气导注册信号,直接将骨导注册信号确定为融合注册信号。这里将图22中的融合注册信号称为融合注册信号4。
图23对应于上述唤醒词注册的第二种实现方式中的方式1。与图19的区别在于,在图23所示的方法中,可穿戴设备的语音注册模块将骨导注册信号的起始部分和气导注册信号分别输入第二声学模型中,以得到第二声学模型分别输出的第三数量个骨导注册后验概率向量和第四数量个气导注册后验概率向量。语音注册模块将第三数量个骨导注册后验概率向量和第四数量个气导注册后验概率向量进行融合,以得到多个注册后验概率向量。
图24对应于上述唤醒词注册的第二种实现方式中的方式2和方式3。与图23的区别在于,在图24所示的方法中,可穿戴设备的语音注册模块将骨导注册信号的起始部分和气导注册信号分别输入第三声学模型中,或者将骨导注册信号和气导注册信号输入第三声学模型中, 以得到第三声学模型输出的多个注册后验概率向量。
由上述可知,在唤醒词的注册过程中对骨导注册信号和气导注册信号的处理流程,与语音唤醒的过程中对骨导信号和气导信号的处理流程相类似,只不过在唤醒词的注册过程中,是为了得到唤醒门限和模板向量,在语音唤醒的过程中,是为了对唤醒词进行检测。其中,模板向量能够提升本方***性和鲁棒性。本方案通过骨导信号对气导信号直接或隐式地进行丢头补偿,或者直接基于骨导信号进行唤醒词的检测,由于骨导信号包含声源输入的命令词信息,即骨导信号未丢头,因此,唤醒词的识别准确率较高,语音唤醒的准确率较高。
以上实施例中介绍了语音唤醒的过程以及唤醒词的注册过程。由前述可知,在本申请实施例中的声学模型需要预先训练得到,如第一声学模型、第二声学模型和第三声学模型等均需要预先训练得到,接下来以计算机设备训练声学模型为例对声学模型的训练过程进行介绍。
在本申请实施例中,计算机设备首先获取第二训练数据集,第二训练数据集包括多个第二样本信号对,一个第二样本信号对包括一个骨导样本信号和一个气导样本信号,一个第二样本信号对对应一个命令词。可选地,第二训练数据集包括直接采集的语音数据、公开语音数据和/或从第三方购买的语音数据。可选地,在训练之前,计算机设备可以对获取的第二训练数据集对进行预处理,以得到经预处理的第二训练数据集,经预处理的第二训练数据集能够模拟真实语音数据的分布,以便更接近于真实场景的语音,增加训练样本的多样性。示例性地,对第二训练数据集进行备份,即额外增加一份数据,对备份的数据进行预处理。可选地,将备份的数据分为多份,对每份数据进行一种预处理,对各份数据所做的预处理可以不同,这样能够使总的训练数据加倍,且保证数据的全面性,在性能和训练开销上达到平衡,使得在一定程度上提高语音识别的准确率和鲁棒性。其中,对每份数据进行预处理的方法可以包括增加噪音、音量增强、增加混响、时移、改变音调、时间拉伸等中的一种或多种。
以训练得到第一声学模型为例,计算机设备基于第二训练数据集确定多个融合样本信号,共有四种方式。需要说明的是,这四种方式与上述实施例中识别过程(即语音唤醒的过程)中可穿戴设备基于骨导信号确定融合信号的四种方式一一对应。即,若在识别过程中,可穿戴设备将骨导信号的起始部分和气导信号进行融合,以得到融合信号,那么在训练过程中,计算机设备将该多个第二样本信号对中每个第二样本信号对包括的骨导样本信号的起始部分和气导样本信号进行融合,以得到多个融合样本信号。若在识别过程中,可穿戴设备基于骨导信号的起始部分生成增强起始信号,将增强起始信号和气导信号进行融合,以得到融合信号,那么在训练过程中,计算机设备基于该多个第二样本信号对中每个第二样本信号对包括的骨导样本信号的起始部分生成增强起始样本信号,将各个增强起始样本信号与对应的气导样本信号进行融合,以得到多个融合样本信号。若在识别过程中,计算机设备将骨导信号和气导信号直接进行融合,以得到融合信号,那么在训练过程中,计算机设备将该多个第二样本信号对中每个第二样本信号对包括的骨导样本信号和气导样本信号直接进行融合,以得到多个融合样本信号。若在识别过程中,可穿戴设备将骨导信号确定为融合信号,那么在训练过程中,计算机设备将该多个第二样本信号对包括的骨导样本信号确定为多个融合样本信号。其中,骨导样本信号的起始部分根据语音检测的检测时延确定或根据经验设定。之后,计算机设备通过该多个融合样本信号训练第一初始声学模型,以得到本申请实施例中的第一声学模型。其中,第一初始声学模型的网络结构与第一声学模型的网络结构相同。
可选地,计算机设备在基于第二训练数据集确定多个融合样本信号之前,对第二训练数据集包括的骨导样本信号和气导样本信号进行预处理,例如对气导样本信号进行前端增强,对骨导样本信号进行下采样和增益调整等。可选地,计算机设备该多个第二样本信号对中每个第二样本信号对包括的骨导样本信号的起始部分输入生成网络模型,得到生成网络模型输出的增强起始样本信号,该生成网络模型与前述实施例中的生成网络模型为同一个模型,也可以为不同的模型,本申请实施例对此不作限定。
示例性地,图25至图28是本申请实施例提供的基于上述四种方式分别训练得到第一声学模型的四个示意图。参见图25至图28,计算机设备获取的第二训练数据集包括骨导数据(骨导样本信号)和气导数据(气导样本信号),计算机设备通过融合模块对骨导数据进行下采样和/或增益调整,通过前端增强模块对气导数据进行前端增强。图25至图27对应于这四种方式的前三种方式,融合模块分别采用对应的方式通过骨导数据对气导信号进行丢头补偿,以得到训练输入数据。图28对应于这四种方式中的第四种方式,无需气导数据,融合模块直接将骨导数据作为训练输入数据。然后,计算机设备通过训练输入数据训练网络模型(即第一初始声学模型),并通过损失函数、以及梯度下降算法和误差反向传播来调整网络模型,从而得到经训练的第一声学模型。
以训练第二声学模型为例,与上述语音唤醒的过程中可穿戴设备基于骨导信号和气导信号确定多个后验概率向量的方式1相对应地,在训练过程中,计算机设备将该多个第二样本信号对中各个第二样本信号对包括的骨导样本信号的起始部分和气导样本信号作为第二初始声学模型的输入,以训练第二初始声学模型,得到第二声学模型。其中,第二初始声学模型的网络结构与第二声学模型的网络结构相同。即,第二初始声学模型也包括两个输入层、一个共享网络层和两个输出层。
图29是本申请实施例提供的一种训练得到第二声学模型的示意图。参见图29,计算机设备获取的第二训练数据集包括骨导数据(骨导样本信号)和气导数据(气导样本信号),计算机设备对骨导数据进行下采样和/或增益调整,对气导数据进行前端增强。计算机设备将骨导数据作为训练输入数据1,将气导数据作为训练输入数据2。计算机设备通过训练输入数据1和训练输入数据2训练网络模型(即第二初始声学模型),并通过损失函数、以及梯度下降算法和误差反向传播来调整网络模型,从而得到经训练的第二声学模型。其中,训练输入数据1和训练输入数据2可以对应同一个损失函数或不同的损失函数,本申请实施例对此不作限定。
以训练第三声学模型为例,与上述语音唤醒的过程中可穿戴设备基于骨导信号和气导信号确定多个后验概率向量的方式2相对应地,在训练过程中,计算机设备将该多个第二样本信号对中各个第二样本信号对包括的骨导样本信号的起始部分和气导样本信号作为第三初始声学模型的输入,以训练第三初始声学模型,得到第三声学模型。或者,与上述语音唤醒的过程中可穿戴设备基于骨导信号和气导信号确定多个后验概率向量的方式3相对应地,在训练过程中,计算机设备将该多个第二样本信号对中各个第二样本信号对包括的骨导样本信号和气导样本信号作为第三初始声学模型的输入,以训练第三初始声学模型,得到第三声学模型。其中,第三初始声学模型的网络结构与第三声学模型的网络结构相同。即,第三初始声学模型也包括两个输入层、一个拼接层、一个网络参数层和一个输出层。
示例性地,图30是本申请实施例提供的一种训练得到第三声学模型的示意图。参见图 30,计算机设备获取的第二训练数据集包括骨导数据(骨导样本信号)和气导数据(气导样本信号),计算机设备对骨导数据进行下采样和/或增益调整,对气导数据进行前端增强。计算机设备将骨导数据或骨导数据中的起始部分作为训练输入数据1,将气导数据作为训练输入数据2。计算机设备通过训练输入数据1和训练输入数据2训练网络模型(即第三初始声学模型),并通过损失函数、以及梯度下降算法和误差反向传播来调整网络模型,从而得到经训练的第三声学模型。
综上所述,在本申请实施例中,在训练过程中,也通过骨导样本信号对气导注册信号直接或者隐式地进行丢头补偿,从而构造训练输入数据来训练初始声学模型,得到经训练的声学模型。在语音唤醒的过程中用同样的方式通过骨导信号对气导信号进行直接地或隐式地丢头补偿,由于骨导信号包含声源输入的命令词信息,即骨导信号未丢头,因此本方案基于骨导信号进行唤醒词的检测的识别准确率较高,语音唤醒的准确率较高,且鲁棒性也得到提高。
图31是本申请实施例提供的一种语音唤醒的装置3100的结构示意图,该语音唤醒的装置3100可以由软件、硬件或者两者的结合实现成为电子设备的部分或者全部,该电子设备可以为图2所示的可穿戴设备。参见图31,该装置3100包括:语音检测模块3101、唤醒词检测模块3102和语音唤醒模块3103。
语音检测模块3101,用于根据骨导麦克采集的骨导信号进行语音检测,该骨导信号包含声源输入的命令词信息;
唤醒词检测模块3102,用于在检测到有语音输入的情况下,基于骨导信号进行唤醒词的检测;
语音唤醒模块3103,用于在检测到该命令词包括唤醒词时,对待唤醒设备进行语音唤醒。
可选地,唤醒词检测模块3102包括:
第一确定子模块,用于基于骨导信号确定融合信号;
唤醒词检测子模块,用于对该融合信号进行唤醒词的检测。
可选地,该装置3100还包括:
处理模块,用于开启空气麦克,通过空气麦克采集气导信号;
第一确定子模块用于:
将骨导信号的起始部分和气导信号进行融合,以得到融合信号,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
基于骨导信号的起始部分生成增强起始信号,将增强起始信号和气导信号进行融合,以得到融合信号,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
将骨导信号和气导信号直接进行融合,以得到融合信号。
可选地,唤醒词检测子模块用于:
将该融合信号包括的多个音频帧输入第一声学模型,以得到第一声学模型输出的多个后验概率向量,该多个后验概率向量与该多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
基于该多个后验概率向量进行唤醒词的检测。
可选地,该装置3100还包括:
处理模块,用于开启空气麦克,通过空气麦克采集气导信号;
唤醒词检测模块3102包括:
第二确定子模块,用于基于骨导信号和气导信号,确定多个后验概率向量,该多个后验概率向量与骨导信号和气导信号包括的多个音频帧一一对应,该多个后验概率向量中的第一后验概率向量用于指示该多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
唤醒词检测子模块,用于基于该多个后验概率向量进行唤醒词的检测。
可选地,第二确定子模块用于:
将骨导信号的起始部分和气导信号输入第二声学模型,以得到第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定,第一数量个骨导后验概率向量与骨导信号的起始部分所包括的音频帧一一对应,第二数量个气导后验概率向量与气导信号所包括的音频帧一一对应;
将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,第一骨导后验概率向量对应骨导信号的起始部分的最后一个音频帧,该最后一个音频帧的时长小于帧时长,第一气导后验概率向量对应气导信号的第一个音频帧,该第一个音频帧的时长小于帧时长,该多个后验概率向量包括第二后验概率向量、第一数量个骨导后验概率向量中除第一骨导后验概率向量之外的向量,以及第二数量个气导后验概率向量中除第一气导后验概率向量之外的向量。
可选地,第二确定子模块用于:
将骨导信号的起始部分和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量,该骨导信号的起始部分根据语音检测的检测时延确定;或者,
将骨导信号和气导信号输入第三声学模型,以得到第三声学模型输出的多个后验概率向量。
可选地,唤醒词检测子模块用于:
基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;
在该置信度超过置信度阈值的情况下,确定检测到该命令词包括唤醒词。
可选地,唤醒词检测子模块用于:
基于该多个后验概率向量和唤醒词对应的音素序列,确定该命令词对应的音素序列包括唤醒词对应的音素序列的置信度;
在该置信度超过置信度阈值,且该多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到该命令词包括唤醒词,该多个模板向量指示包含唤醒词的完整信息的语音信号的音素属于多个指定音素的概率。
可选地,在该多个后验概率向量与该多个模板向量一一对应的情况下,该距离条件包括:该多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
可选地,该装置3100还包括:
获取模块,用于获取骨导注册信号,该骨导注册信号包含唤醒词的完整信息;
确定模块,用于基于骨导注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。
可选地,确定模块包括:
第三确定子模块,用于基于骨导注册信号确定融合注册信号;
第四确定子模块,用于基于该融合注册信号和唤醒词对应的音素序列,确定置信度阈值和多个模板向量。
可选地,第四确定子模块用于:
将该融合注册信号包括的多个注册音频帧输入第一声学模型,以得到第一声学模型输出的多个注册后验概率向量,该多个注册后验概率向量与该多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;
将该多个注册后验概率向量确定为多个模板向量;
基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。
可选地,该装置3100还包括:
获取模块,用于获取气导注册信号;
确定模块包括:
第五确定子模块,用于基于骨导注册信号和气导注册信号,确定多个注册后验概率向量,该多个注册后验概率向量与骨导注册信号和气导注册信号包括的多个注册音频帧一一对应,该多个注册后验概率向量中的第一注册后验概率向量指示该多个注册音频帧中的第一注册音频帧的音素属于多个指定音素的概率;
第六确定子模块,用于基于该多个注册后验概率向量和唤醒词对应的音素序列确定置信度阈值。
在本申请实施例中,通过骨导麦克采集骨导信号进行语音检测,能够保证低功耗。另外,考虑到由于语音检测的延迟可能会导致采集的气导信号丢头,从而未包含声源输入的命令词的完整信息,而骨导麦克采集的骨导信号包含声源输入的命令词信息,即骨导信号未丢头,因此本方案基于骨导信号进行唤醒词的检测。这样,唤醒词的识别准确率较高,语音唤醒的准确度较高。
需要说明的是:上述实施例提供的语音唤醒的装置在进行语音唤醒时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音唤醒的装置与语音唤醒的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可 用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请实施例中涉及到的语音信号都是在充分授权的情况下获取的。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (31)

  1. 一种语音唤醒的方法,其特征在于,所述方法包括:
    根据骨导麦克采集的骨导信号进行语音检测,所述骨导信号包含声源输入的命令词信息;
    在检测到有语音输入的情况下,基于所述骨导信号进行唤醒词的检测;
    在检测到所述命令词包括所述唤醒词时,对待唤醒设备进行语音唤醒。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述骨导信号进行唤醒词的检测,包括:
    基于所述骨导信号确定融合信号;
    对所述融合信号进行所述唤醒词的检测。
  3. 如权利要求2所述的方法,其特征在于,所述基于所述骨导信号确定融合信号之前,还包括:开启空气麦克,通过所述空气麦克采集气导信号;
    所述基于所述骨导信号确定融合信号,包括:
    将所述骨导信号的起始部分和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    基于所述骨导信号的起始部分生成增强起始信号,将所述增强起始信号和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    将所述骨导信号和所述气导信号直接进行融合,以得到所述融合信号。
  4. 如权利要求2或3所述的方法,其特征在于,所述对所述融合信号进行所述唤醒词的检测,包括:
    将所述融合信号包括的多个音频帧输入第一声学模型,以得到所述第一声学模型输出的多个后验概率向量,所述多个后验概率向量与所述多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
    基于所述多个后验概率向量进行所述唤醒词的检测。
  5. 如权利要求1所述的方法,其特征在于,所述基于所述骨导信号进行唤醒词的检测之前,还包括:开启空气麦克,通过所述空气麦克采集气导信号;
    所述基于所述骨导信号进行唤醒词的检测,包括:
    基于所述骨导信号和所述气导信号,确定多个后验概率向量,所述多个后验概率向量与所述骨导信号和所述气导信号包括的多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
    基于所述多个后验概率向量进行所述唤醒词的检测。
  6. 如权利要求5所述的方法,其特征在于,所述基于所述骨导信号和所述气导信号,确定多个后验概率向量,包括:
    将所述骨导信号的起始部分和所述气导信号输入第二声学模型,以得到所述第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定,所述第一数量个骨导后验概率向量与所述骨导信号的起始部分所包括的音频帧一一对应,所述第二数量个气导后验概率向量与所述气导信号所包括的音频帧一一对应;
    将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,所述第一骨导后验概率向量对应所述骨导信号的起始部分的最后一个音频帧,所述最后一个音频帧的时长小于帧时长,所述第一气导后验概率向量对应所述气导信号的第一个音频帧,所述第一个音频帧的时长小于所述帧时长,所述多个后验概率向量包括所述第二后验概率向量、所述第一数量个骨导后验概率向量中除所述第一骨导后验概率向量之外的向量,以及所述第二数量个气导后验概率向量中除所述第一气导后验概率向量之外的向量。
  7. 如权利要求5所述的方法,其特征在于,所述基于所述骨导信号和所述气导信号,确定多个后验概率向量,包括:
    将所述骨导信号的起始部分和所述气导信号输入第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    将所述骨导信号和所述气导信号输入所述第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量。
  8. 如权利要求4-7任一所述的方法,其特征在于,所述基于所述多个后验概率向量进行所述唤醒词的检测,包括:
    基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;
    在所述置信度超过置信度阈值的情况下,确定检测到所述命令词包括所述唤醒词。
  9. 如权利要求4-7任一所述的方法,其特征在于,所述基于所述多个后验概率向量进行所述唤醒词的检测,包括:
    基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;
    在所述置信度超过置信度阈值,且所述多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到所述命令词包括所述唤醒词,所述多个模板向量指示包含所述唤醒词的完整信息的语音信号的音素属于所述多个指定音素的概率。
  10. 如权利要求9所述的方法,其特征在于,在所述多个后验概率向量与所述多个模板向量一一对应的情况下,所述距离条件包括:所述多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
  11. 如权利要求9或10所述的方法,其特征在于,所述方法还包括:
    获取骨导注册信号,所述骨导注册信号包含所述唤醒词的完整信息;
    基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
  12. 如权利要求11所述的方法,其特征在于,所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量,包括:
    基于所述骨导注册信号确定融合注册信号;
    基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
  13. 如权利要求12所述的方法,其特征在于,所述基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量,包括:
    将所述融合注册信号包括的多个注册音频帧输入第一声学模型,以得到所述第一声学模型输出的多个注册后验概率向量,所述多个注册后验概率向量与所述多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;
    将所述多个注册后验概率向量确定为所述多个模板向量;
    基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
  14. 如权利要求11所述的方法,其特征在于,所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值之前,还包括:获取气导注册信号;
    所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值,包括:
    基于所述骨导注册信号和所述气导注册信号,确定多个注册后验概率向量,所述多个注册后验概率向量与所述骨导注册信号和所述气导注册信号包括的多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;
    基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
  15. 一种语音唤醒的装置,其特征在于,所述装置包括:
    语音检测模块,用于根据骨导麦克采集的骨导信号进行语音检测,所述骨导信号包含声源输入的命令词信息;
    唤醒词检测模块,用于在检测到有语音输入的情况下,基于所述骨导信号进行唤醒词的检测;
    语音唤醒模块,用于在检测到所述命令词包括所述唤醒词时,对待唤醒设备进行语音唤醒。
  16. 如权利要求15所述的装置,其特征在于,所述唤醒词检测模块包括:
    第一确定子模块,用于基于所述骨导信号确定融合信号;
    唤醒词检测子模块,用于对所述融合信号进行所述唤醒词的检测。
  17. 如权利要求16所述的装置,其特征在于,所述装置还包括:
    处理模块,用于开启空气麦克,通过所述空气麦克采集气导信号;
    所述第一确定子模块用于:
    将所述骨导信号的起始部分和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    基于所述骨导信号的起始部分生成增强起始信号,将所述增强起始信号和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    将所述骨导信号和所述气导信号直接进行融合,以得到所述融合信号。
  18. 如权利要求16或17所述的装置,其特征在于,所述唤醒词检测子模块用于:
    将所述融合信号包括的多个音频帧输入第一声学模型,以得到所述第一声学模型输出的多个后验概率向量,所述多个后验概率向量与所述多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
    基于所述多个后验概率向量进行所述唤醒词的检测。
  19. 如权利要求15所述的装置,其特征在于,所述装置还包括:
    处理模块,用于开启空气麦克,通过所述空气麦克采集气导信号;
    所述唤醒词检测模块包括:
    第二确定子模块,用于基于所述骨导信号和所述气导信号,确定多个后验概率向量,所述多个后验概率向量与所述骨导信号和所述气导信号包括的多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;
    唤醒词检测子模块,用于基于所述多个后验概率向量进行所述唤醒词的检测。
  20. 如权利要求19所述的装置,其特征在于,所述第二确定子模块用于:
    将所述骨导信号的起始部分和所述气导信号输入第二声学模型,以得到所述第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定,所述第一数量个骨导后验概率向量与所述骨导信号的起始部分所包括的音频帧一一对应,所述第二数量个气导后验概率向量与所述气导信号所包括的音频帧一一对应;
    将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,所述第一骨导后验概率向量对应所述骨导信号的起始部分的最后一个音频帧,所述最后一个音频帧的时长小于帧时长,所述第一气导后验概率向量对应所述气导信号的第一个音频帧,所述第一个音频帧的时长小于所述帧时长,所述多个后验概率向量包括所述第二后验概率向 量、所述第一数量个骨导后验概率向量中除所述第一骨导后验概率向量之外的向量,以及所述第二数量个气导后验概率向量中除所述第一气导后验概率向量之外的向量。
  21. 如权利要求19所述的装置,其特征在于,所述第二确定子模块用于:
    将所述骨导信号的起始部分和所述气导信号输入第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,
    将所述骨导信号和所述气导信号输入所述第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量。
  22. 如权利要求18-21任一所述的装置,其特征在于,所述唤醒词检测子模块用于:
    基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;
    在所述置信度超过置信度阈值的情况下,确定检测到所述命令词包括所述唤醒词。
  23. 如权利要求18-21任一所述的装置,其特征在于,所述唤醒词检测子模块用于:
    基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;
    在所述置信度超过置信度阈值,且所述多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到所述命令词包括所述唤醒词,所述多个模板向量指示包含所述唤醒词的完整信息的语音信号的音素属于所述多个指定音素的概率。
  24. 如权利要求23所述的装置,其特征在于,在所述多个后验概率向量与所述多个模板向量一一对应的情况下,所述距离条件包括:所述多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
  25. 如权利要求23或24所述的装置,其特征在于,所述装置还包括:
    获取模块,用于获取骨导注册信号,所述骨导注册信号包含所述唤醒词的完整信息;
    确定模块,用于基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
  26. 如权利要求25所述的装置,其特征在于,所述确定模块包括:
    第三确定子模块,用于基于所述骨导注册信号确定融合注册信号;
    第四确定子模块,用于基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
  27. 如权利要求26所述的装置,其特征在于,所述第四确定子模块用于:
    将所述融合注册信号包括的多个注册音频帧输入第一声学模型,以得到所述第一声学模型输出的多个注册后验概率向量,所述多个注册后验概率向量与所述多个注册音频帧一一对 应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;
    将所述多个注册后验概率向量确定为所述多个模板向量;
    基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
  28. 如权利要求25所述的装置,其特征在于,所述装置还包括:
    获取模块,用于获取气导注册信号;
    所述确定模块包括:
    第五确定子模块,用于基于所述骨导注册信号和所述气导注册信号,确定多个注册后验概率向量,所述多个注册后验概率向量与所述骨导注册信号和所述气导注册信号包括的多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;
    第六确定子模块,用于基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
  29. 一种电子设备,其特征在于,所述电子设备包括:存储器和处理器;
    所述存储器,用于存储计算机程序;
    所述处理器,用于执行所述计算机程序,以实现权利要求1-14任一所述的方法的步骤。
  30. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-14任一所述的方法的步骤。
  31. 一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令被处理器执行时实现权利要求1-14任一所述的方法的步骤。
PCT/CN2022/095443 2021-08-30 2022-05-27 语音唤醒的方法、装置、设备、存储介质及程序产品 WO2023029615A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22862757.6A EP4379712A1 (en) 2021-08-30 2022-05-27 Wake-on-voice method and apparatus, device, storage medium, and program product
US18/591,853 US20240203408A1 (en) 2021-08-30 2024-02-29 Speech Wakeup Method and Apparatus, Device, Storage Medium, and Program Product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111005443.6 2021-08-30
CN202111005443.6A CN115731927A (zh) 2021-08-30 2021-08-30 语音唤醒的方法、装置、设备、存储介质及程序产品

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/591,853 Continuation US20240203408A1 (en) 2021-08-30 2024-02-29 Speech Wakeup Method and Apparatus, Device, Storage Medium, and Program Product

Publications (1)

Publication Number Publication Date
WO2023029615A1 true WO2023029615A1 (zh) 2023-03-09

Family

ID=85290866

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095443 WO2023029615A1 (zh) 2021-08-30 2022-05-27 语音唤醒的方法、装置、设备、存储介质及程序产品

Country Status (4)

Country Link
US (1) US20240203408A1 (zh)
EP (1) EP4379712A1 (zh)
CN (1) CN115731927A (zh)
WO (1) WO2023029615A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862604A (zh) * 2022-11-24 2023-03-28 镁佳(北京)科技有限公司 语音唤醒模型训练及语音唤醒方法、装置及计算机设备
CN115985323A (zh) * 2023-03-21 2023-04-18 北京探境科技有限公司 语音唤醒方法、装置、电子设备及可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229184A1 (en) * 2013-02-14 2014-08-14 Google Inc. Waking other devices for additional data
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及***
CN109036412A (zh) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 语音唤醒方法和***
CN110010143A (zh) * 2019-04-19 2019-07-12 出门问问信息科技有限公司 一种语音信号增强***、方法及存储介质
JP2020122819A (ja) * 2019-01-29 2020-08-13 オンキヨー株式会社 電子機器及びその制御方法
CN112581970A (zh) * 2019-09-12 2021-03-30 深圳市韶音科技有限公司 用于音频信号生成的***和方法
CN113053371A (zh) * 2019-12-27 2021-06-29 阿里巴巴集团控股有限公司 语音控制***和方法、语音套件、骨传导及语音处理装置
CN113259793A (zh) * 2020-02-07 2021-08-13 杭州智芯科微电子科技有限公司 智能麦克风及其信号处理方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229184A1 (en) * 2013-02-14 2014-08-14 Google Inc. Waking other devices for additional data
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及***
CN109036412A (zh) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 语音唤醒方法和***
JP2020122819A (ja) * 2019-01-29 2020-08-13 オンキヨー株式会社 電子機器及びその制御方法
CN110010143A (zh) * 2019-04-19 2019-07-12 出门问问信息科技有限公司 一种语音信号增强***、方法及存储介质
CN112581970A (zh) * 2019-09-12 2021-03-30 深圳市韶音科技有限公司 用于音频信号生成的***和方法
CN113053371A (zh) * 2019-12-27 2021-06-29 阿里巴巴集团控股有限公司 语音控制***和方法、语音套件、骨传导及语音处理装置
CN113259793A (zh) * 2020-02-07 2021-08-13 杭州智芯科微电子科技有限公司 智能麦克风及其信号处理方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862604A (zh) * 2022-11-24 2023-03-28 镁佳(北京)科技有限公司 语音唤醒模型训练及语音唤醒方法、装置及计算机设备
CN115985323A (zh) * 2023-03-21 2023-04-18 北京探境科技有限公司 语音唤醒方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
EP4379712A1 (en) 2024-06-05
CN115731927A (zh) 2023-03-03
US20240203408A1 (en) 2024-06-20

Similar Documents

Publication Publication Date Title
Karpagavalli et al. A review on automatic speech recognition architecture and approaches
O’Shaughnessy Automatic speech recognition: History, methods and challenges
Arora et al. Automatic speech recognition: a review
US8930196B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
WO2023029615A1 (zh) 语音唤醒的方法、装置、设备、存储介质及程序产品
US10650306B1 (en) User representation using a generative adversarial network
CN109036381A (zh) 语音处理方法及装置、计算机装置及可读存储介质
CN113012686A (zh) 神经语音到意思
US11367431B2 (en) Synthetic speech processing
US11308946B2 (en) Methods and apparatus for ASR with embedded noise reduction
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
CN107039035A (zh) 一种语音起始点和终止点的检测方法
KR20210132615A (ko) 사운드 특성에 대한 음향 모델 컨디셔닝
Herbig et al. Self-learning speaker identification: a system for enhanced speech recognition
CN115176309A (zh) 语音处理***
Fu et al. A survey on Chinese speech recognition
CN112259077B (zh) 语音识别方法、装置、终端和存储介质
US11735178B1 (en) Speech-processing system
Nguyen et al. Vietnamese voice recognition for home automation using MFCC and DTW techniques
Oprea et al. An artificial neural network-based isolated word speech recognition system for the Romanian language
Hao et al. Denoi-spex+: a speaker extraction network based speech dialogue system
Bohouta Improving wake-up-word and general speech recognition systems
CN107039046A (zh) 一种基于特征融合的语音声效模式检测方法
US11783818B2 (en) Two stage user customizable wake word detection
Hirsch Speech Assistant System With Local Client and Server Devices to Guarantee Data Privacy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862757

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022862757

Country of ref document: EP

Effective date: 20240301

NENP Non-entry into the national phase

Ref country code: DE