WO2021062705A1 - 一种单声道鲁棒性的语音关键词实时检测方法 - Google Patents

一种单声道鲁棒性的语音关键词实时检测方法 Download PDF

Info

Publication number
WO2021062705A1
WO2021062705A1 PCT/CN2019/109603 CN2019109603W WO2021062705A1 WO 2021062705 A1 WO2021062705 A1 WO 2021062705A1 CN 2019109603 W CN2019109603 W CN 2019109603W WO 2021062705 A1 WO2021062705 A1 WO 2021062705A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
keyword
speech
real
time
Prior art date
Application number
PCT/CN2019/109603
Other languages
English (en)
French (fr)
Inventor
胡鹏
闫永杰
Original Assignee
大象声科(深圳)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大象声科(深圳)科技有限公司 filed Critical 大象声科(深圳)科技有限公司
Priority to PCT/CN2019/109603 priority Critical patent/WO2021062705A1/zh
Publication of WO2021062705A1 publication Critical patent/WO2021062705A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Definitions

  • the invention relates to the technical field of electronic communication noise reduction, and more specifically, to a method for real-time detection of monophonic robust speech keywords.
  • the technical problem to be solved by the present invention is how to use a single-channel robust voice keyword real-time detection method to solve the problem that the method in the prior art is not robust to noise and has a high false alarm rate The problem.
  • the technical solution adopted by the present invention to solve its technical problem is: using a single-channel robust voice keyword real-time detection method, which performs keyword detection on the electronic format voice collected by a single microphone, compared with beam For keyword detection of the formed microphone array, this method can still maintain a high wake-up rate without using spatial location information, and only uses one microphone, which has a wider range of application scenarios.
  • This method uses a supervised learning method for keyword detection, and realizes a keyword detection method that is robust in noisy scenarios by combining two training targets of noise reduction and keyword detection. This method has excellent performance, can still maintain a high wake-up rate in a noisy environment, has a wider range of practicability, and can greatly reduce the false alarm rate of the neural network.
  • the mono-robust real-time speech keyword detection method includes the following steps:
  • S1 receives the noisy speech signal in electronic format, which contains human voice and non-human background noise
  • S2 uses short-time Fourier transform frame by frame to convert time-domain noisy speech signals into frequency-domain signals
  • S3 uses the Mel filter to process the frequency domain signal to obtain the Mel feature and use it as an acoustic feature
  • S4 neural network includes: convolutional neural network, one-way long and short-term memory regression neural network and feedforward deep neural network;
  • the Mel feature passes through the convolutional neural network, the one-way long-short-term memory regression neural network and the feedforward deep neural network frame by frame, it is processed with the normalized exponential function to obtain the confidence information of each keyword;
  • the output signal of the S6 neural network passes through the attention mechanism and the feedforward deep neural network in turn, and is processed by the normalized exponential function to obtain the confidence information of each keyword at the sentence level.
  • the confidence value is greater than the predefined value Threshold, it is considered that the keyword is detected, otherwise it is considered that the keyword is not detected.
  • the mel feature is formed by splicing the mel feature of the current frame and several future frames.
  • the unidirectional long-short-term memory regression neural network includes a plurality of stacked unidirectional layers, and each unidirectional layer has 64 neurons .
  • the neural network uses a large data set with noise for training, wherein the noisy speech is a mixture of multiple noises and multiple speaker voices;
  • noisy speech is a mixture of thousands of different types of noise and the speech of more than 500 speakers.
  • the convolutional neural network is formed by stacking a number of single convolutional layers
  • Each single convolutional layer of the convolutional neural network is connected by an activation function layer.
  • the feedforward deep neural network is formed by stacking multiple single linear layers
  • Each linear layer of the feedforward deep neural network is connected by an activation function layer.
  • the attention mechanism adopts a soft attention mechanism.
  • the input of the attention mechanism comes from the output of the convolutional neural network layer.
  • the input of the attention mechanism is obtained by mixing the output signals of the current frame and the convolutional layer of several past frames.
  • the vector size of the output of the neural network is the number of keywords participating in training plus one.
  • the beneficial effect is that the present invention provides a robust monophonic real-time detection method for speech keywords.
  • the method has excellent performance and can effectively reduce call noise in a short-distance conversation scenario. Compared with the prior art, it has stronger practicability and does not depend on the noise and the speaker.
  • Fig. 1 is a schematic block diagram of a mono-robust real-time voice keyword detection method of the present invention
  • Fig. 2 is a corresponding table of the key words and serial numbers of the sample key words and serial numbers of a robust monophonic voice keyword real-time detection method of the present invention
  • Figure 3 is a graph showing the downward trend of cross-entropy loss in a mono robust voice keyword real-time detection method of the present invention
  • FIG. 4 shows the change trend of the mean square error in the real training process of a mono robust voice keyword real-time detection method of the present invention
  • Fig. 5 is a schematic diagram of the realization of the soft attention mechanism of a mono-robust real-time speech keyword detection method of the present invention.
  • Fig. 6 is a flow chart of a method for real-time detection of a monophonic robust voice keyword according to the present invention.
  • a robust monophonic real-time detection method for speech keywords includes the following steps: S1 receives an electronic format noisy speech signal, which contains human voice and non-human background noise; S2 uses short-time Fourier transform to convert time-domain noisy speech signals into frequency-domain signals frame by frame; S3 uses Mel filters to process frequency-domain signals to obtain Mel features and use them as acoustic features; S4 Neural networks include: convolutional neural network, one-way long and short-term memory regression neural network and feedforward deep neural network; Mel features go through the convolutional neural network frame by frame, one-way long and short-term memory regression neural network and feedforward deep neural network After the network, use the normalized exponential function (Softmax) to obtain the confidence information of each keyword; S5 When the confidence of a certain keyword is greater than the predefined threshold, take the current frame and move forward several frames Spliced and used as the output signal of the neural network; the output signal of the S6 neural network passes through the attention mechanism and the feedforward deep neural network in turn, and is processed by
  • confidence is also called reliability, or confidence level, confidence coefficient, that is, when sampling to estimate the overall parameters, the conclusion is always uncertain due to the randomness of the sample. Therefore, a probability statement method is adopted, that is, the interval estimation method in mathematical statistics, that is, the estimated value and the overall parameter are within a certain allowable error range, and the corresponding probability is called the confidence.
  • short-time Fourier transform is a mathematical transformation related to Fourier transform, used to determine the frequency and phase of a sine wave in a local area of a time-varying signal.
  • the normalized exponential function or Softmax function in mathematics, especially probability theory and related fields, is a kind of extension of the logic function. It can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector ⁇ (z), so that the range of each element is between (0,1), and the sum of all elements is 1. This function is more than in multi-classification problems.
  • step S2 the time-domain digital signal is converted into a frequency-domain signal by Fast Fourier Transform (FFT). After the conversion into a frequency-domain signal, we can easily analyze the frequency components of the signal. To perform processing on the above, many signal processing algorithms that cannot be completed in the time domain can be realized.
  • FFT Fast Fourier Transform
  • the noisy speech training process in step S1 is a mixture of pure speech and noises with different signal-to-noise ratios
  • the real collected speech is used in the inference process
  • the feature extraction in step S2 Framing and windowing the noisy speech, each frame is 20 milliseconds in length, and there is an overlap of ten milliseconds between adjacent frames.
  • the fast Fourier transform is used to extract the spectral amplitude vector on each frame, and then the Mel filter is used for filtering to obtain the acoustic characteristics of each frame. Because the speech signal has a large correlation in the time dimension, and this correlation is of great help to the task of keyword detection.
  • this method splices the current frame and several frames before and after it into a vector with a larger dimension as the input feature.
  • the method is executed by a computer program to extract acoustic features from noisy speech, and use a deep neural network for processing to obtain whether the original noisy speech contains keywords.
  • the method includes one or more program modules, and any system or hardware device with executable computer programming instructions executes the one or more modules mentioned above.
  • the main reason is that the long and short-term memory regression neural network is used as a component of the network in the present invention.
  • the long and short-term memory regression neural network has the function of retaining part of important historical input information. Therefore, there is no need to splice historical frames, and only rely on future frames, which can reduce the amount of calculation of the neural network and the power consumption of the hardware devices that it depends on.
  • the present invention supports one or more keywords. Different numbers of wake-up words correspond to different dimensions of network output.
  • the network output includes the first-level frame-level output and the second-level sentence-level output.
  • the network output dimension is equal to the number of keywords plus one. For example, if there is one keyword, the output dimension of the network is 2, and if there are two keywords, the output dimension of the network is 3, and so on.
  • each command word should be numbered. For example, there are the following six keywords "turn on the light”, “turn on the TV”, “turn on the air conditioner”, “turn off the light”, “turn off the TV”, “turn off the air conditioner” ", can be numbered as shown in Figure 2.
  • non-keywords are numbered as 0, and the keywords are numbered sequentially from 1 as the labels of the speech used for training. .
  • the label should be one-hot encoding, and then the result of this encoding should be combined with the neural network output to obtain the cross entropy loss (Cross Entropy Loss).
  • cross entropy is an important concept in Shannon's information theory, which is mainly used to measure the difference information between two probability distributions.
  • the performance of a language model is usually measured by cross-entropy and perplexity.
  • the meaning of cross entropy is the difficulty of text recognition with this model, or from the point of view of compression, each word needs to be coded with several bits on average.
  • the meaning of complexity is to use the model to represent the average number of branches of this text, and the reciprocal can be regarded as the average probability of each word.
  • Generalization ability refers to the performance of the method in a scenario where the training is not involved.
  • the generalization performance of the method in the present invention mainly uses non-noise vocal speech and noise collected from about 10,000 different scenes to mix noise-added speech with different signal-to-noise ratio (SNR) and different loudness, Then solve the generalization problem through large-scale training. Since the regression neural network has the ability to model the long-term dependence of the signal, the proposed model has good generalization for new noise and speaker scenes, which is very important for practical applications.
  • the present invention uses an RNN model that relies on future frames, so that the network can obtain information from the past and the future.
  • the input is a noisy speech signal and the output keyword is labeled.
  • the "1" in the figure indicates that it is involved during training.
  • "2" represents the step of inference or prediction phase
  • "3" in the figure represents the step shared by training and prediction.
  • IRM ideal ratio membrane
  • the present invention uses a 7-layer deconvolutional neural network (De-CNN) corresponding to a 7-layer convolutional neural network (CNN) to estimate the ideal ratio of each input noisy speech, and then calculates The mean square error (MSE) of the ideal ratio film and the estimated ratio film.
  • the loss function of the neural network includes both cross-entropy loss and mean square error.
  • the neural network minimizes the loss function of the entire training set through repeated iterations. After the training phase is over, enter the prediction phase, in which the deconvolution part is not used at all. This training method of training multiple targets at the same time is generally called joint training.
  • the output of the deconvolutional neural network part is the predicted ideal ratio film.
  • the ideal ratio film and the mel feature of noisy speech can be used to restore the mel feature of pure speech, so the predicted ideal ratio film is added as the training target It can make the convolutional neural network part can retain more pure voice features, filter out noise features that are not related to keyword detection, can make the entire neural network have better noise reduction performance, and make the present invention robust in a strong noise environment It is highly accurate and can still obtain a high accuracy rate in a strong noise environment.
  • MSE mean square error
  • the robustness means that the wake-up system can still be waked up correctly and maintain a low false wake-up rate even in the case of noise interference or target voice changes.
  • the mean-square error (MSE) value is decreasing rapidly from about 1200 times to about 1400 times.
  • the cross-entropy loss value is also rapidly decreasing, which can be seen as The performance of the noise reduction task is improved, and the performance of the keyword detection task is also continuously improved, indicating that the performance of the noise reduction task has a synergistic effect on the performance of the keyword detection task.
  • the training process of the present invention there are two different stages of training process.
  • the first stage is the aforementioned joint training mean square error (MSE) and cross entropy loss.
  • the second stage training process mainly trains sentence-level keyword detection.
  • the trained 7-layer convolutional neural network has a good noise reduction effect, so in this stage of training, the weight parameters of this part of the convolutional neural network are frozen, making it here Does not update during phase training.
  • the current frame is spliced with 179 historical frames and a total of 180 frames.
  • the output of the convolutional neural network is used as the input of the attention mechanism. These 180 frames correspond to 1.8 seconds of speech.
  • the attention mechanism is derived from the study of human vision. In cognitive science, due to the bottleneck of information processing, humans will selectively pay attention to part of the information while ignoring other visible information. The attention mechanism can be used here from a The keywords extracted from the longer time series detect relevant information while ignoring other useless information. Using the second-level wakeup with attention mechanism can effectively avoid the problem of poor performance caused by the high false alarm rate caused by only the first-level wakeup.
  • the attention mechanism used in the present invention is a soft attention mechanism.
  • T represents the length of the time sequence used by the attention mechanism
  • ht represents the input of the attention mechanism at time t
  • ⁇ t represents the weight corresponding to ht at time t
  • ct represents the output of the attention mechanism at the previous T frame.
  • the process of the entire attention mechanism is to first calculate the output et of the hidden state of each frame of the previous T frames, and then use the normalized exponential function to calculate these hidden states as the weight of each frame, and finally the entire sequence of events
  • the weighted sum of the input of each frame is the output of the attention mechanism.
  • the mel feature is formed by concatenating the mel feature of the current frame and several future frames.
  • the fast Fourier transform is used to extract the spectral amplitude vector on each frame, and then the Mel filter is used for filtering to obtain the acoustic feature of each frame, that is, the Mel feature.
  • the unidirectional long- and short-term memory regression neural network includes multiple stacked unidirectional layers, and each unidirectional layer has 64 neurons.
  • the neural network is trained using a large data set with noise, where the noisy speech is a mixture of multiple noises and multiple speaker voices;
  • noisy speech is a mixture of thousands of different types of noise and the speech of more than 500 speakers.
  • the convolutional neural network is formed by stacking several single convolutional layers; each single convolutional layer of the convolutional neural network is connected by an activation function layer.
  • a convolutional neural network is a deep feedforward artificial neural network. Artificial neurons can respond to surrounding units.
  • the convolutional neural network includes a convolutional layer and a pooling layer.
  • the feed-forward deep neural network is formed by stacking multiple single-layer linear layers; each linear layer of the feed-forward deep neural network is connected by an activation function layer.
  • a feedforward deep neural network (Deep Neural Networks, DNN) is a type with at least one hidden layer, which uses an activation function to de-linearize, uses cross entropy as a loss function, and uses a back propagation optimization algorithm (stochastic gradient Descent algorithm, batch gradient descent algorithm) is a feedforward neural network for learning and training (adjusting and updating the weights between neurons).
  • the attention mechanism uses a soft-attention mechanism.
  • the input of the attention mechanism comes from the output of the convolutional neural network layer.
  • the input of the attention mechanism is obtained by mixing the output signals of the current frame and several past frame convolutional layers.
  • the attention mechanism can expand the capabilities of neural networks and allow approximation of more complex functions. In more intuitive terms, it can focus on the characteristic part of the input, which can help improve the benchmark performance of natural language processing and also bring images. New capabilities such as description, memory network addressing, and neural programmers.
  • the vector size of the output of the neural network is the number of keywords participating in training plus one.
  • the mono-robust keyword detection of the present invention refers to the signal collected by a single microphone. Compared with the keyword detection of a beam-forming microphone array, the mono-keyword detection has a wider range of practicability.
  • the present invention adopts a supervised learning method for keyword detection, and detects keywords through a convolutional neural network, a long and short-term memory regression neural network, and a feedforward deep neural network.
  • the present invention adopts sentence-level attention mechanism and feedforward deep neural network as the second level of neural network, and based on convolutional neural network, long and short-term memory regression neural network and feedforward deep neural network as the first level After the output confidence of is greater than the threshold, the second-level network is used for confirmation.
  • LSTM Long Short-Term Memory Regression Neural Network
  • RNN recurrent neural network
  • the robustness of the monophonic robust wake-up word detection of the present invention is reflected in that it can maintain a high wake-up rate in a noisy environment, and can maintain a high wake-up rate of 90% under a noisy wake-up with a signal-to-noise ratio of 0dB % Above the correct wake-up rate.
  • the signal-to-noise ratio (SNR, S/N) is also called the signal-to-noise ratio.
  • SNR signal-to-noise ratio
  • the signal here refers to the electronic signal from the outside of the device that needs to be processed by this device
  • the noise refers to the irregular extra signal (or information) that does not exist in the original signal generated after passing through the device. The signal does not change with the change of the original signal.
  • robust single-channel keyword detection refers to keyword detection of the electronic format voice collected by a single microphone.
  • the present invention can be used without spatial location information. In this case, a high wake-up rate is still maintained, and only one microphone is used, which has a wider range of application scenarios.
  • the invention adopts a supervised learning method for keyword detection, and realizes a keyword detection method with robustness in noise scenes by combining two training targets of noise reduction and keyword detection.
  • the present invention introduces a second-level network to reduce the problem of excessively high false alarm rate of keyword detection, where the attention mechanism in the second-level network can extract keyword-related information from a longer time series.
  • the second-level network executes its related logic only after the output of the first-level network is greater than the threshold, which can save part of the calculation cost.
  • the false alarm rate is also called the false alarm probability, which refers to the probability that a target is judged to have a target when there is no target when the threshold detection method is adopted in the process of radar detection.
  • the present invention provides a single-channel robust real-time detection method for speech keywords, which adopts the single-channel robust keyword detection method for signals collected by a single microphone, compared with the keyword detection of a beam-forming microphone array, It has a wider range of practicability, and adopts a second-level network for confirmation. At the cost of less impact on performance, it can greatly reduce the false alarm rate of the neural network, improve the performance, and can still work in noisy environments. Maintain a high wake-up rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

一种单声道鲁棒性的语音关键词实时检测方法,包括如下步骤:接收电子格式的带噪语音;逐帧用短时傅里叶变换把时域的语音信号转化为频域信号;使用梅尔滤波器对频域信号进行处理,得到的梅尔特征作为声学特征;梅尔特征逐帧经过神经网络再用归一化指数函数处理后得到每个关键词的置信度信息;当某一个关键词的置信度大于预先定义的阈值之后则取当前帧和往前推移若干帧拼接作为的神经网络的输出,依次通过注意力机制与前馈型深度神经网络,并经过归一化指数函数处理后,得到句子级别的每个关键词的置信度信息,当该置信度数值大于预先定义的阈值,则认为检测到关键词,否则视为没有检测到关键词。该方法能够在嘈杂的环境中依然能够保持较高的唤醒率,具有更加广泛的实用性,能够大幅的降低神经网络的虚警率,提升了关键词检测性能。

Description

一种单声道鲁棒性的语音关键词实时检测方法 技术领域
本发明涉及电子通讯降噪技术领域,更具体地说,涉及一种单声道鲁棒性的语音关键词实时检测方法。
背景技术
随着智能助理,智能音响等应用的兴起,语音关键词检测技术作为人机交互中重要的环节,越来越受到产业界的重视。基于隐马尔科夫模型的填充模型(Filler Models)最早应用于关键词检测。其中,单声道鲁棒性的关键词检测是一个非常具有挑战性的课题,因为单声道关键词检测仅依靠一个麦克风的录音信号,而无法利用麦克风阵列中常用的空间信息。另外,和基于波束形成的(通过传感器阵列适当配置的空间滤波)的麦克风阵列关键词检测相比,单麦克风鲁棒性关键词检测声学应用场景更加广泛。由于只使用一个麦克风,单通道鲁棒性关键词检测不仅成本更低,使用也更加方便。最近,关键词检测的一个较大的突破是使用深度神经网络来代替隐马尔科夫模型,这种方法内存占用少,不需要解码搜索,准确率高。之前最先进的方法是采用了大量数据训练的前馈型深度神经网络(Deep neural network,DNN)加上帧级别的数据标注。虽然该方法能够实现关键字检测,但是该方法的对噪音的鲁棒性不好,可以通过在训练过程中对于输入语音加上不同种类,不同信噪比的噪声来改善,但是仍存在虚警率较高的问题。
现有的方案存在如下缺点:
1.虽然能够实现关键字检测,但是对噪音的鲁棒性不好;
2.存在虚警率较高的问题。
发明内容
本发明要解决的技术问题在于如何通过采用一种单声道鲁棒性的语音关键词实时检测方法,以解决现有技术中的方法对噪音的鲁棒性不好且存在虚警率较高的问题。
本发明解决其技术问题所采用的技术方案是:利用一种单声道鲁棒性的语音关键词实时检测方法,该方法对单个麦克风采集的电子格式的语音进行关键词检测,相比于波束形成的麦克风阵列的关键词检测,该方法可以在不使用空间位置信息的情况下依然保持较高的唤醒率,而且仅使用一个麦克风,有着更加广泛的应用场景。该方法采用有监督学习方法进行关键词检测,通过结合降噪和关键词检测两个训练目标实现一个具有在噪音场景下鲁棒性的关键词检测方法。该方法性能优秀,能够在嘈杂的环境中依然能够保持较高的唤醒率,具有更加广泛的实用性,能够大幅的降低神经网络的虚警率。
在本发明所述的一种单声道鲁棒性的语音关键词实时检测方法中,所述一种单声道鲁棒性的语音关键词实时检测方法,包括如下步骤:
S1接收电子格式的带噪语音信号,其中包含了人声语音和非人声的背景噪音;
S2逐帧用短时傅里叶变换把时域的带噪语音信号转化为频域信号;
S3使用梅尔滤波器对频域信号进行处理,得到梅尔特征并将其处作为声 学特征;
S4神经网络包括:卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络;
梅尔特征逐帧经过卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络后,再用归一化指数函数处理后得到每个关键词的置信度信息;
S5当某一个关键词的置信度大于预先定义的阈值之后则取当前帧和往前推移若干帧拼接,并作为神经网络的输出信号;
S6神经网络的输出信号依次通过注意力机制与前馈型深度神经网络,并经过归一化指数函数处理后,得到句子级别的每个关键词的置信度信息,当置信度数值大于预先定义的阈值,则认为检测到关键词,否则视为没有检测到关键词。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,梅尔特征是由当前帧梅尔特征和未来若干帧拼接而成。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,单向长短期记忆回归神经网络包含多个堆叠的单向层,每个单向层具有六十四个神经元。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,神经网络采用带噪音大数据集进行训练,其中带噪语音由多种噪音和多个说话人语音混合而成;
带噪语音由数千种不同类型的噪音和五百个以上的说话人的语音混合而成。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,卷积神经网络是由若干个单卷积层堆叠而成;
卷积神经网络的每个单卷积层通过激活函数层相连接。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,前馈型深度神经网络由多个单层线性层堆叠而成;
前馈型深度神经网络的每个线性层之间通过激活函数层相连接。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,注意力机制采用的是软注意力机制。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,注意力机制的输入来自于卷积神经网络层的输出。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,注意力机制的输入是将当前帧和若干个过去帧卷积层的输出信号混合所得。
在本发明的一种单声道鲁棒性的语音关键词实时检测方法中,神经网络的输出的向量大小为参与训练的关键词的数量加一。
根据上述方案的本发明,其有益效果在于,本发明提供了一种单声道鲁棒性的语音关键词实时检测方法,该方法性能优秀,能够有效的在近距离交谈场景下降低通话噪音,与现有技术相比具有更强的实用性,且不依赖于噪音与说话者。
附图说明
下面将结合附图及实施例对本发明作进一步说明,附图中:
图1是本发明的一种单声道鲁棒性的语音关键词实时检测方法的原理框图;
图2是本发明的一种单声道鲁棒性的语音关键词实时检测方法的样例关 键词和序号的对应表;
图3是本发明的一种单声道鲁棒性的语音关键词实时检测方法的中交叉熵损失的下降趋势图;
图4是本发明的一种单声道鲁棒性的语音关键词实时检测方法的真实训练过程中均方误差的变化趋势;
图5是本发明的一种单声道鲁棒性的语音关键词实时检测方法的软注意力机制实现原理图。
图6是本发明的一种单声道鲁棒性的语音关键词实时检测方法的流程图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
如图1所示,一种单声道鲁棒性的语音关键词实时检测方法,包括如下步骤:S1接收电子格式的带噪语音信号,其中包含了人声语音和非人声的背景噪音;S2逐帧用短时傅里叶变换把时域的带噪语音信号转化为频域信号;S3使用梅尔滤波器对频域信号进行处理,得到梅尔特征并将其处作为声学特征;S4神经网络包括:卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络;梅尔特征逐帧经过卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络后,再用归一化指数函数(Softmax)处理后得到每个关键词的置信度信息;S5当某一个关键词的置信度大于预先定义的阈值之后则取当前帧和往前推移若干帧拼接,并作为神经网络的输出信号;S6神经网 络的输出信号依次通过注意力机制(Attention mechanism)与前馈型深度神经网络,并经过归一化指数函数处理后,得到句子级别的每个关键词的置信度信息,当置信度数值大于预先定义的阈值,则认为检测到关键词,否则视为没有检测到关键词。
进一步地,置信度也称为可靠度,或置信水平、置信系数,即在抽样对总体参数作出估计时,由于样本的随机性,其结论总是不确定的。因此,采用一种概率的陈述方法,也就是数理统计中的区间估计法,即估计值与总体参数在一定允许的误差范围以内,其相应的概率有多大,这个相应的概率称作置信度。
进一步地,短时傅里叶变换(STFT,short-time Fourier transform)是和傅里叶变换相关的一种数学变换,用以确定时变信号其局部区域正弦波的频率与相位。
进一步地,归一化指数函数或称Softmax函数,在数学,尤其是概率论和相关领域中,是逻辑函数的一种推广。它能将一个含任意实数的K维向量z“压缩”到另一个K维实向量σ(z)中,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。该函数多于多分类问题中。
进一步地,在步骤S2中,通过快速傅里叶变换(FFT)将时域的数字信号转换为频域信号,转换为频域信号之后我们可以很方便地分析出信号的频率成分,在频域上进行处理,实现许多在时域无法完成的信号处理算法。
更进一步地,其中,步骤S1里面的带噪语音训练过程中采用的是纯净语音和不同信噪比的噪音混合而成,推理过程中采用的是真实采集的语音,在步骤S2中的特征提取,将带噪语音进行分帧和加窗处理,每帧长度为二十毫秒,相邻帧之间有十毫秒的重叠。利用快速傅里叶变换提取每一帧上的频谱幅度矢 量,再使用梅尔滤波器进行滤波得到每一帧的声学特征。由于语音信号在时间维度上具有较大的相关性,而且这种相关性对于关键词检测任务有很大的帮助。为了利用时间维度上的相关性来提高关键词检测的性能,本方法把当前帧和前后若干帧拼接在一起成为一个维度较大的向量作为输入特征。该方法由计算机程序执行,从带噪语音中提取声学特征,用深度神经网络进行处理,得到原始带噪语音中是否包含关键词。该方法包含一个或者多个程序模块,任何***或者带有可执行计算机编程指令的硬件设备来执行上述的一个或者多个模块。
更进一步地,对于要求实时处理的应用,例如移动助听器降噪应用,使用未来帧的信息是不可以接受的,因为使用未来帧的信息会导致产生延迟。对于关键词检测应用来说,对于一定范围内的延迟是可以接受的,因此可以在本应用里面牺牲一些实时性来获得更好的性能。具体来说,可以拼接,将当前帧和未来10帧拼接出来作为本发明的输入,仅仅会增加100毫秒的延迟,可以增加关键词检测的正确率,对于使用的未来帧的数量的还可以增加到20帧可以进一步的增加性能。需要注意的是本发明并未拼接历史帧作为输入,主要原因是本发明里面使用了长短期记忆回归神经网络作为网络的组成部分,长短期记忆回归神经网络具有保留一部分重要历史输入信息的功能,因此可以不需要拼接历史帧,仅仅依赖未来帧,这样可以降低神经网络的计算量和所依赖的硬件设备的功耗。
更进一步地,具体的说,本发明支持一个或者多个关键词,不同个数的唤醒词对应不同维度的网络输出,网络输出包含第一级的帧级别的输出,以及第二级句子级别的网络输出,网络输出维度等于关键词个数加一,例如有一个关 键词,则网络的输出维度是2,有两个关键词,则网络的输出维度是3,以此类推。在执行训练任务之前,应当给每一个命令词进行编号,例如有如下六个关键词“开灯”、“打开电视”、“打开空调”、“关灯”、“关闭电视”、“关闭空调”,可以按照图2所示进行编号。需要注意的是,本发明中除了对关键词进行编号之外,还需要对非关键词进行编号,通常把非关键词编号为0号,关键词从1开始顺序编号作为训练使用的语音的标签。训练的时候,应该把标签进行独热编码(One-hot encoding),再把这个编码之后的结果和神经网络输出相结合求交叉熵损失(Cross Entropy Loss)。
更进一步地,交叉熵(Cross Entropy)是Shannon信息论中一个重要概念,主要用于度量两个概率分布间的差异性信息。语言模型的性能通常用交叉熵和复杂度(perplexity)来衡量。交叉熵的意义是用该模型对文本识别的难度,或者从压缩的角度来看,每个词平均要用几个位来编码。复杂度的意义是用该模型表示这一文本平均的分支数,其倒数可视为每个词的平均概率。
进一步地,对于任何有监督学习方法,泛化能力非常重要。泛化能力是指在未参与训练的场景下的方法的性能表现。本发明里面的方法的泛化性能主要采用不带噪的人声语音和大约10000种不同场景采集的噪音混合出不同信噪比(Signal-to-noise ratio,SNR)不同响度的加噪语音,然后通过大规模的训练解决泛化性问题。由于回归神经网络具有对信号中长期依赖关系的建模能力,所提出的模型对新噪声和说话人场景具有很好的泛化性,这对于实际应用非常重要。最好,为了获得更好的性能,本发明使用了一个依赖未来帧的的RNN模型,使得网络可以获得来自于过去和未来的信息。
如图6所示,其详细的说明了本发明的整个过程,提出关键词检测的详细 过程,输入为带噪的语音信号,输出关键词的标号,图中的“1”表示在训练期间涉及的步骤,图中的“2”表示推理或预测阶段的步骤,图中的“3”表示训练和预测共享的那步骤。为了使得本发明在强噪音环境下依旧保持鲁棒性,在训练阶段本发明使用了理想比值膜(IRM)作为辅助的训练目标。IRM是通过比较带噪语音的梅尔特征和纯净语音的梅尔特征得到。如图1所示,本发明采用了跟7层卷积神经网络(CNN)相对应的7层反卷积神经网络(De-CNN)来估计每个输入带噪语音的理想比值膜,然后计算理想比值膜和估计比值膜的均方误差(Mean-square error,MSE)。神经网络的损失函数同时包含了交叉熵损失和均方误差,神经网络经过重复的多轮迭代将整个训练集的损失函数最小化。训练阶段结束之后,进入预测阶段,在这个阶段里面完全不使用反卷积部分。这种同时训练多个目标的训练方法一般称之为联合训练(joint training)。在本发明中反卷积神经网络部分的输出是预测理想比值膜,使用理想比值膜和带噪语音梅尔特征就可以用来恢复纯净语音的梅尔特征,因此添加预测理想比值膜作为训练目标可以使得卷积神经网络部分可以保留更多纯净语音特征,过滤掉跟关键词检测无关的噪音特征,可以使得整个神经网络拥有更好的降噪性能,使得本发明在强噪音环境下的鲁棒性,在强噪音环境下依然可以获得较高的正确率。
更进一步地,均方误差(Mean-square error,MSE)是反映估计量与被估计量之间差异程度的一种度量。
更进一步地,鲁棒性是唤醒***在噪声干扰或目标人声变化的情况下,依然能够正确的被唤醒和保持较低水平的误唤醒率。
如图3或图4所示,从1200次左右到1400次左右均方误差(Mean-square  error,MSE)数值在急速下降,此时交叉熵损失值也在快速的下降,可以看出来随着降噪任务性能的提升,关键词检测任务的性能也在不断的提升,说明降噪任务的性能对关键词检测任务的性能有协同促进作用。
进一步地,在本发明的训练过程中,共分为两个不同阶段的训练过程,第一个阶段就是上述的联合训练均方误差(Mean-square error,MSE)和交叉熵损失,第二个阶段训练过程主要训练句子级别的关键词检测。在上个阶段的训练过程中,训练出来的7层卷积神经网络拥有较好降噪效果,因此在这个阶段的训练中,这部分卷积神经网络的权值参数被冻结,使其在此阶段训练过程中不更新。在这个阶段的训练过程中,当前帧拼接了历史179帧共180帧卷积神经网络的输出作为注意力机制的输入。这180帧对应1.8秒的语音,由于绝大部分关键词语音都小于1.8秒,因此这180帧特征包含了几乎覆盖了整个唤醒词语音。注意力机制源于对人类视觉的研究,在认知科学中,由于信息处理的瓶颈,人类会选择性地关注所以信息的一部分,同时忽略其他可见信息,在此处使用注意力机制可以从一个较长的时间序列中提取的关键词检测相关的信息而忽略其他无用信息。使用了带注意力机制的第二级唤醒可以有效地避免仅有第一级唤醒所带来的虚警率过高导致的性能差的问题。
更进一步地,本发明里面使用的注意力机制是软注意力机制。
如图5所示,软注意力的原理具体可以通过以下的公式描述:
e t=υ Ttanh(Wh t+b)
Figure PCTCN2019109603-appb-000001
Figure PCTCN2019109603-appb-000002
其中T表示注意力机制使用的时间序列的长度,ht表示t时刻注意力机制的输入,αt表示t时刻ht对应的权重大小,ct表示前T帧注意力机制的输出。如上述公式所示,整个注意力机制的流程是首先计算前T帧每一帧隐藏状态的输出et,再使用归一化指数函数计算这些隐藏状态作为每一帧的权重,最后把整个事件序列每帧的输入加权求和就是注意力机制的输出。
进一步地,梅尔特征是由当前帧梅尔特征和未来若干帧拼接而成。
更进一步地,利用快速傅里叶变换提取每一帧上的频谱幅度矢量,再使用梅尔滤波器进行滤波得到每一帧的声学特征,即梅尔特征。
进一步地,单向长短期记忆回归神经网络包含多个堆叠的单向层,每个单向层具有六十四个神经元。
进一步地,神经网络采用带噪音大数据集进行训练,其中带噪语音由多种噪音和多个说话人语音混合而成;
带噪语音由数千种不同类型的噪音和五百个以上的说话人的语音混合而成。
进一步地,卷积神经网络是由若干个单卷积层堆叠而成;卷积神经网络的每个单卷积层通过激活函数层相连接。
更进一步地,卷积神经网络是一种深度前馈人工神经网络,人工神经元可以响应周围单元,卷积神经网络包括卷积层和池化层。
进一步地,前馈型深度神经网络由多个单层线性层堆叠而成;前馈型深度神经网络的每个线性层之间通过激活函数层相连接。
更进一步地,前馈型深度神经网络(Deep Neural Networks,DNN)是一种具备至少一个隐藏层的,利用激活函数去线性化,使用交叉熵作损失函数,利 用反向传播优化算法(随机梯度下降算法、批量梯度下降算法)进行学习训练(调整并更新神经元之间的权重)的前馈神经网络。
进一步地,注意力机制采用的是软注意力机制(Soft-attention)。
进一步地,注意力机制的输入来自于卷积神经网络层的输出。
进一步地,注意力机制的输入是将当前帧和若干个过去帧卷积层的输出信号混合所得。
更进一步地,注意力机制能够扩展神经网络的能力,并允许近似更加复杂的函数,用更直观的话说就是能关注输入的特点部分,能够帮助提升自然语言处理的基准表现,也带来了图像描述、记忆网络寻址和神经编程器等全新能力。
进一步地,神经网络的输出的向量大小为参与训练的关键词的数量加一。
更进一步地,本发明的单声道鲁棒性关键词检测是指对单个麦克风采集的信号,相比波束形成的麦克风阵列的关键词检测,单声道关键词检测具有更加广泛的实用性。
进一步地,本发明采用有监督学习方法进行关键词检测,通过带有卷积神经网络,长短期记忆回归神经网络,前馈型深度神经网络来检测关键词。本发明采用了句子级别的注意力机制和前馈型的深度神经网络作为第二级的神经网络,在基于卷积神经网络,长短期记忆回归神经网络和前馈型深度神经网络作为第一级的输出置信度大于阈值之后,使用第二级网络进行确认,如果第二级网络的输出置信度再次大于阈值,则认为检测到了关键词,否则认为未检出关键词,采用了第二级网络进行确认,在对性能影响较小的代价下,能够大幅的降低神经网络的虚警率,提升了本发明的性能。
进一步地,长短期记忆回归神经网络(LSTM,Long Short-Term Memory) 是一种时间循环神经网络,是为了解决一般的RNN(循环神经网络)存在的长期依赖问题而专门设计出来的,所有的RNN都具有一种重复神经网络模块的链式形式。
更进一步地,本发明单声道鲁棒性唤醒词检测中的鲁棒性体现在能够在嘈杂的环境中依然能够保持较高的唤醒率,在信噪比为0dB的嘈杂唤醒下能够保持90%以上的正确唤醒率。
进一步地,信噪比(SNR,S/N)又称为讯噪比。是指一个电子设备或者电子***中信号与噪声的比例。这里面的信号指的是来自设备外部需要通过这台设备进行处理的电子信号,噪声是指经过该设备后产生的原信号中并不存在的无规则的额外信号(或信息),并且该种信号并不随原信号的变化而变化。
更进一步地,鲁棒性单通道关键词检测是指对单个麦克风采集的电子格式的语音进行关键词检测,相比波束形成的麦克风阵列的关键词检测,本发明可以在不使用空间位置信息的情况下依然保持较高的唤醒率,而且仅使用一个麦克风,有着更加广泛的应用场景。本发明采用有监督学习方法进行关键词检测,通过结合降噪和关键词检测两个训练目标实现一个具有在噪音场景下鲁棒性的关键词检测方法。
更进一步地,本发明引入第二级网络降低关键词检测的虚警率过高的问题,其中第二级网络里面的注意力机制可以从较长的时间序列里面提取关键词相关的信息。第二级网络在推理阶段仅仅在第一级网络的输出大于阈值之后才执行其相关逻辑,可以节省一部分计算成本。
进一步地,虚警率又称虚警概率,指雷达探测的过程中,采用门限检测的方法时由于噪声的普遍存在和起伏,实际不存在目标却判断为有目标的概率。
本发明提供一种单声道鲁棒性的语音关键词实时检测方法,采用了单声道鲁棒性关键词检测方法对单个麦克风采集的信号,相比波束形成的麦克风阵列的关键词检测,具有更加广泛的实用性,并且采用了第二级网络进行确认,在对性能影响较小的代价下,能够大幅的降低神经网络的虚警率,提升了性能,能够在嘈杂的环境中依然能够保持较高的唤醒率。
尽管通过以上实施例对本发明进行了揭示,但本发明的保护范围并不局限于此,在不偏离本发明构思的条件下,对以上各构件所做的变形、替换等均将落入本发明的权利要求范围内。

Claims (10)

  1. 一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,包括如下步骤:
    S1接收电子格式的带噪语音信号,其中包含了人声语音和非人声的背景噪音;
    S2逐帧用短时傅里叶变换把时域的所述带噪语音信号转化为频域信号;
    S3使用梅尔滤波器对所述频域信号进行处理,得到梅尔特征并将其处作为声学特征;
    S4神经网络包括:卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络;
    所述梅尔特征逐帧经过卷积神经网络、单向长短期记忆回归神经网络和前馈型深度神经网络后,再用归一化指数函数处理后得到每个关键词的置信度信息;
    S5当某一个所述关键词的置信度大于预先定义的阈值之后则取当前帧和往前推移若干帧拼接,并作为所述神经网络的输出信号;
    S6所述神经网络的输出信号依次通过注意力机制与所述前馈型深度神经网络,并经过所述归一化指数函数处理后,得到句子级别的每个所述关键词的所述置信度信息,当所述置信度数值大于预先定义的阈值,则认为检测到所述关键词,否则视为没有检测到所述关键词。
  2. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述梅尔特征是由当前帧梅尔特征和未来若干帧拼接而成。
  3. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法, 其特征在于,所述单向长短期记忆回归神经网络包含多个堆叠的单向层,每个所述单向层具有六十四个神经元。
  4. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述神经网络采用带噪音大数据集进行训练,其中带噪语音由多种噪音和多个说话人语音混合而成;
    所述带噪语音由数千种不同类型的噪音和五百个以上的说话人的语音混合而成。
  5. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述卷积神经网络是由若干个单卷积层堆叠而成;
    所述卷积神经网络的每个所述单卷积层通过激活函数层相连接。
  6. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述前馈型深度神经网络由多个单层线性层堆叠而成;
    所述前馈型深度神经网络的每个线性层之间通过激活函数层相连接。
  7. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述注意力机制采用的是软注意力机制。
  8. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述注意力机制的输入来自于卷积神经网络层的输出。
  9. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述注意力机制的输入是将当前帧和若干个过去帧卷积层的输出信号混合所得。
  10. 根据权利要求1所述的一种单声道鲁棒性的语音关键词实时检测方法,其特征在于,所述神经网络的输出的向量大小为参与训练的所述关键词的 数量加一。
PCT/CN2019/109603 2019-09-30 2019-09-30 一种单声道鲁棒性的语音关键词实时检测方法 WO2021062705A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/109603 WO2021062705A1 (zh) 2019-09-30 2019-09-30 一种单声道鲁棒性的语音关键词实时检测方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/109603 WO2021062705A1 (zh) 2019-09-30 2019-09-30 一种单声道鲁棒性的语音关键词实时检测方法

Publications (1)

Publication Number Publication Date
WO2021062705A1 true WO2021062705A1 (zh) 2021-04-08

Family

ID=75337648

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/109603 WO2021062705A1 (zh) 2019-09-30 2019-09-30 一种单声道鲁棒性的语音关键词实时检测方法

Country Status (1)

Country Link
WO (1) WO2021062705A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3127839A1 (fr) * 2021-10-05 2023-04-07 Centre National De La Recherche Scientifique Procédé d’analyse d’un signal sonore bruité pour la reconnaissance de mots clé de commande et d’un locuteur du signal sonore bruité analysé

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190259A1 (en) * 2005-02-18 2006-08-24 Samsung Electronics Co., Ltd. Method and apparatus for recognizing speech by measuring confidence levels of respective frames
CN103559881A (zh) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 语种无关的关键词识别方法及***
CN108615526A (zh) * 2018-05-08 2018-10-02 腾讯科技(深圳)有限公司 语音信号中关键词的检测方法、装置、终端及存储介质
CN110097870A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音处理方法、装置、设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190259A1 (en) * 2005-02-18 2006-08-24 Samsung Electronics Co., Ltd. Method and apparatus for recognizing speech by measuring confidence levels of respective frames
CN103559881A (zh) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 语种无关的关键词识别方法及***
CN110097870A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音处理方法、装置、设备和存储介质
CN108615526A (zh) * 2018-05-08 2018-10-02 腾讯科技(深圳)有限公司 语音信号中关键词的检测方法、装置、终端及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUMAR RAJATH, YERUVA VAISHNAVI, GANAPATHY SRIRAM: "On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text Dependent Speaker Verification", INTERSPEECH 2018, ISCA, ISCA, 1 January 2018 (2018-01-01), ISCA, pages 1121 - 1125, XP055797254, DOI: 10.21437/Interspeech.2018-1759 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3127839A1 (fr) * 2021-10-05 2023-04-07 Centre National De La Recherche Scientifique Procédé d’analyse d’un signal sonore bruité pour la reconnaissance de mots clé de commande et d’un locuteur du signal sonore bruité analysé
WO2023057384A1 (fr) * 2021-10-05 2023-04-13 Centre National De La Recherche Scientifique Procédé d'analyse d'un signal sonore bruité pour la reconnaissance de mots clé de commande et d'un locuteur du signal sonore bruité analysé

Similar Documents

Publication Publication Date Title
CN110767223B (zh) 一种单声道鲁棒性的语音关键词实时检测方法
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
Coucke et al. Efficient keyword spotting using dilated convolutions and gating
US10504539B2 (en) Voice activity detection systems and methods
EP3738118B1 (en) Enhancing audio signals using sub-band deep neural networks
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
Wang et al. A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures
EP3739582B1 (en) Voice detection
JP2020086436A (ja) 人工神経網における復号化方法、音声認識装置及び音声認識システム
CN112767959B (zh) 语音增强方法、装置、设备及介质
CN113205820B (zh) 一种用于声音事件检测的声音编码器的生成方法
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
KR20190032868A (ko) 음성인식 방법 및 그 장치
Huang et al. Deep graph random process for relational-thinking-based speech recognition
Cornell et al. Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
WO2021062705A1 (zh) 一种单声道鲁棒性的语音关键词实时检测方法
Chen et al. Neural-Free Attention for Monaural Speech Enhancement Toward Voice User Interface for Consumer Electronics
WO2012105386A1 (ja) 有音区間検出装置、有音区間検出方法、及び有音区間検出プログラム
Sofer et al. CNN self-attention voice activity detector
Hwang et al. End-to-end speech endpoint detection utilizing acoustic and language modeling knowledge for online low-latency speech recognition
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Sivapatham et al. Gammatone filter bank-deep neural network-based monaural speech enhancement for unseen conditions
Zhao et al. Time Domain Speech Enhancement using self-attention-based subspace projection
Shinozaki et al. Hidden mode HMM using bayesian network for modeling speaking rate fluctuation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948015

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948015

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19948015

Country of ref document: EP

Kind code of ref document: A1