WO2020177371A1 - 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质 - Google Patents

一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质 Download PDF

Info

Publication number
WO2020177371A1
WO2020177371A1 PCT/CN2019/117075 CN2019117075W WO2020177371A1 WO 2020177371 A1 WO2020177371 A1 WO 2020177371A1 CN 2019117075 W CN2019117075 W CN 2019117075W WO 2020177371 A1 WO2020177371 A1 WO 2020177371A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
noise reduction
noise
frame
scene recognition
Prior art date
Application number
PCT/CN2019/117075
Other languages
English (en)
French (fr)
Inventor
张禄
王明江
张啟权
轩晓光
张馨
孙凤娇
Original Assignee
哈尔滨工业大学(深圳)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学(深圳) filed Critical 哈尔滨工业大学(深圳)
Publication of WO2020177371A1 publication Critical patent/WO2020177371A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception

Definitions

  • the present invention relates to the field of software technology, in particular to an environment adaptive neural network noise reduction method, system and storage medium for digital hearing aids.
  • noise reduction algorithms to eliminate background noise interference in the environment to meet the requirements of human hearing comfort. Due to the requirements of digital hearing aids for real-time speech processing, the noise reduction algorithms built into hearing aids mostly use algorithms with low computational complexity such as spectral subtraction and Wiener filtering. These algorithms can only deal with simple and stable noise interference environments. The performance is poor in complex noise environments such as ratio and transient noise, and the wearing experience of hearing loss patients is not good.
  • the invention discloses an environment-adaptive neural network noise reduction method for digital hearing aids, which utilizes the powerful mapping ability of the deep neural network and combines an environment-adaptive strategy to realize a high-performance noise reduction algorithm for complex noise environments .
  • the present invention provides an environmental adaptive neural network noise reduction method for digital hearing aids, which includes the following steps:
  • Preprocessing step receiving noisy speech signal, and transmitting the noisy speech signal to the acoustic scene recognition module after sampling and framing;
  • Scene recognition step the acoustic scene recognition module is used to identify the acoustic scene in which it is located, and then the acoustic scene recognition module autonomously selects different neural network models in the neural network noise reduction module to send;
  • Neural network noise reduction step The neural network noise reduction model receives the classification results sent by the acoustic scene recognition module and performs targeted noise reduction processing on the noise in different scenes.
  • the acoustic scene recognition module adopts an LSTM neural network structure with a memory function for time series.
  • the specific steps are as follows:
  • S2 The LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when a certain frame is reached.
  • the LSTM neural network structure includes an input layer, a hidden layer, and an output layer.
  • the neural units of the output layer correspond to different scene categories.
  • the LSTM neural network will not only process the current input, but also compare with the previously retained The output is combined to realize the function of memory. When the memory of the set number of frames is accumulated, the classification result is output.
  • the LSTM neural network structure memory update principle is as follows:
  • the LSTM neural network structure combines the input feature t n of the current frame with the previously retained output result h n-1 , and also inputs the state C n-1 of the previous frame for judgment, and produces an output h of the current frame n and the output state C n of a current frame, iterate until the memory condition of the required frame is satisfied, and perform softmax transformation on the final output h to obtain the predicted probability of the output layer.
  • the scene recognition step also includes the calculation of the loss loss function during the training of the LSTM neural network.
  • the calculation formula is as follows:
  • the noise reduction models in different scenarios all adopt a fully connected neural network structure, but the number of layers of the fully connected neural network structure and the number of neurons in each layer are different;
  • the noise reduction model of the fully connected neural network structure includes the following steps:
  • Training data set steps select pure speech data as the training set, and then randomly mix the noise data and pure speech to obtain the required noisy training data;
  • Model parameter tuning steps use the minimum mean square error as the cost function, and then tune the model parameters according to the training set loss value and the verification set loss value to obtain the required neural network structure;
  • the verification set is selected as the pure voice data of the verification set and mixed with the noise data to obtain the noisy voice data of the verification set;
  • the minimum mean square error calculation formula is as follows:
  • each hidden layer adopts a regularization method with a drop rate of 0.8, and The coefficient of the L2 regularization term is set to 0.00001; during training, the Adam optimization algorithm is used for back propagation, and iterates 200 times at a learning rate of 0.0001 to achieve a better noise suppression effect.
  • the voice signal received by the microphone is sampled and divided into time domain signals with a frame length of 256 points, the sampling rate is 16000 Hz, and each frame is 16 ms;
  • step S1 the 39-dimensional Mel cepstrum coefficient feature is extracted for each frame
  • the LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when it reaches 100 frames.
  • the present invention also discloses an environmental adaptive neural network noise reduction system for digital hearing aids, including: a memory, a processor, and a computer program stored on the memory, and the computer program is configured to be called by the processor When implementing the steps of the method described in the claims.
  • the present invention also discloses a computer-readable storage medium storing a computer program, and the computer program is configured to implement the steps of the method described in the claims when called by a processor.
  • the beneficial effects of the present invention are: 1. It can ensure the real-time performance of speech processing, only carry out the forward propagation of the neural network, and the amount of calculation is not high; 2. It can recognize the acoustic scene in which it is located, and then autonomously select different nerves The network model performs targeted noise reduction processing on the noise in different scenes, which can ensure better speech quality and speech intelligibility; 3. Can effectively suppress instantaneous noise; 4. Can work in a low signal-to-noise ratio environment Achieve better noise reduction effect.
  • Figure 1 is a block diagram of the environmental adaptive noise reduction algorithm of the present invention
  • Figure 2 is a diagram of the LSTM network structure of the present invention.
  • Figure 3 is a diagram of the operation mechanism of the LSTM unit of the present invention.
  • Figure 4 is a block diagram of the noise reduction model of the fully connected neural network of the present invention.
  • Fig. 5 is a graph of evaluation results of PESQ indicators of the present invention.
  • Fig. 6 is a graph of evaluation results of STOI indicators of the present invention.
  • the invention discloses an environment-adaptive neural network noise reduction method for digital hearing aids.
  • the method uses a scene recognition module as a decision-driven module, and selects corresponding neural network noise reduction models according to different acoustic scenes to realize different noise reduction methods. Type of suppression.
  • the entire algorithm system of the present invention includes two parts, one is a scene recognition module, and the other is a neural network noise reduction module, as shown in FIG. 1.
  • Fig. 1 is an algorithm block diagram of the entire neural network noise reduction system of the present invention, which is composed of an acoustic scene recognition module and multiple noise reduction models in different scenes. After the noisy speech signal is sampled and divided into frames, it is first sent to the scene recognition module to determine the current scene type, and then sent to the corresponding neural network noise reduction model to realize the noise reduction process.
  • the core part of the whole algorithm system is the recognition module and the noise reduction module, which will be introduced in detail below:
  • the acoustic scene recognition module is designed with a LSTM (Long Short-Term Memory) neural network that has a memory function for time series; first, the voice signal received by the microphone is sampled and divided into time frames with a frame length of 256 points. Domain signal, the sampling rate is 16000Hz, and each frame is 16ms; next, 39-dimensional Mel Frequency Cepstrum Coefficient (MFCC) features are extracted for each frame, and the LSTM network reads one frame of MFCC features at a time It is processed, but the classification result is output only when 100 frames are full, that is to say, the current environmental classification result is updated every 1.6S.
  • LSTM Long Short-Term Memory
  • the structure of the LSTM neural network is shown in Figure 2.
  • the number of neural units in the input layer is 39
  • the number of neural units in the recursive hidden layer is 512
  • the number of neural units in the output layer is 9 (corresponding to 9 scene categories: factory, street , Subway stations, railway stations, restaurants, sports fields, airplane cabins, car interiors, indoor scenes)
  • the corresponding training data is downloaded from the freesound website [1] , each scene is about 2 hours of audio
  • LSTM network Not only will it process the current input, but it will also combine with the previously retained output to achieve the function of memory. When 100 frames of memory are accumulated, the classification result will be output.
  • LSTM memory update mechanism unit shown in Figure 3, wherein the C n 1-C n-1 represents a retained state, f n represents the current frame forgotten gate output, u n denotes the current output frame updating door, O n represents the output of the current frame output gate, C n represents the retention status of the current frame, and h n represents the output of the current frame.
  • the LSTM unit combines the input feature t n of the current frame with the previously retained output result h n-1 , and also inputs the state C n-1 of the previous frame for judgment, and generates an output h n and
  • the output state C n of a current frame is iterated until the memory condition of 100 frames is satisfied, and the final output h is subjected to Softmax (Softmax function, or normalized exponential function) transformation to obtain the predicted probability of the output layer.
  • Softmax Softmax function, or normalized exponential function
  • the loss function during training of the LSTM network is calculated by cross-entropy.
  • the calculation formula is shown in formula (11), where y i and Respectively, the correct classification label and the classification result predicted by the output layer of the LSTM network:
  • the input signal with noise will be sent to different noise reduction models for frame-by-frame processing.
  • the noise reduction models in different scenarios all use a fully connected neural network structure, as shown in Figure 4.
  • the number of layers of the neural network and the number of neurons in each layer are different, which is related to the nature of different scene noises, for example Factory noise requires 3 hidden layers to achieve better noise reduction performance, while car interior noise only needs 2 layers to achieve the same noise reduction effect.
  • the following will take the network structure in the factory scenario as an example for detailed introduction.
  • the model is tuned according to the loss value of the training set and the loss value of the validation set, and finally determined: in the factory noise scene,
  • the neural network is selected as the network structure of 129-1024-1024-1024-129, except that the output layer adopts linear layer, all hidden layer units adopt ReLU activation function; in addition, in order to improve the generalization ability of the network, each layer of hidden layer
  • the regularization method of 0.8 discarding rate is adopted, and the coefficient of the L2 regularization term is set to 0.00001.
  • the noise reduction effect and indicators are all measured on the test set, which is from Aishell Another 400 sentences selected from the data set that are not duplicated in the training set (2 males and 2 females, each speaking 100 sentences), mixed with the last 20% of the factory noise in NOISEX-92 to form -5dB, 0dB, 5dB, 10dB And 15dB five kinds of noise pollution levels.
  • NOISEX-92 a predefined noise pollution level
  • instantaneous noise such as the knocking of machines in the factory was well suppressed, and almost no residual noise was heard.
  • the beneficial effects of the present invention are: 1. It can ensure the real-time performance of speech processing, only carry out the forward propagation of the neural network, and the amount of calculation is not high; 2. It can recognize the acoustic scene in which it is located, and then autonomously select different nerves
  • the network model which performs targeted noise reduction processing on the noise in different scenes, can ensure better speech quality and speech intelligibility; 3. Can effectively suppress instantaneous noise; 4. Can work in a low signal-to-noise ratio environment Achieve better noise reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Otolaryngology (AREA)
  • Neurosurgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

一种用于数字助听器的环境自适应神经网络降噪方法,包括依次执行如下步骤:预处理步骤:接收带噪语音信号,带噪语音信号经过采样分帧后传输至声学场景识别模块;场景识别步骤:采用声学场景识别模块对所处的声学场景进行识别,然后由声学场景识别模块自主地选择神经网络降噪模块中不同的神经网络模型进行发送;神经网络降噪步骤。该方法的有益效果是:1.可以保证语音处理的实时性,只进行神经网络的前向传播,运算量不高;2.可以对所处的声学场景进行识别,然后自主地选择不同的神经网络模型,对不同的场景下的噪声进行针对性地降噪处理,能保证更好的语音质量和语音可懂度;3.可以有效地抑制瞬时噪声。

Description

一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质 技术领域
本发明涉及软件技术领域,尤其涉及一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质。
背景技术
目前,市面上的高性能数字助听器都会内置降噪算法,用来消除环境中的背景噪声干扰,以达到满足人耳听觉舒适度的要求。由于数字助听器对语音实时处理的要求,内置在助听器内部的降噪算法多采用谱减法、维纳滤波等运算量较低的算法,这些算法只能应对简单稳定的噪声干扰环境,在低信噪比、瞬时噪声等复杂的噪声环境中性能表现很差,听力损失患者的佩戴使用体验不佳。
发明内容
本发明公开了一种用于数字助听器的环境自适应神经网络降噪方法,利用深层神经网络强大的映射能力,并结合环境自适应的策略,实现一种应对复杂噪声环境的高性能降噪算法。
本发明提供了一种用于数字助听器的环境自适应神经网络降噪方法,包括依次执行如下步骤:
预处理步骤:接收带噪语音信号,带噪语音信号经过采样分帧后传输至声学场景识别模块;
场景识别步骤:采用声学场景识别模块对所处的声学场景进行识别,然后由声学场景识别模块自主的选择神经网络降噪模块中不同的神经网络模型进行发送;
神经网络降噪步骤:神经网络降噪模型接收声学场景识别模块发送的分类结果并对不同场景下的噪声进行针对性地降噪处理。
作为本发明的进一步改进,在所述场景识别步骤中,所述声学场景识别模块采用了对时间序列具有记忆作用的LSTM神经网络结构,具体步骤如下:
S1:对每一帧提取设定维数的梅尔倒谱系数特征;
S2:由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理,达到一定帧时将输出分类的结果。
作为本发明的进一步改进,所述LSTM神经网络结构包括输入层、隐藏层和输出层,输出层的神经单元对应不同的场景类别,LSTM神经网络不仅会处理当前的输入,还会与之前保留的输出进行组合,实现记忆的作用,当累计设定帧数的记忆后,输出分类结果。
作为本发明的进一步改进,所述LSTM神经网络结构记忆更新原理如下:
LSTM神经网络结构将当前帧输入的特征t n与之前保留的输出结果h n-1进行组合,同时也将上一帧的状态C n-1一起输入进去进行判断,产生一个当前帧的输出h n和一个当前帧的输出状态C n,一直迭代下去,直到满足所需帧的记忆条件后,对最终的输出h进行softmax变换得到输出层的预测概率。
作为本发明的进一步改进,在所述场景识别步骤中,还包括LSTM神经网络训练时的损失损失函数计算,计算公式如下:
Figure PCTCN2019117075-appb-000001
其中y i
Figure PCTCN2019117075-appb-000002
分别为正确的分类标签和LSTM网络输出层预测的分类结果。
作为本发明的进一步改进,不同场景下的降噪模型均采用全连接神经网络结构,但所述全连接神经网络结构的层数和每层的神经元个数是不同的;
所述全连接神经网络结构的降噪模型包括执行如下步骤:
训练数据集步骤:挑选作为训练集的纯净语音数据,然后将噪声数据与纯净语音进行随机混合,获得所需带噪训练数据;
模型参数调优步骤:采用最小均方误差作为代价函数,再根据训练集loss值和验证集loss值对模型进行参数调优,得到所需的神经网络结构;
训练时,反复进行反向传播算法迭代,能实现较好的噪声抑制效果;
所述验证集是挑选作为验证集纯净语音数据,并与噪声数据进行混合,得到验证集带噪语音数据;
所述最小均方误差计算公式如下:
Figure PCTCN2019117075-appb-000003
其中MSE为均方误差。
作为本发明的进一步改进,除了输出层采用线性层以外,所有的隐藏层单元均采用ReLU激活函数;另外,为了提高网络的泛化能力,每层隐藏层采用0.8丢弃率的正则化方法,且L2正则化项系数设为0.00001;训练时,利用Adam优化算法进行反向传播,以0.0001的学习率迭代200次,便可以实现较好的噪声抑制效果。
作为本发明的进一步改进,在所述预处理步骤中,麦克风接收到的语音信号,经过采样后,将其分成帧长为256点的时域信号,采样率为16000Hz,每一帧为16ms;
在所述步骤S1中,对每一帧提取39维的梅尔倒谱系数特征;
在所述步骤S2中,由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理,达到100帧时将输出分类的结果。
本发明还公开了一种用于数字助听器的环境自适应神经网络降噪***,包括:存储器、处理器以及存储在所述存储器上的计算机程序,所述计算机程序配置为由所述处理器调用时实现权利要求所述的方法的步骤。
本发明还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序配置为由处理器调用时实现权利要求所述的方法的步骤。
本发明的有益效果是:1.可以保证语音处理的实时性,只进行神经网络的前向传播,运算量不高;2.可以对所处的声学场景进行识别,然后自主地选择不同的神经网络模型,对不同的场景下的噪声进行针对性地降噪处理,能保证更好的语音质量和语音可懂度;3.可以有效地抑制瞬时噪声;4.可以在低信噪比环境下实现更好的降噪效果。
附图说明
图1是本发明环境自适应降噪算法框图;
图2是本发明LSTM网络结构图;
图3是本发明LSTM单元的运行机理图;
图4是本发明全连接神经网络降噪模型框图;
图5是本发明PESQ指标评测结果图;
图6是本发明STOI指标评测结果图。
具体实施方式
本发明公开了一种用于数字助听器的环境自适应神经网络降噪方法,该方法以场景识别模块作为决策驱动模块,根据不同的声学场景选择对应的神经网络降噪模型,来实现对不同噪声类型的抑制。本发明的整个算法***包含两大部分,一个是场景识别模块,另一个是神经网络降噪模块,如图1所示。
图1是本发明整个神经网络降噪***的算法框图,由声学场景识别模块和多个不同场景下的降噪模型组成。带噪语音信号经过采样分帧后,首先会送到场景识别模块来确定当前的场景类型,随后会被送到相应的神经网络降噪模型,实现降噪过程。整个算法***的核心部分在于识别模块和降噪模块两个部分,下面将分别进行详细的介绍:
声学场景识别模块,采用了对时间序列具有记忆作用的LSTM(Long Short-Term Memory)神经网络进行设计;首先,麦克风接收到的语音信号,经过采样后,将其分成帧长为256点的时域信号,采样率为16000Hz,每一帧为16ms;接下来,对每一帧提取39维的梅尔倒谱系数特征(Mel Frequency Cepstrum Coefficient,MFCC),LSTM网络每次读入一帧MFCC特征进行处理,但是只有满100帧时才会输出分类的结果,也就是说每隔1.6S更新一下当前的环境分类结果。
LSTM神经网络的结构如图2所示,其中输入层的神经单元数为39,递归隐藏层的神经单元数为512,输出层的神经单元数为9(对应着9种场景类别:工厂、街道、地铁站、火车站、餐厅、运动场、飞机舱内、汽车内部、室内场景),相应的训练数据,是从freesound网站 [1]上下载的,每 种场景约2个小时的音频;LSTM网络不仅会处理当前的输入,还会与之前保留的输出进行组合,实现记忆的作用,当累计满100帧的记忆后,输出分类结果。
LSTM单元的记忆更新机理如图3所示,其中C n-1C n-1表示上一帧留存的状态,f n表示当前帧遗忘门的输出,u n表示当前帧更新门的输出,O n表示当前帧输出门的输出,C n表示当前帧的留存状态,h n表示当前帧的输出。LSTM单元将当前帧输入的特征t n与之前保留的输出结果h n-1进行组合,同时也将上一帧的状态C n-1一起输入进去进行判断,产生一个当前帧的输出h n和一个当前帧的输出状态C n,一直迭代下去,直到满足100帧的记忆条件后,对最终的输出h进行Softmax(Softmax函数,或称归一化指数函数)变换得到输出层的预测概率。
各个门以及输出的计算式如下,其中δ(·)和tanh()分别代表sigmoid激活函数和双曲正切激活函数:
C_t n=tanh(W c[h n-1,x n]+b c)     (5)
f n=δ(W f[h n-1,x n]+b f)     (6)
u n=δ(W u[h n-1,x n]+b u)    (7)
O n=δ(W o[h n-1,x n]+b o)     (8)
C n=u n*C_t n+f n*C n-1     (9)
h n=O n*tanh(C n)    (10)
LSTM网络的训练时的损失函数用交叉熵来计算,计算式如式(11)所示,其中y i
Figure PCTCN2019117075-appb-000004
分别为正确的分类标签和LSTM网络输出层预测的分类结果:
Figure PCTCN2019117075-appb-000005
根据声学场景分类模块的分类结果,输入的带噪音频信号会被送到不同的降噪模型进行逐帧处理。不同场景下的降噪模型均采用全连接的神经网络结构,如图4所示,但是神经网络的层数和每层的神经元个数是不同的,它与不同的场景噪声性质有关,例如工厂噪声需要3层隐藏层才能实现较好的降噪性能,而汽车内噪声只需要2层便可以实现同样的降噪效果。后面将以工厂场景下的网络结构为例进行详细的介绍。
如上图3所示,要训练全连接神经网络的降噪模型,首先需要准备足够多的训练数据集,这也是提高网络泛化能力很重要的一个方面,所以我们挑选了Aishell中文数据集 [2]中1200句话(6男6女,每人说100句话)作为训练集的纯净语音数据,然后利用NOISEX-92 [3]噪声库中的工厂噪声(前60%)作为噪声数据与纯净语音进行随机混合,混合的信噪比符合区间[-5,20]的均匀分布,总共获得带噪训练数据时长约为25个小时。为了对模型的参数进行调优,需要设置验证集,同样从Aishell数据集中另外挑选出400句话(2男2女,每人说100句话)作为验证集纯净语音数据,并与NOISEX-92工厂噪声的中间20%进行均匀混合,得到大约8个小时的验证集带噪语音数据。
Figure PCTCN2019117075-appb-000006
采用式(12)所示的最小均方误差(Minimum Mean Squared Error,MMSE)作为代价函数,根据训练集loss值和验证集loss值对模型进行参数调优,最后确定:在工厂噪声场景中,选用神经网络为129-1024-1024-1024-129的网络结构,除了输出层采用线性层以外,所有的隐藏层单元均采用ReLU激活函数;另外,为了提高网络的泛化能力,每层隐藏层采用0.8丢弃率的正则化方法,且L2正则化项系数设为0.00001。训练时,利用Adam优化算法(Adam:一种高效的反向传播优化算法,由Adam提出,所以称为Adam优化算法)进行反向传播,以0.0001的学习率迭代200次,便可以实现较好的噪声抑制效果。模型训练完以后,在助听器中只需要进行前向传播,运算量不高,可以满足实时处理的要求。降噪后的PESQ(Perceptual evaluation of speech quality)、STOI(Short-Time Objective Intelligibility)指标评测结果如图5所示,其中降噪效果和指标都是在测试集上测得,测试集是从Aishell数据集中挑选出的与训练集不重复的另外400句话(2男2女,每人说100句话),与NOISEX-92中工厂噪声的后20%混合成-5dB,0dB,5dB,10dB和15dB五种噪声污染程度。另外,进行主观听音时发现,工厂里的机器敲打声等瞬时噪声被抑制的很好,几乎听不到任何残留的噪声。
本发明的有益效果是:1.可以保证语音处理的实时性,只进行神经网络的前向传播,运算量不高;2.可以对所处的声学场景进行识别,然后自主地选择不同的神经网络模型,对不同的场景下的噪声进行针对性地降噪处理,能保证更好的语音质量和语音可懂度;3.可以有效地抑制瞬时噪声; 4.可以在低信噪比环境下实现更好的降噪效果
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。

Claims (10)

  1. 一种用于数字助听器的环境自适应神经网络降噪方法,其特征在于,包括依次执行如下步骤:
    预处理步骤:接收带噪语音信号,带噪语音信号经过采样分帧后传输至声学场景识别模块;
    场景识别步骤:采用声学场景识别模块对所处的声学场景进行识别,然后由声学场景识别模块自主的选择神经网络降噪模块中不同的神经网络模型进行发送;
    神经网络降噪步骤:神经网络降噪模型接收声学场景识别模块发送的分类结果并对不同场景下的噪声进行针对性地降噪处理。
  2. 根据权利要求1所述的环境自适应神经网络降噪方法,其特征在于,在所述场景识别步骤中,所述声学场景识别模块采用了对时间序列具有记忆作用的LSTM神经网络结构,具体步骤如下:
    S1:对每一帧提取设定维数的梅尔倒谱系数特征;
    S2:由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理,达到一定帧时将输出分类的结果。
  3. 根据权利要求2所述的环境自适应神经网络降噪方法,其特征在于,所述LSTM神经网络结构包括输入层、隐藏层和输出层,输出层的神经单元对应不同的场景类别,LSTM神经网络不仅会处理当前的输入,还会与之前保留的输出进行组合,实现记忆的作用,当累计达到设定帧数的记忆后,输出分类结果。
  4. 根据权利要求3所述的环境自适应神经网络降噪方法,其特征在于,所述LSTM神经网络结构记忆更新原理如下:
    LSTM神经网络结构将当前帧输入的特征t n与之前保留的输出结果h n-1进行组合,同时也将上一帧的状态C n-1一起输入进去进行判断,产生一个当前帧的输出h n和一个当前帧的输出状态C n,一直迭代下去,直到满足所需帧的记忆条件后,对最终的输出h进行softmax变换得到输出层的预测概率。
  5. 根据权利要求4所述的环境自适应神经网络降噪方法,其特征在于,在所述场景识别步骤中,还包括LSTM神经网络训练时的损失损失函数计算,计算公式如下:
    Figure PCTCN2019117075-appb-100001
    其中y i
    Figure PCTCN2019117075-appb-100002
    分别为正确的分类标签和LSTM网络输出层预测的分类结果。
  6. 根据权利要求1所述的环境自适应神经网络降噪方法,其特征在于,不同场景下的降噪模型均采用全连接神经网络结构,但所述全连接神经网络结构的层数和每层的神经元个数是不同的;
    所述全连接神经网络结构的降噪模型包括执行如下步骤:
    训练数据集步骤:挑选作为训练集的纯净语音数据,然后将噪声数据与纯净语音进行随机混合,获得所需带噪训练数据;
    模型参数调优步骤:采用最小均方误差作为代价函数,再根据训练集loss值和验证集loss值对模型进行参数调优,得到所需的神经网络结构;
    训练时,反复进行反向传播算法迭代,能实现较好的噪声抑制效果;
    所述验证集是挑选作为验证集纯净语音数据,并与噪声数据进行混合,得到验证集带噪语音数据;
    所述最小均方误差计算公式如下:
    Figure PCTCN2019117075-appb-100003
    其中MSE为均方误差。
  7. 根据权利要求6所述的环境自适应神经网络降噪方法,其特征在于,除了输出层采用线性层以外,所有的隐藏层单元均采用ReLU激活函数;另外,为了提高网络的泛化能力,每层隐藏层采用0.8丢弃率的正则化方法,且L2正则化项系数设为0.00001;训练时,利用Adam优化算法进行反向传播,以0.0001的学习率迭代200次,便可以实现较好的噪声抑制效果。
  8. 根据权利要求2所述的环境自适应神经网络降噪方法,其特征在于,在所述预处理步骤中,麦克风接收到的语音信号,经过采样后,将其分成帧 长为256点的时域信号,采样率为16000Hz,每一帧为16ms;
    在所述步骤S1中,对每一帧提取39维的梅尔倒谱系数特征;
    在所述步骤S2中,由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理,达到100帧时将输出分类的结果。
  9. 一种用于数字助听器的环境自适应神经网络降噪***,其特征在于,包括:存储器、处理器以及存储在所述存储器上的计算机程序,所述计算机程序配置为由所述处理器调用时实现权利要求1-8中任一项所述的方法的步骤。
  10. 一种计算机可读存储介质,其特征在于:所述计算机可读存储介质存储有计算机程序,所述计算机程序配置为由处理器调用时实现权利要求1-8中任一项所述的方法的步骤。
PCT/CN2019/117075 2019-03-06 2019-11-11 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质 WO2020177371A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910168122.4 2019-03-06
CN201910168122.4A CN109859767B (zh) 2019-03-06 2019-03-06 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质

Publications (1)

Publication Number Publication Date
WO2020177371A1 true WO2020177371A1 (zh) 2020-09-10

Family

ID=66899968

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117075 WO2020177371A1 (zh) 2019-03-06 2019-11-11 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质

Country Status (2)

Country Link
CN (1) CN109859767B (zh)
WO (1) WO2020177371A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112447183A (zh) * 2020-11-16 2021-03-05 北京达佳互联信息技术有限公司 音频处理模型的训练、音频去噪方法、装置及电子设备
CN113314136A (zh) * 2021-05-27 2021-08-27 西安电子科技大学 基于定向降噪与干声提取技术的语音优化方法
CN113345464A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 语音提取方法、***、设备及存储介质
CN113707159A (zh) * 2021-08-02 2021-11-26 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法
CN113823322A (zh) * 2021-10-26 2021-12-21 武汉芯昌科技有限公司 一种基于精简改进的Transformer模型的语音识别方法
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***
CN114869224A (zh) * 2022-03-28 2022-08-09 浙江大学 基于协同深度学习和肺部听诊音的肺部疾病分类检测方法
US20220256294A1 (en) * 2019-05-09 2022-08-11 Sonova Ag Hearing Device System And Method For Processing Audio Signals
CN117290669A (zh) * 2023-11-24 2023-12-26 之江实验室 基于深度学习的光纤温度传感信号降噪方法、装置和介质

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859767B (zh) * 2019-03-06 2020-10-13 哈尔滨工业大学(深圳) 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质
CN110379412B (zh) * 2019-09-05 2022-06-17 腾讯科技(深圳)有限公司 语音处理的方法、装置、电子设备及计算机可读存储介质
DE102019213809B3 (de) * 2019-09-11 2020-11-26 Sivantos Pte. Ltd. Verfahren zum Betrieb eines Hörgeräts sowie Hörgerät
CN110996208B (zh) * 2019-12-13 2021-07-30 恒玄科技(上海)股份有限公司 一种无线耳机及其降噪方法
IT201900024454A1 (it) 2019-12-18 2021-06-18 Storti Gianampellio Apparecchio audio con basso consumo per ambienti rumorosi
CN113129876B (zh) * 2019-12-30 2024-05-14 Oppo广东移动通信有限公司 网络搜索方法、装置、电子设备及存储介质
CN111312221B (zh) * 2020-01-20 2022-07-22 宁波舜韵电子有限公司 基于语音控制的智能吸油烟机
CN111491245B (zh) * 2020-03-13 2022-03-04 天津大学 基于循环神经网络的数字助听器声场识别算法及实现方法
CN111508509A (zh) * 2020-04-02 2020-08-07 广东九联科技股份有限公司 基于深度学习的声音质量处理***及其方法
CN112565997B (zh) * 2020-12-04 2022-03-22 可孚医疗科技股份有限公司 助听器的自适应降噪方法、装置、助听器及存储介质
CN113160789A (zh) * 2021-03-05 2021-07-23 南京每深智能科技有限责任公司 主动降噪装置及方法
CN113160844A (zh) * 2021-04-27 2021-07-23 山东省计算中心(国家超级计算济南中心) 基于噪声背景分类的语音增强方法及***
CN113259824B (zh) * 2021-05-14 2021-11-30 谷芯(广州)技术有限公司 一种实时多通道数字助听器降噪方法和***
CN113266933A (zh) * 2021-05-24 2021-08-17 青岛海尔空调器有限总公司 一种空调器的语音控制方法及空调器
CN113724726A (zh) * 2021-08-18 2021-11-30 中国长江电力股份有限公司 一种基于全连接神经网络的机组运行噪声抑制的处理方法
CN114245280B (zh) * 2021-12-20 2023-06-23 清华大学深圳国际研究生院 一种基于神经网络的场景自适应助听器音频增强***
CN114640938B (zh) * 2022-05-18 2022-08-23 深圳市听多多科技有限公司 一种基于蓝牙耳机芯片的助听功能实现方法及蓝牙耳机
CN114640937B (zh) 2022-05-18 2022-09-02 深圳市听多多科技有限公司 一种基于穿戴设备***的助听功能实现方法及穿戴设备
CN116367063B (zh) * 2023-04-23 2023-11-14 郑州大学 一种基于嵌入式的骨传导助听设备及***

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6453284B1 (en) * 1999-07-26 2002-09-17 Texas Tech University Health Sciences Center Multiple voice tracking system and method
CN101529929A (zh) * 2006-09-05 2009-09-09 Gn瑞声达A/S 具有基于直方图的声环境分类的助听器
CN104952448A (zh) * 2015-05-04 2015-09-30 张爱英 一种双向长短时记忆递归神经网络的特征增强方法及***
CN105611477A (zh) * 2015-12-27 2016-05-25 北京工业大学 数字助听器中深度和广度神经网络相结合的语音增强算法
CN108073856A (zh) * 2016-11-14 2018-05-25 华为技术有限公司 噪音信号的识别方法及装置
CN108877823A (zh) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 语音增强方法和装置
CN108962278A (zh) * 2018-06-26 2018-12-07 常州工学院 一种助听器声场景分类方法
CN109410976A (zh) * 2018-11-01 2019-03-01 北京工业大学 双耳助听器中基于双耳声源定位和深度学习的语音增强方法
CN109859767A (zh) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019014890A1 (zh) * 2017-07-20 2019-01-24 大象声科(深圳)科技有限公司 一种通用的单声道实时降噪方法
CN109378010A (zh) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 神经网络模型的训练方法、语音去噪方法及装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6453284B1 (en) * 1999-07-26 2002-09-17 Texas Tech University Health Sciences Center Multiple voice tracking system and method
CN101529929A (zh) * 2006-09-05 2009-09-09 Gn瑞声达A/S 具有基于直方图的声环境分类的助听器
CN104952448A (zh) * 2015-05-04 2015-09-30 张爱英 一种双向长短时记忆递归神经网络的特征增强方法及***
CN105611477A (zh) * 2015-12-27 2016-05-25 北京工业大学 数字助听器中深度和广度神经网络相结合的语音增强算法
CN108073856A (zh) * 2016-11-14 2018-05-25 华为技术有限公司 噪音信号的识别方法及装置
CN108962278A (zh) * 2018-06-26 2018-12-07 常州工学院 一种助听器声场景分类方法
CN108877823A (zh) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 语音增强方法和装置
CN109410976A (zh) * 2018-11-01 2019-03-01 北京工业大学 双耳助听器中基于双耳声源定位和深度学习的语音增强方法
CN109859767A (zh) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220256294A1 (en) * 2019-05-09 2022-08-11 Sonova Ag Hearing Device System And Method For Processing Audio Signals
US11832058B2 (en) * 2019-05-09 2023-11-28 Sonova Ag Hearing device system and method for processing audio signals
CN112447183A (zh) * 2020-11-16 2021-03-05 北京达佳互联信息技术有限公司 音频处理模型的训练、音频去噪方法、装置及电子设备
CN113314136A (zh) * 2021-05-27 2021-08-27 西安电子科技大学 基于定向降噪与干声提取技术的语音优化方法
CN113345464A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 语音提取方法、***、设备及存储介质
CN113707159A (zh) * 2021-08-02 2021-11-26 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法
CN113707159B (zh) * 2021-08-02 2024-05-03 南昌大学 一种基于Mel语图与深度学习的电网涉鸟故障鸟种识别方法
CN113823322A (zh) * 2021-10-26 2021-12-21 武汉芯昌科技有限公司 一种基于精简改进的Transformer模型的语音识别方法
CN114626412B (zh) * 2022-02-28 2024-04-02 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***
CN114869224A (zh) * 2022-03-28 2022-08-09 浙江大学 基于协同深度学习和肺部听诊音的肺部疾病分类检测方法
CN117290669B (zh) * 2023-11-24 2024-02-06 之江实验室 基于深度学习的光纤温度传感信号降噪方法、装置和介质
CN117290669A (zh) * 2023-11-24 2023-12-26 之江实验室 基于深度学习的光纤温度传感信号降噪方法、装置和介质

Also Published As

Publication number Publication date
CN109859767A (zh) 2019-06-07
CN109859767B (zh) 2020-10-13

Similar Documents

Publication Publication Date Title
WO2020177371A1 (zh) 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质
CN109841226B (zh) 一种基于卷积递归神经网络的单通道实时降噪方法
Zhao et al. Perceptually guided speech enhancement using deep neural networks
CN112735456B (zh) 一种基于dnn-clstm网络的语音增强方法
CN107393550B (zh) 语音处理方法及装置
CN110867181B (zh) 基于scnn和tcnn联合估计的多目标语音增强方法
CN111583954B (zh) 一种说话人无关单通道语音分离方法
CN109841206A (zh) 一种基于深度学习的回声消除方法
CN110428849B (zh) 一种基于生成对抗网络的语音增强方法
CN106782497B (zh) 一种基于便携式智能终端的智能语音降噪算法
Tu et al. A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition
CN113744749B (zh) 一种基于心理声学域加权损失函数的语音增强方法及***
CN107360497B (zh) 估算混响分量的计算方法及装置
Dionelis et al. Modulation-domain Kalman filtering for monaural blind speech denoising and dereverberation
CN111739562B (zh) 一种基于数据选择性和高斯混合模型的语音活动检测方法
CN112259117A (zh) 一种目标声源锁定和提取的方法
CN103971697B (zh) 基于非局部均值滤波的语音增强方法
CN116453547A (zh) 基于听损分类的助听器语音质量自评价方法
TWI749547B (zh) 應用深度學習的語音增強系統
Chen et al. Leveraging heteroscedastic uncertainty in learning complex spectral mapping for single-channel speech enhancement
Wang Research progress in speech enhancement technology
CN108573698B (zh) 一种基于性别融合信息的语音降噪方法
CN113744725B (zh) 一种语音端点检测模型的训练方法及语音降噪方法
CN107393559B (zh) 检校语音检测结果的方法及装置
Ni et al. Multi-channel dictionary learning speech enhancement based on power spectrum

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19917642

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19917642

Country of ref document: EP

Kind code of ref document: A1