WO2020177371A1

WO2020177371A1 - Environment adaptive neural network noise reduction method and system for digital hearing aids, and storage medium

Info

Publication number: WO2020177371A1
Application number: PCT/CN2019/117075
Authority: WO
Inventors: 张禄; 王明江; 张啟权; 轩晓光; 张馨; 孙凤娇
Original assignee: 哈尔滨工业大学（深圳）
Priority date: 2019-03-06
Filing date: 2019-11-11
Publication date: 2020-09-10
Also published as: CN109859767B; CN109859767A

Abstract

Disclosed is an environment adaptive neural network noise reduction method for digital hearing aids. The method comprises the successive execution of the following steps: a preprocessing step, including receiving voice signal with noise and transmitting the voice signal with noise to an acoustic scene recognition module after sampling and framing; a scene recognition step, including recognizing the current acoustic scene by using the acoustic scene recognition module, and independently selecting and sending different neural network modules in a neural network noise reduction module by the acoustic scene recognition module; and a neural network noise reduction step. The beneficial effects of the method are: 1, ensuring the timeliness of voice processing, and having less computation burden while only performing forward spread of neural network; 2, being able to recognize the current acoustic scene and independently selecting different neural network modules to perform targeted noise reduction in different scenes, and guaranteeing better voice quality and voice intelligibility; and 3, being able to effectively inhibiting the instant noise.

Description

一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质Environment adaptive neural network noise reduction method, system and storage medium for digital hearing aid

技术领域Technical field

本发明涉及软件技术领域，尤其涉及一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质。The present invention relates to the field of software technology, in particular to an environment adaptive neural network noise reduction method, system and storage medium for digital hearing aids.

背景技术Background technique

目前，市面上的高性能数字助听器都会内置降噪算法，用来消除环境中的背景噪声干扰，以达到满足人耳听觉舒适度的要求。由于数字助听器对语音实时处理的要求，内置在助听器内部的降噪算法多采用谱减法、维纳滤波等运算量较低的算法，这些算法只能应对简单稳定的噪声干扰环境，在低信噪比、瞬时噪声等复杂的噪声环境中性能表现很差，听力损失患者的佩戴使用体验不佳。At present, high-performance digital hearing aids on the market have built-in noise reduction algorithms to eliminate background noise interference in the environment to meet the requirements of human hearing comfort. Due to the requirements of digital hearing aids for real-time speech processing, the noise reduction algorithms built into hearing aids mostly use algorithms with low computational complexity such as spectral subtraction and Wiener filtering. These algorithms can only deal with simple and stable noise interference environments. The performance is poor in complex noise environments such as ratio and transient noise, and the wearing experience of hearing loss patients is not good.

发明内容Summary of the invention

本发明公开了一种用于数字助听器的环境自适应神经网络降噪方法，利用深层神经网络强大的映射能力，并结合环境自适应的策略，实现一种应对复杂噪声环境的高性能降噪算法。The invention discloses an environment-adaptive neural network noise reduction method for digital hearing aids, which utilizes the powerful mapping ability of the deep neural network and combines an environment-adaptive strategy to realize a high-performance noise reduction algorithm for complex noise environments .

本发明提供了一种用于数字助听器的环境自适应神经网络降噪方法，包括依次执行如下步骤：The present invention provides an environmental adaptive neural network noise reduction method for digital hearing aids, which includes the following steps:

预处理步骤：接收带噪语音信号，带噪语音信号经过采样分帧后传输至声学场景识别模块；Preprocessing step: receiving noisy speech signal, and transmitting the noisy speech signal to the acoustic scene recognition module after sampling and framing;

场景识别步骤：采用声学场景识别模块对所处的声学场景进行识别，然后由声学场景识别模块自主的选择神经网络降噪模块中不同的神经网络模型进行发送；Scene recognition step: the acoustic scene recognition module is used to identify the acoustic scene in which it is located, and then the acoustic scene recognition module autonomously selects different neural network models in the neural network noise reduction module to send;

神经网络降噪步骤：神经网络降噪模型接收声学场景识别模块发送的分类结果并对不同场景下的噪声进行针对性地降噪处理。Neural network noise reduction step: The neural network noise reduction model receives the classification results sent by the acoustic scene recognition module and performs targeted noise reduction processing on the noise in different scenes.

作为本发明的进一步改进，在所述场景识别步骤中，所述声学场景识别模块采用了对时间序列具有记忆作用的LSTM神经网络结构，具体步骤如下：As a further improvement of the present invention, in the scene recognition step, the acoustic scene recognition module adopts an LSTM neural network structure with a memory function for time series. The specific steps are as follows:

S1：对每一帧提取设定维数的梅尔倒谱系数特征；S1: Extract the feature of Mel cepstrum coefficients of the set dimension for each frame;

S2：由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理，达到一定帧时将输出分类的结果。S2: The LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when a certain frame is reached.

作为本发明的进一步改进，所述LSTM神经网络结构包括输入层、隐藏层和输出层，输出层的神经单元对应不同的场景类别，LSTM神经网络不仅会处理当前的输入，还会与之前保留的输出进行组合，实现记忆的作用，当累计设定帧数的记忆后，输出分类结果。As a further improvement of the present invention, the LSTM neural network structure includes an input layer, a hidden layer, and an output layer. The neural units of the output layer correspond to different scene categories. The LSTM neural network will not only process the current input, but also compare with the previously retained The output is combined to realize the function of memory. When the memory of the set number of frames is accumulated, the classification result is output.

作为本发明的进一步改进，所述LSTM神经网络结构记忆更新原理如下：As a further improvement of the present invention, the LSTM neural network structure memory update principle is as follows:

LSTM神经网络结构将当前帧输入的特征t _n与之前保留的输出结果h _n-1进行组合，同时也将上一帧的状态C _n-1一起输入进去进行判断，产生一个当前帧的输出h _n和一个当前帧的输出状态C _n，一直迭代下去，直到满足所需帧的记忆条件后，对最终的输出h进行softmax变换得到输出层的预测概率。 The LSTM neural network structure combines the input feature t _{n of the} current frame with the previously retained output result h _n-1 , and also inputs the state C _n-1 of the previous frame for judgment, and produces an output h of the current frame _n and the output state C _{n of} a current frame, iterate until the memory condition of the required frame is satisfied, and perform softmax transformation on the final output h to obtain the predicted probability of the output layer.

作为本发明的进一步改进，在所述场景识别步骤中，还包括LSTM神经网络训练时的损失损失函数计算，计算公式如下：As a further improvement of the present invention, the scene recognition step also includes the calculation of the loss loss function during the training of the LSTM neural network. The calculation formula is as follows:

其中y _i和

分别为正确的分类标签和LSTM网络输出层预测的分类结果。 Where y _i and

They are the correct classification label and the classification result predicted by the output layer of the LSTM network.

作为本发明的进一步改进，不同场景下的降噪模型均采用全连接神经网络结构，但所述全连接神经网络结构的层数和每层的神经元个数是不同的；As a further improvement of the present invention, the noise reduction models in different scenarios all adopt a fully connected neural network structure, but the number of layers of the fully connected neural network structure and the number of neurons in each layer are different;

所述全连接神经网络结构的降噪模型包括执行如下步骤：The noise reduction model of the fully connected neural network structure includes the following steps:

训练数据集步骤：挑选作为训练集的纯净语音数据，然后将噪声数据与纯净语音进行随机混合，获得所需带噪训练数据；Training data set steps: select pure speech data as the training set, and then randomly mix the noise data and pure speech to obtain the required noisy training data;

模型参数调优步骤：采用最小均方误差作为代价函数，再根据训练集loss值和验证集loss值对模型进行参数调优，得到所需的神经网络结构；Model parameter tuning steps: use the minimum mean square error as the cost function, and then tune the model parameters according to the training set loss value and the verification set loss value to obtain the required neural network structure;

训练时，反复进行反向传播算法迭代，能实现较好的噪声抑制效果；During training, repeated iterations of the back propagation algorithm can achieve better noise suppression effects;

所述验证集是挑选作为验证集纯净语音数据，并与噪声数据进行混合，得到验证集带噪语音数据；The verification set is selected as the pure voice data of the verification set and mixed with the noise data to obtain the noisy voice data of the verification set;

所述最小均方误差计算公式如下：The minimum mean square error calculation formula is as follows:

其中MSE为均方误差。Where MSE is the mean square error.

作为本发明的进一步改进，除了输出层采用线性层以外，所有的隐藏层单元均采用ReLU激活函数；另外，为了提高网络的泛化能力，每层隐藏层采用0.8丢弃率的正则化方法，且L2正则化项系数设为0.00001；训练时，利用Adam优化算法进行反向传播，以0.0001的学习率迭代200次，便可以实现较好的噪声抑制效果。As a further improvement of the present invention, in addition to the linear layer used in the output layer, all hidden layer units use the ReLU activation function; in addition, in order to improve the generalization ability of the network, each hidden layer adopts a regularization method with a drop rate of 0.8, and The coefficient of the L2 regularization term is set to 0.00001; during training, the Adam optimization algorithm is used for back propagation, and iterates 200 times at a learning rate of 0.0001 to achieve a better noise suppression effect.

作为本发明的进一步改进，在所述预处理步骤中，麦克风接收到的语音信号，经过采样后，将其分成帧长为256点的时域信号，采样率为16000Hz，每一帧为16ms；As a further improvement of the present invention, in the preprocessing step, the voice signal received by the microphone is sampled and divided into time domain signals with a frame length of 256 points, the sampling rate is 16000 Hz, and each frame is 16 ms;

在所述步骤S1中，对每一帧提取39维的梅尔倒谱系数特征；In the step S1, the 39-dimensional Mel cepstrum coefficient feature is extracted for each frame;

在所述步骤S2中，由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理，达到100帧时将输出分类的结果。In the step S2, the LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when it reaches 100 frames.

本发明还公开了一种用于数字助听器的环境自适应神经网络降噪***，包括：存储器、处理器以及存储在所述存储器上的计算机程序，所述计算机程序配置为由所述处理器调用时实现权利要求所述的方法的步骤。The present invention also discloses an environmental adaptive neural network noise reduction system for digital hearing aids, including: a memory, a processor, and a computer program stored on the memory, and the computer program is configured to be called by the processor When implementing the steps of the method described in the claims.

本发明还公开了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序配置为由处理器调用时实现权利要求所述的方法的步骤。The present invention also discloses a computer-readable storage medium storing a computer program, and the computer program is configured to implement the steps of the method described in the claims when called by a processor.

本发明的有益效果是：1.可以保证语音处理的实时性，只进行神经网络的前向传播，运算量不高；2.可以对所处的声学场景进行识别，然后自主地选择不同的神经网络模型，对不同的场景下的噪声进行针对性地降噪处理，能保证更好的语音质量和语音可懂度；3.可以有效地抑制瞬时噪声；4.可以在低信噪比环境下实现更好的降噪效果。The beneficial effects of the present invention are: 1. It can ensure the real-time performance of speech processing, only carry out the forward propagation of the neural network, and the amount of calculation is not high; 2. It can recognize the acoustic scene in which it is located, and then autonomously select different nerves The network model performs targeted noise reduction processing on the noise in different scenes, which can ensure better speech quality and speech intelligibility; 3. Can effectively suppress instantaneous noise; 4. Can work in a low signal-to-noise ratio environment Achieve better noise reduction effect.

附图说明Description of the drawings

图1是本发明环境自适应降噪算法框图；Figure 1 is a block diagram of the environmental adaptive noise reduction algorithm of the present invention;

图2是本发明LSTM网络结构图；Figure 2 is a diagram of the LSTM network structure of the present invention;

图3是本发明LSTM单元的运行机理图；Figure 3 is a diagram of the operation mechanism of the LSTM unit of the present invention;

图4是本发明全连接神经网络降噪模型框图；Figure 4 is a block diagram of the noise reduction model of the fully connected neural network of the present invention;

图5是本发明PESQ指标评测结果图；Fig. 5 is a graph of evaluation results of PESQ indicators of the present invention;

图6是本发明STOI指标评测结果图。Fig. 6 is a graph of evaluation results of STOI indicators of the present invention.

具体实施方式detailed description

本发明公开了一种用于数字助听器的环境自适应神经网络降噪方法，该方法以场景识别模块作为决策驱动模块，根据不同的声学场景选择对应的神经网络降噪模型，来实现对不同噪声类型的抑制。本发明的整个算法***包含两大部分，一个是场景识别模块，另一个是神经网络降噪模块，如图1所示。The invention discloses an environment-adaptive neural network noise reduction method for digital hearing aids. The method uses a scene recognition module as a decision-driven module, and selects corresponding neural network noise reduction models according to different acoustic scenes to realize different noise reduction methods. Type of suppression. The entire algorithm system of the present invention includes two parts, one is a scene recognition module, and the other is a neural network noise reduction module, as shown in FIG. 1.

图1是本发明整个神经网络降噪***的算法框图，由声学场景识别模块和多个不同场景下的降噪模型组成。带噪语音信号经过采样分帧后，首先会送到场景识别模块来确定当前的场景类型，随后会被送到相应的神经网络降噪模型，实现降噪过程。整个算法***的核心部分在于识别模块和降噪模块两个部分，下面将分别进行详细的介绍：Fig. 1 is an algorithm block diagram of the entire neural network noise reduction system of the present invention, which is composed of an acoustic scene recognition module and multiple noise reduction models in different scenes. After the noisy speech signal is sampled and divided into frames, it is first sent to the scene recognition module to determine the current scene type, and then sent to the corresponding neural network noise reduction model to realize the noise reduction process. The core part of the whole algorithm system is the recognition module and the noise reduction module, which will be introduced in detail below:

声学场景识别模块，采用了对时间序列具有记忆作用的LSTM(Long Short-Term Memory)神经网络进行设计；首先，麦克风接收到的语音信号，经过采样后，将其分成帧长为256点的时域信号，采样率为16000Hz，每一帧为16ms；接下来，对每一帧提取39维的梅尔倒谱系数特征(Mel Frequency Cepstrum Coefficient，MFCC)，LSTM网络每次读入一帧MFCC特征进行处理，但是只有满100帧时才会输出分类的结果，也就是说每隔1.6S更新一下当前的环境分类结果。The acoustic scene recognition module is designed with a LSTM (Long Short-Term Memory) neural network that has a memory function for time series; first, the voice signal received by the microphone is sampled and divided into time frames with a frame length of 256 points. Domain signal, the sampling rate is 16000Hz, and each frame is 16ms; next, 39-dimensional Mel Frequency Cepstrum Coefficient (MFCC) features are extracted for each frame, and the LSTM network reads one frame of MFCC features at a time It is processed, but the classification result is output only when 100 frames are full, that is to say, the current environmental classification result is updated every 1.6S.

LSTM神经网络的结构如图2所示，其中输入层的神经单元数为39，递归隐藏层的神经单元数为512，输出层的神经单元数为9(对应着9种场景类别：工厂、街道、地铁站、火车站、餐厅、运动场、飞机舱内、汽车内部、室内场景)，相应的训练数据，是从freesound网站 ^[1]上下载的，每种场景约2个小时的音频；LSTM网络不仅会处理当前的输入，还会与之前保留的输出进行组合，实现记忆的作用，当累计满100帧的记忆后，输出分类结果。 The structure of the LSTM neural network is shown in Figure 2. The number of neural units in the input layer is 39, the number of neural units in the recursive hidden layer is 512, and the number of neural units in the output layer is 9 (corresponding to 9 scene categories: factory, street , Subway stations, railway stations, restaurants, sports fields, airplane cabins, car interiors, indoor scenes), the corresponding training data is downloaded from the freesound website ^[1] , each scene is about 2 hours of audio; LSTM network Not only will it process the current input, but it will also combine with the previously retained output to achieve the function of memory. When 100 frames of memory are accumulated, the classification result will be output.

LSTM单元的记忆更新机理如图3所示，其中C _n-1C _n-1表示上一帧留存的状态，f _n表示当前帧遗忘门的输出，u _n表示当前帧更新门的输出，O _n表示当前帧输出门的输出，C _n表示当前帧的留存状态，h _n表示当前帧的输出。LSTM单元将当前帧输入的特征t _n与之前保留的输出结果h _n-1进行组合，同时也将上一帧的状态C _n-1一起输入进去进行判断，产生一个当前帧的输出h _n和一个当前帧的输出状态C _n，一直迭代下去，直到满足100帧的记忆条件后，对最终的输出h进行Softmax(Softmax函数，或称归一化指数函数)变换得到输出层的预测概率。 LSTM memory update mechanism unit shown in Figure 3, wherein the C _{_{n 1-C} n-1} represents a retained state, f _n represents the current frame forgotten gate output, u _n denotes the current output frame updating door, O _n represents the output of the current frame output gate, C _n represents the retention status of the current frame, and h _n represents the output of the current frame. The LSTM unit combines the input feature t _{n of the} current frame with the previously retained output result h _n-1 , and also inputs the state C _n-1 of the previous frame for judgment, and generates an output h _n and The output state C _{n of} a current frame is iterated until the memory condition of 100 frames is satisfied, and the final output h is subjected to Softmax (Softmax function, or normalized exponential function) transformation to obtain the predicted probability of the output layer.

各个门以及输出的计算式如下，其中δ(·)和tanh()分别代表sigmoid激活函数和双曲正切激活函数：The calculation formulas of each gate and output are as follows, where δ(·) and tanh() represent the sigmoid activation function and the hyperbolic tangent activation function, respectively:

C_t _n＝tanh(W _c[h _n-1,x _n]+b _c) (5) C_t _n =tanh(W _c [h _n-1 ,x _n ]+b _c ) (5)

f _n＝δ(W _f[h _n-1,x _n]+b _f) (6) f _n ＝δ(W _f [h _n-1 ,x _n ]+b _f ) (6)

u _n＝δ(W _u[h _n-1,x _n]+b _u) (7) u _n =δ(W _u [h _n-1 ,x _n ]+b _u ) (7)

O _n＝δ(W _o[h _n-1,x _n]+b _o) (8) O _n =δ(W _o [h _n-1 ,x _n ]+b _o ) (8)

C _n＝u _n*C_t _n+f _n*C _n-1 (9) _{_{_{C n = u n * C_t n}}} + f n * C n-1 (9)

h _n＝O _n*tanh(C _n) (10) h _n ＝O _n *tanh(C _n ) (10)

LSTM网络的训练时的损失函数用交叉熵来计算，计算式如式(11)所示，其中y _i和

分别为正确的分类标签和LSTM网络输出层预测的分类结果： The loss function during training of the LSTM network is calculated by cross-entropy. The calculation formula is shown in formula (11), where y _i and

Respectively, the correct classification label and the classification result predicted by the output layer of the LSTM network:

根据声学场景分类模块的分类结果，输入的带噪音频信号会被送到不同的降噪模型进行逐帧处理。不同场景下的降噪模型均采用全连接的神经网络结构，如图4所示，但是神经网络的层数和每层的神经元个数是不同的，它与不同的场景噪声性质有关，例如工厂噪声需要3层隐藏层才能实现较好的降噪性能，而汽车内噪声只需要2层便可以实现同样的降噪效果。后面将以工厂场景下的网络结构为例进行详细的介绍。According to the classification result of the acoustic scene classification module, the input signal with noise will be sent to different noise reduction models for frame-by-frame processing. The noise reduction models in different scenarios all use a fully connected neural network structure, as shown in Figure 4. However, the number of layers of the neural network and the number of neurons in each layer are different, which is related to the nature of different scene noises, for example Factory noise requires 3 hidden layers to achieve better noise reduction performance, while car interior noise only needs 2 layers to achieve the same noise reduction effect. The following will take the network structure in the factory scenario as an example for detailed introduction.

如上图3所示，要训练全连接神经网络的降噪模型，首先需要准备足够多的训练数据集，这也是提高网络泛化能力很重要的一个方面，所以我们挑选了Aishell中文数据集 ^[2]中1200句话(6男6女，每人说100句话)作为训练集的纯净语音数据，然后利用NOISEX-92 ^[3]噪声库中的工厂噪声(前60％)作为噪声数据与纯净语音进行随机混合，混合的信噪比符合区间[-5,20]的均匀分布，总共获得带噪训练数据时长约为25个小时。为了对模型的参数进行调优，需要设置验证集，同样从Aishell数据集中另外挑选出400句话(2男2女，每人说100句话)作为验证集纯净语音数据，并与NOISEX-92工厂噪声的中间20％进行均匀混合，得到大约8个小时的验证集带噪语音数据。 As shown in Figure 3 above, to train the noise reduction model of the fully connected neural network, you first need to prepare enough training data sets. This is also an important aspect to improve the generalization ability of the network, so we chose the Aishell Chinese data set ^{[2 ]} 1200 sentences (6 males and 6 females, each speaking 100 sentences) as the pure speech data of the training set, and then use NOISEX-92 ^[3] the factory noise in the noise database (the first 60%) as the noise data and pure The speech is randomly mixed, and the mixed signal-to-noise ratio conforms to the uniform distribution of the interval [-5,20]. The total length of the noisy training data is about 25 hours. In order to tune the parameters of the model, it is necessary to set up a validation set. Similarly, another 400 sentences (2 males and 2 females, each speaking 100 sentences) are selected from the Aishell data set as the pure voice data of the validation set, and combined with NOISEX-92 The middle 20% of the factory noise is uniformly mixed to obtain about 8 hours of noisy speech data in the verification set.

采用式(12)所示的最小均方误差(Minimum Mean Squared Error，MMSE)作为代价函数，根据训练集loss值和验证集loss值对模型进行参数调优，最后确定：在工厂噪声场景中，选用神经网络为129-1024-1024-1024-129的网络结构，除了输出层采用线性层以外，所有的隐藏层单元均采用ReLU激活函数；另外，为了提高网络的泛化能力，每层隐藏层采用0.8丢弃率的正则化方法，且L2正则化项系数设为0.00001。训练时，利用Adam优化算法(Adam：一种高效的反向传播优化算法，由Adam提出，所以称为Adam优化算法)进行反向传播，以0.0001的学习率迭代200次，便可以实现较好的噪声抑制效果。模型训练完以后，在助听器中只需要进行前向传播，运算量不高，可以满足实时处理的要求。降噪后的PESQ(Perceptual evaluation of speech quality)、STOI(Short-Time Objective Intelligibility)指标评测结果如图5所示，其中降噪效果和指标都是在测试集上测得，测试集是从Aishell数据集中挑选出的与训练集不重复的另外400句话(2男2女，每人说100句话)，与NOISEX-92中工厂噪声的后20％混合成-5dB，0dB，5dB，10dB和15dB五种噪声污染程度。另外，进行主观听音时发现，工厂里的机器敲打声等瞬时噪声被抑制的很好，几乎听不到任何残留的噪声。Using the Minimum Mean Squared Error (MMSE) shown in equation (12) as the cost function, the model is tuned according to the loss value of the training set and the loss value of the validation set, and finally determined: in the factory noise scene, The neural network is selected as the network structure of 129-1024-1024-1024-129, except that the output layer adopts linear layer, all hidden layer units adopt ReLU activation function; in addition, in order to improve the generalization ability of the network, each layer of hidden layer The regularization method of 0.8 discarding rate is adopted, and the coefficient of the L2 regularization term is set to 0.00001. During training, use Adam optimization algorithm (Adam: an efficient back propagation optimization algorithm, proposed by Adam, so called Adam optimization algorithm) for back propagation, 200 iterations with a learning rate of 0.0001, you can achieve better The noise suppression effect. After the model is trained, only forward propagation is required in the hearing aid, and the amount of calculation is not high, which can meet the requirements of real-time processing. The evaluation results of PESQ (Perceptual evaluation of speech quality) and STOI (Short-Time Objective Intelligibility) after noise reduction are shown in Figure 5. The noise reduction effect and indicators are all measured on the test set, which is from Aishell Another 400 sentences selected from the data set that are not duplicated in the training set (2 males and 2 females, each speaking 100 sentences), mixed with the last 20% of the factory noise in NOISEX-92 to form -5dB, 0dB, 5dB, 10dB And 15dB five kinds of noise pollution levels. In addition, when subjectively listening, it was found that instantaneous noise such as the knocking of machines in the factory was well suppressed, and almost no residual noise was heard.

本发明的有益效果是：1.可以保证语音处理的实时性，只进行神经网络的前向传播，运算量不高；2.可以对所处的声学场景进行识别，然后自主地选择不同的神经网络模型，对不同的场景下的噪声进行针对性地降噪处理，能保证更好的语音质量和语音可懂度；3.可以有效地抑制瞬时噪声； 4.可以在低信噪比环境下实现更好的降噪效果The beneficial effects of the present invention are: 1. It can ensure the real-time performance of speech processing, only carry out the forward propagation of the neural network, and the amount of calculation is not high; 2. It can recognize the acoustic scene in which it is located, and then autonomously select different nerves The network model, which performs targeted noise reduction processing on the noise in different scenes, can ensure better speech quality and speech intelligibility; 3. Can effectively suppress instantaneous noise; 4. Can work in a low signal-to-noise ratio environment Achieve better noise reduction

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, which should be regarded as falling within the protection scope of the present invention.

Claims

一种用于数字助听器的环境自适应神经网络降噪方法，其特征在于，包括依次执行如下步骤：An environment-adaptive neural network noise reduction method for digital hearing aids is characterized in that it includes the following steps:

预处理步骤：接收带噪语音信号，带噪语音信号经过采样分帧后传输至声学场景识别模块；Preprocessing step: receiving noisy speech signal, and transmitting the noisy speech signal to the acoustic scene recognition module after sampling and framing;

场景识别步骤：采用声学场景识别模块对所处的声学场景进行识别，然后由声学场景识别模块自主的选择神经网络降噪模块中不同的神经网络模型进行发送；Scene recognition step: the acoustic scene recognition module is used to identify the acoustic scene in which it is located, and then the acoustic scene recognition module autonomously selects different neural network models in the neural network noise reduction module to send;

神经网络降噪步骤：神经网络降噪模型接收声学场景识别模块发送的分类结果并对不同场景下的噪声进行针对性地降噪处理。Neural network noise reduction step: The neural network noise reduction model receives the classification results sent by the acoustic scene recognition module and performs targeted noise reduction processing on the noise in different scenes.
根据权利要求1所述的环境自适应神经网络降噪方法，其特征在于，在所述场景识别步骤中，所述声学场景识别模块采用了对时间序列具有记忆作用的LSTM神经网络结构，具体步骤如下：The environment adaptive neural network noise reduction method according to claim 1, wherein in the scene recognition step, the acoustic scene recognition module adopts an LSTM neural network structure that has a memory function for time series. The specific steps are as follows:

S1：对每一帧提取设定维数的梅尔倒谱系数特征；S1: Extract the feature of Mel cepstrum coefficients of the set dimension for each frame;

S2：由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理，达到一定帧时将输出分类的结果。S2: The LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when a certain frame is reached.
根据权利要求2所述的环境自适应神经网络降噪方法，其特征在于，所述LSTM神经网络结构包括输入层、隐藏层和输出层，输出层的神经单元对应不同的场景类别，LSTM神经网络不仅会处理当前的输入，还会与之前保留的输出进行组合，实现记忆的作用，当累计达到设定帧数的记忆后，输出分类结果。The environment adaptive neural network noise reduction method according to claim 2, wherein the LSTM neural network structure includes an input layer, a hidden layer, and an output layer, and the neural unit of the output layer corresponds to different scene categories, and the LSTM neural network It will not only process the current input, but also combine with the previously retained output to realize the function of memory. When the accumulated memory reaches the set number of frames, the classification result will be output.
根据权利要求3所述的环境自适应神经网络降噪方法，其特征在于，所述LSTM神经网络结构记忆更新原理如下：The environmental adaptive neural network noise reduction method according to claim 3, wherein the LSTM neural network structure memory update principle is as follows:

LSTM神经网络结构将当前帧输入的特征t _n与之前保留的输出结果h _n-1进行组合，同时也将上一帧的状态C _n-1一起输入进去进行判断，产生一个当前帧的输出h _n和一个当前帧的输出状态C _n，一直迭代下去，直到满足所需帧的记忆条件后，对最终的输出h进行softmax变换得到输出层的预测概率。 The LSTM neural network structure combines the input feature t _{n of the} current frame with the previously retained output result h _n-1 , and also inputs the state C _n-1 of the previous frame for judgment, and produces an output h of the current frame _n and the output state C _{n of} a current frame, iterate until the memory condition of the required frame is satisfied, and perform softmax transformation on the final output h to obtain the predicted probability of the output layer.
根据权利要求4所述的环境自适应神经网络降噪方法，其特征在于，在所述场景识别步骤中，还包括LSTM神经网络训练时的损失损失函数计算，计算公式如下：The environmental adaptive neural network noise reduction method according to claim 4, characterized in that, in the scene recognition step, it further includes the calculation of the loss loss function during the training of the LSTM neural network, and the calculation formula is as follows:

其中y _i和
分别为正确的分类标签和LSTM网络输出层预测的分类结果。 Where y _i and
They are the correct classification label and the classification result predicted by the output layer of the LSTM network.
根据权利要求1所述的环境自适应神经网络降噪方法，其特征在于，不同场景下的降噪模型均采用全连接神经网络结构，但所述全连接神经网络结构的层数和每层的神经元个数是不同的；The environmental adaptive neural network noise reduction method according to claim 1, wherein the noise reduction models in different scenarios all adopt a fully connected neural network structure, but the number of layers of the fully connected neural network structure and the number of each layer The number of neurons is different;

所述全连接神经网络结构的降噪模型包括执行如下步骤：The noise reduction model of the fully connected neural network structure includes the following steps:

训练数据集步骤：挑选作为训练集的纯净语音数据，然后将噪声数据与纯净语音进行随机混合，获得所需带噪训练数据；Training data set steps: select pure speech data as the training set, and then randomly mix the noise data and pure speech to obtain the required noisy training data;

模型参数调优步骤：采用最小均方误差作为代价函数，再根据训练集loss值和验证集loss值对模型进行参数调优，得到所需的神经网络结构；Model parameter tuning steps: use the minimum mean square error as the cost function, and then tune the model parameters according to the training set loss value and the verification set loss value to obtain the required neural network structure;

训练时，反复进行反向传播算法迭代，能实现较好的噪声抑制效果；During training, repeated iterations of the back propagation algorithm can achieve better noise suppression effects;

所述验证集是挑选作为验证集纯净语音数据，并与噪声数据进行混合，得到验证集带噪语音数据；The verification set is selected as the pure voice data of the verification set and mixed with the noise data to obtain the noisy voice data of the verification set;

所述最小均方误差计算公式如下：The minimum mean square error calculation formula is as follows:

其中MSE为均方误差。Where MSE is the mean square error.
根据权利要求6所述的环境自适应神经网络降噪方法，其特征在于，除了输出层采用线性层以外，所有的隐藏层单元均采用ReLU激活函数；另外，为了提高网络的泛化能力，每层隐藏层采用0.8丢弃率的正则化方法，且L2正则化项系数设为0.00001；训练时，利用Adam优化算法进行反向传播，以0.0001的学习率迭代200次，便可以实现较好的噪声抑制效果。The environmental adaptive neural network noise reduction method according to claim 6, characterized in that, except that the output layer adopts a linear layer, all hidden layer units adopt ReLU activation functions; in addition, in order to improve the generalization ability of the network, every The hidden layer adopts the regularization method of 0.8 drop rate, and the coefficient of the L2 regularization term is set to 0.00001; during training, the Adam optimization algorithm is used for back propagation, and iterates 200 times at a learning rate of 0.0001 to achieve better noise Inhibitory effect.
根据权利要求2所述的环境自适应神经网络降噪方法，其特征在于，在所述预处理步骤中，麦克风接收到的语音信号，经过采样后，将其分成帧长为256点的时域信号，采样率为16000Hz，每一帧为16ms；The environmental adaptive neural network noise reduction method according to claim 2, wherein in the preprocessing step, the voice signal received by the microphone is sampled and divided into a time domain with a frame length of 256 points. Signal, the sampling rate is 16000Hz, and each frame is 16ms;

在所述步骤S1中，对每一帧提取39维的梅尔倒谱系数特征；In the step S1, the 39-dimensional Mel cepstrum coefficient feature is extracted for each frame;

在所述步骤S2中，由LSTM神经网络读入一帧梅尔倒谱系数特征进行处理，达到100帧时将输出分类的结果。In the step S2, the LSTM neural network reads in a frame of Mel cepstrum coefficient features for processing, and outputs the classification result when it reaches 100 frames.
一种用于数字助听器的环境自适应神经网络降噪***，其特征在于，包括：存储器、处理器以及存储在所述存储器上的计算机程序，所述计算机程序配置为由所述处理器调用时实现权利要求1－8中任一项所述的方法的步骤。An environmental adaptive neural network noise reduction system for digital hearing aids, which is characterized by comprising: a memory, a processor, and a computer program stored on the memory, the computer program being configured to be called by the processor The steps of implementing the method of any one of claims 1-8.
一种计算机可读存储介质，其特征在于：所述计算机可读存储介质存储有计算机程序，所述计算机程序配置为由处理器调用时实现权利要求1－8中任一项所述的方法的步骤。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is configured to implement the method of any one of claims 1-8 when called by a processor. step.