CN112346013A

CN112346013A - Binaural sound source positioning method based on deep learning

Info

Publication number: CN112346013A
Application number: CN202011173630.0A
Authority: CN
Inventors: 张雯; 郗经纬; 杨懿晨
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-02-09
Anticipated expiration: 2040-10-28
Also published as: CN112346013B

Abstract

The invention relates to a binaural sound source positioning method based on deep learning, which is a positioning method for processing binaural receiving signals by using a convolutional Neural Network and simultaneously estimating sound source azimuth-pitching by using a Multitask Neural Network (MNN). Parameters determining the target three-dimensional DOA estimation in different environments (noise and reverberation) are learned through a neural network, and the parameters are used for three-dimensional DOA estimation. The method can use the same trained model to estimate the direction of the sound source aiming at different environments, thereby avoiding special treatment aiming at specific environments and meeting the requirement of accurately estimating the azimuth angle and the pitch angle of a target under various environmental conditions; meanwhile, the algorithm has high positioning accuracy and exceeds the existing binaural sound source positioning algorithm in various complex environments. The method can effectively solve the problem that the environmental interference in the traditional method influences the positioning, has wide application prospect and can be directly put into use.

Description

Binaural sound source positioning method based on deep learning

Technical Field

The invention belongs to the technical fields of human-computer interaction, deep learning, binaural localization and the like, relates to a binaural sound source localization method based on the deep learning, and relates to a binaural sound source localization method based on a deep neural network in binaural hearing.

Background

Binaural sound source localization is intended to achieve the same capabilities as human listening localization, i.e. by simulating the binaural auditory principle, using two acoustic sensors to identify the spatial position of the sound source. The main advantages of a dual sensor array are small size, fast response time and easy calibration compared to many positioning systems deployed in audio, radar and sonar applications.

Binaural localization cues may be divided into binaural cues and monaural cues. Binaural cues refer to the difference in binaural phase and level between the left and right ear signals, and are commonly used to determine the lateral direction (left, front, right, or referred to as the front semi-horizontal plane); monaural cues refer to spectral cues caused by scattering and diffraction of sound waves at the pinna and around the body, and are mainly used for elevation positioning and before and after resolution. A Head-Related Transfer Function (HRTF) is defined as a frequency-domain Transfer Function in the whole process of reception of an acoustic signal from a sound source to both ears in a free-field environment. From the HRTFs we can extract binaural positioned cues.

At present, a more traditional binaural sound source localization algorithm is based on a cross-correlation technique, that is, binaural cues are estimated from two microphone signals, and a sound source orientation is estimated by comparing the binaural cue data sets, see documents: m. raspaud, H. Viste, and G.Evangelisa, "binary source localization by joint evaluation of ILD and ITD," IEEE trans. Audio, Speech, Languge Process, vol.18, pp.68-77,2010, and R.Parisi, F.Camoes, and A.Uncini, "center prediction for binding source localization in reversers" IEEE Signal Processing Letters, vol.19, pp.99-102,2012; model-based algorithms, i.e. using maximum likelihood estimation for sound source localization through statistical data of probabilistic models, see literature: wood dust and d.wang, "binary localization of multiple sources in invertebrates and noiserons," IEEE trans. audio, Speech, Language process, vol.20, pp. 1503, 1512, 2012; and performing an azimuth estimation in elevation based on the spectral difference, i.e. by comparing the spectral difference between the received binaural signal and the HRTF data, see literature: R.R.Hammond and P.J.Jackson, "Robust fuel-sphere binding source localization using internal and spectral cups," in ICASSP 2019-.

With the rise of machine learning, neural network-based methods are widely used to solve the binaural auditory localization problem. The positioning problem is converted into a classification problem by using a Convolutional Neural Network (CNN). The experimental results show that in a simple sound source localization task, see the literature: ma, T.May, and G.J.Brown, "expanding deep neural networks and head movements for robust binding localization of multiple sources in invertebrate environment," IEEE/ACM Transactions on Audio, Speech and Lange Processing (TASLP), vol.25, No.12, pp.2444-2453,2017, and F.Vesperni, P.Vecchiotti, E.Princi, S.Squartini, and F.Piazza, "registering spheres in multiple routes by using deep neural networks, and" Computer Speech & Lange, vol.49, pp.83-106,2018. human training performance is comparable to that of N.J.Brown. End-to-end systems are also used for bi-auditory SSL, see literature: P.Vecchiotti, N.Ma, S.Squartini, and G.J.Brown, "End-to-End bound systemic localization from the raw wave form," in ICASSP 2019IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.451-455, Brighton, United Kingdom, May 2019. However, the accuracy of binaural sound source localization faces the challenges of noisy and reverberant environments, as well as simultaneous localization of azimuth and pitch angles.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a binaural sound source positioning method based on deep learning, which is used for estimating the azimuth-pitch angle of a target speech signal simultaneously based on a binaural platform.

Technical scheme

A binaural sound source localization method based on deep learning is characterized in that: the receiving platform is a double-ear platform consisting of 2 array elements and comprises the following steps:

step (ii) of1: dividing the azimuth of a received signal into N_θA is prepared from

Divide pitch angle into

A is prepared from

Target speech from different directions and pitch angles is received using both ears:

Y_l(t,f)＝S(t,f)×B_l(f,Θ)+N_l(t,f)

t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by sound sources and superimposed noise, and l, r respectively represent left and right ears. B is_l(f,Θ),B_r(f, Θ) respectively representing the generated binaural indoor transfer functions;

preprocessing the received signals: extracting the magnitude spectrum E of the binaural signal_l(t,f),E_r(t, f) and binaural phase difference IPD (t, f) of the binaural signal

Step 2, building a convolutional neural network for extracting binaural positioning characteristics:

16 convolution kernels with the size of 3 multiplied by 2 are connected behind the input branch of the amplitude spectrum, and the characteristics of double ears and single ears are extracted simultaneously; the IPD input branch is followed by 16 convolution kernels with the size of 3 multiplied by 1 for extracting the binaural characteristics;

in each branch, the first convolutional layer is followed by the largest pooling layer of size 2 × 1, after which 4 convolutional layers are used to search for features suitable for positioning; wherein the first 2 use 64 convolution kernels of size 3 × 1, and add the largest pooling layer of size 2 × 1, and the last 2 use 128 convolution kernels of size 3 × 1; all the convolution layers are activated through a rectification linear unit ReLU and processed by batch normalization operation;

expanding and connecting the outputs of the two branches in series, namely combining the amplitude characteristic and the IPD characteristic, then sequentially transmitting the combined characteristic into two full-connection layers with the sizes of 8192 and 4096 respectively, wherein the latter full-connection layer is the Shared characteristic and is prepared for the next sound source positioning;

and step 3: the output of the convolutional neural network is a shared characteristic layer which is connected with the multitask neural network and used as the input of the multitask neural network; the multitasking neural network comprises two branches representing estimates of azimuth and elevation, respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation;

and 4, performing multi-environment training on the network in the steps 2 and 3: dividing the data extracted in the step 1 into training data and verification data, and performing multi-environment training on the network in the step 2 and the step 3 by using the training data and verifying the trained network by using the verification data to obtain a multi-environment trained network;

and 5: preprocessing the voice signal received by the receiving platform to obtain the amplitude spectrum E of the binaural signal_l(t,f),E_r(t, f) and the binaural phase difference IPD (t, f) of the binaural signal, which is used as input to the multi-environment trained network, the signal output by the network being the angular information for binaural sound source localization.

Advantageous effects

The invention provides a binaural sound source positioning method based on deep learning, which is a positioning method for processing binaural received signals by using a convolutional Neural Network and simultaneously estimating sound source azimuth-pitching by using a Multitask Neural Network (MNN). Parameters determining the target three-dimensional DOA estimation in different environments (noise and reverberation) are learned through a neural network, and the parameters are used for three-dimensional DOA estimation. The method has the advantages that the target can be well estimated in the azimuth-elevation mode under different environments, the positioning accuracy is higher, the operation is greatly simplified compared with the original algorithm, and the defects of the original algorithm are overcome.

The invention has the beneficial effects that:

the same trained model can be used for estimating the sound source direction aiming at different environments, so that special treatment aiming at specific environments is avoided, and the target azimuth angle and the target pitch angle can be accurately estimated under various environmental conditions; meanwhile, the algorithm has high positioning accuracy and exceeds the existing binaural sound source positioning algorithm in various complex environments. The method can effectively solve the problem that the environmental interference in the traditional method influences the positioning, has wide application prospect and can be directly put into use.

Drawings

FIG. 1: this patent binaural sound source positioning system degree of depth neural network structure chart: the method is composed of a preprocessing stage, a convolutional neural network stage and a multitask neural network stage. The pre-processing stage derives from the binaural signal Y_l(t,f),Y_rExtracting magnitude spectrum E from (t, f)_l(t,f),E_rAnd (t, f) and the phase difference IPD (t, f) are respectively sent into the CNN network. The CNN network extracts features from the two kinds of input data information and outputs the features to the shared feature layer. The CNN network is connected with the multitask neural network stage through a shared characteristic layer and respectively executes two subtasks of azimuth positioning and pitch angle positioning.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the technical scheme adopted by the invention for solving the technical problems is as follows: a solution algorithm for binaural received signal processing using a convolutional Neural Network (MNN) and simultaneous estimation of the azimuth-elevation of the sound source using a multi task Neural Network (MNN), consisting essentially of the steps of:

1) constructing a data set for training a neural network;

the receiving platform is assumed to be a binaural platform, i.e. the array consists of 2 array elements. Dividing azimuth into N_θA is prepared from

Divide pitch angle into

A is prepared from

Target voices from different directions and pitch angles are received by using double ears, received signals are preprocessed, the amplitude and phase difference of the double ears are extracted, and a data set required by network training is constructed;

2) building a convolution neural network for extracting binaural positioning characteristics;

3) constructing a multitask neural network for simultaneously positioning an azimuth angle and a pitch angle;

4) performing multi-environment training on the networks in the step 2) and the step 3);

5) and performing direction estimation on the target voice by using the trained convolutional neural network and the multi-task neural network.

The basic idea of the invention is to train a convolutional neural network for extracting three-dimensional positioning characteristics and a multitask neural network for simultaneously estimating azimuth and pitch, and estimate the azimuth angle and pitch angle of the sound source through the trained network according to the received sound source voice signal.

Setting simulation environment parameters:

the room size is: the length is 5m, the width is 5m, and the height is 3 m;

head position coordinates: the length is 2.5m, the width is 2.5m, and the height is 1.5 m;

the distance between the sound source and the center of the head is 1 m;

and setting angle estimation categories. The azimuth angles are divided into 25 classes, which are [ -80 °, -65 °, -55 °, -45 °:5 °:45 °,55 °, 65 °,80 ° ], respectively](ii) a The pitch angles are divided into 50 classes, the ranges are from-45 degrees to 230.625 degrees and are evenly distributed, and the steps are carried outThe length is 5.625 deg.. The 25 azimuth positions and the 50 pitch positions together form 1250 spatial positions.

A plurality of reverberation condition settings. The reflection coefficient of the Room wall is adjusted by a mirror image model method, and Binaural Room Transfer Functions (BRTFs) are generated by using head-related impulse responses (HRIR) provided by a CIPIC database. The reverberation level is 8, the reverberation time range is 150 ms-500 ms and is evenly distributed, and the step length is 50 ms.

Various noise environment settings. The noise levels are 7, the signal-to-noise ratio range is 5dB to 35dB, the step length is 5 dB.

The method comprises the following steps: a data set for training a neural network is constructed.

This patent assumes that under noisy and reverberant environments, a single sound source signal is captured by the left and right ear microphones of a binaural system. The signal captured by each time-frequency unit in the short-time Fourier transform domain is recorded as

Y_l(t,f)＝S(t,f)×B_l(f,Θ)+N_l(t,f)

Y_r(t,f)＝S(t,f)×B_r(f,Θ)+N_r(t,f)

t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by sound sources and superimposed noise, and l, r respectively represent left and right ears. B is_l(f,Θ),B_r(f, Θ) represents the generated binaural indoor transfer functions, respectively.

Extracting the magnitude spectrum E of the binaural signal_l(t,f),E_r(t, f), the formula is as follows:

E_l(t,f)＝20log₁₀|Y_l(t,f)|

E_r(t,f)＝20log₁₀|Y_r(t,f)|

next, the binaural phase difference IPD (t, f) of the binaural signal is extracted, as follows:

in the method, the binaural amplitude spectrum and the binaural phase difference are used as the input of the network. The output data of the network, i.e. the tags, are then constructed.

Because the network is required to output the azimuth angle and the pitch angle at the same time, the output of the network is set to be 2 one-hot labels, namely, for the azimuth angle, the output is a 25-dimensional vector, and the elements in the vector are 0 except the value of the corresponding sound source azimuth angle, namely 1; for the pitch angle, the output is a 50-dimensional vector, and the elements in the vector are all 0 except the value corresponding to the pitch angle of the sound source, which is 1. Where the dimensions of the vector correspond to the number of classes in the space.

Step two: and building a convolutional neural network for extracting the binaural positioning characteristics.

A schematic diagram of the convolutional neural network is shown in fig. 1.

Here, two independent CNNs are used to learn the localization features from the magnitude spectrum and IPD, respectively, for binaural sound source localization.

First, 32 convolution kernels of size 3 × 2 are followed by the input branch of the amplitude spectrum, and binaural and monaural features can be extracted simultaneously. The IPD input branch is followed by 32 convolution kernels of size 3 × 1 to extract the binaural features.

Second, in each branch, the first convolutional layer is followed by the largest pooling layer of size 2 × 1, after which 4 convolutional layers are used to search for features suitable for localization. Of these, the first 2 used 64 convolution kernels of size 3 × 1, with the addition of the largest pooling layer of size 2 × 1, and the last 2 used 128 convolution kernels of size 3 × 1. All convolutional layers were activated by rectifying linear units (ReLU) and processed using batch normalization operations.

Finally, the outputs of the two branches are spread and concatenated, i.e. the amplitude Feature and the IPD Feature are merged together, and then the merged Feature is introduced into two fully-connected layers of 8192 and 4096, respectively, to form a Shared Feature (Shared Feature) in preparation for the next sound source localization.

Step three: and constructing a multitask neural network for simultaneously positioning the azimuth angle and the pitch angle.

In neural network-based learning, a typical approach to solving the problem is to build a single model for a specific task and optimize the parameters of the model according to specific criteria. However, if optimized for only a single task, the network will not achieve the optimum when multiple related tasks need to be completed simultaneously. One suitable solution is to use shared features between several related tasks so that multiple tasks can be trained together and provide optimal performance for each task, which is known as multi-task learning.

As shown in fig. 1, a shared feature layer is followed by a multitasking neural network, which includes two branches representing estimates of azimuth and elevation, respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation.

Step four: and (4) multi-environment training.

In order to improve the robustness of the algorithm to the performance under various noise and reverberation environments, training data under different signal-to-noise ratios and reverberation time environments are constructed, and multi-environment training is carried out to improve the generalization capability of the network under different environments.

Step five: and performing direction estimation on the target voice by using the trained convolutional neural network and the multi-task neural network.

The scheme of the patent is compared with the two existing methods under different noise and reverberation conditions respectively. The target solution 1 is located by analyzing the composite eigenvectors of the selected binaural cue using Mutual Information (Mutual Information), see literature: x.wu, d.s.talagala, w.zhang, and t.d.ahayapala, "industrialized interactive life learning and personalized organizational localization model," Applied Sciences, vol.9, No.13, p.2682, 2019; for the standard scheme 2, the binaural phase difference and the binaural amplitude difference are directly input into a convolutional neural network for positioning, which is disclosed in the document: pang, H.Liu, and X.Li, "Multitask learning of time-frequency cnn for sound source localization," IEEE Access, vol.7, pp.40725-40737, 2019.

Compared with the two schemes, the scheme can obtain the best performance under most conditions, especially under the conditions of low signal-to-noise ratio (the signal-to-noise ratio is less than or equal to 25dB) and strong reverberation (T60 is more than or equal to 200 ms). Because the amplitude information of the monaural is a key clue for carrying out the human ear elevation positioning, the scheme reserves the operation of the monaural spectral clue, and the elevation positioning result is more obvious than the elevation positioning result of the standard scheme. (-presentation literature does not provide results in this context)

Positioning results in different signal-to-noise ratio environments:

table 1. azimuthal positioning accuracy (%) in different snr environments.

SNR	25dB	20dB	15dB	10dB	5dB
						Bidding scheme 1	-	97.20	-	94.40	-
Benchmarking scheme 2	96.88	95.57	93.05	88.48	79.87
						This scheme	98.10	98.09	98.07	97.94	96.95

And 2, the pitch angle positioning accuracy under different signal-to-noise ratio environments.

SNR	25dB	20dB	15dB	10dB	5dB
						Bidding scheme 1	-	72.64	-	37.04	-
Benchmarking scheme 2	92.42	86.93	78.37	65.77	48.47
						This scheme	98.28	97.59	96.17	93.06	85.25

Positioning results in different reverberation environments:

table 3. azimuth positioning accuracy at different reverberation times.

Reverberation time	300ms	350ms	400ms	450ms	500ms
						Bidding scheme 1	91.44	-	89.44	-	78.88
Benchmarking scheme 2	91.60	92.12	87.44	90.40	83.64
						This scheme	94.23	95.77	92.57	94.98	90.02

And 4, the pitch angle positioning accuracy under different reverberation times.

Reverberation time	300ms	350ms	400ms	450ms	500ms
						Bidding scheme 1	68.48	-	55.52	-	42.64
Benchmarking scheme 2	91.76	91.73	86.93	89.57	81.70
						This scheme	93.09	95.08	91.13	94.43	87.47

The innovation point of the scheme is that the feature selection is carried out on a binaural amplitude spectrum (containing binaural and monaural clues) and an IPD input CNN. This operation may allow more accurate sound source localization, especially in noisy and reverberant conditions. The best existing method only performs better under the conditions of high signal-to-noise ratio and low reverberation. These experimental results demonstrate that under complex environments, the use of binaural amplitude information is more accurate in preserving localization cues than the use of interaural amplitude difference information.

Claims

1. A binaural sound source localization method based on deep learning is characterized in that: the receiving platform is a double-ear platform consisting of 2 array elements and comprises the following steps:

step 1: dividing the azimuth of a received signal into N_θA is prepared from

Divide pitch angle into

A is prepared from

Y_l(t,f)＝S(t,f)×B_l(f,Θ)+N_l(t,f)