CN112346013A - Binaural sound source positioning method based on deep learning - Google Patents

Binaural sound source positioning method based on deep learning Download PDF

Info

Publication number
CN112346013A
CN112346013A CN202011173630.0A CN202011173630A CN112346013A CN 112346013 A CN112346013 A CN 112346013A CN 202011173630 A CN202011173630 A CN 202011173630A CN 112346013 A CN112346013 A CN 112346013A
Authority
CN
China
Prior art keywords
binaural
sound source
neural network
positioning
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011173630.0A
Other languages
Chinese (zh)
Other versions
CN112346013B (en
Inventor
张雯
郗经纬
杨懿晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202011173630.0A priority Critical patent/CN112346013B/en
Publication of CN112346013A publication Critical patent/CN112346013A/en
Application granted granted Critical
Publication of CN112346013B publication Critical patent/CN112346013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/20Position of source determined by a plurality of spaced direction-finders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Stereophonic System (AREA)

Abstract

The invention relates to a binaural sound source positioning method based on deep learning, which is a positioning method for processing binaural receiving signals by using a convolutional Neural Network and simultaneously estimating sound source azimuth-pitching by using a Multitask Neural Network (MNN). Parameters determining the target three-dimensional DOA estimation in different environments (noise and reverberation) are learned through a neural network, and the parameters are used for three-dimensional DOA estimation. The method can use the same trained model to estimate the direction of the sound source aiming at different environments, thereby avoiding special treatment aiming at specific environments and meeting the requirement of accurately estimating the azimuth angle and the pitch angle of a target under various environmental conditions; meanwhile, the algorithm has high positioning accuracy and exceeds the existing binaural sound source positioning algorithm in various complex environments. The method can effectively solve the problem that the environmental interference in the traditional method influences the positioning, has wide application prospect and can be directly put into use.

Description

Binaural sound source positioning method based on deep learning
Technical Field
The invention belongs to the technical fields of human-computer interaction, deep learning, binaural localization and the like, relates to a binaural sound source localization method based on the deep learning, and relates to a binaural sound source localization method based on a deep neural network in binaural hearing.
Background
Binaural sound source localization is intended to achieve the same capabilities as human listening localization, i.e. by simulating the binaural auditory principle, using two acoustic sensors to identify the spatial position of the sound source. The main advantages of a dual sensor array are small size, fast response time and easy calibration compared to many positioning systems deployed in audio, radar and sonar applications.
Binaural localization cues may be divided into binaural cues and monaural cues. Binaural cues refer to the difference in binaural phase and level between the left and right ear signals, and are commonly used to determine the lateral direction (left, front, right, or referred to as the front semi-horizontal plane); monaural cues refer to spectral cues caused by scattering and diffraction of sound waves at the pinna and around the body, and are mainly used for elevation positioning and before and after resolution. A Head-Related Transfer Function (HRTF) is defined as a frequency-domain Transfer Function in the whole process of reception of an acoustic signal from a sound source to both ears in a free-field environment. From the HRTFs we can extract binaural positioned cues.
At present, a more traditional binaural sound source localization algorithm is based on a cross-correlation technique, that is, binaural cues are estimated from two microphone signals, and a sound source orientation is estimated by comparing the binaural cue data sets, see documents: m. raspaud, H. Viste, and G.Evangelisa, "binary source localization by joint evaluation of ILD and ITD," IEEE trans. Audio, Speech, Languge Process, vol.18, pp.68-77,2010, and R.Parisi, F.Camoes, and A.Uncini, "center prediction for binding source localization in reversers" IEEE Signal Processing Letters, vol.19, pp.99-102,2012; model-based algorithms, i.e. using maximum likelihood estimation for sound source localization through statistical data of probabilistic models, see literature: wood dust and d.wang, "binary localization of multiple sources in invertebrates and noiserons," IEEE trans. audio, Speech, Language process, vol.20, pp. 1503, 1512, 2012; and performing an azimuth estimation in elevation based on the spectral difference, i.e. by comparing the spectral difference between the received binaural signal and the HRTF data, see literature: R.R.Hammond and P.J.Jackson, "Robust fuel-sphere binding source localization using internal and spectral cups," in ICASSP 2019-.
With the rise of machine learning, neural network-based methods are widely used to solve the binaural auditory localization problem. The positioning problem is converted into a classification problem by using a Convolutional Neural Network (CNN). The experimental results show that in a simple sound source localization task, see the literature: ma, T.May, and G.J.Brown, "expanding deep neural networks and head movements for robust binding localization of multiple sources in invertebrate environment," IEEE/ACM Transactions on Audio, Speech and Lange Processing (TASLP), vol.25, No.12, pp.2444-2453,2017, and F.Vesperni, P.Vecchiotti, E.Princi, S.Squartini, and F.Piazza, "registering spheres in multiple routes by using deep neural networks, and" Computer Speech & Lange, vol.49, pp.83-106,2018. human training performance is comparable to that of N.J.Brown. End-to-end systems are also used for bi-auditory SSL, see literature: P.Vecchiotti, N.Ma, S.Squartini, and G.J.Brown, "End-to-End bound systemic localization from the raw wave form," in ICASSP 2019IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.451-455, Brighton, United Kingdom, May 2019. However, the accuracy of binaural sound source localization faces the challenges of noisy and reverberant environments, as well as simultaneous localization of azimuth and pitch angles.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a binaural sound source positioning method based on deep learning, which is used for estimating the azimuth-pitch angle of a target speech signal simultaneously based on a binaural platform.
Technical scheme
A binaural sound source localization method based on deep learning is characterized in that: the receiving platform is a double-ear platform consisting of 2 array elements and comprises the following steps:
step (ii) of1: dividing the azimuth of a received signal into NθA is prepared from
Figure RE-GDA0002862721790000021
Divide pitch angle into
Figure RE-GDA0002862721790000022
A is prepared from
Figure RE-GDA0002862721790000031
Target speech from different directions and pitch angles is received using both ears:
Yl(t,f)=S(t,f)×Bl(f,Θ)+Nl(t,f)
t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by sound sources and superimposed noise, and l, r respectively represent left and right ears. B isl(f,Θ),Br(f, Θ) respectively representing the generated binaural indoor transfer functions;
preprocessing the received signals: extracting the magnitude spectrum E of the binaural signall(t,f),Er(t, f) and binaural phase difference IPD (t, f) of the binaural signal
Step 2, building a convolutional neural network for extracting binaural positioning characteristics:
16 convolution kernels with the size of 3 multiplied by 2 are connected behind the input branch of the amplitude spectrum, and the characteristics of double ears and single ears are extracted simultaneously; the IPD input branch is followed by 16 convolution kernels with the size of 3 multiplied by 1 for extracting the binaural characteristics;
in each branch, the first convolutional layer is followed by the largest pooling layer of size 2 × 1, after which 4 convolutional layers are used to search for features suitable for positioning; wherein the first 2 use 64 convolution kernels of size 3 × 1, and add the largest pooling layer of size 2 × 1, and the last 2 use 128 convolution kernels of size 3 × 1; all the convolution layers are activated through a rectification linear unit ReLU and processed by batch normalization operation;
expanding and connecting the outputs of the two branches in series, namely combining the amplitude characteristic and the IPD characteristic, then sequentially transmitting the combined characteristic into two full-connection layers with the sizes of 8192 and 4096 respectively, wherein the latter full-connection layer is the Shared characteristic and is prepared for the next sound source positioning;
and step 3: the output of the convolutional neural network is a shared characteristic layer which is connected with the multitask neural network and used as the input of the multitask neural network; the multitasking neural network comprises two branches representing estimates of azimuth and elevation, respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation;
and 4, performing multi-environment training on the network in the steps 2 and 3: dividing the data extracted in the step 1 into training data and verification data, and performing multi-environment training on the network in the step 2 and the step 3 by using the training data and verifying the trained network by using the verification data to obtain a multi-environment trained network;
and 5: preprocessing the voice signal received by the receiving platform to obtain the amplitude spectrum E of the binaural signall(t,f),Er(t, f) and the binaural phase difference IPD (t, f) of the binaural signal, which is used as input to the multi-environment trained network, the signal output by the network being the angular information for binaural sound source localization.
Advantageous effects
The invention provides a binaural sound source positioning method based on deep learning, which is a positioning method for processing binaural received signals by using a convolutional Neural Network and simultaneously estimating sound source azimuth-pitching by using a Multitask Neural Network (MNN). Parameters determining the target three-dimensional DOA estimation in different environments (noise and reverberation) are learned through a neural network, and the parameters are used for three-dimensional DOA estimation. The method has the advantages that the target can be well estimated in the azimuth-elevation mode under different environments, the positioning accuracy is higher, the operation is greatly simplified compared with the original algorithm, and the defects of the original algorithm are overcome.
The invention has the beneficial effects that:
the same trained model can be used for estimating the sound source direction aiming at different environments, so that special treatment aiming at specific environments is avoided, and the target azimuth angle and the target pitch angle can be accurately estimated under various environmental conditions; meanwhile, the algorithm has high positioning accuracy and exceeds the existing binaural sound source positioning algorithm in various complex environments. The method can effectively solve the problem that the environmental interference in the traditional method influences the positioning, has wide application prospect and can be directly put into use.
Drawings
FIG. 1: this patent binaural sound source positioning system degree of depth neural network structure chart: the method is composed of a preprocessing stage, a convolutional neural network stage and a multitask neural network stage. The pre-processing stage derives from the binaural signal Yl(t,f),YrExtracting magnitude spectrum E from (t, f)l(t,f),ErAnd (t, f) and the phase difference IPD (t, f) are respectively sent into the CNN network. The CNN network extracts features from the two kinds of input data information and outputs the features to the shared feature layer. The CNN network is connected with the multitask neural network stage through a shared characteristic layer and respectively executes two subtasks of azimuth positioning and pitch angle positioning.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the technical scheme adopted by the invention for solving the technical problems is as follows: a solution algorithm for binaural received signal processing using a convolutional Neural Network (MNN) and simultaneous estimation of the azimuth-elevation of the sound source using a multi task Neural Network (MNN), consisting essentially of the steps of:
1) constructing a data set for training a neural network;
the receiving platform is assumed to be a binaural platform, i.e. the array consists of 2 array elements. Dividing azimuth into NθA is prepared from
Figure RE-GDA0002862721790000051
Divide pitch angle into
Figure RE-GDA0002862721790000052
A is prepared from
Figure RE-GDA0002862721790000053
Target voices from different directions and pitch angles are received by using double ears, received signals are preprocessed, the amplitude and phase difference of the double ears are extracted, and a data set required by network training is constructed;
2) building a convolution neural network for extracting binaural positioning characteristics;
3) constructing a multitask neural network for simultaneously positioning an azimuth angle and a pitch angle;
4) performing multi-environment training on the networks in the step 2) and the step 3);
5) and performing direction estimation on the target voice by using the trained convolutional neural network and the multi-task neural network.
The basic idea of the invention is to train a convolutional neural network for extracting three-dimensional positioning characteristics and a multitask neural network for simultaneously estimating azimuth and pitch, and estimate the azimuth angle and pitch angle of the sound source through the trained network according to the received sound source voice signal.
Setting simulation environment parameters:
Figure RE-GDA0002862721790000054
the room size is: the length is 5m, the width is 5m, and the height is 3 m;
Figure RE-GDA0002862721790000055
head position coordinates: the length is 2.5m, the width is 2.5m, and the height is 1.5 m;
Figure RE-GDA0002862721790000056
the distance between the sound source and the center of the head is 1 m;
Figure RE-GDA0002862721790000057
and setting angle estimation categories. The azimuth angles are divided into 25 classes, which are [ -80 °, -65 °, -55 °, -45 °:5 °:45 °,55 °, 65 °,80 ° ], respectively](ii) a The pitch angles are divided into 50 classes, the ranges are from-45 degrees to 230.625 degrees and are evenly distributed, and the steps are carried outThe length is 5.625 deg.. The 25 azimuth positions and the 50 pitch positions together form 1250 spatial positions.
Figure RE-GDA0002862721790000058
A plurality of reverberation condition settings. The reflection coefficient of the Room wall is adjusted by a mirror image model method, and Binaural Room Transfer Functions (BRTFs) are generated by using head-related impulse responses (HRIR) provided by a CIPIC database. The reverberation level is 8, the reverberation time range is 150 ms-500 ms and is evenly distributed, and the step length is 50 ms.
Figure RE-GDA0002862721790000062
Various noise environment settings. The noise levels are 7, the signal-to-noise ratio range is 5dB to 35dB, the step length is 5 dB.
Figure RE-GDA0002862721790000063
The method comprises the following steps: a data set for training a neural network is constructed.
This patent assumes that under noisy and reverberant environments, a single sound source signal is captured by the left and right ear microphones of a binaural system. The signal captured by each time-frequency unit in the short-time Fourier transform domain is recorded as
Yl(t,f)=S(t,f)×Bl(f,Θ)+Nl(t,f)
Yr(t,f)=S(t,f)×Br(f,Θ)+Nr(t,f)
t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by sound sources and superimposed noise, and l, r respectively represent left and right ears. B isl(f,Θ),Br(f, Θ) represents the generated binaural indoor transfer functions, respectively.
Extracting the magnitude spectrum E of the binaural signall(t,f),Er(t, f), the formula is as follows:
El(t,f)=20log10|Yl(t,f)|
Er(t,f)=20log10|Yr(t,f)|
next, the binaural phase difference IPD (t, f) of the binaural signal is extracted, as follows:
Figure RE-GDA0002862721790000061
in the method, the binaural amplitude spectrum and the binaural phase difference are used as the input of the network. The output data of the network, i.e. the tags, are then constructed.
Because the network is required to output the azimuth angle and the pitch angle at the same time, the output of the network is set to be 2 one-hot labels, namely, for the azimuth angle, the output is a 25-dimensional vector, and the elements in the vector are 0 except the value of the corresponding sound source azimuth angle, namely 1; for the pitch angle, the output is a 50-dimensional vector, and the elements in the vector are all 0 except the value corresponding to the pitch angle of the sound source, which is 1. Where the dimensions of the vector correspond to the number of classes in the space.
Step two: and building a convolutional neural network for extracting the binaural positioning characteristics.
A schematic diagram of the convolutional neural network is shown in fig. 1.
Here, two independent CNNs are used to learn the localization features from the magnitude spectrum and IPD, respectively, for binaural sound source localization.
First, 32 convolution kernels of size 3 × 2 are followed by the input branch of the amplitude spectrum, and binaural and monaural features can be extracted simultaneously. The IPD input branch is followed by 32 convolution kernels of size 3 × 1 to extract the binaural features.
Second, in each branch, the first convolutional layer is followed by the largest pooling layer of size 2 × 1, after which 4 convolutional layers are used to search for features suitable for localization. Of these, the first 2 used 64 convolution kernels of size 3 × 1, with the addition of the largest pooling layer of size 2 × 1, and the last 2 used 128 convolution kernels of size 3 × 1. All convolutional layers were activated by rectifying linear units (ReLU) and processed using batch normalization operations.
Finally, the outputs of the two branches are spread and concatenated, i.e. the amplitude Feature and the IPD Feature are merged together, and then the merged Feature is introduced into two fully-connected layers of 8192 and 4096, respectively, to form a Shared Feature (Shared Feature) in preparation for the next sound source localization.
Step three: and constructing a multitask neural network for simultaneously positioning the azimuth angle and the pitch angle.
In neural network-based learning, a typical approach to solving the problem is to build a single model for a specific task and optimize the parameters of the model according to specific criteria. However, if optimized for only a single task, the network will not achieve the optimum when multiple related tasks need to be completed simultaneously. One suitable solution is to use shared features between several related tasks so that multiple tasks can be trained together and provide optimal performance for each task, which is known as multi-task learning.
As shown in fig. 1, a shared feature layer is followed by a multitasking neural network, which includes two branches representing estimates of azimuth and elevation, respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation.
Step four: and (4) multi-environment training.
In order to improve the robustness of the algorithm to the performance under various noise and reverberation environments, training data under different signal-to-noise ratios and reverberation time environments are constructed, and multi-environment training is carried out to improve the generalization capability of the network under different environments.
Step five: and performing direction estimation on the target voice by using the trained convolutional neural network and the multi-task neural network.
The scheme of the patent is compared with the two existing methods under different noise and reverberation conditions respectively. The target solution 1 is located by analyzing the composite eigenvectors of the selected binaural cue using Mutual Information (Mutual Information), see literature: x.wu, d.s.talagala, w.zhang, and t.d.ahayapala, "industrialized interactive life learning and personalized organizational localization model," Applied Sciences, vol.9, No.13, p.2682, 2019; for the standard scheme 2, the binaural phase difference and the binaural amplitude difference are directly input into a convolutional neural network for positioning, which is disclosed in the document: pang, H.Liu, and X.Li, "Multitask learning of time-frequency cnn for sound source localization," IEEE Access, vol.7, pp.40725-40737, 2019.
Compared with the two schemes, the scheme can obtain the best performance under most conditions, especially under the conditions of low signal-to-noise ratio (the signal-to-noise ratio is less than or equal to 25dB) and strong reverberation (T60 is more than or equal to 200 ms). Because the amplitude information of the monaural is a key clue for carrying out the human ear elevation positioning, the scheme reserves the operation of the monaural spectral clue, and the elevation positioning result is more obvious than the elevation positioning result of the standard scheme. (-presentation literature does not provide results in this context)
Positioning results in different signal-to-noise ratio environments:
table 1. azimuthal positioning accuracy (%) in different snr environments.
SNR 25dB 20dB 15dB 10dB 5dB
Bidding scheme 1 - 97.20 - 94.40 -
Benchmarking scheme 2 96.88 95.57 93.05 88.48 79.87
This scheme 98.10 98.09 98.07 97.94 96.95
And 2, the pitch angle positioning accuracy under different signal-to-noise ratio environments.
SNR 25dB 20dB 15dB 10dB 5dB
Bidding scheme 1 - 72.64 - 37.04 -
Benchmarking scheme 2 92.42 86.93 78.37 65.77 48.47
This scheme 98.28 97.59 96.17 93.06 85.25
Positioning results in different reverberation environments:
table 3. azimuth positioning accuracy at different reverberation times.
Reverberation time 300ms 350ms 400ms 450ms 500ms
Bidding scheme 1 91.44 - 89.44 - 78.88
Benchmarking scheme 2 91.60 92.12 87.44 90.40 83.64
This scheme 94.23 95.77 92.57 94.98 90.02
And 4, the pitch angle positioning accuracy under different reverberation times.
Reverberation time 300ms 350ms 400ms 450ms 500ms
Bidding scheme 1 68.48 - 55.52 - 42.64
Benchmarking scheme 2 91.76 91.73 86.93 89.57 81.70
This scheme 93.09 95.08 91.13 94.43 87.47
The innovation point of the scheme is that the feature selection is carried out on a binaural amplitude spectrum (containing binaural and monaural clues) and an IPD input CNN. This operation may allow more accurate sound source localization, especially in noisy and reverberant conditions. The best existing method only performs better under the conditions of high signal-to-noise ratio and low reverberation. These experimental results demonstrate that under complex environments, the use of binaural amplitude information is more accurate in preserving localization cues than the use of interaural amplitude difference information.

Claims (1)

1. A binaural sound source localization method based on deep learning is characterized in that: the receiving platform is a double-ear platform consisting of 2 array elements and comprises the following steps:
step 1: dividing the azimuth of a received signal into NθA is prepared from
Figure FDA0002748071860000011
Divide pitch angle into
Figure FDA0002748071860000012
A is prepared from
Figure FDA0002748071860000013
Target speech from different directions and pitch angles is received using both ears:
Yl(t,f)=S(t,f)×Bl(f,Θ)+Nl(t,f)
t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by sound sources and superimposed noise, and l, r respectively represent left and right ears. B isl(f,Θ),Br(f, Θ) respectively representing the generated binaural indoor transfer functions;
preprocessing the received signals: extracting the magnitude spectrum E of the binaural signall(t,f),Er(t, f) and binaural phase difference IPD (t, f) of the binaural signal
Step 2, building a convolutional neural network for extracting binaural positioning characteristics:
16 convolution kernels with the size of 3 multiplied by 2 are connected behind the input branch of the amplitude spectrum, and the characteristics of double ears and single ears are extracted simultaneously; the IPD input branch is followed by 16 convolution kernels with the size of 3 multiplied by 1 for extracting the binaural characteristics;
in each branch, the first convolutional layer is followed by the largest pooling layer of size 2 × 1, after which 4 convolutional layers are used to search for features suitable for positioning; wherein the first 2 use 64 convolution kernels of size 3 × 1, and add the largest pooling layer of size 2 × 1, and the last 2 use 128 convolution kernels of size 3 × 1; all the convolution layers are activated through a rectification linear unit ReLU and processed by batch normalization operation;
expanding and connecting the outputs of the two branches in series, namely combining the amplitude characteristic and the IPD characteristic, then sequentially transmitting the combined characteristic into two full-connection layers with the sizes of 8192 and 4096 respectively, wherein the latter full-connection layer is the Shared characteristic and is prepared for the next sound source positioning;
and step 3: the output of the convolutional neural network is a shared characteristic layer which is connected with the multitask neural network and used as the input of the multitask neural network; the multitasking neural network comprises two branches representing estimates of azimuth and elevation, respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation;
and 4, performing multi-environment training on the network in the steps 2 and 3: dividing the data extracted in the step 1 into training data and verification data, and performing multi-environment training on the network in the step 2 and the step 3 by using the training data and verifying the trained network by using the verification data to obtain a multi-environment trained network;
and 5: preprocessing the voice signal received by the receiving platform to obtain the amplitude spectrum E of the binaural signall(t,f),Er(t, f) and the binaural phase difference IPD (t, f) of the binaural signal, which is used as input to the multi-environment trained network, the signal output by the network being the angular information for binaural sound source localization.
CN202011173630.0A 2020-10-28 2020-10-28 Binaural sound source positioning method based on deep learning Active CN112346013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011173630.0A CN112346013B (en) 2020-10-28 2020-10-28 Binaural sound source positioning method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011173630.0A CN112346013B (en) 2020-10-28 2020-10-28 Binaural sound source positioning method based on deep learning

Publications (2)

Publication Number Publication Date
CN112346013A true CN112346013A (en) 2021-02-09
CN112346013B CN112346013B (en) 2023-06-30

Family

ID=74358963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011173630.0A Active CN112346013B (en) 2020-10-28 2020-10-28 Binaural sound source positioning method based on deep learning

Country Status (1)

Country Link
CN (1) CN112346013B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107942290A (en) * 2017-11-16 2018-04-20 东南大学 Binaural sound sources localization method based on BP neural network
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN110501673A (en) * 2019-08-29 2019-11-26 北京大学深圳研究生院 A kind of binaural sound source direction in space estimation method and system based on multitask time-frequency convolutional neural networks
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
US20200037097A1 (en) * 2018-04-04 2020-01-30 Bose Corporation Systems and methods for sound source virtualization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107942290A (en) * 2017-11-16 2018-04-20 东南大学 Binaural sound sources localization method based on BP neural network
US20200037097A1 (en) * 2018-04-04 2020-01-30 Bose Corporation Systems and methods for sound source virtualization
CN109164415A (en) * 2018-09-07 2019-01-08 东南大学 A kind of binaural sound sources localization method based on convolutional neural networks
CN110501673A (en) * 2019-08-29 2019-11-26 北京大学深圳研究生院 A kind of binaural sound source direction in space estimation method and system based on multitask time-frequency convolutional neural networks
CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENG PANG等: "Multitask Learning of Time-Frequency CNN for Sound Source Localization", 《IEEE ACCESS》 *
谈雅文等: "基于BP神经网络的双耳声源定位算法", 《电声技术》 *

Also Published As

Publication number Publication date
CN112346013B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN107102296B (en) Sound source positioning system based on distributed microphone array
CN105068048B (en) Distributed microphone array sound localization method based on spatial sparsity
CN110517705B (en) Binaural sound source positioning method and system based on deep neural network and convolutional neural network
US20210219053A1 (en) Multiple-source tracking and voice activity detections for planar microphone arrays
Pang et al. Multitask learning of time-frequency CNN for sound source localization
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN111429939B (en) Sound signal separation method of double sound sources and pickup
CN110610718B (en) Method and device for extracting expected sound source voice signal
Di Carlo et al. Mirage: 2d source localization using microphone pair augmentation with echoes
CN107167770A (en) A kind of microphone array sound source locating device under the conditions of reverberation
CN103901400B (en) A kind of based on delay compensation and ears conforming binaural sound source of sound localization method
Gelderblom et al. Synthetic data for dnn-based doa estimation of indoor speech
CN113514801A (en) Microphone array sound source positioning method and sound source identification method based on deep learning
Yang et al. Full-sphere binaural sound source localization using multi-task neural network
Rascon et al. Lightweight multi-DOA tracking of mobile speech sources
CN111948609B (en) Binaural sound source positioning method based on Soft-argmax regression device
Goli et al. Deep learning-based speech specific source localization by using binaural and monaural microphone arrays in hearing aids
CN111707990B (en) Binaural sound source positioning method based on dense convolutional network
KR20090128221A (en) Method for sound source localization and system thereof
Parisi et al. Source localization in reverberant environments by consistent peak selection
Zhou et al. Binaural Sound Source Localization Based on Convolutional Neural Network.
CN112346013B (en) Binaural sound source positioning method based on deep learning
Nakano et al. Automatic estimation of position and orientation of an acoustic source by a microphone array network
Pertilä Acoustic source localization in a room environment and at moderate distances
CN112731291B (en) Binaural sound source localization method and system for collaborative two-channel time-frequency mask estimation task learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant