CN111239686B

CN111239686B - Dual-channel sound source positioning method based on deep learning

Info

Publication number: CN111239686B
Application number: CN202010099231.8A
Authority: CN
Inventors: 李军锋; 程龙彪; 夏日升; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2021-12-21
Anticipated expiration: 2040-02-18
Also published as: CN111239686A

Abstract

The invention discloses a two-channel sound source positioning method based on deep learning, which comprises the following steps: respectively performing framing, windowing and Fourier transformation on the microphone pickup data of the left channel and the right channel to obtain time-frequency domain pickup signals of the first channel and the second channel; estimating phase sensitive masking from the time-frequency domain picked-up signals and the time-frequency domain direct sound signals corresponding to the time-frequency domain picked-up signals by utilizing deep learning, guiding estimation of sound source direction information by utilizing the phase sensitive masking, calculating accuracy of direction information estimation by utilizing the phase sensitive masking, obtaining a direction information enhancement value from the estimated direction information and the direction information estimation accuracy by utilizing the deep learning, constructing a weighted histogram by utilizing the enhanced direction information and the accuracy of the direction information estimation, and finally selecting the direction corresponding to the peak value of the histogram as the sound source direction. The method estimates the direction of the sound source from the data picked up by the dual-channel microphone, fully utilizes the generalization capability of the neural network, and has better robustness to the noise reverberation environment.

Description

Dual-channel sound source positioning method based on deep learning

Technical Field

The invention relates to the technical field of sound source positioning, in particular to a dual-channel sound source positioning method based on deep learning.

Background

Currently, the sound source localization technology mainly estimates the azimuth of a sound source from data containing background noise and reverberation picked up by a microphone array, so as to obtain better performance in the aspects of sound source separation, sound source tracking and the like. In the sound source localization technique using azimuth as output, the azimuth of the sound source can be estimated by using the orthogonality of the signal space and the noise space, but the performance of such algorithms is obviously reduced when reverberation exists. By utilizing deep learning, the robustness of the algorithm in the presence of noise and reverberation can be better improved. Most sound source localization algorithms based on deep learning regard sound source localization as a classification problem, and utilize neural networks to estimate the location of sound sources from partitioned areas. The positioning accuracy of the algorithm is related to region division, and when the requirement of positioning accuracy changes, the neural network needs to be retrained.

Disclosure of Invention

The invention aims to solve the defects of the existing sound source positioning technology.

In order to achieve the purpose, the invention discloses a dual-channel sound source positioning method based on deep learning, which comprises the following steps:

respectively performing framing, windowing and Fourier transform on the microphone picked data of each channel to obtain a time-frequency domain picked signal of each channel; the double-channel time-frequency domain signal comprises the information of the position of the sound source;

combining the logarithmic power spectrum of the time-frequency domain pickup signal of the first channel and the phase difference between the channels to obtain the input characteristic of the first channel; combining the logarithmic power spectrum of the time-frequency domain pickup signal of the second channel and the phase difference between the channels to obtain the input characteristic of the second channel;

calculating to obtain a phase sensitivity masking estimation value of the first channel by using the time-frequency domain pickup signal of the first channel and the time-frequency domain direct sound signal corresponding to the time-frequency domain pickup signal; calculating to obtain a phase sensitivity masking estimation value of the second channel by using the time-frequency domain pickup signal of the second channel and the time-frequency domain direct sound signal corresponding to the time-frequency domain pickup signal;

training a neural network by using the input characteristics of each channel and the corresponding theoretical phase sensitivity masking to obtain an estimation model of the phase sensitivity masking;

taking the input characteristics of the first channel as the input of the estimation model, and outputting the phase sensitivity masking estimation value of the first channel; taking the input characteristics of the second channel as the input of the probability estimation model, and outputting the estimated value of the phase sensitive masking of the second channel;

calculating a voice covariance matrix by using the picked-up signal of each channel time-frequency domain and the phase sensitivity masking estimation value of each channel time-frequency domain;

carrying out eigenvalue decomposition on the voice covariance matrix to obtain a main eigenvector of the voice covariance matrix as a guide vector of a sound source;

taking the phase angle difference of two elements of the guide vector as direction information;

calculating the estimation accuracy of the direction information of each time frequency point by using the two-channel phase sensitive masking estimation value;

calculating an ideal phase difference of data picked up by two microphones by using the time difference of sound sources reaching the microphones as target direction information;

training a neural network by using the direction information, the direction information estimation accuracy and the target direction information to obtain a direction information enhancement model;

the direction information and the direction information estimation accuracy are used as the input of a direction information enhancement model, and the direction information estimation accuracy are output as enhanced direction information;

calculating a sound source direction at each time-frequency point using the enhanced direction information;

and constructing a weighted statistical histogram by using the estimation accuracy of the direction characteristics and the direction information at all the time-frequency points.

And selecting the direction with the largest statistical result as the sound source direction by using the weighted histogram.

Preferably, the specific steps of framing, windowing and fourier transforming the microphone picked data of each channel respectively are as follows:

taking 512 sampling points for each channel as a frame signal, and supplementing 512 points if the length is insufficient; then windowing each frame of signal, wherein the windowing function adopts a Blackman window; and finally, carrying out Fourier transform on each frame of signal.

Preferably, the per-channel input characteristics are:

where n is the number of the data frame, m is the number of the channel,

is the log-amplitude spectrum of the time-frequency domain signal of the mth channel,

is the phase difference of the time-frequency domain signal of the mth channel.

Preferably, the per-channel phase sensitive mask is:

where f is the number of the frequency band, theta is the phase of the time-frequency domain signal of the data picked up by the microphone,

is the phase of the time-frequency domain signal of the direct sound data,

is the time-frequency domain signal of the direct sound, and X is the time-frequency domain signal of the microphone pickup data.

Preferably, the step of training the neural network by using the input features of each channel and the theoretical phase sensitivity masking corresponding thereto to obtain the estimation model of the phase sensitivity masking includes:

the neural network is a three-layer long-term memory network, and each layer is provided with 512 nodes. And taking the phase sensitive masking theoretical value as a training target of the neural network, and continuously reducing the mean square error of the phase sensitive masking estimated value and the phase sensitive masking theoretical value through iteration.

Preferably, the estimates of the per-channel phase sensitive masking are:

preferably, the speech covariance matrix is:

preferably, the eigenvalue decomposition is performed on the speech covariance matrix, and the acquisition of the principal eigenvector thereof as the steering vector of the sound source is:

preferably, the direction information is:

preferably, the accuracy of the direction information estimation is:

preferably, the ideal phase difference is:

wherein the content of the first and second substances,

and

is the time taken for the sound source to reach the 1 st and 2 nd microphones, f_sIs the sampling rate of the pick-up signal.

Preferably, the direction information estimation accuracy and the target direction information are used for training the neural network to obtain a direction information enhancement model, specifically:

the neural network is a fully-connected neural network with three layers, and each layer is provided with 2048 nodes. The input characteristics of the neural network are a splicing vector of sine value and cosine value of the direction information and estimation accuracy of the direction information, and specifically the method comprises the following steps:

I_n＝[sinθ_n,0,…,sinθ_n,F-1,cosθ_n,0,…,cosθ_n,F-1,W_n,0,…,W_n,F-1]

the estimation target of the neural network is target direction information, specifically:

the mean square error of the enhanced direction information and the target direction information is continuously reduced through iteration.

Preferably, the enhanced direction information is:

wherein the content of the first and second substances,

is the output value of the enhancement model.

Preferably, the sound source direction calculated at each time-frequency point is:

where c is the sound propagation velocity and d is the microphone pitch.

Preferably, the weighted histogram is constructed such that each time-frequency point has a weight of W_n,f。

Preferably, the direction of the largest statistical result is:

the invention has the advantages that: 1) phase sensitive masking is estimated through spatial information and spectral information, so that more accurate direction information estimation is obtained; 2) the neural network is utilized to enhance the estimated direction information, so that the performance of the positioning method in a noise reverberation environment is improved; 3) by estimating the final sound source orientation using the weighted histogram, the influence of the silence segments on the sound source localization accuracy can be reduced. By containing enough noise types and orientations in the training data, the generalization capability of the deep neural network can be fully utilized, the robustness of the model is improved, and the purpose of sound source positioning in a noise reverberation environment is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a dual-channel sound source localization method based on deep learning.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a dual-channel sound source localization method based on deep learning. As shown in fig. 1, the method includes:

step S101: and respectively performing framing, windowing and Fourier transformation on the microphone picked data of the left channel and the right channel to obtain a time-frequency domain picked signal of each channel. The dual-channel time-frequency domain signal contains information of the position of the sound source.

In one embodiment, 512 sampling points are taken as a frame signal for each channel, and if the length is insufficient, 512 sampling points are supplemented; then windowing each frame of signal, wherein the windowing function adopts a Blackman window; and finally, carrying out Fourier transform on each frame of signal to obtain a time-frequency domain pickup signal of each channel.

Step S102: combining the logarithmic power spectrum of the time-frequency domain pickup signal of the left channel and the phase difference between the channels to obtain the input characteristic of the first channel; and combining the logarithmic power spectrum of the time-frequency domain pickup signal of the right channel and the phase difference between the channels to obtain the input characteristic of the second channel.

Specifically, the per-channel input characteristics are:

where n is the number of the data frame, m is the number of the channel,

is the phase difference of the time-frequency domain signal of the mth channel.

Step S103: calculating to obtain a phase sensitivity masking estimation value of the first channel by using the time-frequency domain pickup signal of the first channel and the time-frequency domain direct sound signal corresponding to the time-frequency domain pickup signal; and calculating to obtain a phase sensitivity masking estimation value of the second channel by using the time-frequency domain pickup signal of the second channel and the time-frequency domain direct sound signal corresponding to the time-frequency domain pickup signal.

Specifically, the per-channel phase sensitive mask is:

is direct sound dataThe phase of the time-frequency domain signal,

Step S104: and training the neural network by using the input characteristics of each channel and the corresponding theoretical phase sensitivity masking to obtain an estimation model of the phase sensitivity masking.

In one embodiment, the neural network is a three-layer long-term memory network with 512 nodes in each layer. And taking the phase sensitive masking theoretical value as a training target of the neural network, and continuously reducing the mean square error of the phase sensitive masking estimated value and the phase sensitive masking theoretical value through iteration.

Step S105: taking the input characteristics of the first channel as the input of the estimation model, and outputting the phase sensitivity masking estimation value of the first channel; and taking the input characteristics of the second channel as the input of the probability estimation model, and outputting the estimated value of the phase sensitive masking of the second channel.

Specifically, the estimates of per-channel phase sensitive masking are:

step S106: a speech covariance matrix is calculated using each channel time-frequency domain picked-up signal and each channel time-frequency domain phase sensitive masking estimate together.

Specifically, the speech covariance matrix is:

step S107: and carrying out eigenvalue decomposition on the voice covariance matrix to obtain a main eigenvector of the voice covariance matrix as a guide vector of the sound source.

Specifically, the steering vector is:

step S108: and taking the phase angle difference of the two elements of the guide vector as direction information.

Specifically, the direction information is:

step S109: and calculating the estimation accuracy of the direction information of each time frequency point by using the two-channel phase sensitive masking estimation value.

Specifically, the accuracy of the direction information estimation is:

step S110: the ideal phase difference of the picked-up data of the two microphones is calculated as target direction information by using the time difference of the sound source reaching the microphones.

Specifically, the target direction information is:

wherein the content of the first and second substances,

and

Step S111: and training the neural network by using the direction information, the direction information estimation accuracy and the target direction information to obtain a direction information enhancement model.

In one embodiment, the neural network is a fully-connected neural network with three layers, each layer having 2048 nodes.

Specifically, the input features of the neural network are a splicing vector of a sine value and a cosine value of the direction information and estimation accuracy of the direction information:

I_n＝[sinθ_n,0,…,sinθ_n,F-1,cosθ_n,0,…,cosθ_n,F-1,W_n,0,…,W_n,F-1]

specifically, the estimated target of the neural network is target direction information:

Step S112: and taking the direction information and the direction information estimation accuracy as the input of a direction information enhancement model, and outputting the direction information and the direction information estimation accuracy as enhanced direction information.

Specifically, the enhanced direction information is:

wherein the content of the first and second substances,

is the output value of the enhancement model.

Step S113: the sound source direction is calculated at each time-frequency point using the enhanced direction information.

Specifically, the sound source direction calculated at each time-frequency point is:

where c is the sound propagation velocity and d is the microphone pitch.

Step S114: and constructing a weighted statistical histogram by using the estimation accuracy of the direction characteristics and the direction information at all the time-frequency points.

Specifically, when the weighted histogram is constructed, the weight of each time-frequency point is W_n,f。

Step S115: and selecting the direction with the largest statistical result as the sound source direction by using the weighted histogram.

Specifically, the direction of the maximum statistical result is:

the embodiment of the invention provides a two-channel sound source positioning method based on deep learning, which estimates phase-sensitive masking by simultaneously utilizing spatial information and spectral information, estimates direction information by taking the phase-sensitive masking as guidance, enhances the direction information through a neural network, and finally determines the final sound source position through a weighted statistical histogram. By containing enough noise types and orientations in the training data, the generalization capability of the deep neural network can be fully utilized, the robustness of the model is improved, and the purpose of estimating the orientation of the sound source in the noise reverberation environment is achieved.

The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A dual-channel sound source positioning method based on deep learning is characterized by comprising the following steps:

respectively performing framing, windowing and Fourier transformation on the microphone pickup data of the left channel and the right channel to obtain time-frequency domain pickup signals of the first channel and the second channel; the double-channel time-frequency domain signal comprises information of the position of a sound source;

taking the input characteristics of the first channel as the input of an estimation model, and outputting a phase sensitive masking estimation value of the first channel; taking the input characteristics of the second channel as the input of a probability estimation model, and outputting an estimated value of the phase sensitive masking of the second channel;

calculating a voice covariance matrix by using the picked-up signal of each channel and the phase sensitivity masking estimated value of each channel;

performing eigenvalue decomposition on the voice covariance matrix to obtain a main eigenvector of the voice covariance matrix as a guide vector of a sound source;

taking the phase angle difference of the two elements of the guide vector as direction information;

constructing a weighted statistical histogram by using the estimation accuracy of the direction characteristics and the direction information at all time-frequency points;

and selecting the direction with the largest statistical result as the sound source direction by utilizing the weighted histogram.

2. The method of claim 1, wherein the step of performing framing, windowing and fourier transform on the microphone picked data of each channel respectively comprises:

3. The method of claim 1, wherein the per-channel input features are:

where n is the number of the data frame, m is the number of the channel,

is the phase difference of the time-frequency domain signal of the mth channel;

the per-channel phase sensitive mask is:

is the phase of the time-frequency domain signal of the direct sound data,

4. The method according to claim 1, wherein the step of training the neural network using the input features of each channel and the theoretical phase-sensitive mask corresponding thereto to obtain the estimation model of the phase-sensitive mask comprises:

the neural network is a three-layer long-time memory network, and each layer is provided with 512 nodes; taking the phase sensitive masking theoretical value as a training target of the neural network, and continuously reducing the mean square error of the phase sensitive masking estimated value and the phase sensitive masking theoretical value through iteration; the estimated values of the per-channel phase sensitive masking are:

5. the method of claim 1, wherein the weighted histogram is constructed such that each time-frequency point has a weight of W_n，f。

6. The method of claim 1, wherein the direction of the largest statistical result is: