CN111239687A

CN111239687A - Sound source positioning method and system based on deep neural network

Info

Publication number: CN111239687A
Application number: CN202010050760.9A
Authority: CN
Inventors: 张巧灵; 唐柔冰; 马晗
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU; Zhejiang University of Science and Technology ZUST
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-05
Anticipated expiration: 2040-01-17
Also published as: CN111239687B

Abstract

The invention discloses a positioning method, which comprises the following steps: s1, acquiring a voice signal received by a microphone and generating a voice data set; s2, preprocessing a voice signal in the voice data set; s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the voice signal; s4, acquiring time delay information corresponding to a peak of the phase-weighted generalized cross-correlation function, and taking the time delay information as a TDOA observed value when a sound source signal reaches a microphone; obtaining an amplitude value corresponding to the time delay information; s5, combining the TDOA observed value and the amplitude value to serve as an input vector, using the three-dimensional space position coordinate corresponding to the sound source signal as an output vector, and combining the input vector and the output vector to generate a feature vector; s6, preprocessing the characteristic vector; s7, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network; and S8, transmitting the input vector of the test set into the trained deep neural network for prediction to obtain the three-dimensional space coordinate of the sound source signal.

Description

Sound source positioning method and system based on deep neural network

Technical Field

The invention relates to the technical field of indoor sound source positioning, in particular to a sound source positioning method and system based on a deep neural network.

Background

In recent years, intelligent service type products (such as intelligent sound boxes, smart homes, and the like) are widely used in real life, and in order to obtain good user experience, the human-computer interaction capability of the products is paid more and more attention to by many people. In human-computer interaction, voice communication is an indispensable part, people can directly issue voice passwords to command a machine to provide corresponding services, and the machine identifies the voice passwords and provides the corresponding services without manual operation. At present, in a near-field speech recognition application scenario (such as a mobile phone end), the quality of speech signals received by a microphone is high, and the speech recognition rate meets the actual requirements. However, in far-field speech recognition application scenarios such as smart home, the quality of speech signals captured by the microphone is poor, the speech recognition rate is low, and actual requirements cannot be met. Therefore, solving the problem of far-field speech recognition application has become a research hotspot of domestic and foreign institutions in recent years. At present, the method of estimating the position of a sound source by utilizing a sound source positioning algorithm at the front end of voice recognition, enhancing the sound source signal in the direction of the position, and weakening interference signals in other directions can improve the quality of the voice signal and the voice recognition rate, and the method can effectively solve the problem of landing of far-field voice recognition application. Among them, effective sound source localization prior to speech recognition is of great significance in practice.

The classical positioning algorithm is mainly a two-dimensional sound source positioning algorithm, and the algorithms are mainly divided into three categories: one is an algorithm based on Time Difference of Arrival (TDOA). The time delay estimation algorithm, also called arrival time difference algorithm, determines the position of a sound source according to the time difference of two microphones at different positions receiving the same sound source signal. The time delay corresponding to the maximum peak of a Generalized Cross Correlation (GCC) function of sound signals received by two microphones is used as time delay estimation, and then the geometric constraint of a microphone array is utilized to obtain sound source position estimation. The method is easily affected by environmental noise and indoor reverberation, when the noise is large or the reverberation is serious, a plurality of false Peaks (spidious Peaks) appear in the GCC function, and a wrong TDOA value is easily estimated, so that a wrong sound source position estimation is caused. The second is an algorithm based on spatial spectrum estimation. The basic idea of an algorithm based on spatial spectrum estimation is to determine the direction angle and the position of the sound source from the spatial spectrum. Because the estimation of the spatial signal is similar to the frequency estimation of the time domain signal, the estimation method of the spatial spectrum can be popularized by the time domain nonlinear spectrum, but the algorithm has the premise that signal sources are continuously distributed and the space is stable, so the application of the algorithm is greatly limited. One of the typical algorithms of the spatial spectrum algorithm is a feature subspace class algorithm, which can be divided into a subspace decomposition class algorithm and a subspace fitting class algorithm, the former algorithm is mainly a multi Signal Classification (MUSIC) algorithm and a rotation invariant subspace algorithm (ESPRIT), and the latter algorithm is mainly a Maximum Likelihood algorithm (ML) and a Weighted subspace fitting algorithm (WSF). And thirdly, an algorithm based on controllable beam response. The algorithm based on the steerable beam response is to search globally in the microphone array for the location with the largest energy, i.e. the sound source location. Generally, the speech signals collected by the microphones are filtered and weighted and summed to form a beam, and then the point where the output power of the beam is maximum, i.e., the sound source position, is found. Algorithms based on the controllable beam response can be specifically divided into a delay accumulation beam algorithm and an adaptive beam algorithm. Although the delay accumulation algorithm has small signal distortion and small calculation amount, the anti-interference capability is weak, and the delay accumulation algorithm is easily influenced by noise. The self-adaptive algorithm has large calculation amount, and the signal has certain distortion but strong anti-interference capability.

Multi-Modal Fusion algorithms are currently used as sound source localization algorithms in three-dimensional space, and a representative algorithm is Audio-Visual Fusion (Audio-Visual Fusion) algorithm. The sound source position is usually estimated jointly with the face position information collected by the camera and the delay estimation (DOA) collected by the microphone, and the algorithm avoids the defects that the traditional image tracking is limited by the number of cameras and the illumination intensity, avoids the defects that the traditional sound source tracking is limited by background noise and indoor reverberation, and greatly reduces the influence of environmental factors. However, in the multi-mode fusion algorithm, a lot of parameters are still needed to be set, and when the environment changes, the robustness of the algorithm is reduced.

In recent years, sound source localization using neural networks is a popular research direction, especially after the development of deep learning. The study of sound source localization algorithm by using neural network usually extracts feature vectors from speech signals, and then transmits the feature vectors into the neural network for training. The common speech feature vector is composed of TDOAs of multiple microphone pairs, and does not utilize amplitude information corresponding to the TDOAs, and the amplitude corresponding to the TDOAs reflects the reliability of the TDOAs to some extent.

In general, a sound source positioning method based on a deep neural network is a research hotspot of an indoor sound source positioning problem, and the research is of great significance for solving the technical grounds of many current audio applications, such as intelligent voice interaction. However, the current sound source positioning method based on the deep neural network is not researched well, and the existing results are more or less insufficient.

Disclosure of Invention

The invention aims to provide a sound source positioning method and system based on a deep neural network aiming at the defects of the prior art, and the method and system can estimate time delay

And the corresponding amplitude

The three-dimensional space coordinate is used as an input vector of the deep neural network, and the three-dimensional space coordinate is used as an output vector of the deep neural network, so that the method is suitable for indoor sound source positioning, and has good expandability and algorithm robustness.

In order to achieve the purpose, the invention adopts the following technical scheme:

a sound source positioning method based on a deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:

s1, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;

s2, performing first preprocessing on the voice signals in the generated voice data set;

s3, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;

s4, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;

s5, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;

s6, performing second preprocessing on the generated feature vectors;

s7, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;

and S8, in the testing stage of the deep neural network, transmitting the input feature vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.

Further, in step S1, the set of microphone arrays is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M pairs of microphones.

Further, the step S2 is specifically to perform a first preprocessing on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.

Further, the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of the two microphone voice signals in the preprocessed microphone node m_m(τ), expressed as:

wherein, m is equal to V,

and

respectively represented as time domain microphone signals at node m

And

the corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation.

Further, the step S4 obtains the phase weighted generalized cross-correlation function R_m(tau) time delay information corresponding to wave crest

Expressed as:

and obtaining the time delay information

Corresponding amplitude value

Further, the step S5 is specifically:

calculating the time delay information of all nodes

And the corresponding amplitude

Combining as input vector I of the deep neural network:

taking the three-dimensional space position coordinate Q corresponding to the sound source signal S as an output vector of the neural network:

combining the input vector I and the output vector Q to generate a feature vector G:

G＝(I,Q)^T。

further, the second preprocessing in step S6 includes data cleaning, data disordering, and data normalization.

Further, the cross-validation employed in step S8 includes leave-one-out validation.

Correspondingly, a sound source positioning system based on a deep neural network is also provided, which comprises:

the first acquisition module is used for acquiring the voice signal received by the microphone and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;

a first preprocessing module for performing a first preprocessing on the speech signal within the generated speech data set;

the calculation module is used for calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;

the second acquisition module is used for acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function and taking the acquired time delay information as a TDOA observation value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;

the generating module is used for combining the TDOA observed value and the amplitude value to serve as an input vector of a deep neural network, using a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;

the second preprocessing module is used for carrying out second preprocessing on the generated feature vectors;

a training module for setting the related parameters of the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network

Further, the method also comprises the following steps:

and the test module is used for transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional spatial position coordinates of the sound source signal and evaluating the performance of the deep neural network model by adopting cross validation.

Compared with the prior art, the method estimates the time delay

And the corresponding amplitude

Drawings

FIG. 1 is a flowchart of a sound source localization method based on a deep neural network according to an embodiment;

FIG. 2 is a schematic top view of a simulation environment provided by an embodiment, wherein a circle represents a position of a microphone;

FIG. 3 is a flow chart of a training phase of the deep neural network provided in one embodiment;

FIG. 4 is a flowchart illustrating a testing phase of the deep neural network according to an embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The invention aims to provide a sound source positioning method and system based on a deep neural network, aiming at the defects of the prior art.

Example one

The embodiment provides a sound source localization method based on a deep neural network, which includes a training phase of the deep neural network and a testing phase of the deep neural network, as shown in fig. 1-2, and includes the steps of:

s11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set;

s12, performing first preprocessing on the voice signals in the generated voice data set;

s13, calculating a phase weighted generalized cross-correlation function of a sound source signal corresponding to the preprocessed voice signal;

s14, acquiring time delay information corresponding to the peak of the phase weighted generalized cross-correlation function, and taking the acquired time delay information as a TDOA observed value of a sound source signal reaching a microphone; obtaining the amplitude corresponding to the time delay information;

s15, combining the TDOA observation value with the amplitude value to serve as an input vector of a deep neural network, taking a three-dimensional space position coordinate corresponding to a sound source signal as an output vector of the neural network, and combining the input vector and the output vector to generate a feature vector;

s16, performing second preprocessing on the generated feature vectors;

s17, in the training stage of the deep neural network, setting parameters related to the deep neural network, and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network;

and S18, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.

In the present embodiment, a distributed microphone array is specifically described:

the specific simulation settings are as follows: the simulated environment is a typical conference room of size 4.1m x 3.1m x 3m with a total of L-12 randomly distributed microphones. The distance between two microphones in each microphone node is Dm-0.6 m. For simplicity, the microphone is positioned in a plane having a height of 1.75 m. The sound propagation speed is c 343 m/s. In this embodiment, the original non-reverberant speech signal is a single-channel pure male english pronunciation with a sampling frequency of 16kHz, and the frame length of the speech signal is 120 ms. The room reverberation time T60 is 0.1s, the SNR is 20dB, and the number of monte carlo experiments is 50. The distributed microphone array has M microphone nodes in total, i.e. the set V of microphone nodes is {1,2, …, M }. Each microphone node m contains two microphones, where m ∈ V.

In step S11, acquiring a voice signal received by a microphone, and generating a voice data set from the acquired voice signal; wherein the speech data set comprises a training data set and a testing data set; .

In the present embodiment, the sound source position set is set in a plane with a height of 1.5m to 1.7m, and 24000 position sample sets are uniformly acquired as data sets of the neural network. In an MATLAB simulation environment, firstly, an Image model is used for simulating a room impulse response, then an original voice signal without reverberation is convoluted with the room impulse response and is added with Gaussian white noise, and finally a signal received by a microphone is simulated.

In step S12, a first pre-processing is performed on the speech signal within the generated speech data set.

Specifically, the method comprises the steps of performing first preprocessing on voice signals received by two microphones in a microphone node m, wherein the first preprocessing comprises framing, windowing and pre-emphasis.

Windowing is carried out on a voice signal by adopting a rectangular window, and the window function omega (n) of the rectangular window is as follows:

where N represents the length of the window function.

The formula for pre-emphasis is:

H(z)＝1-αz^-1

where α denotes the pre-emphasis coefficient, which is in the range of 0.9< α < 1.0. in the present embodiment, the length of the window function is the frame length, and the pre-emphasis coefficient α is 0.97.

In step S13, a phase weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal is calculated.

In particular to calculate the phase weighted generalized cross-correlation function R of two microphone voice signals in a preprocessed microphone node m_m(τ), expressed as:

wherein, m is equal to V,

and

respectively represented as time domain microphone signals at node m

And

the corresponding frequency domain portion of (a); the symbol x denotes a complex conjugate operation. In the present embodiment, M is 6.

In step S14, acquiring delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and taking the acquired delay information as a TDOA observation of the arrival of the sound source signal at the microphone; and obtaining the amplitude corresponding to the time delay information.

Obtaining a generalized cross-correlation function R weighted with said phase_m(tau) time delay information corresponding to wave crest

And will delay the information

As a TDOA observation of the arrival of sound source signal S at microphone node m, is represented as:

wherein τ ∈ [ - τ)_max,τ_max]，τ_maxRepresenting the theoretical maximum Time Delay (TDOA) of the arrival of the sound source information S at the microphone node m, i.e.

And

represents the distance of the microphone pair contained at the node m from the sound source information S, and c represents the sound propagation speed; | | · | represents the euclidean norm. Then obtaining the time delay information

(i.e., TDOA observations) corresponding to the magnitude of the signal

TDOA location is a method of location using time differences. By measuring the time of arrival of the signal at the monitoring station, the distance of the signal source can be determined. The location of the signal can be determined by the distance from the signal source to each monitoring station (taking the monitoring station as the center and the distance as the radius to make a circle). However, the absolute time is generally difficult to measure, and by comparing the absolute time difference of the arrival of the signal at each monitoring station, a hyperbola with the monitoring station as the focus and the distance difference as the major axis can be formed, and the intersection point of the hyperbola is the position of the signal.

In step S15, the TDOA observations are combined with the amplitudes as input vectors for a deep neural network, the three-dimensional spatial location coordinates corresponding to the acoustic source signal are used as output vectors for the neural network, and the input vectors and the output vectors are combined to generate feature vectors.

The method specifically comprises the following steps: time delay information

(i.e., TDOA observations) and their corresponding amplitudes

Combining the input vector I as a deep neural network:

G＝(I,Q)^T。

in step S16, a second preprocessing is performed on the generated feature vector. And the second preprocessing comprises data cleaning, data disordering and data normalization.

The normalization adopts a min-max normalization method, and the conversion function is as follows:

wherein, g_min、g_maxRespectively representing the minimum value and the maximum value in the sample feature vector G;

indicating sample data normalisedAnd (6) obtaining the result. After training of the neural network, a data value, which is a three-dimensional spatial position of the sound source point in this embodiment, should be obtained through inverse normalization.

Wherein, the conversion function of the reverse normalization is as follows:

wherein, g_min、g_maxRespectively representing the minimum and maximum values in the sample feature vector G,

and g is the result after the sample data is normalized.

In step S17, in the training phase of the deep neural network, parameters related to the deep neural network are set, and the deep neural network is trained by using the feature vectors of the training set, so as to obtain a trained deep neural network.

In this embodiment, the input layer neuron number of the Deep Neural Network (DNN) is set to 12, and the output layer neuron number is set to 3. The hidden layer is set to be three layers, the neuron number of the first layer hidden layer is 12, the activation function is a tanh function, the neuron number of the second layer hidden layer is 15, the activation function is a tanh function, the neuron number of the third layer hidden layer is 3, and the activation function is a tanh function.

In this embodiment, the loss function of the neural network is set as the Mean Squared Error (MSE) between the true spatial position vector Q and the predicted estimate vector P of the neural network, expressed as:

wherein, U is the total number of the current neural network iteration data set.

In step S18, in the testing stage of the deep neural network, the input vectors of the test set are transmitted into the trained deep neural network for prediction, so as to obtain the three-dimensional spatial position coordinates of the sound source signal, and the performance of the deep neural network model is evaluated by using cross validation.

The input vector of the test set is transmitted into a trained deep neural network, and the three-dimensional spatial position coordinate P ═ px, py, pz of the sound source signal can be predicted]^TAnd evaluating the performance of the deep neural network model by using cross validation.

In this embodiment, the total number of the data sets is 24000, the performance of the neural network is tested by using a cross validation method, the cross validation method adopts a leave-one-validation method, that is, 4000 sample points are left as a test set, 20000 samples are left as a training set, the tested data will be part of the training set in the next process, and the process is repeated until no new sample data needs to be predicted.

The sound source positioning method based on the deep neural network comprises a training stage of the deep neural network and a testing stage of the deep neural network.

As shown in FIG. 3, in the training phase of the deep neural network, steps S11-S17 are included.

As shown in FIG. 4, in the testing stage of the deep neural network, steps S11-S16, S18 are included.

It should be noted that, in the testing phase of this embodiment, a trained deep neural network is obtained based on the training phase, and then test positioning is performed.

Compared with the prior art, the embodiment estimates the time delay

And R_m(τ) amplitude corresponding to maximum peak

Example two

The embodiment provides a sound source positioning system based on a deep neural network, which comprises:

Further, the method also comprises the following steps:

It should be noted that, a sound source localization system based on a deep neural network in this embodiment is similar to the embodiment, and will not be described herein again.

Compared with the prior art, the embodiment estimates the time delay

And R_m(τ) amplitude corresponding to maximum peak

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A sound source positioning method based on a deep neural network is characterized by comprising a training stage of the deep neural network and a testing stage of the deep neural network, and comprises the following steps:

s6, performing second preprocessing on the generated feature vectors;

and S8, in the testing stage of the deep neural network, transmitting the input vectors of the test set into the trained deep neural network for prediction to obtain the three-dimensional space position coordinates of the sound source signal, and evaluating the performance of the deep neural network model by adopting cross validation.

2. The method as claimed in claim 1, wherein the set of microphone arrays in step S1 is V ═ 1,2, …, M }; each microphone node m comprises two microphones, wherein m belongs to V; m denotes a total of M microphone nodes.

3. The method for sound source localization based on deep neural network as claimed in claim 2, wherein the step S2 is specifically to perform a first pre-processing on the speech signals received by two microphones in the microphone node m, and the first pre-processing includes framing, windowing and pre-emphasis.

4. The method for sound source localization according to claim 2, wherein the step S3 is specifically to calculate a phase weighted generalized cross-correlation function R of two microphone voice signals within the preprocessed microphone node m_m(τ), expressed as:

wherein, m is equal to V,

and

respectively represented as time domain microphone signals at node m

And

5. The sound source localization method based on the deep neural network as claimed in claim 4, wherein the step S4 is implemented by obtaining the phase weighted generalized cross-correlation function R_m(tau) time delay information corresponding to wave crest

Expressed as:

and obtaining the time delay information

Corresponding amplitude value

6. The sound source localization method based on the deep neural network as claimed in claim 5, wherein the step S5 specifically comprises:

time delay information

And the corresponding amplitude

Combined as depth spiritInput vector I through the network:

G＝(I,Q)^T。

7. the method for sound source localization based on deep neural network of claim 6, wherein the second preprocessing in step S6 includes data cleaning, data de-ordering, and data normalization.

8. The method for sound source localization based on deep neural network of claim 7, wherein the cross-validation employed in step S8 comprises leave-one-out validation.

9. A sound source localization system based on a deep neural network, comprising:

and the training module is used for setting parameters related to the deep neural network and training the deep neural network by using the feature vectors of the training set to obtain the trained deep neural network.

10. The deep neural network-based sound source localization system according to claim 9, further comprising: