CN114895245A

CN114895245A - Microphone array sound source positioning method and device and storage medium

Info

Publication number: CN114895245A
Application number: CN202210427289.XA
Authority: CN
Inventors: 王子怡; 赵小燕; 戎洪军; 童莹; 芮雄丽
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-12

Abstract

The invention discloses a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, wherein the method comprises the steps of obtaining a test signal; preprocessing the test signal to obtain a single-frame test signal; extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample; the method comprises the steps of inputting a test sample into a CRN model which is constructed in advance and trained to be tested, and obtaining the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.

Description

Microphone array sound source positioning method and device and storage medium

Technical Field

The invention relates to a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, and belongs to the technical field of sound source positioning.

Background

The sound source positioning technology based on the microphone array has wide application prospect and potential economic value in the aspects of voice recognition, front-end processing of a speaker recognition system, video conferences, intelligent robots, intelligent homes and the like. Positioning algorithms based on time delay difference and positioning algorithms based on SRP-PHAT (stepped Response Power-Phase Transform) are two typical traditional positioning methods, and although the two algorithms are easy to implement, the two algorithms have low robustness to reverberation and noise.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, obviously improves the positioning performance, and has better generalization capability on unknown noise and reverberation environments.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a microphone array sound source localization method, including:

acquiring a test signal;

preprocessing the test signal to obtain a single-frame test signal;

extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;

and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.

Further, the method for constructing and training the CRN model includes:

convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;

preprocessing the microphone array signals to obtain single frame signals;

extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of a CRN (CrN, reference point and location) model, marking the corresponding direction of each sample, and using the corresponding direction as a class label of the sample;

and constructing a CRN model, and training by taking the training sample and the class label as a training data set of the CRN model.

Further, the pure voice signal is convolved with the room impulse responses of different azimuth angles, and noise and reverberation of different degrees are added to generate a plurality of microphone array signals, and the formula is as follows:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M

wherein x is _m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone _m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h _m (t) is related to sound source bearing, room reverberation, v _m (t) represents noise.

Further, the preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals includes:

the pre-processing includes framing and windowing, wherein:

the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x _m (t) division into a plurality of single-frame signals x _m (iN + N), wherein i is a frame number, N represents that a sampling number iN a frame is more than or equal to N and less than N, and N is the frame length;

the windowing method comprises the following steps: x is the number of _m (i,n)＝w _H (n)x _m (iN+n)

Wherein x is _m (i, n) is the signal of the ith frame of the mth array element after windowing processing,

is a hamming window.

Further, the extracting the spatial localization cues of the plurality of single-frame signals comprises:

performing discrete Fourier transform on each single-frame signal, and converting a time domain signal into a frequency domain signal;

the discrete fourier transform calculation formula is:

wherein X _m (i, k) is x _m (i, N) indicating a frequency domain signal of an ith frame of an mth array element, K being a frequency point, K being a length of discrete fourier transform, K being 2N, and DFT (·) indicating discrete fourier transform;

design of the Gammatone Filter Bank, g _j (t) is the impulse response function of the j-th Gamma filter, and the expression is:

wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;

represents the phase; f. of _j Represents the center frequency of the jth filter; b _j Representing the attenuation factor of the filter, b _j The calculation formula is as follows:

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:

calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:

wherein R is _mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame;

and calculating the sub-band time delay difference of each frame signal, wherein the expression is as follows:

wherein, T _mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame;

calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:

wherein, P (i, j, r) represents the SRP-PHAT power value of the jth sub-band of the ith frame signal when the beam direction of the array is r; tau is _mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:

where r denotes the coordinates of the beam direction, r _m Denotes the position coordinates of the m-th microphone, c is the speed of sound in air, f _s Is the signal sampling rate;

setting the sound source and the microphone array to be in the same horizontal plane, and setting the sound source to be in the far field of the array, wherein tau is _mn The equivalent calculation formula of (r) is:

where xi is [ cos θ, sin θ ═] ^T Theta is the azimuth angle of the beam direction r, tau _mn (r) is independent of the received signal, and thus can be stored in memory after off-line calculation;

the normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:

forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:

wherein y is _train (i) And J is the number of sub-bands.

Further, the method for preprocessing the test signal to obtain a single frame test signal is the same as the method for preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals;

the method for extracting the spatial localization cues of the single-frame test signal is the same as the method for extracting the spatial localization cues of the plurality of single-frame signals.

Further, the CRN model includes an input layer, two residual blocks, a pooling layer, two fully-connected layers, and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and a batch normalization layer, each residual block structure includes two batch normalization layers and two convolution layers, and the inputs are sequentially processed according to an order of the batch normalization layer first, the ReLU next, and the convolution layer last, wherein the output layer employs Softmax, and the loss function is a cross entropy function.

In a second aspect, the present invention provides a microphone array sound source localization apparatus, including:

an acquisition unit for acquiring a test signal;

the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;

the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;

and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.

In a third aspect, the present invention provides a microphone array sound source localization apparatus, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of the preceding claims.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the preceding claims.

Compared with the prior art, the invention has the following beneficial effects:

the invention adopts the sub-band time delay difference and the sub-band SRP-PHAT space spectrum as the space positioning clue, and the characteristic clue has stronger robustness and space information representation capability; the method adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source direction, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty; the invention can complete the training process of the CRN network of the positioning model off line, the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during the test.

Drawings

Fig. 1 is a flowchart of a sound source positioning method of a microphone array according to an embodiment of the present invention;

FIG. 2 is a block diagram of a CRN model structure provided by an embodiment of the present invention;

FIG. 3 is a block diagram of a residual block structure provided by an embodiment of the present invention;

fig. 4 and 5 are positioning effect graphs of various algorithms when the testing environment and the training environment provided by the embodiment of the invention are consistent;

fig. 6 and fig. 7 are graphs of positioning results in a non-training noise environment according to an embodiment of the present invention;

fig. 8 and 9 are graphs of positioning results in a non-training reverberation environment according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

The embodiment introduces a microphone array sound source localization method, which includes:

acquiring a test signal;

preprocessing the test signal to obtain a single-frame test signal;

and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability of the test signal belonging to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.

The application process of the microphone array sound source positioning method provided by the embodiment specifically relates to the following steps:

step one, convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different azimuth angles:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M

wherein x _m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone _m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h _m (t) related to sound source orientation, room reverberation, v _m (t) represents noise.

In this embodiment, the microphone array is a uniform circular array composed of 6 omnidirectional microphones, and the array radius is 0.1 m. Setting the sound source and the microphone array to be in the same horizontal plane, setting the sound source to be in the far field of the array, defining the right front of the horizontal plane to be 90 degrees, setting the range of the azimuth angle of the sound source to be 0 degrees and 360 degrees, setting the interval to be 10 degrees, and then setting the number of training azimuths to be 36. The reverberation time of the training data comprises 0.5s and 0.8s, and the Image algorithm generates room impulse responses h of different azimuth angles under different reverberation times _m (t)。v _m (t) is white Gaussian noise, and the signal-to-noise ratios of the training data include 0dB, 5dB, 10dB, 15dB, and 20 dB.

And step two, preprocessing the microphone array signal obtained in the step one to obtain a single-frame signal.

The pre-processing includes framing and windowing, wherein:

the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x _m (t) division into a plurality of single-frame signals x _m (iN + N), where i is the frame number, N represents the sampling number 0 ≤ N < N iN one frame, and N is the frame length. Sampling rate f of speech signal in this embodiment _s At 16kHz, the length N of the frame taken is 512(32ms) and the frame shift is 0.

Wherein x _m (i, n) is the signal of the ith frame of the m array element after the windowing processing,

is a hamming window.

And step three, extracting the spatial positioning clue of the array signal. The method specifically comprises the following steps:

and (3-1) performing discrete Fourier transform on each single-frame signal obtained in the step two, and converting the time domain signal into a frequency domain signal.

The discrete fourier transform calculation formula is:

wherein X _m (i, k) is x _m The discrete fourier transform of (i, N) represents the frequency domain signal of the ith frame of the mth array element, K is the frequency point, K is the length of the discrete fourier transform, K is 2N, and DFT (·) represents the discrete fourier transform. In this embodiment, the fourier transform length is set to 1024.

And (3-2) designing a Gamma atom filter bank. g _j (t) is the impulse response function of the j-th Gamma filter, expressed as

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

in this embodiment, the order a is 4 and the phase is

Set to 0, the number of subband filters is 32, i.e. j ═ 1,2, ·,32, filteringCenter frequency f of the device _j In the range of [200Hz,8000Hz]

(3-3) calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:

wherein R is _mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame.

(3-4) calculating the subband time delay difference of each frame signal, wherein the expression is as follows:

wherein T is _mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame.

(3-5) calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows

When P (i, j, r) represents the wave beam direction of the array as r, the SRP-PHAT power value of the jth sub-band of the ith frame signal; tau is _mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:

wherein r represents the beam directionCoordinate of (a), r _m C is the sound velocity in the air, and is about 342m/s at normal temperature, f _s Is the signal sampling rate.

In this embodiment, if the sound source and the microphone array are set to be in the same horizontal plane, and the sound source is located in the far field of the array, then τ is _mn The equivalent calculation formula of (r) is:

where ξ is [ cos θ, sin θ ]] ^T And θ is the azimuth of the beam direction r. Tau. _mn (r) is independent of the received signal and can therefore be calculated off-line and stored in memory.

(3-6) forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:

wherein y is _train (i) J represents the number of subbands, and J is 32 in this embodiment. The beam is directed in azimuth ranges of [0 °,360 °) with 10 ° spacing, so L ═ 36 in this embodiment. In this embodiment, the number M of microphones is 6, so that the number M of microphones is equal to the number M of microphones

For the microphone. The dimension of the localization cues on each sub-band of each frame is thus 15+ 36-51, and the dimension of the feature matrix is 51 × 32.

Step four, preparing a training set: according to the first step to the third step, the spatial characteristic parameters of directional voice data in all training environments (the implementation and the setting of the training environments are detailed in the first step) are extracted and used as the training samples of the CRN, and meanwhile, the corresponding direction of each sample is marked and used as the class label of the sample.

And fifthly, constructing a CRN model, and taking the training sample and the class label obtained in the fourth step as a training data set of the CNN to train so as to obtain the CNN model. The method specifically comprises the following steps:

(5-1) setting a CNN model structure.

The CRN model structure framework adopted by the present invention is shown in fig. 2, and includes an input layer, two residual blocks, a pooling layer, two full-link layers, and a final output layer. Wherein the structure of each residual block is shown in fig. 3. The residual block is composed of a plurality of convolutional layers and a Batch Normalization (BN) layer, each residual block structure comprises two BN layers and two convolutional layers, and input is sequentially processed according to the sequence of BN, ReLU and convolutional layer connection.

The input signal of the input layer is

In the present embodiment, J is 32, L is 36, and M is 6. The size of a convolution kernel in a convolution layer of a residual block in the CRN model is 3 multiplied by 3, the step is 1 multiplied by 1, and the dimensionality of characteristic parameters is kept unchanged before and after convolution by adopting a zero filling mode. The number of implicit elements in the first fully-connected layer is 128, the number of implicit elements in the second fully-connected layer is 36, and the number of output azimuth angles is also the number of the output azimuth angles. The output layer adopts Softmax, and the loss function is cross-entropy function (cross-entropy function).

And (5-2) training network parameters of the CRN model.

The method adopts an Adam optimizer to continuously reduce the loss function in the training process of the model. In the CRN training process, information is spread forward, errors are spread backward, and model parameters are updated accordingly. The invention adopts Xavier to initialize model parameters, and assumes that the W input weight and the W output weight of the weight layer are respectively n _j 、n _j+1 Then Xavier initializes the formula as follows:

wherein U represents a uniform distribution, and symbol-represents a uniform distribution in which the weight layer W follows U.

During training, the initial learning rate is set to 0.001, the batch data volume is set to 200, the value of epsilon in the BN layer is 0.001, and the attenuation coefficient is 0.999. The invention adopts a cross validation method, and randomly divides training data into two parts in each iteration: 70% of the training set, 30% of the validation set. And measuring the quality of the model by using a cross entropy loss function through repeated cross validation, and finally completing the training stage of the model.

Step six, processing the test signal according to the step two and the step three to obtain a spatial positioning clue y of the single-frame test signal _test (i) This was used as a test sample.

And step seven, taking the test sample as the input characteristic of the CRN model trained in the step five, outputting the probability that the test signal belongs to each azimuth angle by the CRN, and taking the azimuth with the highest probability as the azimuth angle estimated value of the frame signal.

The invention adopts mixed characteristics (subband time delay difference and subband SRP-PHAT space spectrum) as a space positioning clue, and the characteristic clue has stronger robustness and space information representation capability. The invention adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source position, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty. The invention can complete the training process of the positioning model CRN network off line, and the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during testing. Compared with the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, the algorithm provided by the invention has the advantages that the positioning performance under the complex acoustic environment is obviously improved, and the generalization capability on the sound source space structure, the reverberation and the noise is better.

Fig. 4 and 5 show the positioning effect of various algorithms when the testing environment is consistent with the training environment, and it can be seen from the figure that the success rate of positioning of the algorithm of the present invention is higher than that of the conventional SRP-PHAT and the positioning algorithm based on the deep neural network. Fig. 6, 7, 8 and 9 show the positioning effect of various algorithms when the test environment and the training environment are not consistent. Fig. 6 and 7 show the localization results in the non-training noise environment, and fig. 8 and 9 show the localization results in the non-training reverberation environment. As can be seen from the figure, even under the non-training environment, the success rate of the algorithm of the invention is still higher than that of the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, which shows that the method of the invention has better robustness and generalization capability to the unknown environment.

Example 2

The present embodiment provides a microphone array sound source localization apparatus, including:

an acquisition unit for acquiring a test signal;

Further, the test unit includes a module for constructing and training a CRN model, and the module for constructing and training a CRN model includes:

the microphone array signal generating module is used for convolving the pure voice signals with the room impulse responses at different azimuth angles and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;

the preprocessing module is used for preprocessing the microphone array signals to obtain a plurality of single-frame signals;

the extraction module is used for extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of the CRN model, marking the corresponding direction of each sample, and using the corresponding direction as a category label of the sample;

and the construction and training module is used for constructing the CRN model and training the training samples and the class labels as a training data set of the CRN model.

The CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.

Example 3

The embodiment provides a microphone array sound source positioning device, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of embodiment 1.

Example 4

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any of the embodiments 1.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A microphone array sound source localization method, comprising:

acquiring a test signal;

preprocessing the test signal to obtain a single-frame test signal;

2. The microphone array sound source localization method of claim 1, wherein the CRN model construction and training method comprises:

convolving the pure voice signals with the room impulse responses at different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;

preprocessing the microphone array signals to obtain single frame signals;

3. The microphone array sound source localization method according to claim 2, wherein the plurality of microphone array signals are generated by convolving the clean speech signal with room impulse responses at different azimuth angles and adding different degrees of noise and reverberation, and the formula is as follows:

x _m (t)＝h _m (t)*s(t)+v _m (t),m＝1,2,...,M

wherein x is _m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone _m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h _m (t) related to sound source orientation, room reverberation, v _m (t) represents noise.

4. The microphone array sound source localization method of claim 2, wherein the pre-processing the plurality of microphone array signals to obtain a plurality of single frame signals comprises:

the pre-processing includes framing and windowing, wherein:

is a hamming window.

5. The microphone array sound source localization method according to claim 2, wherein the extracting spatial localization cues for a plurality of single frame signals comprises:

the discrete fourier transform calculation formula is:

b _j ＝1.109ERB(f _j )

ERB(f _j )＝24.7(4.37f _j /1000+1)

wherein y is _train (i) And J is the number of sub-bands.

6. The microphone array sound source localization method according to claim 5, characterized in that: the method for preprocessing the test signal to obtain a single-frame test signal is the same as the method for preprocessing the microphone array signals to obtain a plurality of single-frame signals;

7. The microphone array sound source localization method according to claim 2, characterized in that: the CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.

8. A microphone array sound source localization apparatus characterized in that: the method comprises the following steps:

an acquisition unit configured to acquire a test signal;

9. A microphone array sound source localization apparatus characterized in that: comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program when executed by a processor implements the steps of the method of any one of claims 1 to 7.