CN114895245A - Microphone array sound source positioning method and device and storage medium - Google Patents
Microphone array sound source positioning method and device and storage medium Download PDFInfo
- Publication number
- CN114895245A CN114895245A CN202210427289.XA CN202210427289A CN114895245A CN 114895245 A CN114895245 A CN 114895245A CN 202210427289 A CN202210427289 A CN 202210427289A CN 114895245 A CN114895245 A CN 114895245A
- Authority
- CN
- China
- Prior art keywords
- frame
- signal
- sound source
- microphone
- microphone array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, wherein the method comprises the steps of obtaining a test signal; preprocessing the test signal to obtain a single-frame test signal; extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample; the method comprises the steps of inputting a test sample into a CRN model which is constructed in advance and trained to be tested, and obtaining the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
Description
Technical Field
The invention relates to a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, and belongs to the technical field of sound source positioning.
Background
The sound source positioning technology based on the microphone array has wide application prospect and potential economic value in the aspects of voice recognition, front-end processing of a speaker recognition system, video conferences, intelligent robots, intelligent homes and the like. Positioning algorithms based on time delay difference and positioning algorithms based on SRP-PHAT (stepped Response Power-Phase Transform) are two typical traditional positioning methods, and although the two algorithms are easy to implement, the two algorithms have low robustness to reverberation and noise.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, obviously improves the positioning performance, and has better generalization capability on unknown noise and reverberation environments.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a microphone array sound source localization method, including:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
Further, the method for constructing and training the CRN model includes:
convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
preprocessing the microphone array signals to obtain single frame signals;
extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of a CRN (CrN, reference point and location) model, marking the corresponding direction of each sample, and using the corresponding direction as a class label of the sample;
and constructing a CRN model, and training by taking the training sample and the class label as a training data set of the CRN model.
Further, the pure voice signal is convolved with the room impulse responses of different azimuth angles, and noise and reverberation of different degrees are added to generate a plurality of microphone array signals, and the formula is as follows:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) is related to sound source bearing, room reverberation, v m (t) represents noise.
Further, the preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals includes:
the pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), wherein i is a frame number, N represents that a sampling number iN a frame is more than or equal to N and less than N, and N is the frame length;
the windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x is m (i, n) is the signal of the ith frame of the mth array element after windowing processing,
Further, the extracting the spatial localization cues of the plurality of single-frame signals comprises:
performing discrete Fourier transform on each single-frame signal, and converting a time domain signal into a frequency domain signal;
the discrete fourier transform calculation formula is:
wherein X m (i, k) is x m (i, N) indicating a frequency domain signal of an ith frame of an mth array element, K being a frequency point, K being a length of discrete fourier transform, K being 2N, and DFT (·) indicating discrete fourier transform;
design of the Gammatone Filter Bank, g j (t) is the impulse response function of the j-th Gamma filter, and the expression is:
wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame;
and calculating the sub-band time delay difference of each frame signal, wherein the expression is as follows:
wherein, T mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame;
calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:
wherein, P (i, j, r) represents the SRP-PHAT power value of the jth sub-band of the ith frame signal when the beam direction of the array is r; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
where r denotes the coordinates of the beam direction, r m Denotes the position coordinates of the m-th microphone, c is the speed of sound in air, f s Is the signal sampling rate;
setting the sound source and the microphone array to be in the same horizontal plane, and setting the sound source to be in the far field of the array, wherein tau is mn The equivalent calculation formula of (r) is:
where xi is [ cos θ, sin θ ═] T Theta is the azimuth angle of the beam direction r, tau mn (r) is independent of the received signal, and thus can be stored in memory after off-line calculation;
the normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
wherein y is train (i) And J is the number of sub-bands.
Further, the method for preprocessing the test signal to obtain a single frame test signal is the same as the method for preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals;
the method for extracting the spatial localization cues of the single-frame test signal is the same as the method for extracting the spatial localization cues of the plurality of single-frame signals.
Further, the CRN model includes an input layer, two residual blocks, a pooling layer, two fully-connected layers, and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and a batch normalization layer, each residual block structure includes two batch normalization layers and two convolution layers, and the inputs are sequentially processed according to an order of the batch normalization layer first, the ReLU next, and the convolution layer last, wherein the output layer employs Softmax, and the loss function is a cross entropy function.
In a second aspect, the present invention provides a microphone array sound source localization apparatus, including:
an acquisition unit for acquiring a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
In a third aspect, the present invention provides a microphone array sound source localization apparatus, including a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of the preceding claims.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the preceding claims.
Compared with the prior art, the invention has the following beneficial effects:
the invention adopts the sub-band time delay difference and the sub-band SRP-PHAT space spectrum as the space positioning clue, and the characteristic clue has stronger robustness and space information representation capability; the method adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source direction, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty; the invention can complete the training process of the CRN network of the positioning model off line, the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during the test.
Drawings
Fig. 1 is a flowchart of a sound source positioning method of a microphone array according to an embodiment of the present invention;
FIG. 2 is a block diagram of a CRN model structure provided by an embodiment of the present invention;
FIG. 3 is a block diagram of a residual block structure provided by an embodiment of the present invention;
fig. 4 and 5 are positioning effect graphs of various algorithms when the testing environment and the training environment provided by the embodiment of the invention are consistent;
fig. 6 and fig. 7 are graphs of positioning results in a non-training noise environment according to an embodiment of the present invention;
fig. 8 and 9 are graphs of positioning results in a non-training reverberation environment according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1
The embodiment introduces a microphone array sound source localization method, which includes:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability of the test signal belonging to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
The application process of the microphone array sound source positioning method provided by the embodiment specifically relates to the following steps:
step one, convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) related to sound source orientation, room reverberation, v m (t) represents noise.
In this embodiment, the microphone array is a uniform circular array composed of 6 omnidirectional microphones, and the array radius is 0.1 m. Setting the sound source and the microphone array to be in the same horizontal plane, setting the sound source to be in the far field of the array, defining the right front of the horizontal plane to be 90 degrees, setting the range of the azimuth angle of the sound source to be 0 degrees and 360 degrees, setting the interval to be 10 degrees, and then setting the number of training azimuths to be 36. The reverberation time of the training data comprises 0.5s and 0.8s, and the Image algorithm generates room impulse responses h of different azimuth angles under different reverberation times m (t)。v m (t) is white Gaussian noise, and the signal-to-noise ratios of the training data include 0dB, 5dB, 10dB, 15dB, and 20 dB.
And step two, preprocessing the microphone array signal obtained in the step one to obtain a single-frame signal.
The pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), where i is the frame number, N represents the sampling number 0 ≤ N < N iN one frame, and N is the frame length. Sampling rate f of speech signal in this embodiment s At 16kHz, the length N of the frame taken is 512(32ms) and the frame shift is 0.
The windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x m (i, n) is the signal of the ith frame of the m array element after the windowing processing,
And step three, extracting the spatial positioning clue of the array signal. The method specifically comprises the following steps:
and (3-1) performing discrete Fourier transform on each single-frame signal obtained in the step two, and converting the time domain signal into a frequency domain signal.
The discrete fourier transform calculation formula is:
wherein X m (i, k) is x m The discrete fourier transform of (i, N) represents the frequency domain signal of the ith frame of the mth array element, K is the frequency point, K is the length of the discrete fourier transform, K is 2N, and DFT (·) represents the discrete fourier transform. In this embodiment, the fourier transform length is set to 1024.
And (3-2) designing a Gamma atom filter bank. g j (t) is the impulse response function of the j-th Gamma filter, expressed as
Wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
in this embodiment, the order a is 4 and the phase isSet to 0, the number of subband filters is 32, i.e. j ═ 1,2, ·,32, filteringCenter frequency f of the device j In the range of [200Hz,8000Hz]
Performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
(3-3) calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame.
(3-4) calculating the subband time delay difference of each frame signal, wherein the expression is as follows:
wherein T is mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame.
(3-5) calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows
When P (i, j, r) represents the wave beam direction of the array as r, the SRP-PHAT power value of the jth sub-band of the ith frame signal; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
wherein r represents the beam directionCoordinate of (a), r m C is the sound velocity in the air, and is about 342m/s at normal temperature, f s Is the signal sampling rate.
In this embodiment, if the sound source and the microphone array are set to be in the same horizontal plane, and the sound source is located in the far field of the array, then τ is mn The equivalent calculation formula of (r) is:
where ξ is [ cos θ, sin θ ]] T And θ is the azimuth of the beam direction r. Tau. mn (r) is independent of the received signal and can therefore be calculated off-line and stored in memory.
The normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
(3-6) forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
wherein y is train (i) J represents the number of subbands, and J is 32 in this embodiment. The beam is directed in azimuth ranges of [0 °,360 °) with 10 ° spacing, so L ═ 36 in this embodiment. In this embodiment, the number M of microphones is 6, so that the number M of microphones is equal to the number M of microphonesFor the microphone. The dimension of the localization cues on each sub-band of each frame is thus 15+ 36-51, and the dimension of the feature matrix is 51 × 32.
Step four, preparing a training set: according to the first step to the third step, the spatial characteristic parameters of directional voice data in all training environments (the implementation and the setting of the training environments are detailed in the first step) are extracted and used as the training samples of the CRN, and meanwhile, the corresponding direction of each sample is marked and used as the class label of the sample.
And fifthly, constructing a CRN model, and taking the training sample and the class label obtained in the fourth step as a training data set of the CNN to train so as to obtain the CNN model. The method specifically comprises the following steps:
(5-1) setting a CNN model structure.
The CRN model structure framework adopted by the present invention is shown in fig. 2, and includes an input layer, two residual blocks, a pooling layer, two full-link layers, and a final output layer. Wherein the structure of each residual block is shown in fig. 3. The residual block is composed of a plurality of convolutional layers and a Batch Normalization (BN) layer, each residual block structure comprises two BN layers and two convolutional layers, and input is sequentially processed according to the sequence of BN, ReLU and convolutional layer connection.
The input signal of the input layer isIn the present embodiment, J is 32, L is 36, and M is 6. The size of a convolution kernel in a convolution layer of a residual block in the CRN model is 3 multiplied by 3, the step is 1 multiplied by 1, and the dimensionality of characteristic parameters is kept unchanged before and after convolution by adopting a zero filling mode. The number of implicit elements in the first fully-connected layer is 128, the number of implicit elements in the second fully-connected layer is 36, and the number of output azimuth angles is also the number of the output azimuth angles. The output layer adopts Softmax, and the loss function is cross-entropy function (cross-entropy function).
And (5-2) training network parameters of the CRN model.
The method adopts an Adam optimizer to continuously reduce the loss function in the training process of the model. In the CRN training process, information is spread forward, errors are spread backward, and model parameters are updated accordingly. The invention adopts Xavier to initialize model parameters, and assumes that the W input weight and the W output weight of the weight layer are respectively n j 、n j+1 Then Xavier initializes the formula as follows:
wherein U represents a uniform distribution, and symbol-represents a uniform distribution in which the weight layer W follows U.
During training, the initial learning rate is set to 0.001, the batch data volume is set to 200, the value of epsilon in the BN layer is 0.001, and the attenuation coefficient is 0.999. The invention adopts a cross validation method, and randomly divides training data into two parts in each iteration: 70% of the training set, 30% of the validation set. And measuring the quality of the model by using a cross entropy loss function through repeated cross validation, and finally completing the training stage of the model.
Step six, processing the test signal according to the step two and the step three to obtain a spatial positioning clue y of the single-frame test signal test (i) This was used as a test sample.
And step seven, taking the test sample as the input characteristic of the CRN model trained in the step five, outputting the probability that the test signal belongs to each azimuth angle by the CRN, and taking the azimuth with the highest probability as the azimuth angle estimated value of the frame signal.
The invention adopts mixed characteristics (subband time delay difference and subband SRP-PHAT space spectrum) as a space positioning clue, and the characteristic clue has stronger robustness and space information representation capability. The invention adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source position, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty. The invention can complete the training process of the positioning model CRN network off line, and the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during testing. Compared with the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, the algorithm provided by the invention has the advantages that the positioning performance under the complex acoustic environment is obviously improved, and the generalization capability on the sound source space structure, the reverberation and the noise is better.
Fig. 4 and 5 show the positioning effect of various algorithms when the testing environment is consistent with the training environment, and it can be seen from the figure that the success rate of positioning of the algorithm of the present invention is higher than that of the conventional SRP-PHAT and the positioning algorithm based on the deep neural network. Fig. 6, 7, 8 and 9 show the positioning effect of various algorithms when the test environment and the training environment are not consistent. Fig. 6 and 7 show the localization results in the non-training noise environment, and fig. 8 and 9 show the localization results in the non-training reverberation environment. As can be seen from the figure, even under the non-training environment, the success rate of the algorithm of the invention is still higher than that of the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, which shows that the method of the invention has better robustness and generalization capability to the unknown environment.
Example 2
The present embodiment provides a microphone array sound source localization apparatus, including:
an acquisition unit for acquiring a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
Further, the test unit includes a module for constructing and training a CRN model, and the module for constructing and training a CRN model includes:
the microphone array signal generating module is used for convolving the pure voice signals with the room impulse responses at different azimuth angles and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
the preprocessing module is used for preprocessing the microphone array signals to obtain a plurality of single-frame signals;
the extraction module is used for extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of the CRN model, marking the corresponding direction of each sample, and using the corresponding direction as a category label of the sample;
and the construction and training module is used for constructing the CRN model and training the training samples and the class labels as a training data set of the CRN model.
The CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.
Example 3
The embodiment provides a microphone array sound source positioning device, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of embodiment 1.
Example 4
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any of the embodiments 1.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A microphone array sound source localization method, comprising:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
2. The microphone array sound source localization method of claim 1, wherein the CRN model construction and training method comprises:
convolving the pure voice signals with the room impulse responses at different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
preprocessing the microphone array signals to obtain single frame signals;
extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of a CRN (CrN, reference point and location) model, marking the corresponding direction of each sample, and using the corresponding direction as a class label of the sample;
and constructing a CRN model, and training by taking the training sample and the class label as a training data set of the CRN model.
3. The microphone array sound source localization method according to claim 2, wherein the plurality of microphone array signals are generated by convolving the clean speech signal with room impulse responses at different azimuth angles and adding different degrees of noise and reverberation, and the formula is as follows:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) related to sound source orientation, room reverberation, v m (t) represents noise.
4. The microphone array sound source localization method of claim 2, wherein the pre-processing the plurality of microphone array signals to obtain a plurality of single frame signals comprises:
the pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), wherein i is a frame number, N represents that a sampling number iN a frame is more than or equal to N and less than N, and N is the frame length;
the windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x is m (i, n) is the signal of the ith frame of the mth array element after windowing processing,
5. The microphone array sound source localization method according to claim 2, wherein the extracting spatial localization cues for a plurality of single frame signals comprises:
performing discrete Fourier transform on each single-frame signal, and converting a time domain signal into a frequency domain signal;
the discrete fourier transform calculation formula is:
wherein X m (i, k) is x m (i, N) indicating a frequency domain signal of an ith frame of an mth array element, K being a frequency point, K being a length of discrete fourier transform, K being 2N, and DFT (·) indicating discrete fourier transform;
design of the Gammatone Filter Bank, g j (t) is the impulse response function of the j-th Gamma filter, and the expression is:
wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame;
and calculating the sub-band time delay difference of each frame signal, wherein the expression is as follows:
wherein, T mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame;
calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:
wherein, P (i, j, r) represents the SRP-PHAT power value of the jth sub-band of the ith frame signal when the beam direction of the array is r; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
where r denotes the coordinates of the beam direction, r m Denotes the position coordinates of the m-th microphone, c is the speed of sound in air, f s Is the signal sampling rate;
setting the sound source and the microphone array to be in the same horizontal plane, and setting the sound source to be in the far field of the array, wherein tau is mn The equivalent calculation formula of (r) is:
where xi is [ cos θ, sin θ ═] T Theta is the azimuth angle of the beam direction r, tau mn (r) is independent of the received signal, and thus can be stored in memory after off-line calculation;
the normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
wherein y is train (i) And J is the number of sub-bands.
6. The microphone array sound source localization method according to claim 5, characterized in that: the method for preprocessing the test signal to obtain a single-frame test signal is the same as the method for preprocessing the microphone array signals to obtain a plurality of single-frame signals;
the method for extracting the spatial localization cues of the single-frame test signal is the same as the method for extracting the spatial localization cues of the plurality of single-frame signals.
7. The microphone array sound source localization method according to claim 2, characterized in that: the CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.
8. A microphone array sound source localization apparatus characterized in that: the method comprises the following steps:
an acquisition unit configured to acquire a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
9. A microphone array sound source localization apparatus characterized in that: comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program when executed by a processor implements the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210427289.XA CN114895245A (en) | 2022-04-22 | 2022-04-22 | Microphone array sound source positioning method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210427289.XA CN114895245A (en) | 2022-04-22 | 2022-04-22 | Microphone array sound source positioning method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114895245A true CN114895245A (en) | 2022-08-12 |
Family
ID=82718420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210427289.XA Pending CN114895245A (en) | 2022-04-22 | 2022-04-22 | Microphone array sound source positioning method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114895245A (en) |
-
2022
- 2022-04-22 CN CN202210427289.XA patent/CN114895245A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107703486B (en) | Sound source positioning method based on convolutional neural network CNN | |
CN112904279B (en) | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum | |
CN109490822B (en) | Voice DOA estimation method based on ResNet | |
CN112151059A (en) | Microphone array-oriented channel attention weighted speech enhancement method | |
CN107452389A (en) | A kind of general monophonic real-time noise-reducing method | |
CN109887489B (en) | Speech dereverberation method based on depth features for generating countermeasure network | |
CN110068795A (en) | A kind of indoor microphone array sound localization method based on convolutional neural networks | |
CN110223708B (en) | Speech enhancement method based on speech processing and related equipment | |
CN109164415B (en) | Binaural sound source positioning method based on convolutional neural network | |
Kumatani et al. | Beamforming with a maximum negentropy criterion | |
CN110544490A (en) | sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics | |
Aroudi et al. | Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation | |
CN113936681A (en) | Voice enhancement method based on mask mapping and mixed hole convolution network | |
Pujol et al. | Source localization in reverberant rooms using Deep Learning and microphone arrays | |
Zhao et al. | Sound source localization based on srp-phat spatial spectrum and deep neural network | |
Salvati et al. | Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks | |
CN112201276B (en) | TC-ResNet network-based microphone array voice separation method | |
Salvati et al. | End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks. | |
CN113111765A (en) | Multi-voice source counting and positioning method based on deep learning | |
CN111123202B (en) | Indoor early reflected sound positioning method and system | |
CN110838303B (en) | Voice sound source positioning method using microphone array | |
Salvati et al. | Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. | |
Aroudi et al. | DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation | |
CN115713943A (en) | Beam forming voice separation method based on complex space angular center Gaussian mixture clustering model and bidirectional long-short-term memory network | |
CN114895245A (en) | Microphone array sound source positioning method and device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |