CN114895245A - Microphone array sound source positioning method and device and storage medium - Google Patents

Microphone array sound source positioning method and device and storage medium Download PDF

Info

Publication number
CN114895245A
CN114895245A CN202210427289.XA CN202210427289A CN114895245A CN 114895245 A CN114895245 A CN 114895245A CN 202210427289 A CN202210427289 A CN 202210427289A CN 114895245 A CN114895245 A CN 114895245A
Authority
CN
China
Prior art keywords
frame
signal
sound source
microphone
microphone array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210427289.XA
Other languages
Chinese (zh)
Inventor
王子怡
赵小燕
戎洪军
童莹
芮雄丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202210427289.XA priority Critical patent/CN114895245A/en
Publication of CN114895245A publication Critical patent/CN114895245A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, wherein the method comprises the steps of obtaining a test signal; preprocessing the test signal to obtain a single-frame test signal; extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample; the method comprises the steps of inputting a test sample into a CRN model which is constructed in advance and trained to be tested, and obtaining the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.

Description

Microphone array sound source positioning method and device and storage medium
Technical Field
The invention relates to a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, and belongs to the technical field of sound source positioning.
Background
The sound source positioning technology based on the microphone array has wide application prospect and potential economic value in the aspects of voice recognition, front-end processing of a speaker recognition system, video conferences, intelligent robots, intelligent homes and the like. Positioning algorithms based on time delay difference and positioning algorithms based on SRP-PHAT (stepped Response Power-Phase Transform) are two typical traditional positioning methods, and although the two algorithms are easy to implement, the two algorithms have low robustness to reverberation and noise.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a microphone array sound source positioning method, a microphone array sound source positioning device and a storage medium, obviously improves the positioning performance, and has better generalization capability on unknown noise and reverberation environments.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a microphone array sound source localization method, including:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
Further, the method for constructing and training the CRN model includes:
convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
preprocessing the microphone array signals to obtain single frame signals;
extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of a CRN (CrN, reference point and location) model, marking the corresponding direction of each sample, and using the corresponding direction as a class label of the sample;
and constructing a CRN model, and training by taking the training sample and the class label as a training data set of the CRN model.
Further, the pure voice signal is convolved with the room impulse responses of different azimuth angles, and noise and reverberation of different degrees are added to generate a plurality of microphone array signals, and the formula is as follows:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) is related to sound source bearing, room reverberation, v m (t) represents noise.
Further, the preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals includes:
the pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), wherein i is a frame number, N represents that a sampling number iN a frame is more than or equal to N and less than N, and N is the frame length;
the windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x is m (i, n) is the signal of the ith frame of the mth array element after windowing processing,
Figure BDA0003610145030000031
is a hamming window.
Further, the extracting the spatial localization cues of the plurality of single-frame signals comprises:
performing discrete Fourier transform on each single-frame signal, and converting a time domain signal into a frequency domain signal;
the discrete fourier transform calculation formula is:
Figure BDA0003610145030000032
wherein X m (i, k) is x m (i, N) indicating a frequency domain signal of an ith frame of an mth array element, K being a frequency point, K being a length of discrete fourier transform, K being 2N, and DFT (·) indicating discrete fourier transform;
design of the Gammatone Filter Bank, g j (t) is the impulse response function of the j-th Gamma filter, and the expression is:
Figure BDA0003610145030000033
wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;
Figure BDA0003610145030000034
represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
Figure BDA0003610145030000035
calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
Figure BDA0003610145030000041
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame;
and calculating the sub-band time delay difference of each frame signal, wherein the expression is as follows:
Figure BDA0003610145030000042
wherein, T mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame;
calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:
Figure BDA0003610145030000043
wherein, P (i, j, r) represents the SRP-PHAT power value of the jth sub-band of the ith frame signal when the beam direction of the array is r; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure BDA0003610145030000044
where r denotes the coordinates of the beam direction, r m Denotes the position coordinates of the m-th microphone, c is the speed of sound in air, f s Is the signal sampling rate;
setting the sound source and the microphone array to be in the same horizontal plane, and setting the sound source to be in the far field of the array, wherein tau is mn The equivalent calculation formula of (r) is:
Figure BDA0003610145030000045
where xi is [ cos θ, sin θ ═] T Theta is the azimuth angle of the beam direction r, tau mn (r) is independent of the received signal, and thus can be stored in memory after off-line calculation;
the normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
Figure BDA0003610145030000051
forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
Figure BDA0003610145030000052
wherein y is train (i) And J is the number of sub-bands.
Further, the method for preprocessing the test signal to obtain a single frame test signal is the same as the method for preprocessing the plurality of microphone array signals to obtain a plurality of single frame signals;
the method for extracting the spatial localization cues of the single-frame test signal is the same as the method for extracting the spatial localization cues of the plurality of single-frame signals.
Further, the CRN model includes an input layer, two residual blocks, a pooling layer, two fully-connected layers, and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and a batch normalization layer, each residual block structure includes two batch normalization layers and two convolution layers, and the inputs are sequentially processed according to an order of the batch normalization layer first, the ReLU next, and the convolution layer last, wherein the output layer employs Softmax, and the loss function is a cross entropy function.
In a second aspect, the present invention provides a microphone array sound source localization apparatus, including:
an acquisition unit for acquiring a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
In a third aspect, the present invention provides a microphone array sound source localization apparatus, including a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of the preceding claims.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the preceding claims.
Compared with the prior art, the invention has the following beneficial effects:
the invention adopts the sub-band time delay difference and the sub-band SRP-PHAT space spectrum as the space positioning clue, and the characteristic clue has stronger robustness and space information representation capability; the method adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source direction, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty; the invention can complete the training process of the CRN network of the positioning model off line, the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during the test.
Drawings
Fig. 1 is a flowchart of a sound source positioning method of a microphone array according to an embodiment of the present invention;
FIG. 2 is a block diagram of a CRN model structure provided by an embodiment of the present invention;
FIG. 3 is a block diagram of a residual block structure provided by an embodiment of the present invention;
fig. 4 and 5 are positioning effect graphs of various algorithms when the testing environment and the training environment provided by the embodiment of the invention are consistent;
fig. 6 and fig. 7 are graphs of positioning results in a non-training noise environment according to an embodiment of the present invention;
fig. 8 and 9 are graphs of positioning results in a non-training reverberation environment according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1
The embodiment introduces a microphone array sound source localization method, which includes:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability of the test signal belonging to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
The application process of the microphone array sound source positioning method provided by the embodiment specifically relates to the following steps:
step one, convolving the pure voice signal with the room impulse response of different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of directional voice signals of different azimuth angles:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) related to sound source orientation, room reverberation, v m (t) represents noise.
In this embodiment, the microphone array is a uniform circular array composed of 6 omnidirectional microphones, and the array radius is 0.1 m. Setting the sound source and the microphone array to be in the same horizontal plane, setting the sound source to be in the far field of the array, defining the right front of the horizontal plane to be 90 degrees, setting the range of the azimuth angle of the sound source to be 0 degrees and 360 degrees, setting the interval to be 10 degrees, and then setting the number of training azimuths to be 36. The reverberation time of the training data comprises 0.5s and 0.8s, and the Image algorithm generates room impulse responses h of different azimuth angles under different reverberation times m (t)。v m (t) is white Gaussian noise, and the signal-to-noise ratios of the training data include 0dB, 5dB, 10dB, 15dB, and 20 dB.
And step two, preprocessing the microphone array signal obtained in the step one to obtain a single-frame signal.
The pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), where i is the frame number, N represents the sampling number 0 ≤ N < N iN one frame, and N is the frame length. Sampling rate f of speech signal in this embodiment s At 16kHz, the length N of the frame taken is 512(32ms) and the frame shift is 0.
The windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x m (i, n) is the signal of the ith frame of the m array element after the windowing processing,
Figure BDA0003610145030000081
is a hamming window.
And step three, extracting the spatial positioning clue of the array signal. The method specifically comprises the following steps:
and (3-1) performing discrete Fourier transform on each single-frame signal obtained in the step two, and converting the time domain signal into a frequency domain signal.
The discrete fourier transform calculation formula is:
Figure BDA0003610145030000082
wherein X m (i, k) is x m The discrete fourier transform of (i, N) represents the frequency domain signal of the ith frame of the mth array element, K is the frequency point, K is the length of the discrete fourier transform, K is 2N, and DFT (·) represents the discrete fourier transform. In this embodiment, the fourier transform length is set to 1024.
And (3-2) designing a Gamma atom filter bank. g j (t) is the impulse response function of the j-th Gamma filter, expressed as
Figure BDA0003610145030000091
Wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;
Figure BDA0003610145030000092
represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
in this embodiment, the order a is 4 and the phase is
Figure BDA0003610145030000096
Set to 0, the number of subband filters is 32, i.e. j ═ 1,2, ·,32, filteringCenter frequency f of the device j In the range of [200Hz,8000Hz]
Performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
Figure BDA0003610145030000093
(3-3) calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
Figure BDA0003610145030000094
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame.
(3-4) calculating the subband time delay difference of each frame signal, wherein the expression is as follows:
Figure BDA0003610145030000095
wherein T is mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame.
(3-5) calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows
Figure BDA0003610145030000101
When P (i, j, r) represents the wave beam direction of the array as r, the SRP-PHAT power value of the jth sub-band of the ith frame signal; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure BDA0003610145030000102
wherein r represents the beam directionCoordinate of (a), r m C is the sound velocity in the air, and is about 342m/s at normal temperature, f s Is the signal sampling rate.
In this embodiment, if the sound source and the microphone array are set to be in the same horizontal plane, and the sound source is located in the far field of the array, then τ is mn The equivalent calculation formula of (r) is:
Figure BDA0003610145030000103
where ξ is [ cos θ, sin θ ]] T And θ is the azimuth of the beam direction r. Tau. mn (r) is independent of the received signal and can therefore be calculated off-line and stored in memory.
The normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
Figure BDA0003610145030000104
(3-6) forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
Figure BDA0003610145030000105
wherein y is train (i) J represents the number of subbands, and J is 32 in this embodiment. The beam is directed in azimuth ranges of [0 °,360 °) with 10 ° spacing, so L ═ 36 in this embodiment. In this embodiment, the number M of microphones is 6, so that the number M of microphones is equal to the number M of microphones
Figure BDA0003610145030000111
For the microphone. The dimension of the localization cues on each sub-band of each frame is thus 15+ 36-51, and the dimension of the feature matrix is 51 × 32.
Step four, preparing a training set: according to the first step to the third step, the spatial characteristic parameters of directional voice data in all training environments (the implementation and the setting of the training environments are detailed in the first step) are extracted and used as the training samples of the CRN, and meanwhile, the corresponding direction of each sample is marked and used as the class label of the sample.
And fifthly, constructing a CRN model, and taking the training sample and the class label obtained in the fourth step as a training data set of the CNN to train so as to obtain the CNN model. The method specifically comprises the following steps:
(5-1) setting a CNN model structure.
The CRN model structure framework adopted by the present invention is shown in fig. 2, and includes an input layer, two residual blocks, a pooling layer, two full-link layers, and a final output layer. Wherein the structure of each residual block is shown in fig. 3. The residual block is composed of a plurality of convolutional layers and a Batch Normalization (BN) layer, each residual block structure comprises two BN layers and two convolutional layers, and input is sequentially processed according to the sequence of BN, ReLU and convolutional layer connection.
The input signal of the input layer is
Figure BDA0003610145030000112
In the present embodiment, J is 32, L is 36, and M is 6. The size of a convolution kernel in a convolution layer of a residual block in the CRN model is 3 multiplied by 3, the step is 1 multiplied by 1, and the dimensionality of characteristic parameters is kept unchanged before and after convolution by adopting a zero filling mode. The number of implicit elements in the first fully-connected layer is 128, the number of implicit elements in the second fully-connected layer is 36, and the number of output azimuth angles is also the number of the output azimuth angles. The output layer adopts Softmax, and the loss function is cross-entropy function (cross-entropy function).
And (5-2) training network parameters of the CRN model.
The method adopts an Adam optimizer to continuously reduce the loss function in the training process of the model. In the CRN training process, information is spread forward, errors are spread backward, and model parameters are updated accordingly. The invention adopts Xavier to initialize model parameters, and assumes that the W input weight and the W output weight of the weight layer are respectively n j 、n j+1 Then Xavier initializes the formula as follows:
Figure BDA0003610145030000121
wherein U represents a uniform distribution, and symbol-represents a uniform distribution in which the weight layer W follows U.
During training, the initial learning rate is set to 0.001, the batch data volume is set to 200, the value of epsilon in the BN layer is 0.001, and the attenuation coefficient is 0.999. The invention adopts a cross validation method, and randomly divides training data into two parts in each iteration: 70% of the training set, 30% of the validation set. And measuring the quality of the model by using a cross entropy loss function through repeated cross validation, and finally completing the training stage of the model.
Step six, processing the test signal according to the step two and the step three to obtain a spatial positioning clue y of the single-frame test signal test (i) This was used as a test sample.
And step seven, taking the test sample as the input characteristic of the CRN model trained in the step five, outputting the probability that the test signal belongs to each azimuth angle by the CRN, and taking the azimuth with the highest probability as the azimuth angle estimated value of the frame signal.
The invention adopts mixed characteristics (subband time delay difference and subband SRP-PHAT space spectrum) as a space positioning clue, and the characteristic clue has stronger robustness and space information representation capability. The invention adopts the convolution residual error network to construct the mapping relation between the space positioning clue and the sound source position, and the positioning model can accelerate the characteristics in the circulation network, reduce the characteristic loss and reduce the training difficulty. The invention can complete the training process of the positioning model CRN network off line, and the trained network is stored in the memory, and the real-time sound source positioning can be realized only by one frame of signal during testing. Compared with the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, the algorithm provided by the invention has the advantages that the positioning performance under the complex acoustic environment is obviously improved, and the generalization capability on the sound source space structure, the reverberation and the noise is better.
Fig. 4 and 5 show the positioning effect of various algorithms when the testing environment is consistent with the training environment, and it can be seen from the figure that the success rate of positioning of the algorithm of the present invention is higher than that of the conventional SRP-PHAT and the positioning algorithm based on the deep neural network. Fig. 6, 7, 8 and 9 show the positioning effect of various algorithms when the test environment and the training environment are not consistent. Fig. 6 and 7 show the localization results in the non-training noise environment, and fig. 8 and 9 show the localization results in the non-training reverberation environment. As can be seen from the figure, even under the non-training environment, the success rate of the algorithm of the invention is still higher than that of the traditional SRP-PHAT algorithm and the positioning algorithm based on the deep neural network, which shows that the method of the invention has better robustness and generalization capability to the unknown environment.
Example 2
The present embodiment provides a microphone array sound source localization apparatus, including:
an acquisition unit for acquiring a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
Further, the test unit includes a module for constructing and training a CRN model, and the module for constructing and training a CRN model includes:
the microphone array signal generating module is used for convolving the pure voice signals with the room impulse responses at different azimuth angles and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
the preprocessing module is used for preprocessing the microphone array signals to obtain a plurality of single-frame signals;
the extraction module is used for extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of the CRN model, marking the corresponding direction of each sample, and using the corresponding direction as a category label of the sample;
and the construction and training module is used for constructing the CRN model and training the training samples and the class labels as a training data set of the CRN model.
The CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.
Example 3
The embodiment provides a microphone array sound source positioning device, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of embodiment 1.
Example 4
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any of the embodiments 1.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A microphone array sound source localization method, comprising:
acquiring a test signal;
preprocessing the test signal to obtain a single-frame test signal;
extracting a spatial positioning clue of the single-frame test signal, and taking the spatial positioning clue as a test sample;
and inputting the test sample into a pre-constructed and trained CRN model for testing, and acquiring the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
2. The microphone array sound source localization method of claim 1, wherein the CRN model construction and training method comprises:
convolving the pure voice signals with the room impulse responses at different azimuth angles, and adding different degrees of noise and reverberation to generate a plurality of microphone array signals;
preprocessing the microphone array signals to obtain single frame signals;
extracting spatial positioning clues of a plurality of single-frame signals, using the spatial positioning clues as training samples of a CRN (CrN, reference point and location) model, marking the corresponding direction of each sample, and using the corresponding direction as a class label of the sample;
and constructing a CRN model, and training by taking the training sample and the class label as a training data set of the CRN model.
3. The microphone array sound source localization method according to claim 2, wherein the plurality of microphone array signals are generated by convolving the clean speech signal with room impulse responses at different azimuth angles and adding different degrees of noise and reverberation, and the formula is as follows:
x m (t)=h m (t)*s(t)+v m (t),m=1,2,...,M
wherein x is m (t) represents the speech signal of the designated direction received by the mth microphone, M is the serial number of the microphone elements, M is 1,2, …, M is the number of the microphone elements, s (t) is the pure speech, h is the speech signal of the designated direction received by the mth microphone m (t) represents the room impulse response from the specified sound source bearing to the m-th microphone, h m (t) related to sound source orientation, room reverberation, v m (t) represents noise.
4. The microphone array sound source localization method of claim 2, wherein the pre-processing the plurality of microphone array signals to obtain a plurality of single frame signals comprises:
the pre-processing includes framing and windowing, wherein:
the framing method comprises the following steps: adopting preset frame length and frame shift to convert the time domain signal x of the m-th array element into a time domain signal x m (t) division into a plurality of single-frame signals x m (iN + N), wherein i is a frame number, N represents that a sampling number iN a frame is more than or equal to N and less than N, and N is the frame length;
the windowing method comprises the following steps: x is the number of m (i,n)=w H (n)x m (iN+n)
Wherein x is m (i, n) is the signal of the ith frame of the mth array element after windowing processing,
Figure FDA0003610145020000021
is a hamming window.
5. The microphone array sound source localization method according to claim 2, wherein the extracting spatial localization cues for a plurality of single frame signals comprises:
performing discrete Fourier transform on each single-frame signal, and converting a time domain signal into a frequency domain signal;
the discrete fourier transform calculation formula is:
Figure FDA0003610145020000022
wherein X m (i, k) is x m (i, N) indicating a frequency domain signal of an ith frame of an mth array element, K being a frequency point, K being a length of discrete fourier transform, K being 2N, and DFT (·) indicating discrete fourier transform;
design of the Gammatone Filter Bank, g j (t) is the impulse response function of the j-th Gamma filter, and the expression is:
Figure FDA0003610145020000031
wherein j represents the serial number of the filter; c is the filter gain; t represents a continuous time; a is the order of the filter;
Figure FDA0003610145020000032
represents the phase; f. of j Represents the center frequency of the jth filter; b j Representing the attenuation factor of the filter, b j The calculation formula is as follows:
b j =1.109ERB(f j )
ERB(f j )=24.7(4.37f j /1000+1)
performing discrete Fourier transform on each Gamma filter to obtain a frequency domain expression:
Figure FDA0003610145020000033
calculating a sub-band generalized cross-correlation function of each frame signal, wherein the calculation formula is as follows:
Figure FDA0003610145020000034
wherein R is mn (i, j, tau) represents the generalized cross-correlation function of the mth array element and the nth array element in the jth sub-band of the ith frame;
and calculating the sub-band time delay difference of each frame signal, wherein the expression is as follows:
Figure FDA0003610145020000035
wherein, T mn (i, j) represents the time delay difference of the mth array element and the nth array element in the jth sub-band of the ith frame;
calculating a sub-band SRP-PHAT function of each frame signal, wherein the calculation formula is as follows:
Figure FDA0003610145020000036
wherein, P (i, j, r) represents the SRP-PHAT power value of the jth sub-band of the ith frame signal when the beam direction of the array is r; tau is mn (r) represents a time difference of the sound wave propagating from the beam direction r to the m-th microphone and the n-th microphone, and is calculated by the formula:
Figure FDA0003610145020000041
where r denotes the coordinates of the beam direction, r m Denotes the position coordinates of the m-th microphone, c is the speed of sound in air, f s Is the signal sampling rate;
setting the sound source and the microphone array to be in the same horizontal plane, and setting the sound source to be in the far field of the array, wherein tau is mn The equivalent calculation formula of (r) is:
Figure FDA0003610145020000042
where xi is [ cos θ, sin θ ═] T Theta is the azimuth angle of the beam direction r, tau mn (r) is independent of the received signal, and thus can be stored in memory after off-line calculation;
the normalization processing is carried out on the sub-band SRP-PHAT function, and the calculation formula is as follows:
Figure FDA0003610145020000043
forming a characteristic matrix by the time delay differences of all sub-bands in the same frame and the SRP-PHAT function to obtain a spatial clue of mixed characteristics, wherein the expression is as follows:
Figure FDA0003610145020000044
wherein y is train (i) And J is the number of sub-bands.
6. The microphone array sound source localization method according to claim 5, characterized in that: the method for preprocessing the test signal to obtain a single-frame test signal is the same as the method for preprocessing the microphone array signals to obtain a plurality of single-frame signals;
the method for extracting the spatial localization cues of the single-frame test signal is the same as the method for extracting the spatial localization cues of the plurality of single-frame signals.
7. The microphone array sound source localization method according to claim 2, characterized in that: the CRN model comprises an input layer, two residual blocks, a pooling layer, two full-connection layers and a final output layer, wherein the residual blocks are composed of a plurality of convolution layers and batch normalization layers, each residual block structure comprises two batch normalization layers and two convolution layers, and input is sequentially processed according to the sequence of the batch normalization layer, the ReLU and the convolution layer, wherein the output layer adopts Softmax, and a loss function is a cross entropy function.
8. A microphone array sound source localization apparatus characterized in that: the method comprises the following steps:
an acquisition unit configured to acquire a test signal;
the preprocessing unit is used for preprocessing the test signal to obtain a single-frame test signal;
the extraction unit is used for extracting the spatial positioning clue of the single-frame test signal and taking the spatial positioning clue as a test sample;
and the testing unit is used for inputting the test sample into a pre-constructed and trained CRN model for testing to obtain the probability that the test signal belongs to each azimuth angle, wherein the azimuth with the maximum probability is taken as the azimuth angle estimated value of the frame signal.
9. A microphone array sound source localization apparatus characterized in that: comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program when executed by a processor implements the steps of the method of any one of claims 1 to 7.
CN202210427289.XA 2022-04-22 2022-04-22 Microphone array sound source positioning method and device and storage medium Pending CN114895245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210427289.XA CN114895245A (en) 2022-04-22 2022-04-22 Microphone array sound source positioning method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210427289.XA CN114895245A (en) 2022-04-22 2022-04-22 Microphone array sound source positioning method and device and storage medium

Publications (1)

Publication Number Publication Date
CN114895245A true CN114895245A (en) 2022-08-12

Family

ID=82718420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210427289.XA Pending CN114895245A (en) 2022-04-22 2022-04-22 Microphone array sound source positioning method and device and storage medium

Country Status (1)

Country Link
CN (1) CN114895245A (en)

Similar Documents

Publication Publication Date Title
CN107703486B (en) Sound source positioning method based on convolutional neural network CNN
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN109490822B (en) Voice DOA estimation method based on ResNet
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
CN110068795A (en) A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110223708B (en) Speech enhancement method based on speech processing and related equipment
CN109164415B (en) Binaural sound source positioning method based on convolutional neural network
Kumatani et al. Beamforming with a maximum negentropy criterion
CN110544490A (en) sound source positioning method based on Gaussian mixture model and spatial power spectrum characteristics
Aroudi et al. Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation
CN113936681A (en) Voice enhancement method based on mask mapping and mixed hole convolution network
Pujol et al. Source localization in reverberant rooms using Deep Learning and microphone arrays
Zhao et al. Sound source localization based on srp-phat spatial spectrum and deep neural network
Salvati et al. Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
Salvati et al. End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks.
CN113111765A (en) Multi-voice source counting and positioning method based on deep learning
CN111123202B (en) Indoor early reflected sound positioning method and system
CN110838303B (en) Voice sound source positioning method using microphone array
Salvati et al. Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features.
Aroudi et al. DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation
CN115713943A (en) Beam forming voice separation method based on complex space angular center Gaussian mixture clustering model and bidirectional long-short-term memory network
CN114895245A (en) Microphone array sound source positioning method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination