CN110007276B

CN110007276B - Sound source positioning method and system

Info

Publication number: CN110007276B
Application number: CN201910312565.6A
Authority: CN
Inventors: 黄丽霞; 张雪英; 王杰; 李凤莲; 陈桂军
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2021-01-12
Anticipated expiration: 2039-04-18
Also published as: CN110007276A

Abstract

The invention discloses a sound source positioning method and a sound source positioning system. The sound source positioning method firstly carries out windowing and framing on sound source voice signals obtained by the quaternary microphone array, then detects signal effective frame signals, and calculates and fuses quadratic correlation generalized spectrum subtraction correction phase transformation functions on the screened effective frame signals. In order to further improve the time delay precision, the average generalized spectrum fused with quadratic correlation is adopted to subtract the modified phase transformation function to calculate the time delay value. And finally, estimating the sound source direction according to the geometric position of the microphone array and the calculated time delay value, thereby improving the precision of sound source positioning.

Description

Sound source positioning method and system

Technical Field

The present invention relates to the field of sound source localization, and in particular, to a sound source localization method and system.

Background

Sound source localization has become a research hotspot in the field of voice signal processing, and has wide application in the fields of video conferences, intelligent robots, intelligent video monitoring systems and the like. In the traditional positioning algorithm, the positioning accuracy is sharply reduced in a severe environment with low signal-to-noise ratio and high reverberation time.

Disclosure of Invention

The invention aims to provide a sound source positioning method and a sound source positioning system so as to improve the accuracy of sound source positioning.

In order to achieve the purpose, the invention provides the following scheme:

the invention provides a sound source positioning method, which comprises the following steps:

acquiring four sound source voice signals by adopting a quaternary microphone array; the quaternary microphone array comprises four microphones, and each microphone collects one path of sound source voice signals;

performing synchronous framing processing on four paths of sound source voice signals to obtain a frame signal set, wherein each frame signal in the signal frame set comprises four paths of frame signals which are a first path of frame signal, a second path of frame signal, a third path of frame signal and a fourth path of frame signal respectively;

judging the validity of each frame signal in the frame signal set to obtain a valid frame signal subset;

acquiring an average generalized spectrum subtraction correction phase transformation function fused with secondary correlation of any two paths of effective frame signals according to the effective frame signal subset;

acquiring a time point corresponding to a maximum peak value of a modified phase transformation function fused with a quadratic correlation average generalized spectrum in any two paths of effective frame signals to obtain a time delay value of any two paths of microphone sound source signals;

and determining the direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals.

Optionally, the synchronous framing processing is performed on the four sound source voice signals to obtain a frame signal set, which specifically includes:

using window functions

Carrying out synchronous windowing and framing processing on the four sound source voice signals to obtain a frame signal x_ij(N), N denotes the nth sampling point, N is 1,2_ij(n) a j-th signal indicating an ith frame signal, where j is 1,2,3, 4;

all the frame signals are combined into a set of frame signals.

Optionally, the determining the validity of each frame signal in the frame signal set to obtain a valid frame signal subset specifically includes:

using formulas

Calculating short-time frame energy of a jth path of frame signals of the ith frame signal; wherein E is_ijShort-time frame energy of a jth frame signal of an ith frame signal is represented, N represents an nth sampling point, and N is 1, 2.

Judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is greater than a first preset threshold value or not to obtain a first judgment result;

if it is the firstIf the judgment result shows that the short-time frame energy is not greater than the first preset threshold, increasing the value of i by 1, and returning to the step of utilizing the formula

Calculating short-time frame energy of a jth frame signal of the ith frame signal;

if the first judgment result shows that the short-time array energy is larger than the first preset threshold, setting the ith frame signal as a starting point, and increasing the value of i by 1;

using formulas

Calculating the zero crossing rate of the jth frame signal of the ith frame signal; wherein the content of the first and second substances,

judging whether the zero crossing rate is greater than a second preset threshold value or not to obtain a second judgment result;

if the second judgment result shows that the zero crossing rate is greater than the second preset threshold, marking T of the jth path of frame signal of the ith frame signal_ijIs set to 1;

if the obtained judgment result shows that the zero crossing rate is not greater than the second preset threshold value, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0;

using the formula SS (i) ═ T_i1&&T_i2&&T_i3&&T_i4Calculating the total state value SS (i) of the marks of the four paths of frame signals of the ith frame signal; wherein, T_i1、T_i2、T_i3And T_i4Marks respectively representing the 1 st path, the 2 nd path, the 3 rd path and the 4 th path of the ith frame signal;

judging whether the total state value SS (i) is equal to 1 or not to obtain a third judgment result;

if the third judgment result indicates that SS (i) is equal to 1, setting the ith signal frame as an effective signal frame;

judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is smaller than a third preset threshold value or not to obtain a fourth judgment result;

if the fourth judgment result shows that the short-time frame energy of the jth frame signal of the ith frame signal is smaller than the third preset threshold, setting the ith signal frame as the termination point of the voice signal to obtain an effective frame signal subset;

if the fourth judgment result shows that the short-time frame energy of the jth frame signal of the ith frame signal is not less than the third preset threshold, increasing the value of i by 1, and returning to the step of using the formula

And calculating the zero crossing rate of the jth frame signal of the ith frame signal.

Optionally, the obtaining, according to the effective frame signal subset, an average generalized spectrum subtraction correction phase transformation function with which any two effective frame signals are fused with quadratic correlation specifically includes:

calculating the quadratic correlation of any two paths of frame signals of each effective frame signal according to the effective frame signal subset;

calculating the power spectrum of each path of frame signal of each effective frame signal according to the effective frame signal subset;

acquiring a noise masking function of each path of frame signal of each effective frame signal according to the power spectrum of each path of frame signal:

wherein z is_pq(ω) noise masking function of the q-th frame signal representing the p-th effective frame signal, X_pq(ω) represents a power spectrum of a q-th frame signal of the p-th effective frame signal, q is 1,2,3,4, N (ω) noise power spectrum, α represents a first coefficient, and β represents a second coefficient;

acquiring generalized spectrum subtraction correction phase transformation function of any two paths of frame signals of each effective frame signal fused with quadratic correlation according to the noise masking function of each path of frame signals of each effective frame signal and the quadratic correlation of any two paths of frame signals:

wherein phi is_{ls_p}(ω) a generalized spectrally-subtracted modified phase transform function in which the l-th frame signal and the s-th frame signal, which represent the p-th valid frame signal, are fused together in a quadratic correlation, where l is 1,2,3,4, s is 1,2,3,4, l is not equal to s,

X_pl(omega) and X_ps(ω) represents a power spectrum of the l-th frame signal and a power spectrum of the s-th frame signal of the p-th effective frame signal, respectively, and ρ represents a third coefficient;

according to the generalized spectrum subtraction correction phase transformation function of the fusion quadratic correlation of any two paths of frame signals of each effective frame signal, obtaining the average generalized spectrum subtraction correction phase transformation function of the fusion quadratic correlation of any two paths of effective frame signals:

wherein the content of the first and second substances,

and the average generalized spectrum subtraction correction phase transformation function which represents that the l-th path of effective frame signal and the s-th path of effective frame signal are fused with secondary correlation is represented, and P represents the number of effective frame signals in the effective frame signal subset.

Optionally, the determining the direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals specifically includes:

according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals, a formula is utilized

Calculating azimuth angle theta of sound source to coordinate origin

Calculating azimuth angle and pitch angle of sound source to origin of coordinates

Where c is the sound velocity, d is the distance from the microphone element to the origin of coordinates, τ₁₂Representing the time delay value, tau, of the 1 st and 2 nd microphone source signals₁₃Representing the time delay value, tau, of the 1 st path microphone source signal and the 3 rd path microphone source signal₁₄Which represents the delay values of the 1 st path microphone source signal and the 4 th path microphone source signal.

Optionally, the synchronous framing processing is performed on the four sound source voice signals to obtain a frame signal set, and the method further includes:

carrying out voice enhancement processing on each path of sound source voice signal to obtain a signal subjected to voice enhancement processing;

performing band-pass filtering processing on the signal subjected to the voice enhancement processing to obtain a signal subjected to the band-pass filtering processing;

and denoising the signal subjected to the band-pass filtering by using a wavelet threshold to obtain a preprocessed sound source voice signal.

A sound source localization system, comprising:

the sound source voice signal acquisition module is used for acquiring four paths of sound source voice signals by adopting a quaternary microphone array; the quaternary microphone array comprises four microphones, and each microphone collects one path of sound source voice signals;

a framing module, configured to perform synchronous framing processing on four channels of the sound source voice signals to obtain a frame signal set, where each frame signal in the signal frame set includes four channels of frame signals, namely a first channel of frame signal, a second channel of frame signal, a third channel of frame signal, and a fourth channel of frame signal;

the effective frame signal subset acquisition module is used for judging the effectiveness of each frame signal in the frame signal set to obtain an effective frame signal subset;

the fusion quadratic correlation average generalized spectrum subtraction correction phase transformation function acquisition module is used for acquiring the fusion quadratic correlation average generalized spectrum subtraction correction phase transformation function of any two paths of effective frame signals according to the effective frame signal subsets;

the time delay value calculation module is used for acquiring a time point corresponding to a maximum peak value of a modified phase transformation function fused with a quadratic correlation average generalized spectrum in any two paths of effective frame signals to obtain time delay values of any two paths of microphone sound source signals;

and the direction position determining module is used for determining the direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals.

Optionally, the framing module specifically includes:

a framing sub-module for applying a window function

and the synthesis submodule is used for synthesizing all the frame signals into a frame signal set.

Optionally, the valid frame signal subset obtaining module specifically includes:

short time frame energy calculation submodule for utilizing formula

Calculating the short-time frame energy of the jth frame signal of the ith frame signal(ii) a Wherein E is_ijShort-time frame energy of a jth frame signal of an ith frame signal is represented, N represents an nth sampling point, and N is 1, 2.

The first judgment submodule is used for judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is greater than a first preset threshold value or not to obtain a first judgment result;

a first judgment result processing submodule for increasing the value of i by 1, calling the short-time frame energy calculation submodule and executing the step of utilizing the formula

Calculating short-time frame energy of a jth frame signal of the ith frame signal; if the first judgment result shows that the short-time array energy is larger than the first preset threshold, setting the ith frame signal as a starting point, and increasing the value of i by 1;

a zero-crossing rate calculation submodule for utilizing the formula

the second judgment submodule is used for judging whether the zero crossing rate is greater than a second preset threshold value or not to obtain a second judgment result;

a second judgment result processing submodule, configured to, if the second judgment result indicates that the zero crossing rate is greater than the second preset threshold, mark T of a jth frame signal of an ith frame signal_ijIs set to 1; if the obtained judgment result shows that the zero crossing rate is not greater than the second preset threshold value, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0;

a total state value ss (i) calculation submodule for using the formula ss (i) ═ T_i1&&T_i2&&T_i3&&T_i4Calculating the total state value SS (i) of the marks of the four paths of frame signals of the ith frame signal; wherein, T_i1、T_i2、T_i3And T_i4Marks respectively representing the 1 st path, the 2 nd path, the 3 rd path and the 4 th path of the ith frame signal;

a third judging submodule, configured to judge whether the total state value ss (i) is equal to 1, to obtain a third judgment result;

a third result processing sub-module, configured to set an ith signal frame as an effective signal frame if the third determination result indicates that ss (i) is equal to 1;

the fourth judgment submodule is used for judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is smaller than a third preset threshold value or not to obtain a fourth judgment result;

a fourth judgment result processing submodule, configured to set the ith signal frame as a termination point of the voice signal to obtain an effective frame signal subset if the fourth judgment result indicates that the short-time frame energy of the jth frame signal of the ith frame signal is smaller than the third preset threshold; if the fourth judgment result shows that the short-time frame energy of the jth path frame signal of the ith frame signal is not less than the third preset threshold, increasing the value of i by 1, calling a zero crossing rate calculation submodule, and executing the step of utilizing a formula

Optionally, the module for obtaining the fused quadratic correlation average generalized spectrum subtraction correction phase transformation function specifically includes:

the secondary correlation calculation submodule is used for calculating the secondary correlation of any two paths of frame signals of each effective frame signal according to the effective frame signal subset;

the power spectrum calculation submodule is used for calculating the power spectrum of each path of frame signal of each effective frame signal according to the effective signal subset;

the noise masking function obtaining submodule is used for obtaining the noise masking function of each path of frame signal of each effective frame signal according to the power spectrum of each path of frame signal:

the generalized spectrum subtraction correction phase transformation function fused secondary correlation obtaining sub-module is used for obtaining the generalized spectrum subtraction correction phase transformation function fused secondary correlation of any two paths of frame signals of each effective frame signal according to the noise masking function of each path of frame signals of each effective frame signal and the secondary correlation of any two paths of frame signals:

the quadratic correlation fused average generalized spectrum subtraction correction phase transformation function obtaining sub-module is used for fusing quadratic correlation generalized spectrum subtraction correction phase transformation functions according to any two paths of frame signals of each effective frame signal to obtain quadratic correlation fused average generalized spectrum subtraction correction phase transformation functions of any two paths of effective frame signals:

wherein the content of the first and second substances,

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a sound source positioning method and a sound source positioning system. The sound source positioning method firstly obtains color sound source voice signal windowing framing for a quaternary microphone array, then detects signal effective frame signals, and calculates and fuses quadratic correlation generalized spectrum subtraction correction phase transformation functions for the screened effective frame signals. In order to further improve the time delay precision, the average generalized spectrum fused with quadratic correlation is adopted to subtract the modified phase transformation function to calculate the time delay value. And finally, estimating the sound source direction according to the geometric position of the microphone array and the calculated time delay value, thereby improving the precision of sound source positioning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of a sound source positioning method according to the present invention;

FIG. 2 is a diagram of a model of a quad microphone alignment provided by the present invention;

FIG. 3 is a graph of accuracy versus context for delay estimation of different algorithms at each frame under-5 dB noise environment provided by the present invention;

FIG. 4 is a graph comparing accuracy of delay estimation for different algorithms at each frame in a 5dB noise environment according to the present invention;

FIG. 5 is a graph comparing accuracy of delay estimation for different algorithms at each frame under an environment with a reverberation time of 750ms and a noise of 5dB provided by the present invention;

FIG. 6 is a diagram of an acquisition card according to the present invention;

FIG. 7 is a pictorial view of a microphone provided in accordance with the present invention;

fig. 8 is a physical diagram of a quaternary microphone array provided by the present invention;

fig. 9 is a block diagram of a sound source localization system according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

The embodiment 1 of the invention provides a sound source positioning method.

As shown in fig. 1, the sound source localization method includes the steps of:

step 101, acquiring four sound source voice signals by adopting a quaternary microphone array; the quaternary microphone array comprises four microphones, and each microphone collects one path of sound source voice signals; 102, synchronously framing four sound source voice signals to obtain a frame signal set, wherein each frame signal in the signal frame set comprises four frame signals which are a first frame signal, a second frame signal, a third frame signal and a fourth frame signal respectively; 103, judging the validity of each frame signal in the frame signal set to obtain a valid frame signal subset; 104, acquiring an average generalized spectrum subtraction correction phase transformation function fused with quadratic correlation of any two paths of effective frame signals according to the effective frame signal subsets; step 105, acquiring a time point corresponding to a maximum peak value of any two paths of microphone sound source signals by fusing any two paths of effective frame signals with a quadratic correlation average generalized spectrum subtraction correction phase transformation function; and 106, determining the direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals.

Example 2

Example 2 of the present invention provides a preferred embodiment of a sound source localization method, but the implementation of the present invention is not limited to the embodiment defined in example 2 of the present invention.

The quaternary microphone array in step 101 is shown in fig. 2, where the coordinate of the quaternary microphone is m₁(d,0,0)，m₂(0,d,0)，m₃(-d,0,0)，m₄(0, -d,0), d being the microphone element to origin distance.

After four sound source voice signals are obtained, carrying out voice enhancement processing on each sound source voice signal to obtain a signal subjected to voice enhancement processing; performing band-pass filtering processing on the signal subjected to the voice enhancement processing to obtain a signal subjected to the band-pass filtering processing; and denoising the signal subjected to the band-pass filtering by using a wavelet threshold to obtain a preprocessed sound source voice signal.

Step 102, performing synchronous framing processing on the four sound source voice signals to obtain a frame signal set, specifically including: using window functions

Carrying out synchronous windowing and framing processing on the four sound source voice signals to obtain a frame signal x_ij(N), N denotes the nth sampling point, N is 1,2_ij(n) a j-th signal indicating an ith frame signal, where j is 1,2,3, 4; all the frame signals are combined into a set of frame signals.

Step 103, judging the validity of each frame signal in the frame signal set to obtain a valid frame signal subset, specifically including: using formulas

Calculating the short time of the jth frame signal of the ith frame signalFrame energy; wherein E is_ijShort-time frame energy of a jth frame signal of an ith frame signal is represented, N represents an nth sampling point, and N is 1, 2. Judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is greater than a first preset threshold value or not to obtain a first judgment result; if the first judgment result shows that the short-time frame energy is not greater than the first preset threshold, increasing the value of i by 1, and returning to the step of utilizing the formula

Calculating short-time frame energy of a jth frame signal of the ith frame signal; if the first judgment result shows that the short-time array energy is larger than the first preset threshold, setting the ith frame signal as a starting point, and increasing the value of i by 1; using formulas

judging whether the zero crossing rate is greater than a second preset threshold value or not to obtain a second judgment result; if the second judgment result shows that the zero crossing rate is greater than the second preset threshold, marking T of the jth path of frame signal of the ith frame signal_ijIs set to 1; if the obtained judgment result shows that the zero crossing rate is not greater than the second preset threshold value, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0; using the formula SS (i) ═ T_i1&&T_i2&&T_i3&&T_i4Calculating the total state value SS (i) of the marks of the four paths of frame signals of the ith frame signal; wherein, T_i1、T_i2、T_i3And T_i4Marks respectively representing the 1 st path, the 2 nd path, the 3 rd path and the 4 th path of the ith frame signal; judging whether the total state value SS (i) is equal to 1 or not to obtain a third judgment result; if the third judgment result indicates that SS (i) is equal to 1, setting the ith signal frame as an effective signal frame; judging the jth frame of the ith frame signalWhether the short-time frame energy of the signal is smaller than a third preset threshold value or not is judged to obtain a fourth judgment result; if the fourth judgment result shows that the short-time frame energy of the jth frame signal of the ith frame signal is smaller than the third preset threshold, setting the ith signal frame as the termination point of the voice signal to obtain an effective frame signal subset; if the fourth judgment result shows that the short-time frame energy of the jth frame signal of the ith frame signal is not less than the third preset threshold, increasing the value of i by 1, and returning to the step of using the formula

Step 104, obtaining an average generalized spectrum subtraction correction phase transformation function fused with quadratic correlation of any two effective frame signals according to the effective frame signal subset, specifically including: calculating the quadratic correlation of any two paths of frame signals of each effective frame signal according to the effective frame signal subset; calculating the power spectrum of each path of frame signal of each effective frame signal according to the effective frame signal subset; acquiring a noise masking function of each path of frame signal of each effective frame signal according to the power spectrum of each path of frame signal:

wherein z is_pq(ω) noise masking function of the q-th frame signal representing the p-th effective frame signal, X_pq(ω) represents a power spectrum of a q-th frame signal of the p-th effective frame signal, q is 1,2,3,4, N (ω) noise power spectrum, α represents a first coefficient, and β represents a second coefficient; acquiring generalized spectrum subtraction correction phase transformation function of any two paths of frame signals of each effective frame signal fused with quadratic correlation according to the noise masking function of each path of frame signals of each effective frame signal and the quadratic correlation of any two paths of frame signals:

X_pl(omega) and X_ps(ω) represents a power spectrum of the l-th frame signal and a power spectrum of the s-th frame signal of the p-th effective frame signal, respectively, and ρ represents a third coefficient; according to the generalized spectrum subtraction correction phase transformation function of the fusion quadratic correlation of any two paths of frame signals of each effective frame signal, obtaining the average generalized spectrum subtraction correction phase transformation function of the fusion quadratic correlation of any two paths of effective frame signals:

wherein the content of the first and second substances,

Step 105, determining the direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals, specifically comprising:

Calculating azimuth angle theta of sound source to coordinate origin

Where c is the sound velocity, d is the distance from the microphone element to the origin of coordinates, τ₁₂Representing the time delay value, tau, of the 1 st and 2 nd microphone source signals₁₃Representing the time delay value, tau, of the 1 st path microphone source signal and the 3 rd path microphone source signal₁₄Which represents the delay values of the 1 st path microphone source signal and the 4 th path microphone source signal. Specifically, the calculation formula x of the spherical coordinates is calculated according to the geometrical position relationship of the array of the four-element microphones (the coordinates of the four-element array microphone are m1(d,0,0), m2(0, d,0), m3(-d,0,0), m4(0, -d,0))²+y²+z²＝r²Formula for calculating distance between two points

And formula of velocity

Solving for azimuth

And a pitch angle

In order to illustrate the effect of the sound source localization method of the present invention, the present invention performs analog simulation comparison under different Signal-to-noise ratios and reverberation environments, and as can be seen from fig. 3 and 4, under a medium noise environment (SNB (Signal-to-noise ratio) ═ 5dB), the accuracy performance of the time delay value estimated by the Phase transformation (PHAT) algorithm is far inferior to the improved Cross Power Phase algorithm (MCPSP) and the Generalized spectral subtraction Modified Cross correlation function (GCC-APHAT) method, while APHAT is significantly better than the MCPSP method; under a strong noise environment (SNB ═ 5dB), the PHAT performance is reduced sharply, and only the MCPSP and the APHAT still maintain good performance. As can be seen from fig. 5, under the environmental conditions where both strong reverberation and strong noise exist (T60-750 ms and SNB-5 dB), the APHAT algorithm has better delay accuracy than the PHAT and MCPSP algorithms. The analysis and comparison in the simultaneous way can verify that the APHAT algorithm has better robustness to noise and reverberation.

In order to further explain the effect of the sound source positioning method, the method is used for building and carrying out real environment experiments, a multi-channel data acquisition card Q801 of Beijing Acoustic science technology Limited (SKC) is used for recording sound source signals, as shown in fig. 6, an array support and a microphone MP40 in a quaternary microphone array are both products of SKC manufacturers, as shown in fig. 7 and 8.

The experiments were all completed in a 7.2m x 6m x 3.2m room with both doors and windows closed. Given that certain background noise and reverberation exist in a room, including reflection of sound of a computer host fan, tables and chairs and other artificial interference, a sound source is a girl speaking voice (I goes to Beijing), and the sound source is a section of voice recorded in an actual environment. The sampling rate of the signal is 8kHz, the frame length of the framing is 256, the frame shift is 128, and a Hamming window is added. The coordinates of the quaternary microphone array are respectively: m is₁(25cm,0,0)，m₂(0,25cm,0)，m₃(-25cm,0,0)，m₄(0, -25cm,0), and the microphone array is arranged at a height of 70cm from the ground. In the comparison of experimental data in an actual environment, a plurality of sound source positions in the following table 1 are respectively selected in an experiment, and 10 groups of data are collected at each sound source position. The invention takes PHAT and MCPSP algorithms as reference, and compares the performance of the APHAT algorithm and other two algorithms through experimental analysis. Table 1 shows the result of comparing the experimental data with the actual sound source localization results of the improved algorithm APHAT, PHAT and MCPSP in the actual environment, the localization error result is shown in table 2, and the root mean square error of localization is shown in table 3: TABLE 1

Serial number	S(x,y,z)	(r,θ,φ)	PHAT	MCPSP	APHAT
						1	(1,1,0.76)	(1.6,45°,61.6°)	(45°,58.6°)	(45°,58.6°)	(45°,57.3°)
2	(2,1,0.76)	(2.36,26.6°,71.2°)	(26.6°,74.6°)	(26.6°,74.6°)	(26.6°,71.9°)
						3	(2,2,076)	(2.93,45°,75°)	(45°,77.4°)	(45°,77.4°)	(45°,74.1°)
4	(-2,1,0.76)	(2.36,-26.6°,71.2°)	(-26.6°,74.6°)	(-26.6°,74.6°)	(-26.6°,71.9°)
						5	(-2,2,0.76)	(2.93,-45°,75°)	(-45°,77.4°)	(-45°,77.4°)	(-45°,74.1°)
6	(1.2,0.6,076)	(1.54,26.6°,60.4°)	(37.9°,79.5°)	(29.1°,62.6°)	(29.1°,61.1°)
						7	(-2.4,2.4,0.76)	(3.48,-45°,77.4°)	(-45°,77.4°)	(-45°,77.4°)	(-45°,74.1°)
8	(1.5,1.2,0.76)	(2.07,38.7°,68.5°)	(41.2°,66.5°)	(37.9°,79.5°)	(41.2°,64.6°)
						9	(1.8,1.2,076)	(2.29,33.7°,70.6°)	(0,234°)	(37.9°,79.5°)	(37.9°,75.7°)
10	(2,1.2,0.76)	(2.45,31°,71.9°)	(-36.9°,59.6°)	(30.1°,90.2°)	(31°,82.4°)
						11	(1.2,0,0.76)	(1.42,0°,57.7°)	(0,59.6°)	(0,59.6°)	(0,58.2°)
12	(0,1.8,0.76)	(1.95,90°,67.1°)	(0,234°)	(-90°,71.6°)	(90°,69.2°)
						13	(1.2,2.4,076	(2.79,63.4°,74.2°)	(63.4°,74.6°)	(63.4°,74.6°)	(63.4°,71.9°)
14	(0,1.2,0.76)	(1.42,90°,57.7°)	(-90°,59.6°)	(-90°,59.6°)	(90°,58.2°)
						15	(-1.2,0,0.76)	(1.42,180°,57.7°)	(180,234°)	(180,59.6°)	(180°,58.7°)
16	(0,-1.2,0.76)	(1.42,-90°,57.7°)	(90°,59.6°)	(90°,59.6°)	(-90°,58.2°)
						17	(-0.6,-1.2,0.76)	(1.54,63.4°,60.4°)	(60.9°,62.6°)	(60.9°,62.6°)	(60.9°，61.1°)
18	(0.6,-1.2,0.76)	(1.54,-63.4°,60.4°)	(-63.4°,74.6°)	(-68.2°,68.3°)	(-68.2°，66.3°)

TABLE 2

TABLE 3

	PHAT	MCPSP	APHAT
				Azimuth angle theta_RMSE	59.4	68.6	1.6
Angle of pitch phi_RMSE	63.2	4.6	2.7

From the comparative analysis of the experimental results in tables 1 and 2, it can be seen that: under the actual environment, the performance of estimating the azimuth angle and the pitch angle by the PHAT algorithm is unstable and has larger error. Estimating the positioning direction opposite errors of the sound source when the sound source is positioned in the X axis and the Y axis of the system coordinate by the MCPSP algorithm azimuth angle; the positioning performance of the APHAT algorithm is stable, the positioning progress is high, and the root mean square error REMS of the azimuth angle of the APHAT algorithm is 1.6 and the root mean square error REMS of the pitch angle is 2.7 as can be seen from the table 3; the angular deviation error of the APHAT direction of the algorithm is basically within an acceptable range, and the accuracy is relatively high. This also verifies the efficient performance of the algorithm proposed herein.

Example 3

Embodiment 3 the present invention provides a sound source localization system.

As shown in fig. 9, the present invention provides a sound source localization system including: a sound source voice signal acquisition module 901, configured to acquire four paths of sound source voice signals by using a quaternary microphone array; the quaternary microphone array comprises four microphones, and each microphone collects one path of sound source voice signals; a framing module 902, configured to perform synchronous framing processing on four channels of the sound source voice signals to obtain a frame signal set, where each frame signal in the signal frame set includes four channels of frame signals, which are a first channel of frame signal, a second channel of frame signal, a third channel of frame signal, and a fourth channel of frame signal; an effective frame signal subset obtaining module 903, configured to determine validity of each frame signal in the frame signal set, so as to obtain an effective frame signal subset; a fused secondary correlation average generalized spectrum subtraction correction phase transformation function obtaining module 904, configured to obtain a fused secondary correlation average generalized spectrum subtraction correction phase transformation function of any two paths of effective frame signals according to the effective frame signal subset; the time delay value calculation module 905 is used for acquiring a time point corresponding to a maximum peak value of the modified phase transformation function fused with the quadratic correlation average generalized spectrum subtraction correction of any two paths of effective frame signals to obtain time delay values of any two paths of microphone sound source signals; and a direction position determining module 906, configured to determine a direction position of the sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals.

Example 4

Example 4 of the present invention provides a preferred implementation of a sound source localization system.

The framing module 902 specifically includes: a framing sub-module for applying a window function

Carrying out synchronous windowing and framing processing on the four sound source voice signals to obtain a frame signal x_ij(N), N denotes the nth sampling point, N is 1,2_ij(n) a j-th signal indicating an ith frame signal, where j is 1,2,3, 4; and the synthesis submodule is used for synthesizing all the frame signals into a frame signal set.

The valid frame signal subset obtaining module 903 specifically includes: short time frame energy calculation submodule for utilizing formula

Calculating short-time frame energy of a jth path of frame signals of the ith frame signal; wherein E is_ijShort-time frame energy of a jth frame signal of an ith frame signal is represented, N represents an nth sampling point, and N is 1, 2. The first judgment submodule is used for judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is greater than a first preset threshold value or not to obtain a first judgment result; a first judgment result processing submodule for increasing the value of i by 1, calling the short-time frame energy calculation submodule and executing the step of utilizing the formula

Calculating short-time frame energy of a jth frame signal of the ith frame signal; if the first judgment result shows that the short-time array energy is larger than the first preset threshold, setting the ith frame signal as a starting point, and increasing the value of i by 1; a zero-crossing rate calculation submodule for utilizing the formula

Calculating the zero crossing rate of the jth frame signal of the ith frame signal(ii) a Wherein the content of the first and second substances,

the second judgment submodule is used for judging whether the zero crossing rate is greater than a second preset threshold value or not to obtain a second judgment result; a second judgment result processing submodule, configured to, if the second judgment result indicates that the zero crossing rate is greater than the second preset threshold, mark T of a jth frame signal of an ith frame signal_ijIs set to 1; if the obtained judgment result shows that the zero crossing rate is not greater than the second preset threshold value, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0;

a total state value ss (i) calculation submodule for using the formula ss (i) ═ T_i1&&T_i2&&T_i3&&T_i4Calculating the total state value SS (i) of the marks of the four paths of frame signals of the ith frame signal; wherein, T_i1、T_i2、T_i3And T_i4Marks respectively representing the 1 st path, the 2 nd path, the 3 rd path and the 4 th path of the ith frame signal; a third judging submodule, configured to judge whether the total state value ss (i) is equal to 1, to obtain a third judgment result; a third result processing sub-module, configured to set an ith signal frame as an effective signal frame if the third determination result indicates that ss (i) is equal to 1; the fourth judgment submodule is used for judging whether the short-time frame energy of the jth path of frame signal of the ith frame signal is smaller than a third preset threshold value or not to obtain a fourth judgment result; a fourth judgment result processing submodule, configured to set the ith signal frame as a termination point of the voice signal to obtain an effective frame signal subset if the fourth judgment result indicates that the short-time frame energy of the jth frame signal of the ith frame signal is smaller than the third preset threshold; if the fourth judgment result shows that the short-time frame energy of the jth path frame signal of the ith frame signal is not less than the third preset threshold, increasing the value of i by 1, calling a zero crossing rate calculation submodule, and executing the step of utilizing a formula

Calculating the jth frame signal of the ith frame signalZero crossing rate of ".

The module 904 for obtaining the mean generalized spectrum subtraction correction phase transformation function with fused quadratic correlations specifically includes: the secondary correlation calculation submodule is used for calculating the secondary correlation of any two paths of frame signals of each effective frame signal according to the effective frame signal subset; the power spectrum calculation submodule is used for calculating the power spectrum of each path of frame signal of each effective frame signal according to the effective signal subset; the noise masking function obtaining submodule is used for obtaining the noise masking function of each path of frame signal of each effective frame signal according to the power spectrum of each path of frame signal:

wherein z is_pq(ω) noise masking function of the q-th frame signal representing the p-th effective frame signal, X_pq(ω) represents a power spectrum of a q-th frame signal of the p-th effective frame signal, q is 1,2,3,4, N (ω) noise power spectrum, α represents a first coefficient, and β represents a second coefficient; the generalized spectrum subtraction correction phase transformation function fused secondary correlation obtaining sub-module is used for obtaining the generalized spectrum subtraction correction phase transformation function fused secondary correlation of any two paths of frame signals of each effective frame signal according to the noise masking function of each path of frame signals of each effective frame signal and the secondary correlation of any two paths of frame signals:

X_pl(omega) and X_ps(ω) the power spectrum of the l-th frame signal and the power spectrum of the p-th effective frame signal, respectivelyThe power spectrum of the s-channel frame signals, and rho represents a third coefficient; the quadratic correlation fused average generalized spectrum subtraction correction phase transformation function obtaining sub-module is used for fusing quadratic correlation generalized spectrum subtraction correction phase transformation functions according to any two paths of frame signals of each effective frame signal to obtain quadratic correlation fused average generalized spectrum subtraction correction phase transformation functions of any two paths of effective frame signals:

wherein the content of the first and second substances,

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the implementation manner of the present invention are explained by applying specific examples, the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof, the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

Claims

1. A sound source localization method, characterized by comprising the steps of:

acquiring sample points corresponding to maximum peak values of any two paths of microphone sound source signals by fusing any two paths of effective frame signals with secondary correlation average generalized spectrum subtraction correction phase transformation functions;

determining the direction position of a sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals;

the obtaining of the average generalized spectrum subtraction correction phase transformation function fused with quadratic correlation of any two effective frame signals according to the effective frame signal subset specifically includes:

according to the effective frame signal subset, combining autocorrelation and cross correlation, and calculating the quadratic correlation of any two paths of frame signals of each effective frame signal;

wherein the content of the first and second substances,

2. The method according to claim 1, wherein the step of synchronously framing the four sound source voice signals to obtain a frame signal set comprises:

using window functions

all the frame signals are combined into a set of frame signals.

3. The method according to claim 1, wherein the determining validity of each frame signal in the frame signal set to obtain a valid frame signal subset comprises:

using formulas

if the first judgment result shows that the short-time frame energy is not greater than the first preset threshold, increasing the value of i by 1, and returning to the step of utilizing the formula

using formulas

if the second judgment result shows that the zero crossing rate is not greater than the second preset threshold, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0;

using the formula SS (i) ═ T_i1&&T_i2&&T_i3&&T_i4Calculating the total state value SS (i) of the marks of the four paths of frame signals of the ith frame signal; wherein, T_i1、T_i2、T_i3And T_i4Individual watchMarks showing the 1 st, 2 nd, 3 rd and 4 th path frame signals of the ith frame signal;

4. The sound source localization method according to claim 1, wherein the determining a directional position of a sound source according to the geometric position of the quaternary microphone array and the time delay values of any two microphone sound source signals specifically comprises:

Calculating azimuth angle theta of sound source to coordinate origin

5. The sound source localization method according to claim 1, wherein the synchronous framing of the four sound source voice signals to obtain a frame signal set further comprises:

6. A sound source localization system, comprising:

the direction position determining module is used for determining the direction position of a sound source according to the geometric position of the quaternary microphone array and the time delay values of any two paths of microphone sound source signals;

the module for obtaining the average generalized spectrum subtraction correction phase transformation function fused with the quadratic correlation specifically comprises:

the secondary correlation calculation submodule is used for combining the autocorrelation and the cross correlation according to the effective frame signal subset and calculating the secondary correlation of any two paths of frame signals of each effective frame signal;

wherein the content of the first and second substances,

7. The sound source positioning system according to claim 6, wherein the framing module specifically includes:

a framing sub-module for applying a window function

8. The sound source localization system according to claim 6, wherein the valid frame signal subset acquisition module specifically comprises:

short time frame energy calculation submodule for utilizing formula

Calculating short-time frame energy of a jth frame signal of the ith frame signal;if the first judgment result shows that the short-time array energy is larger than the first preset threshold, setting the ith frame signal as a starting point, and increasing the value of i by 1;

a zero-crossing rate calculation submodule for utilizing the formula

a second judgment result processing submodule, configured to, if the second judgment result indicates that the zero crossing rate is greater than the second preset threshold, mark T of a jth frame signal of an ith frame signal_ijIs set to 1; if the second judgment result shows that the zero crossing rate is not greater than the second preset threshold, marking T of the jth path of frame signal of the ith frame signal_ijSet to 0;