CN113053406B

CN113053406B - Voice signal identification method and device

Info

Publication number: CN113053406B
Application number: CN202110502126.9A
Authority: CN
Inventors: 侯海宁
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2024-06-18
Anticipated expiration: 2041-05-08
Also published as: CN113053406A

Abstract

The disclosure relates to a voice signal recognition method and device. The intelligent voice interaction technology is related to the problem of low accuracy of identifying the sound source signal under the scene of strong interference and low signal to noise ratio. The method comprises the following steps: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively; performing first-stage noise reduction processing on the original observed data to obtain observed signal estimated data; obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data; performing second-stage noise reduction processing on the observed signal data according to the positioning information to obtain a beam enhanced output signal; and obtaining the time domain sound source signals with enhanced signal to noise ratio of each sound source according to the beam enhanced output signals. The technical scheme provided by the disclosure is suitable for voice interaction equipment, and realizes high-quality and low-noise voice signal recognition.

Description

Voice signal identification method and device

Technical Field

The disclosure relates to intelligent voice interaction technology, and in particular relates to a voice signal recognition method and device.

Background

In the internet of things and AI age, intelligent voice is used as one of the artificial intelligent core technologies, so that the mode of man-machine interaction is enriched, and the use convenience of intelligent products is greatly improved.

The intelligent product equipment pick-up adopts a microphone array formed by a plurality of microphones, and the microphone beam forming technology or the blind source separation technology is applied to inhibit environmental interference, so that the voice signal processing quality is improved, and the voice recognition rate in the real environment is improved.

In addition, in order to endow stronger intelligence and perceptibility, the general intelligent device is provided with an indicator light, and the indicator light is accurately directed to a user rather than interfered when the user interacts with the intelligent device, so that the user feels in face-to-face conversation with the intelligent device, and the interaction experience of the user is enhanced. Based on this, in an environment where an interfering sound source exists, it is important to accurately estimate the direction of the user (i.e., the sound source).

The sound source direction finding algorithm generally uses the data collected by the microphone directly, and uses the sound source direction finding algorithm (Steered Response Power-Phase Transform, SRP-PHAT) and other algorithms based on the Phase transformation weighted controllable response power to carry out direction finding estimation. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, each direction to the interference sound source is extremely easy to find, and the effective sound source cannot be accurately positioned.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a method and apparatus for recognizing a sound signal.

According to a first aspect of an embodiment of the present disclosure, there is provided a sound signal recognition method, including:

Acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

Performing first-stage noise reduction processing on the original observed data to obtain observed signal estimated data;

Obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data;

Performing second-stage noise reduction processing on the observed signal data according to the positioning information to obtain a beam enhanced output signal;

And obtaining the time domain sound source signals with enhanced signal to noise ratio of each sound source according to the beam enhanced output signals.

Further, the step of performing a first-stage noise reduction process on the original observed data to obtain observed signal estimated data includes:

Initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix are the number of the sound sources;

solving time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

according to the separation matrix of the previous frame and the observation signal matrix, solving the prior frequency domain estimation of each sound source of the current frame;

updating the weighted covariance matrix according to the prior frequency domain estimation;

Updating the separation matrix according to the updated weighted covariance matrix;

deblurring the updated separation matrix;

and separating the original observed data according to the deblurred separation matrix, and taking the posterior domain estimated data obtained by separation as the observed signal estimated data.

Further, the step of obtaining the prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises the following steps:

And separating the observation signal matrix according to the separation matrix of the previous frame to obtain prior frequency domain estimation of each sound source of the current frame.

Further, the step of updating the weighted covariance matrix based on the prior frequency-domain estimate comprises:

And updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

Further, the step of updating the separation matrix according to the updated weighted covariance matrix includes:

Respectively updating the separation matrix of each sound source according to the weighted covariance matrix of each sound source;

and updating the separation matrix into a conjugate transpose matrix of the combined separation matrix of each sound source.

Further, the step of deblurring the updated separation matrix includes:

and performing amplitude deblurring on the separation matrix by adopting a minimum distortion criterion.

Further, the step of obtaining the positioning information of each sound source and the observed signal data according to the observed signal estimation data includes:

Obtaining the observed signal data of each sound source at each acquisition point according to the observed signal estimation data;

and respectively estimating the azimuth of each sound source according to the observed signal data of each sound source at each acquisition point to obtain the positioning information of each sound source.

Further, according to the observed signal data of each sound source at each acquisition point, estimating the azimuth of each sound source, and obtaining the positioning information of each sound source includes:

the following estimation is carried out on each sound source to obtain the azimuth of each sound source:

And using the observed signal data of the same sound source at different acquisition points to form the observed data of the acquisition points, and positioning the sound sources through a direction finding algorithm to obtain positioning information of each sound source.

Further, the positioning information of the sound source includes azimuth coordinates of the sound source, and the step of performing second-stage noise reduction processing on the positioning information to obtain a beam enhancement output signal includes:

according to the azimuth coordinates of each sound source and the azimuth coordinates of each acquisition point, respectively calculating the propagation delay difference value of each sound source, wherein the propagation delay difference value is the time difference value of sound transmitted by the sound source to each acquisition point;

and carrying out second-stage noise reduction on each sound source through beam delay summation beam forming processing by using the observed signal data of each sound source to obtain the beam enhanced output signals of each sound source.

Further, the step of obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signal comprises:

And carrying out short-time Fourier inverse transformation on the beam enhancement output signals of each sound source, and then carrying out overlap addition to obtain time domain sound source signals with enhanced signal-to-noise ratio of each sound source.

According to a second aspect of embodiments of the present disclosure, there is provided a sound signal recognition apparatus including:

The original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observed data to obtain observed signal estimated data;

the positioning module is used for obtaining positioning information of each sound source and observation signal data according to the observation signal estimation data;

The second noise reduction module is used for carrying out second-stage noise reduction processing on the observed signal data according to the positioning information to obtain a beam enhanced output signal;

And the enhanced signal output module is used for obtaining the time domain sound source signals with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signals.

Further, the first noise reduction module includes:

the matrix initialization sub-module is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and the number of columns of the separation matrix are the number of the sound sources;

The frequency domain data acquisition sub-module is used for solving time domain signals at each acquisition point and constructing an observation signal matrix according to the frequency domain signals corresponding to the time domain signals;

The priori frequency domain estimation sub-module is used for solving the priori frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

a covariance matrix updating sub-module for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating sub-module, configured to update the separation matrix according to the updated weighted covariance matrix;

A deblurring sub-module configured to deblur the updated separation matrix;

and the posterior domain estimation sub-module is used for separating the original observation data according to the deblurred separation matrix, and taking the posterior domain estimation data obtained by separation as the observation signal estimation data.

Further, the priori frequency domain estimation sub-module is configured to separate the observation signal matrix according to a separation matrix of a previous frame, so as to obtain a priori frequency domain estimation of each sound source of the current frame.

Further, the covariance matrix updating sub-module is configured to update the weighted covariance matrix according to the observation signal matrix and a conjugate transpose matrix of the observation signal matrix.

Further, the split matrix updating submodule includes:

the first updating sub-module is used for respectively updating the separation matrixes of the sound sources according to the weighted covariance matrixes of the sound sources;

And the second updating sub-module is used for updating the separation matrix into a conjugate transpose matrix after the separation matrix of each sound source is combined.

Further, the deblurring submodule is used for carrying out amplitude deblurring processing on the separation matrix by adopting a minimum distortion criterion.

Further, the positioning module includes:

the sound source data estimation sub-module is used for obtaining the observed signal data of each sound source at each acquisition point according to the observed signal estimation data;

And the first positioning sub-module is used for respectively estimating the azimuth of each sound source according to the observed signal data of each sound source at each acquisition point to obtain the positioning information of each sound source.

Further, the first positioning sub-module is configured to perform the following estimation on each sound source, and obtain the azimuth of each sound source:

Further, the positioning information of the sound source includes azimuth coordinates of the sound source, and the second noise reduction module includes:

The time delay calculation sub-module is used for respectively calculating the propagation delay difference value of each sound source according to the azimuth coordinates of each sound source and the azimuth coordinates of each acquisition point, wherein the propagation delay difference value is the time difference value of the sound sent by the sound source to each acquisition point;

A beam summation sub-module for performing second-stage noise reduction on each sound source through beam delay summation beam forming processing by using the observed signal data of each sound source to obtain the beam enhanced output signal of each sound source,

Further, the enhanced signal output module is configured to perform short-time inverse fourier transform and overlap-add on the beam enhanced output signals of each sound source, so as to obtain time domain sound source signals with enhanced signal-to-noise ratio of each sound source.

According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a sound signal recognition method, the method comprising:

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, then carrying out first-stage noise reduction processing on the original observation data to obtain observation signal estimation data, then obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data, and then carrying out second-stage noise reduction processing on the observation signal data according to the positioning information to obtain beam enhancement output signals, and obtaining time domain sound source signals with enhanced signal to noise ratios of each sound source according to the beam enhancement output signals. After the original observation data is subjected to noise reduction treatment and sound source localization, the signal to noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source localization accuracy and poor voice recognition quality in a scene with strong interference and low signal to noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart illustrating a voice signal recognition method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating yet another voice signal recognition method according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating a voice signal recognition apparatus according to an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating a structure of a first noise reduction module 602 according to an exemplary embodiment.

Fig. 8 is a schematic diagram showing the structure of a split matrix updating sub-module 705 according to an exemplary embodiment.

Fig. 9 is a schematic structural diagram of a first noise reduction module 602, according to an example embodiment.

Fig. 10 is a schematic diagram illustrating a structure of a second noise reduction module 604 according to an exemplary embodiment.

Fig. 11 is a block diagram of an apparatus (general structure of a mobile terminal) according to an exemplary embodiment.

Fig. 12 is a block diagram (general structure of a server) of an apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The sound source direction finding algorithm generally uses data collected by a microphone directly, and uses a microphone array sound source positioning (SRP-PHAT) algorithm to carry out direction finding estimation. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, each direction to the interference sound source is extremely easy to find, and the effective sound source cannot be accurately positioned.

In order to solve the above problems, embodiments of the present disclosure provide a method and an apparatus for identifying a sound signal. The collected data is subjected to noise reduction treatment and then is subjected to direction finding and positioning, the signal to noise ratio is further improved by carrying out noise reduction treatment on the direction finding and positioning result again, then the final time domain sound source signal is obtained, the influence of an interference sound source is eliminated, the problem of low sound source positioning accuracy under a scene with strong interference and low signal to noise ratio is solved, and the sound source positioning with high efficiency and strong anti-interference capability is realized.

An exemplary embodiment of the present disclosure provides a sound signal recognition method, with which a flow for completing sound source localization is shown in fig. 1, including:

step 101, acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively.

In this embodiment, the collection point may be a microphone. For example, it may be a plurality of microphones disposed on the same device, the plurality of microphones constituting a microphone array.

In this step, data is collected at each collection point, and the collected data sources may be a plurality of sound sources. The plurality of sound sources may include a target effective sound source and may also include an interfering sound source.

The acquisition point acquires the original observation data of at least two sound sources.

Step 102, performing first-stage noise reduction processing on the original observed data to obtain observed signal estimated data.

In the step, the first-stage noise reduction processing is performed on the acquired original observation data so as to eliminate noise influence generated by an interference sound source and the like.

The raw observation data may be signal separated after preprocessing by a minimum distortion criterion (e.g., markov decision process Markov decision processes, abbreviated as MDP) to recover an estimate of the observation data of each sound source at each acquisition point.

After the noise reduction processing, the observation signal estimation data subjected to noise reduction is obtained.

And 103, according to the observation signal estimation data, positioning information and observation signal data of each sound source are obtained.

In the step, after the observation signal estimation data which is close to the real sound source data and eliminates the influence of noise is obtained, the observation signal data of each sound source at each acquisition point can be obtained.

And further, the sound source is positioned, and positioning information of each sound source is obtained, wherein the positioning information can comprise the azimuth of the sound source, for example, the azimuth can be a three-dimensional coordinate value in a three-dimensional coordinate system. The SRP-PHAT algorithm can be used for estimating the azimuth of each sound source according to the observation signal estimation data of each sound source, so that the positioning of each sound source is completed.

And 104, performing second-stage noise reduction processing on the observed signal data according to the positioning information to obtain a beam enhanced output signal.

In this step, for the noise interference residue in the observed signal data obtained in step 103, in order to further improve the idea quality, a delay and sum beamforming technique is used to perform the second-stage noise reduction processing. The method can enhance the sound source signal and inhibit other direction signals (signals possibly interfering the sound source signal), thereby further improving the signal-to-noise ratio of the sound source signal, and further carrying out sound source localization identification on the basis so as to obtain more accurate results.

And 105, obtaining the time domain sound source signals with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signals.

In the step, according to the beam enhancement output signal, the time domain sound source signal with enhanced signal to noise ratio after the separation beam processing is obtained through short time inverse Fourier transform (ISTFT) and overlap addition, and compared with the observed signal data, the time domain sound source signal has smaller noise, can reflect the sound signal sent by the sound source more truly and accurately, and realizes accurate and efficient sound signal identification.

An exemplary embodiment of the present disclosure further provides a method for identifying a sound signal, which performs noise reduction processing on original observation data based on blind source separation to obtain observation signal estimation data, where a specific flow is shown in fig. 2, and includes:

step 201, initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point.

In this step, the number of rows and columns of the separation matrix is the number of sound sources, and the weighted covariance matrix is 0 matrix.

In the present embodiment, a scene in which two microphones are taken as the acquisition points is taken as an example. As shown in fig. 3, the smart speaker a has two microphones: mic1 and mic2; there are two sound sources in the space around the intelligent sound box a: s1 and s2. The signals from both sources can be picked up by both microphones. The signals of the two sound sources in each microphone will be aliased together. The following coordinate system is established:

Microphone coordinate of intelligent sound box A is set as And x, y and z are three axes of x, y and z in a three-dimensional coordinate system.For the x-axis coordinate value of the ith microphone,/>For the y-axis coordinate value of the ith microphone,/>Is the z-axis coordinate value of the ith microphone. Where i=1,..m. In this example, m=2.

A time domain signal representing the ith microphone, τ, frame, i=1, 2; m=1, …, nfft. Nfft is the frame length of each sub-frame in the sound system of intelligent loudspeaker a. After windowing the frame obtained from Nfft, a corresponding frequency domain signal X _i (k, τ) is obtained by fourier transform (FFT).

For the convolution blind separation problem, the frequency domain model is:

X(k,τ)＝H(k,τ)s(k,τ)

Y(k,τ)＝W(k,τ)X(k,τ)

Wherein X (k, τ) = [ X ₁(k,τ),X₂(k,τ),...,X_M(k,τ)]^T ] is microphone observation data,

S (k, τ) = [ s ₁(k,τ),s₂(k,τ),...,s_M(k,τ)]^T ] is a sound source signal vector,

Y (k, τ) = [ Y ₁(k,τ),Y₂(k,τ),...,Y_M(k,τ)]^T ] is a split signal vector, H (k, τ) is a mixed matrix in m×m dimensions, W (k, τ) is a split matrix in m×m dimensions, k is the number of frequency bands, τ is the number of frames, () ^T represents a vector (or matrix) transpose.Is the frequency domain data of sound source i.

The separation matrix is expressed as:

W(k,τ)＝[w₁(k,τ),w₂(k,τ),...w_N(k,τ)]^H

Wherein (-) ^H represents the conjugate transpose of the vector (or matrix).

Specifically to the scene shown in fig. 3:

defining a mixing matrix as:

where h _ij is the transfer function of sound sources i through micj.

Defining a separation matrix as:

let the frame length of each sub-frame in the sound system be Nfft, k=nfft/2+1.

In this step, a separation matrix of each frequency point is initialized according to expression (1):

the separation matrix is a unit matrix; k=1..k, represents the kth bin.

Initializing a weighted covariance matrix V _i (k, tau) of each sound source at each frequency point as a zero matrix according to an expression (2):

wherein k=1,..k, represents the kth frequency bin; i=1, 2.

Step 202, obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals.

In this step, byA time domain signal representing the ith microphone, τ, frame, i=1, 2; m=1, … Nfft. According to expression (3), the corresponding frequency domain signal X _i (k, τ) is obtained by windowing and performing Nfft point FFT:

The observed signal matrix is:

X(k,τ)＝[X₁(k,τ),X₂(k,τ)]^T

where k=1.

And 203, according to the separation matrix of the previous frame and the observation signal matrix, solving the prior frequency domain estimation of each sound source of the current frame.

In the step, firstly, the observed signal matrix is separated according to the separation matrix of the previous frame, and the prior frequency domain estimation of each sound source of the current frame is obtained. For the application scenario shown in fig. 3, the prior frequency domain estimate Y (k, τ) of the two sound source signals in the current frame is found using W (k) of the previous frame.

For example, let Y (K, τ) = [ Y ₁(k,τ),Y₂(k,τ)]^T, k=1, ], K. Where Y ₁(k,τ),Y₂ (k, τ) is the estimate of the sound sources s1 and s2 at the time-frequency point (k, τ), respectively. According to expression (4), the observation matrix X (k, τ) is separated by using the separation matrix W (k, τ):

Y(k,τ)＝W(k,τ)X(k,τ) k＝1,..,K (4)

Then the frequency domain estimate of the ith sound source at the τ frame is, according to expression (5):

where i=1, 2.

And step 204, updating the weighted covariance matrix according to the prior frequency domain estimation.

In this step, the weighted covariance matrix is updated according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

For the application scenario shown in fig. 3, the weighted covariance matrix V _i (k, τ) is updated.

For example, the update of the weighted covariance matrix is performed according to expression (6):

Defining a contrast function as:

G_R(r_i(τ))＝r_i(τ)

defining weighting coefficients as:

step 205, updating the separation matrix according to the updated weighted covariance matrix.

In this step, firstly, the separation matrix of each sound source is updated according to the weighted covariance matrix of each sound source, and then the separation matrix is updated into the conjugate transpose matrix after the separation matrix of each sound source is combined. For the application scenario shown in fig. 3, the separation matrix W (k, τ) is updated.

For example, the separation matrix W (k, τ) is updated according to expressions (7), (8), (9):

w_i(k,τ)＝(W(k,τ-1)V_i(k,τ))^-1e_i (7)

W(k,τ)＝[w₁(k,τ),w₂(k,τ)]^H (9)

i＝1,2。

Step 206, deblurring the updated separation matrix.

In this step, the separation matrix may be subjected to an amplitude deblurring process using MDP. For the application scenario shown in fig. 3, the amplitude deblurring process is performed on W (k, τ) using the MDP algorithm.

For example, MDP amplitude deblurring processing is performed according to expression (10):

W(k,τ)＝diag(invW(k,τ))·W(k,τ) (10)

Wherein invW (k, τ) is the inverse of W (k, τ). diag (invW (k, τ)) means that the non-principal diagonal element of invW (k, τ) is set to 0.

And 207, separating the original observation data according to the deblurred separation matrix, and taking the posterior domain estimation data obtained by separation as the observation signal estimation data.

In this step, for the application scenario shown in fig. 3, the amplitude deblurred W (k, τ) is used to separate the original microphone signal to obtain the posterior frequency domain estimation data Y (k, τ) of the sound source signal, specifically as shown in expression (11):

Y(k,τ)＝[Y₁(k,τ),Y₂(k,τ)]^T＝W(k,τ)X(k,τ) (11)

after the posterior frequency domain estimation data with low signal-to-noise ratio is obtained, the posterior frequency domain estimation data is used as observation signal estimation data, the observation signal estimation data of each sound source at each acquisition point is further determined, and a high-quality data basis is provided for the direction finding of each sound source.

An exemplary embodiment of the present disclosure further provides a sound signal recognition method, where the method is used to obtain positioning information of each sound source and flow of observation signal data according to the observation signal estimation data, as shown in fig. 4, and the method includes:

And 401, obtaining the observed signal data of each sound source at each acquisition point according to the observed signal estimated data.

In this step, the observed signal data of each sound source at each acquisition point is acquired based on the observed signal estimation data. For the application scenario shown in fig. 3, in this step, the superposition of each sound source at each microphone is estimated to obtain an observation signal, so as to estimate the observation signal data of each sound source at each microphone.

For example, the raw observation data is separated by using the W (k, τ) after MDP

Y (k, τ) = [ Y ₁(k,τ),Y₂(k,τ)]^T. According to the principle of the MDP algorithm, the Y (k, τ) it recovers is exactly the estimate of the observed signal of the sound source at the corresponding microphone, namely:

the estimation of the observed signal data of the sound source s ₁ at mic1 is as expression (12), which is:

Y₁(k,τ)＝h₁₁s₁(k,τ)

Re-memorization

Y₁₁(k,τ)＝Y₁(k,τ) (12)

The observed signal data of sound source s ₂ at mic2 is estimated as expression (13), as:

Y₂(k,τ)＝h₂₂s₂(k,τ)

Re-memorization

Y₂₂(k,τ)＝Y₂(k,τ) (13)

Since the observed signal at each microphone is a superposition of two sound source observed signal data, the observed data of sound source s ₂ at mic1 is estimated as in expression (14), as:

Y₁₂(k,τ)＝X₁(k,τ)-Y₁₁(k,τ) (14)

The estimate of the observed data of sound source s ₁ at mic2 is as in expression (15), which is

Y₂₁(k,τ)＝X₂(k,τ)-Y₂₂(k,τ) (15)

Therefore, based on the MDP algorithm, the observed signal data of each sound source at each microphone is completely recovered, and the original phase information is reserved. Thus, the azimuth of each sound source can be further estimated based on these observation signal data.

And step 402, estimating the azimuth of each sound source according to the observed signal data of each sound source at each acquisition point, and obtaining the positioning information of each sound source.

In this step, the following estimation is performed on each sound source to obtain the azimuth of each sound source:

For the application scenario shown in fig. 3, the azimuth of each sound source is estimated using the SRP-phas algorithm, using the observed signal data of each sound source at each microphone.

The SRP-PHAT algorithm principle is as follows:

Traversing the microphone array:

For i＝1:M-1

For j＝i+1:M

End

where X _i(τ)＝[X_i(1,τ),...,X_i(K,τ)]^T is the frequency domain data of the τ frame of the i-th microphone. K=nfft.

And similarly X _j(τ)..^* represents the multiplication of two vector correspondences.

The coordinates of any point s on the unit sphere are (s _x,s_y,s_z), which satisfy the followingCalculating the delay difference between the arbitrary point s and any two microphones:

For i＝1:M-1

For j＝i+1:M

End

Where fs is the system sampling rate and c is the speed of sound.

According toFind the corresponding controllable response power (Steered Response Power, SRP):

Traversing all points s on the unit sphere, and finding the point of the SRP maximum value as the estimated sound source:

Taking the example of the scenario of fig. 3, in this step, Y ₁₁ (k, τ) and Y ₂₁ (k, τ) may be substituted for X (k, τ) = [ X ₁(k,τ),X₂(k,τ)]^T, and the SRP-PAHT algorithm may be carried to estimate the azimuth of sound source s ₁; the azimuth of sound source s ₂ is estimated similarly using Y ₂₂ (k, τ) and Y ₁₂ (k, τ) instead of X (k, τ) = [ X ₁(k,τ),X₂(k,τ)]^T ].

Because the signal to noise ratios of Y ₁₁ (k, τ) and Y ₂₁(k,τ),Y₂₂ (k, τ) and Y ₁₂ (k, τ) have been greatly improved on a separate basis, the position estimation is more stable and accurate.

An exemplary embodiment of the present disclosure further provides a method for identifying a sound signal, where the method is used to perform a second stage of noise reduction processing on the positioning information, and a flow of obtaining a beam enhanced output signal is shown in fig. 5, and includes:

Step 501, calculating propagation delay difference values of the sound sources according to the azimuth coordinates of the sound sources and the azimuth coordinates of the acquisition points.

In this embodiment, the propagation delay difference is a time difference between transmission of sound from the sound source to each acquisition point.

In this step, the localization information of the sound source contains the azimuth coordinates of the sound source. Still taking the application scenario shown in fig. 3 as an example, in the three-dimensional coordinate system, the azimuth of the sound source s ₁ isThe azimuth of sound source s ₂ is/>The acquisition point may be a microphone, the two microphones being oriented individually/>And/>

First, the delay difference τ ₁ of the sound source s ₁ to each microphone is calculated according to expressions (16) and (17):

calculating a delay difference τ ₂ of the sound source to each microphone according to expressions (18) and (19):

And 502, performing second-stage noise reduction on each sound source through beam delay summation beamforming processing by using the observed signal data of each sound source, so as to obtain the beam enhanced output signals of each sound source.

In this step, taking the scenario shown in fig. 3 as an example, the second-stage noise reduction is performed on each sound source by the beam delay and sum beamforming process according to expressions (20) and (21), respectively:

Beam delay and sum beamforming of sound source s ₁ using Y ₁₁ (k, τ) and Y ₂₁ (k, τ) yields enhanced output signal YE ₁ (k, τ) for sound source s ₁:

Beam delay and sum beamforming of sound source s ₂ using Y ₁₂ (k, τ) and Y ₂₂ (k, τ) yields enhanced output signal YE ₂ (k, τ) for sound source s ₂:

Where k=1, …, K.

An exemplary embodiment of the present disclosure further provides a sound signal recognition method capable of obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source according to a beam enhanced output signal. And carrying out short-time inverse Fourier transform and overlap addition on the beam enhancement output signals of each sound source to obtain time domain sound source signals with enhanced signal-to-noise ratio of each sound source.

Still taking the application scenario shown in fig. 3 as an example, YEs ₁(τ)＝[YE₁(1,τ),...,YE₁ (K, τ) ] and YEs ₂(τ)＝[YE₂(1,τ),...,YE₂ (K, τ) ] are respectively performed according to expression (22), k=1

Where m=1, …, nfft. i=1, 2.

Because the microphone observation data is noisy, the algorithm extremely depends on the signal-to-noise ratio, and when the signal-to-noise ratio is low, the direction finding is inaccurate, and the accuracy of the voice recognition result is affected. In the embodiment of the disclosure, after blind source separation, the delay and sum beam forming technology is utilized to further eliminate noise influence on observed signal data and improve signal to noise ratio, so that the problem that the voice recognition result is inaccurate due to direct use of original microphone observed data X (k, tau) = [ X ₁(k,τ),X₂(k,τ)]^T ] to carry out sound source azimuth estimation is solved.

An exemplary embodiment of the present disclosure further provides a voice signal recognition apparatus, the structure of which is shown in fig. 6, including:

The original data acquisition module 601 is configured to acquire original observation data acquired by at least two acquisition points for at least two sound sources respectively;

the first noise reduction module 602 is configured to perform a first level noise reduction process on the original observed data to obtain observed signal estimated data;

A positioning module 603, configured to obtain positioning information and observation signal data of each sound source according to the observation signal estimation data;

A second noise reduction module 604, configured to perform a second level noise reduction process on the positioning information, so as to obtain a beam enhanced output signal;

and the enhanced signal output module 605 is configured to obtain a time domain sound source signal with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signal.

The first noise reduction module 602 is shown in fig. 7, and includes:

the matrix initialization submodule 701 is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and the number of columns of the separation matrix are the number of the sound sources;

the frequency domain data acquisition sub-module 702 is configured to obtain time domain signals at each acquisition point, and construct an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

a priori frequency domain estimation sub-module 703, configured to calculate a priori frequency domain estimation of each sound source in the current frame according to the separation matrix of the previous frame and the observation signal matrix;

a covariance matrix update sub-module 704 configured to update the weighted covariance matrix according to the prior frequency domain estimate;

A separation matrix updating sub-module 705, configured to update the separation matrix according to the updated weighted covariance matrix;

a deblurring sub-module 706, configured to deblur the updated separation matrix;

And the posterior domain estimation sub-module 707 is configured to separate the original observation data according to the deblurred separation matrix, and use the posterior domain estimation data obtained by separation as the observation signal estimation data.

Further, the priori frequency domain estimation sub-module 703 is configured to separate the observation signal matrix according to the separation matrix of the previous frame, so as to obtain a priori frequency domain estimation of each sound source of the current frame.

Further, the covariance matrix updating sub-module 704 is configured to update the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

Further, the structure of the split matrix updating sub-module 705 is shown in fig. 8, and includes:

A first updating sub-module 801, configured to update the separation matrix of each sound source according to the weighted covariance matrix of each sound source;

A second updating sub-module 802, configured to update the separation matrix into a conjugate transpose matrix after the separation matrices of the respective sound sources are combined.

Further, the deblurring submodule 706 is configured to perform amplitude deblurring on the separation matrix using a minimum distortion criterion. The defuzzification process may be performed using MDP.

The positioning module 603 is shown in fig. 9, and includes:

a sound source data estimation sub-module 901, configured to obtain observed signal data of each sound source at each acquisition point according to the observed signal estimation data;

And the first positioning sub-module 902 is configured to estimate the azimuth of each sound source according to the observed signal data of each sound source at each acquisition point, so as to obtain positioning information of each sound source.

Further, the first positioning sub-module 902 is configured to perform the following estimation on each sound source to obtain the azimuth of each sound source:

The positioning information of the sound source includes the azimuth coordinates of the sound source, and the second noise reduction module 604 is configured as shown in fig. 10, and includes:

The time delay calculation sub-module 1001 is configured to calculate a propagation delay difference value of each sound source according to the azimuth coordinate of each sound source and the azimuth coordinate of each acquisition point, where the propagation delay difference value is a time difference value of transmission of sound sent by the sound source to each acquisition point;

A beam summation sub-module 1002, configured to perform second-stage noise reduction on each sound source through beam delay summation beamforming processing by using the observed signal data of each sound source, to obtain the beam enhanced output signal of each sound source,

Further, the enhanced signal output module 602 is configured to perform short-time inverse fourier transform and overlap-add on the beam enhanced output signals of the respective sound sources, so as to obtain time domain sound source signals of the respective sound sources as the second level positioning information of the respective sound sources.

The device can be integrated in intelligent terminal equipment or a remote operation processing platform, or part of functional modules can be integrated in the intelligent terminal equipment and part of functional modules are integrated in the remote operation processing platform, and corresponding functions are realized by the intelligent terminal equipment and/or the remote operation processing platform.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 11 is a block diagram illustrating an apparatus 1100 for sound source localization according to an exemplary embodiment. For example, apparatus 1100 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 11, apparatus 1100 may include one or more of the following components: a processing component 1102, a memory 1104, a power component 1106, a multimedia component 1108, an audio component 1110, an input/output (I/O) interface 1112, a sensor component 1114, and a communication component 1116.

The processing component 1102 generally controls overall operation of the apparatus 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1102 may include one or more processors 1120 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1102 can include one or more modules that facilitate interactions between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

Memory 1104 is configured to store various types of data to support operations at device 1100. Examples of such data include instructions for any application or method operating on the device 1100, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1104 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 1106 provides power to the various components of the device 1100. The power components 1106 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 1100.

Multimedia component 1108 includes a screen between the device 1100 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, multimedia component 1108 includes a front camera and/or a rear camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 1100 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1110 is configured to output and/or input an audio signal. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the device 1100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio component 1110 further comprises a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1114 includes one or more sensors for providing status assessment of various aspects of the apparatus 1100. For example, the sensor assembly 1114 may detect the on/off state of the device 1100, the relative positioning of the components, such as the display and keypad of the apparatus 1100, the sensor assembly 1114 may also detect a change in position of the apparatus 1100 or a component of the apparatus 1100, the presence or absence of user contact with the apparatus 1100, the orientation or acceleration/deceleration of the apparatus 1100, and a change in temperature of the apparatus 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate communication between the apparatus 1100 and other devices in a wired or wireless manner. The device 1100 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1116 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 1116 further includes a Near Field Communication (NFC) module to facilitate short range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as a memory 1104 including instructions executable by the processor 1120 of the apparatus 1100 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of voice signal recognition, the method comprising:

Fig. 12 is a block diagram illustrating an apparatus 1200 for sound source localization according to an exemplary embodiment. For example, apparatus 1200 may be provided as a server. Referring to fig. 12, apparatus 1200 includes a processing component 1222 that further includes one or more processors, and memory resources represented by memory 1232 for storing instructions, such as applications, executable by processing component 1222. The application programs stored in memory 1232 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1222 is configured to execute instructions to perform the above-described methods.

The apparatus 1200 may also include a power component 1226 configured to perform power management of the apparatus 1200, a wired or wireless network interface 1250 configured to connect the apparatus 1200 to a network, and an input output (I/O) interface 1258. The apparatus 1200 may operate based on an operating system stored in the memory 1232, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

The embodiment of the disclosure provides a sound signal identification method and device, which are used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, then carrying out first-stage noise reduction processing on the original observation data to obtain observation signal estimation data, then obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data, then carrying out second-stage noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhancement output signal, and obtaining a time domain sound source signal with enhanced signal to noise ratio of each sound source according to the beam enhancement output signal. After the original observation data is subjected to noise reduction treatment and sound source localization, the signal to noise ratio is further improved through beam enhancement to highlight signals, the problems of low sound source localization accuracy and poor voice recognition quality in a scene with strong interference and low signal to noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of voice signal recognition, comprising:

Obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signal;

the step of performing first-stage noise reduction processing on the original observed data to obtain observed signal estimated data comprises the following steps:

deblurring the updated separation matrix;

Separating the original observed data according to the deblurred separation matrix, and taking the posterior domain estimated data obtained by separation as the observed signal estimated data;

the positioning information of the sound source comprises azimuth coordinates of the sound source, the positioning information is subjected to second-stage noise reduction processing, and the step of obtaining a beam enhanced output signal comprises the following steps:

2. The method of claim 1, wherein the step of obtaining a priori frequency domain estimates for each source of the current frame based on the separation matrix of the previous frame and the observation signal matrix comprises:

3. The method of claim 1, wherein updating the weighted covariance matrix based on the prior frequency-domain estimate comprises:

4. The sound signal recognition method of claim 1, wherein the step of updating the separation matrix based on the updated weighted covariance matrix comprises:

5. The method of claim 1, wherein the step of deblurring the updated separation matrix comprises:

And performing amplitude deblurring on the separation matrix by adopting a Markov decision process.

6. The sound signal recognition method according to claim 1, wherein the step of obtaining the positioning information of each sound source and the observed signal data from the observed signal estimation data comprises:

7. The sound signal recognition method of claim 6, wherein the step of estimating the azimuth of each sound source based on the observed signal data of each sound source at each acquisition point, respectively, to obtain the localization information of each sound source comprises:

8. The sound signal recognition method of claim 6, wherein the step of obtaining a signal-to-noise ratio enhanced time domain sound source signal of each sound source from the beam enhanced output signal comprises:

9. A sound signal recognition apparatus, comprising:

The enhanced signal output module is used for obtaining time domain sound source signals with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signals;

Wherein, the first noise reduction module includes:

A deblurring sub-module configured to deblur the updated separation matrix;

The posterior domain estimation sub-module is used for separating the original observation data according to the deblurred separation matrix, and taking the posterior domain estimation data obtained by separation as the observation signal estimation data;

the positioning information of the sound source comprises azimuth coordinates of the sound source, and the second noise reduction module comprises:

And the beam summation sub-module is used for carrying out second-stage noise reduction on each sound source through beam delay summation beam forming processing by using the observed signal data of each sound source, so as to obtain the beam enhanced output signals of each sound source.

10. The voice signal recognition apparatus of claim 9, wherein,

And the priori frequency domain estimation submodule is used for separating the observation signal matrix according to the separation matrix of the previous frame to obtain the priori frequency domain estimation of each sound source of the current frame.

11. The voice signal recognition apparatus of claim 9, wherein,

The covariance matrix updating sub-module is used for updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

12. The voice signal recognition apparatus of claim 9, wherein the separation matrix update sub-module comprises:

13. The voice signal recognition apparatus of claim 9, wherein,

The deblurring submodule is used for carrying out amplitude deblurring processing on the separation matrix by adopting a Markov decision process.

14. The voice signal recognition device of claim 9, wherein the localization module comprises:

15. The voice signal recognition apparatus of claim 14, wherein,

The first positioning submodule is used for respectively estimating the following to each sound source to acquire the azimuth of each sound source:

16. The voice signal recognition apparatus of claim 14, wherein,

And the enhanced signal output module is used for carrying out short-time Fourier inverse transformation and overlap addition on the beam enhanced output signals of each sound source to obtain time domain sound source signals with enhanced signal-to-noise ratio of each sound source.

17. A computer apparatus, comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to:

deblurring the updated separation matrix;

18. A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of voice signal recognition, the method comprising:

deblurring the updated separation matrix;