CN111929638A

CN111929638A - Voice direction of arrival estimation method and device

Info

Publication number: CN111929638A
Application number: CN202011011975.6A
Authority: CN
Inventors: 谭祚; 何云鹏; 许兵
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-11-13

Abstract

A method for estimating a direction of arrival of speech includes the steps of S1: splitting a broadband voice signal received by a microphone array into a plurality of narrowband voice signals; dividing the space domain into a plurality of initial space domain grids S2: calculating a guide vector and a covariance matrix of each frequency point in each narrowband voice signal, separating a signal subspace and a noise subspace, and solving the space spectrum energy of each grid direction; judging the estimated direction of arrival through grid points corresponding to the spatial spectrum energy peak; s3: increasing grid points in the grid interval where the target may exist through the direction of arrival obtained in step S2; the step S2 is repeated on the reduced mesh after the mesh point is added to correct the direction of arrival until the mesh addition upper limit is reached. By adopting the voice direction of arrival estimation method, the mesh points are added in the estimation process in a self-adaptive manner; the estimation precision and the resolution of the voice signal direction of arrival method can be improved.

Description

Voice direction of arrival estimation method and device

Technical Field

The invention belongs to the technical field of intelligent voice recognition, relates to voice front-end signal processing, and particularly relates to a method and a device for estimating a voice direction of arrival.

Background

The direction-of-arrival estimation algorithm is mainly used for estimating the angle information of the sound source target and the microphone array, and inputting the estimated angle data into the voice enhancement system to effectively enhance the voice signals in the direction and inhibit the noise signals in other directions. Currently, the arrival direction estimation algorithm of the speech signal usually adopts a method based on the arrival time difference for estimation. However, when the signal-to-noise ratio of the input signal is lowered, reverberation is severe. The performance of such algorithms will drop significantly. Moreover, when the number of target sound sources increases, the method cannot distinguish the correct number of target sound sources, thereby causing performance failure.

The prior art has the following disadvantages:

1. the reverberation resistance is weak;

2. the noise immunity is weak;

3. the estimated resolution is low;

4. the estimation accuracy is insufficient.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses a method and a device for estimating the direction of arrival of voice.

The invention relates to a method for estimating the direction of arrival of voice, which comprises the following steps:

s1: splitting a broadband voice signal received by a microphone array into a plurality of narrowband voice signals; dividing a 0-180-degree airspace into a plurality of initial airspace grids;

s2: calculating a guide vector and a covariance matrix of each frequency point in each narrowband voice signal, separating a signal subspace and a noise subspace, and solving the space spectrum energy of each grid point of an airspace grid; judging the estimated direction of arrival through grid points corresponding to the spatial spectrum energy peak;

s3: increasing grid points in the grid interval where the target may exist through the direction of arrival obtained in step S2; the step S2 is repeated on the reduced mesh after the mesh point is added to correct the direction of arrival until the mesh addition upper limit is reached.

Preferably, the specific step of splitting the wideband speech signal into a plurality of narrowband speech signals in step S1 is: windowing and framing the voice signals, converting the time domain audio signals of each frame into a frequency domain, carrying out frequency division processing on the frequency spectrum of the frequency domain signals, and dividing the broadband voice signals into a plurality of narrow-band voice signals.

Preferably, in step S2, for each narrow band, the narrow band is the second in-band

The steering vector of each frequency point is expressed as:

wherein the content of the first and second substances,

representing the grid point angles of a single initial spatial grid, the subscripts 1, 2 … k representing the different grid points,

indicates the frequency of the ith frequency point,

representing the distance between two microphones in a microphone array,

is the speed of sound, e is a natural constant, and j represents the complex imaginary component.

Preferably, the covariance matrix R of the frequency points in step S2_yyExpressed as:

，

wherein

Representing the frequency domain of the received audio signal,

in the form of an array of flow pattern matrices,

which represents the variance of the noise, is,

is an identity matrix, R_yyRepresenting received signals of a microphone arrayThe covariance matrix of (a) is determined,

the covariance matrix of the voice signal is represented, the superscript H represents the conjugate transpose operation, theta is the direction of arrival of the voice signal, f is the frequency point frequency, and E is the expected value calculation symbol.

Further, the step S2 of separating the signal subspace and the noise subspace specifically includes:

covariance matrix R of received signals of microphone array_yyDecomposed into signal subspace and noise subspace by eigenvalue decomposition method, expressed as:

；

wherein

A signal sub-space is represented that is,

representing a noise subspace; sigma_X、Σ_NRespectively representing a diagonal matrix of eigenvalues of the signal subspace and the noise subspace.

Further, in step S2, the spatial spectral energy in each grid direction is estimated using the orthogonality between the steering vector and the noise subspace, and the spatial spectral energy P (θ, f) in each grid direction is expressed as:

wherein the content of the first and second substances,

grid point corresponding to the peak value of

I.e. the estimated direction of arrival,

indicating a steering vector, superscript H indicating a conjugate transpose operation,

representing a noise subspace, theta is the direction of arrival of the voice signal, and f is the frequency point frequency.

Preferably, in step S3, the grid points are symmetrically added on both sides of the grid point where the estimated direction of arrival is located, and the added grid points should be located in the last divided minimum grid.

The invention also discloses a voice direction-of-arrival estimation device, which comprises an array module, a control module, an input module, an estimation module, an optimization module and an output module which are sequentially connected;

the array module is a microphone array formed by a plurality of microphones arranged on a horizontal plane;

the control module realizes that the working state of the voice direction-of-arrival estimation algorithm is controlled by the identification of the awakening word;

the input module processes the voice signal and converts the voice signal into frequency point data with different frequencies;

the estimation module selects a frequency band range with obvious voice characteristics, and carries out primary direction-of-arrival estimation on each frequency point data in the range;

the optimization module optimizes an algorithm by a self-adaptive method for increasing spatial domain grid points;

and the output module transmits the estimated arrival direction of the voice to the voice enhancement system for enhancing the voice by a subsequent system.

By adopting the method and the device for estimating the direction of arrival of the voice, the grid points are added in an adaptive manner in the estimation process; the estimation precision and the resolution of the voice signal direction of arrival method can be improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of the estimation method according to the present invention;

FIG. 2 is a schematic view of an embodiment of the apparatus according to the present invention;

fig. 3 is a schematic diagram of an embodiment of the present invention, in which grid points are added, and the coordinate axes in fig. 3 represent angles.

Detailed Description

The following provides a more detailed description of the present invention.

The method for estimating the direction of arrival of voice, as shown in fig. 1, includes the following steps:

Specifically, the splitting the wideband speech signal into a plurality of narrowband speech signals in step S1 includes: windowing and framing the voice signals, converting the time domain audio signals of each frame into a frequency domain, carrying out frequency division processing on the frequency spectrum of the frequency domain signals, and dividing the broadband voice signals into a plurality of narrow-band voice signals;

one specific implementation way is that the time domain audio signal of each frame is converted into the frequency domain by fast fourier transform of 512 points, so as to obtain a plurality of discrete frequency points, for example, each frequency point in the frequency band interval of 1KHZ-3KHZ is subjected to frequency division processing, and under the sampling rate of 16KHZ, the broadband voice signal can be divided into 71 narrow bands, and each narrow band represents one discrete frequency point.

And dividing the 0-180-degree space domain into a plurality of initial space domain grids, and calculating the guide vector of each frequency point of the space domain grids.

For example, at 20 degree meshThe size of the grid is divided into 9 initial space domain grid intervals by 0-180 degrees to obtain 10 grid points, and then theta is calculated₁Is 0 degree, theta₂Is 20 degrees … theta₁₀Is 180 degrees.

In step S2, a guide vector and a covariance matrix of each frequency point in each narrowband speech signal are calculated, a signal subspace and a noise subspace are separated by using the guide vector and the covariance matrix, and spatial spectrum energy in each grid direction is solved; the method specifically comprises the following steps:

s21 for each narrowband, the steering vector of the ith frequency point in the narrowband can be represented as:

wherein the content of the first and second substances,

indicates the frequency of the ith frequency point,

representing the distance between two microphones in a microphone array,

for the speed of sound, E is a natural constant, j represents the complex imaginary component, and E is the expected value calculation sign.

S22 estimating covariance matrix R of each frequency point of input signal_YYIt can be expressed as:

，

wherein

Representing the frequency domain of the received audio signal,

in the form of an array of flow pattern matrices,

which represents the variance of the noise, is,

is an identity matrix, R_yyA covariance matrix representing the received signals of the microphone array,

the covariance matrix of the voice signal is represented, the superscript H represents the conjugate transpose operation, theta is the direction of arrival of the voice signal, and f is the frequency point frequency.

Decomposing the estimated covariance matrix into a signal subspace and a noise subspace by an eigenvalue decomposition method can be expressed as:

,

wherein

A signal sub-space is represented that is,

And finally, estimating the spatial spectrum energy of each grid direction by using the orthogonal characteristics of the guide vector and the noise subspace, wherein the spatial spectrum energy P (theta, f) of each grid direction can be expressed as:

wherein the content of the first and second substances,

grid point corresponding to the peak value of

I.e. the estimated direction of arrival,

After the first estimated direction of arrival is obtained according to the first divided mesh, the range of the direction of arrival is narrowed by continuously adding the mesh points because the range of the first divided mesh is large and the real direction of arrival is near the first estimated direction of arrival.

S3 symmetrically adds grid points on both sides of the grid point where the estimated direction of arrival is located, where the added grid points should be located in the minimum grid of the last division, for example, the distance between the new grid point and the center grid point is usually a reduction ratio of the last grid width.

After the grid points are added, the steps S21-S22 are continuously repeated in the interval of the added pair of grid points, and the grid point where the updated estimated direction of arrival is located is obtained;

step S3 is repeated until the grid point addition number reaches the grid addition upper limit, which is usually set in consideration of the hardware condition, the time required for calculation, and the like.

As shown in fig. 3, if the first estimation estimates the direction of arrival at the 80 degree grid points, the first incremental grid points W1 may be added at 70 degrees and 90 degrees to the left and right of the 80 degree grid points, steps S21-S22 may be repeated to find a new estimated direction of arrival at the 70 degree grid points, the second incremental grid points W2 may be added at 65 and 75 degrees to the left and right of the 70 degree grid points, and the new direction of arrival may be estimated again at 75 degrees, and the pair of third incremental grid points W3 may continue to be added at the left and right of 75 degrees. And in the same way, continuously adding grid points and reducing the grid point gap until the added grid points reach the upper limit.

The spatial distance of the spatial grid directly affects the estimation accuracy and resolution of the algorithm, but the densely distributed spatial grid increases the computational power of the algorithm and is not beneficial to engineering application. The invention adopts the airspace grids with larger space intervals at the beginning, counts the estimated prior arrival direction after the initial estimation of the first frequency points, and then adds new grid points in the grid interval of the prior direction to increase the number of the airspace grids continuously so as to present non-uniform distribution, so that in the airspace grid interval where the target possibly exists, the grids become dense and the grids in other intervals are sparse, and the estimation precision and the resolution of the estimation of the arrival direction can be improved.

The estimation method of the present invention can be implemented based on a device, as shown in fig. 2, which includes an array module, a control module, an input module, an estimation module, an optimization module, and an output module, which are connected in sequence. The functions of each module are as follows:

the array module is provided with a microphone array formed by a plurality of microphones on a horizontal plane to receive voice signals. The array structure, such as a circular array, a linear array and the like, can be set according to application requirements;

the control module realizes that the working state of the voice direction-of-arrival estimation algorithm is controlled by the recognition of the awakening word, and when the awakening word is recognized, the system starts to operate the voice direction-of-arrival estimation algorithm;

the input module can firstly carry out windowing and framing processing on the voice signals and convert non-stationary voice signals into short-time stationary signals. Then, converting the voice signal of each frame into frequency point data with different frequencies through a fast Fourier transform algorithm, and estimating the direction of arrival of the voice signal through the frequency point data;

and the estimation module selects a frequency band range with obvious voice characteristics and carries out primary direction-of-arrival estimation on each frequency point data in the range. The signal space of each frequency point can be decomposed into a signal subspace and a noise subspace through a characteristic value decomposition algorithm. Then, setting spatial grids with the same spatial interval, calculating the spatial spectrum energy of each spatial grid point by using the orthogonal characteristic of the guide vector and the noise subspace, and determining the arrival direction of the voice signal by retrieving the peak value of the spatial spectrum energy;

the optimization module adopts spatial grids with large spatial intervals when estimation is started, counts the estimated prior direction of arrival after initial estimation of a plurality of previous frequency points, and adds new grid points in the grid interval of the prior direction, so that the number of the spatial grids is continuously increased to present non-uniform distribution. The optimization module optimizes the performance of the algorithm by a method of increasing spatial domain grid points in a self-adaptive mode.

And the output module is used for transmitting the estimated speech arrival direction to the speech enhancement system for enhancing the speech by a subsequent system.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A method for estimating the direction of arrival of speech, comprising the steps of:

2. The method for estimating the direction of arrival of speech according to claim 1, wherein the step S1 of splitting the wideband speech signal into a plurality of narrowband speech signals comprises the following steps: windowing and framing the voice signals, converting the time domain audio signals of each frame into a frequency domain, carrying out frequency division processing on the frequency spectrum of the frequency domain signals, and dividing the broadband voice signals into a plurality of narrow-band voice signals.

3. The method for estimating a direction of arrival of a speech according to claim 1, wherein in said step S2, for each narrow band, the number of narrow bands is the second

The steering vector of each frequency point is expressed as:

wherein the content of the first and second substances,

indicates the frequency of the ith frequency point,

representing the distance between two microphones in a microphone array,

4. The method according to claim 1, wherein the covariance matrix R of the frequency points in step S2 is_yyExpressed as:

，

wherein

Representing the frequency domain of the received audio signal,

in the form of an array of flow pattern matrices,

which represents the variance of the noise, is,

is an identity matrix, R_yyA covariance matrix representing the signals received by the microphone array,

5. The method according to claim 4, wherein the step of separating the signal subspace and the noise subspace in step S2 is specifically as follows:

；

wherein

A signal sub-space is represented that is,

6. The speech direction-of-arrival estimation method according to claim 4, wherein in step S2, the spatial spectral energy of each grid direction is estimated using the orthogonality between the steering vector and the noise subspace, and the spatial spectral energy P (θ, f) of each grid direction is represented as:

wherein the content of the first and second substances,

grid point corresponding to the peak value of

I.e. the estimated direction of arrival,

7. The method for estimating a speech direction of arrival according to claim 1, wherein in step S3, the addition points are symmetrically added on both sides of the point where the estimated direction of arrival is located, and the addition points should be located within the minimum mesh of the last division.

8. A voice direction-of-arrival estimation device is characterized by comprising an array module, a control module, an input module, an estimation module, an optimization module and an output module which are sequentially connected;