CN104599679A

CN104599679A - Speech signal based focus covariance matrix construction method and device

Info

Publication number: CN104599679A
Application number: CN201510052368.7A
Authority: CN
Inventors: 陈喆; 殷福亮; 张梦晗
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-01-30
Filing date: 2015-01-30
Publication date: 2015-05-06
Also published as: WO2016119388A1

Abstract

The invention discloses a speech signal based focus covariance matrix construction method and device. The method includes the steps of determining sampling frequency points of a microphone array collecting a voice signal; calculating a first covariance matrix and a focus transformation matrix of the voice signal acquired at any sampling frequency point as well as a conjugate transpose matrix of the focus transformation matrix according to any of the determined sampling frequency points, and setting a product of the first covariance matrix, the focus transformation matrix and the conjugate transpose matrix of the focus transformation matrix as a focus covariance matrix of the voice signal acquired at any sampling frequency point; setting a sum of the calculated focus covariance matrixes of the voice signal acquired at every sampling frequency point as a focus covariance matrix of the voice signal. According to the speech signal based focus covariance matrix construction method and device, the prediction of the incidence angle of a sound source is avoided during the construction of the focus covariance matrix and errors exist in the prediction of the incidence angle of the sound source, so that the accuracy of the constructed focus covariance matrix is improved.

Description

A kind of method and device focusing on covariance matrix based on voice signal structure

Technical field

The present invention relates to voice process technology field, particularly a kind of method and device focusing on covariance matrix based on voice signal structure.

Background technology

Microphone array is compared with single microphone, except the time domain that can utilize sound source and frequency domain information, the spatial information of sound source can also be utilized, therefore, there is the advantages such as antijamming capability is strong, applying flexible, in the problems such as solution auditory localization, speech enhan-cement, speech recognition, there is stronger advantage, be widely used in the fields such as audio/video conference system, onboard system, auditory prosthesis, man-machine interactive system, robot system, safety monitoring, military surveillance at present.

Based in the voice processing technology of microphone array, often need the number knowing sound source, higher handling property could be obtained like this; If sound source number is unknown, or the sound source number of hypothesis is too much or very few, then will decline to the accuracy of the result of the voice that microphone array obtains.

In order to improve the accuracy of the result to the voice that microphone array obtains, propose the method calculating sound source, in the process calculating sound source, structure is needed to focus on covariance matrix, but, the incident angle needing to predict sound source in the process of covariance matrix is focused at present at structure, covariance matrix is focused on again according to the incident angle structure of prediction, and estimate the number of sound source, but, if the angle incidence angle error of the sound source doped is comparatively large, the accuracy constructing the focusing covariance matrix obtained is lower.

Summary of the invention

The embodiment of the present invention provides a kind of and focuses on the method for covariance matrix and device based on voice signal structure, the defect that the accuracy in order to solve the focusing covariance matrix that the structure that exists in prior art obtains is lower.

The concrete technical scheme that the embodiment of the present invention provides is as follows:

First aspect, provides a kind of method focusing on covariance matrix based on voice signal structure, comprising:

Determine the sampling frequency that microphone array adopts when gathering voice signal;

For any one the sampling frequency in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency described, and the associate matrix of described focusing transform matrix, and by the product of the associate matrix of described first covariance matrix, described focusing transform matrix, described focusing transform matrix, as the focusing covariance matrix of the voice signal collected at described any sampling frequency;

By the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency calculated, as the focusing covariance matrix of the voice signal that described microphone array collects.

In conjunction with first aspect, in the implementation that the first is possible, calculate described first covariance matrix, specifically comprise:

Calculate described first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

Wherein, described in represent described first covariance matrix, described k represents described any sampling frequency, described P represents that described microphone array gathers the quantity of the frame of described voice signal, described X _i(k) represent described microphone array any frame and described any one sampling frequency time discrete Fourier transformation DFT value, described in represent described X _ik the associate matrix of (), described N represent the quantity of the sampling frequency that any frame comprises, the quantity of the sampling frequency included by any two different frames is all identical.

In conjunction with first aspect, and the first possible implementation of first aspect, in the implementation that the second is possible, before calculating described focusing transform matrix, also comprise:

Determine the focusing frequency of the sampling frequency that described microphone array adopts when gathering voice signal;

Calculate the second covariance matrix that described microphone array is listed in the voice signal that described focusing frequency collects;

Calculate described focusing transform matrix, specifically comprise:

To described first covariance matrix characteristics of decomposition value, obtain first eigenvector matrix, and conjugate transpose is carried out to described first eigenvector matrix, obtain the associate matrix of described first eigenvector matrix;

To described second covariance matrix characteristics of decomposition value, obtain second feature vector matrix;

By the product of the associate matrix of described first eigenvector matrix, described second feature vector matrix, as described focusing transform matrix.

In conjunction with the implementation that the second of first aspect is possible, in the implementation that the third is possible, calculate described second covariance matrix, specifically comprise:

Calculate described second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

Wherein, described in represent described second covariance matrix, described k ₀represent described focusing frequency, described P represents that described microphone array gathers the quantity of the frame of described voice signal, described X _i(k ₀) represent the DFT value of described microphone array when any frame and described focusing frequency, described in represent described X _i(k ₀) associate matrix.

In conjunction with the second or the third possible implementation of first aspect, in the 4th kind of possible implementation, to described first covariance matrix characteristics of decomposition value, specifically comprise:

In the following way to described first covariance matrix characteristics of decomposition value:

\hat{R} (k) = U (k) Λ U^{H} (k)

Wherein, described in represent described in described second covariance matrix, the expression of described U (k) second feature vector matrix, described Λ represent described in eigenwert arrange the diagonal matrix formed, described U by descending order ^hk () represents the associate matrix of described U (k).

In conjunction with the second of first aspect to the 4th kind of possible implementation, in the 5th kind of possible implementation, to described second covariance matrix characteristics of decomposition value, specifically comprise:

In the following way to described second covariance matrix characteristics of decomposition value:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

Wherein, described in represent described second covariance matrix, described U (k ₀) described in expression second feature vector matrix, described Λ ₀described in expression eigenwert arrange the diagonal matrix formed, described U by descending order ^h(k ₀) represent described U (k ₀) associate matrix.

In conjunction with the first of first aspect to the 5th kind of possible implementation, in the 6th kind of possible implementation, described X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1

Wherein: X _i1k () represents DFT value, the X of the 1st array element of described microphone array when the i-th frame and kth sample frequency _i2k () represents DFT value, the X of the 2nd array element of described microphone array when the i-th frame and kth sample frequency _iLthe quantity that k () represents the DFT value of L array element of described microphone array when the i-th frame and kth sample frequency, described L is the array element that described microphone array comprises.

Second aspect, provides a kind of device focusing on covariance matrix based on voice signal structure, comprising:

Determining unit, for determining the sampling frequency that microphone array adopts when gathering voice signal;

First computing unit, for frequency of sampling for any one in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency described, and the associate matrix of described focusing transform matrix, and by the product of the associate matrix of described first covariance matrix, described focusing transform matrix, described focusing transform matrix, as the focusing covariance matrix of the voice signal collected at described any sampling frequency;

Second computing unit, for the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency that will calculate, as the focusing covariance matrix of the voice signal that described microphone array collects.

In conjunction with second aspect, in the implementation that the first is possible, described first computing unit, when calculating described first covariance matrix, is specially:

Calculate described first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

In conjunction with second aspect, and the first possible implementation of second aspect, in the implementation that the second is possible, described determining unit also for, determine the focusing frequency of the sampling frequency that described microphone array adopts when gathering voice signal;

Described first computing unit also for, calculate the second covariance matrix that described microphone array is listed in the voice signal that described focusing frequency collects;

Described first computing unit, when calculating described focusing transform matrix, is specially:

In conjunction with the implementation that the second of second aspect is possible, in the implementation that the third is possible, described first computing unit, when calculating described second covariance matrix, is specially:

Calculate described second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

In conjunction with the second or the third possible implementation of second aspect, in the 4th kind of possible implementation, described first computing unit, when to described first covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k) = U (k) Λ U^{H} (k)

In conjunction with the second of second aspect to the 4th kind of possible implementation, in the 5th kind of possible implementation, described first computing unit, when to described second covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

In conjunction with the first of second aspect to the 5th kind of possible implementation, in the 6th kind of possible implementation, described X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1

Beneficial effect of the present invention is as follows:

The main thought based on voice signal structure focusing covariance matrix that the embodiment of the present invention provides is: determine the sampling frequency that microphone array adopts when gathering voice signal; For any one the sampling frequency in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix that collect voice signal at any one sampling frequency, and the associate matrix of focusing transform matrix, and by the product of the associate matrix of the first covariance matrix, focusing transform matrix, focusing transform matrix, as the focusing covariance matrix of the voice signal collected at any sampling frequency; By the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency calculated, as the focusing covariance matrix of voice signal, in this scenario, when constructing focusing covariance matrix, do not need the incident angle predicting sound source, and when predicting the incident angle of sound source, there is error, therefore, the scheme that the embodiment of the present invention provides improves the accuracy of the focusing covariance matrix of structure.

Accompanying drawing explanation

Figure 1A is the process flow diagram focusing on covariance matrix in the embodiment of the present invention based on voice signal structure;

Figure 1B is that in the embodiment of the present invention, frame moves schematic diagram;

The one that the number of the calculating sound source that Fig. 1 C provides for the embodiment of the present invention and CSM-GDE calculate the number of sound source contrasts schematic diagram;

The another kind that the number of the calculating sound source that Fig. 1 D provides for the embodiment of the present invention and CSM-GDE calculate the number of sound source contrasts schematic diagram;

Fig. 2 is the embodiment focusing on covariance matrix in the embodiment of the present invention based on voice signal structure;

Fig. 3 A is a kind of structural representation focusing on the device of covariance matrix in the embodiment of the present invention based on voice signal structure;

Fig. 3 B is a kind of structural representation focusing on the device of covariance matrix in the embodiment of the present invention based on voice signal structure.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Term "and/or" herein, being only a kind of incidence relation describing affiliated partner, can there are three kinds of relations in expression, and such as, A and/or B, can represent: individualism A, exists A and B simultaneously, these three kinds of situations of individualism B.In addition, alphabetical "/" herein, general expression forward-backward correlation is to the relation liking a kind of "or".

Below in conjunction with Figure of description, the preferred embodiment of the present invention is described in detail, be to be understood that, preferred embodiment described herein is only for instruction and explanation of the present invention, be not intended to limit the present invention, and when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.

Below in conjunction with accompanying drawing, the preferred embodiment of the present invention is described in detail.

Consult shown in Figure 1A, in the embodiment of the present invention, the flow process focusing on covariance matrix based on voice signal structure is as follows:

Step 100: determine the sampling frequency that microphone array adopts when gathering voice signal;

Step 110: for any one the sampling frequency in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency, and the associate matrix of focusing transform matrix, and by the product of the associate matrix of the first covariance matrix, focusing transform matrix, focusing transform matrix, as the focusing covariance matrix of the voice signal collected at any sampling frequency;

Step 120: by the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency calculated, as the focusing covariance matrix of the voice signal that microphone array collects.

In the embodiment of the present invention, in order to improve the accuracy of the focusing covariance matrix constructed, obtaining microphone array after the voice signal that any sampling frequency collects, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency, and before the associate matrix of focusing transform matrix, also comprise following operation:

Pre-emphasis process is carried out to the voice signal collected;

Now, calculate the first covariance matrix, the focusing transform matrix of voice signal that collect at any one sampling frequency, and the associate matrix of focusing transform matrix, optionally, can in the following way:

Pre-emphasis process is carried out to the voice signal collected at any one sampling frequency;

Calculate the first covariance matrix, the focusing transform matrix of the voice signal after pre-emphasis process, and the associate matrix of focusing transform matrix.

In the embodiment of the present invention, optionally, pre-emphasis process can be carried out to voice signal in the following way:

\hat{x} (k) = x (k) - ax (k - 1), k = 0,1,2, . . . . . ., N - 1

(formula one)

Wherein, be be the sample quantity of frequency, a be pre emphasis factor at sample voice signal that frequency collects, N of kth-1 at a kth voice signal that sampling frequency collects, x (k-1) for the voice signal after carrying out pre-emphasis process to the voice signal collected at kth sampling frequency, x (k), optionally, a=0.9375 is got.

Wherein, optionally, the form of x (k) is as shown in formula two:

X _i(k)=[X _i1(k), X _i2(k) ..., X _iL(k)] ^t, i=0,1,2 ..., P-1 (formula two)

Wherein: X _i1k () represents DFT value, the X of the 1st array element of microphone array when the i-th frame and kth sample frequency _i2(k) represent the 2nd array element of microphone array when the i-th frame and kth sample frequency DFT value ..., X _iLthe quantity that k () represents the DFT value of L array element of microphone array when the i-th frame and kth sample frequency, L is the array element that microphone array comprises, P represent that microphone array gathers the quantity of the frame of voice signal.

In the embodiment of the present invention, in order to improve the accuracy of the focusing covariance matrix constructed, obtain microphone array after the voice signal that any sampling frequency collects, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency, and before the associate matrix of focusing transform matrix, also comprise following operation:

Sub-frame processing is carried out to the voice signal collected;

Calculate the first covariance matrix, the focusing transform matrix of voice signal that collect at any one sampling frequency, and during the associate matrix of focusing transform matrix, optionally, can in the following way:

Sub-frame processing is carried out to the voice signal collected at any one sampling frequency;

Calculate the first covariance matrix, the focusing transform matrix of the voice signal after carrying out sub-frame processing, and the associate matrix of focusing transform matrix.

In the embodiment of the present invention, when carrying out sub-frame processing, adopt overlapping mode to carry out framing, namely, two frames produce overlapping, and overlapping part is called that frame moves, and optionally, choose frame and move half into frame length, framing is as shown in Figure 1B overlapping.

In the embodiment of the present invention, in order to improve the accuracy of the focusing covariance matrix constructed further, to receive voice signal after carrying out sub-frame processing, need to carry out windowing process to the voice signal carried out after sub-frame processing.

Can in the following way when windowing process carried out to the voice signal carried out after sub-frame processing:

Voice signal after carrying out sub-frame processing is multiplied with Hamming window function w (n).Wherein, optionally, Hamming window function w (n) is as shown in formula three:

w (k) = 0.54 - 0.46 \cos (π \frac{2 k + 1}{N}), k = 0, . . . . . ., N - 1

(formula three)

Wherein, k is any sampling frequency, and N represents the quantity of the sampling frequency that any frame comprises, and the quantity of the sampling frequency included by any two different frames is all identical.

In actual applications, the voice signal that microphone array collects may some signal be the voice signal that destination object sends, some signal is the voice signal that non-destination object sends, such as: time in session, before speaker's speech, there are some noises, these noises are voice signals that non-destination object sends, and when speaker starts to talk, the voice signal that now microphone array collects is exactly the voice signal that destination object sends, and the accuracy of the focusing covariance matrix constructed according to the voice signal that these destination objects send is higher, therefore, in the embodiment of the present invention, after the voice signal that acquisition microphone array collects, calculate the first covariance matrix of the voice signal collected at any one sampling frequency, focusing transform matrix, and before the associate matrix of focusing transform matrix, also comprise following operation:

Calculate at any one sampling frequency, the energy value of voice signal that collects at any frame;

Determine that corresponding energy value reaches the frame at the voice signal place of preset energy threshold value;

Calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency and the frame determined, and the associate matrix of focusing transform matrix.

In the embodiment of the present invention, the mode calculating the first covariance matrix has multiple, optionally, and can in the following way:

Calculate the first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

(formula four)

Wherein, represent the first covariance matrix, k represents any sampling frequency, P represents that microphone array gathers quantity, the X of the frame of voice signal _idFT (Discrete Fourier Transform, the discrete Fourier transformation) value of (k) expression microphone array when any frame and any sampling frequency, represent X _ik the associate matrix of (), N represent the quantity of the sampling frequency that any frame comprises, the quantity of the sampling frequency included by any two different frames is all identical.

In the embodiment of the present invention, before calculating focusing transform matrix, also comprise following operation:

Determine the focusing frequency of the sampling frequency that microphone array adopts when gathering voice signal;

Calculate microphone array and be listed in the second covariance matrix focusing on the voice signal that frequency collects;

Now, when calculating focusing transform matrix, optionally, can in the following way:

To the first covariance matrix characteristics of decomposition value, obtain first eigenvector matrix, and conjugate transpose is carried out to first eigenvector matrix, obtain the associate matrix of first eigenvector matrix;

To the second covariance matrix characteristics of decomposition value, obtain second feature vector matrix;

By the product of the associate matrix of first eigenvector matrix, second feature vector matrix, as focusing transform matrix.

In the embodiment of the present invention, when calculating the second covariance matrix, optionally, can in the following way:

Calculate the second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

(formula five)

Wherein, represent the second covariance matrix, k ₀expression focuses on frequency, P represents that microphone array gathers quantity, the X of the frame of voice signal _i(k ₀) the DFT value of expression microphone array at any frame and when focusing on frequency, represent X _i(k ₀) associate matrix.

In the embodiment of the present invention, during to the first covariance matrix characteristics of decomposition value, optionally, can in the following way:

In the following way to the first covariance matrix characteristics of decomposition value:

\hat{R} (k) = U (k) Λ U^{H} (k)

(formula six)

Wherein, represent that the second covariance matrix, U (k) represent second feature vector matrix, Λ represent eigenwert arrange the diagonal matrix, the U that form by descending order ^hk () represents the associate matrix of U (k).

In the embodiment of the present invention, during to the second covariance matrix characteristics of decomposition value, optionally, can in the following way:

In the following way to the second covariance matrix characteristics of decomposition value:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

(formula seven)

Wherein, represent the second covariance matrix, U (k ₀) represent second feature vector matrix, Λ ₀represent eigenwert arrange the diagonal matrix, the U that form by descending order ^h(k ₀) represent U (k ₀) associate matrix.

In the embodiment of the present invention, optionally, X _ik () form is as shown in formula two.In the embodiment of the present invention, focus on after covariance matrix calculating, sound source number can be calculated according to the focusing covariance matrix obtained, when calculating sound source number according to the focusing covariance matrix obtained, optionally, can in the following way:

Your conical pods of lid is adopted to calculate sound source number according to the focusing covariance matrix obtained.Such as: in indoor environment, room-size is 10m × 10m × 3m, and eight apex coordinates are respectively (0,0,0), (0,10,0), (0,10,2.5), (0,0,2.5), (10,0,0), (10,10,0), (10,10,2.5) and (10,0,2.5).The uniform linear array of 10 microphone compositions is distributed in (2,4,1.3) and (2,4.9,1.3) point-to-point transmission, array element distance is 0.1m, and array element is isotropic omni-directional microphone, and 6 speaker positions are respectively (8,1,1.3), (8,2.6,1.3), (8,4.2,1.3), (8,5.8,1.3), (8,7.4,1.3) and (8,9,1.3), suppose that ground unrest is white Gaussian noise.Use Image realistic model to process microphone array and speaker's speech, with 8kHz sample frequency, voice signal is sampled, obtain microphone array Received signal strength.Coefficient gamma=0.8 of folding resampling, iterations is 20.Speaker's voice signal duration long enough, get different pieces of information in each experiment and carry out 50 tests, detection probability is as follows:

(formula eight)

If actual speaker's number is 2, any frame comprises 128 sampling frequencies, number of frames is 100, parameter D (K)=0.7 in your conical pods of lid, signal to noise ratio (S/N ratio) changes to 5dB from-5dB, when step-length is 1dB, the method of the focusing covariance matrix that the method construct adopting the embodiment of the present invention to provide goes out and existing CSM (Coherent Signal Subspace Method, coherent signal-subspace method)-GDE (GerschgorinDisk Estimator, your disc estimation of lid method) method detection probability with signal to noise ratio (S/N ratio) contrast as shown in Figure 1 C.Can be found out by Fig. 1 C, CSM-GDE method is when signal to noise ratio (S/N ratio) is 0dB, and detection probability can reach 0.9, and when signal to noise ratio (S/N ratio) is 4dB, detection probability can reach 1.Scheme provided by the invention is when signal to noise ratio (S/N ratio) is less than 0dB, and compared with CSM-GDE method, correct detection probability has a distinct increment; When signal to noise ratio (S/N ratio) is-3dB, detection probability reaches 0.9, and when signal to noise ratio (S/N ratio) is-3dB, correct detection probability can reach 1.

If actual speaker's number is 2, signal to noise ratio (S/N ratio) is 10dB, any frame comprises 128 sampling frequencies, number of frames changes to 70 from 5, when step-length is 5, the method for the focusing covariance matrix adopting the method construct that provides of the embodiment of the present invention to go out and existing CSM-GDE method detection probability with number of frames contrast as shown in figure ip.From Fig. 1 D, CSM-GDE method when number of frames is 40, detection probability can reach 0.9, and when number of frames is 65, detection probability can reach 1.The present invention program is when number of frames is less than 50, and compared with CSM-GDE method, detection probability has a distinct increment; When number of frames is 25, detection probability reaches 0.9, and when number of frames is 50, detection probability can reach 1.

Table 1 gives the structure that provides according to the present invention program and focuses on the Performance comparision of method in different speaker's number situation that method that covariance matrix calculates sound source number and CSM-GDE calculate sound source number.In this experiment, actual speaker's number is 2, and signal to noise ratio (S/N ratio) is 10dB, and subframe lengths is 128 points, and number of frames is 100.As shown in Table 1, when actual speaker's number is 2 and 3, the method detection probability that the method for the structure focusing covariance matrix calculating sound source number that the present invention program provides and CSM-GDE calculate sound source number all can reach 1, when actual speaker's number is greater than 3, increase detection probability with speaker's number to decline gradually, under speaker's number same case, the structure provided according to the present invention program focuses on the method that method that covariance matrix calculates sound source number calculates sound source number compared with CSM-GDE and has higher detection probability.

Table 1 detection probability is with the change of actual speaker's number

Actual speaker's number	2	3	4	5	6
						CSM-GDE	1	1	0.94	0.84	0.66
The present invention program	1	1	0.98	0.90	0.72

In the embodiment of the present invention, adopting your conical pods of lid to calculate sound source number according to the focusing covariance matrix obtained is the mode relatively commonly used in the art, no longer describes in detail at this.

In order to understand the embodiment of the present invention better, below providing embody rule scene, for the process focusing on covariance matrix based on voice signal structure, making and describing in further detail, as shown in Figure 2:

Step 200: when determining that microphone array gathers voice signal, the sampling frequency that adopts is 100: sampling frequency 0, sampling frequency 1, sampling frequency 2 ..., sampling frequency 99;

Step 210: for sampling frequency, 0, calculate the first covariance matrix for sampling frequency 0;

Step 220: the focusing frequency determining 100 sampling frequencies;

Step 230: calculate microphone array and be listed in the second covariance matrix focusing on the voice signal that frequency collects;

Step 240: to the first covariance matrix characteristics of decomposition value, obtain first eigenvector matrix, and conjugate transpose is carried out to first eigenvector matrix, obtain the associate matrix of first eigenvector matrix;

Step 250: to the second covariance matrix characteristics of decomposition value, obtain second feature vector matrix;

Step 260: by the product of the associate matrix of first eigenvector matrix, second feature vector matrix, as focusing transform matrix, and conjugate transpose is carried out to focusing transform matrix, obtain the associate matrix of focusing transform matrix;

Step 270: by the product of the associate matrix of the first covariance matrix, focusing transform matrix, focusing transform matrix, as the focusing covariance matrix of the voice signal collected at sampling frequency 0;

Step 280: according to calculating the focusing covariance matrix calculating other sampling frequencies for the mode of the focusing covariance matrix of sampling frequency 0, and by the focusing covariance matrix sum for each sampling frequency, as the focusing covariance matrix of the voice signal that microphone array collects.

Based on the technical scheme of above-mentioned correlation method, consult shown in Fig. 3 A, the embodiment of the present invention provides a kind of device focusing on covariance matrix based on voice signal structure, and this device comprises determining unit 30, first computing unit 31, and the second computing unit 32, wherein:

Determining unit 30, for determining the sampling frequency that microphone array adopts when gathering voice signal;

First computing unit 31, for frequency of sampling for any one in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency, and the associate matrix of focusing transform matrix, and by the product of the associate matrix of the first covariance matrix, focusing transform matrix, focusing transform matrix, as the focusing covariance matrix of the voice signal collected at any sampling frequency;

Second computing unit 32, for the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency that will calculate, as the focusing covariance matrix of the voice signal that microphone array collects.

Optionally, the first computing unit 31, when calculating the first covariance matrix, is specially:

Calculate the first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

Wherein, represent the first covariance matrix, k represents any sampling frequency, P represents that microphone array gathers quantity, the X of the frame of voice signal _ithe discrete Fourier transformation DFT value of (k) expression microphone array when any frame and any sampling frequency, represent X _ik the associate matrix of (), N represent the quantity of the sampling frequency that any frame comprises, the quantity of the sampling frequency included by any two different frames is all identical.

Further, determining unit 30 also for, determine the focusing frequency of sampling frequency that microphone array adopts when gathering voice signal;

First computing unit 31 also for, calculate microphone array and be listed in the second covariance matrix of voice signal focusing on frequency and collect;

First computing unit 31, when calculating focusing transform matrix, is specially:

Optionally, the first computing unit 31, when calculating the second covariance matrix, is specially:

Calculate the second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

Optionally, the first computing unit 31, when to the first covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k) = U (k) Λ U^{H} (k)

Optionally, the first computing unit 31, when to the second covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

Optionally, X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1

Wherein: X _i1k () represents DFT value, the X of the 1st array element of microphone array when the i-th frame and kth sample frequency _i2(k) represent the 2nd array element of microphone array when the i-th frame and kth sample frequency DFT value ..., X _iLthe quantity that k () represents the DFT value of L array element of microphone array when the i-th frame and kth sample frequency, L is the array element that microphone array comprises.

As shown in Figure 3 B, be the another kind of structural representation focusing on the device of covariance matrix based on voice signal structure that the embodiment of the present invention provides, comprise at least one processor 301, communication bus 302, storer 303 and at least one communication interface 304.

Wherein, communication bus 302 is for the connection that realizes between said modules and communicate, and communication interface 304 is for being connected with external unit and communicating.

Wherein, storer 303 is for storing executable program code, and processor 301 passes through to perform these program codes, for:

For any one the sampling frequency in the sampling frequency determined, calculate the first covariance matrix, the focusing transform matrix of the voice signal collected at any one sampling frequency, and the associate matrix of focusing transform matrix, and by the product of the associate matrix of the first covariance matrix, focusing transform matrix, focusing transform matrix, as the focusing covariance matrix of the voice signal collected at any sampling frequency;

By the focusing covariance matrix sum of voice signal collected respectively at each sampling frequency calculated, as the focusing covariance matrix of the voice signal that microphone array collects.

Optionally, when processor 301 calculates the first covariance matrix, be specially:

Calculate the first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

Further, processor 301 also comprises before calculating focusing transform matrix:

Calculate focusing transform matrix, specifically comprise:

Optionally, when processor 301 calculates the second covariance matrix, be specially:

Calculate the second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

Wherein, represent the second covariance matrix, k ₀expression focuses on frequency, P represents that microphone array gathers quantity, the X of the frame of voice signal _i(k ₀) represent microphone array any frame and focus on frequency time

DFT value, X _i ^h(k ₀) represent X _i(k ₀) associate matrix.

Optionally, when processor 301 is to the first covariance matrix characteristics of decomposition value, be specially:

\hat{R} (k) = U (k) Λ U^{H} (k)

Optionally, when processor 301 is to the second covariance matrix characteristics of decomposition value, be specially:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

In the embodiment of the present invention, optionally, X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1

The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step of the function realized in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.

Obviously, those skilled in the art can carry out various change and modification to the embodiment of the present invention and not depart from the spirit and scope of the embodiment of the present invention.Like this, if these amendments of the embodiment of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. focus on a method for covariance matrix based on voice signal structure, it is characterized in that, comprising:

2. the method for claim 1, is characterized in that, calculates described first covariance matrix, specifically comprises:

Calculate described first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

3. method as claimed in claim 1 or 2, is characterized in that, before calculating described focusing transform matrix, also comprise:

Calculate described focusing transform matrix, specifically comprise:

4. method as claimed in claim 3, is characterized in that, calculate described second covariance matrix, specifically comprise:

Calculate described second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

5. the method as described in claim 3 or 4, is characterized in that, to described first covariance matrix characteristics of decomposition value, specifically comprises:

\hat{R} (k) = U (k) Λ U^{H} (k)

6. the method as described in any one of claim 3-5, is characterized in that, to described second covariance matrix characteristics of decomposition value, specifically comprises:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

7. the method as described in any one of claim 2-6, is characterized in that, described X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1

8. focus on a device for covariance matrix based on voice signal structure, it is characterized in that, comprising:

9. device as claimed in claim 8, is characterized in that, described first computing unit, when calculating described first covariance matrix, is specially:

Calculate described first covariance matrix in the following way:

\hat{R} (k) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k) X_{i}^{H} (k), k = 0, . . . . . ., N - 1

10. as claimed in claim 8 or 9 device, is characterized in that, described determining unit also for, determine the focusing frequency of the sampling frequency that described microphone array adopts when gathering voice signal;

11. devices as claimed in claim 10, is characterized in that, described first computing unit, when calculating described second covariance matrix, is specially:

Calculate described second covariance matrix in the following way:

\hat{R} (k_{0}) = \frac{1}{P} Σ_{i = 1}^{P} X_{i} (k_{0}) X_{i}^{H} (k_{0})

12. devices as described in claim 10 or 11, it is characterized in that, described first computing unit, when to described first covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k) = U (k) Λ U^{H} (k)

13. devices as described in any one of claim 10-12, it is characterized in that, described first computing unit, when to described second covariance matrix characteristics of decomposition value, is specially:

\hat{R} (k_{0}) = U (k_{0}) Λ_{0} U^{H} (k_{0})

14. devices as described in any one of claim 9-13, is characterized in that, described X _ik () form is as follows:

X _i(k)＝[X _i1(k),X _i2(k),......,X _iL(k)] ^T,i＝0,1,2,......,P-1