CN110858485A

CN110858485A - Voice enhancement method, device, equipment and storage medium

Info

Publication number: CN110858485A
Application number: CN201810967670.9A
Authority: CN
Inventors: 刘章; 余涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2020-03-03
Anticipated expiration: 2038-08-23
Also published as: CN110858485B

Abstract

The disclosure provides a voice enhancement method, a device, equipment and a storage medium. Subtracting the outputs of two microphones in the microphone array to obtain a first-order difference output; comparing the first order difference output to a predetermined threshold; determining a hidden value of each frequency point based on a comparison result, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise; and performing speech enhancement based on the masked values. The voice enhancement scheme realized based on the differential mask has almost no delay, is not influenced by directional voice interference, and can effectively improve the success rate of voice recognition in noisy scenes such as subway ticket purchasers and the like.

Description

Voice enhancement method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech enhancement, and in particular, to a speech enhancement method, apparatus, device, and storage medium.

Background

With the development of artificial intelligence voice technology, the demand of many traditional devices for human-computer voice interaction is increasing, for example, subway ticket buying machines. However, to be successfully applied in subway ticket-booking venues requires challenging highly noisy noise environments. These noises are: the noise caused by the foam noise caused by the speaking of the crowd, the interference noise caused by the speakers around the ticket buyer, the noise generated by the movement of the crowd, the mechanical noise of the movement of the subway locomotive, the interference sound of a high pitch horn and the like. The high noisy noise brings great challenge to speech recognition, and because the influence of foam noise and human voice interference cannot be effectively overcome by the existing acoustic model technology, the speech recognition effect can be sharply reduced in the high noisy environment.

Therefore, there is a need for a speech enhancement scheme for noisy scenes.

Disclosure of Invention

An object of the present disclosure is to provide a speech enhancement scheme capable of improving a speech enhancement effect.

According to a first aspect of the present disclosure, a speech enhancement method is proposed, comprising: subtracting the outputs of two microphones in the microphone array to obtain a first-order difference output; comparing the first order difference output to a predetermined threshold; determining a hidden value of each frequency point based on the comparison result, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise; and speech enhancement based on the masked values.

Optionally, the step of determining the masked values of the frequency points includes: the masked value of the bin when the first order difference output is less than the predetermined threshold value is determined to be 1, and the masked value of the bin when the first order difference output is greater than or equal to the predetermined threshold value is determined to be 0.

Optionally, the step of determining the masked values of the frequency points includes: determining a hidden value estimation result of each first-order difference output based on a result of comparing the plurality of first-order difference outputs with a predetermined threshold value, respectively; and determining the final hidden value of the frequency point based on the hidden values corresponding to the same frequency point in a plurality of hidden value estimation results.

Optionally, the step of determining the final masked value of the frequency point includes: and taking the product of the hidden values corresponding to the same frequency point in a plurality of hidden value estimation results as the final hidden value of the frequency point.

Optionally, the first order difference output is equal to the product of the filter coefficients and a matrix of time-frequency domain data for the two microphones.

Optionally, the filter coefficients are

Where h (ω) is the filter coefficient, τ₀Is the distance of the two microphones divided by the speed of sound, ω is the angular frequency,α is a parameter used to adjust the direction of the differential nulls.

Optionally, the speech enhancement method further comprises calculating relative angles of the two microphones to the speaker based on the speaker's sound source location information, and determining α in the filter coefficients based on the relative angles.

Optionally, the step of calculating the relative angles of the two microphones to the speaker comprises: determining a first direction vector from the centers of the two microphones to the speaker; determining a second directional vector from one of the two microphones to the other microphone; based on the first direction vector and the second direction vector, a relative angle is calculated.

Optionally, the step of performing speech enhancement based on the masked value comprises: calculating a first correlation matrix corresponding to the speech and a second correlation matrix corresponding to the noise based on the masked values; and performing voice enhancement by using a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

Optionally, the first correlation matrix is a covariance matrix of corresponding speech portions extracted from the time-frequency domain data output by the microphone array based on the concealment values, and the second correlation matrix is a covariance matrix of corresponding noise portions extracted from the time-frequency domain data output by the microphone array based on the concealment values.

According to a second aspect of the present disclosure, there is also provided a speech enhancement apparatus comprising: the difference module is used for subtracting the outputs of two microphones in the microphone array to obtain a first-order difference output; a comparison module for comparing the first order difference output with a predetermined threshold; the determining module is used for determining the hidden value of each frequency point based on the comparison result, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise; and a speech enhancement module for performing speech enhancement based on the masked value.

Optionally, the determining module determines the masked value of the frequency bin when the first order difference output is smaller than a predetermined threshold value as 1, and determines the masked value of the frequency bin when the first order difference output is greater than or equal to the predetermined threshold value as 0.

Optionally, the determining module determines a hidden value estimation result of each first-order difference output based on a result of comparing each of the plurality of first-order difference outputs with a predetermined threshold, and determines a final hidden value of the frequency point based on a hidden value corresponding to the same frequency point in the plurality of hidden value estimation results.

Optionally, the determining module uses a product of the concealment values corresponding to the same frequency point in the plurality of concealment value estimation results as a final concealment value of the frequency point.

Optionally, the filter coefficients are

Where h (ω) is the filter coefficient, τ₀Is the distance of the two microphones divided by the speed of sound, ω is the angular frequency, α is the parameter used to adjust the direction of the differential nulls.

Optionally, the concealment value estimation apparatus further includes an angle calculation module for calculating relative angles of the two microphones to the speaker based on the sound source position information of the speaker, and a coefficient determination module for determining α of the filter coefficients based on the relative angles.

Optionally, the angle calculation module comprises: the first direction vector determining module is used for determining a first direction vector from the centers of the two microphones to the speaker; a second direction vector determination module for determining a second direction vector from one of the two microphones to the other microphone; and the calculation submodule is used for calculating the relative angle based on the first direction vector and the second direction vector.

Optionally, the speech enhancement module includes a matrix calculation module for calculating a first correlation matrix corresponding to the speech and a second correlation matrix corresponding to the noise based on the masked values; and a beam forming module for performing voice enhancement by using a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

According to a third aspect of the present disclosure, there is also provided an apparatus for supporting a voice interaction function, including: a microphone array for receiving a sound input; and the terminal processor is used for subtracting the outputs of the two microphones in the microphone array to obtain a first-order difference output, comparing the first-order difference output with a preset threshold value, determining the hidden value of each frequency point based on the comparison result, and performing voice enhancement based on the hidden value, wherein the hidden value is used for representing the shielding condition of noise in the voice with noise to the voice.

Optionally, the device further includes a communication module, configured to send the voice-enhanced voice data to a server.

Optionally, the device is any one of: a ticket purchasing machine; an intelligent sound box; a robot; an automobile.

According to a fourth aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform a method as set forth in the first aspect of the disclosure.

According to a fifth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in the first aspect of the present disclosure.

The method carries out mask estimation based on the first-order difference output of the microphone array, only depends on sound source positioning information in time, and can achieve no delay or small delay. And is not influenced by directional human voice interference. Therefore, the speech enhancement scheme based on the differential mask can realize real-time speech enhancement and effectively improve the speech recognition success rate in noisy scenes such as subway ticket purchasing machines and the like.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 is a schematic diagram showing a microphone array composed of 4 microphones.

FIG. 2 is a schematic flow chart diagram illustrating a method of speech enhancement according to an embodiment of the present disclosure.

Fig. 3 is a block diagram illustrating a structure of an apparatus for supporting a voice interaction function according to an embodiment of the present disclosure.

FIG. 4 is an overall flow diagram illustrating a speech enhancement scheme according to an embodiment of the present disclosure.

Fig. 5 is a schematic block diagram illustrating the structure of a speech enhancement apparatus according to an embodiment of the present disclosure.

Fig. 6 is a schematic block diagram showing the structure of functional modules that the speech enhancement module in fig. 5 may have.

FIG. 7 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ term interpretation ]

(1) Speech enhancement

Speech enhancement is a technique for extracting a useful speech signal from a noise background to suppress and reduce noise interference when the speech signal is interfered or even submerged by various noises.

(2) Microphone array

The microphone array is an array formed by arranging a group of omnidirectional microphones at different spatial positions according to a certain shape rule, and is a device for carrying out spatial sampling on a spatial propagation sound signal, and the acquired signal comprises spatial position information of the spatial propagation sound signal. The array can be divided into a near-field model and a far-field model according to the distance between the sound source and the microphone array. According to the topology of the microphone array, the microphone array can be divided into a linear array, a planar array, a volume array, and the like.

(3) Zero trap

The lowest gain point in the beam pattern.

(4)MVDR

MVDR (Minimum Variance Distortionless Response) is an adaptive beamforming algorithm based on the maximum signal-to-noise ratio (SNR) criterion. The MVDR algorithm can adaptively minimize the power of the array output in the desired direction while maximizing the signal-to-noise ratio.

(5)mask

mask, which can be translated into a masking value (or masking value), can characterize the masking of speech by noise in noisy speech. In general, masks are mainly classified into Ideal Binary Mask (IBM) and Ideal Ratio Mask (IRM). The mask referred to in this disclosure may be considered IBM.

IBM divides an audio signal into different sub-bands according to auditory perception characteristics, and sets the energy of a corresponding time-frequency unit to 0 (in the case of noise dominance) or keeps the energy as it is, that is, sets the energy of the corresponding time-frequency unit to 1 (in the case of target voice dominance) according to the signal-to-noise ratio of each time-frequency unit.

The IRM also calculates for each time-frequency unit, but unlike IBM's "non-0, i.e., 1", the IRM calculates the energy ratio between speech signal and noise to obtain a number between 0 and 1, and then varies the energy level of the time-frequency unit accordingly. The IRM is the evolution of the IBM, reflects the suppression degree of noise on each time-frequency unit, and can further improve the quality and intelligibility of the separated voice.

[ scheme overview ]

The speech enhancement technology based on the microphone array signal processing can greatly improve the signal-to-noise ratio and the speech recognition performance, so that the speech recognition rate can be improved by installing the microphone array on a speech interaction device (such as a ticket buying machine) and by an effective enhancement scheme.

The current leading speech enhancement scheme adopts a mask (hidden value) estimation framework, firstly estimates the mask of a target frequency point, and then adopts a beamforming (beam forming) method to carry out spatial filtering so as to realize speech enhancement. Common mask estimation techniques are cluster-based CGMM (Complex Gaussian mixture model) and neural network-based nn-mask, but these two schemes have the disadvantages that the mask cannot be estimated in real time and the directional vocal interference cannot be solved.

In view of the above, the present disclosure provides a mask estimation scheme based on a differential method, where reliable mask estimation values can be obtained by combining one or more sets of differential microphones. Based on the obtained mask estimation value, a correlation matrix corresponding to the voice and a correlation matrix corresponding to the noise can be calculated, and then spatial filtering is performed by using beam forming methods such as MVDR or GEV (generalized eigenvector) to realize voice enhancement.

The following further describes aspects of the present disclosure.

[ mask estimation scheme based on differential principle ]

A. First order differential microphone principle

By utilizing the difference of the spatial sound pressure, the output of the two omnidirectional microphones is subtracted to obtain first-order difference output. The filter coefficients of the first order difference microphone can be expressed as follows,

wherein, tau₀Is the distance of the two microphones divided by the speed of sound, ω is the angular frequency, α is a parameter used to adjust the direction of the differential nullsrn (beam pattern), which can be described by,

where θ is the relative angle of the two microphones to the speaker, α is used to adjust the direction of the differential nulls, from the definition of the nulls, when cos θ is α, the null angle, therefore, the first order differential filtering proposed by the present disclosure is to place the nulls in a specified direction (θ, i.e., the direction of the two microphones relative to the speaker) in order to obtain the differential mask.

The derivation of the formula for calculating the coefficients of the first order difference filter is described below, and details related to the calculation formula and known in the art are not repeated in the present disclosure.

The difference array represents the difference of spatial sound pressure, the first-order difference of the sound pressure can be obtained by subtracting the outputs of two near omnidirectional microphones, and similarly, the N microphones can obtain the N-1-order difference of the sound pressure at most. In designing a differential array, it is an important condition that the microphone spacing be so small that the finite difference between the microphones, which is much smaller than the wavelength of the sound signal, is output to enable an estimate of the difference in the actual sound pressure field.

In a first order difference array, two microphones are required, with two constraints: 1. no distortion in the target direction, i.e., gain (1) in the speaker direction (i.e., θ mentioned above, the relative direction of the two microphones to the speaker); 2. the zero point is in the interval 0 < theta <180 deg..

These two constraints are mathematically expressed as follows:

d^T(ω，cos 0°)h(ω)＝d^T(ω，1)h(ω)＝1

d^T(ω，α_1，1)h(ω)＝β_1，1

where h (ω) is the filter coefficient, d^T() h (ω) represents a filter coefficient under the condition in parentheses, ω is an angular frequency, α_1,1Is a parameter for adjusting the direction of the zero point, α_1,1＝cosθ_1,1Indicating that the zero point is set in the target direction, β therefore_1,1＝0。

The above formula can be expressed as follows by a matrix:

this formula may be further expressed as a Vandermonde matrix (Vandermonde matrix),

wherein, tau₀The distance between the two microphones is divided by the speed of sound, and the matrix is solved to obtain a first order difference filter of,

when the microphone separation is much smaller than the signal wavelength, the following mathematical assumption can be made: e.g. of the type^xThe value is approximately equal to 1+ x, the mathematical assumption is applied to the formula for simplification, and a calculation formula of the filter coefficient of the first-order difference microphone can be obtained

B. Differential mask acquisition

When the null of the differential microphone is aligned in a given direction, speech in that direction may be masked. By using the principle, the time-frequency point in the null direction can be found on the time-frequency diagram, and the method for acquiring the differential mask is described below by taking a linear microphone array composed of 4 microphones shown in fig. 1 as an example. It should be appreciated that the differential mask acquisition method of the present disclosure may also be applicable to other numbers or forms of microphone arrays.

1. Firstly, the relative angles between the mic1 and the mic2 and the speaker can be calculated through the sound source position information, and the direction vector from the center of the mic2 to the speaker is set as (x) 1_s,y_s) The mic2 points to the mic1 squareThe vector of direction is (x)₁₂,y₁₂) Then, then

Will cos (theta)₁) α directly substituted into the formula in A can obtain the filter coefficients H of mic1 and mic2₁₂(ω). Similarly, the filter coefficients H of mic3 and mic4 can be obtained₃₄(ω)。

2. Let X be the time-frequency domain data of mic1, mic2, mic3, and mic4₁(t,ω)，X₂(t,ω)，X₃(t,ω),X₄(t, ω), calculating the differential output of mic1, mic2

Calculating the differential output of the mic3 and the mic4

3. If the frequency points (t, ω) come from mic1, mic2 differential nulls, then X₁₂The modulus value of (t, ω) is theoretically zero and will be much smaller than the original X₁(t,ω),X₂The modulus of (t, ω) and therefore a threshold e can be set when

Then, the frequency point (t, ω) can be considered to come from the target speaker. I.e. mask, M at frequency point (t, ω)₁₂(t, ω) 1, otherwise M₁₂(t, ω) ═ 0, where the specific value of the threshold e can be determined empirically.

As shown in the beam pattern curve of fig. 1, when the null is close to one end, the response in the direction close to the end is also small, which easily results in estimation error. Multiple sets of first order difference microphones may be used to overcome this drawback.

In particular, the same can be estimated for M₃₄(t,ω)、M₁₃(t,ω)、M₂₃(t,ω)、M₂₄(t, ω), and the final mask output may be the product of the masks of the sets of first order difference microphones. For example, the final mask output may be expressed as M (t, ω) ═ M₁₂(t,ω)·M₃₄(t,ω)·M₂₃(t,ω)·M₁₃(t,ω)·M₂₄(t, ω). Where M (t, ω) represents the final mask output result, M₁₂(t, ω) represents the difference mask estimation results (i.e., the concealment value estimation results described below) obtained for the mic1 and the mic2, and M₃₄(t, ω) represents the difference mask estimation results obtained for mic3 and mic4, and M is₂₃(t, ω) represents the difference mask estimation results obtained for mic2 and mic3, and M is₁₃(t, ω) represents the difference mask estimation results obtained for mic1 and mic3, and M is₂₄(t, ω) represents the difference mask estimation results obtained for the mic2 and the mic 4.

[ Speech enhancement ]

After the voice mask information is obtained, voice enhancement can be performed based on the mask information. Such as spatial filtering using beamforming to achieve speech enhancement. For example, speech enhancement can be implemented based on a subspace approach, which has the basic idea of calculating an autocorrelation matrix or a covariance matrix of a signal, then dividing a noisy speech signal into a useful signal subspace and a noise signal subspace, and reconstructing the signal using the useful signal subspace, thereby obtaining an enhanced signal.

It can be seen that the subspace approach requires the construction of a covariance matrix using noisy speech signals, which is then decomposed to obtain a signal subspace and a noise subspace. In the present disclosure, a correlation matrix (e.g., a covariance matrix) of a voice and a correlation matrix (e.g., a covariance matrix) of a noise may be quickly calculated according to mask information obtained based on a differential mask obtaining manner, where the correlation matrix of the voice may represent a signal subspace corresponding to the voice, and the correlation matrix of the noise may represent a noise subspace corresponding to the noise. Thus, the calculated correlation matrix can be applied in a beamforming algorithm to achieve speech enhancement.

As an example of the present disclosure, the correlation matrix of the speech may be a covariance matrix of a corresponding speech part extracted from the time-frequency domain data of the output of the microphone array based on the acquired mask information, and the correlation matrix of the noise may be a covariance matrix of a corresponding noise part extracted from the time-frequency domain data of the output of the microphone array based on the acquired mask informationCovariance matrix of the acoustic part. For example, the correlation matrix of noise and speech can be calculated by the following formula, respectively, wherein the correlation matrix calculation formula of noise can be expressed as R_NN＝E_t((1-M(t,ω))·(X(t,ω)X(t,ω)^H) The correlation matrix calculation formula for speech can be expressed as R_SS＝E_t(M(t,ω)·(X(t,ω)X(t,ω)^H)). Wherein R is_NNCorrelation matrix representing noise, R_SSCorrelation matrix representing speech, E_tRepresenting expectation, or mean statistics, M (t, omega) represents the finally calculated concealed value of each frequency point (t, omega), X (t, omega) represents the time-frequency domain data output by the microphone array, and X (t, omega)^HDenotes the conjugate transpose of X (t, ω).

After obtaining the correlation matrix of voice and noise, the voice may be enhanced based on a plurality of beamforming algorithms, for example, the above-mentioned beamforming algorithms such as MVDR, GEV, etc. may be used to implement voice enhancement. The specific implementation process of the beamforming algorithm is not described herein.

So far, the implementation principles of the mask estimation scheme based on the differential principle and the speech enhancement scheme based on the mask of the present disclosure are briefly explained.

[ Speech enhancement method ]

The present disclosure may be implemented as a speech enhancement method. FIG. 2 is a schematic flow chart diagram illustrating a method of speech enhancement according to an embodiment of the present disclosure.

Referring to fig. 2, in step S210, the outputs of two microphones in the microphone array are subtracted to obtain a first order difference output.

The microphone array may be mounted on a voice interaction device, such as a ticket purchaser. The difference of spatial sound pressures can be used to subtract the outputs of two microphones in the microphone array to obtain a first order difference output. It should be noted that, here, the outputs of any pair of microphones in the microphone array may be subtracted to obtain one first-order differential output, or the outputs of each pair of microphones may be subtracted to obtain a plurality of first-order differential outputs.

In this embodiment, the first order difference output may be equal to the product of the filter coefficients and a matrix of time-frequency domain data for the corresponding two microphones. Taking the two microphones mic1 and mic2 as examples, the first order difference output of mic1 and mic2

Wherein H₁₂(ω) filter coefficients for mic1, mic2, X₁(t, ω) is time-frequency data of mic1, X₂(t, ω) is time-frequency data of mic 2. As described above, the filter coefficients may be expressed as

In this embodiment, the relative angle between the two microphones and the speaker may also be calculated based on the sound source location information of the speaker, and α of the filter coefficients may be determined based on the relative angle.

It is emphasized that, the beam pattern can be described as,

when cos θ is α, it is the null angle, and cos θ is the cosine of the relative angle between the two microphones and the speaker corresponding to the filter coefficients.

In step S220, the first order difference output is compared with a predetermined threshold.

In step S230, based on the comparison result, the concealment value for each frequency bin is determined.

The masking value is used to characterize the masking of the speech by noise in the noisy speech. In the present disclosure, the Mask value (Mask) may refer to an Ideal Binary Mask (IBM), and the description about the IBM may refer to the above description, which is not repeated herein.

As can be seen from the description formula of the beam pattern, when the null of the differential microphone is aligned to a specified direction, the voice in the direction can be shielded, if the frequency point (t, ω) comes from the mic1 and the mic2 differential null, then X is the frequency point X₁₂The modulus value of (t, ω) is theoretically zero and will be much smaller than the original X₁(t,ω),X₂The modulus value of (t, ω). Therefore, a predetermined threshold may be set, and for each first order difference output, the first order difference output may be compared with the predetermined threshold, the masked value of the bin (t, ω) when the first order difference output is smaller than the predetermined threshold is 1, and the masked value of the bin (t, ω) when the first order difference output is larger than the predetermined threshold is 0.

For example, a threshold e may be set when

Then, the frequency point (t, ω) is considered to come from the target speaker, i.e. the mask, M at the frequency point (t, ω)₁₂(t, ω) 1, otherwise M₁₂(t, ω) ═ 0, where the specific value of the threshold e can be determined empirically.

As shown in the beam pattern curve of FIG. 1, when the null is close to one end, the response in the direction close to the end will be small, which easily results in mask estimation error. Therefore, the disclosure proposes that the hidden value estimation result of each first-order difference output can be determined based on the result of comparing the plurality of first-order difference outputs with the predetermined threshold respectively, and then the final hidden value of the frequency point can be determined based on the hidden value corresponding to the same frequency point in the plurality of hidden value estimation results, for example, the product of the hidden values corresponding to the same frequency point in the plurality of hidden value estimation results can be used as the final hidden value of the frequency point.

For example, assuming that a microphone array is composed of 4 microphones of mic1, mic2, mic3, and mic4, M can be estimated using the above method₁₂(t,ω)、M₃₄(t,ω)、M₁₃(t,ω)、M₂₃(t,ω)、M₂₄The result of the estimation of the concealment values of multiple sets of first-order difference microphones (t, ω) may then be the product of the calculated concealment values corresponding to the same frequency bin as the concealment value for that frequency bin, i.e., the final mask output may be denoted as M (t, ω), where M (t, ω) is equal to M (t, ω)₁₂(t,ω)·M₃₄(t,ω)·M₂₃(t,ω)·M₁₃(t,ω)·M₂₄(t, ω). Thus, the influence of the null near a certain end can be overcome.

In step S240, speech enhancement is performed based on the masked value.

Based on the determined hidden value, spatial filtering can be performed by using beam forming algorithms such as MVDR or GEV and the like to realize voice enhancement. The specific calculation process is well-known in the art and will not be described herein.

Briefly, a first correlation matrix corresponding to speech and a second correlation matrix corresponding to noise may first be calculated based on the masked values. The first correlation matrix is a covariance matrix of a corresponding voice part extracted from time-frequency domain data output by a microphone array based on a hidden value, and the second correlation matrix is a covariance matrix of a corresponding noise part extracted from time-frequency domain data output by the microphone array based on the hidden value. Then, based on the first correlation matrix and the second correlation matrix, speech enhancement is performed using a beamforming algorithm.

As an example, the first correlation matrix R_SS＝E_t(M(t,ω)·(X(t,ω)X(t,ω)^H) A second correlation matrix R)_NN＝E_t((1-M(t,ω))·(X(t,ω)X(t,ω)^H) M (t, ω) represents the hidden value of different time frequency points (t, ω), X (t, ω) represents the time-frequency domain data output by the microphone array, E_tRepresenting a mathematical expectation, X (t, ω)^HDenotes the conjugate transpose of X (t, ω). Where X (t, ω) may include time-frequency domain data for one or more microphones of the microphone array.

[ application scenarios and application examples ]

The speech enhancement scheme of the present disclosure is applicable to devices supporting speech interaction functions in noisy environments, in particular devices that are far away from the sound source (speaker, i.e. user issuing speech instructions), such as Echo (smart sound box), robot, car, ticket purchaser, etc. Reference herein to a noisy environment is to an environment in which various noise effects are present. Taking a subway ticket purchasing machine as an example, the subway ticket purchasing machine is often arranged at an entrance with more people flowing in a subway station, and to successfully apply a voice recognition technology to the subway ticket purchasing machine, a highly noisy noise environment needs to be challenged, and the noises include but are not limited to: foam noise caused by speaking of people, interference noise caused by speakers around ticket buyers, noise generated by movement of people, mechanical noise of subway movement, interference sound of tweeters and the like.

The speech enhancement scheme disclosed by the invention can be deployed on the equipment which is applied to the noisy environment and supports the speech interaction function, and can enhance the target speech so as to improve the speech recognition performance.

Fig. 3 is a block diagram illustrating a structure of an apparatus for supporting a voice interaction function according to an embodiment of the present disclosure. The device shown in fig. 3 may be a voice interaction device applied in a noisy environment, and may be, but is not limited to, a smart speaker, a robot, an automobile, a ticket vending machine, and the like.

As shown in fig. 3, device 300 includes a microphone array 310 and a terminal processor 320.

The microphone array 310 is configured to receive a sound input, and the sound input received by the microphone array 310 may include both the speaker's voice and ambient noise.

For the sound signals received by the microphone array 310, the terminal processor 320 may first perform analog-to-digital conversion on them to obtain sound data, and then may perform speech enhancement using the speech enhancement scheme of the present disclosure. Briefly, the terminal processor 320 may subtract the outputs of two microphones in the microphone array to obtain a first-order difference output, compare the first-order difference output with a predetermined threshold, determine a concealment value of each frequency point based on the comparison result, and perform speech enhancement based on the concealment value, where the concealment value is used to represent the concealment condition of noise in the noisy speech to the speech. For details of the specific implementation of the speech enhancement scheme executed by the terminal processor 320, see the above description, and are not described here again.

Additionally, the device 300 may also include a communication module 330. The voice data after the voice enhancement performed by the terminal processor 320 can be sent to the server through the communication module 330, and the server performs subsequent operations such as voice recognition and instruction issuing.

Fig. 4 is an overall flow diagram illustrating a speech enhancement scheme that may be performed by the device shown in fig. 3. The VAD judgment, the differential mask estimation, the statistic calculation, and the beam forming shown in fig. 4 may be performed by the terminal processor in fig. 3.

As shown in fig. 4, the solid circle on the left side represents a microphone array, and after the microphone array collects the original sound wave signal and obtains the sound data in digital form through ADC (analog-to-digital conversion), VAD judgment can be performed according to the voice activity detection result (i.e., VAD input). The voice activity detection result may be result information obtained by detection based on an existing voice activity detection mode, and the principle and implementation details about the voice activity detection mode are not the key points of the present disclosure and are not described herein again.

According to the VAD judgment result, whether the sound data are transmitted to the differential mask estimation module or directly transmitted to the statistic calculation module can be determined. For example, the analog-to-digital converted voice data may be passed into the differential mask estimation module in the case where the VAD input is that there is voice activity, and the analog-to-digital converted digital signal may be passed into the statistic calculation module in the case where the VAD input is that there is no voice activity.

The differential mask estimation module can receive the input of the sound source position information and execute the mask estimation scheme to obtain the time frequency point of the corresponding target voice in the voice with noise. The sound source position information may be position information of a target speaker determined based on any known positioning manner, for example, the sound source position information may be determined by a Multiple signal classification algorithm (MUSIC), and a specific determination manner about the sound source position information is not a focus of the present disclosure, so a positioning process about the sound source position information is not described in detail in the present disclosure.

The statistic calculation module can receive the original audio data and the mask information of each frequency point estimated by the difference mask estimation module, and calculate the correlation matrix of the corresponding voice and the correlation matrix of the noise. The beam forming module can calculate the coefficient of the spatial filter by inputting the correlation matrix of the voice and the noise, and performs beam forming on the original audio to output the finally enhanced voice.

Compared with the cluster-based CGMM and the neural network-based mask estimation schemes, the first-order difference output-based mask estimation scheme of the present disclosure has almost no delay (depending on sound source localization information in time, no delay or little delay can be achieved). Moreover, for directional vocal interference, a CGMM method based on clustering and an nn-mask based on a neural network cannot solve the problem, and the mask estimation scheme based on first-order difference output is not influenced by the directional vocal interference. Therefore, the real-time voice enhancement scheme based on the differential mask can effectively improve the success rate of voice recognition in noisy scenes such as subway ticket purchasers and the like.

[ Speech enhancement apparatus ]

Fig. 5 is a schematic block diagram illustrating the structure of a speech enhancement apparatus according to an embodiment of the present disclosure. The functional blocks of the speech enhancement apparatus may be implemented by hardware, software, or a combination of hardware and software, among others, which implement the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 5 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, brief descriptions are given to functional modules that the speech enhancement device can have and operations that each functional module can perform, and details related thereto may be referred to the description above in conjunction with fig. 2, and are not repeated here.

Referring to fig. 5, the speech enhancement apparatus 500 includes a difference module 510, a comparison module 520, a determination module 530, and a speech enhancement module 540.

The difference module 510 is used to subtract the outputs of two microphones in the microphone array to obtain a first order difference output. The comparison module 520 is used to compare the first order difference output with a predetermined threshold. The determining module 530 is configured to determine a concealment value of each frequency point based on the comparison result, and the speech enhancement module 540 is configured to perform speech enhancement based on the concealment value. The hidden value is used for representing the shielding condition of noise in the noisy speech to the speech. The determining module 530 may determine the masked value of the frequency bin when the first-order difference output is smaller than the predetermined threshold as 1, and determine the masked value of the frequency bin when the first-order difference output is greater than or equal to the predetermined threshold as 0.

As an example of the disclosure, the determining module 530 may determine a hidden value estimation result of each first-order difference output based on a result of comparing the plurality of first-order difference outputs with a predetermined threshold, and determine a final hidden value of the frequency point based on a hidden value corresponding to the same frequency point in the plurality of hidden value estimation results.

In this embodiment, the first order difference output may be equal to the product of the filter coefficients and a matrix of time-frequency domain data for the two microphones. Wherein the filter coefficients are

As one example of the present disclosure, the concealment value estimation apparatus 500 may further include an angle calculation module for calculating relative angles of the two microphones with respect to the speaker based on sound source position information of the speaker, and a coefficient determination module (not shown in the figure) for determining α of filter coefficients based on the relative angles.

Optionally, the angle calculation module may include: the first direction vector determining module is used for determining a first direction vector from the centers of the two microphones to the speaker; a second direction vector determination module for determining a second direction vector from one of the two microphones to the other microphone; and the calculation submodule is used for calculating the relative angle based on the first direction vector and the second direction vector.

As shown in fig. 6, the speech enhancement module 540 includes a matrix calculation module 541 and a beam forming module 542.

The matrix calculation module 541 is configured to calculate a first correlation matrix corresponding to the speech and a second correlation matrix corresponding to the noise based on the masked values. The first correlation matrix is a covariance matrix of a corresponding speech part extracted from time-frequency domain data output by the microphone array based on the concealment value, and the second correlation matrix is a covariance matrix of a corresponding noise part extracted from time-frequency domain data output by the microphone array based on the concealment value. For example, a first correlation matrix R_SS＝E_t(M(t,ω)·(X(t,ω)X(t,ω)^H) A second correlation matrix R)_NN＝E_t((1-M(t,ω))·(X(t,ω)X(t,ω)^H) M (t, ω) represents a masked value matrix of different frequency points (t, ω), X (t, ω) represents time-frequency domain data output by the microphone array, E_tRepresenting a mathematical expectation, X (t, ω)^HDenotes the conjugate transpose of X (t, ω).

The beamforming module 542 is configured to perform speech enhancement using a beamforming algorithm based on the first correlation matrix and the second correlation matrix. For example, spatial filtering may be performed using a beamforming method such as MVDR or GEV to achieve speech enhancement.

[ calculating device ]

FIG. 7 shows a schematic structural diagram of a computing device that can be used to implement the speech enhancement method described above according to an embodiment of the present disclosure.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 710 has stored thereon executable code that, when executed by the processor 720, causes the processor 720 to perform the speech enhancement methods described above.

The speech enhancement method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above-mentioned steps defined in the above-mentioned method of the present disclosure.

Alternatively, the present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of speech enhancement comprising:

subtracting the outputs of two microphones in the microphone array to obtain a first-order difference output;

comparing the first order difference output to a predetermined threshold;

determining a hidden value of each frequency point based on a comparison result, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise; and

speech enhancement is performed based on the masked values.

2. The speech enhancement method of claim 1 wherein the step of determining the masked values for each frequency bin comprises:

determining the masked value of the frequency bin when the first-order difference output is less than the predetermined threshold value as 1, and determining the masked value of the frequency bin when the first-order difference output is greater than or equal to the predetermined threshold value as 0.

3. The speech enhancement method of claim 1 wherein the step of determining the masked values for each frequency bin comprises:

determining a concealed value estimation result of each of the first order difference outputs based on a result of comparing the plurality of first order difference outputs with the predetermined threshold value, respectively; and

and determining the final hidden value of the frequency point based on the hidden values corresponding to the same frequency point in a plurality of hidden value estimation results.

4. The speech enhancement method of claim 3 wherein the step of determining the final masked value for the bin comprises:

and taking the product of the hidden values corresponding to the same frequency point in the plurality of hidden value estimation results as the final hidden value of the frequency point.

5. The speech enhancement method according to claim 1,

the first order difference output is equal to the product of the filter coefficients and a matrix of time-frequency domain data for the two microphones.

6. The speech enhancement method of claim 5 wherein the filter coefficients are

7. The speech enhancement method of claim 6, further comprising:

calculating relative angles of the two microphones and the speaker based on sound source position information of the speaker; and

α of the filter coefficients are determined based on the relative angle.

8. The speech enhancement method of claim 7 wherein the step of calculating the relative angles of the two microphones to the speaker comprises:

determining a first direction vector from the centers of the two microphones to the speaker;

determining a second directional vector from one microphone to the other of the two microphones;

calculating the relative angle based on the first direction vector and the second direction vector.

9. The speech enhancement method of claim 1 wherein said speech enhancement based on the masked values comprises:

calculating a first correlation matrix corresponding to the speech and a second correlation matrix corresponding to the noise based on the masked values; and

and performing voice enhancement by utilizing a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

10. The speech enhancement method according to claim 9,

the first correlation matrix is a covariance matrix of corresponding speech portions extracted from time-frequency domain data output by the microphone array based on the masked values,

the second correlation matrix is a covariance matrix of a corresponding noise part extracted from time-frequency domain data output from the microphone array based on the concealment value.

11. A speech enhancement device comprising:

the difference module is used for subtracting the outputs of two microphones in the microphone array to obtain a first-order difference output;

a comparison module for comparing the first order difference output with a predetermined threshold;

the determining module is used for determining the hidden value of each frequency point based on the comparison result, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise; and

and the voice enhancement module is used for carrying out voice enhancement on the basis of the hidden value.

12. An apparatus for supporting a voice interaction function, comprising:

a microphone array for receiving a sound input; and

and the terminal processor is used for subtracting the outputs of the two microphones in the microphone array to obtain a first-order difference output, comparing the first-order difference output with a preset threshold value, determining the hidden value of each frequency point based on the comparison result, and performing voice enhancement based on the hidden value, wherein the hidden value is used for representing the voice shielding condition of noise in the voice with noise.

13. The apparatus of claim 12, further comprising:

and the communication module is used for sending the voice data after the voice enhancement to the server.

14. The apparatus of claim 12, wherein the apparatus is any one of:

a ticket purchasing machine;

an intelligent sound box;

a robot;

an automobile.

15. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-10.

16. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-10.