CN110085246A

CN110085246A - Sound enhancement method, device, equipment and storage medium

Info

Publication number: CN110085246A
Application number: CN201910233376.XA
Authority: CN
Inventors: 汪法兵; 李健; 张连毅; 武卫东
Original assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Current assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP; Beijing Sinovoice Technology Co Ltd
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-08-02

Abstract

The present invention relates to speech signal processing technology, a kind of sound enhancement method, device, equipment and storage medium are provided, it is intended to solve the problems, such as that existing voice Enhancement Method is computationally intensive, be unsatisfactory for requirement of real-time.The sound enhancement method includes: the present frame signals with noise for obtaining microphone array acquisition, and the present frame signals with noise includes at least the voice signal that target voice sound source and other sound sources respectively issue；Using the present frame signals with noise, the corresponding time-frequency mask of the present frame signals with noise is determined；Using the time-frequency mask, the corresponding filter coefficient of the present frame signals with noise is determined；Using the filter coefficient, speech enhan-cement processing is carried out to signals with noise.Since the present invention is when calculating the time-frequency mask, it is only necessary to handle a frame signals with noise, therefore calculation amount of the invention is smaller, and meets requirement of real-time.

Description

Sound enhancement method, device, equipment and storage medium

Technical field

The present invention relates to speech signal processing technologies, in particular to a kind of sound enhancement method, device, set Standby and storage medium.

Background technique

Under noise circumstance, the performance of many speech processing systems sharply declines.Speech enhan-cement is as solution noise pollution A kind of effective preconditioning technique is always the hot spot of field of voice signal.The purpose of speech enhan-cement is from signals with noise In extract primary speech signal as pure as possible, improve signal-to-noise ratio, improve voice quality.

In the prior art, the General Principle of speech enhan-cement are as follows: first with filter coefficient to by Fourier transformation or The signals with noise of Short Time Fourier Transform is filtered, the frequency-region signal enhanced；Then the frequency domain of the enhancing is believed again Number inversefouriertransform is done, the time-domain signal enhanced, to export.Wherein for the determination of filter coefficient, existing skill There are a variety of determining methods in art.In conventional determination method, filter coefficient is confirmed as a fixed value, due to noise sheet Body generally can be with time to time change, therefore filter coefficient is confirmed as a fixed value and does not meet general rule naturally Rule only can be suitably used for speech enhan-cement in the constant situation of noise field using the method that this filter coefficient carries out speech enhan-cement, adapt to Property is weak.In order to overcome the above problem, existing another kind algorithm is to be made an uproar letter using EM algorithm using one section of longer band of caching Number calculates the corresponding time-frequency mask of this section of voice first, and it is corresponding then to calculate this section of voice using the time-frequency mask Filter coefficient；Although such method can accurately calculate filter coefficient, so that Speech enhancement effect is improved, Be need the long period to cache mass data due to such method, such as need 10 minutes it is data cached, therefore by such method After sound enhancement method, sound enhancement method is not only computationally intensive, but also it is unsatisfactory for requirement of real-time, cannot be answered For in the speech enhan-cement task with requirement of real-time.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of sound enhancement method, device, equipment and storage mediums.Purport It solving the problems, such as that existing voice Enhancement Method is computationally intensive, be unsatisfactory for requirement of real-time.

In a first aspect, the embodiment of the invention provides a kind of sound enhancement methods, comprising:

The present frame signals with noise of microphone array acquisition is obtained, the present frame signals with noise includes at least target voice The voice signal that sound source and other sound sources respectively issue；

Using the present frame signals with noise, the corresponding time-frequency mask of the present frame signals with noise is determined；

Using the time-frequency mask, the corresponding filter coefficient of the present frame signals with noise is determined；

Using the filter coefficient, speech enhan-cement processing is carried out to signals with noise.

Second aspect, the embodiment of the invention provides a kind of speech sound enhancement devices, comprising:

Module is obtained, for obtaining the present frame signals with noise of microphone array acquisition, the present frame signals with noise is extremely It less include the voice signal that target voice sound source and other sound sources respectively issue；

Time-frequency mask determining module determines the present frame signals with noise pair for utilizing the present frame signals with noise The time-frequency mask answered；

Filter coefficient determining module determines that the present frame signals with noise is corresponding for utilizing the time-frequency mask Filter coefficient；And

Speech enhan-cement module carries out speech enhan-cement processing to signals with noise for utilizing the filter coefficient.

The third aspect, the embodiment of the invention provides a kind of speech enhancement apparatus, comprising: microphone array, is deposited processor Reservoir and be stored in the computer program that can be run on the memory and on the processor, the processor with it is described Microphone array connection, which is characterized in that when the processor executes the computer program, realize that the embodiment of the present invention is appointed Sound enhancement method described in one.

Fourth aspect, the embodiment of the invention provides storage mediums, are stored thereon with computer program, which is characterized in that When the computer program is executed by processor, any sound enhancement method of the embodiment of the present invention is realized.

Compared with prior art, the invention has the following advantages:

In sound enhancement method provided by the present invention, the present frame signals with noise of microphone array acquisition, benefit are obtained With the present frame signals with noise, it is determined that the corresponding time-frequency mask of the present frame signals with noise, using the time-frequency mask, The corresponding filter coefficient of the present frame signals with noise has been determined, using the filter coefficient, language is carried out to signals with noise Sound enhancing processing is also possible to the present frame band and makes an uproar letter wherein the signals with noise can be the present frame signals with noise Number former frame signals with noise or rear a few frame signals with noise.On the one hand, since the present invention is directed to each frame signals with noise, calculating should The corresponding time-frequency mask of frame signals with noise, the time-frequency mask not instead of fixed value, according to the specific feelings of every frame signals with noise Condition and change, correspondingly, be also to be changed according to the concrete condition of every frame signals with noise by its resulting filter coefficient, Therefore final speech enhan-cement is carried out using the filter coefficient to handle, speech enhan-cement effect can be improved；On the other hand, by In the present invention when calculating the time-frequency mask, it is only necessary to a frame signals with noise is handled, therefore calculation amount of the invention compared with It is small, and meet requirement of real-time.

Detailed description of the invention

Fig. 1 shows the flow diagram of the sound enhancement method provided in embodiment；

Fig. 2 shows the structural block diagrams of speech-enhancement system as described in the examples；

Fig. 3 shows the flow diagram of the determination method of time-frequency mask as described in the examples；

Fig. 4 shows the structural block diagram of the speech sound enhancement device provided in embodiment；

Fig. 5 shows the structural block diagram of another speech sound enhancement device provided in embodiment.

Specific embodiment

A specific embodiment of the invention is described below, which is schematical, it is intended to disclose of the invention Specific work process should not be understood as further limiting scope of protection of the claims.

Referring to Fig. 1, embodiment provides a kind of sound enhancement method, the method can be used as speech recognition, voice coder The pre-processing link of the voice signals application technologies such as code, can be applied in speech-enhancement system.As shown in Fig. 2, the voice Enhancing system mainly includes sequentially connected microphone array 10, audio decoder 20, digital signal processor 30 and DA conversion Device 40.Wherein, the microphone array 10 is for acquiring sound, and the sound of acquisition is converted to and is made an uproar by the band of analog representation Signal；The audio decoder 20 is used to carry out digitized sampling conversion to the signals with noise, and the data after conversion are sent Enter the digital signal processor 30；The digital signal processor 30 is used to carry out speech enhan-cement processing to the data, and The D/A converter 40 is sent by speech enhan-cement treated data by described；What the D/A converter 40 was used to receive Data are converted to analog signal, and export.It include at least in speech-enhancement system shown in Fig. 2, in the microphone array 10 Two array elements, each array element are a microphone, when the microphone array 10 is digital microphone array, shown in Fig. 2 Shown audio decoder 20 can be saved in system.

In the related technology, digital signal processor 30 is when carrying out speech enhan-cement processing, first with filter coefficient pair Signals with noise by Fourier transformation or Short Time Fourier Transform is filtered, the frequency-region signal enhanced；Then right again The frequency-region signal of the enhancing does inversefouriertransform, the time-domain signal enhanced, to export.Among these, for filtering The determination of device coefficient, mainly includes the following three types existing method in the prior art:

The first existing method utilizes the covariance matrix of signals with noiseApproximation is used as noise covariance matrixUtilize the noise covariance matrixCalculate the filter coefficient.Its drawback is, due to It is notAccurate estimation, will lead to noise inhibiting ability deficiency, while also damage speaker speech.

Second of existing method, it is assumed that noise is disperse noise field, utilizes the covariance square of preset disperse noise Battle arrayInstead of true autocorrelation matrixWhen noise is disperse noise, this scheme has good noise suppression Ability processed.However, there are various coherent noises or interference sounds in practical application scene, at this moment,With really make an uproar Sound covariance matrix deviation is larger.It will lead to noise inhibiting ability deficiency, speech enhan-cement is ineffective.

The third existing method using EM algorithm to time frequency point cluster calculation time-frequency mask, and then is covered using the time-frequency Mould calculates noise covariance matrixThe benefit of the method is more accurately to estimate noise covariance matrixIts drawback is that EM algorithm is computationally intensive, and needing to cache sufficiently large data could estimate accurately, it is difficult to meet real When the demand that calculates.

In the present invention, since sound enhancement method provided in this embodiment has, calculation amount is small, is able to satisfy requirement of real-time The features such as, therefore can be applied to have in the Speech processing task of requirement of real-time.Hereinafter, the present embodiment will be in conjunction with figure 1, it describes in detail to the sound enhancement method.

Step 101, the present frame signals with noise of microphone array acquisition is obtained, the present frame signals with noise includes at least The voice signal that target voice sound source and other sound sources respectively issue.

Specifically, when the sound that each sound source is issued is captured by each array element in microphone array, the microphone Each array element in array will generate a signals with noise.As an example, obtaining the institute of each array element in the microphone array State signals with noise.Since the array element is after receiving the sound that target voice sound source and other sound sources respectively issue, and generate The signals with noise, therefore include the voice signal that target voice source of students issues in the signals with noise, also include him The voice signal that sound source issues.It should be appreciated that the target voice sound source is to need to carry out the sound source of speech enhan-cement, it is described Other sound sources are the sound source in addition to the target voice sound source, and the quantity of other sound sources can be one or more.Make For example, the microphone array of mobile phone receives the voice from user A, also receives and make an uproar from what ambient noise source was issued Sound, mobile phone need to enhance the voice of user, and by the voice signal real-time Transmission formed after enhancing to the another of other places User B.Among these, user A is the target voice sound source, and ambient noise source is other described sound sources.

As an example, the microphone array can be mobile phone, tablet computer, palm PC PDA, laptop, platform Microphone array on formula computer or intelligent sound box (such as day cat spirit, millet box) equipment.Correspondingly, described in obtaining The signals with noise and subject of implementation for carrying out following subsequent processings to the signals with noise can be the mobile phone, tablet computer, the palm The processor of upper computer PDA, laptop, desktop computer or intelligent sound etc..

Specifically, the microphone array includes at least two array elements, each array element is a microphone.As showing Example, the microphone array specifically may include 6 array elements.

As an example, the frame length of present frame signals with noise can be 10ms, 20ms or 30ms etc..It should be appreciated that described The frame length of signals with noise is not limited to the example above, the present invention to the frame length of signals with noise without limitation.It should be appreciated that this Inventive embodiments obtain the corresponding filter coefficient of each frame just because of each frame to signals with noise is handled in time, Therefore more accurate filter coefficient can be calculated for every frame signal, and then improves speech enhan-cement effect.Again due to every frame The frame length of signals with noise is shorter, and generally within 100ms, therefore signal data volume is smaller, the data volume handled needed for processor It is smaller, and due to not needing just to be handled after caching longer signal data, the real-time of processing is more preferable.

Step 102, using the present frame signals with noise, the corresponding time-frequency mask of the present frame signals with noise is determined.

Since the corresponding time-frequency mask of the present frame signals with noise is obtained using present frame signals with noise, because The numerical values recited of this time-frequency mask is related to the actual conditions of present frame signals with noise, and the numerical value of the time-frequency mask is more quasi- Really, after being applied to speech enhan-cement, speech enhan-cement effect can be improved.

Citing as an embodiment can specifically be asked by the method for step 201 included below and step 202 The time-frequency mask is taken, as shown in Figure 3.

Step 201, according to the present frame signals with noise, determine the target voice sound source relative to the microphone array The estimation orientation of column.

As an example, can be selected by any estimation method and estimated for the location estimation of the target voice sound source, this Invention does not limit this.For example, spherical microphone array can be passed through when the microphone array selects spherical microphone array The sound pressure information for acquiring high-order sound field decomposes sound field using spheric harmonic function and establishes signal model, estimates using MUSIC algorithm The orientation of target voice sound source.In another example existing TDOA algorithm can also be selected to estimate the orientation of target voice sound source.It examines Consider and the position of the target voice sound source is estimated, the prior art can be selected, therefore the present invention is to specific estimation side Method repeats no more.It should be appreciated that the estimation method that calculation amount can be selected less than normal is to described in order to further decrease calculation amount Estimated the position of target voice sound source.

In this step, estimation orientation and target voice sound source of the target voice sound source relative to the microphone array True bearing between there are error, the size of the error will receive the influence of noise source (i.e. described other sound sources).Such as When the acoustic impacts that the sound that noise source is issued issues target voice sound source are larger, the error is also larger.

Step 202, according to the relative positional relationship between the estimation orientation and target area, the present frame band is determined The corresponding time-frequency mask of noise cancellation signal, wherein the target area is the physical location region where the target voice sound source.

Wherein, the physical location region is position section.Such as the actual position phase of the target voice sound source Orientation for the microphone array is 15 °, then the physical location region is [15-a, 15+a].As an example, a Specific size can preset, when being set as 30 ° such as a, at this time physical location region be [- 15 °, 45 °]；As an example, institute State a specific size can also according to speech enhan-cement effect timely automated ground adjusting and optimizing, such as when speech enhan-cement effect is unobvious, When the voice signal of output still includes larger noise, a can be reduced automatically, such as when being exported after speech enhan-cement Original targeted voice signal (i.e. described target voice sound source corresponding voice signal) has been damaged in voice signal, it can will be automatic The a is amplified.

As an example, when the target voice sound source is the fixed sound source in position, such as the target voice sound source Actual position is 15 ° relative to the orientation of the microphone array, and constant always, then the physical location region is always [15-a,15+a]；As an example, when the target voice sound source is that position is not fixed sound source, then the physical location region For [b-a, b+a], wherein the actual position of the target voice sound source is b, for example it can be tracked, infrared be chased after by binocular camera The non-speech audios processing methods such as track positioning, determine the actual position b of target voice sound source.

It should be appreciated that it is that the target voice sound source determines a physical location that any rational method, which can be selected, in the present invention Region, i.e., the described target area.Multiple examples enumerated above do not limit the present invention.

In the above-mentioned method comprising step 201 and step 202, due to being specifically according to the estimation orientation and target area Between relative positional relationship, determine that the corresponding time-frequency mask of the present frame signals with noise, essence are according to the estimation Error size between orientation and the target voice sound source actual position, determines the time-frequency mask.And as previously mentioned, described Size of error itself will receive the influence of noise source, therefore the size of the time-frequency mask, be led by noise source itself It determines.Such as when the acoustic impacts that the sound that noise source is issued issues target voice sound source are larger, the time-frequency Mask is larger, and when the acoustic impacts that the sound that noise source is issued issues target voice sound source are smaller, the time-frequency is covered Mould is smaller.

Specific implementation for above-mentioned steps 202, i.e., how according to the relative position between estimation orientation and target area Relationship determines that the corresponding time-frequency mask of the present frame signals with noise, embodiment provide the citing of following two embodiment.

First embodiment, if the estimation orientation is located in the target area, it is determined that the time-frequency mask For predetermined fixed value T1；If the estimation orientation is located at outside the target area, it is determined that the time-frequency mask is default solid Definite value T2；Wherein T2≤1 0≤T1 <.It should be appreciated that when first embodiment determines described in a manner of hard decision Frequency mask.

In above-mentioned first embodiment, predetermined fixed value T1 is corresponding " estimation orientation is located in the target area " The case where, influence of the noise source to target voice sound source at this time is smaller, the estimation orientation and target voice sound source physical location Between error it is smaller；It the case where predetermined fixed value T2 corresponding " estimation orientation is located at outside the target area ", makes an uproar at this time Sound source is affected to target voice sound source, the error between the estimation orientation and target voice sound source physical location compared with Greatly；Since T1 and T2 has the numerical relation of T2≤1 0≤T1 <, reflect when noise source influences target voice sound source When larger, the time-frequency mask is larger, and when noise source influences smaller to target voice sound source, the time-frequency mask is smaller.

As an example, the specific value of the T1, preferably takes 0；The specific value of the T2, preferably takes 1.Above-mentioned Under sample situation, when the estimation orientation is located in the target area, the time-frequency mask is 0, it is believed that at this time can be with Do not consider noise source；When the estimation orientation is located at outside the target area, the time-frequency mask is 1, it is believed that this When need to consider noise source.The T1 is taken as 0, the T2 beneficial effect for being taken as 1 is, the number that size is 0 or 1 Value can be conducive to be further simplified calculating, and then further decrease calculation amount.It should be appreciated that the specific value of the T1 and T2, It is not limited to the example above, such as the T1 can also be taken as 0.05, or be taken as 0.1, or be taken as 0.2 etc., such as the T2 Can also be taken as 0.95, or be taken as 0.9, or be taken as 0.8 etc., the present invention to the specific value of T1 and T2 without limitation.

Second embodiment, if the estimation orientation is located at outside the target area, it is determined that the time-frequency mask For predetermined fixed value T3, wherein 0 T3≤1 <；If the estimation orientation is located in the target area, according to the estimation Specific relative position of the orientation in the target area determines that the time-frequency mask is T4；Wherein 0≤T4 < T3.It should manage Solution, second embodiment determines the time-frequency mask in a manner of soft-decision.

In above-mentioned second embodiment, predetermined fixed value T3 is corresponding " estimation orientation is located at outside the target area " The case where, noise source is affected to target voice sound source at this time, the estimation orientation and target voice sound source physical location Between error it is larger；The case where T4 corresponding " estimation orientation is located in the target area ", noise source is to target at this time The influence of voice sound source is smaller, and the error between the estimation orientation and target voice sound source physical location is smaller；Due to T3 and T4 has the numerical relation of 0≤T4 < T3, therefore reflects when noise source is affected to target voice sound source, the time-frequency Mask is larger, and when noise source influences smaller to target voice sound source, the time-frequency mask is smaller.

As an example, the specific value of the T3, preferably takes 1.The T2 beneficial effect for being taken as 1 is, size It can be conducive to be further simplified calculating for 1 numerical value, and then further decrease calculation amount.It should be appreciated that the T3's specifically takes Value, it is not limited to which the example above, such as the T3 can also be taken as 0.95, or be taken as 0.9, or be taken as 0.8 etc., the present invention couple The specific value of T3 is without limitation.

In above-mentioned second embodiment, in the specific opposite position according to the estimation orientation in the target area It sets, when determining that the time-frequency mask is T4, as an example, the size of the numerical value of the T4 meets following relationship: the estimation side Position is closer to the center of the target area, and the numerical value of the T4 is closer to 0；The estimation orientation is closer to the target The marginal position in region, the numerical value of the T4 is closer to the T3.

In above-mentioned example, when the estimation orientation is closer to the center of the target area, noise source is to mesh at this time The influence in poster speech source is smaller, and the error between the estimation orientation and target voice sound source physical location is smaller, accordingly Time-frequency mask is smaller；When the estimation orientation is closer to the marginal position of the target area, noise source is to target voice at this time Sound source is affected, and the error between the estimation orientation and target voice sound source physical location is larger, and corresponding time-frequency is covered Mould is larger.

For example, specifically the T4 can be gone out according to preset mapping function.The mapping function can be set by experience It is fixed, it can also be obtained by the statistical learning of machine.As an example, can be by tool of the estimation orientation in the target area Mapping function between body relative position and the T4, is set as linear mapping function.For example the target voice sound source is true Real position is 15 ° relative to the orientation of the microphone array, and the physical location region is [- 15 °, 45 °], predetermined fixed value The number of T3 is set as 1, then can set linear mapping function as T4=R/30-0.5 (15≤R≤45), T4=-R/30+0.5 (- 15≤R≤15), wherein R is the estimation orientation.As an example, can also be by the estimation orientation in the target area Mapping function between interior specific relative position and the T4 is set as nonlinear mapping function, the Nonlinear Mapping letter In number, using the estimation orientation as independent variable, using the time-frequency mask T4 as dependent variable, the nonlinear mapping function can basis Experience setting, can also be obtained by the statistical learning of machine.It should be appreciated that setting approach of the present invention to the mapping function Without limitation.

Step 103, using the time-frequency mask, the corresponding filter coefficient of the present frame signals with noise is determined.

Wherein, due to the own actual situation gained that the time-frequency mask is according to present frame signals with noise, specifically root According to the corresponding noise source situation gained of previous frame signals with noise, therefore using filter coefficient determined by the time-frequency mask, it is The accuracy of the corresponding filter coefficient of present frame signals with noise, the filter coefficient is high.

Citing as an embodiment can specifically be determined by the method for step 1 included below to step 3 The filter coefficient.

Step 1 carries out Fourier transformation to the present frame signals with noise, obtains Fu of the present frame signals with noise In leaf transformation frequency spectrum.

Specifically, since the signals with noise exported from microphone array is the signals with noise continuously exported in time, The present embodiment intercepts and captures out a frame signals with noise of current time from the continuous signals with noise, and this frame signals with noise It is named as present frame signals with noise, Fourier transformation then is carried out to the present frame signals with noise.It should be appreciated that being directed to one The longer continuous signals with noise of section, the present embodiment will on a frame-by-frame basis intercept out data from the continuous signals with noise and carry out Fu In leaf transformation, as time goes by, the continuous signals with noise will be truncated as multiple frames, successively carry out Fourier transformation, because This is directed to continuous signals with noise, has substantially carried out Short Time Fourier Transform to it, for each frame, has been equivalent to a time Data corresponding to window.

Specifically, can intercept out a frame band from the signals with noise that each array element in microphone array exports and make an uproar letter Number, the present frame signals with noise as each array element；Then Fourier transformation is carried out to the present frame signals with noise of each array element, Obtain the Fourier transformation frequency spectrum y of each array element₁、y₂、y₃Or y_m, wherein subscript m represents array element number；Finally by each array element Fourier transformation frequency spectrum merge, obtain the Fourier transformation frequency spectrum y=[y of the present frame signals with noise₁,y₂,y₃… y_m]^T, the Fourier transformation frequency spectrum characterizes by the matrix y.

Step 2 calculates noise covariance matrix according to following formula:

Wherein, describedFor the noise covariance matrix, T is the time-frequency mask, and y (t, f) is described in characterization The matrix of Fourier transformation frequency spectrum, y^H(t, f) is the conjugate matrices of y (t, f).

Specifically, the y (t, f) has array element number attribute, time attribute and frequency attribute, the y^H(t, f) can lead to It crosses and is taken by conjugation and is acquired for the y (t, f).

Wherein, by being introduced into the time-frequency mask T into above-mentioned formula, progressive of the invention is embodied.The prior art When seeking filter coefficient, the noise covariance matrix is also applied.In the first foregoing existing method, benefit With the covariance matrix of signals with noiseApproximation is used as noise covariance matrixIt is equivalent to time-frequency mask Fixedly value is 1 to T.In foregoing second of existing method, assume that noise is disperse noise field, using presetting Disperse noise covariance matrixInstead of true autocorrelation matrixIt is equivalent to noise coefficient T Fixedly value is a decimal between 0 to 1, such as 0.6.

In above two existing method, it cannot determine to work as according to the corresponding noise source situation of present frame signals with noise The corresponding time-frequency mask of previous frame signals with noise, therefore time-frequency mask is not accurate enough, causes filter coefficient accuracy lower, finally Keep speech enhan-cement ineffective.And in the present invention, it can be determined according to the corresponding noise source situation of present frame signals with noise The corresponding time-frequency mask of present frame signals with noise, therefore time-frequency mask is more acurrate, so that filter coefficient accuracy is higher, finally Make speech enhan-cement better effect.

Step 3 calculates the filter coefficient according to following formula:

Wherein, w (f) is the filter coefficient,For the noise covariance matrix,It is described current The corresponding steering vector being estimated of frame signals with noise,ForConjugate matrices.

Wherein, the steering vectorThe steering vector can be estimated by existing estimation methodExample As in the prior art, the phase time-frequency mask estimation steering vector of audio signal can be passed through.It should be appreciated that the present invention to how Estimate the steering vectorWithout limitation.

In this step, due to the noise covariance matrixIt is according to noise source in present frame signals with noise Concrete condition and determination, therefore be also to match with present frame signals with noise using its filter coefficient determined, it is described The accuracy of filter coefficient is higher, and when carrying out speech enhan-cement processing using it, speech enhan-cement effect can be improved.

Step 104, using the filter coefficient, speech enhan-cement processing is carried out to signals with noise.

As an example, the signals with noise can be the present frame signals with noise, it is also possible to the present frame band and makes an uproar Former frame signals with noise of signal or rear a few frame signals with noise.For example, digital signal processor can utilize the filter coefficient, In subsequent time, a frame corresponding to subsequent time or a few frame signals with noise carry out speech enhan-cement processing.For another example, number letter The filter coefficient can also be used in number processor, and a frame corresponding to previous moment or a few frame signals with noise carry out speech enhan-cement Then processing exports the speech enhan-cement signal of previous moment, although decay time is on described at this time there are signal output time delay The time difference between one moment and current time, this section of time difference corresponding frame number is a frame or a few frames, due to every frame signals with noise Duration is very short, therefore signal output time delay very little.

As an example, the present embodiment also selects the General Principle method of speech enhan-cement described in background technology, to described Present frame signals with noise carries out speech enhan-cement processing.Specifically, using the filter coefficient, to by Fourier transformation or short When Fourier transformation signals with noise be filtered, the frequency-region signal enhanced；Then again to the frequency-region signal of the enhancing Inversefouriertransform is done, the time-domain signal enhanced, to export.More specifically, the frequency domain of enhancing is calculated as follows first Signal, Then the time-domain signal of enhancing is calculated as follows, WhereinFor the frequency-region signal of the enhancing, w^HIt (f) is the conjugate matrices of the filter coefficient, y (t, f) is to characterize institute The matrix of Fourier transformation frequency spectrum is stated,For the time-domain signal of the enhancing, IFFT indicates inversefouriertransform.It should manage Solution, above-mentioned example only acts on as an example, is not intended to limit the present invention.

With it is above-mentioned include sound enhancement method of the step 101 to step 104, obtain working as microphone array acquisition Previous frame signals with noise utilizes the present frame signals with noise, it is determined that the corresponding time-frequency mask of the present frame signals with noise, benefit With the time-frequency mask, it is determined that the corresponding filter coefficient of the present frame signals with noise, it is right using the filter coefficient The present frame signals with noise carries out speech enhan-cement processing.On the one hand, it since the above method is directed to each frame signals with noise, calculates The corresponding time-frequency mask of frame signals with noise, the time-frequency mask not instead of fixed value, according to the specific of every frame signals with noise Situation and change, correspondingly, being also to be changed according to the concrete condition of every frame signals with noise by its resulting filter coefficient , therefore carry out final speech enhan-cement using the filter coefficient and handle, speech enhan-cement effect can be improved；Another party Face, since the above method is when calculating the time-frequency mask, it is only necessary to handle a frame signals with noise, therefore meter of the invention Calculation amount is smaller, and meets requirement of real-time.

Referring to Fig. 4, embodiment provides a kind of speech sound enhancement device, the speech sound enhancement device includes:

Module 501 is obtained, for obtaining the present frame signals with noise of microphone array acquisition, the present frame signals with noise The voice signal respectively issued including at least target voice sound source and other sound sources；

Time-frequency mask determining module 502 determines the present frame signals with noise for utilizing the present frame signals with noise Corresponding time-frequency mask；

Filter coefficient determining module 503 determines that the present frame signals with noise is corresponding for utilizing the time-frequency mask Filter coefficient；And

Speech enhan-cement module 504 carries out voice increasing to the present frame signals with noise for utilizing the filter coefficient Strength reason.

Optionally, referring to Fig. 5, time-frequency mask determining module on the basis of above-mentioned Fig. 4, in the speech sound enhancement device 502 include:

Estimation orientation determines submodule 5021, for determining the target language speech according to the present frame signals with noise Estimation orientation of the source relative to the microphone array；And

Time-frequency mask determines submodule 5022, for being closed according to the relative position between the estimation orientation and target area System, determines the corresponding time-frequency mask of the present frame signals with noise, wherein the target area is the target voice sound source institute Physical location region.

Optionally, on the basis of above-mentioned Fig. 5, the time-frequency mask determines submodule 5022, can be specifically used for: if institute It states estimation orientation to be located in the target area, it is determined that the time-frequency mask is predetermined fixed value T1；If the estimation side Position is located at outside the target area, it is determined that the time-frequency mask is predetermined fixed value T2；Wherein T2≤1 0≤T1 <.

Or it is optional, on the basis of above-mentioned Fig. 5, the time-frequency mask determines submodule 5022, can be specifically used for: such as Estimation orientation described in fruit is located at outside the target area, it is determined that and the time-frequency mask is predetermined fixed value T3, wherein 0 < T3≤ 1；If the estimation orientation is located in the target area, specific in the target area according to the estimation orientation Relative position determines that the time-frequency mask is T4；Wherein 0≤T4 < T3.Wherein, the estimation orientation is closer to the target area The center in domain, the numerical value of the T4 is closer to 0；Marginal position of the estimation orientation closer to the target area, institute The numerical value of T4 is stated closer to the T3.

Optionally, on the basis of above-mentioned Fig. 4, the filter coefficient determining module 503 in the speech sound enhancement device is wrapped It includes:

Fourier transformation submodule obtains described current for carrying out Fourier transformation to the present frame signals with noise The Fourier transformation frequency spectrum of frame signals with noise；

Noise covariance matrix computational submodule, for calculating noise covariance matrix according to following formula:And

Filter coefficient computational submodule, for calculating the filter coefficient according to following formula:

Wherein,For the noise covariance matrix, T is the time-frequency mask, and y (t, f) is to characterize in Fu The matrix of leaf transformation frequency spectrum, y^H(t, f) is the conjugate matrices of y (t, f), and w (f) is the filter coefficient,Work as to be described The corresponding steering vector being estimated of previous frame signals with noise,ForConjugate matrices.

In addition, embodiment additionally provides a kind of speech enhancement apparatus, the speech enhancement apparatus include: microphone array, Processor, memory and it is stored in the computer program that can be run on the memory and on the processor, the place Reason device is connect with the microphone array, when the processor executes the computer program, realizes that any of the above method is real Apply sound enhancement method described in example.

In above-mentioned speech enhancement apparatus, as an example, digital microphone array, the place can be selected in the microphone array Managing device can be selected digital signal processor.As an example, nonnumeric formula microphone array, institute also can be selected in the microphone array Stating processor still can be selected digital signal processor, can connect the microphone array and the place by audio decoder at this time Device is managed, the signals with noise that the audio decoder is used to generate the microphone array carries out digitized sampling conversion, and will Data after conversion are sent into the digital signal processor.

In addition, embodiment additionally provides a kind of storage medium, it is stored thereon with computer program, when the computer program When being executed by processor, sound enhancement method described in any of the above embodiment of the method is realized.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims

1. a kind of sound enhancement method characterized by comprising

The present frame signals with noise of microphone array acquisition is obtained, the present frame signals with noise includes at least target voice sound source The voice signal respectively issued with other sound sources；

2. sound enhancement method according to claim 1, which is characterized in that utilize the present frame signals with noise, determine The corresponding time-frequency mask of the present frame signals with noise, comprising:

According to the present frame signals with noise, estimation side of the target voice sound source relative to the microphone array is determined Position；

According to the relative positional relationship between the estimation orientation and target area, determine that the present frame signals with noise is corresponding Time-frequency mask, wherein the target area is the physical location region where the target voice sound source.

3. sound enhancement method according to claim 2, which is characterized in that according to the estimation orientation and target area it Between relative positional relationship, determine the corresponding time-frequency mask of the present frame signals with noise, comprising:

If the estimation orientation is located in the target area, it is determined that the time-frequency mask is predetermined fixed value T1；

If the estimation orientation is located at outside the target area, it is determined that the time-frequency mask is predetermined fixed value T2；

Wherein T2≤1 0≤T1 <.

4. sound enhancement method according to claim 2, which is characterized in that according to the estimation orientation and target area it Between relative positional relationship, determine the corresponding time-frequency mask of the present frame signals with noise, comprising:

If the estimation orientation is located at outside the target area, it is determined that the time-frequency mask is predetermined fixed value T3, wherein 0 T3≤1 <；

If the estimation orientation is located in the target area, according to tool of the estimation orientation in the target area Body relative position determines that the time-frequency mask is T4；

Wherein 0≤T4 < T3.

5. sound enhancement method according to claim 4, which is characterized in that the size of the numerical value of the T4 meets with ShiShimonoseki System:

The estimation orientation is closer to the center of the target area, and the numerical value of the T4 is closer to 0；The estimation orientation Marginal position closer to the target area, the numerical value of the T4 is closer to the T3.

6. sound enhancement method according to claim 1, which is characterized in that it is described to utilize the time-frequency mask, determine institute State the corresponding filter coefficient of present frame signals with noise, comprising:

Fourier transformation is carried out to the present frame signals with noise, obtains the Fourier transformation frequency of the present frame signals with noise Spectrum；

Noise covariance matrix is calculated according to following formula:

The filter coefficient is calculated according to following formula:

Wherein,For the noise covariance matrix, T is the time-frequency mask, and y (t, f) is to characterize the Fourier to become Change the matrix of frequency spectrum, y^H(t, f) is the conjugate matrices of y (t, f), and w (f) is the filter coefficient,For the present frame The corresponding steering vector being estimated of signals with noise,ForConjugate matrices.

7. a kind of speech sound enhancement device characterized by comprising

Module is obtained, for obtaining the present frame signals with noise of microphone array acquisition, the present frame signals with noise is at least wrapped Include the voice signal that target voice sound source and other sound sources respectively issue；

Time-frequency mask determining module determines that the present frame signals with noise is corresponding for utilizing the present frame signals with noise Time-frequency mask；

Filter coefficient determining module determines the corresponding filtering of the present frame signals with noise for utilizing the time-frequency mask Device coefficient；And

8. speech sound enhancement device according to claim 7, which is characterized in that the time-frequency mask determining module includes:

Estimation orientation determines submodule, for according to the present frame signals with noise, determine the target voice sound source relative to The estimation orientation of the microphone array；And

Time-frequency mask determines submodule, for determining according to the relative positional relationship between the estimation orientation and target area The corresponding time-frequency mask of the present frame signals with noise, wherein the target area is the reality where the target voice sound source The band of position.

9. a kind of speech enhancement apparatus, comprising: microphone array, processor, memory and be stored on the memory simultaneously The computer program that can be run on the processor, the processor are connect with the microphone array, which is characterized in that when When the processor executes the computer program, sound enhancement method described in claim 1 to 6 any one is realized.

10. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program is by processor When execution, sound enhancement method described in claim 1 to 6 any one is realized.