CN109523999A

CN109523999A - A kind of front end processing method and system promoting far field speech recognition

Info

Publication number: CN109523999A
Application number: CN201811602419.9A
Authority: CN
Inventors: 李军锋; 高飞; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2019-03-26
Anticipated expiration: 2038-12-26
Also published as: CN109523999B

Abstract

This application provides a kind of front end processing methods and system for promoting far field speech recognition, the method comprise the steps that calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, through acoustical signal and early stage reverb signal are intercepted；Clean speech signal in through acoustical signal and early stage reverb signal and sound bank is subjected to convolution in the time domain, obtains time domain echo signal；Other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are respectively calculated, target inband energy and other signals energy are obtained, ideal ratio masking is obtained by target inband energy and other signals energy；After time domain mixed signal is converted into frequency domain mixed signal, the amplitude of frequency domain mixed signal is multiplied with ideal ratio masking, the phase of frequency domain mixed signal is reused, obtains reconstruction signal.The present invention passes through ideal amplitude masking and isolates echo signal from the mixing voice under the conditions of noise reverberation.

Description

A kind of front end processing method and system promoting far field speech recognition

Technical field

The present invention relates to Audio Signal Processing field more particularly to a kind of front end processing methods for promoting far field speech recognition And system.

Background technique

With the continuous development of voice technology, the application of interactive voice is very extensive, big to get home to national military affairs are small With household individual application.Currently based on speech recognition using more and more, such as smart home, service robot, but true In real interactive voice scene, ambient noise and RMR room reverb can interfere the propagation of voice, these interference are not only to voice quality There is damage with intelligibility of speech degree, it is also very big to the harm of speech recognition.Therefore, voice is isolated from these interference, for It is particularly important for speech recognition.

Based on the research of auditory masking phenomenon, ideal two-value masking (Ideal Binary Mask, IBM) is proposed for Target voice is isolated from noisy speech, the main thought of IBM is to retain echo signal ratio by certain local threshold The strong time frequency unit of signals with noise energy, removes other time frequency units.Much research shows that IBM can promote the intelligibility of voice And voice quality.Ideal ratio shelters the soft-decision of (Ideal Ratio Mask, IRM) as IBM, it can retain more The information of voice has better performance in speech recognition performance.In noise circumstance, IRM by clean speech energy with The energy ratio of noisy speech is calculated.When scene transforms to reverberation noise environment, present way, which is still applied, only makes an uproar Sound Shi Fangfa, noise is additivity, and multiplying property of reverberation, it is made of direct sound wave, early reflection, late reverberation, it is clear that with Upper method is unreasonable for the processing of reverberation.

We use room impulse response (Room Impulse Response, RIR) generally to describe the reverberation in room spy Property, research shows that the direct sound wave and early reflection in room shock response be to the advantageous part of human auditory system, it is now some to grind Study carefully using the early stage reverberation of direct sound wave and preceding 50ms as target voice, the experimental results showed that this is sequestered under the conditions of noise reverberation The intelligibility of speech and voice quality can effectively be promoted.But reverberation is different and different with the acoustic characteristic of room environment, no Early reflection with length is to there is different influences in the intelligibility of speech, the method for 50ms before all intercepting for the different reverberation time It is not good way.

Summary of the invention

To solve the above-mentioned problems, the invention proposes it is a kind of promoted far field speech recognition front end processing method and be System.

In order to achieve the above object, embodiments herein adopts the following technical scheme that

In a first aspect, the application provides a kind of front end processing method for promoting far field speech recognition, comprising: to Room impulse Response signal is calculated, and obtains the sliced time point of early stage reverb signal Yu late reverberation signal, intercept through acoustical signal and The early stage reverb signal；The room impulse response signal successively by the through acoustical signal, the early stage reverb signal and The late reverberation signal composition；By clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank Convolution is carried out in the time domain, obtains time domain echo signal；When by removing described in the time domain echo signal and time domain mixed signal Other signals other than the echo signal of domain are respectively calculated, and target inband energy and other signals energy are obtained, by described Target inband energy and the other signals energy obtain ideal ratio masking；Wherein the time domain mixed signal is by the room After voice carries out convolution in the time domain in impulse response signals and the sound bank, then it is mixed to get with noise signal；It will After the time domain mixed signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal and the ideal ratio are covered Multiplication is covered, the phase of the frequency domain mixed signal is reused, obtains reconstruction signal.

It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: by calculate the echo density function of the room impulse response signal come Determine the early stage reverb signal of the room impulse response signal and the sliced time point of late reverberation signal, the echo density Function NED is defined as:

Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window The room impulse response signal standard deviation；When reverberation is from when early stage, reverberation changed to late reverberation, NED connects since 0 Nearly 1, the sliced time of early stage reverb signal and late reverberation signal is just defined as the standard deviation infinite approach of late reverberation signal When 1.

It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: by the diffusing scattering field of the late reverberation it is assumed that through kurtosis come based on The sliced time point of early stage reverb signal and late reverberation signal is calculated, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis γ₄Is defined as:

Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation；The sliced time is defined as sliding The kurtosis calculated in dynamic window reaches time instant when zero.

It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: calculate early stage reverb signal and late reverberation signal by room characteristic Sliced time point, the time t is defined as:

Wherein, V and S is respectively the volume in room and the surface area in room.

It is described that the time domain will be removed in the time domain echo signal and time domain mixed signal in another possible realization Other signals other than echo signal are respectively calculated, and obtain target inband energy and other signals energy, pass through the mesh Mark signal energy and the other signals energy obtain ideal ratio masking, specifically include: by the time domain echo signal and institute It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated；It will be described Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking；It is described Ideal ratio shelters formula IRM (k, l) are as follows:

Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy The other signals energy of amount, k indicate band index, and l indicates frame index.

In another possible realization, the time domain mixed signal is by the room impulse response signal and the voice It after voice carries out convolution in the time domain in library, then is mixed to get with noise signal, the time domain mixed signal generating mode Are as follows:

M (t)=s (t) h (t)+n (t)

Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t Indicate time index.

In another possible realization, it is described the time domain mixed signal is converted into frequency domain mixed signal after, by institute The amplitude for stating frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal, obtains Reconstruction signal specifically includes:

After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained；

The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal Phase, obtain reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}

Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ M^f(k, l) indicates frequency domain The phase of mixed signal, k indicate band index, and l indicates frame index.

Second aspect, the application provide a kind of front-end processing system for promoting far field speech recognition, comprising: interception unit, For calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut Cut-off reaches acoustical signal and the early stage reverb signal；The room impulse response signal is successively by the through acoustical signal, described Early stage reverb signal and late reverberation signal composition；First generation unit was used for the through acoustical signal and the morning Clean speech signal carries out convolution in the time domain in phase reverb signal and sound bank, obtains time domain echo signal；Second generates list Member, for distinguishing the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal It is calculated, obtains target inband energy and other signals energy, pass through the target inband energy and the other signals energy Measure ideal ratio masking；Wherein the time domain mixed signal is by language in the room impulse response signal and the sound bank After sound carries out convolution in the time domain, then it is mixed to get with noise signal；Third generation unit, for mixing the time domain After signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, then makes With the phase of the frequency domain mixed signal, reconstruction signal is obtained.

In another possible realization, second generation unit is specifically used for, by the time domain echo signal and institute It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated；It will be described Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking；It is described Ideal ratio shelters formula IRM (k, l) are as follows:

In another possible realization, the third generation unit is specifically used for,

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}

The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal, Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.

Detailed description of the invention

The attached drawing used required in embodiment or description of the prior art is briefly described below.

Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application；

Fig. 2 is room impulse response signal composition schematic diagram provided by the embodiments of the present application；

Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application.

Specific embodiment

Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar unit or unit with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the application, and should not be understood as the limitation to the application.

Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application.Such as The front end processing method shown in FIG. 1 for promoting far field speech recognition, the specific implementation steps are as follows:

Step S102 calculates room impulse response signal, obtains early stage reverb signal and late reverberation signal Sliced time point intercepts through acoustical signal and early stage reverb signal.

Preferably, as shown in Fig. 2, room impulse response signal is actually by direct sound wave, early stage reverberation and evening in the application Phase reverberation composition, however direct sound wave and early stage reverberation are to the advantageous part of human auditory system, the application mainly passes through acquisition early stage The sliced time of reverberation and late reverberation point, is then truncated to direct sound wave and early stage reverberation is handled.

Specifically, after sampling certain frequency to room impulse response signal liter, room impulse response signal is calculated The sliced time point of early stage reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.

Wherein, room impulse response signal rise and sample certain frequency, be interception early stage reverberation for convenience Time.

Preferably, it is best to sample 48kHz to room impulse response signal liter by the application, and being also under certain other frequencies can With.

In one embodiment, determine that Room impulse is rung by the echo density function of calculated room impulse response signals The early stage reverb signal of induction signal and the sliced time point of late reverberation signal, echo density function NED is defined as:

Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window The room impulse response signal standard deviation.

When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and advanced stage are mixed When the standard deviation that the sliced time of sound signal is just defined as late reverberation signal is infinitely close to 1.

In one embodiment, based on the diffusing scattering field of late reverberation it is assumed that calculating early stage reverberation letter by kurtosis Sliced time point number with late reverberation signal, kurtosis is the Fourth-order moment of statistic processes, kurtosis γ₄Is defined as:

Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation；

The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.

In one embodiment, the sliced time of early stage reverb signal and late reverberation signal is calculated by room characteristic Point, the time t is defined as:

Step S104 carries out through acoustical signal and early stage reverb signal with clean speech signal in sound bank in the time domain Convolution obtains time domain echo signal.

Preferably, sound bank Hub5 used in this application is the telephonograph of English voice, and the speaker being recruited passes through Robot operator connection, the everyday topics announced with regard to robot operator when conversing and starting are talked at will.It should The sample frequency of sound bank is 8000 hertz.Wherein, clean speech signal refers to the voice of a faithful record of no any operation.

It specifically, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and voice Clean speech signal carries out convolution in the time domain in library, obtains time domain echo signal.

Step S106, by the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal point It is not calculated, obtains target inband energy and other signals energy, obtained by target inband energy and other signals energy Ideal ratio masking.

Preferably, time domain mixed signal is rolled up in the time domain by room impulse response signal and voices whole in sound bank After product, then it is mixed to get again with noise signal.Time domain mixed signal generating mode are as follows:

M (t)=s (t) h (t)+n (t)

Wherein, noise signal refers to the ambient noise in true interactive voice scene, it and RMR room reverb can interfere language The propagation of sound, these interference not only have damage to voice quality and intelligibility of speech degree, but also will affect speech recognition.

Specifically, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are distinguished Fourier transformation is done, target inband energy D (k, l) and other signals energy R (k, l) is calculated；Then by echo signal energy It measures D (k, l) and other signals energy R (k, l) energy substitutes into ideal ratio and shelters formula, obtain ideal ratio masking.Wherein manage Think that ratio shelters formula IRM (k, l) are as follows:

Step S108, after time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and reason of frequency domain mixed signal Think that ratio masking is multiplied, reuses the phase of frequency domain mixed signal, obtain reconstruction signal.

Specifically, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained；Then by frequency domain The amplitude of mixed signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruct letter Number s ' (t) calculation formula are as follows:

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}

Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application. The front-end processing system of far field speech recognition is promoted as shown in Figure 3, comprising: interception unit 301, the first generation unit 302, second Generation unit 303 and third generation unit 304.

Interception unit 301 obtains early stage reverb signal and late reverberation for calculating room impulse response signal The sliced time point of signal intercepts through acoustical signal and early stage reverb signal.

Wherein, after sampling certain frequency to room impulse response signal liter, the morning of room impulse response signal is calculated The sliced time point of phase reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.

First generation unit 302 is used to exist clean speech signal in through acoustical signal and early stage reverb signal and sound bank Convolution is carried out in time domain, obtains time domain echo signal.

It wherein, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and sound bank Middle clean speech signal carries out convolution in the time domain, obtains time domain echo signal.

Second generation unit 303 be used for by time domain echo signal and time domain mixed signal in addition to time domain echo signal Other signals are respectively calculated, and obtain target inband energy and other signals energy, pass through target inband energy and other letters Number energy obtains ideal ratio masking.

M (t)=s (t) h (t)+n (t)

Wherein, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are done respectively Target inband energy D (k, l) and other signals energy R (k, l) is calculated in Fourier transformation；Then by target inband energy D (k, l) and other signals energy R (k, l) energy substitute into ideal ratio and shelter formula, obtain ideal ratio masking.Wherein ideal ratio Value masking formula IRM (k, l) are as follows:

After third generation unit 304 is used to time domain mixed signal being converted into frequency domain mixed signal, by frequency domain mixed signal The masking of amplitude and ideal ratio be multiplied, reuse the phase of frequency domain mixed signal, obtain reconstruction signal.

Wherein, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained；Then frequency domain is mixed The amplitude for closing signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruction signal S ' (t) calculation formula are as follows:

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}

Finally, it is stated that: above embodiments are only to illustrate the technical solution of the application, and limit it；Although reference The application is described in detail in previous embodiment, those skilled in the art should understand that: it still can be right Technical solution documented by foregoing embodiments is modified or equivalent replacement of some of the technical features；And this A little modifications or substitutions, the range of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of front end processing method for promoting far field speech recognition characterized by comprising

Room impulse response signal is calculated, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut Cut-off reaches acoustical signal and the early stage reverb signal；The room impulse response signal is successively by the through acoustical signal, described Early stage reverb signal and late reverberation signal composition；

Clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank is subjected to convolution in the time domain, Obtain time domain echo signal；

By the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal respectively into Row calculates, and obtains target inband energy and other signals energy, passes through the target inband energy and the other signals energy Obtain ideal ratio masking；Wherein the time domain mixed signal is by voice in the room impulse response signal and the sound bank After carrying out convolution in the time domain, then it is mixed to get with noise signal；

After the time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and the ideal of the frequency domain mixed signal Ratio masking is multiplied, and reuses the phase of the frequency domain mixed signal, obtains reconstruction signal.

2. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:

The morning of the room impulse response signal is determined by calculating the echo density function of the room impulse response signal The sliced time point of phase reverb signal and late reverberation signal, the echo density function NED is defined as:

Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } be One target function, when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when the institute in front window State the standard deviation of room impulse response signal；

When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and late reverberation are believed Number sliced time when be just defined as the standard deviation of late reverberation signal and be infinitely close to 1.

3. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:

Based on the diffusing scattering field of the late reverberation it is assumed that calculating early stage reverb signal and late reverberation signal by kurtosis Sliced time point, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis γ₄Is defined as:

4. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:

The sliced time point of early stage reverb signal and late reverberation signal, the time t are calculated by room characteristic is defined as:

5. the method according to claim 1, wherein described by the time domain echo signal and time domain mixed signal In other signals in addition to the time domain echo signal be respectively calculated, obtain target inband energy and other signals energy Amount obtains ideal ratio masking by the target inband energy and the other signals energy, specifically includes:

The time domain echo signal and the other signals are done into Fourier transformation respectively, the target inband energy is calculated With the other signals energy；

The target inband energy and the other signals energy are substituted into ideal ratio and shelter formula, obtains the ideal ratio Masking；The ideal ratio shelters formula IRM (k, l) are as follows:

Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove target inband energy in mixed signal energy Other signals energy, k indicate band index, and l indicates frame index.

6. the method according to claim 1, wherein the time domain mixed signal is believed by the room impulse response Number with after voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal, the time domain mixing Signal generating mode are as follows:

M (t)=s (t) h (t)+n (t)

Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, and t is indicated Time index.

7. the method according to claim 1, wherein described be converted into frequency domain mixing for the time domain mixed signal After signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal Phase, obtain reconstruction signal, specifically include:

The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal Position, obtains reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}

Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ M^f(k, l) indicates frequency domain mixing The phase of signal, k indicate band index, and l indicates frame index.

8. a kind of front-end processing system for promoting far field speech recognition, comprising:

Interception unit obtains early stage reverb signal and late reverberation signal for calculating room impulse response signal Sliced time point intercepts through acoustical signal and the early stage reverb signal；The room impulse response signal is successively by described straight It is formed up to acoustical signal, the early stage reverb signal and the late reverberation signal；

First generation unit is used for clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank Convolution is carried out in the time domain, obtains time domain echo signal；

Second generation unit, for by the time domain echo signal and time domain mixed signal in addition to the time domain echo signal Other signals be respectively calculated, obtain target inband energy and other signals energy, by the target inband energy and The other signals energy obtains ideal ratio masking；Wherein the time domain mixed signal by the room impulse response signal with After voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal；

The frequency domain is mixed and is believed after the time domain mixed signal is converted into frequency domain mixed signal by third generation unit Number the masking of amplitude and the ideal ratio be multiplied, reuse the phase of the frequency domain mixed signal, obtain reconstruction signal.

9. system according to claim 8, which is characterized in that second generation unit is specifically used for,

10. system according to claim 8, which is characterized in that the third generation unit is specifically used for,

S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ M^f(k,l)]}