CN109523999A - A kind of front end processing method and system promoting far field speech recognition - Google Patents

A kind of front end processing method and system promoting far field speech recognition Download PDF

Info

Publication number
CN109523999A
CN109523999A CN201811602419.9A CN201811602419A CN109523999A CN 109523999 A CN109523999 A CN 109523999A CN 201811602419 A CN201811602419 A CN 201811602419A CN 109523999 A CN109523999 A CN 109523999A
Authority
CN
China
Prior art keywords
signal
time domain
energy
early stage
frequency domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811602419.9A
Other languages
Chinese (zh)
Other versions
CN109523999B (en
Inventor
李军锋
高飞
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201811602419.9A priority Critical patent/CN109523999B/en
Publication of CN109523999A publication Critical patent/CN109523999A/en
Application granted granted Critical
Publication of CN109523999B publication Critical patent/CN109523999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

This application provides a kind of front end processing methods and system for promoting far field speech recognition, the method comprise the steps that calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, through acoustical signal and early stage reverb signal are intercepted;Clean speech signal in through acoustical signal and early stage reverb signal and sound bank is subjected to convolution in the time domain, obtains time domain echo signal;Other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are respectively calculated, target inband energy and other signals energy are obtained, ideal ratio masking is obtained by target inband energy and other signals energy;After time domain mixed signal is converted into frequency domain mixed signal, the amplitude of frequency domain mixed signal is multiplied with ideal ratio masking, the phase of frequency domain mixed signal is reused, obtains reconstruction signal.The present invention passes through ideal amplitude masking and isolates echo signal from the mixing voice under the conditions of noise reverberation.

Description

A kind of front end processing method and system promoting far field speech recognition
Technical field
The present invention relates to Audio Signal Processing field more particularly to a kind of front end processing methods for promoting far field speech recognition And system.
Background technique
With the continuous development of voice technology, the application of interactive voice is very extensive, big to get home to national military affairs are small With household individual application.Currently based on speech recognition using more and more, such as smart home, service robot, but true In real interactive voice scene, ambient noise and RMR room reverb can interfere the propagation of voice, these interference are not only to voice quality There is damage with intelligibility of speech degree, it is also very big to the harm of speech recognition.Therefore, voice is isolated from these interference, for It is particularly important for speech recognition.
Based on the research of auditory masking phenomenon, ideal two-value masking (Ideal Binary Mask, IBM) is proposed for Target voice is isolated from noisy speech, the main thought of IBM is to retain echo signal ratio by certain local threshold The strong time frequency unit of signals with noise energy, removes other time frequency units.Much research shows that IBM can promote the intelligibility of voice And voice quality.Ideal ratio shelters the soft-decision of (Ideal Ratio Mask, IRM) as IBM, it can retain more The information of voice has better performance in speech recognition performance.In noise circumstance, IRM by clean speech energy with The energy ratio of noisy speech is calculated.When scene transforms to reverberation noise environment, present way, which is still applied, only makes an uproar Sound Shi Fangfa, noise is additivity, and multiplying property of reverberation, it is made of direct sound wave, early reflection, late reverberation, it is clear that with Upper method is unreasonable for the processing of reverberation.
We use room impulse response (Room Impulse Response, RIR) generally to describe the reverberation in room spy Property, research shows that the direct sound wave and early reflection in room shock response be to the advantageous part of human auditory system, it is now some to grind Study carefully using the early stage reverberation of direct sound wave and preceding 50ms as target voice, the experimental results showed that this is sequestered under the conditions of noise reverberation The intelligibility of speech and voice quality can effectively be promoted.But reverberation is different and different with the acoustic characteristic of room environment, no Early reflection with length is to there is different influences in the intelligibility of speech, the method for 50ms before all intercepting for the different reverberation time It is not good way.
Summary of the invention
To solve the above-mentioned problems, the invention proposes it is a kind of promoted far field speech recognition front end processing method and be System.
In order to achieve the above object, embodiments herein adopts the following technical scheme that
In a first aspect, the application provides a kind of front end processing method for promoting far field speech recognition, comprising: to Room impulse Response signal is calculated, and obtains the sliced time point of early stage reverb signal Yu late reverberation signal, intercept through acoustical signal and The early stage reverb signal;The room impulse response signal successively by the through acoustical signal, the early stage reverb signal and The late reverberation signal composition;By clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank Convolution is carried out in the time domain, obtains time domain echo signal;When by removing described in the time domain echo signal and time domain mixed signal Other signals other than the echo signal of domain are respectively calculated, and target inband energy and other signals energy are obtained, by described Target inband energy and the other signals energy obtain ideal ratio masking;Wherein the time domain mixed signal is by the room After voice carries out convolution in the time domain in impulse response signals and the sound bank, then it is mixed to get with noise signal;It will After the time domain mixed signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal and the ideal ratio are covered Multiplication is covered, the phase of the frequency domain mixed signal is reused, obtains reconstruction signal.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: by calculate the echo density function of the room impulse response signal come Determine the early stage reverb signal of the room impulse response signal and the sliced time point of late reverberation signal, the echo density Function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window The room impulse response signal standard deviation;When reverberation is from when early stage, reverberation changed to late reverberation, NED connects since 0 Nearly 1, the sliced time of early stage reverb signal and late reverberation signal is just defined as the standard deviation infinite approach of late reverberation signal When 1.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: by the diffusing scattering field of the late reverberation it is assumed that through kurtosis come based on The sliced time point of early stage reverb signal and late reverberation signal is calculated, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;The sliced time is defined as sliding The kurtosis calculated in dynamic window reaches time instant when zero.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal With the sliced time point of late reverberation signal, comprising: calculate early stage reverb signal and late reverberation signal by room characteristic Sliced time point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
It is described that the time domain will be removed in the time domain echo signal and time domain mixed signal in another possible realization Other signals other than echo signal are respectively calculated, and obtain target inband energy and other signals energy, pass through the mesh Mark signal energy and the other signals energy obtain ideal ratio masking, specifically include: by the time domain echo signal and institute It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated;It will be described Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking;It is described Ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy The other signals energy of amount, k indicate band index, and l indicates frame index.
In another possible realization, the time domain mixed signal is by the room impulse response signal and the voice It after voice carries out convolution in the time domain in library, then is mixed to get with noise signal, the time domain mixed signal generating mode Are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t Indicate time index.
In another possible realization, it is described the time domain mixed signal is converted into frequency domain mixed signal after, by institute The amplitude for stating frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal, obtains Reconstruction signal specifically includes:
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal Phase, obtain reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain The phase of mixed signal, k indicate band index, and l indicates frame index.
Second aspect, the application provide a kind of front-end processing system for promoting far field speech recognition, comprising: interception unit, For calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut Cut-off reaches acoustical signal and the early stage reverb signal;The room impulse response signal is successively by the through acoustical signal, described Early stage reverb signal and late reverberation signal composition;First generation unit was used for the through acoustical signal and the morning Clean speech signal carries out convolution in the time domain in phase reverb signal and sound bank, obtains time domain echo signal;Second generates list Member, for distinguishing the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal It is calculated, obtains target inband energy and other signals energy, pass through the target inband energy and the other signals energy Measure ideal ratio masking;Wherein the time domain mixed signal is by language in the room impulse response signal and the sound bank After sound carries out convolution in the time domain, then it is mixed to get with noise signal;Third generation unit, for mixing the time domain After signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, then makes With the phase of the frequency domain mixed signal, reconstruction signal is obtained.
In another possible realization, second generation unit is specifically used for, by the time domain echo signal and institute It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated;It will be described Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking;It is described Ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy The other signals energy of amount, k indicate band index, and l indicates frame index.
In another possible realization, the third generation unit is specifically used for,
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal Phase, obtain reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal, Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Detailed description of the invention
The attached drawing used required in embodiment or description of the prior art is briefly described below.
Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application;
Fig. 2 is room impulse response signal composition schematic diagram provided by the embodiments of the present application;
Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar unit or unit with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the application, and should not be understood as the limitation to the application.
Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application.Such as The front end processing method shown in FIG. 1 for promoting far field speech recognition, the specific implementation steps are as follows:
Step S102 calculates room impulse response signal, obtains early stage reverb signal and late reverberation signal Sliced time point intercepts through acoustical signal and early stage reverb signal.
Preferably, as shown in Fig. 2, room impulse response signal is actually by direct sound wave, early stage reverberation and evening in the application Phase reverberation composition, however direct sound wave and early stage reverberation are to the advantageous part of human auditory system, the application mainly passes through acquisition early stage The sliced time of reverberation and late reverberation point, is then truncated to direct sound wave and early stage reverberation is handled.
Specifically, after sampling certain frequency to room impulse response signal liter, room impulse response signal is calculated The sliced time point of early stage reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.
Wherein, room impulse response signal rise and sample certain frequency, be interception early stage reverberation for convenience Time.
Preferably, it is best to sample 48kHz to room impulse response signal liter by the application, and being also under certain other frequencies can With.
In one embodiment, determine that Room impulse is rung by the echo density function of calculated room impulse response signals The early stage reverb signal of induction signal and the sliced time point of late reverberation signal, echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window The room impulse response signal standard deviation.
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and advanced stage are mixed When the standard deviation that the sliced time of sound signal is just defined as late reverberation signal is infinitely close to 1.
In one embodiment, based on the diffusing scattering field of late reverberation it is assumed that calculating early stage reverberation letter by kurtosis Sliced time point number with late reverberation signal, kurtosis is the Fourth-order moment of statistic processes, kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
In one embodiment, the sliced time of early stage reverb signal and late reverberation signal is calculated by room characteristic Point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
Step S104 carries out through acoustical signal and early stage reverb signal with clean speech signal in sound bank in the time domain Convolution obtains time domain echo signal.
Preferably, sound bank Hub5 used in this application is the telephonograph of English voice, and the speaker being recruited passes through Robot operator connection, the everyday topics announced with regard to robot operator when conversing and starting are talked at will.It should The sample frequency of sound bank is 8000 hertz.Wherein, clean speech signal refers to the voice of a faithful record of no any operation.
It specifically, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and voice Clean speech signal carries out convolution in the time domain in library, obtains time domain echo signal.
Step S106, by the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal point It is not calculated, obtains target inband energy and other signals energy, obtained by target inband energy and other signals energy Ideal ratio masking.
Preferably, time domain mixed signal is rolled up in the time domain by room impulse response signal and voices whole in sound bank After product, then it is mixed to get again with noise signal.Time domain mixed signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t Indicate time index.
Wherein, noise signal refers to the ambient noise in true interactive voice scene, it and RMR room reverb can interfere language The propagation of sound, these interference not only have damage to voice quality and intelligibility of speech degree, but also will affect speech recognition.
Specifically, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are distinguished Fourier transformation is done, target inband energy D (k, l) and other signals energy R (k, l) is calculated;Then by echo signal energy It measures D (k, l) and other signals energy R (k, l) energy substitutes into ideal ratio and shelters formula, obtain ideal ratio masking.Wherein manage Think that ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy The other signals energy of amount, k indicate band index, and l indicates frame index.
Step S108, after time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and reason of frequency domain mixed signal Think that ratio masking is multiplied, reuses the phase of frequency domain mixed signal, obtain reconstruction signal.
Specifically, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained;Then by frequency domain The amplitude of mixed signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruct letter Number s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal, Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application. The front-end processing system of far field speech recognition is promoted as shown in Figure 3, comprising: interception unit 301, the first generation unit 302, second Generation unit 303 and third generation unit 304.
Interception unit 301 obtains early stage reverb signal and late reverberation for calculating room impulse response signal The sliced time point of signal intercepts through acoustical signal and early stage reverb signal.
Wherein, after sampling certain frequency to room impulse response signal liter, the morning of room impulse response signal is calculated The sliced time point of phase reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.
In one embodiment, determine that Room impulse is rung by the echo density function of calculated room impulse response signals The early stage reverb signal of induction signal and the sliced time point of late reverberation signal, echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window The room impulse response signal standard deviation.
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and advanced stage are mixed When the standard deviation that the sliced time of sound signal is just defined as late reverberation signal is infinitely close to 1.
In one embodiment, based on the diffusing scattering field of late reverberation it is assumed that calculating early stage reverberation letter by kurtosis Sliced time point number with late reverberation signal, kurtosis is the Fourth-order moment of statistic processes, kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
In one embodiment, the sliced time of early stage reverb signal and late reverberation signal is calculated by room characteristic Point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
First generation unit 302 is used to exist clean speech signal in through acoustical signal and early stage reverb signal and sound bank Convolution is carried out in time domain, obtains time domain echo signal.
It wherein, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and sound bank Middle clean speech signal carries out convolution in the time domain, obtains time domain echo signal.
Second generation unit 303 be used for by time domain echo signal and time domain mixed signal in addition to time domain echo signal Other signals are respectively calculated, and obtain target inband energy and other signals energy, pass through target inband energy and other letters Number energy obtains ideal ratio masking.
Preferably, time domain mixed signal is rolled up in the time domain by room impulse response signal and voices whole in sound bank After product, then it is mixed to get again with noise signal.Time domain mixed signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t Indicate time index.
Wherein, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are done respectively Target inband energy D (k, l) and other signals energy R (k, l) is calculated in Fourier transformation;Then by target inband energy D (k, l) and other signals energy R (k, l) energy substitute into ideal ratio and shelter formula, obtain ideal ratio masking.Wherein ideal ratio Value masking formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy The other signals energy of amount, k indicate band index, and l indicates frame index.
After third generation unit 304 is used to time domain mixed signal being converted into frequency domain mixed signal, by frequency domain mixed signal The masking of amplitude and ideal ratio be multiplied, reuse the phase of frequency domain mixed signal, obtain reconstruction signal.
Wherein, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained;Then frequency domain is mixed The amplitude for closing signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruction signal S ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal, Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Finally, it is stated that: above embodiments are only to illustrate the technical solution of the application, and limit it;Although reference The application is described in detail in previous embodiment, those skilled in the art should understand that: it still can be right Technical solution documented by foregoing embodiments is modified or equivalent replacement of some of the technical features;And this A little modifications or substitutions, the range of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of front end processing method for promoting far field speech recognition characterized by comprising
Room impulse response signal is calculated, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut Cut-off reaches acoustical signal and the early stage reverb signal;The room impulse response signal is successively by the through acoustical signal, described Early stage reverb signal and late reverberation signal composition;
Clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank is subjected to convolution in the time domain, Obtain time domain echo signal;
By the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal respectively into Row calculates, and obtains target inband energy and other signals energy, passes through the target inband energy and the other signals energy Obtain ideal ratio masking;Wherein the time domain mixed signal is by voice in the room impulse response signal and the sound bank After carrying out convolution in the time domain, then it is mixed to get with noise signal;
After the time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and the ideal of the frequency domain mixed signal Ratio masking is multiplied, and reuses the phase of the frequency domain mixed signal, obtains reconstruction signal.
2. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:
The morning of the room impulse response signal is determined by calculating the echo density function of the room impulse response signal The sliced time point of phase reverb signal and late reverberation signal, the echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } be One target function, when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when the institute in front window State the standard deviation of room impulse response signal;
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and late reverberation are believed Number sliced time when be just defined as the standard deviation of late reverberation signal and be infinitely close to 1.
3. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:
Based on the diffusing scattering field of the late reverberation it is assumed that calculating early stage reverb signal and late reverberation signal by kurtosis Sliced time point, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
4. being obtained the method according to claim 1, wherein described calculate room impulse response signal The sliced time point of early stage reverb signal and late reverberation signal, comprising:
The sliced time point of early stage reverb signal and late reverberation signal, the time t are calculated by room characteristic is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
5. the method according to claim 1, wherein described by the time domain echo signal and time domain mixed signal In other signals in addition to the time domain echo signal be respectively calculated, obtain target inband energy and other signals energy Amount obtains ideal ratio masking by the target inband energy and the other signals energy, specifically includes:
The time domain echo signal and the other signals are done into Fourier transformation respectively, the target inband energy is calculated With the other signals energy;
The target inband energy and the other signals energy are substituted into ideal ratio and shelter formula, obtains the ideal ratio Masking;The ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove target inband energy in mixed signal energy Other signals energy, k indicate band index, and l indicates frame index.
6. the method according to claim 1, wherein the time domain mixed signal is believed by the room impulse response Number with after voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal, the time domain mixing Signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, and t is indicated Time index.
7. the method according to claim 1, wherein described be converted into frequency domain mixing for the time domain mixed signal After signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal Phase, obtain reconstruction signal, specifically include:
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal Position, obtains reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain mixing The phase of signal, k indicate band index, and l indicates frame index.
8. a kind of front-end processing system for promoting far field speech recognition, comprising:
Interception unit obtains early stage reverb signal and late reverberation signal for calculating room impulse response signal Sliced time point intercepts through acoustical signal and the early stage reverb signal;The room impulse response signal is successively by described straight It is formed up to acoustical signal, the early stage reverb signal and the late reverberation signal;
First generation unit is used for clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank Convolution is carried out in the time domain, obtains time domain echo signal;
Second generation unit, for by the time domain echo signal and time domain mixed signal in addition to the time domain echo signal Other signals be respectively calculated, obtain target inband energy and other signals energy, by the target inband energy and The other signals energy obtains ideal ratio masking;Wherein the time domain mixed signal by the room impulse response signal with After voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal;
The frequency domain is mixed and is believed after the time domain mixed signal is converted into frequency domain mixed signal by third generation unit Number the masking of amplitude and the ideal ratio be multiplied, reuse the phase of the frequency domain mixed signal, obtain reconstruction signal.
9. system according to claim 8, which is characterized in that second generation unit is specifically used for,
The time domain echo signal and the other signals are done into Fourier transformation respectively, the target inband energy is calculated With the other signals energy;
The target inband energy and the other signals energy are substituted into ideal ratio and shelter formula, obtains the ideal ratio Masking;The ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove target inband energy in mixed signal energy Other signals energy, k indicate band index, and l indicates frame index.
10. system according to claim 8, which is characterized in that the third generation unit is specifically used for,
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal Position, obtains reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain mixing The phase of signal, k indicate band index, and l indicates frame index.
CN201811602419.9A 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition Active CN109523999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811602419.9A CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811602419.9A CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Publications (2)

Publication Number Publication Date
CN109523999A true CN109523999A (en) 2019-03-26
CN109523999B CN109523999B (en) 2021-03-23

Family

ID=65797174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811602419.9A Active CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Country Status (1)

Country Link
CN (1) CN109523999B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428852A (en) * 2019-08-09 2019-11-08 南京人工智能高等研究院有限公司 Speech separating method, device, medium and equipment
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device
CN112201262A (en) * 2020-09-30 2021-01-08 珠海格力电器股份有限公司 Sound processing method and device
CN112201229A (en) * 2020-10-09 2021-01-08 百果园技术(新加坡)有限公司 Voice processing method, device and system
CN112652290A (en) * 2020-12-14 2021-04-13 北京达佳互联信息技术有限公司 Method for generating reverberation audio signal and training method of audio processing model
CN112735461A (en) * 2020-12-29 2021-04-30 西安讯飞超脑信息科技有限公司 Sound pickup method, related device and equipment
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
WO2023093477A1 (en) * 2021-11-25 2023-06-01 广州视源电子科技股份有限公司 Speech enhancement model training method and apparatus, storage medium, and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090122999A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method
CN105427859A (en) * 2016-01-07 2016-03-23 深圳市音加密科技有限公司 Front voice enhancement method for identifying speaker
CN105427860A (en) * 2015-11-11 2016-03-23 百度在线网络技术(北京)有限公司 Far field voice recognition method and device
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
CN108389586A (en) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 A kind of long-range audio collecting device, monitoring device and long-range collection sound method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090122999A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method
CN105427860A (en) * 2015-11-11 2016-03-23 百度在线网络技术(北京)有限公司 Far field voice recognition method and device
CN105427859A (en) * 2016-01-07 2016-03-23 深圳市音加密科技有限公司 Front voice enhancement method for identifying speaker
CN108389586A (en) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 A kind of long-range audio collecting device, monitoring device and long-range collection sound method
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428852A (en) * 2019-08-09 2019-11-08 南京人工智能高等研究院有限公司 Speech separating method, device, medium and equipment
CN110428852B (en) * 2019-08-09 2021-07-16 南京人工智能高等研究院有限公司 Voice separation method, device, medium and equipment
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN111768796B (en) * 2020-07-14 2024-05-03 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device
CN112201262A (en) * 2020-09-30 2021-01-08 珠海格力电器股份有限公司 Sound processing method and device
CN112201262B (en) * 2020-09-30 2024-05-31 珠海格力电器股份有限公司 Sound processing method and device
CN112201229A (en) * 2020-10-09 2021-01-08 百果园技术(新加坡)有限公司 Voice processing method, device and system
CN112201229B (en) * 2020-10-09 2024-05-28 百果园技术(新加坡)有限公司 Voice processing method, device and system
CN112652290A (en) * 2020-12-14 2021-04-13 北京达佳互联信息技术有限公司 Method for generating reverberation audio signal and training method of audio processing model
CN112735461A (en) * 2020-12-29 2021-04-30 西安讯飞超脑信息科技有限公司 Sound pickup method, related device and equipment
CN112735461B (en) * 2020-12-29 2024-06-07 西安讯飞超脑信息科技有限公司 Pickup method, and related device and equipment
CN113643714B (en) * 2021-10-14 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
WO2023093477A1 (en) * 2021-11-25 2023-06-01 广州视源电子科技股份有限公司 Speech enhancement model training method and apparatus, storage medium, and device

Also Published As

Publication number Publication date
CN109523999B (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN109523999A (en) A kind of front end processing method and system promoting far field speech recognition
Pedersen et al. Two-microphone separation of speech mixtures
CN104835503A (en) Improved GSC self-adaptive speech enhancement method
CN110211602B (en) Intelligent voice enhanced communication method and device
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN114694670A (en) Multi-task network-based microphone array speech enhancement system and method
EP0997003A2 (en) A method of noise reduction in speech signals and an apparatus for performing the method
CN113593590A (en) Method for suppressing transient noise in voice
CN112820312B (en) Voice separation method and device and electronic equipment
Mesgarani et al. Speech enhancement based on filtering the spectrotemporal modulations
CN110838303B (en) Voice sound source positioning method using microphone array
CN114884780B (en) Underwater sound communication signal modulation identification method and device based on passive time reversal mirror
CN114401168B (en) Voice enhancement method applicable to short wave Morse signal under complex strong noise environment
JP2002064617A (en) Echo suppression method and echo suppression equipment
CN110444222A (en) A kind of speech noise-reduction method based on comentropy weighting
Gbadamosi et al. Development of non-parametric noise reduction algorithm for GSM voice signal
Wang et al. Speech endpoint detection in fixed differential beamforming combined with modulation domain
Ou et al. Speech enhancement employing modified a priori SNR estimation
Peng et al. Research on Underwater Transient Signal Denoising Algorithm Under low SNR
YADAV et al. Distortionless acoustic beamforming with enhanced sparsity based on reweighted ℓ1-norm minimization
CN106997766A (en) A kind of homomorphic filtering sound enhancement method based on broadband noise
Jung et al. Noise Reduction after RIR removal for Speech De-reverberation and De-noising
Wang et al. Time-Frequency Thresholding: A new algorithm in wavelet package speech enhancement
Jeong et al. Kepstrum approach to real-time speech enhancement methods using two microphones
CN117435880A (en) Signal enhancement method in underwater acoustic interference environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant