CN109523999A - A kind of front end processing method and system promoting far field speech recognition - Google Patents
A kind of front end processing method and system promoting far field speech recognition Download PDFInfo
- Publication number
- CN109523999A CN109523999A CN201811602419.9A CN201811602419A CN109523999A CN 109523999 A CN109523999 A CN 109523999A CN 201811602419 A CN201811602419 A CN 201811602419A CN 109523999 A CN109523999 A CN 109523999A
- Authority
- CN
- China
- Prior art keywords
- signal
- time domain
- energy
- early stage
- frequency domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000001737 promoting effect Effects 0.000 title claims abstract description 13
- 238000003672 processing method Methods 0.000 title claims abstract description 10
- 230000004044 response Effects 0.000 claims abstract description 57
- 230000000873 masking effect Effects 0.000 claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 6
- 230000035939 shock Effects 0.000 claims description 5
- 238000013459 approach Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 13
- 230000006378 damage Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000006698 induction Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
This application provides a kind of front end processing methods and system for promoting far field speech recognition, the method comprise the steps that calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, through acoustical signal and early stage reverb signal are intercepted;Clean speech signal in through acoustical signal and early stage reverb signal and sound bank is subjected to convolution in the time domain, obtains time domain echo signal;Other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are respectively calculated, target inband energy and other signals energy are obtained, ideal ratio masking is obtained by target inband energy and other signals energy;After time domain mixed signal is converted into frequency domain mixed signal, the amplitude of frequency domain mixed signal is multiplied with ideal ratio masking, the phase of frequency domain mixed signal is reused, obtains reconstruction signal.The present invention passes through ideal amplitude masking and isolates echo signal from the mixing voice under the conditions of noise reverberation.
Description
Technical field
The present invention relates to Audio Signal Processing field more particularly to a kind of front end processing methods for promoting far field speech recognition
And system.
Background technique
With the continuous development of voice technology, the application of interactive voice is very extensive, big to get home to national military affairs are small
With household individual application.Currently based on speech recognition using more and more, such as smart home, service robot, but true
In real interactive voice scene, ambient noise and RMR room reverb can interfere the propagation of voice, these interference are not only to voice quality
There is damage with intelligibility of speech degree, it is also very big to the harm of speech recognition.Therefore, voice is isolated from these interference, for
It is particularly important for speech recognition.
Based on the research of auditory masking phenomenon, ideal two-value masking (Ideal Binary Mask, IBM) is proposed for
Target voice is isolated from noisy speech, the main thought of IBM is to retain echo signal ratio by certain local threshold
The strong time frequency unit of signals with noise energy, removes other time frequency units.Much research shows that IBM can promote the intelligibility of voice
And voice quality.Ideal ratio shelters the soft-decision of (Ideal Ratio Mask, IRM) as IBM, it can retain more
The information of voice has better performance in speech recognition performance.In noise circumstance, IRM by clean speech energy with
The energy ratio of noisy speech is calculated.When scene transforms to reverberation noise environment, present way, which is still applied, only makes an uproar
Sound Shi Fangfa, noise is additivity, and multiplying property of reverberation, it is made of direct sound wave, early reflection, late reverberation, it is clear that with
Upper method is unreasonable for the processing of reverberation.
We use room impulse response (Room Impulse Response, RIR) generally to describe the reverberation in room spy
Property, research shows that the direct sound wave and early reflection in room shock response be to the advantageous part of human auditory system, it is now some to grind
Study carefully using the early stage reverberation of direct sound wave and preceding 50ms as target voice, the experimental results showed that this is sequestered under the conditions of noise reverberation
The intelligibility of speech and voice quality can effectively be promoted.But reverberation is different and different with the acoustic characteristic of room environment, no
Early reflection with length is to there is different influences in the intelligibility of speech, the method for 50ms before all intercepting for the different reverberation time
It is not good way.
Summary of the invention
To solve the above-mentioned problems, the invention proposes it is a kind of promoted far field speech recognition front end processing method and be
System.
In order to achieve the above object, embodiments herein adopts the following technical scheme that
In a first aspect, the application provides a kind of front end processing method for promoting far field speech recognition, comprising: to Room impulse
Response signal is calculated, and obtains the sliced time point of early stage reverb signal Yu late reverberation signal, intercept through acoustical signal and
The early stage reverb signal;The room impulse response signal successively by the through acoustical signal, the early stage reverb signal and
The late reverberation signal composition;By clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank
Convolution is carried out in the time domain, obtains time domain echo signal;When by removing described in the time domain echo signal and time domain mixed signal
Other signals other than the echo signal of domain are respectively calculated, and target inband energy and other signals energy are obtained, by described
Target inband energy and the other signals energy obtain ideal ratio masking;Wherein the time domain mixed signal is by the room
After voice carries out convolution in the time domain in impulse response signals and the sound bank, then it is mixed to get with noise signal;It will
After the time domain mixed signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal and the ideal ratio are covered
Multiplication is covered, the phase of the frequency domain mixed signal is reused, obtains reconstruction signal.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal
With the sliced time point of late reverberation signal, comprising: by calculate the echo density function of the room impulse response signal come
Determine the early stage reverb signal of the room impulse response signal and the sliced time point of late reverberation signal, the echo density
Function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1
{ } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window
The room impulse response signal standard deviation;When reverberation is from when early stage, reverberation changed to late reverberation, NED connects since 0
Nearly 1, the sliced time of early stage reverb signal and late reverberation signal is just defined as the standard deviation infinite approach of late reverberation signal
When 1.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal
With the sliced time point of late reverberation signal, comprising: by the diffusing scattering field of the late reverberation it is assumed that through kurtosis come based on
The sliced time point of early stage reverb signal and late reverberation signal is calculated, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis
γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;The sliced time is defined as sliding
The kurtosis calculated in dynamic window reaches time instant when zero.
It is described that room impulse response signal is calculated in another possible realization, obtain early stage reverb signal
With the sliced time point of late reverberation signal, comprising: calculate early stage reverb signal and late reverberation signal by room characteristic
Sliced time point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
It is described that the time domain will be removed in the time domain echo signal and time domain mixed signal in another possible realization
Other signals other than echo signal are respectively calculated, and obtain target inband energy and other signals energy, pass through the mesh
Mark signal energy and the other signals energy obtain ideal ratio masking, specifically include: by the time domain echo signal and institute
It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated;It will be described
Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking;It is described
Ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy
The other signals energy of amount, k indicate band index, and l indicates frame index.
In another possible realization, the time domain mixed signal is by the room impulse response signal and the voice
It after voice carries out convolution in the time domain in library, then is mixed to get with noise signal, the time domain mixed signal generating mode
Are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t
Indicate time index.
In another possible realization, it is described the time domain mixed signal is converted into frequency domain mixed signal after, by institute
The amplitude for stating frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal, obtains
Reconstruction signal specifically includes:
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal
Phase, obtain reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain
The phase of mixed signal, k indicate band index, and l indicates frame index.
Second aspect, the application provide a kind of front-end processing system for promoting far field speech recognition, comprising: interception unit,
For calculating room impulse response signal, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut
Cut-off reaches acoustical signal and the early stage reverb signal;The room impulse response signal is successively by the through acoustical signal, described
Early stage reverb signal and late reverberation signal composition;First generation unit was used for the through acoustical signal and the morning
Clean speech signal carries out convolution in the time domain in phase reverb signal and sound bank, obtains time domain echo signal;Second generates list
Member, for distinguishing the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal
It is calculated, obtains target inband energy and other signals energy, pass through the target inband energy and the other signals energy
Measure ideal ratio masking;Wherein the time domain mixed signal is by language in the room impulse response signal and the sound bank
After sound carries out convolution in the time domain, then it is mixed to get with noise signal;Third generation unit, for mixing the time domain
After signal is converted into frequency domain mixed signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, then makes
With the phase of the frequency domain mixed signal, reconstruction signal is obtained.
In another possible realization, second generation unit is specifically used for, by the time domain echo signal and institute
It states other signals and does Fourier transformation respectively, the target inband energy and the other signals energy is calculated;It will be described
Target inband energy and the other signals energy substitute into ideal ratio and shelter formula, obtain the ideal ratio masking;It is described
Ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy
The other signals energy of amount, k indicate band index, and l indicates frame index.
In another possible realization, the third generation unit is specifically used for,
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal
Phase, obtain reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain
The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal,
Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through
Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Detailed description of the invention
The attached drawing used required in embodiment or description of the prior art is briefly described below.
Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application;
Fig. 2 is room impulse response signal composition schematic diagram provided by the embodiments of the present application;
Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar unit or unit with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and is only used for explaining the application, and should not be understood as the limitation to the application.
Fig. 1 is a kind of flow chart of front end processing method for promoting far field speech recognition provided by the embodiments of the present application.Such as
The front end processing method shown in FIG. 1 for promoting far field speech recognition, the specific implementation steps are as follows:
Step S102 calculates room impulse response signal, obtains early stage reverb signal and late reverberation signal
Sliced time point intercepts through acoustical signal and early stage reverb signal.
Preferably, as shown in Fig. 2, room impulse response signal is actually by direct sound wave, early stage reverberation and evening in the application
Phase reverberation composition, however direct sound wave and early stage reverberation are to the advantageous part of human auditory system, the application mainly passes through acquisition early stage
The sliced time of reverberation and late reverberation point, is then truncated to direct sound wave and early stage reverberation is handled.
Specifically, after sampling certain frequency to room impulse response signal liter, room impulse response signal is calculated
The sliced time point of early stage reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.
Wherein, room impulse response signal rise and sample certain frequency, be interception early stage reverberation for convenience
Time.
Preferably, it is best to sample 48kHz to room impulse response signal liter by the application, and being also under certain other frequencies can
With.
In one embodiment, determine that Room impulse is rung by the echo density function of calculated room impulse response signals
The early stage reverb signal of induction signal and the sliced time point of late reverberation signal, echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1
{ } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window
The room impulse response signal standard deviation.
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and advanced stage are mixed
When the standard deviation that the sliced time of sound signal is just defined as late reverberation signal is infinitely close to 1.
In one embodiment, based on the diffusing scattering field of late reverberation it is assumed that calculating early stage reverberation letter by kurtosis
Sliced time point number with late reverberation signal, kurtosis is the Fourth-order moment of statistic processes, kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
In one embodiment, the sliced time of early stage reverb signal and late reverberation signal is calculated by room characteristic
Point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
Step S104 carries out through acoustical signal and early stage reverb signal with clean speech signal in sound bank in the time domain
Convolution obtains time domain echo signal.
Preferably, sound bank Hub5 used in this application is the telephonograph of English voice, and the speaker being recruited passes through
Robot operator connection, the everyday topics announced with regard to robot operator when conversing and starting are talked at will.It should
The sample frequency of sound bank is 8000 hertz.Wherein, clean speech signal refers to the voice of a faithful record of no any operation.
It specifically, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and voice
Clean speech signal carries out convolution in the time domain in library, obtains time domain echo signal.
Step S106, by the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal point
It is not calculated, obtains target inband energy and other signals energy, obtained by target inband energy and other signals energy
Ideal ratio masking.
Preferably, time domain mixed signal is rolled up in the time domain by room impulse response signal and voices whole in sound bank
After product, then it is mixed to get again with noise signal.Time domain mixed signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t
Indicate time index.
Wherein, noise signal refers to the ambient noise in true interactive voice scene, it and RMR room reverb can interfere language
The propagation of sound, these interference not only have damage to voice quality and intelligibility of speech degree, but also will affect speech recognition.
Specifically, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are distinguished
Fourier transformation is done, target inband energy D (k, l) and other signals energy R (k, l) is calculated;Then by echo signal energy
It measures D (k, l) and other signals energy R (k, l) energy substitutes into ideal ratio and shelters formula, obtain ideal ratio masking.Wherein manage
Think that ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy
The other signals energy of amount, k indicate band index, and l indicates frame index.
Step S108, after time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and reason of frequency domain mixed signal
Think that ratio masking is multiplied, reuses the phase of frequency domain mixed signal, obtain reconstruction signal.
Specifically, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained;Then by frequency domain
The amplitude of mixed signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruct letter
Number s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain
The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal,
Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through
Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Fig. 3 is a kind of structural block diagram of front-end processing system for promoting far field speech recognition provided by the embodiments of the present application.
The front-end processing system of far field speech recognition is promoted as shown in Figure 3, comprising: interception unit 301, the first generation unit 302, second
Generation unit 303 and third generation unit 304.
Interception unit 301 obtains early stage reverb signal and late reverberation for calculating room impulse response signal
The sliced time point of signal intercepts through acoustical signal and early stage reverb signal.
Wherein, after sampling certain frequency to room impulse response signal liter, the morning of room impulse response signal is calculated
The sliced time point of phase reverb signal and late reverberation signal, is then truncated to through acoustical signal and early stage reverb signal.
In one embodiment, determine that Room impulse is rung by the echo density function of calculated room impulse response signals
The early stage reverb signal of induction signal and the sliced time point of late reverberation signal, echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1
{ } is a target function, and when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when in front window
The room impulse response signal standard deviation.
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and advanced stage are mixed
When the standard deviation that the sliced time of sound signal is just defined as late reverberation signal is infinitely close to 1.
In one embodiment, based on the diffusing scattering field of late reverberation it is assumed that calculating early stage reverberation letter by kurtosis
Sliced time point number with late reverberation signal, kurtosis is the Fourth-order moment of statistic processes, kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
In one embodiment, the sliced time of early stage reverb signal and late reverberation signal is calculated by room characteristic
Point, the time t is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
First generation unit 302 is used to exist clean speech signal in through acoustical signal and early stage reverb signal and sound bank
Convolution is carried out in time domain, obtains time domain echo signal.
It wherein, will be after through acoustical signal and the down-sampled sample frequency for voice signal of early stage reverb signal and sound bank
Middle clean speech signal carries out convolution in the time domain, obtains time domain echo signal.
Second generation unit 303 be used for by time domain echo signal and time domain mixed signal in addition to time domain echo signal
Other signals are respectively calculated, and obtain target inband energy and other signals energy, pass through target inband energy and other letters
Number energy obtains ideal ratio masking.
Preferably, time domain mixed signal is rolled up in the time domain by room impulse response signal and voices whole in sound bank
After product, then it is mixed to get again with noise signal.Time domain mixed signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, t
Indicate time index.
Wherein, the other signals in time domain echo signal and time domain mixed signal in addition to time domain echo signal are done respectively
Target inband energy D (k, l) and other signals energy R (k, l) is calculated in Fourier transformation;Then by target inband energy D
(k, l) and other signals energy R (k, l) energy substitute into ideal ratio and shelter formula, obtain ideal ratio masking.Wherein ideal ratio
Value masking formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove echo signal energy in mixed signal energy
The other signals energy of amount, k indicate band index, and l indicates frame index.
After third generation unit 304 is used to time domain mixed signal being converted into frequency domain mixed signal, by frequency domain mixed signal
The masking of amplitude and ideal ratio be multiplied, reuse the phase of frequency domain mixed signal, obtain reconstruction signal.
Wherein, after time domain mixed signal being carried out Short Time Fourier Transform, frequency domain mixed signal is obtained;Then frequency domain is mixed
The amplitude for closing signal is multiplied with ideal ratio masking, reuses the phase of frequency domain mixed signal, obtains reconstruction signal, reconstruction signal
S ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain
The phase of mixed signal, k indicate band index, and l indicates frame index.
The present invention is calculated by the room impulse response signal to different acoustic characteristics, intercepts early stage reverb signal,
Then early stage reverb signal and ideal ratio masking are combined, is applied to mixing voice signal, obtains reconstruction signal, realization passes through
Ideal amplitude masking isolates echo signal from the mixing voice under the conditions of noise reverberation.
Finally, it is stated that: above embodiments are only to illustrate the technical solution of the application, and limit it;Although reference
The application is described in detail in previous embodiment, those skilled in the art should understand that: it still can be right
Technical solution documented by foregoing embodiments is modified or equivalent replacement of some of the technical features;And this
A little modifications or substitutions, the range of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. a kind of front end processing method for promoting far field speech recognition characterized by comprising
Room impulse response signal is calculated, the sliced time point of early stage reverb signal Yu late reverberation signal is obtained, is cut
Cut-off reaches acoustical signal and the early stage reverb signal;The room impulse response signal is successively by the through acoustical signal, described
Early stage reverb signal and late reverberation signal composition;
Clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank is subjected to convolution in the time domain,
Obtain time domain echo signal;
By the other signals in the time domain echo signal and time domain mixed signal in addition to the time domain echo signal respectively into
Row calculates, and obtains target inband energy and other signals energy, passes through the target inband energy and the other signals energy
Obtain ideal ratio masking;Wherein the time domain mixed signal is by voice in the room impulse response signal and the sound bank
After carrying out convolution in the time domain, then it is mixed to get with noise signal;
After the time domain mixed signal is converted into frequency domain mixed signal, by the amplitude and the ideal of the frequency domain mixed signal
Ratio masking is multiplied, and reuses the phase of the frequency domain mixed signal, obtains reconstruction signal.
2. being obtained the method according to claim 1, wherein described calculate room impulse response signal
The sliced time point of early stage reverb signal and late reverberation signal, comprising:
The morning of the room impulse response signal is determined by calculating the echo density function of the room impulse response signal
The sliced time point of phase reverb signal and late reverberation signal, the echo density function NED is defined as:
Wherein,It is the score of sample of the expected sample except the average value standard deviation of Gaussian Profile, 1 { } be
One target function, when the parameter of the inside is true return 1, otherwise returning to 0, ω (l) is weighting function, and δ is when the institute in front window
State the standard deviation of room impulse response signal;
When reverberation is from when early stage, reverberation changed to late reverberation, NED begins to approach 1 from 0, and early stage reverb signal and late reverberation are believed
Number sliced time when be just defined as the standard deviation of late reverberation signal and be infinitely close to 1.
3. being obtained the method according to claim 1, wherein described calculate room impulse response signal
The sliced time point of early stage reverb signal and late reverberation signal, comprising:
Based on the diffusing scattering field of the late reverberation it is assumed that calculating early stage reverb signal and late reverberation signal by kurtosis
Sliced time point, the kurtosis is the Fourth-order moment of statistic processes, the kurtosis γ4Is defined as:
Wherein, E is the expectation that handle shock response x, and μ is mean value, and δ is standard deviation;
The sliced time is defined as the time instant when kurtosis calculated in sliding window reaches zero.
4. being obtained the method according to claim 1, wherein described calculate room impulse response signal
The sliced time point of early stage reverb signal and late reverberation signal, comprising:
The sliced time point of early stage reverb signal and late reverberation signal, the time t are calculated by room characteristic is defined as:
Wherein, V and S is respectively the volume in room and the surface area in room.
5. the method according to claim 1, wherein described by the time domain echo signal and time domain mixed signal
In other signals in addition to the time domain echo signal be respectively calculated, obtain target inband energy and other signals energy
Amount obtains ideal ratio masking by the target inband energy and the other signals energy, specifically includes:
The time domain echo signal and the other signals are done into Fourier transformation respectively, the target inband energy is calculated
With the other signals energy;
The target inband energy and the other signals energy are substituted into ideal ratio and shelter formula, obtains the ideal ratio
Masking;The ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove target inband energy in mixed signal energy
Other signals energy, k indicate band index, and l indicates frame index.
6. the method according to claim 1, wherein the time domain mixed signal is believed by the room impulse response
Number with after voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal, the time domain mixing
Signal generating mode are as follows:
M (t)=s (t) h (t)+n (t)
Wherein, s (t) indicates that clean speech signal, h (t) indicate that room impulse response signal, n (t) indicate noise signal, and t is indicated
Time index.
7. the method according to claim 1, wherein described be converted into frequency domain mixing for the time domain mixed signal
After signal, the amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the frequency domain mixed signal
Phase, obtain reconstruction signal, specifically include:
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal
Position, obtains reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain mixing
The phase of signal, k indicate band index, and l indicates frame index.
8. a kind of front-end processing system for promoting far field speech recognition, comprising:
Interception unit obtains early stage reverb signal and late reverberation signal for calculating room impulse response signal
Sliced time point intercepts through acoustical signal and the early stage reverb signal;The room impulse response signal is successively by described straight
It is formed up to acoustical signal, the early stage reverb signal and the late reverberation signal;
First generation unit is used for clean speech signal in the through acoustical signal and the early stage reverb signal and sound bank
Convolution is carried out in the time domain, obtains time domain echo signal;
Second generation unit, for by the time domain echo signal and time domain mixed signal in addition to the time domain echo signal
Other signals be respectively calculated, obtain target inband energy and other signals energy, by the target inband energy and
The other signals energy obtains ideal ratio masking;Wherein the time domain mixed signal by the room impulse response signal with
After voice carries out convolution in the time domain in the sound bank, then it is mixed to get with noise signal;
The frequency domain is mixed and is believed after the time domain mixed signal is converted into frequency domain mixed signal by third generation unit
Number the masking of amplitude and the ideal ratio be multiplied, reuse the phase of the frequency domain mixed signal, obtain reconstruction signal.
9. system according to claim 8, which is characterized in that second generation unit is specifically used for,
The time domain echo signal and the other signals are done into Fourier transformation respectively, the target inband energy is calculated
With the other signals energy;
The target inband energy and the other signals energy are substituted into ideal ratio and shelter formula, obtains the ideal ratio
Masking;The ideal ratio shelters formula IRM (k, l) are as follows:
Wherein, D (k, l) is expressed as target inband energy, and R (k, l) indicates to remove target inband energy in mixed signal energy
Other signals energy, k indicate band index, and l indicates frame index.
10. system according to claim 8, which is characterized in that the third generation unit is specifically used for,
After the time domain mixed signal is carried out Short Time Fourier Transform, the frequency domain mixed signal is obtained;
The amplitude of the frequency domain mixed signal is multiplied with ideal ratio masking, reuses the phase of the frequency domain mixed signal
Position, obtains reconstruction signal, reconstruction signal s ' (t) calculation formula are as follows:
S ' (t)=istft { M (k, l) × IRM (k, l) × exp [j ∠ Mf(k,l)]}
Wherein, istft is expressed as inverse Fourier's operation, and M (k, l) indicates frequency domain mixed signal, ∠ Mf(k, l) indicates frequency domain mixing
The phase of signal, k indicate band index, and l indicates frame index.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811602419.9A CN109523999B (en) | 2018-12-26 | 2018-12-26 | Front-end processing method and system for improving far-field speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811602419.9A CN109523999B (en) | 2018-12-26 | 2018-12-26 | Front-end processing method and system for improving far-field speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109523999A true CN109523999A (en) | 2019-03-26 |
CN109523999B CN109523999B (en) | 2021-03-23 |
Family
ID=65797174
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811602419.9A Active CN109523999B (en) | 2018-12-26 | 2018-12-26 | Front-end processing method and system for improving far-field speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109523999B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110428852A (en) * | 2019-08-09 | 2019-11-08 | 南京人工智能高等研究院有限公司 | Speech separating method, device, medium and equipment |
CN111312273A (en) * | 2020-05-11 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Reverberation elimination method, apparatus, computer device and storage medium |
CN111768796A (en) * | 2020-07-14 | 2020-10-13 | 中国科学院声学研究所 | Acoustic echo cancellation and dereverberation method and device |
CN112201262A (en) * | 2020-09-30 | 2021-01-08 | 珠海格力电器股份有限公司 | Sound processing method and device |
CN112201229A (en) * | 2020-10-09 | 2021-01-08 | 百果园技术(新加坡)有限公司 | Voice processing method, device and system |
CN112652290A (en) * | 2020-12-14 | 2021-04-13 | 北京达佳互联信息技术有限公司 | Method for generating reverberation audio signal and training method of audio processing model |
CN112735461A (en) * | 2020-12-29 | 2021-04-30 | 西安讯飞超脑信息科技有限公司 | Sound pickup method, related device and equipment |
CN113643714A (en) * | 2021-10-14 | 2021-11-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio processing method, device, storage medium and computer program |
WO2023093477A1 (en) * | 2021-11-25 | 2023-06-01 | 广州视源电子科技股份有限公司 | Speech enhancement model training method and apparatus, storage medium, and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090122999A1 (en) * | 2007-11-13 | 2009-05-14 | Samsung Electronics Co., Ltd | Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method |
CN105427859A (en) * | 2016-01-07 | 2016-03-23 | 深圳市音加密科技有限公司 | Front voice enhancement method for identifying speaker |
CN105427860A (en) * | 2015-11-11 | 2016-03-23 | 百度在线网络技术(北京)有限公司 | Far field voice recognition method and device |
CN107481731A (en) * | 2017-08-01 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | A kind of speech data Enhancement Method and system |
CN108389586A (en) * | 2017-05-17 | 2018-08-10 | 宁波桑德纳电子科技有限公司 | A kind of long-range audio collecting device, monitoring device and long-range collection sound method |
-
2018
- 2018-12-26 CN CN201811602419.9A patent/CN109523999B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090122999A1 (en) * | 2007-11-13 | 2009-05-14 | Samsung Electronics Co., Ltd | Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method |
CN105427860A (en) * | 2015-11-11 | 2016-03-23 | 百度在线网络技术(北京)有限公司 | Far field voice recognition method and device |
CN105427859A (en) * | 2016-01-07 | 2016-03-23 | 深圳市音加密科技有限公司 | Front voice enhancement method for identifying speaker |
CN108389586A (en) * | 2017-05-17 | 2018-08-10 | 宁波桑德纳电子科技有限公司 | A kind of long-range audio collecting device, monitoring device and long-range collection sound method |
CN107481731A (en) * | 2017-08-01 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | A kind of speech data Enhancement Method and system |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110428852A (en) * | 2019-08-09 | 2019-11-08 | 南京人工智能高等研究院有限公司 | Speech separating method, device, medium and equipment |
CN110428852B (en) * | 2019-08-09 | 2021-07-16 | 南京人工智能高等研究院有限公司 | Voice separation method, device, medium and equipment |
CN111312273A (en) * | 2020-05-11 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Reverberation elimination method, apparatus, computer device and storage medium |
CN111768796B (en) * | 2020-07-14 | 2024-05-03 | 中国科学院声学研究所 | Acoustic echo cancellation and dereverberation method and device |
CN111768796A (en) * | 2020-07-14 | 2020-10-13 | 中国科学院声学研究所 | Acoustic echo cancellation and dereverberation method and device |
CN112201262A (en) * | 2020-09-30 | 2021-01-08 | 珠海格力电器股份有限公司 | Sound processing method and device |
CN112201262B (en) * | 2020-09-30 | 2024-05-31 | 珠海格力电器股份有限公司 | Sound processing method and device |
CN112201229A (en) * | 2020-10-09 | 2021-01-08 | 百果园技术(新加坡)有限公司 | Voice processing method, device and system |
CN112201229B (en) * | 2020-10-09 | 2024-05-28 | 百果园技术(新加坡)有限公司 | Voice processing method, device and system |
CN112652290A (en) * | 2020-12-14 | 2021-04-13 | 北京达佳互联信息技术有限公司 | Method for generating reverberation audio signal and training method of audio processing model |
CN112735461A (en) * | 2020-12-29 | 2021-04-30 | 西安讯飞超脑信息科技有限公司 | Sound pickup method, related device and equipment |
CN112735461B (en) * | 2020-12-29 | 2024-06-07 | 西安讯飞超脑信息科技有限公司 | Pickup method, and related device and equipment |
CN113643714B (en) * | 2021-10-14 | 2022-02-18 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio processing method, device, storage medium and computer program |
CN113643714A (en) * | 2021-10-14 | 2021-11-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Audio processing method, device, storage medium and computer program |
WO2023093477A1 (en) * | 2021-11-25 | 2023-06-01 | 广州视源电子科技股份有限公司 | Speech enhancement model training method and apparatus, storage medium, and device |
Also Published As
Publication number | Publication date |
---|---|
CN109523999B (en) | 2021-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109523999A (en) | A kind of front end processing method and system promoting far field speech recognition | |
Pedersen et al. | Two-microphone separation of speech mixtures | |
CN104835503A (en) | Improved GSC self-adaptive speech enhancement method | |
CN110211602B (en) | Intelligent voice enhanced communication method and device | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
CN114694670A (en) | Multi-task network-based microphone array speech enhancement system and method | |
EP0997003A2 (en) | A method of noise reduction in speech signals and an apparatus for performing the method | |
CN113593590A (en) | Method for suppressing transient noise in voice | |
CN112820312B (en) | Voice separation method and device and electronic equipment | |
Mesgarani et al. | Speech enhancement based on filtering the spectrotemporal modulations | |
CN110838303B (en) | Voice sound source positioning method using microphone array | |
CN114884780B (en) | Underwater sound communication signal modulation identification method and device based on passive time reversal mirror | |
CN114401168B (en) | Voice enhancement method applicable to short wave Morse signal under complex strong noise environment | |
JP2002064617A (en) | Echo suppression method and echo suppression equipment | |
CN110444222A (en) | A kind of speech noise-reduction method based on comentropy weighting | |
Gbadamosi et al. | Development of non-parametric noise reduction algorithm for GSM voice signal | |
Wang et al. | Speech endpoint detection in fixed differential beamforming combined with modulation domain | |
Ou et al. | Speech enhancement employing modified a priori SNR estimation | |
Peng et al. | Research on Underwater Transient Signal Denoising Algorithm Under low SNR | |
YADAV et al. | Distortionless acoustic beamforming with enhanced sparsity based on reweighted ℓ1-norm minimization | |
CN106997766A (en) | A kind of homomorphic filtering sound enhancement method based on broadband noise | |
Jung et al. | Noise Reduction after RIR removal for Speech De-reverberation and De-noising | |
Wang et al. | Time-Frequency Thresholding: A new algorithm in wavelet package speech enhancement | |
Jeong et al. | Kepstrum approach to real-time speech enhancement methods using two microphones | |
CN117435880A (en) | Signal enhancement method in underwater acoustic interference environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |