CN112634929A - Voice enhancement method, device and storage medium - Google Patents
Voice enhancement method, device and storage medium Download PDFInfo
- Publication number
- CN112634929A CN112634929A CN202011501035.5A CN202011501035A CN112634929A CN 112634929 A CN112634929 A CN 112634929A CN 202011501035 A CN202011501035 A CN 202011501035A CN 112634929 A CN112634929 A CN 112634929A
- Authority
- CN
- China
- Prior art keywords
- signal
- energy
- processed
- spectrum energy
- voice signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000001228 spectrum Methods 0.000 claims abstract description 213
- 230000003595 spectral effect Effects 0.000 claims abstract description 131
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000009795 derivation Methods 0.000 description 4
- 238000011410 subtraction method Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention provides a voice enhancement method, a device and a storage medium, wherein the method comprises the following steps: acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed; calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal; dividing a frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating to obtain the actual pure speech signal spectrum energy of each frequency band according to the speech signal spectrum energy, the pre-estimated pure speech signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment. The embodiment of the invention can reduce the situation of voice distortion.
Description
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a speech enhancement method, apparatus, and storage medium.
Background
Speech enhancement is an important branch of speech signal processing, which is widely used in speech coding. With the continuous development of DSP chips, real-time speech enhancement is well realized and applied in different occasions. The goals of actual speech enhancement are mainly to extract the original speech, which may be clean, from noisy speech, to improve speech quality and to eliminate background noise, to make the listener receptive and not feeling tired, to improve intelligibility.
In the traditional speech enhancement algorithm, spectral subtraction is the most common and widely applied. The traditional spectral subtraction method utilizes the characteristic that additive noise is irrelevant to voice, and under the premise that the noise is assumed to be stable in statistics, the noise spectrum measured and calculated without voice gaps is used as an estimated value to replace the noise spectrum in the voice period, and the estimated value of a pure voice spectrum is obtained by subtracting the noise spectrum from the voice spectrum containing noise. The traditional spectral subtraction method has the advantages of simple implementation and low computational power requirement, but because the estimated noise is usually not accurate enough, voice information is lost due to too much subtraction, and too little subtraction causes excessive interference noise, so that the voice distortion is often caused.
Disclosure of Invention
The invention aims to provide a voice enhancement method, a voice enhancement device and a storage medium, which are used for solving the technical problem that the conventional spectral subtraction method often has voice distortion.
In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a speech enhancement method, including:
acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed;
calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal;
dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy;
calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
Further, the obtaining of the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed specifically includes:
acquiring a voice signal to be processed:
performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed;
and calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
Further, before the calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the speech signal, the method further includes:
and smoothing the frequency domain signal to be processed.
Further, the energy of the pure speech signal corresponding to the frequency domain signal to be processed is estimated according to the speech signal spectrum energy and the noise spectrum energy to obtain estimated pure speech signal spectrum energy, which specifically is as follows: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
Further, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy specifically includes:
according to the formulaComputingObtaining the actual pure speech signal spectrum energy X (omega) of each frequency segment, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
Further, the calculating according to the spectrum energy of the actual pure speech signal of each frequency segment to obtain the actual pure speech signal corresponding to the speech signal to be processed specifically includes:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
In a second aspect, an embodiment of the present invention provides a speech enhancement apparatus, including:
the frequency domain signal acquisition module is used for acquiring a frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed;
the frequency spectrum energy calculation module is used for calculating the frequency spectrum energy of each frame of the frequency domain signal to be processed to obtain the frequency spectrum energy of the voice signal;
the frequency domain signal dividing module is used for dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
the noise spectrum energy acquisition module is used for acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
the pre-estimation module is used for pre-estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain pre-estimated pure voice signal spectrum energy;
the actual pure voice signal spectrum energy calculation module is used for calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and the actual pure voice signal calculation module is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
Further, the obtaining of the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed specifically includes:
a voice signal acquisition unit, configured to acquire a voice signal to be processed:
the framing unit is used for framing the voice signal to be processed to obtain a time domain signal to be processed;
and the calculating unit is used for calculating to-be-processed frequency domain signals according to the to-be-processed time domain signals.
Further, the speech enhancement apparatus further includes: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
Further, the energy of the pure speech signal corresponding to the frequency domain signal to be processed is estimated according to the speech signal spectrum energy and the noise spectrum energy to obtain estimated pure speech signal spectrum energy, which specifically is as follows: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
Further, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy specifically includes:
according to the formulaCalculating to obtain the actually pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrumEnergy, p and delta are both constants, delta is in the range of [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where, when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform the speech enhancement method as described above.
Compared with the prior art, the frequency domain signal to be processed is divided into a plurality of frequency segments with the same sum of the spectrum energy according to the spectrum energy of the voice signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a speech enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the invention provides a voice enhancement method, and an execution main body of the method can be a terminal and a server. The terminal can be a smart phone, a tablet computer, an intelligent sound box and other devices which can acquire voice signals and have processing capacity.
Referring to fig. 1, a speech enhancement method according to an embodiment of the present invention includes:
and S1, acquiring the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed.
In the embodiment of the present invention, it should be understood that the speech signal to be processed is a noisy speech signal, and may be, for example, a piece of audio collected by a microphone array.
And S2, calculating the spectrum energy of each frame of the frequency domain signal to be processed to obtain the spectrum energy of the voice signal.
And S3, dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of the spectral energy according to the spectral energy of the voice signal.
It should be noted that the number of frequency segments can be set according to the calculation power and the final effect of the device, and therefore, the number of frequency segments is not limited herein, and may be, for example, 5.
It should be appreciated that while conventional spectral subtraction can attenuate noise in the original noisy speech, it can introduce "musical noise" because the noise is colored and does not affect the speech signal uniformly across the entire spectrum. The frequency domain signal to be processed is divided into a plurality of frequency bands with the same sum of the spectral energy, and then the frequency bands are calculated, because noise does not have the same influence on the whole frequency spectrum, and the influence on some frequency bands is larger, so that the voice distortion can be reduced by calculating the frequency bands.
And S4, acquiring noise spectrum energy corresponding to the voice signal spectrum energy.
The conventional spectral subtraction assumes that, on the premise that the noise is statistically stationary, the noise spectrum estimation value calculated without a speech gap replaces the spectrum of the noise in the speech period, and is subtracted from the speech spectrum containing the noise to obtain the estimation value of the speech spectrum.
In the embodiment of the present invention, it should be understood that the noise spectrum is a noise spectrum calculated from the speech signal (speech signal with noise) without a speech gap.
And S5, estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy.
And S6, calculating the actual pure speech signal spectrum energy of each frequency band according to the speech signal spectrum energy, the pre-estimated pure speech signal spectrum energy and the noise spectrum energy.
It should be understood that, while the conventional spectral subtraction, i.e., the pure speech signal spectral energy estimated from the noisy speech signal spectral energy and the noise spectral energy is a linear spectral subtraction, the linear spectral subtraction estimates the noise with difficulty in accuracy, the speech information is lost if the pure speech signal spectral energy is subtracted too much, and excessive interference noise remains if the pure speech signal spectral energy is subtracted too little, so that speech distortion often occurs, the speech enhancement method proposed by the embodiment of the present invention, i.e., the pure speech signal spectral energy calculated from the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, is a nonlinear spectral subtraction, and can bring lower residual noise compared to the linear spectral subtraction, thereby reducing the speech distortion. In addition, the embodiment of the invention calculates the pure speech signal spectral energy according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, so that greater noise attenuation is provided, and especially for low-energy speech segments, lower residual noise is brought, thereby reducing the situation of speech distortion.
And S7, calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency segment.
In summary, in the speech enhancement method provided in the embodiment of the present invention, the frequency domain signal to be processed is divided into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the speech signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
As an example of the embodiment of the present invention, the obtaining of the frequency domain signal of the to-be-processed speech signal to obtain the to-be-processed frequency domain signal specifically includes S11-S13:
and S11, acquiring the voice signal to be processed.
In an embodiment of the present invention, the to-be-processed voice signal is collected by an audio collecting device, where the audio collecting device includes, but is not limited to, an audio collecting device with a microphone array, such as a smart phone, a tablet computer, a smart sound, a handheld microphone, a desktop microphone, an earphone microphone, and the like.
And S12, performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed.
In this embodiment, the framing processing is performed according to a preset time length, for example, 25ms may be one frame, and in order to obtain better correlation between frames after the spectral energy calculation, 1/4 frames are overlapped between windows.
And S13, calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
Specifically, fourier transform or fast fourier transform is performed on the time domain signal to be processed to obtain a frequency domain signal to be processed.
As an example of the embodiment of the present invention, the estimating, according to the speech signal spectrum energy and the noise spectrum energy, the energy of the pure speech signal corresponding to the frequency domain signal to be processed to obtain estimated pure speech signal spectrum energy specifically includes: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
As an example of the embodiment of the present invention, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy, and the noise spectral energy specifically includes:
according to the formulaCalculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
In the embodiment of the present invention, the value of p is a positive integer, and is usually 2.
δ is an empirical correction constant, and δ is closer to 1 as p is larger, and δ is 0.5 in most cases when p is 2.
For ease of understanding, the derivation of the above formula is given below:
let the form of the nonlinear spectral subtraction be:
|X(ω)|p=α(ω)|Y(ω)|p-β(ω)|D(ω)|p
wherein X (ω) represents the spectral energy of the actual clean speech signal, Y (X), (Y), and (Y:)ω) represents the speech signal (noisy speech signal) spectral energy, D (ω) represents the noise spectral energy,representing the spectral energy of the pre-estimated pure speech signal, p and delta are constants, and delta belongs to [0,1 ]]Usually p is 2.
From the above derivation process, the embodiment of the present invention selects the subtraction parameters optimally in the sense of the mean square error, and has the constraint term, so that greater noise attenuation can be provided, and especially for low-energy speech segments, lower residual noise can be brought about.
It should be understood that, in practical practice, after dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of spectral energy according to the spectral energy of the speech signal, the frequency domain signal to be processed is divided according to a formula Andcalculating a first parameter alpha (omega) and a second parameter beta (omega) corresponding to each frequency segment, and then calculating the number of cells according to a formula | X (omega)p=α(ω)|Y(ω)|p-β(ω)|D(ω)|pThe spectral energy of the actual pure speech signal in each frequency band is calculated separately, for example, the frequency domain signal is divided into three frequency bands, and then the spectral energy of the actual pure speech signal in the first frequency band X1(ω), the spectral energy of the actual pure speech signal in the second frequency band X2(ω), and the spectral energy of the actual pure speech signal in the third frequency band X3(ω) are obtained through the above calculation. Finally, the actual speech signal spectrum energy of the three frequency bands is added and root-coded, i.e.Obtaining the frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed, and comparing the actual pure voice signal corresponding to the voice signal to be processedAnd carrying out inverse Fourier transform on the frequency domain signal of the clean voice signal to obtain a clean voice signal corresponding to the voice signal to be processed.
As an example of the embodiment of the present invention, the calculating, according to the spectrum energy of the actual pure speech signal of each frequency segment, to obtain the actual pure speech signal corresponding to the speech signal to be processed specifically includes:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
Specifically, the actual pure speech signal spectrum energy corresponding to the speech signal to be processed can be obtained by adding the actual pure speech signal spectrum energy of each frequency band, the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed can be obtained by opening the root of the actual pure speech signal spectrum energy corresponding to the speech signal to be processed, and the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed is subjected to inverse fourier transform to obtain the pure speech signal corresponding to the speech signal to be processed.
As an improvement of the above scheme, before the calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the speech signal, the method further includes:
and smoothing the frequency domain signal to be processed.
Specifically, the frequency domain signal to be processed is smoothed by low-pass filtering or windowing.
If the frequency domain signal to be processed is smoothed by windowing, and the window length L is assumed to be 3, the calculation method is as follows:
X′(n)=(X(n-1)+X(n)+X(n+1))/3
wherein, x (n) is a frequency domain signal corresponding to a frame of speech signal.
The embodiment of the invention can reduce the fluctuation of spectrum estimation by smoothing the frequency domain signal to be processed, thereby reducing the residual noise and further reducing the condition of voice distortion.
Example 2:
referring to fig. 2, an embodiment of the present invention provides a speech enhancement apparatus, including:
and the frequency domain signal acquisition module 1 is configured to acquire a frequency domain signal of the speech signal to be processed, and obtain the frequency domain signal to be processed.
In the embodiment of the present invention, the speech signal to be processed is a noisy speech signal, and may be, for example, a segment of audio collected by a microphone array.
And the spectrum energy calculating module 2 is used for calculating the spectrum energy of each frame of the frequency domain signal to be processed to obtain the spectrum energy of the voice signal.
And the frequency domain signal dividing module 3 is configured to divide the frequency domain signal to be processed into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the voice signal.
It should be noted that the number of frequency segments can be set according to the calculation power and the final effect of the device, and therefore, the number of frequency segments is not limited herein, and may be, for example, 5.
It should be appreciated that while conventional spectral subtraction can attenuate noise in the original noisy speech, it can introduce "musical noise" because the noise is colored and does not affect the speech signal uniformly across the entire spectrum. The frequency domain signal to be processed is divided into a plurality of frequency bands with the same sum of the spectral energy, and then the frequency bands are calculated, because noise does not have the same influence on the whole frequency spectrum, and the influence on some frequency bands is larger, so that the voice distortion can be reduced by calculating the frequency bands.
And the noise spectrum energy acquisition module 4 is used for acquiring noise spectrum energy corresponding to the speech signal spectrum energy.
The conventional spectral subtraction assumes that, on the premise that the noise is statistically stationary, the noise spectrum estimation value calculated without a speech gap replaces the spectrum of the noise in the speech period, and is subtracted from the speech spectrum containing the noise to obtain the estimation value of the speech spectrum.
In the embodiment of the present invention, it should be understood that the noise spectrum is a noise spectrum calculated from the speech signal (speech signal with noise) without a speech gap.
And the pre-estimation module 5 is configured to pre-estimate the energy of the pure speech signal corresponding to the frequency domain signal to be processed according to the speech signal spectrum energy and the noise spectrum energy, so as to obtain pre-estimated pure speech signal spectrum energy.
And the actual pure voice signal spectrum energy calculating module 6 is used for calculating the actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy.
It should be understood that, while the conventional spectral subtraction, i.e., the pure speech signal spectral energy estimated from the noisy speech signal spectral energy and the noise spectral energy is a linear spectral subtraction, the linear spectral subtraction estimates the noise with difficulty in accuracy, the speech information is lost if the pure speech signal spectral energy is subtracted too much, and excessive interference noise remains if the pure speech signal spectral energy is subtracted too little, so that speech distortion often occurs, the speech enhancement method proposed by the embodiment of the present invention, i.e., the pure speech signal spectral energy calculated from the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, is a nonlinear spectral subtraction, and can bring lower residual noise compared to the linear spectral subtraction, thereby reducing the speech distortion. In addition, the embodiment of the invention calculates the pure speech signal spectral energy according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, so that greater noise attenuation is provided, and especially for low-energy speech segments, lower residual noise is brought, thereby reducing the situation of speech distortion.
And the actual pure voice signal calculation module 7 is configured to calculate an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
In summary, in the speech enhancement method provided in the embodiment of the present invention, the frequency domain signal to be processed is divided into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the speech signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
As an example of the embodiment of the present invention, the frequency domain signal obtaining module includes:
and the voice signal acquisition unit is used for acquiring the voice signal to be processed.
In an embodiment of the present invention, the to-be-processed voice signal is collected by an audio collecting device, where the audio collecting device includes, but is not limited to, an audio collecting device with a microphone array, such as a smart phone, a tablet computer, a smart sound, a handheld microphone, a desktop microphone, an earphone microphone, and the like.
And the framing unit is used for framing the voice signal to be processed to obtain a time domain signal to be processed.
In this embodiment, the framing processing is performed according to a preset time length, for example, 25ms may be a frame, and in order to obtain better correlation between frames after the spectral energy calculation, 1/4 frames of overlap are formed between windows.
And the calculating unit is used for calculating to-be-processed frequency domain signals according to the to-be-processed time domain signals.
Specifically, fourier transform or fast fourier transform is performed on the time domain signal to be processed to obtain a frequency domain signal to be processed.
As an example of the embodiment of the present invention, the estimating, according to the speech signal spectrum energy and the noise spectrum energy, the energy of the pure speech signal corresponding to the frequency domain signal to be processed to obtain estimated pure speech signal spectrum energy specifically includes: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
As an example of the embodiment of the present invention, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy, and the noise spectral energy specifically includes:
according to the formulaCalculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
In the embodiment of the present invention, the value of p is positive integer, and is usually 2.
δ is an empirical correction constant, and δ is closer to 1 as p is larger, and in most cases, p is 2, and δ is 0.5
For ease of understanding, the derivation of the above formula is given below:
let the form of the nonlinear spectral subtraction be:
|X(ω)|p=α(ω)|Y(ω)|p-β(ω)|D(ω)|p
wherein X (ω) represents the spectral energy of an actual clean speech signal, Y (ω) represents the spectral energy of a speech signal (a noisy speech signal), D (ω) represents the spectral energy of a noise,representing the spectral energy of the pre-estimated pure speech signal, p and delta are constants, and delta belongs to [0,1 ]]Usually p is 2.
From the above derivation process, the embodiment of the present invention selects the subtraction parameters optimally in the sense of the mean square error, and has the constraint term, so that greater noise attenuation can be provided, and especially for low-energy speech segments, lower residual noise can be brought about.
It should be understood that, in practical practice, after dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of spectral energy according to the spectral energy of the speech signal, the frequency domain signal to be processed is divided according to a formula Andcalculating a first parameter α (w) and a second parameter β (w) corresponding to each frequency bin, and then calculating the number of cells according to the formula | X (ω)p=α(ω)|Y(ω)|p-β(ω)|D(ω)|pThe spectral energy of the actual pure speech signal in each frequency band is calculated separately, for example, the frequency domain signal is divided into three frequency bands, and then the spectral energy X1(ω) of the actual pure speech signal in the first frequency band and the spectral energy X1(ω) of the actual pure speech signal in the second frequency band are obtained by the above calculationThe actual clean speech signal spectral energy X2(ω), and the actual clean speech signal spectral energy X3(ω) of the third frequency band. Finally, the actual speech signal spectrum energy of the three frequency bands is added and root-coded, i.e.And obtaining a frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed, and performing inverse Fourier transform on the frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed to obtain the pure voice signal corresponding to the voice signal to be processed.
As an example of the embodiment of the present invention, the actual pure speech signal calculation module includes:
the actual pure voice frequency domain signal calculating unit is used for calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and the actual pure voice signal calculation unit is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
Specifically, the actual pure speech signal spectrum energy corresponding to the speech signal to be processed can be obtained by adding the actual pure speech signal spectrum energy of each frequency band, the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed can be obtained by opening the root of the actual pure speech signal spectrum energy corresponding to the speech signal to be processed, and the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed is subjected to inverse fourier transform to obtain the pure speech signal corresponding to the speech signal to be processed.
As an improvement of the above solution, the speech enhancement apparatus further includes: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
Specifically, the frequency domain signal to be processed is smoothed by low-pass filtering or windowing.
If the frequency domain signal to be processed is smoothed by windowing, and the window length L is assumed to be 3, the calculation method is as follows:
X′(n)=(X(n-1)+X(n)+X(n+1))/3
wherein, x (n) is a frequency domain signal corresponding to a frame of speech signal.
The embodiment of the invention can reduce the fluctuation of spectrum estimation by smoothing the frequency domain signal to be processed, thereby reducing the residual noise and further reducing the condition of voice distortion.
Example 3:
the present invention also provides a computer-readable storage medium, which specifically includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the speech enhancement method according to any of the above embodiments.
It should be noted that, all or part of the flow in the method according to the above embodiments of the present invention may also be implemented by a computer program instructing related hardware, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above embodiments of the method may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be further noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
Claims (10)
1. A method of speech enhancement, comprising:
acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed;
calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal;
dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy;
calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
2. The speech enhancement method according to claim 1, wherein the obtaining of the frequency domain signal of the speech signal to be processed to obtain the frequency domain signal to be processed specifically comprises:
acquiring a voice signal to be processed:
performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed;
and calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
3. The method of claim 1, wherein before calculating the spectral energy of each frame of the frequency-domain signal to be processed to obtain the spectral energy of the speech signal, the method further comprises:
and smoothing the frequency domain signal to be processed.
4. The speech enhancement method according to claim 1, wherein the estimating of the energy of the clean speech signal corresponding to the frequency domain signal to be processed according to the speech signal spectral energy and the noise spectral energy to obtain an estimated clean speech signal spectral energy is specifically: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
5. The speech enhancement method according to claim 1, wherein the calculating the actual clean speech signal spectral energy for each frequency bin according to the speech signal spectral energy, the estimated clean speech signal spectral energy and the noise spectral energy comprises:
according to the formulaCalculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
6. The speech enhancement method according to claim 1, wherein the calculating an actual clean speech signal corresponding to the speech signal to be processed according to the spectral energy of the actual clean speech signal in each frequency band specifically comprises:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
7. A speech enhancement apparatus, comprising:
the frequency domain signal acquisition module is used for acquiring a frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed;
the frequency spectrum energy calculation module is used for calculating the frequency spectrum energy of each frame of the frequency domain signal to be processed to obtain the frequency spectrum energy of the voice signal;
the frequency domain signal dividing module is used for dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
the noise spectrum energy acquisition module is used for acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
the pre-estimation module is used for pre-estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain pre-estimated pure voice signal spectrum energy;
the actual pure voice signal spectrum energy calculation module is used for calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and the actual pure voice signal calculation module is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
8. The speech enhancement device of claim 7, further comprising: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
9. The speech enhancement device of claim 7, wherein the calculating the actual clean speech signal spectral energy for each frequency bin according to the speech signal spectral energy, the estimated clean speech signal spectral energy and the noise spectral energy comprises:
according to the formulaCalculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency segment, wherein Y (w) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]], Representing the spectral energy of the estimated clean speech signal.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the speech enhancement method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011501035.5A CN112634929A (en) | 2020-12-16 | 2020-12-16 | Voice enhancement method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011501035.5A CN112634929A (en) | 2020-12-16 | 2020-12-16 | Voice enhancement method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112634929A true CN112634929A (en) | 2021-04-09 |
Family
ID=75317398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011501035.5A Pending CN112634929A (en) | 2020-12-16 | 2020-12-16 | Voice enhancement method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112634929A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5450522A (en) * | 1991-08-19 | 1995-09-12 | U S West Advanced Technologies, Inc. | Auditory model for parametrization of speech |
US20050143989A1 (en) * | 2003-12-29 | 2005-06-30 | Nokia Corporation | Method and device for speech enhancement in the presence of background noise |
CN101320566A (en) * | 2008-06-30 | 2008-12-10 | 中国人民解放军第四军医大学 | Non-air conduction speech reinforcement method based on multi-band spectrum subtraction |
US20120136655A1 (en) * | 2010-11-30 | 2012-05-31 | JVC KENWOOD Corporation a corporation of Japan | Speech processing apparatus and speech processing method |
CN108831500A (en) * | 2018-05-29 | 2018-11-16 | 平安科技(深圳)有限公司 | Sound enhancement method, device, computer equipment and storage medium |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
-
2020
- 2020-12-16 CN CN202011501035.5A patent/CN112634929A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5450522A (en) * | 1991-08-19 | 1995-09-12 | U S West Advanced Technologies, Inc. | Auditory model for parametrization of speech |
US20050143989A1 (en) * | 2003-12-29 | 2005-06-30 | Nokia Corporation | Method and device for speech enhancement in the presence of background noise |
CN101320566A (en) * | 2008-06-30 | 2008-12-10 | 中国人民解放军第四军医大学 | Non-air conduction speech reinforcement method based on multi-band spectrum subtraction |
US20120136655A1 (en) * | 2010-11-30 | 2012-05-31 | JVC KENWOOD Corporation a corporation of Japan | Speech processing apparatus and speech processing method |
CN108831500A (en) * | 2018-05-29 | 2018-11-16 | 平安科技(深圳)有限公司 | Sound enhancement method, device, computer equipment and storage medium |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
Non-Patent Citations (2)
Title |
---|
NAVNEET UPADHYAY,等: "An Improved Multi-Band Spectral Subtraction Algorithm for Enhancing Speech in Various Noise Environments", 《PROCEDIA ENGINEERING》, 31 December 2013 (2013-12-31) * |
孙博凯: "改进语音增强多频带谱减算法研究", 《电子设计工程》, 5 April 2012 (2012-04-05) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831499B (en) | Speech enhancement method using speech existence probability | |
US10891931B2 (en) | Single-channel, binaural and multi-channel dereverberation | |
EP0790599A1 (en) | A noise suppressor and method for suppressing background noise in noisy speech, and a mobile station | |
CN113539285B (en) | Audio signal noise reduction method, electronic device and storage medium | |
WO2000017859A1 (en) | Noise suppression for low bitrate speech coder | |
WO1997028527A1 (en) | A noisy speech parameter enhancement method and apparatus | |
CN102402987A (en) | Noise suppression device, noise suppression method, and program | |
US10382857B1 (en) | Automatic level control for psychoacoustic bass enhancement | |
Wolfe et al. | Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement | |
EP4189677B1 (en) | Noise reduction using machine learning | |
CN102314883B (en) | Music noise judgment method and voice noise elimination method | |
CN113241089B (en) | Voice signal enhancement method and device and electronic equipment | |
CN112634929A (en) | Voice enhancement method, device and storage medium | |
GB2536727B (en) | A speech processing device | |
Rao et al. | Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration | |
CN111986694B (en) | Audio processing method, device, equipment and medium based on transient noise suppression | |
Upadhyay et al. | A perceptually motivated multi-band spectral subtraction algorithm for enhancement of degraded speech | |
CN114360566A (en) | Noise reduction processing method and device for voice signal and storage medium | |
CN112750451A (en) | Noise reduction method for improving voice listening feeling | |
Moon et al. | Importance of phase information in speech enhancement | |
Childers et al. | Co--Channel speech separation | |
Karabashetti et al. | Speech enhancement using multiband spectral subtraction with cross spectral component reduction | |
CN111261197B (en) | Real-time speech paragraph tracking method under complex noise scene | |
CN114333880B (en) | Signal processing method, device, equipment and storage medium | |
JPH113094A (en) | Noise eliminating device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |