CN112634929A - Voice enhancement method, device and storage medium - Google Patents

Voice enhancement method, device and storage medium Download PDF

Info

Publication number
CN112634929A
CN112634929A CN202011501035.5A CN202011501035A CN112634929A CN 112634929 A CN112634929 A CN 112634929A CN 202011501035 A CN202011501035 A CN 202011501035A CN 112634929 A CN112634929 A CN 112634929A
Authority
CN
China
Prior art keywords
signal
energy
processed
spectrum energy
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011501035.5A
Other languages
Chinese (zh)
Inventor
何维祯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pulian International Co ltd
Original Assignee
Pulian International Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pulian International Co ltd filed Critical Pulian International Co ltd
Priority to CN202011501035.5A priority Critical patent/CN112634929A/en
Publication of CN112634929A publication Critical patent/CN112634929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a voice enhancement method, a device and a storage medium, wherein the method comprises the following steps: acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed; calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal; dividing a frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating to obtain the actual pure speech signal spectrum energy of each frequency band according to the speech signal spectrum energy, the pre-estimated pure speech signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment. The embodiment of the invention can reduce the situation of voice distortion.

Description

Voice enhancement method, device and storage medium
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a speech enhancement method, apparatus, and storage medium.
Background
Speech enhancement is an important branch of speech signal processing, which is widely used in speech coding. With the continuous development of DSP chips, real-time speech enhancement is well realized and applied in different occasions. The goals of actual speech enhancement are mainly to extract the original speech, which may be clean, from noisy speech, to improve speech quality and to eliminate background noise, to make the listener receptive and not feeling tired, to improve intelligibility.
In the traditional speech enhancement algorithm, spectral subtraction is the most common and widely applied. The traditional spectral subtraction method utilizes the characteristic that additive noise is irrelevant to voice, and under the premise that the noise is assumed to be stable in statistics, the noise spectrum measured and calculated without voice gaps is used as an estimated value to replace the noise spectrum in the voice period, and the estimated value of a pure voice spectrum is obtained by subtracting the noise spectrum from the voice spectrum containing noise. The traditional spectral subtraction method has the advantages of simple implementation and low computational power requirement, but because the estimated noise is usually not accurate enough, voice information is lost due to too much subtraction, and too little subtraction causes excessive interference noise, so that the voice distortion is often caused.
Disclosure of Invention
The invention aims to provide a voice enhancement method, a voice enhancement device and a storage medium, which are used for solving the technical problem that the conventional spectral subtraction method often has voice distortion.
In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a speech enhancement method, including:
acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed;
calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal;
dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy;
calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
Further, the obtaining of the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed specifically includes:
acquiring a voice signal to be processed:
performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed;
and calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
Further, before the calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the speech signal, the method further includes:
and smoothing the frequency domain signal to be processed.
Further, the energy of the pure speech signal corresponding to the frequency domain signal to be processed is estimated according to the speech signal spectrum energy and the noise spectrum energy to obtain estimated pure speech signal spectrum energy, which specifically is as follows: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
Further, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy specifically includes:
according to the formula
Figure BDA0002839687780000021
ComputingObtaining the actual pure speech signal spectrum energy X (omega) of each frequency segment, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]],
Figure BDA0002839687780000031
Figure BDA0002839687780000032
Representing the spectral energy of the estimated clean speech signal.
Further, the calculating according to the spectrum energy of the actual pure speech signal of each frequency segment to obtain the actual pure speech signal corresponding to the speech signal to be processed specifically includes:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
In a second aspect, an embodiment of the present invention provides a speech enhancement apparatus, including:
the frequency domain signal acquisition module is used for acquiring a frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed;
the frequency spectrum energy calculation module is used for calculating the frequency spectrum energy of each frame of the frequency domain signal to be processed to obtain the frequency spectrum energy of the voice signal;
the frequency domain signal dividing module is used for dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
the noise spectrum energy acquisition module is used for acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
the pre-estimation module is used for pre-estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain pre-estimated pure voice signal spectrum energy;
the actual pure voice signal spectrum energy calculation module is used for calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and the actual pure voice signal calculation module is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
Further, the obtaining of the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed specifically includes:
a voice signal acquisition unit, configured to acquire a voice signal to be processed:
the framing unit is used for framing the voice signal to be processed to obtain a time domain signal to be processed;
and the calculating unit is used for calculating to-be-processed frequency domain signals according to the to-be-processed time domain signals.
Further, the speech enhancement apparatus further includes: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
Further, the energy of the pure speech signal corresponding to the frequency domain signal to be processed is estimated according to the speech signal spectrum energy and the noise spectrum energy to obtain estimated pure speech signal spectrum energy, which specifically is as follows: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
Further, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy specifically includes:
according to the formula
Figure BDA0002839687780000041
Calculating to obtain the actually pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrumEnergy, p and delta are both constants, delta is in the range of [0,1 ]],
Figure BDA0002839687780000042
Figure BDA0002839687780000043
Representing the spectral energy of the estimated clean speech signal.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where, when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform the speech enhancement method as described above.
Compared with the prior art, the frequency domain signal to be processed is divided into a plurality of frequency segments with the same sum of the spectrum energy according to the spectrum energy of the voice signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a speech enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the invention provides a voice enhancement method, and an execution main body of the method can be a terminal and a server. The terminal can be a smart phone, a tablet computer, an intelligent sound box and other devices which can acquire voice signals and have processing capacity.
Referring to fig. 1, a speech enhancement method according to an embodiment of the present invention includes:
and S1, acquiring the frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed.
In the embodiment of the present invention, it should be understood that the speech signal to be processed is a noisy speech signal, and may be, for example, a piece of audio collected by a microphone array.
And S2, calculating the spectrum energy of each frame of the frequency domain signal to be processed to obtain the spectrum energy of the voice signal.
And S3, dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of the spectral energy according to the spectral energy of the voice signal.
It should be noted that the number of frequency segments can be set according to the calculation power and the final effect of the device, and therefore, the number of frequency segments is not limited herein, and may be, for example, 5.
It should be appreciated that while conventional spectral subtraction can attenuate noise in the original noisy speech, it can introduce "musical noise" because the noise is colored and does not affect the speech signal uniformly across the entire spectrum. The frequency domain signal to be processed is divided into a plurality of frequency bands with the same sum of the spectral energy, and then the frequency bands are calculated, because noise does not have the same influence on the whole frequency spectrum, and the influence on some frequency bands is larger, so that the voice distortion can be reduced by calculating the frequency bands.
And S4, acquiring noise spectrum energy corresponding to the voice signal spectrum energy.
The conventional spectral subtraction assumes that, on the premise that the noise is statistically stationary, the noise spectrum estimation value calculated without a speech gap replaces the spectrum of the noise in the speech period, and is subtracted from the speech spectrum containing the noise to obtain the estimation value of the speech spectrum.
In the embodiment of the present invention, it should be understood that the noise spectrum is a noise spectrum calculated from the speech signal (speech signal with noise) without a speech gap.
And S5, estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy.
And S6, calculating the actual pure speech signal spectrum energy of each frequency band according to the speech signal spectrum energy, the pre-estimated pure speech signal spectrum energy and the noise spectrum energy.
It should be understood that, while the conventional spectral subtraction, i.e., the pure speech signal spectral energy estimated from the noisy speech signal spectral energy and the noise spectral energy is a linear spectral subtraction, the linear spectral subtraction estimates the noise with difficulty in accuracy, the speech information is lost if the pure speech signal spectral energy is subtracted too much, and excessive interference noise remains if the pure speech signal spectral energy is subtracted too little, so that speech distortion often occurs, the speech enhancement method proposed by the embodiment of the present invention, i.e., the pure speech signal spectral energy calculated from the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, is a nonlinear spectral subtraction, and can bring lower residual noise compared to the linear spectral subtraction, thereby reducing the speech distortion. In addition, the embodiment of the invention calculates the pure speech signal spectral energy according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, so that greater noise attenuation is provided, and especially for low-energy speech segments, lower residual noise is brought, thereby reducing the situation of speech distortion.
And S7, calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency segment.
In summary, in the speech enhancement method provided in the embodiment of the present invention, the frequency domain signal to be processed is divided into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the speech signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
As an example of the embodiment of the present invention, the obtaining of the frequency domain signal of the to-be-processed speech signal to obtain the to-be-processed frequency domain signal specifically includes S11-S13:
and S11, acquiring the voice signal to be processed.
In an embodiment of the present invention, the to-be-processed voice signal is collected by an audio collecting device, where the audio collecting device includes, but is not limited to, an audio collecting device with a microphone array, such as a smart phone, a tablet computer, a smart sound, a handheld microphone, a desktop microphone, an earphone microphone, and the like.
And S12, performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed.
In this embodiment, the framing processing is performed according to a preset time length, for example, 25ms may be one frame, and in order to obtain better correlation between frames after the spectral energy calculation, 1/4 frames are overlapped between windows.
And S13, calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
Specifically, fourier transform or fast fourier transform is performed on the time domain signal to be processed to obtain a frequency domain signal to be processed.
As an example of the embodiment of the present invention, the estimating, according to the speech signal spectrum energy and the noise spectrum energy, the energy of the pure speech signal corresponding to the frequency domain signal to be processed to obtain estimated pure speech signal spectrum energy specifically includes: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
As an example of the embodiment of the present invention, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy, and the noise spectral energy specifically includes:
according to the formula
Figure BDA0002839687780000081
Calculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]],
Figure BDA0002839687780000082
Figure BDA0002839687780000083
Representing the spectral energy of the estimated clean speech signal.
In the embodiment of the present invention, the value of p is a positive integer, and is usually 2.
δ is an empirical correction constant, and δ is closer to 1 as p is larger, and δ is 0.5 in most cases when p is 2.
For ease of understanding, the derivation of the above formula is given below:
let the form of the nonlinear spectral subtraction be:
|X(ω)|p=α(ω)|Y(ω)|p-β(ω)|D(ω)|p
wherein alpha (omega) is a first parameter, beta (omega) is a second parameter,
Figure BDA0002839687780000091
Figure BDA0002839687780000092
from the above formula, one can deduce:
Figure BDA0002839687780000093
wherein X (ω) represents the spectral energy of the actual clean speech signal, Y (X), (Y), and (Y:)ω) represents the speech signal (noisy speech signal) spectral energy, D (ω) represents the noise spectral energy,
Figure BDA0002839687780000094
representing the spectral energy of the pre-estimated pure speech signal, p and delta are constants, and delta belongs to [0,1 ]]Usually p is 2.
From the above derivation process, the embodiment of the present invention selects the subtraction parameters optimally in the sense of the mean square error, and has the constraint term, so that greater noise attenuation can be provided, and especially for low-energy speech segments, lower residual noise can be brought about.
It should be understood that, in practical practice, after dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of spectral energy according to the spectral energy of the speech signal, the frequency domain signal to be processed is divided according to a formula
Figure BDA0002839687780000095
Figure BDA0002839687780000096
And
Figure BDA0002839687780000097
calculating a first parameter alpha (omega) and a second parameter beta (omega) corresponding to each frequency segment, and then calculating the number of cells according to a formula | X (omega)p=α(ω)|Y(ω)|p-β(ω)|D(ω)|pThe spectral energy of the actual pure speech signal in each frequency band is calculated separately, for example, the frequency domain signal is divided into three frequency bands, and then the spectral energy of the actual pure speech signal in the first frequency band X1(ω), the spectral energy of the actual pure speech signal in the second frequency band X2(ω), and the spectral energy of the actual pure speech signal in the third frequency band X3(ω) are obtained through the above calculation. Finally, the actual speech signal spectrum energy of the three frequency bands is added and root-coded, i.e.
Figure BDA0002839687780000098
Obtaining the frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed, and comparing the actual pure voice signal corresponding to the voice signal to be processedAnd carrying out inverse Fourier transform on the frequency domain signal of the clean voice signal to obtain a clean voice signal corresponding to the voice signal to be processed.
As an example of the embodiment of the present invention, the calculating, according to the spectrum energy of the actual pure speech signal of each frequency segment, to obtain the actual pure speech signal corresponding to the speech signal to be processed specifically includes:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
Specifically, the actual pure speech signal spectrum energy corresponding to the speech signal to be processed can be obtained by adding the actual pure speech signal spectrum energy of each frequency band, the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed can be obtained by opening the root of the actual pure speech signal spectrum energy corresponding to the speech signal to be processed, and the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed is subjected to inverse fourier transform to obtain the pure speech signal corresponding to the speech signal to be processed.
As an improvement of the above scheme, before the calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the speech signal, the method further includes:
and smoothing the frequency domain signal to be processed.
Specifically, the frequency domain signal to be processed is smoothed by low-pass filtering or windowing.
If the frequency domain signal to be processed is smoothed by windowing, and the window length L is assumed to be 3, the calculation method is as follows:
X′(n)=(X(n-1)+X(n)+X(n+1))/3
wherein, x (n) is a frequency domain signal corresponding to a frame of speech signal.
The embodiment of the invention can reduce the fluctuation of spectrum estimation by smoothing the frequency domain signal to be processed, thereby reducing the residual noise and further reducing the condition of voice distortion.
Example 2:
referring to fig. 2, an embodiment of the present invention provides a speech enhancement apparatus, including:
and the frequency domain signal acquisition module 1 is configured to acquire a frequency domain signal of the speech signal to be processed, and obtain the frequency domain signal to be processed.
In the embodiment of the present invention, the speech signal to be processed is a noisy speech signal, and may be, for example, a segment of audio collected by a microphone array.
And the spectrum energy calculating module 2 is used for calculating the spectrum energy of each frame of the frequency domain signal to be processed to obtain the spectrum energy of the voice signal.
And the frequency domain signal dividing module 3 is configured to divide the frequency domain signal to be processed into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the voice signal.
It should be noted that the number of frequency segments can be set according to the calculation power and the final effect of the device, and therefore, the number of frequency segments is not limited herein, and may be, for example, 5.
It should be appreciated that while conventional spectral subtraction can attenuate noise in the original noisy speech, it can introduce "musical noise" because the noise is colored and does not affect the speech signal uniformly across the entire spectrum. The frequency domain signal to be processed is divided into a plurality of frequency bands with the same sum of the spectral energy, and then the frequency bands are calculated, because noise does not have the same influence on the whole frequency spectrum, and the influence on some frequency bands is larger, so that the voice distortion can be reduced by calculating the frequency bands.
And the noise spectrum energy acquisition module 4 is used for acquiring noise spectrum energy corresponding to the speech signal spectrum energy.
The conventional spectral subtraction assumes that, on the premise that the noise is statistically stationary, the noise spectrum estimation value calculated without a speech gap replaces the spectrum of the noise in the speech period, and is subtracted from the speech spectrum containing the noise to obtain the estimation value of the speech spectrum.
In the embodiment of the present invention, it should be understood that the noise spectrum is a noise spectrum calculated from the speech signal (speech signal with noise) without a speech gap.
And the pre-estimation module 5 is configured to pre-estimate the energy of the pure speech signal corresponding to the frequency domain signal to be processed according to the speech signal spectrum energy and the noise spectrum energy, so as to obtain pre-estimated pure speech signal spectrum energy.
And the actual pure voice signal spectrum energy calculating module 6 is used for calculating the actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy.
It should be understood that, while the conventional spectral subtraction, i.e., the pure speech signal spectral energy estimated from the noisy speech signal spectral energy and the noise spectral energy is a linear spectral subtraction, the linear spectral subtraction estimates the noise with difficulty in accuracy, the speech information is lost if the pure speech signal spectral energy is subtracted too much, and excessive interference noise remains if the pure speech signal spectral energy is subtracted too little, so that speech distortion often occurs, the speech enhancement method proposed by the embodiment of the present invention, i.e., the pure speech signal spectral energy calculated from the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, is a nonlinear spectral subtraction, and can bring lower residual noise compared to the linear spectral subtraction, thereby reducing the speech distortion. In addition, the embodiment of the invention calculates the pure speech signal spectral energy according to the speech signal spectral energy, the estimated pure speech signal spectral energy and the noise spectral energy, so that greater noise attenuation is provided, and especially for low-energy speech segments, lower residual noise is brought, thereby reducing the situation of speech distortion.
And the actual pure voice signal calculation module 7 is configured to calculate an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
In summary, in the speech enhancement method provided in the embodiment of the present invention, the frequency domain signal to be processed is divided into frequency segments with the same sum of a plurality of spectral energies according to the spectral energy of the speech signal; acquiring noise spectrum energy corresponding to the voice signal spectrum energy; estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy; calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy; and calculating to obtain an actual pure speech signal corresponding to the speech signal to be processed according to the actual pure speech signal spectrum energy of each frequency band, so that the fluctuation of spectrum estimation is reduced, greater noise attenuation is provided, lower residual noise is brought, and the situation of speech distortion is reduced. In addition, the embodiment of the invention has simple calculation and lower calculation force requirement.
As an example of the embodiment of the present invention, the frequency domain signal obtaining module includes:
and the voice signal acquisition unit is used for acquiring the voice signal to be processed.
In an embodiment of the present invention, the to-be-processed voice signal is collected by an audio collecting device, where the audio collecting device includes, but is not limited to, an audio collecting device with a microphone array, such as a smart phone, a tablet computer, a smart sound, a handheld microphone, a desktop microphone, an earphone microphone, and the like.
And the framing unit is used for framing the voice signal to be processed to obtain a time domain signal to be processed.
In this embodiment, the framing processing is performed according to a preset time length, for example, 25ms may be a frame, and in order to obtain better correlation between frames after the spectral energy calculation, 1/4 frames of overlap are formed between windows.
And the calculating unit is used for calculating to-be-processed frequency domain signals according to the to-be-processed time domain signals.
Specifically, fourier transform or fast fourier transform is performed on the time domain signal to be processed to obtain a frequency domain signal to be processed.
As an example of the embodiment of the present invention, the estimating, according to the speech signal spectrum energy and the noise spectrum energy, the energy of the pure speech signal corresponding to the frequency domain signal to be processed to obtain estimated pure speech signal spectrum energy specifically includes: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
As an example of the embodiment of the present invention, the calculating the actual pure speech signal spectral energy of each frequency band according to the speech signal spectral energy, the estimated pure speech signal spectral energy, and the noise spectral energy specifically includes:
according to the formula
Figure BDA0002839687780000131
Calculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]],
Figure BDA0002839687780000132
Figure BDA0002839687780000133
Representing the spectral energy of the estimated clean speech signal.
In the embodiment of the present invention, the value of p is positive integer, and is usually 2.
δ is an empirical correction constant, and δ is closer to 1 as p is larger, and in most cases, p is 2, and δ is 0.5
For ease of understanding, the derivation of the above formula is given below:
let the form of the nonlinear spectral subtraction be:
|X(ω)|p=α(ω)|Y(ω)|p-β(ω)|D(ω)|p
wherein alpha (omega) is a first parameter, beta (omega) is a second parameter,
Figure BDA0002839687780000141
Figure BDA0002839687780000142
from the above formula, one can deduce:
Figure BDA0002839687780000143
wherein X (ω) represents the spectral energy of an actual clean speech signal, Y (ω) represents the spectral energy of a speech signal (a noisy speech signal), D (ω) represents the spectral energy of a noise,
Figure BDA0002839687780000144
representing the spectral energy of the pre-estimated pure speech signal, p and delta are constants, and delta belongs to [0,1 ]]Usually p is 2.
From the above derivation process, the embodiment of the present invention selects the subtraction parameters optimally in the sense of the mean square error, and has the constraint term, so that greater noise attenuation can be provided, and especially for low-energy speech segments, lower residual noise can be brought about.
It should be understood that, in practical practice, after dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of spectral energy according to the spectral energy of the speech signal, the frequency domain signal to be processed is divided according to a formula
Figure BDA0002839687780000145
Figure BDA0002839687780000146
And
Figure BDA0002839687780000147
calculating a first parameter α (w) and a second parameter β (w) corresponding to each frequency bin, and then calculating the number of cells according to the formula | X (ω)p=α(ω)|Y(ω)|p-β(ω)|D(ω)|pThe spectral energy of the actual pure speech signal in each frequency band is calculated separately, for example, the frequency domain signal is divided into three frequency bands, and then the spectral energy X1(ω) of the actual pure speech signal in the first frequency band and the spectral energy X1(ω) of the actual pure speech signal in the second frequency band are obtained by the above calculationThe actual clean speech signal spectral energy X2(ω), and the actual clean speech signal spectral energy X3(ω) of the third frequency band. Finally, the actual speech signal spectrum energy of the three frequency bands is added and root-coded, i.e.
Figure BDA0002839687780000148
And obtaining a frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed, and performing inverse Fourier transform on the frequency domain signal of the actual pure voice signal corresponding to the voice signal to be processed to obtain the pure voice signal corresponding to the voice signal to be processed.
As an example of the embodiment of the present invention, the actual pure speech signal calculation module includes:
the actual pure voice frequency domain signal calculating unit is used for calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and the actual pure voice signal calculation unit is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
Specifically, the actual pure speech signal spectrum energy corresponding to the speech signal to be processed can be obtained by adding the actual pure speech signal spectrum energy of each frequency band, the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed can be obtained by opening the root of the actual pure speech signal spectrum energy corresponding to the speech signal to be processed, and the frequency domain signal of the actual pure speech signal corresponding to the speech signal to be processed is subjected to inverse fourier transform to obtain the pure speech signal corresponding to the speech signal to be processed.
As an improvement of the above solution, the speech enhancement apparatus further includes: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
Specifically, the frequency domain signal to be processed is smoothed by low-pass filtering or windowing.
If the frequency domain signal to be processed is smoothed by windowing, and the window length L is assumed to be 3, the calculation method is as follows:
X′(n)=(X(n-1)+X(n)+X(n+1))/3
wherein, x (n) is a frequency domain signal corresponding to a frame of speech signal.
The embodiment of the invention can reduce the fluctuation of spectrum estimation by smoothing the frequency domain signal to be processed, thereby reducing the residual noise and further reducing the condition of voice distortion.
Example 3:
the present invention also provides a computer-readable storage medium, which specifically includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the speech enhancement method according to any of the above embodiments.
It should be noted that, all or part of the flow in the method according to the above embodiments of the present invention may also be implemented by a computer program instructing related hardware, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above embodiments of the method may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be further noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method of speech enhancement, comprising:
acquiring a frequency domain signal of a voice signal to be processed to obtain the frequency domain signal to be processed;
calculating the spectral energy of each frame of the frequency domain signal to be processed to obtain the spectral energy of the voice signal;
dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
estimating the energy of a pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain estimated pure voice signal spectrum energy;
calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
2. The speech enhancement method according to claim 1, wherein the obtaining of the frequency domain signal of the speech signal to be processed to obtain the frequency domain signal to be processed specifically comprises:
acquiring a voice signal to be processed:
performing framing processing on the voice signal to be processed to obtain a time domain signal to be processed;
and calculating to obtain a frequency domain signal to be processed according to the time domain signal to be processed.
3. The method of claim 1, wherein before calculating the spectral energy of each frame of the frequency-domain signal to be processed to obtain the spectral energy of the speech signal, the method further comprises:
and smoothing the frequency domain signal to be processed.
4. The speech enhancement method according to claim 1, wherein the estimating of the energy of the clean speech signal corresponding to the frequency domain signal to be processed according to the speech signal spectral energy and the noise spectral energy to obtain an estimated clean speech signal spectral energy is specifically: and subtracting the noise spectrum energy from the voice signal spectrum energy to obtain the estimated pure voice signal spectrum energy.
5. The speech enhancement method according to claim 1, wherein the calculating the actual clean speech signal spectral energy for each frequency bin according to the speech signal spectral energy, the estimated clean speech signal spectral energy and the noise spectral energy comprises:
according to the formula
Figure FDA0002839687770000021
Calculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency band, wherein Y (omega) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]],
Figure FDA0002839687770000022
Figure FDA0002839687770000023
Representing the spectral energy of the estimated clean speech signal.
6. The speech enhancement method according to claim 1, wherein the calculating an actual clean speech signal corresponding to the speech signal to be processed according to the spectral energy of the actual clean speech signal in each frequency band specifically comprises:
calculating to obtain an actual pure voice frequency domain signal of each frequency segment according to the actual pure voice signal spectrum energy of each frequency segment;
and calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice frequency domain signal of each frequency segment.
7. A speech enhancement apparatus, comprising:
the frequency domain signal acquisition module is used for acquiring a frequency domain signal of the voice signal to be processed to obtain the frequency domain signal to be processed;
the frequency spectrum energy calculation module is used for calculating the frequency spectrum energy of each frame of the frequency domain signal to be processed to obtain the frequency spectrum energy of the voice signal;
the frequency domain signal dividing module is used for dividing the frequency domain signal to be processed into a plurality of frequency segments with the same sum of frequency spectrum energy according to the voice signal spectrum energy;
the noise spectrum energy acquisition module is used for acquiring noise spectrum energy corresponding to the voice signal spectrum energy;
the pre-estimation module is used for pre-estimating the energy of the pure voice signal corresponding to the frequency domain signal to be processed according to the voice signal spectrum energy and the noise spectrum energy to obtain pre-estimated pure voice signal spectrum energy;
the actual pure voice signal spectrum energy calculation module is used for calculating actual pure voice signal spectrum energy of each frequency band according to the voice signal spectrum energy, the pre-estimated pure voice signal spectrum energy and the noise spectrum energy;
and the actual pure voice signal calculation module is used for calculating to obtain an actual pure voice signal corresponding to the voice signal to be processed according to the actual pure voice signal spectrum energy of each frequency segment.
8. The speech enhancement device of claim 7, further comprising: and the smoothing module is used for smoothing the frequency domain signal to be processed before the spectral energy of each frame of the frequency domain signal to be processed is calculated to obtain the spectral energy of the voice signal.
9. The speech enhancement device of claim 7, wherein the calculating the actual clean speech signal spectral energy for each frequency bin according to the speech signal spectral energy, the estimated clean speech signal spectral energy and the noise spectral energy comprises:
according to the formula
Figure FDA0002839687770000031
Calculating to obtain the actual pure speech signal spectrum energy X (omega) of each frequency segment, wherein Y (w) represents the speech signal spectrum energy, D (omega) represents the noise spectrum energy, p and delta are constants, and delta belongs to [0,1 ]],
Figure FDA0002839687770000032
Figure FDA0002839687770000033
Representing the spectral energy of the estimated clean speech signal.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the speech enhancement method according to any one of claims 1 to 6.
CN202011501035.5A 2020-12-16 2020-12-16 Voice enhancement method, device and storage medium Pending CN112634929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011501035.5A CN112634929A (en) 2020-12-16 2020-12-16 Voice enhancement method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011501035.5A CN112634929A (en) 2020-12-16 2020-12-16 Voice enhancement method, device and storage medium

Publications (1)

Publication Number Publication Date
CN112634929A true CN112634929A (en) 2021-04-09

Family

ID=75317398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011501035.5A Pending CN112634929A (en) 2020-12-16 2020-12-16 Voice enhancement method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112634929A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5450522A (en) * 1991-08-19 1995-09-12 U S West Advanced Technologies, Inc. Auditory model for parametrization of speech
US20050143989A1 (en) * 2003-12-29 2005-06-30 Nokia Corporation Method and device for speech enhancement in the presence of background noise
CN101320566A (en) * 2008-06-30 2008-12-10 中国人民解放军第四军医大学 Non-air conduction speech reinforcement method based on multi-band spectrum subtraction
US20120136655A1 (en) * 2010-11-30 2012-05-31 JVC KENWOOD Corporation a corporation of Japan Speech processing apparatus and speech processing method
CN108831500A (en) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 Sound enhancement method, device, computer equipment and storage medium
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5450522A (en) * 1991-08-19 1995-09-12 U S West Advanced Technologies, Inc. Auditory model for parametrization of speech
US20050143989A1 (en) * 2003-12-29 2005-06-30 Nokia Corporation Method and device for speech enhancement in the presence of background noise
CN101320566A (en) * 2008-06-30 2008-12-10 中国人民解放军第四军医大学 Non-air conduction speech reinforcement method based on multi-band spectrum subtraction
US20120136655A1 (en) * 2010-11-30 2012-05-31 JVC KENWOOD Corporation a corporation of Japan Speech processing apparatus and speech processing method
CN108831500A (en) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 Sound enhancement method, device, computer equipment and storage medium
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NAVNEET UPADHYAY,等: "An Improved Multi-Band Spectral Subtraction Algorithm for Enhancing Speech in Various Noise Environments", 《PROCEDIA ENGINEERING》, 31 December 2013 (2013-12-31) *
孙博凯: "改进语音增强多频带谱减算法研究", 《电子设计工程》, 5 April 2012 (2012-04-05) *

Similar Documents

Publication Publication Date Title
CN108831499B (en) Speech enhancement method using speech existence probability
US10891931B2 (en) Single-channel, binaural and multi-channel dereverberation
EP0790599A1 (en) A noise suppressor and method for suppressing background noise in noisy speech, and a mobile station
CN113539285B (en) Audio signal noise reduction method, electronic device and storage medium
WO2000017859A1 (en) Noise suppression for low bitrate speech coder
WO1997028527A1 (en) A noisy speech parameter enhancement method and apparatus
CN102402987A (en) Noise suppression device, noise suppression method, and program
US10382857B1 (en) Automatic level control for psychoacoustic bass enhancement
Wolfe et al. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement
EP4189677B1 (en) Noise reduction using machine learning
CN102314883B (en) Music noise judgment method and voice noise elimination method
CN113241089B (en) Voice signal enhancement method and device and electronic equipment
CN112634929A (en) Voice enhancement method, device and storage medium
GB2536727B (en) A speech processing device
Rao et al. Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration
CN111986694B (en) Audio processing method, device, equipment and medium based on transient noise suppression
Upadhyay et al. A perceptually motivated multi-band spectral subtraction algorithm for enhancement of degraded speech
CN114360566A (en) Noise reduction processing method and device for voice signal and storage medium
CN112750451A (en) Noise reduction method for improving voice listening feeling
Moon et al. Importance of phase information in speech enhancement
Childers et al. Co--Channel speech separation
Karabashetti et al. Speech enhancement using multiband spectral subtraction with cross spectral component reduction
CN111261197B (en) Real-time speech paragraph tracking method under complex noise scene
CN114333880B (en) Signal processing method, device, equipment and storage medium
JPH113094A (en) Noise eliminating device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination