CN113963710A - Voice enhancement method and device, electronic equipment and storage medium - Google Patents

Voice enhancement method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113963710A
CN113963710A CN202111216471.2A CN202111216471A CN113963710A CN 113963710 A CN113963710 A CN 113963710A CN 202111216471 A CN202111216471 A CN 202111216471A CN 113963710 A CN113963710 A CN 113963710A
Authority
CN
China
Prior art keywords
power spectrum
enhanced
voice signal
signal
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111216471.2A
Other languages
Chinese (zh)
Inventor
秦永红
付贤会
刘武钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rongxun Technology Co ltd
Original Assignee
Beijing Rongxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Rongxun Technology Co ltd filed Critical Beijing Rongxun Technology Co ltd
Priority to CN202111216471.2A priority Critical patent/CN113963710A/en
Publication of CN113963710A publication Critical patent/CN113963710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment discloses a voice enhancement method, a voice enhancement device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal; determining a power spectrum estimation value of the pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal; determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal; determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced; and determining the enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal. The embodiment of the invention obtains the enhanced voice signal by determining the masking threshold, can enhance the noise suppression result and improve the voice recognition effect.

Description

Voice enhancement method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of signal processing, in particular to a voice enhancement method, a voice enhancement device, electronic equipment and a storage medium.
Background
With the rapid development of signal processing techniques and speech recognition techniques, speech enhancement techniques in front-end preprocessing are also becoming more and more important. Generally, when a device plays sound, noise is heard along with voice, but the presence of noise interferes with voice and even affects the perception of voice by human ears. Generally, a speech enhancement method is used to process a speech signal containing noise.
At present, the speech enhancement methods mainly include spectral subtraction, wavelet transform, wiener filtering, and the like. The spectral subtraction method can better suppress noise when the signal-to-noise ratio of an input signal is high, but when the signal-to-noise ratio is low, more noise remains, and the negative value obtained after spectral subtraction is subjected to half-wave rectification processing, so that music noise occurs, and the recognition effect of voice is seriously influenced. That is, in a non-stationary environment, many speech enhancement methods suffer from tracking delay and large error.
Therefore, how to suppress noise and enhance the voice effect in a non-stationary environment is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for speech enhancement, an electronic device, and a storage medium, which can enhance a noise suppression result and improve a speech recognition effect.
In a first aspect, an embodiment of the present invention provides a speech enhancement method, including:
acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
determining a power spectrum estimation value of a pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal;
determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
In a second aspect, an embodiment of the present invention further provides a speech enhancement apparatus, including:
the parameter acquisition module is used for acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
the pure tone power spectrum estimation value determining module is used for determining a power spectrum estimation value of the pure tone signal according to the power spectrum of the speech signal to be enhanced and the power spectrum estimation value of the noise signal;
the masking threshold determining module is used for determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
the pure tone enhanced power spectrum value determining module is used for determining an enhanced power spectrum estimated value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and the enhanced voice signal determining module is used for determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement a speech enhancement method according to any of the embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speech enhancement method according to any embodiment of the present invention.
The embodiment of the invention discloses a voice enhancement method. The method comprises the following steps: acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal; determining a power spectrum estimation value of the pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal; determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal; determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced; and determining the enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal. The embodiment of the invention obtains the enhanced voice signal by determining the masking threshold values of different frequency bands of the voice signal to be enhanced, can enhance the noise suppression result, and improves the signal-to-noise ratio of the voice signal to be enhanced and the voice recognition effect.
Drawings
Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech enhancement apparatus according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present invention, which is applicable to speech enhancement of a speech signal to be enhanced. The method may be performed by a speech enhancement apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, for example, the electronic device may be a device with communication and computing capabilities, such as a background server. As shown in fig. 1, the method specifically includes:
s110, acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal.
The voice signal to be enhanced is acquired by at least one voice acquisition device in a voice acquisition site. The voice acquisition site can be a communication site in a conference room, a broadcasting room, a railway station and other noisy environments, and can also be a military communication site or a voice recognition site and the like. For example, when a broadcaster broadcasts news, various sounds may appear in a broadcasting room, traffic noise generated by passing vehicles outside the broadcasting room building or noise generated by back-and-forth movement of a air conditioning system, a light control system, a camera and a worker in the broadcasting room building, and at the moment, voice signals in the broadcasting room need to be collected, and voice enhancement is performed on the voice signals of the broadcaster.
The voice acquisition device can be a microphone or a wave detector. Specifically, the number of the voice collecting devices is not limited, and may be 1 or more. When the number of the voice collecting devices is 2 or more, the arrangement mode of the voice collecting devices is not limited in order to collect voice signals at different positions. For example, the voice collecting devices may be arranged along a circumferential direction of a pure voice signal source in the voice signal to be enhanced. In addition, because noise interference in the voice signal to be enhanced has uncertainty and randomness, the voice acquisition device can acquire the voice signal to be enhanced continuously or can perform intermittent acquisition in a shorter interval time.
Further, in order to better perform the speech enhancement processing on the speech signal to be enhanced, the collected speech signal to be enhanced needs to be converted into a frequency domain sound signal, for example, the speech signal to be enhanced may be converted into a frequency domain signal by using a fourier transform or the like. The power spectrum is the power of the speech signal to be enhanced in a unit frequency band, and the phase spectrum is the phase of the speech signal to be enhanced in the unit frequency band. The power spectrum and the phase spectrum of the voice signal to be enhanced can also be obtained by the acquired voice signal to be enhanced through Fast Fourier Transform (FFT).
Further, the power spectrum estimation value of the noise signal in the voice signal to be enhanced is obtained by estimating the noise signal according to the collected voice signal to be enhanced and performing FFT processing. The noise signal estimation method may be the following three methods: a recursive average noise estimation algorithm, a minimum tracking algorithm, and a histogram noise estimation algorithm.
It should be noted that the speech signal to be enhanced includes a clean speech signal and a noise signal. The clean voice signal refers to a desired voice signal, and the noise signal refers to all interference signals except the desired voice signal. For example, when a broadcaster broadcasts news indoors, the sound signal of the broadcaster is a pure voice signal, and the sound generated by the broadcaster when walking back and forth among an indoor air conditioning system, a light control system, a camera and a worker and the sound generated by a passing vehicle outdoors are noise signals.
And S120, determining the power spectrum estimation value of the pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal.
Wherein, the power spectrum estimated value of the pure voice signal can be obtained by using spectral subtraction. The spectral subtraction is based on the characteristics of supposing that the noise signal and the clean voice signal are not related and the noise signal is stable, and is obtained by subtracting the power spectrum estimated value of the voice signal to be enhanced and the power spectrum estimated value of the noise signal.
For example, it may be assumed that the noise signal is stationary, i.e. the expected value of the power spectrum of the noise signal is equal during the period with the clean speech signal and during the period without the clean speech signal; then, replacing the power spectrum of the noise signal in the period with the pure voice signal with the power spectrum of the noise signal measured and calculated in the period without the pure voice signal; and finally, subtracting the power spectrum estimated value of the voice signal to be enhanced from the power spectrum estimated value of the noise signal to obtain the power spectrum estimated value of the pure voice signal. It will be appreciated that when the difference is negative, it is zeroed.
And S130, determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal.
It should be noted that, in a short time (e.g., 10ms to 30ms), the shape of the vocal cords and vocal tract of the human is relatively stable, and thus the short-time spectrum of the acquired human voice signal has relative stability. The voice acquisition site may be in a non-stationary environment, and in order to ensure that the noise signal in the voice signal to be enhanced can be effectively suppressed in the non-stationary environment, the voice signal to be enhanced needs to be divided into a plurality of frequency bands according to the frequency domain of the voice signal to be enhanced. It is understood that the clean speech signal (e.g., the human voice signal) in each of the divided frequency bands is quasi-stationary.
The frequency band division may be based on a Bark scale, or may also be based on a Mel scale. The Bark scale is a unit of perception frequency, the frequency of the voice signal to be enhanced is mapped to 24 psychoacoustic critical frequency bands in Hertz, the width of each critical frequency band is one Bark scale, and when the frequency band division is carried out in the Bark scale, physical frequency needs to be converted into psychoacoustic frequency. The Mel scale is a frequency band division approach closer to the human auditory system.
For example, for different dedicated devices, the way of band division for the speech signal to be enhanced may be selected according to the application scenario. For example, when a broadcaster broadcasts news in a broadcasting room, since the collected sound signal of each broadcaster and the noise signal in the broadcasting room are relatively stable, the frequency domain of the to-be-enhanced speech signal may be divided into 26 frequency bands according to the Mel scale.
In this embodiment, optionally, the determining a masking threshold according to the power spectrum estimation value of the clean speech signal includes: obtaining values of the power spectrum estimation value of the pure voice signal in two adjacent iterations in the iterative computation; determining a parameter value of a frequency band according to values of two adjacent iterations of the power spectrum estimation value of the pure voice signal; and determining a masking threshold according to the parameter value of the frequency band and the power spectrum estimation value of the noise signal.
The iterative computation is to repeat a group of instructions or a certain step, and iterative computation is performed on the iterative relation by limiting an iterative condition or iterative times to obtain an iterative variable. The power spectrum estimation value of the pure voice signal can be obtained in a mode of obtaining two adjacent iteration values in the iterative calculation according to the period of the power spectrum estimation value of the pure voice signal. The parameter value of the frequency band is an estimated value capable of expressing the occupation ratio of the pure voice signal in the voice signal with noise, and can be an empirical value obtained according to a plurality of tests, or can be obtained by calculating a relational expression according to values of adjacent two iterations of the estimated value of the power spectrum of the pure voice signal. It will of course be appreciated that when a plurality of tones are present simultaneously in the noisy speech signal, a masking effect occurs when the high tones completely mask the low tones. In addition, the masking threshold may be determined in various ways, for example, the masking threshold may be output by establishing a neural network model, or may be calculated by establishing a relational expression.
In this embodiment, optionally, determining the parameter value of the frequency band according to the values of two adjacent iterations of the power spectrum estimation value of the clean speech signal includes:
determining the parameter values of the frequency bands using the following formula:
Figure BDA0003310981400000081
where j is the number of iterations, i is the number of bands,
Figure BDA0003310981400000082
is the power spectrum estimate of the clean speech signal in the ith frequency band in the jth iteration,
Figure BDA0003310981400000083
is an estimate of the power spectrum, alpha, of the clean speech signal in the ith frequency band in the (j-1) th iterationj(i) Is the parameter value for the ith frequency band in the jth iteration.
It can be understood that the parameter value is obtained by calculating the values of two adjacent iterations of the power spectrum estimated value of the pure voice signal in the iterative calculation, and thus, the advantage of the arrangement is that the parameter value alpha of different frequency bands can be obtainedj(i) And the masking threshold values of different frequency bands are respectively calculated by using the parameter values of the different frequency bands, so that the result of the masking threshold values is more accurate.
In this embodiment, optionally, determining the masking threshold according to the parameter value of the frequency band and the estimated power spectrum value of the noise signal includes:
the masking threshold is determined using the following equation:
Figure BDA0003310981400000084
wherein N isj(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the jth iteration, N(j-1)(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the (j-1) th iterationj(i)Is the masking threshold for the ith frequency band in the jth iteration.
It is understood that, in the above formula, the masking threshold T is setj(i) Initial value of (2)
Figure BDA0003310981400000091
Different frequency bands are divided by using human auditory effect, and then masking thresholds of different frequency bands are calculated, so that noise signals of various frequency bands can be inhibited according to the masking thresholds of different frequency bands, and the signal-to-noise ratio of the voice signal to be enhanced is improved.
S140, determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced.
And the enhancement power spectrum estimation value of the pure voice signal is the power spectrum estimation value obtained after effectively suppressing the noise signal in the voice signal to be enhanced. Of course, the determination manner of the enhanced power spectrum estimation value of the clean speech signal is not unique, and may be determined by a spectral subtraction method, a wiener filtering method, or a masking threshold, for example.
In this embodiment, optionally, the determining an enhanced power spectrum estimation value of the clean speech signal according to the masking threshold and the power spectrum of the speech signal to be enhanced includes:
determining a ratio parameter according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced power spectrum estimation value of the pure voice signal according to the ratio parameter and the power spectrum of the voice signal to be enhanced.
Wherein the proportion parameter is the proportion fraction of the enhancement power spectrum estimation value of the pure voice signal in the voice signal to be enhanced.
In this embodiment, optionally, the following formula is used to determine the enhanced power spectrum estimation value of the clean speech signal:
Figure BDA0003310981400000092
wherein,
Figure BDA0003310981400000101
is an enhanced power spectrum estimate, P, of a clean speech signalyk) Is the power of the speech signal to be enhanced at the k-th frequency point, alpha (omega)k) Is the parameter value at the k-th frequency point, mu (omega)k) Is the adjustment value at the k-th frequency bin,
Figure BDA0003310981400000102
is a duty ratio parameter.
It can be understood that the parameter value α (ω) of the k-th frequency pointk) The adaptive adjustment parameter value of the enhanced power spectrum estimation value of the pure voice signal can be obtained according to an empirical value or a calculation relation. The adjustment value mu (omega) of the k-th frequency pointk) Is an enhanced power spectrum estimate on the output of a clean speech signal
Figure BDA0003310981400000103
The parameter values are adjusted according to the requirements. The speech signal to be enhanced is adjusted by using the wiener type function, so that the problem of frequency spectrum loss caused by excessive suppression of the speech signal to be enhanced by using traditional algorithms such as a spectral subtraction method, a wiener filtering method and the like is solved, a noise suppression result can be enhanced, the signal-to-noise ratio of the speech signal to be enhanced and the recognition effect of speech are improved, and the application of subjective audibility of people, later-stage speech recognition and the like is facilitated.
In this embodiment, optionally, the parameter value of the k-th frequency point in the iterative process is determined by using the following formula:
Figure BDA0003310981400000104
the parameter value of each frequency point is obtained by using the auditory masking threshold of a person, so that the enhanced power spectrum estimation value of the pure voice signal output according to the parameter value is more accurate.
S150, determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
The enhanced voice signal is a high signal-to-noise ratio voice signal obtained after the voice to be enhanced is subjected to signal processing. The determining method of the enhanced speech signal may be obtained by performing IFFT (Inverse Fast Fourier Transform) calculation according to the phase spectrum of the speech signal to be enhanced and the enhanced power spectrum estimation value of the clean speech signal.
According to the technical scheme of the embodiment of the invention, the power spectrum and the phase spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal in the voice signal to be enhanced are obtained, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal; determining a power spectrum estimation value of a pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal; determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal; determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced; and determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal. The embodiment of the invention obtains the enhanced voice signal by determining the masking threshold values of different frequency bands of the voice signal to be enhanced, can enhance the noise suppression result, and improves the signal-to-noise ratio of the voice signal to be enhanced and the voice recognition effect.
Example two
Fig. 2 is a schematic structural diagram of a speech enhancement apparatus according to a second embodiment of the present invention, which is applicable to speech enhancement and speech recognition. As shown in fig. 2, the apparatus includes:
a parameter obtaining module 210, configured to obtain a power spectrum and a phase spectrum of a speech signal to be enhanced, and a power spectrum estimation value of a noise signal in the speech signal to be enhanced, where the speech signal to be enhanced includes a clean speech signal and a noise signal;
a pure tone power spectrum estimation value determination module 220, configured to determine a power spectrum estimation value of a pure tone signal according to the power spectrum of the to-be-enhanced tone signal and the power spectrum estimation value of the noise signal;
a masking threshold determining module 230, configured to determine masking thresholds in different frequency bands according to the power spectrum estimation value of the clean speech signal;
a pure tone enhanced power spectrum value determining module 240, configured to determine an enhanced power spectrum estimation value of the pure speech signal according to the masking threshold and the power spectrum of the speech signal to be enhanced;
and an enhanced speech signal determining module 250, configured to determine an enhanced speech signal according to the phase spectrum of the speech signal to be enhanced and the enhanced power spectrum estimation value of the clean speech signal.
The voice enhancement device provided by the embodiment of the invention firstly obtains a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal; determining a power spectrum estimation value of the pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal; determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal; determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced; and finally, determining the enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal. The embodiment of the invention obtains the enhanced voice signal by determining the masking threshold values of different frequency bands of the voice signal to be enhanced, can enhance the noise suppression result, and improves the signal-to-noise ratio of the voice signal to be enhanced and the voice recognition effect.
Further, the masking threshold determining module 230 includes:
the iteration value acquisition unit is used for acquiring values of adjacent two iterations of the power spectrum estimation value of the pure voice signal in iterative computation;
the parameter value determining unit is used for determining the parameter value of the frequency band according to the values of the adjacent two iterations of the power spectrum estimation value of the pure voice signal;
and the masking threshold determining unit is used for determining a masking threshold according to the parameter value of the frequency band and the power spectrum estimation value of the noise signal.
Further, the parameter value determining unit is specifically configured to:
determining the parameter values of the frequency bands using the following formula:
Figure BDA0003310981400000131
where j is the number of iterations, i is the number of bands,
Figure BDA0003310981400000132
is the power spectrum estimate of the clean speech signal in the ith frequency band in the jth iteration,
Figure BDA0003310981400000133
is an estimate of the power spectrum, alpha, of the clean speech signal in the ith frequency band in the (j-1) th iterationj(i) Is the parameter value for the ith frequency band in the jth iteration.
Further, the masking threshold determining unit is specifically configured to:
the masking threshold is determined using the following equation:
Figure BDA0003310981400000134
wherein N isj(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the jth iteration, N(j-1)(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the (j-1) th iterationj(i) Is the masking threshold for the ith frequency band in the jth iteration.
Further, the pure tone enhanced power spectrum value determining module 240 includes:
the occupation ratio parameter determining unit is used for determining occupation ratio parameters according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and the pure tone enhanced power spectrum value determining unit is used for determining an enhanced power spectrum estimated value of the pure voice signal according to the ratio parameter and the power spectrum of the voice signal to be enhanced.
Further, the pure tone enhanced power spectrum value determining unit is specifically configured to:
determining an enhanced power spectrum estimate for the clean speech signal using the following equation:
Figure BDA0003310981400000135
wherein,
Figure BDA0003310981400000136
is an enhanced power spectrum estimate, P, of a clean speech signalyk) Is the power of the speech signal to be enhanced at the k-th frequency point, alpha (omega)k) Is the parameter value at the k-th frequency point, mu (omega)k) Is the adjustment value at the k-th frequency bin,
Figure BDA0003310981400000137
is a duty ratio parameter.
Further, the parameter value of the kth frequency point in the iteration process is determined by adopting the following formula:
Figure BDA0003310981400000141
the voice enhancement device provided by the embodiment of the invention can execute the voice enhancement method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the voice enhancement method.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention, as shown in fig. 3, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the device may be one or more, and one processor 310 is taken as an example in fig. 3; the processor 310, the memory 320, the input device 330 and the output device 340 in the apparatus may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.
The memory in the apparatus, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the speech enhancement method in embodiments of the present invention (e.g., parameter acquisition module 210, pure-tone power spectrum estimate determination module 220, masking threshold determination module 230, pure-tone enhanced power spectrum value determination module 240, and enhanced speech signal determination module 250). The processor 310 implements the above-described voice enhancement method by executing software programs, instructions, and modules stored in the memory 320 to perform various functional applications of the device and data processing.
The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 340 may include a display device such as a display screen.
And, when the one or more programs included in the above electronic device are executed by the one or more processors 310, the programs perform the following operations:
acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
determining a power spectrum estimation value of a pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal;
determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
The voice enhancement method, the voice enhancement device, the electronic equipment and the storage medium provided in the above embodiments can be executed to perform the voice enhancement method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to the speech enhancement method provided in any of the embodiments of the present application.
Example four
A third embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech enhancement method according to any of the embodiments of the present invention, where the method includes:
acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
determining a power spectrum estimation value of a pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal;
determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
Optionally, the program, when executed by a processor, may be further adapted to perform a speech enhancement method as provided by any of the embodiments of the invention.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-only Memory (ROM), an Erasable Programmable Read-only Memory (EPROM), a flash Memory, an optical fiber, a portable CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that, in the embodiment of the speech enhancement apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method of speech enhancement, comprising:
acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
determining a power spectrum estimation value of a pure voice signal according to the power spectrum of the voice signal to be enhanced and the power spectrum estimation value of the noise signal;
determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
determining an enhanced power spectrum estimation value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
2. The method of claim 1, wherein determining masking thresholds at different frequency bands from the power spectrum estimate of the clean speech signal comprises:
obtaining values of the power spectrum estimation value of the pure voice signal in two adjacent iterations in the iterative computation;
determining a parameter value of a frequency band according to values of two adjacent iterations of the power spectrum estimation value of the pure voice signal;
and determining a masking threshold according to the parameter value of the frequency band and the power spectrum estimation value of the noise signal.
3. The method of claim 2, wherein determining the parameter value of the frequency band according to the values of two adjacent iterations of the power spectrum estimation value of the clean speech signal comprises:
determining the parameter values of the frequency bands using the following formula:
Figure FDA0003310981390000011
where j is the number of iterations, i is the number of bands,
Figure FDA0003310981390000012
is the power spectrum estimate of the clean speech signal in the ith frequency band in the jth iteration,
Figure FDA0003310981390000021
is an estimate of the power spectrum, alpha, of the clean speech signal in the ith frequency band in the (j-1) th iterationj(i) Is the parameter value for the ith frequency band in the jth iteration.
4. The method of claim 3, wherein determining the masking threshold based on the parameter values for the frequency band and the power spectrum estimate of the noise signal comprises:
the masking threshold is determined using the following equation:
Figure FDA0003310981390000022
wherein N isj(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the jth iteration, N(j-1)(i) Is an estimate of the power spectrum of the noise signal in the ith frequency band in the (j-1) th iterationj(i) Is the masking threshold for the ith frequency band in the jth iteration.
5. The method of claim 1, wherein determining an enhanced power spectrum estimate for a clean speech signal based on the masking threshold and the power spectrum of the speech signal to be enhanced comprises:
determining a ratio parameter according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and determining an enhanced power spectrum estimation value of the pure voice signal according to the ratio parameter and the power spectrum of the voice signal to be enhanced.
6. The method of claim 5, wherein the enhanced power spectrum estimate of the clean speech signal is determined using the following equation:
Figure FDA0003310981390000023
wherein,
Figure FDA0003310981390000024
is an estimate of the enhanced power spectrum of the clean speech signal,
Figure FDA0003310981390000025
is the power of the speech signal to be enhanced at the k-th frequency point, alpha (omega)k) Is the parameter value at the k-th frequency point, mu (omega)k) Is the adjustment value at the k-th frequency bin,
Figure FDA0003310981390000026
is a duty ratio parameter.
7. The method according to claim 6, wherein the parameter value of the k-th frequency point is determined by using the following formula:
Figure FDA0003310981390000031
where N (i) is the power spectrum estimation value of the noise signal of the ith frequency band, Tj(i) Is the masking threshold for the ith frequency band in the jth iteration.
8. A speech enhancement apparatus, comprising:
the parameter acquisition module is used for acquiring a power spectrum and a phase spectrum of a voice signal to be enhanced and a power spectrum estimation value of a noise signal in the voice signal to be enhanced, wherein the voice signal to be enhanced comprises a pure voice signal and a noise signal;
the pure tone power spectrum estimation value determining module is used for determining a power spectrum estimation value of the pure tone signal according to the power spectrum of the speech signal to be enhanced and the power spectrum estimation value of the noise signal;
the masking threshold determining module is used for determining masking thresholds under different frequency bands according to the power spectrum estimation value of the pure voice signal;
the pure tone enhanced power spectrum value determining module is used for determining an enhanced power spectrum estimated value of the pure voice signal according to the masking threshold and the power spectrum of the voice signal to be enhanced;
and the enhanced voice signal determining module is used for determining an enhanced voice signal according to the phase spectrum of the voice signal to be enhanced and the enhanced power spectrum estimation value of the pure voice signal.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speech enhancement method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech enhancement method according to any one of claims 1-7.
CN202111216471.2A 2021-10-19 2021-10-19 Voice enhancement method and device, electronic equipment and storage medium Pending CN113963710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111216471.2A CN113963710A (en) 2021-10-19 2021-10-19 Voice enhancement method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111216471.2A CN113963710A (en) 2021-10-19 2021-10-19 Voice enhancement method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113963710A true CN113963710A (en) 2022-01-21

Family

ID=79464845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111216471.2A Pending CN113963710A (en) 2021-10-19 2021-10-19 Voice enhancement method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113963710A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005107448A (en) * 2003-10-02 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> Noise reduction processing method, and device, program, and recording medium for implementing same method
US20050182624A1 (en) * 2004-02-16 2005-08-18 Microsoft Corporation Method and apparatus for constructing a speech filter using estimates of clean speech and noise
KR20110028104A (en) * 2009-09-11 2011-03-17 삼성전자주식회사 Musical noise elimination apparatus and method of the same
WO2015078268A1 (en) * 2013-11-27 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method, apparatus and server for processing noisy speech
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN112652322A (en) * 2020-12-23 2021-04-13 江苏集萃智能集成电路设计技术研究所有限公司 Voice signal enhancement method
CN113012711A (en) * 2019-12-19 2021-06-22 ***通信有限公司研究院 Voice processing method, device and equipment
CN113160845A (en) * 2021-03-29 2021-07-23 南京理工大学 Speech enhancement algorithm based on speech existence probability and auditory masking effect

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005107448A (en) * 2003-10-02 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> Noise reduction processing method, and device, program, and recording medium for implementing same method
US20050182624A1 (en) * 2004-02-16 2005-08-18 Microsoft Corporation Method and apparatus for constructing a speech filter using estimates of clean speech and noise
KR20110028104A (en) * 2009-09-11 2011-03-17 삼성전자주식회사 Musical noise elimination apparatus and method of the same
WO2015078268A1 (en) * 2013-11-27 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method, apparatus and server for processing noisy speech
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN113012711A (en) * 2019-12-19 2021-06-22 ***通信有限公司研究院 Voice processing method, device and equipment
CN112652322A (en) * 2020-12-23 2021-04-13 江苏集萃智能集成电路设计技术研究所有限公司 Voice signal enhancement method
CN113160845A (en) * 2021-03-29 2021-07-23 南京理工大学 Speech enhancement algorithm based on speech existence probability and auditory masking effect

Similar Documents

Publication Publication Date Title
US20200265857A1 (en) Speech enhancement method and apparatus, device and storage mediem
US9431023B2 (en) Monaural noise suppression based on computational auditory scene analysis
CN101430882B (en) Method and apparatus for restraining wind noise
CN103531204B (en) Sound enhancement method
CN109036460B (en) Voice processing method and device based on multi-model neural network
US20100067710A1 (en) Noise spectrum tracking in noisy acoustical signals
CN109643554A (en) Adaptive voice Enhancement Method and electronic equipment
CN110706693B (en) Method and device for determining voice endpoint, storage medium and electronic device
CN106885971B (en) Intelligent background noise reduction method for cable fault detection pointing instrument
CN109979478A (en) Voice de-noising method and device, storage medium and electronic equipment
CN104637489A (en) Method and device for processing sound signals
CN110310656A (en) A kind of sound enhancement method
CN113539285B (en) Audio signal noise reduction method, electronic device and storage medium
CN101176149A (en) Signal processing system for tonal noise robustness
CN111933165A (en) Rapid estimation method for mutation noise
CN105869652A (en) Psychological acoustic model calculation method and device
CN113851151A (en) Masking threshold estimation method, device, electronic equipment and storage medium
CN103971697B (en) Sound enhancement method based on non-local mean filtering
CN113963710A (en) Voice enhancement method and device, electronic equipment and storage medium
CN104867498A (en) Mobile communication terminal and voice enhancement method and module thereof
CN113763976A (en) Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN115410593A (en) Audio channel selection method, device, equipment and storage medium
CN110168640A (en) For enhancing the device and method for needing component in signal
CN113611319A (en) Wind noise suppression method, device, equipment and system based on voice component
KR101993003B1 (en) Apparatus and method for noise reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination