CN117995215B

CN117995215B - Voice signal processing method and device, computer equipment and storage medium

Info

Publication number: CN117995215B
Application number: CN202410399812.1A
Authority: CN
Inventors: 苑亚南; 刘福平
Original assignee: Shenzhen Aitushi Innovation Technology Co ltd
Current assignee: Shenzhen Aitushi Innovation Technology Co ltd
Priority date: 2024-04-03
Filing date: 2024-04-03
Publication date: 2024-06-18
Anticipated expiration: 2044-04-03
Also published as: CN117995215A

Abstract

The application discloses a processing method, a device, a computer device and a storage medium of a voice signal, wherein the method comprises the following steps: acquiring a first frequency spectrum of a target voice signal frame; taking the preset noise spectrum as a first noise spectrum of the first frequency spectrum; filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum; determining whether voice information exists in the second frequency spectrum; if the second frequency spectrum does not have voice information, performing noise spectrum estimation processing on the second frequency spectrum to obtain a second noise spectrum, taking the second noise spectrum as a first noise spectrum, taking the second frequency spectrum as a first frequency spectrum, and returning to execute filtering processing on the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum; and if the voice information exists in the second frequency spectrum, taking the second frequency spectrum as a target frequency spectrum of the target voice signal frame, and determining the processed signal frame of the target voice signal frame based on the target frequency spectrum. The embodiment of the application can strengthen the suppression of non-stationary noise and improve the effect of voice enhancement.

Description

Voice signal processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech signals, and in particular, to a method and apparatus for processing a speech signal, a computer device, and a storage medium.

Background

Speech enhancement refers to the process of improving the quality, clarity and intelligibility of speech signals by various signal processing techniques and algorithms. Its goal is to reduce the impact of noise, reverberation, etc. interference factors on the speech signal, making the speech easier to understand and recognize in noisy environments. The method is generally applied to the fields of communication systems, voice recognition, audio recording, voice control, voice auxiliary equipment and the like, and plays an important role in improving the audio experience of users.

The related voice enhancement algorithm is, for example, spectral subtraction, the implementation of the spectral subtraction is simpler, and the calculation efficiency is high. However, under the condition of low signal-to-noise ratio, the spectral subtraction may introduce distortion, and the suppression of non-stationary noise is poor, so that the effect of voice enhancement is not ideal.

Disclosure of Invention

The embodiment of the application provides a processing method, a processing device, computer equipment and a storage medium for voice signals, which aim to strengthen the suppression of non-stationary noise and improve the voice enhancement effect.

In a first aspect, an embodiment of the present application provides a method for processing a voice signal, where the method includes:

Acquiring a first frequency spectrum of a target voice signal frame to be processed;

taking a preset noise spectrum as a first noise spectrum of the first frequency spectrum;

Filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum;

determining whether voice information exists in the second frequency spectrum;

If no voice information exists in the second frequency spectrum, performing noise spectrum estimation processing on the second frequency spectrum to obtain a second noise spectrum, taking the second noise spectrum as the first noise spectrum, taking the second frequency spectrum as the first frequency spectrum, and returning to execute the filtering processing on the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum;

and if voice information exists in the second frequency spectrum, the second frequency spectrum is used as a target frequency spectrum of the target voice signal frame, and the processed signal frame of the target voice signal frame is determined based on the target frequency spectrum.

In a second aspect, the present application also provides a device for processing a speech signal, the device comprising:

an acquisition unit, configured to acquire a first spectrum of a target speech signal frame to be processed, and take a preset noise spectrum as a first noise spectrum of the first spectrum;

the filtering unit is used for filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum;

a determining unit, configured to determine whether voice information exists in the second spectrum;

The first execution unit is configured to perform noise spectrum estimation processing on the second spectrum if no voice information exists in the second spectrum, obtain a second noise spectrum, use the second noise spectrum as the first noise spectrum, use the second spectrum as the first spectrum, and return to perform filtering processing on the first spectrum based on the first noise spectrum to obtain a second spectrum;

And the second execution unit is used for taking the second frequency spectrum as a target frequency spectrum of the target voice signal frame if voice information exists in the second frequency spectrum, and determining the processed signal frame of the target voice signal frame based on the target frequency spectrum.

In a third aspect, the present application also provides a computer device comprising:

One or more processors;

A memory; and

One or more applications, wherein the one or more applications are stored in memory and configured to be executed by a processor to implement the method of processing a speech signal of any of the above.

In a fourth aspect, the present application also provides a computer-readable storage medium having stored thereon a computer program to be loaded by a processor for performing the steps of the method of processing a speech signal of any of the above.

The beneficial effects of the application are as follows: according to the processing method, the processing device, the computer equipment and the storage medium for the voice signals, when voice information does not exist in the second frequency spectrum, the second frequency spectrum is used as the first frequency spectrum, the first frequency spectrum is subjected to filtering processing based on the first frequency spectrum, and the second frequency spectrum is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an embodiment of a method for processing a speech signal according to an embodiment of the present application;

fig. 2 is a flowchart of another embodiment of a method for processing a voice signal according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a graph of global masking thresholds provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a processing apparatus for voice signals according to the present application;

Fig. 5 is a schematic structural diagram of an embodiment of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present application, the term "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described as "exemplary" in this disclosure is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for purposes of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail so as not to obscure the description of the application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The embodiment of the application provides a voice signal processing method, a voice signal processing device, computer equipment and a storage medium. The following will describe in detail.

As shown in fig. 1, a flowchart of an embodiment of a method for processing a speech signal according to an embodiment of the present application may include:

11. Acquiring a first frequency spectrum of a target voice signal frame to be processed;

In an embodiment of the present application, the target speech signal frame to be processed may be a speech signal frame including noise, and the frame duration of one speech signal frame is generally a first preset duration, for example, the first preset duration may take a value of 25 milliseconds. In the step of generating the target voice signal frame, specifically, the first preset time length is taken as the frame time length, the second preset time length is taken as the frame moving distance, the complete target voice signal to be processed is segmented, multi-frame voice signal frames (such as x1, x2, … … and xn) sequenced according to the voice sequence are obtained, and at least one frame in the multi-frame voice signal frames is taken as the target voice signal frame to be processed. The complete target voice signal to be processed is generally a single-channel voice signal, the second preset duration is smaller than the first preset duration, and the second preset duration can take a value of 10 milliseconds, for example.

The frequency spectrum is simply called frequency spectrum density, and the first frequency spectrum is a graph formed by arranging the amplitudes in the target voice signal frame according to the frequency. In some embodiments of the present application, acquiring a first spectrum of a target speech signal frame to be processed may include: a hamming window is added to the target speech signal frame, and then FFT (fast Fourier transform ) is performed to obtain a first spectrum.

12. Taking the preset noise spectrum as a first noise spectrum of the first frequency spectrum;

In the embodiment of the application, the noise spectrum refers to a noise signal included in a frequency spectrum, and since the first noise spectrum of the first frequency spectrum cannot be determined immediately after the first frequency spectrum of the target speech signal frame to be processed is acquired, the preset noise spectrum is first used as the first noise spectrum of the first frequency spectrum to execute the subsequent steps.

The preset noise spectrum may be preset based on actual conditions, for example, the preset noise spectrum may be extracted from the complete target speech signal to be processed. Specifically, before taking the preset noise spectrum as the first noise spectrum of the first spectrum, the method may further include: the method comprises the steps that a plurality of target frames with the preset number are obtained from multi-frame voice signal frames which comprise target voice signal frames and are sequenced according to voice sequence, wherein the multi-frame voice signal frames are obtained based on the division of the complete target voice signals, the value range of the preset number can be [4,7], and the target voice signal frames are not included in the plurality of target frames with the preset number; acquiring first spectrums of a plurality of target frames, for example, respectively adding hamming windows to the plurality of target frames, and performing FFT (fast Fourier transform) to obtain first spectrums of the plurality of target frames; and carrying out average value processing on the first frequency spectrums of the plurality of target frames to obtain a preset noise spectrum, wherein the average value processing can be, for example, frequency average of frequency points at the same position in the first frequency spectrums of the plurality of target frames. It will be appreciated that, since the initial stage of the complete target speech signal does not always include speech, the preset noise spectrum obtained based on the plurality of target frames of the previous preset number does not always include speech, but only noise, and after the preset noise spectrum is used as the first noise spectrum of the first spectrum, the first noise spectrum of the first spectrum may be made to conform to the real noise more.

13. Filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum;

In the embodiment of the application, since the first noise spectrum is preliminarily determined, the first spectrum is subjected to filtering processing based on the first noise spectrum, and a part including the first noise spectrum in the first spectrum can be filtered out to obtain a second spectrum with less noise.

In some embodiments of the present application, filtering the first spectrum based on the first noise spectrum to obtain a second spectrum may include: acquiring a target noise spectrum of a previous frame of a target voice signal frame and a target spectrum of the previous frame, wherein the target spectrum of the previous frame is as follows: the second spectrum of the previous frame of the present speech information may specifically refer to the following determination method of the target spectrum of the target speech signal frame (specifically, if the previous frame is one of the above-mentioned plurality of target frames of the previous preset number, the first spectrum obtained by adding hamming window and FFT conversion to the previous frame is directly used as the target spectrum of the previous frame), and the target noise spectrum of the previous frame is: the first noise spectrum of the previous frame when the voice information exists in the second spectrum of the previous frame may specifically refer to the following determination manner of the target noise spectrum of the target voice signal frame (in particular, if the previous frame is one frame of the plurality of target frames of the previous preset number, the preset noise spectrum is taken as the target noise spectrum of the previous frame); determining a posterior signal-to-noise ratio of the first spectrum based on the target noise spectrum of the previous frame; determining a priori signal-to-noise ratio of a first frequency spectrum based on a target frequency spectrum, a posterior signal-to-noise ratio and the first noise spectrum of a previous frame; determining a first signal gain function of a first frequency spectrum according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio; and filtering the first frequency spectrum based on the first signal gain function to obtain a second frequency spectrum.

The formula of the posterior signal-to-noise ratio of the first spectrum is as follows:

γ(k,m)=|X(k,m)|²/λ_d(k,m-1)

Gamma (k, m) is the posterior signal-to-noise ratio of the first spectrum, X (k, m) is the first spectrum of the target speech signal frame, k is the frequency point in the first spectrum of the target speech signal frame, m is the frame number of the target speech signal frame, lambda _d (k, m-1) is the target noise spectrum of the previous frame.

The formula for the a priori signal to noise ratio of the first spectrum is illustrated as follows:

ζ_DD(k,m)=β(m)|S(k,m-1)|²/λ_d(k,m)+(1-β(m))max[γ(k,m)-1,0]

ζ _DD (k, m) is the a priori signal to noise ratio of the first spectrum, β (m) is a predetermined smoothing constant, S (k, m-1) is the estimate of the clean speech spectrum in the previous frame, and λ _d (k, m) is the first noise spectrum of the target speech signal frame.

S (k, m-1) may be calculated based on the target spectrum of the previous frame through a sliding window and a smoothing factor, and the formula is exemplified as follows:

S(k,m-1)=α_pS_f(k,m-1)+(1-α_p)S_f(k,m)

Alpha _p is a preset smoothing constant, b is a Hanning window, and W takes a value of 1.

The first signal gain function GH (k, m) of the first spectrum may select logMMSE algorithm gains, whose formula is exemplified as follows:

GH(k,m)=ζ_DD(k,m)/(1+ζ_DD(k,m))exp(0.5/>∫v(k,m)/>e^-t/tdt)

v(k,m)=γ(k,m)ζ_DD(k,m)/(1+ζ_DD(k,m))

the formula for filtering the first spectrum based on the first signal gain function to obtain the second spectrum X _new (k, m) is illustrated as follows:

X_new(k,m)=GH(k,m)X(k,m)

14. determining whether voice information exists in the second frequency spectrum;

In the embodiment of the application, since the change rule of the signal of the voice and the change rule of the noise are different, whether the voice information exists in the second frequency spectrum can be identified. For example, determining whether speech information is present in the second spectrum may include: determining the probability of the presence of speech in the second frequency spectrum based on the posterior signal-to-noise ratio and the prior signal-to-noise ratio; based on the probabilities, it is determined whether speech information is present in the second spectrum.

The formula of the probability P (h1|x _new (k, m)) that speech exists in the second spectrum is illustrated as follows:

P(H1|X_new(k,m))={1+(1+ζ_opt exp(-ζ_opt/>γ(k,m)/(1+ζ_opt)))}-1

Wherein ζ _opt is determined based on the value of a priori signal-to-noise ratio ζ _DD (k, m), for example, when ζ _DD (k, m) > 18db, it indicates that the signal power of the second spectrum is large, and thus the value of ζ _opt is ζ _DD (k, m), and when ζ _DD (k, m). Ltoreq.18 db, it indicates that the signal power of the second spectrum is small, and thus the value of ζ _opt is 18db, so as to avoid errors occurring when the signal power of the second spectrum is small.

Wherein determining whether speech information is present in the second spectrum based on the probability may include: when the probability is larger than a preset probability threshold, determining that voice information exists in the second frequency spectrum; and when the probability is smaller than or equal to a preset probability threshold value, determining that no voice information exists in the second frequency spectrum. The predetermined probability threshold may take a value of 0.5, for example.

15. If the second frequency spectrum does not have voice information, performing noise spectrum estimation processing on the second frequency spectrum to obtain a second noise spectrum, taking the second noise spectrum as a first noise spectrum, taking the second frequency spectrum as a first frequency spectrum, and returning to execute filtering processing on the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum;

In the embodiment of the application, if the second frequency spectrum does not have voice information, the filtering processing of the first frequency spectrum is insufficient for filtering noise, so that the second frequency spectrum needs to be subjected to noise spectrum estimation processing to obtain a second noise spectrum, the second noise spectrum is used as a new first noise spectrum, the second frequency spectrum is used as a new first frequency spectrum, and the filtering processing is carried out again to realize the cyclic filtering of the frequency spectrum.

In some embodiments of the application, the noise spectrum estimation process may be performed based on the probability that speech is present. Specifically, performing noise spectrum estimation processing on the second spectrum to obtain a second noise spectrum may include: determining the probability of the presence of speech in the second frequency spectrum based on the posterior signal-to-noise ratio and the prior signal-to-noise ratio; and carrying out noise spectrum estimation processing on the second frequency spectrum based on the probability and the target noise spectrum of the previous frame to obtain the second noise spectrum.

The formula for obtaining the second noise spectrum by performing noise spectrum estimation processing on the second frequency spectrum based on the probability and the target noise spectrum of the previous frame is exemplified as follows:

λ_d(k,m)=P(H1|X_new(k,m))λ_d(k,m-1)+(1-P(H1|X_new(k,m)))|X(k,m)|²

16. And if the voice information exists in the second frequency spectrum, taking the second frequency spectrum as a target frequency spectrum of the target voice signal frame, and determining the processed signal frame of the target voice signal frame based on the target frequency spectrum.

In the embodiment of the application, if the second frequency spectrum has voice information, it is indicated that the filtering processing performed on the first frequency spectrum is more than filtering noise, so the second frequency spectrum is used as the target frequency spectrum of the target voice signal frame, the first noise spectrum is used as the target noise spectrum of the target voice signal frame, so as to finish the cyclic filtering of the frequency spectrum, and a good compromise is made between enhancing noise suppression and weakening music noise. Based on the target spectrum, the processed signal frame of the target speech signal frame may be further determined, for example, the target spectrum may be further filtered based on a masking threshold, so as to obtain a processed signal frame with better speech enhancement effect, which is described in detail in the embodiment shown in fig. 2 below.

According to the processing method of the voice signal, when voice information does not exist in the second frequency spectrum, the second noise spectrum is used as the first noise spectrum, the second frequency spectrum is used as the first frequency spectrum, the step of filtering the first frequency spectrum based on the first noise spectrum is carried out, and the second frequency spectrum is obtained is carried out, so that cyclic filtering of the frequency spectrum is achieved, suppression of non-stationary noise is enhanced, voice enhancement effect is improved, and the definition of enhanced voice is improved.

As shown in fig. 2, which is a flowchart of another embodiment of a processing method of a speech signal provided in an embodiment of the present application, determining a processed signal frame of a target speech signal frame based on a target spectrum may include:

21. Determining masking threshold values of all frequency points in a target frequency spectrum;

In the embodiment of the application, based on the auditory masking characteristic of the human ear, a signal of a certain frequency point in a target frequency spectrum can generate some masking thresholds at the frequency point where the signal is positioned and the surrounding frequency points, and if the sound pressure level of noise is lower than the thresholds, the noise is masked by voice so that the human ear cannot hear the noise; conversely, if the sound pressure level of the noise is above these thresholds, the noise is not masked and is perceived by the human ear. It is therefore necessary to determine masking thresholds for each bin in the target spectrum.

In some embodiments of the present application, determining the masking threshold for each frequency bin in the target spectrum may include: and converting the amplitude value of each frequency point in the target frequency spectrum into a sound pressure level, and then calculating a masking threshold value of the sound pressure level of each frequency point in the target frequency spectrum. The masking threshold of the sound pressure level of each frequency point in the target frequency spectrum can be calculated by adopting an algorithm proposed by Johnston in 1988, in the algorithm proposed by Johnston, a rough pure voice signal value is firstly calculated by adopting spectral subtraction, and then is substituted into the algorithm of Johnston to calculate the masking threshold, and the specific calculation process is not repeated here.

22. Determining a probability model of masking each frequency point in the target frequency spectrum by the target noise spectrum based on the masking threshold;

In an embodiment of the present application, the probability model that each frequency bin in the target spectrum is masked by the target noise spectrum may employ a probability model provided by Johnston. Because of the limitation of the hearing range of the human ear, a probability model that each frequency point in the target frequency spectrum is masked by the target noise spectrum can be determined according to an absolute hearing threshold value and a masking threshold value of the human ear, for example, the absolute hearing threshold value of the human ear is shown by a dotted curve formed by a threshold value in fig. 2. Specifically, determining a probability model that each frequency point in the target frequency spectrum is masked by the target noise spectrum according to the absolute hearing threshold and the masking threshold of the human ear may include: mapping a masking threshold to a Bark frequency domain in psychoacoustics to consider masking of human ear hearing from psychoacoustic dimensions; normalizing the masking threshold in the Bark frequency domain by using a psychoacoustic preset spreading function SF (i, j) to obtain a normalized masking threshold, wherein the normalized masking threshold is shown by a solid curve formed by a masking threshold in FIG. 2; determining a global masking threshold based on the normalized masking threshold and an absolute hearing threshold of the human ear; based on the global masking threshold, a probability model is determined that each frequency point in the target spectrum is masked by the target noise spectrum.

The formula of the global masking threshold T (k, m) is illustrated as follows:

T(k,m)=max(T_abs(k),10log10(T’(k,m)))

T _abs (k) is the absolute hearing threshold of the human ear and T' (k, m) is the normalized masking threshold.

The formula of the probability model p (k, m) that each frequency bin in the target spectrum is masked by the target noise spectrum is exemplified as follows:

p(k,m)=1-exp(-T(k,m)/λ_d(k,m))

where lambda _d (k, m) is the target noise spectrum of the target speech signal frame.

23. Determining a second signal gain function of the target spectrum based on the probability model;

in some embodiments of the present application, the probability model may be used to update the first signal gain function to obtain a more accurate second signal gain function. The formula for the second signal gain function is illustrated as follows:

GH_new(k,m)=sqrt(GH²(k,m)+(1-GH²(k,m))p(k,m))

Where GH _new (k, m) is the second signal gain function, GH (k, m) is the first signal gain function, and sqrt refers to square root calculation.

24. Filtering the target frequency spectrum based on the second signal gain function to obtain a third frequency spectrum;

In some embodiments of the application, the formula for the third spectrum X' (k, m) is illustrated as follows:

X’(k,m)=GH_new(k,m)X_new(k,m)

Wherein X _new (k, m) is the target spectrum of the target speech signal frame.

25. And determining the processed signal frame of the target voice signal frame according to the third frequency spectrum.

In some embodiments of the present application, the processed signal frames of the target speech signal frame may be obtained by superimposing the third frequency spectrum onto the target speech signal frame. In further embodiments of the present application, determining the processed signal frame of the target speech signal frame according to the third spectrum may include: acquiring phase information X (k, m) in a target voice signal frame; performing phase compensation processing on the third frequency spectrum based on the phase information to obtain a fourth frequency spectrum; and determining the processed signal frame of the target voice signal frame according to the fourth frequency spectrum, for example, performing inverse fast Fourier transform on the fourth frequency spectrum, and then performing overlap addition on the fourth frequency spectrum to obtain the processed signal frame.

The formula for performing phase compensation processing on the third spectrum based on the phase information to obtain the fourth spectrum Y (k, m) is as follows:

Y(k,m)=X’(k,m)exp(j/>∠X(k,m))

By the phase compensation processing, the phase offset in the processing process of the target voice signal frame can be corrected, so that the accuracy of the processed signal frame is ensured.

According to the processing method of the voice signal, provided by the embodiment of the application, the masking threshold value and the masking probability model of each frequency point in the target frequency spectrum are determined, then the second signal gain function of the target frequency spectrum is determined based on the probability model, and the target frequency spectrum is subjected to filtering processing, so that the voice enhancement also considers the auditory masking characteristic of human ears, the auditory distortion of voice is reduced, and the voice enhancement effect is further improved.

In order to better implement the method for processing a voice signal in the embodiment of the present application, on the basis of the method for processing a voice signal, the embodiment of the present application further provides a device for processing a voice signal, as shown in fig. 4, which is a schematic structural diagram of an embodiment of the device for processing a voice signal provided in the embodiment of the present application, and may include:

An obtaining unit 401, configured to obtain a first spectrum of a target speech signal frame to be processed, and a first noise spectrum that uses a preset noise spectrum as the first spectrum;

a filtering unit 402, configured to perform filtering processing on the first spectrum based on the first noise spectrum, to obtain a second spectrum;

A determining unit 403, configured to determine whether voice information exists in the second spectrum;

A first execution unit 404, configured to perform noise spectrum estimation processing on the second spectrum if no voice information exists in the second spectrum, obtain a second noise spectrum, use the second noise spectrum as a first noise spectrum, use the second spectrum as a first spectrum, and return to perform filtering processing on the first spectrum based on the first noise spectrum, so as to obtain a second spectrum;

the second execution unit 405 is configured to determine, based on the target spectrum, a processed signal frame of the target speech signal frame, where the second spectrum is used as the target spectrum of the target speech signal frame if speech information exists in the second spectrum.

According to the processing device for the voice signal, when voice information does not exist in the second frequency spectrum, the second noise spectrum is used as the first noise spectrum, the second frequency spectrum is used as the first frequency spectrum, filtering processing is carried out on the first frequency spectrum based on the first noise spectrum, and the second frequency spectrum is obtained.

In some embodiments of the present application, the filtering unit 402 is specifically configured to:

acquiring a target noise spectrum of a previous frame of a target voice signal frame and a target spectrum of the previous frame;

determining a posterior signal-to-noise ratio of the first spectrum based on the target noise spectrum of the previous frame;

determining a priori signal-to-noise ratio of a first frequency spectrum based on a target frequency spectrum, a posterior signal-to-noise ratio and the first noise spectrum of a previous frame;

Determining a first signal gain function of a first frequency spectrum according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio;

and filtering the first frequency spectrum based on the first signal gain function to obtain a second frequency spectrum.

In some embodiments of the present application, the determining unit 403 is specifically configured to:

determining the probability of the presence of speech in the second frequency spectrum based on the posterior signal-to-noise ratio and the prior signal-to-noise ratio;

Based on the probabilities, it is determined whether speech information is present in the second spectrum.

In some embodiments of the present application, the first execution unit 404 is specifically configured to:

and carrying out noise spectrum estimation processing on the second frequency spectrum based on the probability and the target noise spectrum of the previous frame to obtain the second noise spectrum.

In some embodiments of the present application, the obtaining unit 401 is further configured to:

acquiring a plurality of target frames with the preset number from a plurality of frames which comprise target voice signal frames and are sequenced according to the voice sequence;

acquiring first frequency spectrums of a plurality of target frames;

and carrying out average value processing on the first frequency spectrums of the target frames to obtain a preset noise spectrum.

In some embodiments of the present application, the second execution unit 405 is specifically configured to:

determining masking threshold values of all frequency points in a target frequency spectrum;

determining a probability model of masking each frequency point in the target frequency spectrum by the target noise spectrum based on the masking threshold;

determining a second signal gain function of the target spectrum based on the probability model;

filtering the target frequency spectrum based on the second signal gain function to obtain a third frequency spectrum;

And determining the processed signal frame of the target voice signal frame according to the third frequency spectrum.

acquiring phase information in a target voice signal frame;

performing phase compensation processing on the third frequency spectrum based on the phase information to obtain a fourth frequency spectrum;

And determining the processed signal frame of the target voice signal frame according to the fourth frequency spectrum.

In addition to the above method and apparatus for processing a voice signal, an embodiment of the present application further provides a computer device, where the computer device includes:

One or more processors;

A memory; and

As shown in fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a computer device according to an embodiment of the present application, specifically:

The computer device may include one or more processing cores 'processors 501, one or more computer-readable storage media's memory 502, a power supply 503, and an input unit 504, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 5 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in the embodiment of the present application, the processor 501 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 executes the application programs stored in the memory 502, so as to implement various functions, as follows:

Acquiring a first frequency spectrum of a target voice signal frame to be processed; taking the preset noise spectrum as a first noise spectrum of the first frequency spectrum; filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum; determining whether voice information exists in the second frequency spectrum; if the second frequency spectrum does not have voice information, performing noise spectrum estimation processing on the second frequency spectrum to obtain a second noise spectrum, taking the second noise spectrum as a first noise spectrum, taking the second frequency spectrum as a first frequency spectrum, and returning to execute filtering processing on the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum; and if the voice information exists in the second frequency spectrum, taking the second frequency spectrum as a target frequency spectrum of the target voice signal frame, and determining the processed signal frame of the target voice signal frame based on the target frequency spectrum.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, which may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like. The computer readable storage medium stores a plurality of instructions that can be loaded by a processor to perform steps in any of the methods for processing a speech signal provided by the embodiments of the present application. For example, the instructions may perform the steps of:

In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of one embodiment that are not described in detail in the foregoing embodiments may be referred to in the foregoing detailed description of other embodiments, which are not described herein again.

In the implementation, each unit or structure may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit or structure may be referred to the foregoing method embodiments and will not be repeated herein.

The foregoing has described in detail the method, apparatus, computer device and storage medium for processing a voice signal according to the embodiments of the present application, and specific examples are provided herein to illustrate the principles and embodiments of the present application, where the above description of the embodiments is only for aiding in the understanding of the method and core idea of the present application; meanwhile, the contents of the present specification should not be construed as limiting the present application in view of the fact that those skilled in the art can vary in specific embodiments and application scope according to the ideas of the present application.

Claims

1. A method of processing a speech signal, the method comprising:

determining whether voice information exists in the second frequency spectrum;

If voice information exists in the second frequency spectrum, the second frequency spectrum is used as a target frequency spectrum of the target voice signal frame, and the processed signal frame of the target voice signal frame is determined based on the target frequency spectrum;

And if the voice information exists in the second frequency spectrum, taking the first noise spectrum as a target noise spectrum of the target voice signal frame, and filtering the first frequency spectrum based on the first noise spectrum to obtain a second frequency spectrum, wherein the method comprises the following steps: acquiring a target noise spectrum of a previous frame of the target voice signal frame and a target spectrum of the previous frame; determining a posterior signal-to-noise ratio of the first spectrum based on the target noise spectrum of the previous frame; determining a priori signal-to-noise ratio of the first spectrum based on the target spectrum of the previous frame, the posterior signal-to-noise ratio, and the first noise spectrum; determining a first signal gain function of the first frequency spectrum according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio; filtering the first frequency spectrum based on the first signal gain function to obtain the second frequency spectrum;

Before the preset noise spectrum is used as the first noise spectrum of the first frequency spectrum, the method further comprises the following steps: acquiring a plurality of target frames with the preset number from a plurality of frames which comprise the target voice signal frames and are sequenced according to the voice sequence; acquiring first frequency spectrums of a plurality of target frames; average value processing is carried out on the first frequency spectrums of the target frames to obtain the preset noise spectrum;

The determining, based on the target spectrum, a processed signal frame of the target speech signal frame includes: determining a masking threshold of each frequency point in the target frequency spectrum; determining a probability model of masking each frequency point in the target frequency spectrum by the target noise spectrum based on the masking threshold; determining a second signal gain function of the target spectrum based on the probability model; filtering the target frequency spectrum based on the second signal gain function to obtain a third frequency spectrum; and determining the processed signal frame of the target voice signal frame according to the third frequency spectrum.

2. The method for processing a voice signal according to claim 1, wherein the determining whether voice information exists in the second spectrum comprises:

determining a probability of speech being present in the second frequency spectrum based on the posterior signal-to-noise ratio and the prior signal-to-noise ratio;

based on the probability, it is determined whether speech information is present in the second spectrum.

3. The method for processing a speech signal according to claim 1, wherein said performing noise spectrum estimation processing on said second spectrum to obtain a second noise spectrum comprises:

and carrying out noise spectrum estimation processing on the second frequency spectrum based on the probability and the target noise spectrum of the previous frame to obtain a second noise spectrum.

4. The method for processing a speech signal according to claim 1, wherein said determining a processed signal frame of said target speech signal frame based on said third spectrum comprises:

acquiring phase information in the target voice signal frame;

based on the phase information, performing phase compensation processing on the third frequency spectrum to obtain a fourth frequency spectrum;

5. A device for processing a speech signal, the device comprising:

the second execution unit is used for taking the second frequency spectrum as a target frequency spectrum of the target voice signal frame if voice information exists in the second frequency spectrum, and determining a processed signal frame of the target voice signal frame based on the target frequency spectrum;

6. A computer device, the computer device comprising:

One or more processors;

A memory; and

One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the method of processing a speech signal of any of claims 1 to 4.

7. A computer-readable storage medium, on which a computer program is stored, the computer program being loaded by a processor to perform the steps of the method of processing a speech signal according to any one of claims 1 to 4.