CN113160842B

CN113160842B - MCLP-based voice dereverberation method and system

Info

Publication number: CN113160842B
Application number: CN202110247855.4A
Authority: CN
Inventors: 冯子成; 马鸿飞
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-03-06
Filing date: 2021-03-06
Publication date: 2024-04-09
Anticipated expiration: 2041-03-06
Also published as: CN113160842A

Abstract

The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and a voice dereverberation system based on MCLP. The method comprises the following steps: frame data processing is carried out on the collected reverberation voice of the reverberation environment, so that an expected signal of a current frame is obtained; obtaining a voice reverberation energy ratio and a signal to noise estimated value of a desired signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain a first power spectrum density of the desired signal; the energy ratio of the voice reverberation is in positive correlation with the first energy ratio of the reverberation voice and the reverberation component, and the signal-to-noise estimated value is in positive correlation with the second energy ratio of the expected voice and the reverberation component; obtaining a dereverberated voice signal according to the first power spectral density; and storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all dereverberated voice signals are obtained. The embodiment of the invention can obtain higher-quality dereverberated voice.

Description

MCLP-based voice dereverberation method and system

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a voice dereverberation method and a voice dereverberation system based on MCLP.

Background

In daily life, the scene requirement of indoor recording is more and more extensive, is common in indoor meetings, auditorium lectures, network live broadcast, intelligent voice assistants and the like, and in these scenes, a voice signal collected by a microphone is often mixed with a serious reverberation component. Reverberation is an acoustic phenomenon generated in a closed space, and due to the multipath propagation effect of sound, reflection is generated on the surfaces of a wall body and an object, so that collected voice signals are blurred due to poor time delay, and the definition of a voice frequency spectrum is seriously polluted. Research has shown that early reverberant sounds within 50 milliseconds help to improve speech intelligibility, and fullness, but excessive late reverberation can severely impact speech signal quality.

The inventors have found in practice that the above prior art has the following drawbacks:

for Multi-channel linear prediction (Multi-Channel Linear Prediction, MCLP) algorithms in the field of speech dereverberation, since the clean speech signal is modeled as a time-varying gaussian model, the algorithm performance is severely dependent on the accuracy of estimating the power spectral density (Power Spectral Density, PSD) of the clean speech signal, whereas the original online MCLP algorithm directly uses the observed reverberation signal instead of the clean speech to estimate the PSD, which has poor accuracy and affects the dereverberation effect. In part of the improved research effort on the algorithm, a late reverberation component PSD estimation algorithm is used, and then the reverberation PSD is subtracted by spectral subtraction to obtain an estimated clean speech PSD. However, since the reverberant PSD estimation is inaccurate, when the amplitude of the estimated value is larger, the direct spectrum subtraction can cause over-subtraction problem, so that the frequency spectrum has over-zero points, and the problems of frequency spectrum distortion and music noise are caused.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a voice dereverberation method and a voice dereverberation system based on MCLP, and the adopted technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides an MCLP-based speech dereverberation method comprising the steps of:

frame data processing is carried out on the collected reverberation voice of the reverberation environment, so that an expected signal of a current frame is obtained;

obtaining a voice reverberation energy ratio and a signal to noise estimated value of the expected signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain a first power spectrum density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and the reverberated component; the second energy ratio is an energy ratio of the desired speech and the reverberant component;

obtaining a dereverberated voice signal according to the first power spectral density;

and storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all dereverberated voice signals are obtained.

Preferably, the step of acquiring the desired signal includes:

calculating a prediction coefficient through mathematical representation of the reverberation signal in a time-frequency domain;

and obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberant voice subjected to framing treatment.

Preferably, the method for calculating the reverberation energy ratio of the voice comprises the following steps:

and obtaining the voice reverberation energy ratio of the current frame by carrying out smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.

Preferably, the signal-to-noise estimation value calculating method comprises the following steps:

wherein R is _d/r Representing a signal-to-noise estimate;representing the second energy ratio, d' _t,l Represents the estimated desired signal bin amplitude, |d' _t,l | ² Representing the energy of the desired signal, +.>A second power spectral density representing the reverberant component; beta ₂ Representing a second smoothing factor; r is R _x/r Representing the speech reverberation energy ratio.

Preferably, the step of obtaining the dereverberated speech signal includes:

obtaining expected signal frequency points at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density;

and carrying out short-time Fourier inverse transformation on the expected signal frequency points to obtain the dereverberated voice signal.

In a second aspect, another embodiment of the present invention provides an MCLP-based speech dereverberation system comprising the following modules:

the reverberation voice preprocessing module is used for obtaining an expected signal of a current frame by carrying out frame data processing on the collected reverberation voice of the reverberation environment;

the first power spectrum density acquisition module is used for acquiring the voice reverberation energy ratio and the signal to noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain the first power spectrum density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and the reverberated component; the second energy ratio is an energy ratio of the desired speech and the reverberant component;

the voice dereverberation module is used for obtaining voice signals subjected to dereverberation according to the first power spectral density;

and the first power spectrum density updating module is used for storing the first power spectrum density of the current frame, taking the first power spectrum density as the historical first power spectrum density of the next frame, and updating the first power spectrum density of the next frame until all dereverberated voice signals are obtained.

Preferably, the reverberation voice preprocessing module includes:

the prediction coefficient calculation module is used for calculating a prediction coefficient through mathematical representation of the reverberation signal in a time-frequency domain;

and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating the expected signal by using the first prediction coefficient matrix and the reverberant voice subjected to framing treatment.

Preferably, the first power spectral density acquisition module includes:

and the voice reverberation energy ratio acquisition module is used for obtaining the voice reverberation energy ratio of the current frame by carrying out smooth calculation on the first energy ratio and the historical voice reverberation energy ratio.

Preferably, the first power spectral density acquisition module includes:

the signal-to-noise estimation value calculation module is used for calculating the signal-to-noise estimation value:

Preferably, the speech dereverberation module comprises:

the expected signal frequency point acquisition module is used for acquiring expected signal frequency points at all channels of the current frame by utilizing a weighted recursive least square formula according to the first power spectral density;

and the dereverberation voice signal calculation module is used for carrying out short-time Fourier inverse transformation on the expected signal frequency points to obtain the dereverberation voice signal.

The embodiment of the invention has the following beneficial effects:

by combining the geometric spectrum subtraction and the MCLP algorithm, the problem of excessive spectrum subtraction caused by using the spectral subtraction is solved, the dereverberation performance of the MCLP algorithm is improved, and higher-quality dereverberation voice can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a MCLP-based speech dereverberation method according to an embodiment of the present invention;

FIG. 2 is a diagram showing a time domain waveform of an original speech with a reverberation time of 0.8s and a channel number of 4 according to an embodiment of the present invention;

FIG. 3 is a time domain waveform diagram of voice processed by the MCLP algorithm according to an embodiment of the present invention when the reverberation time is 0.8s and the channel number is 4;

FIG. 4 is a time domain waveform diagram of a voice processed by the MCLP-based voice dereverberation method according to an embodiment of the present invention when the reverberation time is 0.8s and the channel number is 4;

FIG. 5 is a graph of a speech spectrum of an original speech with a reverberation time of 0.8s and a channel number of 4 according to an embodiment of the present invention;

FIG. 6 is a graph of a voice spectrum of the MCLP algorithm processed voice with a reverberation time of 0.8s and a channel number of 4 according to one embodiment of the present invention;

FIG. 7 is a graph of a voice spectrum of a voice processed by the MCLP-based voice dereverberation method according to an embodiment of the present invention, when the reverberation time is 0.8s and the channel number is 4;

FIG. 8 is a plot of quality assessment lines for original reverberated speech, MCLP-algorithm processed speech, and MCLP-based speech dereverberation methods using subjective speech quality assessment at different reverberation times according to one embodiment of the present invention;

FIG. 9 is a graph of quality evaluation lines for energy comparison of an original reverberated speech, a speech processed by an MCLP algorithm, and a speech processed by a MCLP-based speech dereverberation method using a speech reverberation model at different reverberation times according to an embodiment of the present invention;

FIG. 10 is a plot of quality evaluation of a comparison of raw reverberations speech, MCLP-algorithm processed speech, and MCLP-based speech dereverberation approach using weighted piecewise direct reverberations energy at different reverberations times, according to one embodiment of the present invention;

FIG. 11 is a plot of quality evaluation of a cepstrum distance versus original reverberated speech, MCLP processed speech, and MCLP-based speech dereverberation method at different reverberation times according to an embodiment of the present invention;

FIG. 12 is a plot of quality assessment lines for original reverberated speech, MCLP-algorithm processed speech, and MCLP-based speech dereverberation methods using subjective speech quality assessment at different numbers of speech channels according to an embodiment of the present invention;

FIG. 13 is a graph illustrating quality evaluation lines for energy comparison of an original reverberant speech, a speech processed by an MCLP algorithm, and a speech processed by a MCLP-based speech dereverberation method using a speech reverberation model for different numbers of speech channels according to an embodiment of the present invention;

FIG. 14 is a plot of quality evaluation of a comparison of raw reverberations speech, MCLP-algorithm processed speech, and MCLP-based speech dereverberation approach using weighted piecewise direct reverberations energy at different numbers of speech channels, according to one embodiment of the present invention;

FIG. 15 is a plot of quality evaluation of the original reverberations, the MCLP processed speech, and the MCLP-based speech dereverberation method using cepstrum distances at different speech channel numbers according to an embodiment of the present invention;

fig. 16 is a block diagram illustrating a voice dereverberation system based on MCLP according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects of the present invention for achieving the intended purpose, the following detailed description refers to the specific implementation, structure, features and effects of an MCLP-based speech dereverberation method and system according to the present invention, with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of a voice dereverberation method and a system based on MCLP provided by the invention with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a voice dereverberation method based on MCLP according to an embodiment of the present invention is shown, the method comprising the steps of:

and S001, carrying out frame data processing on the collected reverberant voice in the reverberant environment to obtain the expected signal of the current frame.

The method comprises the following specific steps of:

1) Calculating prediction coefficients from a mathematical representation of a reverberations signal in the time-frequency domain

In a closed acoustic space, a single voice signal source and a microphone array formed by M omnidirectional microphones are established, the array shape is not required, multi-channel voice signals received by the microphone array are subjected to frame-by-frame windowing, frame-by-frame and L-point short-time Fourier transform (Short Time Fourier Transform, STFT) with the frame length being L, since the reverberation voice is the result of reverberation room impulse response and voice convolution in the time domain and the result of multiplication of the reverberation room impulse response and the voice convolution in the frequency domain, the reverberation signal received by the M-th channel microphone can be expressed as:

wherein t represents the time domain sequence number of the voice frame; l represents the frequency domain frequency point sequence number at each frame, L e {1,2, …, L }; τ represents the linear prediction delay;indicated at the t frame 1Frequency point components of the reverberated voice at the frequency points; s is(s) _t,l Frequency point components representing clean speech at the first frequency point of the t-th frame; />A prediction coefficient representing the m-th microphone to the n-th microphone received signal, which may also be called a reverberation room impulse response from the source to the m-th microphone, and the length of each channel prediction coefficient is set to be a constant K; k represents the number of prediction coefficients, K e {1,2, …, K }.

It should be noted that, the prediction delay τ is usually a non-negative integer from 0 to 3, and the prediction coefficient length K is usually a positive integer from 5 to 20; x, s and μ are all complex forms.

2) And obtaining a first prediction coefficient matrix according to the prediction coefficients, and calculating a desired signal by using the first prediction coefficient matrix and the reverberant voice subjected to framing treatment.

The above formula (1) is rewritten as a matrix:

the method comprises the following steps:

a prediction coefficient matrix representing the mth microphone, x _t-τ,l Representing the sequence of signal observations required to predict late reverberation in the current frame, the embodiment of the invention assumes the desired signal s _t,l A time-varying Gaussian model with zero mean value, and a late reverberation component part +>Independent of each other, the prediction coefficient +.>Then, the expected signal of the current frame is obtained:

in the embodiment of the present invention, the method of the present invention is performed with on-machine experimental simulation, and the method is specifically:

the simulation environment is that in a closed room with the size of 7.0 multiplied by 3.5 multiplied by 2.4 (M), a uniform linear array composed of eight omnidirectional microphones is placed, namely M=8, the microphone intervals are all 10cm, and the microphone coordinates are [6.0,1.35-2.05,1.0 ]]The information source coordinate is [1.0,1.7,1.0 ]]. Generating multi-channel reverberation voice under different reverberation time by using mirror image source model method, wherein the duration is 8s, and the sampling frequency f _s =16000 Hz. When windowing and framing, the frame length is set to be L=512 sample points, the window function is a Hamming window with the length of 512, the prediction coefficient length K=10, and the prediction delay tau=3.

Step S002, obtaining the speech reverberation energy ratio and the signal to noise estimation value of the expected signal, substituting the speech reverberation energy ratio and the signal to noise estimation value into a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberated speech to obtain the first power spectrum density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and the reverberated component; the second energy ratio is an energy ratio of the desired speech and reverberant components.

The method comprises the following specific steps of:

1) A second power spectral density of the late reverberation component is estimated.

Modeling into an exponential decay model based on reverberation time, estimating frame by adopting a smooth calculation mode, and using symbolsThe second power spectral density representing late reverberation is:

wherein R represents the discrete frame shift length of the speech frame in the time domain, which is typically set to one half or one quarter of the frame length L, in the embodiment of the present invention, the frame shift r=128 samples; e is a constant representing the minimum value of the estimated second power spectral density, typically 0.0001;representing a third power spectral density of the reverberant speech signal at the t- τ frame, the embodiment of the present invention finds by averaging the front δ -frame signal of the signals received by all channels of the microphone:

where τ represents the predicted delay frame number, τ frames before the t frame do not participate in the prediction, δ represents the number of frames involved in the calculation covered before and after the t- τ frame, and δ is a constant of 6 to 10, and is generally required to be equal to or greater than 2τ.

As an example, in an embodiment of the present invention, δ takes 10.

Alpha (t, l) is defined as a variable related to the reverberation time:

wherein f _s Representing the speech sampling rate in Hz; RT (reverse transcription) method ₆₀ (t, l) represents the estimated reverberation time in seconds at the current speech frame frequency point, obtained by various kinds of reverberation time estimation algorithms.

As an example, in the embodiment of the invention, the reverberation time RT is calculated by a maximum likelihood estimation method ₆₀ ：

Wherein the constant ρ represents the rate of attenuation of the acoustic wave, a likelihood function may be utilizedAnd solving by a maximum likelihood rule. Likelihood function->Wherein L represents a frame length, and a and d (i) are respectively:

wherein, represents A _r Representing the original amplitude of the current speech signal, v (i) representing the value at the ith sample point of the discrete normal distribution with a mean of 0 and a variance of 1, having i ε {0, …, N-1}, r _t (i) Represents a set reverberation time search sequence, r _t ＝[0.1,0.2,…,1.2]。

2) A first power spectral density of the desired signal is estimated using geometric spectral subtraction.

The method comprises the following specific steps of:

a) A speech-to-reverberation energy ratio is calculated.

The speech reverberation energy ratio of the current frame is obtained by performing a smoothing calculation on the first energy ratio and the historical speech reverberation energy ratio.

The specific calculation formula is as follows:

wherein R is _x/r Representing a speech reverberation energy ratio; beta ₁ Representing a first smoothing factor, 0<β ₁ <1；The first energy ratio is expressed in the form of a constant.

As an example, in an embodiment of the present invention, β ₁ Take 0.9.

b) And calculating a signal-to-noise estimated value.

The specific calculation formula is as follows:

wherein R is _d/r Representing a signal-to-noise estimate;representing a second energy ratio, d' _t,l Represents the estimated desired signal bin amplitude, |d' _t,l | ² Representing the energy of the desired signal; beta ₂ Representing a second smoothing factor, 0<β ₂ <1。

The product d 'was obtained' _t,l Then, it is brought into formula (2) to calculate R of the next frame _d/r The method comprises the steps of carrying out a first treatment on the surface of the In calculating the first frame, use |x _t,l I replace d' _t,l R is taken as _x/r Initialized to 1.0.

As an example, β in the embodiment of the present invention ₂ Take 0.9.

c) And obtaining the first power spectrum density of the expected signal according to the frequency point amplitude of the expected signal.

Wherein d' _t,l To estimate the expected signal frequency point amplitude, beta ₃ Is a third smoothing factor of 0<β ₃ <1, in processing the first frame, usingReplace->And (5) performing calculation.

As an example, in an embodiment of the present invention, β ₃ Take 0.9.

Step S003, the voice signal after the reverberation removal is obtained according to the first power spectral density.

The method comprises the following specific steps of:

1) And obtaining the expected signal frequency points at each channel of the current frame by using a weighted recursive least square formula according to the first power spectral density.

d _t,l ＝x _t,l -G _l (t-1) ^H x _t-τ,l

The method comprises the following steps:

wherein d _t,l Representing the expected signal frequency point at each channel of the current frame, G _l (t) represents a second prediction coefficient matrix, k _l (t) a gain vector representing the updated prediction coefficient, the matrix size being (MKX 1), Φ _l (t) an inverse matrix for storing the spatial correlation matrix, the matrix size being (mkxmk); alpha is a constant and represents a fourth smoothing factor.

As an example, in an embodiment of the present invention, α takes 0.9999.

Before calculating the first frame, G is calculated _l (t) initializing to an all-zero matrix, Φ _l (t) initializing to a unit diagonal matrix.

2) And carrying out short-time Fourier inverse transformation on the expected signal frequency points to obtain the voice signal after dereverberation.

For d _t,l After it performs an inverse short-time fourier transform, the algorithm outputs as frames of the dereverberated speech signal.

Step S004, the first power spectrum density of the current frame is stored and used as the historical first power spectrum density of the next frame, and the first power spectrum density of the next frame is updated until all dereverberated voice signals are obtained.

The method comprises the following specific steps of:

since the expected signal is modeled as a time-varying Gaussian model with zero mean, the first power spectral density is used as a variance, and the first power spectral density of the currently obtained speech frame is stored asSubstituting the calculation formula (3) of the next frame, and correcting the estimation process of the first power spectral density:

judging whether all the voice frames are processed, if so, continuing to perform dereverberation calculation of next frame data until all the voice frames are processed.

In summary, according to the embodiment of the invention, the frame data processing is performed on the collected reverberation voice of the reverberation environment to obtain the expected signal of the current frame; obtaining a voice reverberation energy ratio and a signal to noise estimated value of a desired signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain a first power spectrum density of the desired signal; the energy ratio of the voice reverberation is in positive correlation with the first energy ratio of the reverberation voice and the reverberation component, and the signal-to-noise estimated value is in positive correlation with the second energy ratio of the expected voice and the reverberation component; obtaining a dereverberated voice signal according to the first power spectral density; and storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all dereverberated voice signals are obtained.

Through the experimental simulation of the machine, the embodiment of the invention evaluates the performance of the MCLP-based voice dereverberation method, as shown in fig. 2-15, the improved MCLP algorithm in the diagram is the MCLP-based voice dereverberation method provided by the embodiment of the invention, and by observing the time domain waveforms of fig. 2-4 and the frequency spectrum waveforms of fig. 5-7, the embodiment of the invention is clear and clean on the envelope of the time domain waveforms and spectrogram ripple compared with the processing voice of the MCLP algorithm, reduces the effect of tailing ambiguity, and particularly has very obvious improvement on the definition of the time domain and frequency domain waveforms compared with the MCLP algorithm in the beginning section of the voice, and is free from swelling and ambiguity, thereby indicating that the removal of reverberation components is more thorough and the overall stability of the algorithm is higher.

Of the four Speech quality assessment criteria, the higher the scores of the subjective Speech quality assessment method (Perceptual Evaluation of Speech Quality, PESQ), the Speech reverberation model energy ratio (spech-to-Reverberation Modulation Energy Ratio, SRMR), and the weighted segment direct reverberation energy ratio (Frequency Weighted SNRseg, FWsegSNR), the lower the score of the cepstrum distance (Cepstrum Distance, CD) represents the better the Speech quality. By observing the line diagrams of fig. 8-11, it can be found that the scores of the four evaluation indexes are obviously better than the MCLP algorithm under different reverberation times of 0.2s to 1.2s, and the performance improvement amount is stable, thus proving the superiority of the embodiment of the invention. By observing the line diagrams of fig. 12-15, it can be found that, in the case of different numbers of voice channels 2, 4, 6 and 8, four evaluation indexes are obviously improved compared with the MCLP algorithm, and the performance improvement amplitude is larger as the number of voice channels is higher.

The comparison shows that the voice quality processed by the voice dereverberation method based on the MCLP is obviously superior to that of the original MCLP algorithm, and the embodiment of the invention can further improve the dereverberation performance to a certain extent.

Based on the same inventive concept as the above method, another embodiment of the present invention provides an MCLP-based speech dereverberation system, referring to fig. 16, the system includes the following modules:

a reverberant speech preprocessing module 1001, a first power spectral density acquisition module 1002, a speech dereverberation module 1003, and a first power spectral density update module 1004.

The reverberation voice preprocessing module 1001 is configured to obtain a desired signal of a current frame by performing frame data processing on collected reverberation voices of a reverberation environment; the first power spectral density obtaining module 1002 is configured to obtain a speech reverberation energy ratio and a signal-to-noise estimated value of a desired signal, and apply the speech reverberation energy ratio and the signal-to-noise estimated value to a geometric spectrum subtraction formula to perform spectrum subtraction on the reverberated speech to obtain a first power spectral density of the desired signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and the reverberated component; the second energy ratio is an energy ratio of the desired speech and reverberant components; the voice dereverberation module 1003 is configured to obtain a voice signal after dereverberation according to the first power spectral density; the first power spectral density updating module 1004 is configured to store the first power spectral density of the current frame and update the first power spectral density of the next frame with the first power spectral density as the historical first power spectral density of the next frame until all the dereverberated speech signals are obtained.

Preferably, the reverberant speech preprocessing module includes:

and the expected signal calculation module is used for obtaining a first prediction coefficient matrix according to the prediction coefficient, and calculating an expected signal by using the first prediction coefficient matrix and the reverberated voice subjected to framing treatment.

Preferably, the first power spectral density acquisition module includes:

the speech reverberation energy ratio acquisition module is used for obtaining the speech reverberation energy ratio of the current frame by carrying out smooth calculation on the first energy ratio and the historical speech reverberation energy ratio.

Preferably, the first power spectral density acquisition module includes:

the signal-to-noise estimation value calculation module is used for calculating a signal-to-noise estimation value:

wherein R is _d/r Representing a signal-to-noise estimate;representing a second energy ratio, d' _t,l Represents the estimated desired signal bin amplitude, |d' _t,l | ² Representing the energy of the desired signal, +.>A second power spectral density representing the reverberant component; beta ₂ Representing a second smoothing factor; r is R _x/r Representing the speech reverberation energy ratio.

Preferably, the speech dereverberation module comprises:

and the dereverberation voice signal calculation module is used for carrying out short-time inverse Fourier transform on the expected signal frequency points to obtain a dereverberation voice signal.

In summary, in the embodiment of the present invention, the reverberation voice preprocessing module 1001 performs frame-by-frame data processing on the collected reverberation voice of the reverberation environment to obtain the expected signal of the current frame; the method comprises the steps of obtaining a voice reverberation energy ratio and a signal to noise estimated value of a desired signal through a first power spectrum density obtaining module 1002, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberated voice to obtain the first power spectrum density of the desired signal; obtaining, by the speech dereverberation module 1003, a dereverberated speech signal according to the first power spectral density; the first power spectral density of the current frame is stored by the first power spectral density update module 1004 and is used as the historical first power spectral density of the next frame, and the first power spectral density of the next frame is updated until all dereverberated speech signals are obtained. The embodiment of the invention can further improve the dereverberation performance of the MCLP algorithm to a certain extent, and obtain higher-quality dereverberation voice.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for MCLP-based speech dereverberation, the method comprising the steps of:

obtaining a voice reverberation energy ratio and a signal to noise estimated value of the expected signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain a first power spectrum density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and reverberated components; the second energy ratio is an energy ratio of the desired speech and the reverberant component;

storing the first power spectral density of the current frame, taking the first power spectral density as the historical first power spectral density of the next frame, and updating the first power spectral density of the next frame until all dereverberated voice signals are obtained;

the substituting geometrical spectrum subtraction formula for performing spectrum subtraction on the reverberant voice to obtain a first power spectrum density of a desired signal includes:

calculating the energy ratio of voice reverberation; the speech reverberation energy ratio of the current frame is obtained by carrying out smooth calculation on the first energy ratio and the historical speech reverberation energy ratio, and a specific calculation formula is as follows:

wherein R is _x/r Representing a speech reverberation energy ratio; beta ₁ Representing a first smoothing factor, 0<β ₁ <1；Representing a first energy ratio; />A third power spectral density representing the reverberated speech signal; />A second power spectral density representing the reverberant component;

the signal-to-noise estimation value is calculated according to the following specific calculation formula:

wherein R is _d/r Representing a signal-to-noise estimate;represents a second energy ratio, d ^′ _t,l Represents the estimated desired signal bin amplitude, |d ^′ _t,l | ² Representing the energy of the desired signal; beta ₂ Representing a second smoothing factor, 0<β ₂ <1；

Wherein,frequency point components representing reverberated speech at the first frequency point of the t frame received by the mth channel microphone; m represents the number of microphones;

obtaining d ^′ _t,l Thereafter, d ^′ _t,l R of the next frame is calculated in a calculation formula carried into the signal-to-noise estimation value _d/r ；

The first power spectrum density of the expected signal is obtained according to the amplitude of the frequency point of the expected signal, and the calculation formula is as follows:

wherein,a first power spectral density, beta, representative of the desired signal ₃ Is a third smoothing factor of 0<β ₃ <1；

The step of obtaining the dereverberated voice signal comprises the following steps:

2. The method of claim 1, wherein the step of acquiring the desired signal comprises:

3. An MCLP-based speech dereverberation system, the system comprising:

the first power spectrum density acquisition module is used for acquiring the voice reverberation energy ratio and the signal to noise estimation value of the expected signal, substituting a geometric spectrum subtraction formula to carry out spectrum subtraction on the reverberation voice to obtain the first power spectrum density of the expected signal; the voice reverberation energy ratio and the first energy ratio are in positive correlation, and the signal-to-noise estimated value and the second energy ratio are in positive correlation; the first energy ratio is an energy ratio of the reverberated speech and reverberated components; the second energy ratio is an energy ratio of the desired speech and the reverberant component;

the first power spectrum density updating module is used for storing the first power spectrum density of the current frame, taking the first power spectrum density as the historical first power spectrum density of the next frame, and updating the first power spectrum density of the next frame until all dereverberated voice signals are obtained;

The speech dereverberation module comprises:

4. The system of claim 3, wherein the reverberant speech preprocessing module includes: