CN112309421A

CN112309421A - Speech enhancement method and system fusing signal-to-noise ratio and intelligibility dual targets

Info

Publication number: CN112309421A
Application number: CN201910689178.4A
Authority: CN
Inventors: 张鹏远; 战鸽; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2021-02-02
Anticipated expiration: 2039-07-29
Also published as: CN112309421B

Abstract

The invention belongs to the technical field of voice enhancement signal processing, and particularly relates to a voice enhancement method fusing a signal-to-noise ratio and intelligibility dual target, which comprises the following steps: converting an original voice signal into an original time-frequency domain characteristic; inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.

Description

Speech enhancement method and system fusing signal-to-noise ratio and intelligibility dual targets

Technical Field

The invention belongs to the technical field of voice enhancement signal processing, and particularly relates to a voice enhancement method and system fusing a signal-to-noise ratio and intelligibility dual target.

Background

When a speech signal is interfered by noise, the signal quality and intelligibility are degraded, thereby affecting the user experience of speech recognition and speech perception processing based on the speech signal. Currently, the speech enhancement method is commonly used to separate the spectral components of the speech signal from the noise coverage by means of a mask that estimates the speech signal. The speech enhancement method is generally based on the minimum mean square error criterion, a mask is estimated, the components of the speech signal with noise in the time-frequency domain are classified, the components shielded by the noise are distinguished, and the components with stronger speech signal energy are reserved. The separated speech signal components carry important speech information and are often used for subsequent speech recognition and speech perception processing. However, the minimum mean square error criterion is not directly related to the human perception mechanism of the speech signal, and does not distinguish between the noise signal and the speech signal distributed in different segments of the noisy signal, and thus is not optimal for suppressing the noise residual and improving the listening quality and intelligibility of the speech signal. Therefore, two convenient targeted speech enhancement methods of suppressing noise residual and improving hearing quality and intelligibility are directly related and have unique importance in research and application level.

The speech enhancement technology at the present stage mainly generates a mask based on the minimum mean square error criterion optimization according to the time-frequency characteristics of the speech signals, and obtains the speech signal components by combining the mask and the time-frequency characteristics. Such enhancement results in a balance between suppression of noise residual and improvement of listening quality and intelligibility, and the inability to finely express the accuracy of a speech signal in a section where speech components are present hinders improvement of intelligibility of the speech signal. Meanwhile, the mean square error contains the error of the voice component, and the error caused by the residual noise cannot be accurately expressed, so that the enhanced voice signal is not optimal in the sense of signal-to-noise ratio.

With the wide application of deep neural networks in branches of multiple signal processing fields such as images and voices, training criteria other than minimum mean square error are increasingly gaining attention. For various optimization targets faced by the existing speech enhancement method, a single training criterion cannot comprehensively contain errors obtained under all optimization target angles, and generally only can achieve the balance between noise residue suppression and auditory quality and intelligibility improvement, and the enhancement result is not optimal in the aspects of noise suppression and speech intelligibility improvement.

Disclosure of Invention

The invention aims to solve the defects of the existing voice enhancement method, and provides a voice enhancement method fusing double targets of signal-to-noise ratio and intelligibility, which trains different neural networks respectively according to multiple training criteria and then fuses results obtained under multiple optimization targets to form a new enhanced voice signal; the method overcomes the problems of limited signal-to-noise ratio improvement and intelligibility improvement of the existing voice enhancement method.

In order to achieve the above object, the present invention provides a speech enhancement method for fusing dual targets of signal-to-noise ratio and intelligibility, the method comprising:

extracting original time-frequency domain characteristics from an original voice signal;

inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio;

inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility;

processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the correlation weight threshold, and taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement.

As one improvement of the above technical solution, the original speech signal is converted into an original time-frequency domain characteristic; the method specifically comprises the following steps:

performing framing and windowing processing on an original voice signal to obtain a processed voice signal;

performing Fourier transform on the processed voice signal to obtain a Fourier transform coefficient matrix with the height of H and the width of T1;

and taking an absolute value of the obtained Fourier coefficient matrix, and obtaining the original time-frequency domain characteristics corresponding to the original voice signals.

As one improvement of the above technical solution, the original time-frequency domain features are input into a pre-established first neural network model, and a first effective feature with a signal-to-noise ratio is obtained; the method specifically comprises the following steps:

inputting the original time-frequency domain characteristics into a pre-established first neural network model, and acquiring a first floating value mask through forward calculation of the first neural network;

and multiplying the obtained first floating value mask and the original time-frequency domain characteristic point to obtain a first effective characteristic with a signal-to-noise ratio.

As an improvement of the above technical solution, the pre-established first neural network model specifically includes:

step S11) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;

step S12), according to the ith original time-frequency domain characteristic, obtaining an ith first effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;

step S13), calculating the mean square error according to the ith first effective characteristic and the ith reference time-frequency domain characteristic to obtain the ith mean square error;

step S14), squaring and averaging the ith reference time-frequency domain characteristic, and making a ratio with the obtained ith mean square error to obtain the ith signal-to-noise ratio used for the training and the optimal weight coefficient of each layer after the training;

step S15), according to the optimal weight coefficient, calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio, and obtaining the signal-to-noise ratio error.

Step S16), judging whether the obtained signal-to-noise ratio error is smaller than a preset threshold value; if the signal-to-noise ratio error is smaller than a preset threshold value, continuing to execute downwards; if the signal-to-noise ratio error is not less than the preset threshold value, returning to the step S12) and continuing to execute;

step S17) determines the current model to be the first deep neural network model.

As one improvement of the technical scheme, the original time-frequency domain characteristics are input into a pre-established second neural network model to obtain second effective characteristics with intelligibility; the method specifically comprises the following steps:

inputting the original time-frequency domain characteristics into a pre-established second neural network model, and obtaining a second floating value mask through forward calculation of the second neural network;

and multiplying the second floating value mask and the original time-frequency domain feature point to obtain a second effective feature with intelligibility.

As an improvement of the above technical solution, the pre-established second neural network model specifically includes:

step S21) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;

step S22), according to the ith original time-frequency domain characteristic, obtaining an ith second effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;

step S23), calculating the norm column by column according to the ith reference time-frequency domain characteristic, and acquiring a column index of which the norm is higher than the norm threshold according to the norm column by column and a preset norm threshold;

step S24), according to the obtained column index, obtaining the ith second effective characteristic and a specific column in the reference time-frequency domain characteristic, respectively obtaining the ith second effective voice component characteristic and the reference time-frequency domain voice component characteristic, and forming two matrixes with the height of H and the width of T2;

step S25), respectively sliding the ith second effective voice component characteristic and the H-th line of the reference time-frequency domain voice component characteristic point by adopting sliding windows according to the preset time window length m, calculating the correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a first correlation coefficient matrix with the height of H and the width of T2-m + 1;

step S26), respectively sliding the ith second effective voice component characteristic and the T column of the reference time-frequency domain voice component characteristic point by adopting a sliding window according to the preset frequency window length n, calculating the correlation coefficient of elements in two sliding windows at corresponding positions, and acquiring a second phase relation matrix with the height of H-n +1 and the width of T2;

step S27), calculating the average of the first correlation coefficient matrix and the second correlation coefficient matrix respectively, and obtaining the optimal weight coefficient of each layer after the i-th intelligibility of the training and the training;

step S28), calculating an error based on intelligibility between the output value of the deep neural network and the reference time-frequency domain characteristic according to the optimal weight coefficient;

step S29) judging whether the intelligibility error is less than a preset threshold value; if the intelligibility error is less than a preset threshold value, continuing to execute downwards; if the intelligibility error is not less than the preset threshold, returning to the step S22), and continuing to execute;

step S30) determines the current model to be a second neural network model.

As an improvement of the foregoing technical solution, the acquiring a speech signal after speech enhancement specifically includes:

sliding a preset sliding window with the height of H and the width of 2n +1(2n +1< T1) on the first effective feature and the second effective feature row by row, covering the T-n row to the T + n row by taking the T-th row as the center, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and two local features with the height of H and the width of 2n + 1;

respectively calculating a mean value and a standard deviation according to the first local feature and the second local feature, and respectively carrying out unit column-by-column normalization on the first local feature and the second local feature to obtain a first normalized feature and a second normalized feature;

point-to-point multiplication is carried out according to the first normalization characteristic and the second normalization characteristic, and a correlation weight coefficient with the height of H and the width of 2n +1 corresponding to the current sliding window in the weight matrix is obtained; wherein, the correlation weight coefficient is an element of the second effective characteristic which has high correlation with the first effective characteristic;

selecting the position of the correlation weight threshold value on the (n + 1) th column in the correlation weight coefficient higher than the preset correlation weight threshold value row by row according to the preset correlation weight threshold value to obtain a position index;

and according to the acquired position index, extracting a correlation weight threshold value at a position corresponding to the t-th column in the second effective characteristic, replacing the correlation weight threshold value at the position corresponding to the t-th column in the first effective characteristic, taking the replaced first effective characteristic as a time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.

The invention also provides a speech enhancement system fusing the dual targets of signal-to-noise ratio and intelligibility, which specifically comprises:

the feature conversion module is used for converting the original voice signal into the original time-frequency domain feature;

the first acquisition module is used for inputting the original time-frequency domain characteristics into a pre-established first neural network model to acquire first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio;

the second obtaining module is used for inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility; and

and the enhancement module is used for processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.

Compared with the prior art, the invention has the beneficial effects that:

the method of the invention trains different deep neural network models respectively according to the optimal criterion of the signal-to-noise ratio and the optimal criterion of the intelligibility, fuses the double optimization targets of the signal-to-noise ratio and the intelligibility, and achieves the effect of voice enhancement; in the time-frequency domain characteristics after the voice enhancement, on one hand, the optimal attribute of the signal-to-noise ratio of the first effective characteristic is kept, and the noise residue is suppressed; and on the other hand, the optimal intelligibility attribute of the second effective characteristic is fused, and the intelligibility of the corresponding voice signal is improved.

Drawings

FIG. 1 is a schematic flow chart of a speech enhancement method for merging dual targets of SNR and intelligibility according to the present invention;

FIG. 2 is a schematic flow chart of a speech enhancement method for merging dual targets of SNR and intelligibility according to the present invention;

FIG. 3 is a schematic flow chart of a first neural network model training method of a speech enhancement method with a target combining signal-to-noise ratio and intelligibility according to the present invention;

FIG. 4 is a schematic flow chart of a second neural network model training method of the speech enhancement method with the fusion of the SNR and the intelligibility targets of the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

The invention provides a speech enhancement method fusing a signal-to-noise ratio and intelligibility dual target, which is characterized in that original time-frequency domain characteristics are enhanced through two pre-established neural network models, time-frequency domain speech components with optimal signal-to-noise ratio meanings and intelligibility meanings are respectively obtained, and a first effective characteristic and a second effective characteristic are respectively and correspondingly formed; and performing column-by-column normalization on the first effective characteristic and the second effective characteristic, and obtaining a weight matrix through point-to-point multiplication. And selecting a position with a value higher than the weight threshold value in the weight matrix according to a preset weight threshold value, extracting a value of a corresponding position in the second effective characteristic matrix to replace the value of the corresponding position in the first effective characteristic matrix, and obtaining the final enhanced time-frequency domain characteristic. The noise component in the first effective characteristic is suppressed to the lowest, the signal-to-noise ratio of the time-frequency domain voice is optimal, and the voice component has certain distortion; and the noise component in the second effective characteristic is higher than that in the first effective characteristic, so that the intelligibility of the time-frequency domain speech is optimal, and the accuracy of the speech component is higher.

As shown in fig. 1 and 2, the method includes:

step 110) extracting original time-frequency domain characteristics from an original voice signal;

specifically, the form of the original time-frequency domain feature is optional, and the original speech signal is converted into the original time-frequency domain feature by using the magnitude spectrum, which may specifically adopt the following steps:

step 1101) framing and windowing the original voice signal to obtain a processed voice signal;

step 1102) performing Fourier transform on the processed voice signal to obtain a Fourier transform coefficient matrix with the height of H and the width of T1;

step 1103) taking an absolute value of the obtained Fourier coefficient matrix, and obtaining original time-frequency domain characteristics corresponding to the original voice signals.

Step 120) inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio; in particular, the amount of the solvent to be used,

step 1201) inputting the original time-frequency domain characteristics into a pre-established first neural network model by taking the original time-frequency domain characteristics as input, and acquiring a first floating value mask through forward calculation of the first neural network;

step 1202) point-to-point multiplying the obtained first floating value mask and the original time-frequency domain feature to obtain a first effective feature with a signal-to-noise ratio.

As shown in fig. 3, an ith sample speech signal is used as a reference signal, a noise signal is added to construct an ith original speech signal corresponding to the ith sample speech signal, the ith sample speech signal and the noise signal are respectively converted into an ith reference time-frequency domain feature and an ith original time-frequency domain feature, a deep neural network model is trained based on a criterion of optimal signal-to-noise ratio, and then a first neural network model is obtained. The first neural network model can reduce the noise proportion in the original time-frequency domain characteristics, so that the signal quality is improved, a large amount of sample data is required to be used in advance, the deep neural network model is trained based on the target with the optimal signal-to-noise ratio, and then the first neural network model is obtained. Wherein, obtaining the first neural network model specifically comprises:

step 1203) setting initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;

step 1204) according to the ith original time-frequency domain characteristic, obtaining an ith first effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;

step 1205) calculating a mean square error according to the ith first effective characteristic and the ith reference time-frequency domain characteristic to obtain an ith mean square error;

step 1206) squaring and averaging the ith reference time-frequency domain characteristic, and taking a ratio of the squared and averaged characteristics to the obtained ith mean square error to obtain the ith signal-to-noise ratio for the training and the optimal weight coefficient of each layer after the training;

step 1207) calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio according to the optimal weight coefficient, and acquiring the signal-to-noise ratio error;

step 1208) judging whether the signal-to-noise ratio error obtained in the step 1208) is smaller than a preset threshold value; if the signal-to-noise ratio error is smaller than a preset threshold value, continuing to execute downwards; and if the signal-to-noise ratio error is not less than the preset threshold value, returning to the step 1204) and continuing to execute.

Step 1209) determining the current model to be the first deep neural network model.

Continuously and repeatedly training the deep neural network based on the obtained ith signal-to-noise ratio for the training and the optimal weight coefficient of each layer after training, and obtaining the weight coefficient which enables the signal-to-noise ratio of the output voice signal to be optimal; and calculating an error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio according to the weight coefficient, wherein the error is smaller than a preset threshold value, and determining that the current model is a first neural network model.

Step 130) inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility; in particular, the amount of the solvent to be used,

step 1301) inputting the original time-frequency domain characteristics into a pre-established second neural network model by taking the original time-frequency domain characteristics as input, and acquiring a second floating value mask through forward calculation of the second neural network;

step 1302) multiplying the second floating-value mask by the original time-frequency domain feature point-to-point to obtain a second effective feature with intelligibility;

as shown in fig. 4, an ith sample speech signal is used as a reference signal, a noise signal is added to construct an ith original speech signal corresponding to the ith sample speech signal, the ith sample speech signal and the noise signal are respectively converted into an ith reference time-frequency domain feature and an ith original time-frequency domain feature, the deep neural network model is trained based on a criterion of optimal intelligibility, and then a second deep neural network model is obtained. The second deep neural network model can improve the speech proportion in the original time-frequency domain characteristics, so that the intelligibility is improved, a large amount of sample data is required to be used in advance, the deep neural network model is trained based on the target with the optimal intelligibility, and then the second deep neural network model is obtained. Wherein, obtaining the second neural network model specifically comprises:

step 1303) setting initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;

step 1304) according to the ith original time-frequency domain characteristic, obtaining an ith second effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;

step 1305) calculating a norm column by column according to the ith reference time-frequency domain characteristic, and acquiring a column index of which the norm is higher than a norm threshold value according to the norm column by column and a preset norm threshold value;

step 1306) obtaining specific columns in the ith second effective characteristic and the reference time-frequency domain characteristic according to the obtained column indexes, and respectively obtaining the ith second effective voice component characteristic and the reference time-frequency domain voice component characteristic to form two matrixes with the height of H and the width of T2;

step 1307) respectively sliding the ith second effective voice component characteristic and the H-th row of the reference time-frequency domain voice component characteristic point by adopting sliding windows according to the preset time window length m, calculating correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a first correlation coefficient matrix with the height of H and the width of T2-m + 1;

step 1308) according to the preset length n of the frequency window, respectively sliding the ith second effective voice component characteristic and the T column of the reference time-frequency domain voice component characteristic point by adopting sliding windows, calculating correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a second correlation coefficient matrix with the height of H-n +1 and the width of T2;

step 1309) calculating the average of the first correlation coefficient matrix and the second correlation coefficient matrix respectively, and obtaining the i-th intelligibility for the training and the optimal weight coefficient of each layer after the training;

step 1310) calculating an error based on intelligibility between an output value of the deep neural network and the reference time-frequency domain characteristic according to the optimal weight coefficient;

step 1311) determining whether the intelligibility error is less than a preset threshold; if the intelligibility error is less than a preset threshold value, continuing to execute downwards; if the intelligibility error is not less than the preset threshold, returning to the step 1304) and continuing to execute;

step 1312) determines the current model to be a second neural network model.

Wherein, based on the obtained i-th intelligibility for the training and the optimal weight coefficient of each layer after the training; continuously carrying out repeated training on the weight coefficient of the sample voice signal to obtain the weight coefficient which enables the intelligibility of the output voice signal to be optimal; and calculating an error based on intelligibility between the output value of the deep neural network and the reference time-frequency domain characteristic according to the obtained optimal weight coefficient, wherein the error is smaller than a preset threshold value, and determining that the current model is a second neural network model.

Step 140) processing the first effective feature and the second effective feature to obtain a weight matrix, selecting elements with high correlation with the first effective feature from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at a corresponding position in the first effective feature with the elements, taking the replaced first effective feature as the time-frequency domain feature after speech enhancement, achieving the enhancement effect of fusing the double optimization targets of signal-to-noise ratio and intelligibility, and converting the time-frequency domain feature after speech enhancement into an enhanced speech signal to complete speech enhancement. In particular, the amount of the solvent to be used,

step 1401) sliding a preset sliding window with the height of H and the width of 2n +1(2n +1< T1) on the first effective feature and the second effective feature row by row, covering the T-n row to the T + n row by taking the T-th row as the center, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and two local features with the height of H and the width of 2n + 1;

step 1402) respectively calculating a mean value and a standard deviation according to the first local feature and the second local feature, respectively performing unit column-by-column normalization on the first local feature and the second local feature, and obtaining a first normalized feature and a second normalized feature;

step 1403) point-to-point multiplication is carried out according to the first normalization characteristic and the second normalization characteristic, and a correlation weight coefficient with the height of H and the width of 2n +1 corresponding to the current sliding window in the weight matrix is obtained; wherein, the correlation weight coefficient is an element of the second effective characteristic which has high correlation with the first effective characteristic;

step 1404) selecting, row by row, positions where the correlation weight threshold on the n +1 th row in the correlation weight coefficient is higher than the preset correlation weight threshold according to the preset correlation weight threshold, and obtaining position indexes;

step 1405) extracting a correlation weight threshold at a position corresponding to the t-th row in the second effective feature according to the obtained position index, replacing the correlation weight threshold at the position corresponding to the t-th row in the first effective feature, taking the replaced first effective feature as a time-frequency domain feature after voice enhancement, converting the time-frequency domain feature after voice enhancement into an enhanced voice signal, and completing voice enhancement.

In the time-frequency domain features after the voice enhancement, on one hand, the optimal attribute of the signal-to-noise ratio of the first effective feature is kept, and the noise residue is suppressed; and on the other hand, the optimal intelligibility attribute of the second effective characteristic is fused, and the intelligibility of the corresponding voice signal is improved.

The invention also provides a speech enhancement system fusing the dual targets of signal-to-noise ratio and intelligibility, which is realized by the method, and specifically comprises the following steps:

the first acquisition module is used for inputting the original time-frequency domain characteristics into a pre-established first neural network model to acquire first effective characteristics with signal-to-noise ratio;

the second obtaining module is used for inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; and

the enhancement module is used for processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold on the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after speech enhancement, achieving the enhancement effect of fusing the double optimization targets of the signal-to-noise ratio and the intelligibility, and converting the time-frequency domain characteristic after speech enhancement into an enhanced speech signal.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of speech enhancement incorporating signal-to-noise ratio and intelligibility, the method comprising:

converting an original voice signal into an original time-frequency domain characteristic;

inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio;

inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility;

processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.

2. The method of claim 1, wherein the converting the original speech signal into original time-frequency domain features; the method specifically comprises the following steps:

3. The method of claim 1, wherein the raw time-frequency domain features are input into a pre-established first neural network model to obtain a first valid feature with a signal-to-noise ratio; the method specifically comprises the following steps:

4. The method according to claim 3, wherein the pre-established first neural network model specifically comprises:

step S15), according to the optimal weight coefficient, calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio, and acquiring the signal-to-noise ratio error;

5. The method of claim 1, wherein the original time-frequency domain features are input into a pre-established second neural network model to obtain a second valid feature with intelligibility; the method specifically comprises the following steps:

6. The method of claim 5, wherein the pre-established second neural network model specifically comprises:

step S30) determines the current model to be a second neural network model.

7. The method according to claim 1, wherein the obtaining the speech signal after speech enhancement specifically comprises:

sliding a preset sliding window with the height of H and the width of 2n +1 on the first effective feature and the second effective feature row by row, taking the t-th row as the center, covering the t-n-th row to the t + n-th row, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and a second local feature with the height of H and the width of 2n + 1;

8. A speech enhancement system that combines the dual goals of signal-to-noise ratio and intelligibility, the system comprising: