CN112309421A - Speech enhancement method and system fusing signal-to-noise ratio and intelligibility dual targets - Google Patents

Speech enhancement method and system fusing signal-to-noise ratio and intelligibility dual targets Download PDF

Info

Publication number
CN112309421A
CN112309421A CN201910689178.4A CN201910689178A CN112309421A CN 112309421 A CN112309421 A CN 112309421A CN 201910689178 A CN201910689178 A CN 201910689178A CN 112309421 A CN112309421 A CN 112309421A
Authority
CN
China
Prior art keywords
frequency domain
characteristic
signal
ith
effective
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910689178.4A
Other languages
Chinese (zh)
Other versions
CN112309421B (en
Inventor
张鹏远
战鸽
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910689178.4A priority Critical patent/CN112309421B/en
Publication of CN112309421A publication Critical patent/CN112309421A/en
Application granted granted Critical
Publication of CN112309421B publication Critical patent/CN112309421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention belongs to the technical field of voice enhancement signal processing, and particularly relates to a voice enhancement method fusing a signal-to-noise ratio and intelligibility dual target, which comprises the following steps: converting an original voice signal into an original time-frequency domain characteristic; inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.

Description

Speech enhancement method and system fusing signal-to-noise ratio and intelligibility dual targets
Technical Field
The invention belongs to the technical field of voice enhancement signal processing, and particularly relates to a voice enhancement method and system fusing a signal-to-noise ratio and intelligibility dual target.
Background
When a speech signal is interfered by noise, the signal quality and intelligibility are degraded, thereby affecting the user experience of speech recognition and speech perception processing based on the speech signal. Currently, the speech enhancement method is commonly used to separate the spectral components of the speech signal from the noise coverage by means of a mask that estimates the speech signal. The speech enhancement method is generally based on the minimum mean square error criterion, a mask is estimated, the components of the speech signal with noise in the time-frequency domain are classified, the components shielded by the noise are distinguished, and the components with stronger speech signal energy are reserved. The separated speech signal components carry important speech information and are often used for subsequent speech recognition and speech perception processing. However, the minimum mean square error criterion is not directly related to the human perception mechanism of the speech signal, and does not distinguish between the noise signal and the speech signal distributed in different segments of the noisy signal, and thus is not optimal for suppressing the noise residual and improving the listening quality and intelligibility of the speech signal. Therefore, two convenient targeted speech enhancement methods of suppressing noise residual and improving hearing quality and intelligibility are directly related and have unique importance in research and application level.
The speech enhancement technology at the present stage mainly generates a mask based on the minimum mean square error criterion optimization according to the time-frequency characteristics of the speech signals, and obtains the speech signal components by combining the mask and the time-frequency characteristics. Such enhancement results in a balance between suppression of noise residual and improvement of listening quality and intelligibility, and the inability to finely express the accuracy of a speech signal in a section where speech components are present hinders improvement of intelligibility of the speech signal. Meanwhile, the mean square error contains the error of the voice component, and the error caused by the residual noise cannot be accurately expressed, so that the enhanced voice signal is not optimal in the sense of signal-to-noise ratio.
With the wide application of deep neural networks in branches of multiple signal processing fields such as images and voices, training criteria other than minimum mean square error are increasingly gaining attention. For various optimization targets faced by the existing speech enhancement method, a single training criterion cannot comprehensively contain errors obtained under all optimization target angles, and generally only can achieve the balance between noise residue suppression and auditory quality and intelligibility improvement, and the enhancement result is not optimal in the aspects of noise suppression and speech intelligibility improvement.
Disclosure of Invention
The invention aims to solve the defects of the existing voice enhancement method, and provides a voice enhancement method fusing double targets of signal-to-noise ratio and intelligibility, which trains different neural networks respectively according to multiple training criteria and then fuses results obtained under multiple optimization targets to form a new enhanced voice signal; the method overcomes the problems of limited signal-to-noise ratio improvement and intelligibility improvement of the existing voice enhancement method.
In order to achieve the above object, the present invention provides a speech enhancement method for fusing dual targets of signal-to-noise ratio and intelligibility, the method comprising:
extracting original time-frequency domain characteristics from an original voice signal;
inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio;
inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility;
processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the correlation weight threshold, and taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement.
As one improvement of the above technical solution, the original speech signal is converted into an original time-frequency domain characteristic; the method specifically comprises the following steps:
performing framing and windowing processing on an original voice signal to obtain a processed voice signal;
performing Fourier transform on the processed voice signal to obtain a Fourier transform coefficient matrix with the height of H and the width of T1;
and taking an absolute value of the obtained Fourier coefficient matrix, and obtaining the original time-frequency domain characteristics corresponding to the original voice signals.
As one improvement of the above technical solution, the original time-frequency domain features are input into a pre-established first neural network model, and a first effective feature with a signal-to-noise ratio is obtained; the method specifically comprises the following steps:
inputting the original time-frequency domain characteristics into a pre-established first neural network model, and acquiring a first floating value mask through forward calculation of the first neural network;
and multiplying the obtained first floating value mask and the original time-frequency domain characteristic point to obtain a first effective characteristic with a signal-to-noise ratio.
As an improvement of the above technical solution, the pre-established first neural network model specifically includes:
step S11) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step S12), according to the ith original time-frequency domain characteristic, obtaining an ith first effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step S13), calculating the mean square error according to the ith first effective characteristic and the ith reference time-frequency domain characteristic to obtain the ith mean square error;
step S14), squaring and averaging the ith reference time-frequency domain characteristic, and making a ratio with the obtained ith mean square error to obtain the ith signal-to-noise ratio used for the training and the optimal weight coefficient of each layer after the training;
step S15), according to the optimal weight coefficient, calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio, and obtaining the signal-to-noise ratio error.
Step S16), judging whether the obtained signal-to-noise ratio error is smaller than a preset threshold value; if the signal-to-noise ratio error is smaller than a preset threshold value, continuing to execute downwards; if the signal-to-noise ratio error is not less than the preset threshold value, returning to the step S12) and continuing to execute;
step S17) determines the current model to be the first deep neural network model.
As one improvement of the technical scheme, the original time-frequency domain characteristics are input into a pre-established second neural network model to obtain second effective characteristics with intelligibility; the method specifically comprises the following steps:
inputting the original time-frequency domain characteristics into a pre-established second neural network model, and obtaining a second floating value mask through forward calculation of the second neural network;
and multiplying the second floating value mask and the original time-frequency domain feature point to obtain a second effective feature with intelligibility.
As an improvement of the above technical solution, the pre-established second neural network model specifically includes:
step S21) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step S22), according to the ith original time-frequency domain characteristic, obtaining an ith second effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step S23), calculating the norm column by column according to the ith reference time-frequency domain characteristic, and acquiring a column index of which the norm is higher than the norm threshold according to the norm column by column and a preset norm threshold;
step S24), according to the obtained column index, obtaining the ith second effective characteristic and a specific column in the reference time-frequency domain characteristic, respectively obtaining the ith second effective voice component characteristic and the reference time-frequency domain voice component characteristic, and forming two matrixes with the height of H and the width of T2;
step S25), respectively sliding the ith second effective voice component characteristic and the H-th line of the reference time-frequency domain voice component characteristic point by adopting sliding windows according to the preset time window length m, calculating the correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a first correlation coefficient matrix with the height of H and the width of T2-m + 1;
step S26), respectively sliding the ith second effective voice component characteristic and the T column of the reference time-frequency domain voice component characteristic point by adopting a sliding window according to the preset frequency window length n, calculating the correlation coefficient of elements in two sliding windows at corresponding positions, and acquiring a second phase relation matrix with the height of H-n +1 and the width of T2;
step S27), calculating the average of the first correlation coefficient matrix and the second correlation coefficient matrix respectively, and obtaining the optimal weight coefficient of each layer after the i-th intelligibility of the training and the training;
step S28), calculating an error based on intelligibility between the output value of the deep neural network and the reference time-frequency domain characteristic according to the optimal weight coefficient;
step S29) judging whether the intelligibility error is less than a preset threshold value; if the intelligibility error is less than a preset threshold value, continuing to execute downwards; if the intelligibility error is not less than the preset threshold, returning to the step S22), and continuing to execute;
step S30) determines the current model to be a second neural network model.
As an improvement of the foregoing technical solution, the acquiring a speech signal after speech enhancement specifically includes:
sliding a preset sliding window with the height of H and the width of 2n +1(2n +1< T1) on the first effective feature and the second effective feature row by row, covering the T-n row to the T + n row by taking the T-th row as the center, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and two local features with the height of H and the width of 2n + 1;
respectively calculating a mean value and a standard deviation according to the first local feature and the second local feature, and respectively carrying out unit column-by-column normalization on the first local feature and the second local feature to obtain a first normalized feature and a second normalized feature;
point-to-point multiplication is carried out according to the first normalization characteristic and the second normalization characteristic, and a correlation weight coefficient with the height of H and the width of 2n +1 corresponding to the current sliding window in the weight matrix is obtained; wherein, the correlation weight coefficient is an element of the second effective characteristic which has high correlation with the first effective characteristic;
selecting the position of the correlation weight threshold value on the (n + 1) th column in the correlation weight coefficient higher than the preset correlation weight threshold value row by row according to the preset correlation weight threshold value to obtain a position index;
and according to the acquired position index, extracting a correlation weight threshold value at a position corresponding to the t-th column in the second effective characteristic, replacing the correlation weight threshold value at the position corresponding to the t-th column in the first effective characteristic, taking the replaced first effective characteristic as a time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.
The invention also provides a speech enhancement system fusing the dual targets of signal-to-noise ratio and intelligibility, which specifically comprises:
the feature conversion module is used for converting the original voice signal into the original time-frequency domain feature;
the first acquisition module is used for inputting the original time-frequency domain characteristics into a pre-established first neural network model to acquire first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio;
the second obtaining module is used for inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility; and
and the enhancement module is used for processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.
Compared with the prior art, the invention has the beneficial effects that:
the method of the invention trains different deep neural network models respectively according to the optimal criterion of the signal-to-noise ratio and the optimal criterion of the intelligibility, fuses the double optimization targets of the signal-to-noise ratio and the intelligibility, and achieves the effect of voice enhancement; in the time-frequency domain characteristics after the voice enhancement, on one hand, the optimal attribute of the signal-to-noise ratio of the first effective characteristic is kept, and the noise residue is suppressed; and on the other hand, the optimal intelligibility attribute of the second effective characteristic is fused, and the intelligibility of the corresponding voice signal is improved.
Drawings
FIG. 1 is a schematic flow chart of a speech enhancement method for merging dual targets of SNR and intelligibility according to the present invention;
FIG. 2 is a schematic flow chart of a speech enhancement method for merging dual targets of SNR and intelligibility according to the present invention;
FIG. 3 is a schematic flow chart of a first neural network model training method of a speech enhancement method with a target combining signal-to-noise ratio and intelligibility according to the present invention;
FIG. 4 is a schematic flow chart of a second neural network model training method of the speech enhancement method with the fusion of the SNR and the intelligibility targets of the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings.
The invention provides a speech enhancement method fusing a signal-to-noise ratio and intelligibility dual target, which is characterized in that original time-frequency domain characteristics are enhanced through two pre-established neural network models, time-frequency domain speech components with optimal signal-to-noise ratio meanings and intelligibility meanings are respectively obtained, and a first effective characteristic and a second effective characteristic are respectively and correspondingly formed; and performing column-by-column normalization on the first effective characteristic and the second effective characteristic, and obtaining a weight matrix through point-to-point multiplication. And selecting a position with a value higher than the weight threshold value in the weight matrix according to a preset weight threshold value, extracting a value of a corresponding position in the second effective characteristic matrix to replace the value of the corresponding position in the first effective characteristic matrix, and obtaining the final enhanced time-frequency domain characteristic. The noise component in the first effective characteristic is suppressed to the lowest, the signal-to-noise ratio of the time-frequency domain voice is optimal, and the voice component has certain distortion; and the noise component in the second effective characteristic is higher than that in the first effective characteristic, so that the intelligibility of the time-frequency domain speech is optimal, and the accuracy of the speech component is higher.
As shown in fig. 1 and 2, the method includes:
step 110) extracting original time-frequency domain characteristics from an original voice signal;
specifically, the form of the original time-frequency domain feature is optional, and the original speech signal is converted into the original time-frequency domain feature by using the magnitude spectrum, which may specifically adopt the following steps:
step 1101) framing and windowing the original voice signal to obtain a processed voice signal;
step 1102) performing Fourier transform on the processed voice signal to obtain a Fourier transform coefficient matrix with the height of H and the width of T1;
step 1103) taking an absolute value of the obtained Fourier coefficient matrix, and obtaining original time-frequency domain characteristics corresponding to the original voice signals.
Step 120) inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio; the first effective characteristic improves the signal-to-noise ratio, namely, the first effective characteristic has the advantage of high signal-to-noise ratio; in particular, the amount of the solvent to be used,
step 1201) inputting the original time-frequency domain characteristics into a pre-established first neural network model by taking the original time-frequency domain characteristics as input, and acquiring a first floating value mask through forward calculation of the first neural network;
step 1202) point-to-point multiplying the obtained first floating value mask and the original time-frequency domain feature to obtain a first effective feature with a signal-to-noise ratio.
As shown in fig. 3, an ith sample speech signal is used as a reference signal, a noise signal is added to construct an ith original speech signal corresponding to the ith sample speech signal, the ith sample speech signal and the noise signal are respectively converted into an ith reference time-frequency domain feature and an ith original time-frequency domain feature, a deep neural network model is trained based on a criterion of optimal signal-to-noise ratio, and then a first neural network model is obtained. The first neural network model can reduce the noise proportion in the original time-frequency domain characteristics, so that the signal quality is improved, a large amount of sample data is required to be used in advance, the deep neural network model is trained based on the target with the optimal signal-to-noise ratio, and then the first neural network model is obtained. Wherein, obtaining the first neural network model specifically comprises:
step 1203) setting initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step 1204) according to the ith original time-frequency domain characteristic, obtaining an ith first effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step 1205) calculating a mean square error according to the ith first effective characteristic and the ith reference time-frequency domain characteristic to obtain an ith mean square error;
step 1206) squaring and averaging the ith reference time-frequency domain characteristic, and taking a ratio of the squared and averaged characteristics to the obtained ith mean square error to obtain the ith signal-to-noise ratio for the training and the optimal weight coefficient of each layer after the training;
step 1207) calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio according to the optimal weight coefficient, and acquiring the signal-to-noise ratio error;
step 1208) judging whether the signal-to-noise ratio error obtained in the step 1208) is smaller than a preset threshold value; if the signal-to-noise ratio error is smaller than a preset threshold value, continuing to execute downwards; and if the signal-to-noise ratio error is not less than the preset threshold value, returning to the step 1204) and continuing to execute.
Step 1209) determining the current model to be the first deep neural network model.
Continuously and repeatedly training the deep neural network based on the obtained ith signal-to-noise ratio for the training and the optimal weight coefficient of each layer after training, and obtaining the weight coefficient which enables the signal-to-noise ratio of the output voice signal to be optimal; and calculating an error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio according to the weight coefficient, wherein the error is smaller than a preset threshold value, and determining that the current model is a first neural network model.
Step 130) inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; wherein, the second effective characteristic improves the intelligibility, namely has the advantage of high intelligibility; in particular, the amount of the solvent to be used,
step 1301) inputting the original time-frequency domain characteristics into a pre-established second neural network model by taking the original time-frequency domain characteristics as input, and acquiring a second floating value mask through forward calculation of the second neural network;
step 1302) multiplying the second floating-value mask by the original time-frequency domain feature point-to-point to obtain a second effective feature with intelligibility;
as shown in fig. 4, an ith sample speech signal is used as a reference signal, a noise signal is added to construct an ith original speech signal corresponding to the ith sample speech signal, the ith sample speech signal and the noise signal are respectively converted into an ith reference time-frequency domain feature and an ith original time-frequency domain feature, the deep neural network model is trained based on a criterion of optimal intelligibility, and then a second deep neural network model is obtained. The second deep neural network model can improve the speech proportion in the original time-frequency domain characteristics, so that the intelligibility is improved, a large amount of sample data is required to be used in advance, the deep neural network model is trained based on the target with the optimal intelligibility, and then the second deep neural network model is obtained. Wherein, obtaining the second neural network model specifically comprises:
step 1303) setting initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step 1304) according to the ith original time-frequency domain characteristic, obtaining an ith second effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step 1305) calculating a norm column by column according to the ith reference time-frequency domain characteristic, and acquiring a column index of which the norm is higher than a norm threshold value according to the norm column by column and a preset norm threshold value;
step 1306) obtaining specific columns in the ith second effective characteristic and the reference time-frequency domain characteristic according to the obtained column indexes, and respectively obtaining the ith second effective voice component characteristic and the reference time-frequency domain voice component characteristic to form two matrixes with the height of H and the width of T2;
step 1307) respectively sliding the ith second effective voice component characteristic and the H-th row of the reference time-frequency domain voice component characteristic point by adopting sliding windows according to the preset time window length m, calculating correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a first correlation coefficient matrix with the height of H and the width of T2-m + 1;
step 1308) according to the preset length n of the frequency window, respectively sliding the ith second effective voice component characteristic and the T column of the reference time-frequency domain voice component characteristic point by adopting sliding windows, calculating correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a second correlation coefficient matrix with the height of H-n +1 and the width of T2;
step 1309) calculating the average of the first correlation coefficient matrix and the second correlation coefficient matrix respectively, and obtaining the i-th intelligibility for the training and the optimal weight coefficient of each layer after the training;
step 1310) calculating an error based on intelligibility between an output value of the deep neural network and the reference time-frequency domain characteristic according to the optimal weight coefficient;
step 1311) determining whether the intelligibility error is less than a preset threshold; if the intelligibility error is less than a preset threshold value, continuing to execute downwards; if the intelligibility error is not less than the preset threshold, returning to the step 1304) and continuing to execute;
step 1312) determines the current model to be a second neural network model.
Wherein, based on the obtained i-th intelligibility for the training and the optimal weight coefficient of each layer after the training; continuously carrying out repeated training on the weight coefficient of the sample voice signal to obtain the weight coefficient which enables the intelligibility of the output voice signal to be optimal; and calculating an error based on intelligibility between the output value of the deep neural network and the reference time-frequency domain characteristic according to the obtained optimal weight coefficient, wherein the error is smaller than a preset threshold value, and determining that the current model is a second neural network model.
Step 140) processing the first effective feature and the second effective feature to obtain a weight matrix, selecting elements with high correlation with the first effective feature from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at a corresponding position in the first effective feature with the elements, taking the replaced first effective feature as the time-frequency domain feature after speech enhancement, achieving the enhancement effect of fusing the double optimization targets of signal-to-noise ratio and intelligibility, and converting the time-frequency domain feature after speech enhancement into an enhanced speech signal to complete speech enhancement. In particular, the amount of the solvent to be used,
step 1401) sliding a preset sliding window with the height of H and the width of 2n +1(2n +1< T1) on the first effective feature and the second effective feature row by row, covering the T-n row to the T + n row by taking the T-th row as the center, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and two local features with the height of H and the width of 2n + 1;
step 1402) respectively calculating a mean value and a standard deviation according to the first local feature and the second local feature, respectively performing unit column-by-column normalization on the first local feature and the second local feature, and obtaining a first normalized feature and a second normalized feature;
step 1403) point-to-point multiplication is carried out according to the first normalization characteristic and the second normalization characteristic, and a correlation weight coefficient with the height of H and the width of 2n +1 corresponding to the current sliding window in the weight matrix is obtained; wherein, the correlation weight coefficient is an element of the second effective characteristic which has high correlation with the first effective characteristic;
step 1404) selecting, row by row, positions where the correlation weight threshold on the n +1 th row in the correlation weight coefficient is higher than the preset correlation weight threshold according to the preset correlation weight threshold, and obtaining position indexes;
step 1405) extracting a correlation weight threshold at a position corresponding to the t-th row in the second effective feature according to the obtained position index, replacing the correlation weight threshold at the position corresponding to the t-th row in the first effective feature, taking the replaced first effective feature as a time-frequency domain feature after voice enhancement, converting the time-frequency domain feature after voice enhancement into an enhanced voice signal, and completing voice enhancement.
In the time-frequency domain features after the voice enhancement, on one hand, the optimal attribute of the signal-to-noise ratio of the first effective feature is kept, and the noise residue is suppressed; and on the other hand, the optimal intelligibility attribute of the second effective characteristic is fused, and the intelligibility of the corresponding voice signal is improved.
The invention also provides a speech enhancement system fusing the dual targets of signal-to-noise ratio and intelligibility, which is realized by the method, and specifically comprises the following steps:
the feature conversion module is used for converting the original voice signal into the original time-frequency domain feature;
the first acquisition module is used for inputting the original time-frequency domain characteristics into a pre-established first neural network model to acquire first effective characteristics with signal-to-noise ratio;
the second obtaining module is used for inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; and
the enhancement module is used for processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold on the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after speech enhancement, achieving the enhancement effect of fusing the double optimization targets of the signal-to-noise ratio and the intelligibility, and converting the time-frequency domain characteristic after speech enhancement into an enhanced speech signal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method of speech enhancement incorporating signal-to-noise ratio and intelligibility, the method comprising:
converting an original voice signal into an original time-frequency domain characteristic;
inputting the original time-frequency domain characteristics into a pre-established first neural network model to obtain first effective characteristics with signal-to-noise ratio;
inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility;
processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.
2. The method of claim 1, wherein the converting the original speech signal into original time-frequency domain features; the method specifically comprises the following steps:
performing framing and windowing processing on an original voice signal to obtain a processed voice signal;
performing Fourier transform on the processed voice signal to obtain a Fourier transform coefficient matrix with the height of H and the width of T1;
and taking an absolute value of the obtained Fourier coefficient matrix, and obtaining the original time-frequency domain characteristics corresponding to the original voice signals.
3. The method of claim 1, wherein the raw time-frequency domain features are input into a pre-established first neural network model to obtain a first valid feature with a signal-to-noise ratio; the method specifically comprises the following steps:
inputting the original time-frequency domain characteristics into a pre-established first neural network model, and acquiring a first floating value mask through forward calculation of the first neural network;
and multiplying the obtained first floating value mask and the original time-frequency domain characteristic point to obtain a first effective characteristic with a signal-to-noise ratio.
4. The method according to claim 3, wherein the pre-established first neural network model specifically comprises:
step S11) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step S12), according to the ith original time-frequency domain characteristic, obtaining an ith first effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step S13), calculating the mean square error according to the ith first effective characteristic and the ith reference time-frequency domain characteristic to obtain the ith mean square error;
step S14), squaring and averaging the ith reference time-frequency domain characteristic, and making a ratio with the obtained ith mean square error to obtain the ith signal-to-noise ratio used for the training and the optimal weight coefficient of each layer after the training;
step S15), according to the optimal weight coefficient, calculating the error between the output value of the deep neural network and the reference time-frequency domain characteristic based on the signal-to-noise ratio, and acquiring the signal-to-noise ratio error;
step S16), judging whether the obtained signal-to-noise ratio error is smaller than a preset threshold value; if the signal-to-noise ratio error is smaller than a preset threshold value, continuing to execute downwards; if the signal-to-noise ratio error is not less than the preset threshold value, returning to the step S12) and continuing to execute;
step S17) determines the current model to be the first deep neural network model.
5. The method of claim 1, wherein the original time-frequency domain features are input into a pre-established second neural network model to obtain a second valid feature with intelligibility; the method specifically comprises the following steps:
inputting the original time-frequency domain characteristics into a pre-established second neural network model, and obtaining a second floating value mask through forward calculation of the second neural network;
and multiplying the second floating value mask and the original time-frequency domain feature point to obtain a second effective feature with intelligibility.
6. The method of claim 5, wherein the pre-established second neural network model specifically comprises:
step S21) setting an initial weight of the deep neural network; taking the ith sample voice signal as a reference signal, adding a noise signal to construct an ith original voice signal corresponding to the ith sample voice signal, and respectively converting the ith sample voice signal and the ith original voice signal into an ith reference time-frequency domain characteristic and an ith original time-frequency domain characteristic;
step S22), according to the ith original time-frequency domain characteristic, obtaining an ith second effective characteristic corresponding to the ith original time-frequency domain characteristic through forward calculation of a deep neural network;
step S23), calculating the norm column by column according to the ith reference time-frequency domain characteristic, and acquiring a column index of which the norm is higher than the norm threshold according to the norm column by column and a preset norm threshold;
step S24), according to the obtained column index, obtaining the ith second effective characteristic and a specific column in the reference time-frequency domain characteristic, respectively obtaining the ith second effective voice component characteristic and the reference time-frequency domain voice component characteristic, and forming two matrixes with the height of H and the width of T2;
step S25), respectively sliding the ith second effective voice component characteristic and the H-th line of the reference time-frequency domain voice component characteristic point by adopting sliding windows according to the preset time window length m, calculating the correlation coefficients of elements in the two sliding windows at corresponding positions, and acquiring a first correlation coefficient matrix with the height of H and the width of T2-m + 1;
step S26), respectively sliding the ith second effective voice component characteristic and the T column of the reference time-frequency domain voice component characteristic point by adopting a sliding window according to the preset frequency window length n, calculating the correlation coefficient of elements in two sliding windows at corresponding positions, and acquiring a second phase relation matrix with the height of H-n +1 and the width of T2;
step S27), calculating the average of the first correlation coefficient matrix and the second correlation coefficient matrix respectively, and obtaining the optimal weight coefficient of each layer after the i-th intelligibility of the training and the training;
step S28), calculating an error based on intelligibility between the output value of the deep neural network and the reference time-frequency domain characteristic according to the optimal weight coefficient;
step S29) judging whether the intelligibility error is less than a preset threshold value; if the intelligibility error is less than a preset threshold value, continuing to execute downwards; if the intelligibility error is not less than the preset threshold, returning to the step S22), and continuing to execute;
step S30) determines the current model to be a second neural network model.
7. The method according to claim 1, wherein the obtaining the speech signal after speech enhancement specifically comprises:
sliding a preset sliding window with the height of H and the width of 2n +1 on the first effective feature and the second effective feature row by row, taking the t-th row as the center, covering the t-n-th row to the t + n-th row, and respectively obtaining a first local feature with the height of H and the width of 2n +1 and a second local feature with the height of H and the width of 2n + 1;
respectively calculating a mean value and a standard deviation according to the first local feature and the second local feature, and respectively carrying out unit column-by-column normalization on the first local feature and the second local feature to obtain a first normalized feature and a second normalized feature;
point-to-point multiplication is carried out according to the first normalization characteristic and the second normalization characteristic, and a correlation weight coefficient with the height of H and the width of 2n +1 corresponding to the current sliding window in the weight matrix is obtained; wherein, the correlation weight coefficient is an element of the second effective characteristic which has high correlation with the first effective characteristic;
selecting the position of the correlation weight threshold value on the (n + 1) th column in the correlation weight coefficient higher than the preset correlation weight threshold value row by row according to the preset correlation weight threshold value to obtain a position index;
and according to the acquired position index, extracting a correlation weight threshold value at a position corresponding to the t-th column in the second effective characteristic, replacing the correlation weight threshold value at the position corresponding to the t-th column in the first effective characteristic, taking the replaced first effective characteristic as a time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.
8. A speech enhancement system that combines the dual goals of signal-to-noise ratio and intelligibility, the system comprising:
the feature conversion module is used for converting the original voice signal into the original time-frequency domain feature;
the first acquisition module is used for inputting the original time-frequency domain characteristics into a pre-established first neural network model to acquire first effective characteristics with signal-to-noise ratio;
the second obtaining module is used for inputting the original time-frequency domain characteristics into a pre-established second neural network model to obtain second effective characteristics with intelligibility; and
and the enhancement module is used for processing the first effective characteristic and the second effective characteristic to obtain a weight matrix, selecting elements with high correlation with the first effective characteristic from the weight matrix column by column according to a preset correlation weight threshold, extracting the correlation weight threshold of the elements, replacing the threshold at the corresponding position in the first effective characteristic with the elements, taking the replaced first effective characteristic as the time-frequency domain characteristic after voice enhancement, and converting the time-frequency domain characteristic after voice enhancement into an enhanced voice signal.
CN201910689178.4A 2019-07-29 2019-07-29 Speech enhancement method and system integrating signal-to-noise ratio and intelligibility dual targets Active CN112309421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910689178.4A CN112309421B (en) 2019-07-29 2019-07-29 Speech enhancement method and system integrating signal-to-noise ratio and intelligibility dual targets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910689178.4A CN112309421B (en) 2019-07-29 2019-07-29 Speech enhancement method and system integrating signal-to-noise ratio and intelligibility dual targets

Publications (2)

Publication Number Publication Date
CN112309421A true CN112309421A (en) 2021-02-02
CN112309421B CN112309421B (en) 2024-03-19

Family

ID=74330190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910689178.4A Active CN112309421B (en) 2019-07-29 2019-07-29 Speech enhancement method and system integrating signal-to-noise ratio and intelligibility dual targets

Country Status (1)

Country Link
CN (1) CN112309421B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112802489A (en) * 2021-04-09 2021-05-14 广州健抿科技有限公司 Automatic call voice adjusting system and method
CN113035174A (en) * 2021-03-25 2021-06-25 联想(北京)有限公司 Voice recognition processing method, device, equipment and system
CN113808607A (en) * 2021-03-05 2021-12-17 北京沃东天骏信息技术有限公司 Voice enhancement method and device based on neural network and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1397929A (en) * 2002-07-12 2003-02-19 清华大学 Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization
US9741360B1 (en) * 2016-10-09 2017-08-22 Spectimbre Inc. Speech enhancement for target speakers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1397929A (en) * 2002-07-12 2003-02-19 清华大学 Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization
US9741360B1 (en) * 2016-10-09 2017-08-22 Spectimbre Inc. Speech enhancement for target speakers
CN107919133A (en) * 2016-10-09 2018-04-17 赛谛听股份有限公司 For the speech-enhancement system and sound enhancement method of destination object

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JALAL TAGHIA ET AL.: "《Objective Intelligibility Measures Based on Mutual Information for Speech Subjected to Speech Enhancement Processing》", 《IEEE/ACM TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE PROCESSING》, vol. 22, no. 1, pages 6 - 16 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808607A (en) * 2021-03-05 2021-12-17 北京沃东天骏信息技术有限公司 Voice enhancement method and device based on neural network and electronic equipment
CN113035174A (en) * 2021-03-25 2021-06-25 联想(北京)有限公司 Voice recognition processing method, device, equipment and system
CN112802489A (en) * 2021-04-09 2021-05-14 广州健抿科技有限公司 Automatic call voice adjusting system and method

Also Published As

Publication number Publication date
CN112309421B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN112309421B (en) Speech enhancement method and system integrating signal-to-noise ratio and intelligibility dual targets
DE69831288T2 (en) Sound processing adapted to ambient noise
CN101790752B (en) Multiple microphone voice activity detector
KR100304666B1 (en) Speech enhancement method
CN110767244B (en) Speech enhancement method
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
CN112634926B (en) Short wave channel voice anti-fading auxiliary enhancement method based on convolutional neural network
He et al. Multiplicative update of auto-regressive gains for codebook-based speech enhancement
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN105702262A (en) Headset double-microphone voice enhancement method
US9875748B2 (en) Audio signal noise attenuation
Fang et al. Integrating statistical uncertainty into neural network-based speech enhancement
CN116798434A (en) Communication enhancement method, system and storage medium based on voice characteristics
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN114613384B (en) Deep learning-based multi-input voice signal beam forming information complementation method
JP2002278586A (en) Speech recognition method
Li et al. MDNet: Learning monaural speech enhancement from deep prior gradient
Popović et al. Speech Enhancement Using Augmented SSL CycleGAN
CN113012711A (en) Voice processing method, device and equipment
Shen et al. A priori SNR estimator based on a convex combination of two DD approaches for speech enhancement
CN115730642A (en) Main and auxiliary network voice enhancement system integrating attention mechanism
Wan et al. Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement
CN116778970B (en) Voice detection model training method in strong noise environment
CN114842864B (en) Short wave channel signal diversity combining method based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant