CN111105809A

CN111105809A - Noise reduction method and device

Info

Publication number: CN111105809A
Application number: CN201911413911.6A
Authority: CN
Inventors: 李庆龙; 关海欣
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-05
Anticipated expiration: 2039-12-31
Also published as: CN111105809B

Abstract

The invention discloses a noise reduction method and a device, comprising the following steps: generating a first cepstrum feature of the preset voice by using the preset voice with noise; obtaining a predicted masking value based on the first cepstral feature; training a preset neural network according to the predicted masking value so as to generate a trained neural network; acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise; and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction. The invention does not relate to the noise reduction by utilizing high and low frequencies, has no problem of energy conservation, and has excellent, stable and high-efficiency noise reduction result. The problem that noise cannot be effectively separated due to large energy difference between high frequency and low frequency is solved. Compared with the preset neural network trained by other characteristics required by a noise reduction method based on a deep learning technology, the preset neural network trained by the acquired cepstrum characteristics is more perfect, and the noise reduction effect is better.

Description

Noise reduction method and device

Technical Field

The invention relates to the technical field of voice data processing, in particular to a noise reduction method and device.

Background

At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, a noise reduction method based on a deep learning technology is generally used to remove noise components in sound, and voice noise reduction is to separate noise and human voice in mixed voice, so that the human voice part is completely reserved as much as possible while the noise part is removed as much as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.

In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is large, the noise reduction effect of high and low frequencies is inconsistent, the noise reduction result is not in accordance with the expected expectation, and then the noise can not be effectively separated.

Disclosure of Invention

Aiming at the displayed problems, the method trains the neural network based on the obtained predicted masking value and the actual masking value of the preset voice with noise, extracts the masking value of the current voice with noise by the trained neural network, and utilizes the current masking value to reduce the noise.

A method of noise reduction comprising the steps of:

generating a first cepstrum feature of a preset voice by using the noisy preset voice;

obtaining a predictive masking value based on the first cepstral feature;

training a preset neural network according to the predicted masking value to generate a trained neural network;

acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;

and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice.

Preferably, the generating a first cepstrum feature of the preset speech by using the noisy preset speech includes:

acquiring a plurality of preset voices with noises;

extracting the first cepstral feature using the following formula:

cepstral＝ISTFT(log(STFT(mixture)))；

wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;

the obtaining a predictive masking value based on the first cepstral feature comprises:

inputting the first cepstral feature into the pre-set neural network to calculate the predictive masking value.

Preferably, the training a preset neural network according to the predicted masking value to generate a trained neural network includes:

acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;

the actual masking value is calculated using the following formula:

the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;

calculating a difference between the actual masking value and the predicted masking value;

and training the preset neural network through a feed-forward algorithm and the difference value to generate a trained neural network.

Preferably, the obtaining the current voice with noise and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise includes:

extracting a second cepstrum feature of the current noisy speech;

inputting the second cepstral feature into the trained neural network;

and outputting the current masking value.

Preferably, the performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice includes:

carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;

multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;

and outputting the current pure voice.

A noise reducing device, the device comprising:

the generating module is used for generating a first cepstrum feature of a preset voice by using the noisy preset voice;

a first obtaining sub-module, configured to obtain a predicted masking value based on the first cepstral feature;

the training module is used for training a preset neural network according to the prediction masking value so as to generate a trained neural network;

the second acquisition module is used for acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;

and the output module is used for carrying out noise reduction processing on the current voice with noise based on the current masking value and outputting the current voice after noise reduction.

Preferably, the generating module includes:

the first obtaining submodule is used for obtaining a plurality of preset voices with noises;

a first extraction submodule for extracting the first cepstral feature using the following formula:

cepstral＝ISTFT(log(STFT(mixture)))；

the first obtaining module includes:

a first computation submodule configured to input the first cepstral feature into the preset neural network to compute the predictive masking value.

Preferably, the training module includes:

the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;

a second calculation submodule for calculating the actual masking value using the following formula:

a third computation submodule for computing a difference between said actual masking value and said predicted masking value;

and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate a trained neural network.

Preferably, the second obtaining module includes:

the second extraction submodule is used for extracting a second cepstrum feature of the current voice with noise;

an input sub-module, configured to input the second cepstral feature into the trained neural network;

and the first output submodule is used for outputting the current masking value.

Preferably, the output module includes:

the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;

the second conversion submodule is used for multiplying the current masking value and the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;

and the second output submodule is used for outputting the current pure voice.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart illustrating a method for noise reduction according to the present invention;

FIG. 2 is another flowchart of a noise reduction method according to the present invention;

FIG. 3 is a screenshot of a workflow of a noise reduction method according to the present invention;

FIG. 4 is a block diagram of a noise reducer according to the present invention;

fig. 5 is another structural diagram of a noise reduction device provided by the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, noise components in sound are usually removed by a noise reduction method, wherein voice noise reduction is to separate noise and human voice in mixed voice, and remove noise parts as much as possible while completely preserving the human voice part as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.

In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: 1. the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is big, the high and low frequency noise reduction effect is inconsistent, the result of making an uproar falls and is not conform to the expectation, and then can't effectual separation noise 2, need extract a large amount of characteristics and regard as training data, can't guarantee the integrality and the reliability of above-mentioned characteristic of extracting for the noise reduction effect becomes low. In order to solve the above problem, this embodiment discloses a method for training a neural network based on obtaining a predicted masking value and an actual masking value of a preset voice with noise, extracting a masking value of a current voice with noise by the trained neural network, and performing noise reduction by using the current masking value.

A method of noise reduction, as shown in fig. 1, comprising the steps of:

s101, generating a first cepstrum feature of preset voice by using the preset voice with noise;

step S102, obtaining a prediction masking value based on the first cepstrum characteristic;

step S103, training a preset neural network according to the prediction masking value to generate a trained neural network;

step S104, acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;

and S105, performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction.

The working principle of the technical scheme is as follows: the method comprises the steps of generating a prediction masking value by utilizing a preset voice with noise in advance, training a preset neural network according to the prediction masking value to generate a trained neural network, then obtaining a current masking value of the current voice with noise according to the trained neural network, reducing the noise of the current voice with noise according to the current masking value, and outputting the current voice with noise reduced.

The beneficial effects of the above technical scheme are: by utilizing the masking value to reduce noise, the noise reduction method provided by the invention does not relate to noise reduction by utilizing high and low frequencies, namely, the problem of energy conservation does not exist, and the noise reduction result is excellent, stable and high in efficiency. The problem of among the prior art because the big noise reduction effect that produces of high low frequency energy difference is low can't effectual separation noise is solved.

In one embodiment, generating a first cepstral feature of a noisy preset speech using the preset speech includes:

acquiring a plurality of preset voices with noises;

extracting a first cepstral feature using the following formula:

cepstral＝ISTFT(log(STFT(mixture)))；

wherein STFT () is short-time Fourier transform, ISTFT is short-time inverse Fourier transform, and texture is preset voice with noise;

obtaining a predictive masking value based on the first cepstral feature, comprising:

the first cepstral features are input into a preset neural network to calculate a predicted masking value.

The beneficial effects of the above technical scheme are: the method comprises the steps of obtaining a plurality of preset voices with noises to obtain various masking values to deal with different conditions, and then training a preset neural network according to the various masking values, so that a training model is more complete, and the problem that the current voices with noises cannot be effectively denoised because the current voices with noises contain the masking values which can not be recognized by the preset neural network is solved. Meanwhile, the preset neural network trained by the acquired cepstrum features is more perfect and the noise reduction effect is better than the preset neural network trained by other features required by the noise reduction method based on the deep learning technology.

In one embodiment, training a preset neural network according to the predicted masking value to generate a trained neural network includes:

the actual masking value is calculated using the following formula:

wherein pure preset voice is pure, theta is a phase, and | xx | is an amplitude;

and training the preset neural network through a feedforward algorithm and the difference value to generate the trained neural network.

The beneficial effects of the above technical scheme are: the difference value calculation is carried out on the predicted masking value and the actual masking value of the same type of voice with noise and pure voice, and then the preset neural network is optimized through a feedforward algorithm, so that the preset neural network is used for containing more masking values, the trained neural network is more perfect, and the better noise reduction effect of the current voice with noise can be achieved.

In one embodiment, as shown in fig. 2, obtaining a current noisy speech, and inputting the current noisy speech into a trained neural network to obtain a current masking value of the current noisy speech includes:

step S201, extracting a second cepstrum feature of the current voice with noise;

step S202, inputting the second cepstrum characteristic into the trained neural network;

and step S203, outputting the current masking value.

The beneficial effects of the above technical scheme are: the method comprises the steps of inputting the current voice with noise into a trained neural network, obtaining a current estimated masking value according to the masking value in the neural network, extracting cepstrum characteristics, obtaining envelope and harmonic characteristics in a voice signal at the same time, and then reducing the noise of the current voice with noise according to the current estimated masking value, so that the problem that the noise reduction effect is not ideal due to the fact that the envelope characteristics and the harmonic characteristics cannot be obtained at the same time in the prior art is solved.

In one embodiment, denoising a current noisy speech based on a current masking value, and outputting the denoised current speech, includes:

and outputting the current pure voice.

The technical scheme has the advantages that the noise part in the current voice with noise is removed by utilizing the current masking value and the first STFT calculation of the current voice with noise to obtain the second STFT of the current pure voice, and then the voice signal is recovered through short-time inverse Fourier transform to further obtain the current pure voice. Compared with the prior art, the method has the advantages that the noise reduction can be realized only by the neural network after the Fourier transform and the inverse transform are combined for training the current voice with noise, so that the problems of complex operation and influence on the noise reduction efficiency are solved.

In one embodiment, as shown in FIG. 3, the method includes:

(1) extracting cepstrum characteristics of the mixture of the noisy speech, wherein the formula is as follows:

cepstral＝ISTFT(log(STFT(mixture)))；

STFT and ISTFT are the short-time fourier transform and its inverse, respectively.

(2) Calculate the mixture and PSM (phase sensitive mask) corresponding to pure speech pure, the formula is as follows:

| · | represents amplitude and θ represents phase.

(3) The neural network is trained by a feedforward algorithm using MSE (mean square error) as a loss function, and the trained network is saved.

(4) Inputting the characteristics of the voice with noise into a trained model to obtain a predicted PSM, multiplying the PSM and the frequency spectrum of the voice with noise, and performing inverse Fourier transform to obtain enhanced voice. When the user uses the system, the user only needs to input the voice with noise into the model, and the enhanced voice can be obtained.

The working principle and the beneficial effects of the technical scheme are as follows: because the voice exists in a form of harmonic plus envelope essentially, envelope characteristics can be described by a low-order cepstrum coefficient, and harmonic can be described by a high-order cepstrum coefficient, namely, the voice directly decouples the excitation and sound channels of the voice, compared with the common frequency domain energy spectrum characteristics, the characteristics are selected without the energy balance problem and contain harmonic information which is not contained in the characteristics such as MFCC/GFCC and the like, in our experiment, under the same network and data, the cepstrum characteristics are superior to other characteristics under various voice quality evaluation methods, and the sense of a chief officer obviously takes precedence.

The present embodiment also provides a noise reduction apparatus, as shown in fig. 4, the apparatus including:

a generating module 401, configured to generate a first cepstrum feature of a preset voice by using a noisy preset voice;

a first obtaining module 402, configured to obtain a predicted masking value based on the first cepstral feature;

a training module 403, configured to train a preset neural network according to the predicted masking value to generate a trained neural network;

a second obtaining module 404, configured to obtain a current voice with noise, and input the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;

and an output module 405, configured to perform noise reduction processing on the current voice with noise based on the current masking value, and output the current voice after noise reduction.

In one embodiment, a generation module includes:

a first extraction submodule for extracting a first cepstral feature using the following formula:

cepstral＝ISTFT(log(STFT(mixture)))；

a first acquisition submodule comprising:

and the first calculation sub-module is used for inputting the first cepstrum feature into a preset neural network to calculate a prediction masking value.

In one embodiment, a training module, comprising:

and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate the trained neural network.

In one embodiment, as shown in fig. 5, the second obtaining module includes:

the second extraction submodule 4041 is used for extracting a second cepstrum feature of the current voice with noise;

an input sub-module 4042, configured to input the second cepstrum feature into the trained neural network;

a first output submodule 4043, configured to output the current masking value.

In one embodiment, an output module includes:

the second conversion submodule is used for multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;

and the second output submodule is used for outputting the current pure voice.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of noise reduction, comprising the steps of:

obtaining a predictive masking value based on the first cepstral feature;

2. The method for reducing noise according to claim 1, wherein the generating a first cepstrum feature of the preset speech by using the noisy preset speech comprises:

acquiring a plurality of preset voices with noises;

extracting the first cepstral feature using the following formula:

cepstral＝ISTFT(log(STFT(mixture)))；

inputting the first cepstral feature into the pre-set neural network to calculate the predicted masking value.

3. The method of reducing noise according to claim 2, wherein the training a preset neural network according to the predicted masking value to generate a trained neural network comprises:

the actual masking value is calculated using the following formula:

4. The method of claim 3, wherein the obtaining the current noisy speech and inputting the current noisy speech into the trained neural network to obtain a current masking value of the current noisy speech comprises:

extracting a second cepstrum feature of the current noisy speech;

inputting the second cepstral feature into the trained neural network;

and outputting the current masking value.

5. The method according to claim 4, wherein the denoising the current noisy speech based on the current masking value, and outputting the denoised current speech comprises:

and outputting the current pure voice.

6. A noise reducing device, characterized in that the device comprises:

a first obtaining module, configured to obtain a predicted masking value based on the first cepstral feature;

7. The noise reduction apparatus of claim 6, wherein the generating module comprises:

cepstral＝ISTFT(log(STFT(mixture)))；

the first obtaining module includes:

8. The noise reduction apparatus of claim 7, wherein the training module comprises:

9. The noise reduction apparatus of claim 8, wherein the second obtaining module comprises:

10. The noise reduction apparatus of claim 9, wherein the output module comprises:

and the second output submodule is used for outputting the current pure voice.