CN111105809A - Noise reduction method and device - Google Patents

Noise reduction method and device Download PDF

Info

Publication number
CN111105809A
CN111105809A CN201911413911.6A CN201911413911A CN111105809A CN 111105809 A CN111105809 A CN 111105809A CN 201911413911 A CN201911413911 A CN 201911413911A CN 111105809 A CN111105809 A CN 111105809A
Authority
CN
China
Prior art keywords
current
voice
noise
preset
masking value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911413911.6A
Other languages
Chinese (zh)
Other versions
CN111105809B (en
Inventor
李庆龙
关海欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN201911413911.6A priority Critical patent/CN111105809B/en
Publication of CN111105809A publication Critical patent/CN111105809A/en
Application granted granted Critical
Publication of CN111105809B publication Critical patent/CN111105809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The invention discloses a noise reduction method and a device, comprising the following steps: generating a first cepstrum feature of the preset voice by using the preset voice with noise; obtaining a predicted masking value based on the first cepstral feature; training a preset neural network according to the predicted masking value so as to generate a trained neural network; acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise; and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction. The invention does not relate to the noise reduction by utilizing high and low frequencies, has no problem of energy conservation, and has excellent, stable and high-efficiency noise reduction result. The problem that noise cannot be effectively separated due to large energy difference between high frequency and low frequency is solved. Compared with the preset neural network trained by other characteristics required by a noise reduction method based on a deep learning technology, the preset neural network trained by the acquired cepstrum characteristics is more perfect, and the noise reduction effect is better.

Description

Noise reduction method and device
Technical Field
The invention relates to the technical field of voice data processing, in particular to a noise reduction method and device.
Background
At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, a noise reduction method based on a deep learning technology is generally used to remove noise components in sound, and voice noise reduction is to separate noise and human voice in mixed voice, so that the human voice part is completely reserved as much as possible while the noise part is removed as much as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.
In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is large, the noise reduction effect of high and low frequencies is inconsistent, the noise reduction result is not in accordance with the expected expectation, and then the noise can not be effectively separated.
Disclosure of Invention
Aiming at the displayed problems, the method trains the neural network based on the obtained predicted masking value and the actual masking value of the preset voice with noise, extracts the masking value of the current voice with noise by the trained neural network, and utilizes the current masking value to reduce the noise.
A method of noise reduction comprising the steps of:
generating a first cepstrum feature of a preset voice by using the noisy preset voice;
obtaining a predictive masking value based on the first cepstral feature;
training a preset neural network according to the predicted masking value to generate a trained neural network;
acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice.
Preferably, the generating a first cepstrum feature of the preset speech by using the noisy preset speech includes:
acquiring a plurality of preset voices with noises;
extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the obtaining a predictive masking value based on the first cepstral feature comprises:
inputting the first cepstral feature into the pre-set neural network to calculate the predictive masking value.
Preferably, the training a preset neural network according to the predicted masking value to generate a trained neural network includes:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
Figure BDA0002350688860000021
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feed-forward algorithm and the difference value to generate a trained neural network.
Preferably, the obtaining the current voice with noise and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise includes:
extracting a second cepstrum feature of the current noisy speech;
inputting the second cepstral feature into the trained neural network;
and outputting the current masking value.
Preferably, the performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice includes:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
A noise reducing device, the device comprising:
the generating module is used for generating a first cepstrum feature of a preset voice by using the noisy preset voice;
a first obtaining sub-module, configured to obtain a predicted masking value based on the first cepstral feature;
the training module is used for training a preset neural network according to the prediction masking value so as to generate a trained neural network;
the second acquisition module is used for acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and the output module is used for carrying out noise reduction processing on the current voice with noise based on the current masking value and outputting the current voice after noise reduction.
Preferably, the generating module includes:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the first obtaining module includes:
a first computation submodule configured to input the first cepstral feature into the preset neural network to compute the predictive masking value.
Preferably, the training module includes:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
Figure BDA0002350688860000041
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate a trained neural network.
Preferably, the second obtaining module includes:
the second extraction submodule is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module, configured to input the second cepstral feature into the trained neural network;
and the first output submodule is used for outputting the current masking value.
Preferably, the output module includes:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value and the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flowchart illustrating a method for noise reduction according to the present invention;
FIG. 2 is another flowchart of a noise reduction method according to the present invention;
FIG. 3 is a screenshot of a workflow of a noise reduction method according to the present invention;
FIG. 4 is a block diagram of a noise reducer according to the present invention;
fig. 5 is another structural diagram of a noise reduction device provided by the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, noise components in sound are usually removed by a noise reduction method, wherein voice noise reduction is to separate noise and human voice in mixed voice, and remove noise parts as much as possible while completely preserving the human voice part as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.
In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: 1. the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is big, the high and low frequency noise reduction effect is inconsistent, the result of making an uproar falls and is not conform to the expectation, and then can't effectual separation noise 2, need extract a large amount of characteristics and regard as training data, can't guarantee the integrality and the reliability of above-mentioned characteristic of extracting for the noise reduction effect becomes low. In order to solve the above problem, this embodiment discloses a method for training a neural network based on obtaining a predicted masking value and an actual masking value of a preset voice with noise, extracting a masking value of a current voice with noise by the trained neural network, and performing noise reduction by using the current masking value.
A method of noise reduction, as shown in fig. 1, comprising the steps of:
s101, generating a first cepstrum feature of preset voice by using the preset voice with noise;
step S102, obtaining a prediction masking value based on the first cepstrum characteristic;
step S103, training a preset neural network according to the prediction masking value to generate a trained neural network;
step S104, acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and S105, performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction.
The working principle of the technical scheme is as follows: the method comprises the steps of generating a prediction masking value by utilizing a preset voice with noise in advance, training a preset neural network according to the prediction masking value to generate a trained neural network, then obtaining a current masking value of the current voice with noise according to the trained neural network, reducing the noise of the current voice with noise according to the current masking value, and outputting the current voice with noise reduced.
The beneficial effects of the above technical scheme are: by utilizing the masking value to reduce noise, the noise reduction method provided by the invention does not relate to noise reduction by utilizing high and low frequencies, namely, the problem of energy conservation does not exist, and the noise reduction result is excellent, stable and high in efficiency. The problem of among the prior art because the big noise reduction effect that produces of high low frequency energy difference is low can't effectual separation noise is solved.
In one embodiment, generating a first cepstral feature of a noisy preset speech using the preset speech includes:
acquiring a plurality of preset voices with noises;
extracting a first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein STFT () is short-time Fourier transform, ISTFT is short-time inverse Fourier transform, and texture is preset voice with noise;
obtaining a predictive masking value based on the first cepstral feature, comprising:
the first cepstral features are input into a preset neural network to calculate a predicted masking value.
The beneficial effects of the above technical scheme are: the method comprises the steps of obtaining a plurality of preset voices with noises to obtain various masking values to deal with different conditions, and then training a preset neural network according to the various masking values, so that a training model is more complete, and the problem that the current voices with noises cannot be effectively denoised because the current voices with noises contain the masking values which can not be recognized by the preset neural network is solved. Meanwhile, the preset neural network trained by the acquired cepstrum features is more perfect and the noise reduction effect is better than the preset neural network trained by other features required by the noise reduction method based on the deep learning technology.
In one embodiment, training a preset neural network according to the predicted masking value to generate a trained neural network includes:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
Figure BDA0002350688860000071
wherein pure preset voice is pure, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feedforward algorithm and the difference value to generate the trained neural network.
The beneficial effects of the above technical scheme are: the difference value calculation is carried out on the predicted masking value and the actual masking value of the same type of voice with noise and pure voice, and then the preset neural network is optimized through a feedforward algorithm, so that the preset neural network is used for containing more masking values, the trained neural network is more perfect, and the better noise reduction effect of the current voice with noise can be achieved.
In one embodiment, as shown in fig. 2, obtaining a current noisy speech, and inputting the current noisy speech into a trained neural network to obtain a current masking value of the current noisy speech includes:
step S201, extracting a second cepstrum feature of the current voice with noise;
step S202, inputting the second cepstrum characteristic into the trained neural network;
and step S203, outputting the current masking value.
The beneficial effects of the above technical scheme are: the method comprises the steps of inputting the current voice with noise into a trained neural network, obtaining a current estimated masking value according to the masking value in the neural network, extracting cepstrum characteristics, obtaining envelope and harmonic characteristics in a voice signal at the same time, and then reducing the noise of the current voice with noise according to the current estimated masking value, so that the problem that the noise reduction effect is not ideal due to the fact that the envelope characteristics and the harmonic characteristics cannot be obtained at the same time in the prior art is solved.
In one embodiment, denoising a current noisy speech based on a current masking value, and outputting the denoised current speech, includes:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
The technical scheme has the advantages that the noise part in the current voice with noise is removed by utilizing the current masking value and the first STFT calculation of the current voice with noise to obtain the second STFT of the current pure voice, and then the voice signal is recovered through short-time inverse Fourier transform to further obtain the current pure voice. Compared with the prior art, the method has the advantages that the noise reduction can be realized only by the neural network after the Fourier transform and the inverse transform are combined for training the current voice with noise, so that the problems of complex operation and influence on the noise reduction efficiency are solved.
In one embodiment, as shown in FIG. 3, the method includes:
(1) extracting cepstrum characteristics of the mixture of the noisy speech, wherein the formula is as follows:
cepstral=ISTFT(log(STFT(mixture)));
STFT and ISTFT are the short-time fourier transform and its inverse, respectively.
(2) Calculate the mixture and PSM (phase sensitive mask) corresponding to pure speech pure, the formula is as follows:
Figure BDA0002350688860000091
| · | represents amplitude and θ represents phase.
(3) The neural network is trained by a feedforward algorithm using MSE (mean square error) as a loss function, and the trained network is saved.
(4) Inputting the characteristics of the voice with noise into a trained model to obtain a predicted PSM, multiplying the PSM and the frequency spectrum of the voice with noise, and performing inverse Fourier transform to obtain enhanced voice. When the user uses the system, the user only needs to input the voice with noise into the model, and the enhanced voice can be obtained.
The working principle and the beneficial effects of the technical scheme are as follows: because the voice exists in a form of harmonic plus envelope essentially, envelope characteristics can be described by a low-order cepstrum coefficient, and harmonic can be described by a high-order cepstrum coefficient, namely, the voice directly decouples the excitation and sound channels of the voice, compared with the common frequency domain energy spectrum characteristics, the characteristics are selected without the energy balance problem and contain harmonic information which is not contained in the characteristics such as MFCC/GFCC and the like, in our experiment, under the same network and data, the cepstrum characteristics are superior to other characteristics under various voice quality evaluation methods, and the sense of a chief officer obviously takes precedence.
The present embodiment also provides a noise reduction apparatus, as shown in fig. 4, the apparatus including:
a generating module 401, configured to generate a first cepstrum feature of a preset voice by using a noisy preset voice;
a first obtaining module 402, configured to obtain a predicted masking value based on the first cepstral feature;
a training module 403, configured to train a preset neural network according to the predicted masking value to generate a trained neural network;
a second obtaining module 404, configured to obtain a current voice with noise, and input the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and an output module 405, configured to perform noise reduction processing on the current voice with noise based on the current masking value, and output the current voice after noise reduction.
In one embodiment, a generation module includes:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting a first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein STFT () is short-time Fourier transform, ISTFT is short-time inverse Fourier transform, and texture is preset voice with noise;
a first acquisition submodule comprising:
and the first calculation sub-module is used for inputting the first cepstrum feature into a preset neural network to calculate a prediction masking value.
In one embodiment, a training module, comprising:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
Figure BDA0002350688860000101
wherein pure preset voice is pure, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate the trained neural network.
In one embodiment, as shown in fig. 5, the second obtaining module includes:
the second extraction submodule 4041 is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module 4042, configured to input the second cepstrum feature into the trained neural network;
a first output submodule 4043, configured to output the current masking value.
In one embodiment, an output module includes:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of noise reduction, comprising the steps of:
generating a first cepstrum feature of a preset voice by using the noisy preset voice;
obtaining a predictive masking value based on the first cepstral feature;
training a preset neural network according to the predicted masking value to generate a trained neural network;
acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice.
2. The method for reducing noise according to claim 1, wherein the generating a first cepstrum feature of the preset speech by using the noisy preset speech comprises:
acquiring a plurality of preset voices with noises;
extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the obtaining a predictive masking value based on the first cepstral feature comprises:
inputting the first cepstral feature into the pre-set neural network to calculate the predicted masking value.
3. The method of reducing noise according to claim 2, wherein the training a preset neural network according to the predicted masking value to generate a trained neural network comprises:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
Figure FDA0002350688850000021
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feed-forward algorithm and the difference value to generate a trained neural network.
4. The method of claim 3, wherein the obtaining the current noisy speech and inputting the current noisy speech into the trained neural network to obtain a current masking value of the current noisy speech comprises:
extracting a second cepstrum feature of the current noisy speech;
inputting the second cepstral feature into the trained neural network;
and outputting the current masking value.
5. The method according to claim 4, wherein the denoising the current noisy speech based on the current masking value, and outputting the denoised current speech comprises:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
6. A noise reducing device, characterized in that the device comprises:
the generating module is used for generating a first cepstrum feature of a preset voice by using the noisy preset voice;
a first obtaining module, configured to obtain a predicted masking value based on the first cepstral feature;
the training module is used for training a preset neural network according to the prediction masking value so as to generate a trained neural network;
the second acquisition module is used for acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and the output module is used for carrying out noise reduction processing on the current voice with noise based on the current masking value and outputting the current voice after noise reduction.
7. The noise reduction apparatus of claim 6, wherein the generating module comprises:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the first obtaining module includes:
a first computation submodule configured to input the first cepstral feature into the preset neural network to compute the predictive masking value.
8. The noise reduction apparatus of claim 7, wherein the training module comprises:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
Figure FDA0002350688850000031
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate a trained neural network.
9. The noise reduction apparatus of claim 8, wherein the second obtaining module comprises:
the second extraction submodule is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module, configured to input the second cepstral feature into the trained neural network;
and the first output submodule is used for outputting the current masking value.
10. The noise reduction apparatus of claim 9, wherein the output module comprises:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value and the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
CN201911413911.6A 2019-12-31 2019-12-31 Noise reduction method and device Active CN111105809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911413911.6A CN111105809B (en) 2019-12-31 2019-12-31 Noise reduction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911413911.6A CN111105809B (en) 2019-12-31 2019-12-31 Noise reduction method and device

Publications (2)

Publication Number Publication Date
CN111105809A true CN111105809A (en) 2020-05-05
CN111105809B CN111105809B (en) 2022-03-22

Family

ID=70425717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911413911.6A Active CN111105809B (en) 2019-12-31 2019-12-31 Noise reduction method and device

Country Status (1)

Country Link
CN (1) CN111105809B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933172A (en) * 2020-08-10 2020-11-13 广州九四智能科技有限公司 Method and device for separating and extracting human voice, computer equipment and storage medium
CN113921022A (en) * 2021-12-13 2022-01-11 北京世纪好未来教育科技有限公司 Audio signal separation method, device, storage medium and electronic equipment
CN114220448A (en) * 2021-12-16 2022-03-22 游密科技(深圳)有限公司 Voice signal generation method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105679312A (en) * 2016-03-04 2016-06-15 重庆邮电大学 Phonetic feature processing method of voiceprint identification in noise environment
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108447495A (en) * 2018-03-28 2018-08-24 天津大学 A kind of deep learning sound enhancement method based on comprehensive characteristics collection
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
CN109036460A (en) * 2018-08-28 2018-12-18 百度在线网络技术(北京)有限公司 Method of speech processing and device based on multi-model neural network
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN110136737A (en) * 2019-06-18 2019-08-16 北京拙河科技有限公司 A kind of voice de-noising method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105679312A (en) * 2016-03-04 2016-06-15 重庆邮电大学 Phonetic feature processing method of voiceprint identification in noise environment
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108447495A (en) * 2018-03-28 2018-08-24 天津大学 A kind of deep learning sound enhancement method based on comprehensive characteristics collection
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
CN109036460A (en) * 2018-08-28 2018-12-18 百度在线网络技术(北京)有限公司 Method of speech processing and device based on multi-model neural network
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN110136737A (en) * 2019-06-18 2019-08-16 北京拙河科技有限公司 A kind of voice de-noising method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933172A (en) * 2020-08-10 2020-11-13 广州九四智能科技有限公司 Method and device for separating and extracting human voice, computer equipment and storage medium
CN113921022A (en) * 2021-12-13 2022-01-11 北京世纪好未来教育科技有限公司 Audio signal separation method, device, storage medium and electronic equipment
CN114220448A (en) * 2021-12-16 2022-03-22 游密科技(深圳)有限公司 Voice signal generation method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111105809B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN106486131B (en) A kind of method and device of speech de-noising
CN111105809B (en) Noise reduction method and device
Xu et al. A regression approach to speech enhancement based on deep neural networks
Xu et al. Dynamic noise aware training for speech enhancement based on deep neural networks.
Narayanan et al. Joint noise adaptive training for robust automatic speech recognition
US9570072B2 (en) System and method for noise reduction in processing speech signals by targeting speech and disregarding noise
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
CN111128213B (en) Noise suppression method and system for processing in different frequency bands
Ganapathy et al. Robust feature extraction using modulation filtering of autoregressive models
US10614827B1 (en) System and method for speech enhancement using dynamic noise profile estimation
Yu et al. Robust speech recognition using a cepstral minimum-mean-square-error-motivated noise suppressor
Shahnaz et al. Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme
Dionelis et al. Phase-aware single-channel speech enhancement with modulation-domain Kalman filtering
US9536537B2 (en) Systems and methods for speech restoration
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
Wolfe et al. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement
Rao et al. Robust speaker recognition on mobile devices
US20150162014A1 (en) Systems and methods for enhancing an audio signal
Nair et al. Mfcc based noise reduction in asr using kalman filtering
Wang et al. Task-aware warping factors in mask-based speech enhancement
CN111028858B (en) Method and device for detecting voice start-stop time
CN109272996A (en) A kind of noise-reduction method and system
Garg et al. Deep convolutional neural network-based speech signal enhancement using extensive speech features
Shankar et al. Noise dependent super gaussian-coherence based dual microphone speech enhancement for hearing aid application using smartphone
Mallidi et al. Robust speaker recognition using spectro-temporal autoregressive models.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant