CN111105809A - Noise reduction method and device - Google Patents
Noise reduction method and device Download PDFInfo
- Publication number
- CN111105809A CN111105809A CN201911413911.6A CN201911413911A CN111105809A CN 111105809 A CN111105809 A CN 111105809A CN 201911413911 A CN201911413911 A CN 201911413911A CN 111105809 A CN111105809 A CN 111105809A
- Authority
- CN
- China
- Prior art keywords
- current
- voice
- noise
- preset
- masking value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000000873 masking effect Effects 0.000 claims abstract description 99
- 238000013528 artificial neural network Methods 0.000 claims abstract description 74
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 9
- 239000000203 mixture Substances 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 20
- 238000013135 deep learning Methods 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000004134 energy conservation Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 11
- 238000001228 spectrum Methods 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 239000003638 chemical reducing agent Substances 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The invention discloses a noise reduction method and a device, comprising the following steps: generating a first cepstrum feature of the preset voice by using the preset voice with noise; obtaining a predicted masking value based on the first cepstral feature; training a preset neural network according to the predicted masking value so as to generate a trained neural network; acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise; and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction. The invention does not relate to the noise reduction by utilizing high and low frequencies, has no problem of energy conservation, and has excellent, stable and high-efficiency noise reduction result. The problem that noise cannot be effectively separated due to large energy difference between high frequency and low frequency is solved. Compared with the preset neural network trained by other characteristics required by a noise reduction method based on a deep learning technology, the preset neural network trained by the acquired cepstrum characteristics is more perfect, and the noise reduction effect is better.
Description
Technical Field
The invention relates to the technical field of voice data processing, in particular to a noise reduction method and device.
Background
At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, a noise reduction method based on a deep learning technology is generally used to remove noise components in sound, and voice noise reduction is to separate noise and human voice in mixed voice, so that the human voice part is completely reserved as much as possible while the noise part is removed as much as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.
In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is large, the noise reduction effect of high and low frequencies is inconsistent, the noise reduction result is not in accordance with the expected expectation, and then the noise can not be effectively separated.
Disclosure of Invention
Aiming at the displayed problems, the method trains the neural network based on the obtained predicted masking value and the actual masking value of the preset voice with noise, extracts the masking value of the current voice with noise by the trained neural network, and utilizes the current masking value to reduce the noise.
A method of noise reduction comprising the steps of:
generating a first cepstrum feature of a preset voice by using the noisy preset voice;
obtaining a predictive masking value based on the first cepstral feature;
training a preset neural network according to the predicted masking value to generate a trained neural network;
acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice.
Preferably, the generating a first cepstrum feature of the preset speech by using the noisy preset speech includes:
acquiring a plurality of preset voices with noises;
extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the obtaining a predictive masking value based on the first cepstral feature comprises:
inputting the first cepstral feature into the pre-set neural network to calculate the predictive masking value.
Preferably, the training a preset neural network according to the predicted masking value to generate a trained neural network includes:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feed-forward algorithm and the difference value to generate a trained neural network.
Preferably, the obtaining the current voice with noise and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise includes:
extracting a second cepstrum feature of the current noisy speech;
inputting the second cepstral feature into the trained neural network;
and outputting the current masking value.
Preferably, the performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice includes:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
A noise reducing device, the device comprising:
the generating module is used for generating a first cepstrum feature of a preset voice by using the noisy preset voice;
a first obtaining sub-module, configured to obtain a predicted masking value based on the first cepstral feature;
the training module is used for training a preset neural network according to the prediction masking value so as to generate a trained neural network;
the second acquisition module is used for acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and the output module is used for carrying out noise reduction processing on the current voice with noise based on the current masking value and outputting the current voice after noise reduction.
Preferably, the generating module includes:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the first obtaining module includes:
a first computation submodule configured to input the first cepstral feature into the preset neural network to compute the predictive masking value.
Preferably, the training module includes:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate a trained neural network.
Preferably, the second obtaining module includes:
the second extraction submodule is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module, configured to input the second cepstral feature into the trained neural network;
and the first output submodule is used for outputting the current masking value.
Preferably, the output module includes:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value and the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flowchart illustrating a method for noise reduction according to the present invention;
FIG. 2 is another flowchart of a noise reduction method according to the present invention;
FIG. 3 is a screenshot of a workflow of a noise reduction method according to the present invention;
FIG. 4 is a block diagram of a noise reducer according to the present invention;
fig. 5 is another structural diagram of a noise reduction device provided by the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
At present, a voice call function, a recording function, a music playing function, and the like are commonly used functions in a mobile terminal at present, and if the functions are used in a noisy environment, because environmental noise is large, a call effect, a recording effect, or a music playing effect of a user is affected. In order to achieve a better recording effect of a call effect or a music playing effect, noise components in sound are usually removed by a noise reduction method, wherein voice noise reduction is to separate noise and human voice in mixed voice, and remove noise parts as much as possible while completely preserving the human voice part as possible. The method can effectively improve the quality of voice communication or voice interaction, so that people or machines can hear clear and clean voice in a noisy environment.
In the mainstream noise reduction method based on the deep learning technology in the prior art, noise reduction is realized by acquiring the amplitude spectrum of sound and combining a training model and external frequency, and the method has the following defects: 1. the problem of energy imbalance is difficult to overcome by the amplitude spectrum of sound, and because the difference of high and low frequency energy is big, the high and low frequency noise reduction effect is inconsistent, the result of making an uproar falls and is not conform to the expectation, and then can't effectual separation noise 2, need extract a large amount of characteristics and regard as training data, can't guarantee the integrality and the reliability of above-mentioned characteristic of extracting for the noise reduction effect becomes low. In order to solve the above problem, this embodiment discloses a method for training a neural network based on obtaining a predicted masking value and an actual masking value of a preset voice with noise, extracting a masking value of a current voice with noise by the trained neural network, and performing noise reduction by using the current masking value.
A method of noise reduction, as shown in fig. 1, comprising the steps of:
s101, generating a first cepstrum feature of preset voice by using the preset voice with noise;
step S102, obtaining a prediction masking value based on the first cepstrum characteristic;
step S103, training a preset neural network according to the prediction masking value to generate a trained neural network;
step S104, acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and S105, performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the current voice after noise reduction.
The working principle of the technical scheme is as follows: the method comprises the steps of generating a prediction masking value by utilizing a preset voice with noise in advance, training a preset neural network according to the prediction masking value to generate a trained neural network, then obtaining a current masking value of the current voice with noise according to the trained neural network, reducing the noise of the current voice with noise according to the current masking value, and outputting the current voice with noise reduced.
The beneficial effects of the above technical scheme are: by utilizing the masking value to reduce noise, the noise reduction method provided by the invention does not relate to noise reduction by utilizing high and low frequencies, namely, the problem of energy conservation does not exist, and the noise reduction result is excellent, stable and high in efficiency. The problem of among the prior art because the big noise reduction effect that produces of high low frequency energy difference is low can't effectual separation noise is solved.
In one embodiment, generating a first cepstral feature of a noisy preset speech using the preset speech includes:
acquiring a plurality of preset voices with noises;
extracting a first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein STFT () is short-time Fourier transform, ISTFT is short-time inverse Fourier transform, and texture is preset voice with noise;
obtaining a predictive masking value based on the first cepstral feature, comprising:
the first cepstral features are input into a preset neural network to calculate a predicted masking value.
The beneficial effects of the above technical scheme are: the method comprises the steps of obtaining a plurality of preset voices with noises to obtain various masking values to deal with different conditions, and then training a preset neural network according to the various masking values, so that a training model is more complete, and the problem that the current voices with noises cannot be effectively denoised because the current voices with noises contain the masking values which can not be recognized by the preset neural network is solved. Meanwhile, the preset neural network trained by the acquired cepstrum features is more perfect and the noise reduction effect is better than the preset neural network trained by other features required by the noise reduction method based on the deep learning technology.
In one embodiment, training a preset neural network according to the predicted masking value to generate a trained neural network includes:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
wherein pure preset voice is pure, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feedforward algorithm and the difference value to generate the trained neural network.
The beneficial effects of the above technical scheme are: the difference value calculation is carried out on the predicted masking value and the actual masking value of the same type of voice with noise and pure voice, and then the preset neural network is optimized through a feedforward algorithm, so that the preset neural network is used for containing more masking values, the trained neural network is more perfect, and the better noise reduction effect of the current voice with noise can be achieved.
In one embodiment, as shown in fig. 2, obtaining a current noisy speech, and inputting the current noisy speech into a trained neural network to obtain a current masking value of the current noisy speech includes:
step S201, extracting a second cepstrum feature of the current voice with noise;
step S202, inputting the second cepstrum characteristic into the trained neural network;
and step S203, outputting the current masking value.
The beneficial effects of the above technical scheme are: the method comprises the steps of inputting the current voice with noise into a trained neural network, obtaining a current estimated masking value according to the masking value in the neural network, extracting cepstrum characteristics, obtaining envelope and harmonic characteristics in a voice signal at the same time, and then reducing the noise of the current voice with noise according to the current estimated masking value, so that the problem that the noise reduction effect is not ideal due to the fact that the envelope characteristics and the harmonic characteristics cannot be obtained at the same time in the prior art is solved.
In one embodiment, denoising a current noisy speech based on a current masking value, and outputting the denoised current speech, includes:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
The technical scheme has the advantages that the noise part in the current voice with noise is removed by utilizing the current masking value and the first STFT calculation of the current voice with noise to obtain the second STFT of the current pure voice, and then the voice signal is recovered through short-time inverse Fourier transform to further obtain the current pure voice. Compared with the prior art, the method has the advantages that the noise reduction can be realized only by the neural network after the Fourier transform and the inverse transform are combined for training the current voice with noise, so that the problems of complex operation and influence on the noise reduction efficiency are solved.
In one embodiment, as shown in FIG. 3, the method includes:
(1) extracting cepstrum characteristics of the mixture of the noisy speech, wherein the formula is as follows:
cepstral=ISTFT(log(STFT(mixture)));
STFT and ISTFT are the short-time fourier transform and its inverse, respectively.
(2) Calculate the mixture and PSM (phase sensitive mask) corresponding to pure speech pure, the formula is as follows:
| · | represents amplitude and θ represents phase.
(3) The neural network is trained by a feedforward algorithm using MSE (mean square error) as a loss function, and the trained network is saved.
(4) Inputting the characteristics of the voice with noise into a trained model to obtain a predicted PSM, multiplying the PSM and the frequency spectrum of the voice with noise, and performing inverse Fourier transform to obtain enhanced voice. When the user uses the system, the user only needs to input the voice with noise into the model, and the enhanced voice can be obtained.
The working principle and the beneficial effects of the technical scheme are as follows: because the voice exists in a form of harmonic plus envelope essentially, envelope characteristics can be described by a low-order cepstrum coefficient, and harmonic can be described by a high-order cepstrum coefficient, namely, the voice directly decouples the excitation and sound channels of the voice, compared with the common frequency domain energy spectrum characteristics, the characteristics are selected without the energy balance problem and contain harmonic information which is not contained in the characteristics such as MFCC/GFCC and the like, in our experiment, under the same network and data, the cepstrum characteristics are superior to other characteristics under various voice quality evaluation methods, and the sense of a chief officer obviously takes precedence.
The present embodiment also provides a noise reduction apparatus, as shown in fig. 4, the apparatus including:
a generating module 401, configured to generate a first cepstrum feature of a preset voice by using a noisy preset voice;
a first obtaining module 402, configured to obtain a predicted masking value based on the first cepstral feature;
a training module 403, configured to train a preset neural network according to the predicted masking value to generate a trained neural network;
a second obtaining module 404, configured to obtain a current voice with noise, and input the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and an output module 405, configured to perform noise reduction processing on the current voice with noise based on the current masking value, and output the current voice after noise reduction.
In one embodiment, a generation module includes:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting a first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein STFT () is short-time Fourier transform, ISTFT is short-time inverse Fourier transform, and texture is preset voice with noise;
a first acquisition submodule comprising:
and the first calculation sub-module is used for inputting the first cepstrum feature into a preset neural network to calculate a prediction masking value.
In one embodiment, a training module, comprising:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
wherein pure preset voice is pure, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate the trained neural network.
In one embodiment, as shown in fig. 5, the second obtaining module includes:
the second extraction submodule 4041 is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module 4042, configured to input the second cepstrum feature into the trained neural network;
a first output submodule 4043, configured to output the current masking value.
In one embodiment, an output module includes:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (10)
1. A method of noise reduction, comprising the steps of:
generating a first cepstrum feature of a preset voice by using the noisy preset voice;
obtaining a predictive masking value based on the first cepstral feature;
training a preset neural network according to the predicted masking value to generate a trained neural network;
acquiring current voice with noise, and inputting the current voice with noise into the trained neural network to obtain a current masking value of the current voice with noise;
and performing noise reduction processing on the current voice with noise based on the current masking value, and outputting the noise-reduced current voice.
2. The method for reducing noise according to claim 1, wherein the generating a first cepstrum feature of the preset speech by using the noisy preset speech comprises:
acquiring a plurality of preset voices with noises;
extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the obtaining a predictive masking value based on the first cepstral feature comprises:
inputting the first cepstral feature into the pre-set neural network to calculate the predicted masking value.
3. The method of reducing noise according to claim 2, wherein the training a preset neural network according to the predicted masking value to generate a trained neural network comprises:
acquiring a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
the actual masking value is calculated using the following formula:
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
calculating a difference between the actual masking value and the predicted masking value;
and training the preset neural network through a feed-forward algorithm and the difference value to generate a trained neural network.
4. The method of claim 3, wherein the obtaining the current noisy speech and inputting the current noisy speech into the trained neural network to obtain a current masking value of the current noisy speech comprises:
extracting a second cepstrum feature of the current noisy speech;
inputting the second cepstral feature into the trained neural network;
and outputting the current masking value.
5. The method according to claim 4, wherein the denoising the current noisy speech based on the current masking value, and outputting the denoised current speech comprises:
carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
multiplying the current masking value by the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and outputting the current pure voice.
6. A noise reducing device, characterized in that the device comprises:
the generating module is used for generating a first cepstrum feature of a preset voice by using the noisy preset voice;
a first obtaining module, configured to obtain a predicted masking value based on the first cepstral feature;
the training module is used for training a preset neural network according to the prediction masking value so as to generate a trained neural network;
the second acquisition module is used for acquiring the current voice with noise, and inputting the current voice with noise into the trained neural network to obtain the current masking value of the current voice with noise;
and the output module is used for carrying out noise reduction processing on the current voice with noise based on the current masking value and outputting the current voice after noise reduction.
7. The noise reduction apparatus of claim 6, wherein the generating module comprises:
the first obtaining submodule is used for obtaining a plurality of preset voices with noises;
a first extraction submodule for extracting the first cepstral feature using the following formula:
cepstral=ISTFT(log(STFT(mixture)));
wherein the STFT () is short-time Fourier transform, the ISTFT is short-time inverse Fourier transform, and the texture is preset voice with noise;
the first obtaining module includes:
a first computation submodule configured to input the first cepstral feature into the preset neural network to compute the predictive masking value.
8. The noise reduction apparatus of claim 7, wherein the training module comprises:
the second obtaining submodule is used for obtaining a plurality of pure preset voices; the plurality of pure preset voices correspond to the plurality of preset voices with noises;
a second calculation submodule for calculating the actual masking value using the following formula:
the pure preset voice is obtained, theta is a phase, and | xx | is an amplitude;
a third computation submodule for computing a difference between said actual masking value and said predicted masking value;
and the training submodule is used for training the preset neural network through a feedforward algorithm and the difference value so as to generate a trained neural network.
9. The noise reduction apparatus of claim 8, wherein the second obtaining module comprises:
the second extraction submodule is used for extracting a second cepstrum feature of the current voice with noise;
an input sub-module, configured to input the second cepstral feature into the trained neural network;
and the first output submodule is used for outputting the current masking value.
10. The noise reduction apparatus of claim 9, wherein the output module comprises:
the first conversion submodule is used for carrying out short-time Fourier transform on the current voice with noise to obtain a first STFT of the current voice with noise;
the second conversion submodule is used for multiplying the current masking value and the first STFT to obtain a second STFT of the current pure voice, and performing short-time inverse Fourier transform on the second STFT to obtain the current pure voice;
and the second output submodule is used for outputting the current pure voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911413911.6A CN111105809B (en) | 2019-12-31 | 2019-12-31 | Noise reduction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911413911.6A CN111105809B (en) | 2019-12-31 | 2019-12-31 | Noise reduction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111105809A true CN111105809A (en) | 2020-05-05 |
CN111105809B CN111105809B (en) | 2022-03-22 |
Family
ID=70425717
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911413911.6A Active CN111105809B (en) | 2019-12-31 | 2019-12-31 | Noise reduction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111105809B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933172A (en) * | 2020-08-10 | 2020-11-13 | 广州九四智能科技有限公司 | Method and device for separating and extracting human voice, computer equipment and storage medium |
CN113921022A (en) * | 2021-12-13 | 2022-01-11 | 北京世纪好未来教育科技有限公司 | Audio signal separation method, device, storage medium and electronic equipment |
CN114220448A (en) * | 2021-12-16 | 2022-03-22 | 游密科技(深圳)有限公司 | Voice signal generation method and device, computer equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105679312A (en) * | 2016-03-04 | 2016-06-15 | 重庆邮电大学 | Phonetic feature processing method of voiceprint identification in noise environment |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN108447495A (en) * | 2018-03-28 | 2018-08-24 | 天津大学 | A kind of deep learning sound enhancement method based on comprehensive characteristics collection |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN109036460A (en) * | 2018-08-28 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Method of speech processing and device based on multi-model neural network |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
CN110136737A (en) * | 2019-06-18 | 2019-08-16 | 北京拙河科技有限公司 | A kind of voice de-noising method and device |
-
2019
- 2019-12-31 CN CN201911413911.6A patent/CN111105809B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105679312A (en) * | 2016-03-04 | 2016-06-15 | 重庆邮电大学 | Phonetic feature processing method of voiceprint identification in noise environment |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN108447495A (en) * | 2018-03-28 | 2018-08-24 | 天津大学 | A kind of deep learning sound enhancement method based on comprehensive characteristics collection |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN109036460A (en) * | 2018-08-28 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Method of speech processing and device based on multi-model neural network |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
CN110136737A (en) * | 2019-06-18 | 2019-08-16 | 北京拙河科技有限公司 | A kind of voice de-noising method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933172A (en) * | 2020-08-10 | 2020-11-13 | 广州九四智能科技有限公司 | Method and device for separating and extracting human voice, computer equipment and storage medium |
CN113921022A (en) * | 2021-12-13 | 2022-01-11 | 北京世纪好未来教育科技有限公司 | Audio signal separation method, device, storage medium and electronic equipment |
CN114220448A (en) * | 2021-12-16 | 2022-03-22 | 游密科技(深圳)有限公司 | Voice signal generation method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111105809B (en) | 2022-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106486131B (en) | A kind of method and device of speech de-noising | |
CN111105809B (en) | Noise reduction method and device | |
Xu et al. | A regression approach to speech enhancement based on deep neural networks | |
Xu et al. | Dynamic noise aware training for speech enhancement based on deep neural networks. | |
Narayanan et al. | Joint noise adaptive training for robust automatic speech recognition | |
US9570072B2 (en) | System and method for noise reduction in processing speech signals by targeting speech and disregarding noise | |
Xiao et al. | Normalization of the speech modulation spectra for robust speech recognition | |
CN111128213B (en) | Noise suppression method and system for processing in different frequency bands | |
Ganapathy et al. | Robust feature extraction using modulation filtering of autoregressive models | |
US10614827B1 (en) | System and method for speech enhancement using dynamic noise profile estimation | |
Yu et al. | Robust speech recognition using a cepstral minimum-mean-square-error-motivated noise suppressor | |
Shahnaz et al. | Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme | |
Dionelis et al. | Phase-aware single-channel speech enhancement with modulation-domain Kalman filtering | |
US9536537B2 (en) | Systems and methods for speech restoration | |
CN111883154B (en) | Echo cancellation method and device, computer-readable storage medium, and electronic device | |
Wolfe et al. | Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement | |
Rao et al. | Robust speaker recognition on mobile devices | |
US20150162014A1 (en) | Systems and methods for enhancing an audio signal | |
Nair et al. | Mfcc based noise reduction in asr using kalman filtering | |
Wang et al. | Task-aware warping factors in mask-based speech enhancement | |
CN111028858B (en) | Method and device for detecting voice start-stop time | |
CN109272996A (en) | A kind of noise-reduction method and system | |
Garg et al. | Deep convolutional neural network-based speech signal enhancement using extensive speech features | |
Shankar et al. | Noise dependent super gaussian-coherence based dual microphone speech enhancement for hearing aid application using smartphone | |
Mallidi et al. | Robust speaker recognition using spectro-temporal autoregressive models. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |