CN117153183A - Voice enhancement method, equipment and storage medium based on neural network - Google Patents
Voice enhancement method, equipment and storage medium based on neural network Download PDFInfo
- Publication number
- CN117153183A CN117153183A CN202311130448.0A CN202311130448A CN117153183A CN 117153183 A CN117153183 A CN 117153183A CN 202311130448 A CN202311130448 A CN 202311130448A CN 117153183 A CN117153183 A CN 117153183A
- Authority
- CN
- China
- Prior art keywords
- data
- frequency domain
- neural network
- noise
- mask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 84
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000001228 spectrum Methods 0.000 claims abstract description 34
- 230000000295 complement effect Effects 0.000 claims abstract description 26
- 238000006243 chemical reaction Methods 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 12
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims 1
- 238000001914 filtration Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 9
- 210000002569 neuron Anatomy 0.000 description 7
- 238000000605 extraction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention relates to the field of voice enhancement, and discloses a voice enhancement method, equipment and a storage medium based on a neural network. The method comprises the following steps: receiving time domain noise voice data; converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data; performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum; performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask; performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data; and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data. In the embodiment of the invention, the technical problem that the noise filtering algorithm cannot handle sudden noise for filtering and suppressing caused by the fact that the neural network runs on the small-sized singlechip is solved.
Description
Technical Field
The present invention relates to the field of speech enhancement, and in particular, to a method, apparatus, and storage medium for speech enhancement based on a neural network.
Background
Conventional noise reduction algorithms require energy decisions and threshold settings to set the noise estimate, which may lead to unstable noise estimates, especially during device movement, switching from a low noise environment to a high noise environment. Meanwhile, the traditional noise reduction algorithm is easy to introduce music noise, and sudden noise (such as sounds of keyboard knocking, pen falling on the ground, object collision and the like) is difficult to be restrained.
The existing RNNoise noise reduction algorithm with an open source has the advantages that the volume is small, the complexity is low, the operation can be transplanted to the upper surface of the singlechip, the related surface is the combination of the traditional noise reduction and the neural network, the traditional noise reduction is realized, the model relates to the VAD and the PITCH algorithm, and the feature extraction is relatively complex. The volume of other neural network algorithms such as DNN, CNN, RNN with main flow open sources is relatively large, and the parameter volume is huge and is not suitable for being transplanted to a singlechip for operation.
Therefore, the method solves the technical problem that the noise filtering algorithm cannot cope with sudden noise and filter and inhibit the sudden noise when the current neural network runs on the small-sized singlechip.
Disclosure of Invention
The invention mainly aims to solve the technical problem that a noise filtering algorithm cannot cope with sudden noise to filter and inhibit when a neural network runs on a small single chip microcomputer.
The first aspect of the present invention provides a neural network-based speech enhancement method, which includes:
receiving time domain noise voice data;
converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
Optionally, in a first implementation manner of the first aspect of the present invention, before performing a mask estimation operation on the frequency domain magnitude spectrum according to the preset neural network to obtain a complement mask, the method further includes:
establishing an input layer for collecting data;
the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
Optionally, in a second implementation manner of the first aspect of the present invention, the hidden layer includes: the method comprises the steps of performing mask estimation operation on the frequency domain amplitude spectrum according to a preset neural network to obtain a complement mask, wherein the complement mask comprises the following steps of:
the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
the output layer receives the fifth conductive data and outputs a complement mask.
Optionally, in a third implementation manner of the first aspect of the present invention, the acquiring, by the input layer, amplitude data for the frequency domain amplitude spectrum includes:
and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
Optionally, in a fourth implementation manner of the first aspect of the present invention, before performing a mask estimation operation on the frequency domain magnitude spectrum according to the preset neural network to obtain a complement mask, the method further includes:
receiving clean voice data and noise data;
combining the pure voice data and the noise data to obtain combined acoustic data;
converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
performing complex absolute value square operation on the frequency domain acoustic data to obtain predicted acoustic data;
performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
Optionally, in a fifth implementation manner of the first aspect of the present invention, performing, according to a preset wiener filter, an optimal mask operation on the clean speech data and the noise data, to obtain optimal mask data includes:
according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the wiener filter includes:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n Is the power spectrum of the frequency domain noise data, x is the pure voice data, n is the noise numberAccording to the above.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the receiving time domain noise voice data includes:
and receiving a URL address, and capturing time domain noise voice data from the URL address.
A second aspect of the present invention provides a voice enhancement device based on a neural network, comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the neural network-based speech enhancement device to perform the neural network-based speech enhancement method described above.
A third aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when run on a computer, cause the computer to perform the above-described neural network-based speech enhancement method.
In the embodiment of the invention, through carrying out short-time Fourier transform on all audio frequencies to obtain frequency domain data for sudden noise interference, then utilizing a neural network arranged on a singlechip to identify noise and clean sound in the frequency domain, giving a complementary mask corresponding to relevant acoustic data, utilizing the complementary mask to modify the acoustic data, eliminating the noise and then converting the noise back to the audio data in the time domain, realizing the generation of noise-reducing enhanced audio data, and solving the technical problem that a noise filtering algorithm cannot filter and inhibit the sudden noise caused by the operation of the neural network on a small singlechip.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a neural network-based speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a second embodiment of a neural network-based speech enhancement method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a neural network according to an embodiment of the present invention;
FIG. 4 is a first embodiment of a neural network training method in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of data processing at steps 401-404 according to an embodiment of the present invention;
FIG. 6 is a second embodiment of a neural network training method in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of data processing at step 405 in an embodiment of the invention;
FIG. 8 is a schematic diagram of an embodiment of a neural network-based speech enhancement device in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a voice enhancement method, equipment and storage medium based on a neural network.
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the present disclosure has been illustrated in the drawings in some form, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and examples of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.
In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a neural network-based speech enhancement method in an embodiment of the present invention includes:
101. receiving time domain noise voice data;
102. converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
103. performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
in the steps 101-103, the earphone directly receives the external time domain noise voice data S, the STFT short-time Fourier transform is performed on the frequency domain noise voice data S, and then the frequency domain noise voice data S is subjected to the operation processing of amplitude value extraction to obtain a frequency domain amplitude spectrum. In detail, reference may be made to the following processing mathematical logic:
s=stft (S), |s|=abs (S), where STFT () is a short-time fourier transform function, S is frequency domain noise speech data, S is time domain noise speech data, |is a complex absolute value operator taking the magnitude, and abs () is an absolute value operation function.
Further, at 101 the following steps may be performed:
1011. and receiving a URL address, and capturing time domain noise voice data from the URL address.
In step 1011, the time domain noise voice data may be remotely transmitted using an internet URL address to download audio format data such as MP 3.
104. Performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
in the present embodiment, the neural network performs a masking operation on the frequency domain magnitude spectrum |S| to generate a complement mask W ~ The loss function of the neural network adopts a mask mean square error (Mean Square Error), the optimizer is Adam, the initial learning rate is 0.0001, the loss function is expressed as follows every 5 rounds of halving:
wherein,for a trained predictive mask, W is the target mask of the input, N is the mask length, M is the number of samples to be computed, i isThe index value of the sample.
Further, before 104, the following steps may also be performed:
1041. establishing an input layer for collecting data;
1042. the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
1043. and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
In 1041-1043 steps, a neural network is constructed, the neural network is formed by sequentially connecting an input layer, a hidden layer and an output layer, 257 data obtained by linear transmission of 257 points of data acquired by the input layer enter the hidden layer to perform multi-layer linear processing, and 257 points of data are output by the output layer to perform data output.
Referring to fig. 2, fig. 2 is a second embodiment of a voice enhancement method based on a neural network according to an embodiment of the present invention, where the hidden layer includes: the first fully-connected layer, the first GRU connection layer, the second fully-connected layer, the third fully-connected layer, in step 104, may perform the steps of:
1044. the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
1045. the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
1046. the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
1047. the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
1048. the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
1049. the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
10410. the output layer receives the fifth conductive data and outputs a complement mask.
In the steps 1044-10410, referring to fig. 3, the structure of the neural network in the embodiment of the present invention may be schematically shown in fig. 3, the input layer takes the front (257) data of S amplitude to be kept for training, the first layer of the hidden layer is FC (Fully Connected) full-connection layer (257 neurons), then the first layer is ReLU (Rectified Linear Unit) activated by the function of second layer GRU (Gated Recurrent Unit) (128 neurons), the third layer GRU (Gated Recurrent Unit) (128 neurons), the fourth layer FC (257 neurons), the ReLU activated by the function of fifth layer FC (257 neurons), the output is performed by the SIMOID activated function, and the output layer outputs a mask of W (257) points.
In the 1044-14010 embodiment, a voice enhancement RNN model with small volume (5 layers: 3 layers of full connection layers and 2 layers of GRU layers) and low complexity (total 413K floating point type parameter quantity) is utilized, compared with RNNoise, the voice enhancement RNN model omits the traditional algorithms such as VAD and PITCH, and has lower complexity than other open source models, and then GRU neurons are utilized to replace LSTM neurons so that the calculated amount is further reduced, and therefore the voice enhancement RNN model can be transplanted to a singlechip for operation.
Specifically, in step 1044, the following steps may be specifically performed:
10441. and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
In step 10441, the first 257 data whose data is the frequency domain magnitude spectrum |s| are collected as the data amount of the training analysis.
105. Performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
in the present embodiment, the frequency domain noise speech data S and the complement mask W ~ The specific processing formula for dot multiplication is as follows:
wherein,for frequency domain enhanced speech data, S is frequency domain noise speech data,>to complement the mask.
106. And performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
In this embodiment, the frequency domain enhanced speech dataInverse operation can be carried out on the ISTFT () function which can be directly called, and the ISTFT () function is converted into corresponding frequency domain enhanced voice data +.>
Referring to fig. 4, fig. 4 is a first embodiment of a training method of a neural network according to an embodiment of the present invention, before 104, the following training method may be performed:
401. receiving clean voice data and noise data;
402. combining the pure voice data and the noise data to obtain combined acoustic data;
403. converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
404. performing complex absolute value operation on the frequency domain acoustic data to obtain predicted acoustic data;
405. performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
406. performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
407. performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
408. and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
In the steps 401-408, after receiving the clean voice data and the noise data, adding the clean voice data and the noise data to obtain combined acoustic data (voice data with noise), collecting the noise data and the clean voice data, mixing the noise data and the clean voice data into voice with noise by using a random signal-to-noise ratio, and using the mixed voice with noise as a training set. The dataset was Microsoft (The Microsoft Scalable Noisy Speech Dataset (MS-SNSD)), clean speech data: the method comprises the steps of 23075 pure voices in clean_train, wherein the total duration is about 19 hours, 128 pieces of data S (noisy voices) to be predicted of noise in a noise data set noise_train are randomly combined with clean voices according to noise signal to noise ratios of 0, 5, 10, 15 and 20, calculated target data W (ideal ratio mask) are windowed (window length 512) according to a time domain voice signal, 75% of frames are overlapped according to a frame length of 512 frames, and frequency domain features (amplitude values) are extracted through STFT (short time Fourier transform). And obtaining the frequency domain characteristic amplitude of the predicted data S, calculating the power spectrum mask of the target data W, and starting training the neural network. The goal of training is to learn the noise reduction function of wiener filtersAnd (5) performing frequency point suppression on the noisy speech according to the IRM (Ideal Ratio Mask) ideal ratio mask, and recovering the enhanced clean speech. And performing variance operation on the prediction mask and the optimal mask to generate a variance value, namely calculating a loss value based on a loss function, and performing parameter adjustment on parameters of the neural network by using the loss value until the loss value of the neural network is smaller than a termination threshold value to obtain the trained neural network. W is a theoretical optimal learning object, and a neural network is utilized to predict and learn W to obtain a predicted parameter of the neural networkThe prediction parameter->It is the neural network that ultimately modifies the resulting data of the training.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating data processing of steps 401 to 404 according to an embodiment of the present invention.
Referring to fig. 6, fig. 6 is a second embodiment of a neural network training method according to an embodiment of the present invention, in step 405, the following steps may be performed:
4051. according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
4052. according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
4053. substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
In 4051-4053 steps, feature extraction is performed using the clean speech data and the training set to obtain feature data of noisy speech and W mask data of optimal wiener filtering, and the specific processing steps may refer to fig. 7, and fig. 7 is a schematic diagram of data processing in step 405 in the embodiment of the invention. In detail, the wiener filter includes:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n Power spectrum of frequency domain noise data, x is pure voice data, n is noise data, P x The power spectrum is obtained by carrying out complex absolute value square operation on frequency domain pure voice data, and P n The power spectrum is obtained by carrying out complex absolute value square operation on the frequency domain noise data.
In the embodiment of the invention, through carrying out short-time Fourier transform on all audio frequencies to obtain frequency domain data for sudden noise interference, then utilizing a neural network arranged on a singlechip to identify noise and clean sound in the frequency domain, giving a complementary mask corresponding to relevant acoustic data, utilizing the complementary mask to modify the acoustic data, eliminating the noise and then converting the noise back to the audio data in the time domain, realizing the generation of noise-reducing enhanced audio data, and solving the technical problem that a noise filtering algorithm cannot filter and inhibit the sudden noise caused by the operation of the neural network on a small singlechip.
Fig. 8 is a schematic structural diagram of a voice enhancement device based on a neural network according to an embodiment of the present invention, where the voice enhancement device 800 based on the neural network may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 833 or data 832. Wherein memory 820 and storage medium 830 can be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations in the neural network-based speech enhancement device 800. Still further, the processor 810 may be arranged to communicate with the storage medium 830, executing a series of instruction operations in the storage medium 830 on the neural network-based speech enhancement device 800.
The neural network-based speech enhancement device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input/output interfaces 860, and/or one or more operating systems 831, such as Windows Server, mac OS X, unix, linux, free BSD, and the like. It will be appreciated by those skilled in the art that the neural network-based speech enhancement device structure illustrated in fig. 8 does not constitute a limitation of the neural network-based speech enhancement device, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, in which instructions are stored which, when executed on a computer, cause the computer to perform the steps of the neural network based speech enhancement method.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
Claims (10)
1. A method for voice enhancement based on a neural network, comprising the steps of:
receiving time domain noise voice data;
converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
2. The voice enhancement method based on a neural network according to claim 1, wherein before performing a mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask, the method further comprises:
establishing an input layer for collecting data;
the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
3. The neural network-based speech enhancement method of claim 2, wherein the hidden layer comprises: the method comprises the steps of performing mask estimation operation on the frequency domain amplitude spectrum according to a preset neural network to obtain a complement mask, wherein the complement mask comprises the following steps of:
the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
the output layer receives the fifth conductive data and outputs a complement mask.
4. The neural network-based speech enhancement method of claim 3, wherein the input layer acquiring amplitude data for the frequency domain amplitude spectrum comprises:
and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
5. The voice enhancement method based on a neural network according to claim 1, wherein before performing a mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask, the method further comprises:
receiving clean voice data and noise data;
combining the pure voice data and the noise data to obtain combined acoustic data;
converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
performing complex absolute value operation on the frequency domain acoustic data to obtain predicted acoustic data;
performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
6. The neural network-based speech enhancement method of claim 5, wherein performing an optimal masking operation on the clean speech data and the noise data according to a preset wiener filter to obtain optimal masking data comprises:
according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
7. The neural network-based speech enhancement method of claim 6, wherein the wiener filter comprises:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n The power spectrum of the frequency domain noise data is obtained, x is pure voice data, and n is noise data.
8. The neural network-based speech enhancement method of claim 1, wherein the receiving time domain noise speech data comprises:
and receiving a URL address, and capturing time domain noise voice data from the URL address.
9. A neural network-based speech enhancement device, the neural network-based speech enhancement device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;
the at least one processor invoking the instructions in the memory to cause the neural network-based speech enhancement device to perform the neural network-based speech enhancement method of any of claims 1-8.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the neural network based speech enhancement method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311130448.0A CN117153183A (en) | 2023-08-31 | 2023-08-31 | Voice enhancement method, equipment and storage medium based on neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311130448.0A CN117153183A (en) | 2023-08-31 | 2023-08-31 | Voice enhancement method, equipment and storage medium based on neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117153183A true CN117153183A (en) | 2023-12-01 |
Family
ID=88900404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311130448.0A Pending CN117153183A (en) | 2023-08-31 | 2023-08-31 | Voice enhancement method, equipment and storage medium based on neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117153183A (en) |
-
2023
- 2023-08-31 CN CN202311130448.0A patent/CN117153183A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement | |
CN108597496B (en) | Voice generation method and device based on generation type countermeasure network | |
US11024324B2 (en) | Methods and devices for RNN-based noise reduction in real-time conferences | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Lin et al. | Speech enhancement using multi-stage self-attentive temporal convolutional networks | |
CN113436643B (en) | Training and application method, device and equipment of voice enhancement model and storage medium | |
CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
CN113077806B (en) | Audio processing method and device, model training method and device, medium and equipment | |
CN110265065B (en) | Method for constructing voice endpoint detection model and voice endpoint detection system | |
CN112767959B (en) | Voice enhancement method, device, equipment and medium | |
CN113345460B (en) | Audio signal processing method, device, equipment and storage medium | |
CN113823264A (en) | Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment | |
CN113782044B (en) | Voice enhancement method and device | |
CN116013344A (en) | Speech enhancement method under multiple noise environments | |
CN113571080A (en) | Voice enhancement method, device, equipment and storage medium | |
CN115223583A (en) | Voice enhancement method, device, equipment and medium | |
CN111681649B (en) | Speech recognition method, interaction system and achievement management system comprising system | |
CN111341331B (en) | Voice enhancement method, device and medium based on local attention mechanism | |
CN114333893A (en) | Voice processing method and device, electronic equipment and readable medium | |
CN113035216B (en) | Microphone array voice enhancement method and related equipment | |
US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
CN116682444A (en) | Single-channel voice enhancement method based on waveform spectrum fusion network | |
CN107919136B (en) | Digital voice sampling frequency estimation method based on Gaussian mixture model | |
CN117153183A (en) | Voice enhancement method, equipment and storage medium based on neural network | |
TWI749547B (en) | Speech enhancement system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |