CN117153183A - Voice enhancement method, equipment and storage medium based on neural network - Google Patents

Voice enhancement method, equipment and storage medium based on neural network Download PDF

Info

Publication number
CN117153183A
CN117153183A CN202311130448.0A CN202311130448A CN117153183A CN 117153183 A CN117153183 A CN 117153183A CN 202311130448 A CN202311130448 A CN 202311130448A CN 117153183 A CN117153183 A CN 117153183A
Authority
CN
China
Prior art keywords
data
frequency domain
neural network
noise
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311130448.0A
Other languages
Chinese (zh)
Inventor
邓刚
赵宏亮
欧阳梓俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Changfeng Imaging Equipment Co ltd
Original Assignee
Shenzhen Changfeng Imaging Equipment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Changfeng Imaging Equipment Co ltd filed Critical Shenzhen Changfeng Imaging Equipment Co ltd
Priority to CN202311130448.0A priority Critical patent/CN117153183A/en
Publication of CN117153183A publication Critical patent/CN117153183A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to the field of voice enhancement, and discloses a voice enhancement method, equipment and a storage medium based on a neural network. The method comprises the following steps: receiving time domain noise voice data; converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data; performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum; performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask; performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data; and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data. In the embodiment of the invention, the technical problem that the noise filtering algorithm cannot handle sudden noise for filtering and suppressing caused by the fact that the neural network runs on the small-sized singlechip is solved.

Description

Voice enhancement method, equipment and storage medium based on neural network
Technical Field
The present invention relates to the field of speech enhancement, and in particular, to a method, apparatus, and storage medium for speech enhancement based on a neural network.
Background
Conventional noise reduction algorithms require energy decisions and threshold settings to set the noise estimate, which may lead to unstable noise estimates, especially during device movement, switching from a low noise environment to a high noise environment. Meanwhile, the traditional noise reduction algorithm is easy to introduce music noise, and sudden noise (such as sounds of keyboard knocking, pen falling on the ground, object collision and the like) is difficult to be restrained.
The existing RNNoise noise reduction algorithm with an open source has the advantages that the volume is small, the complexity is low, the operation can be transplanted to the upper surface of the singlechip, the related surface is the combination of the traditional noise reduction and the neural network, the traditional noise reduction is realized, the model relates to the VAD and the PITCH algorithm, and the feature extraction is relatively complex. The volume of other neural network algorithms such as DNN, CNN, RNN with main flow open sources is relatively large, and the parameter volume is huge and is not suitable for being transplanted to a singlechip for operation.
Therefore, the method solves the technical problem that the noise filtering algorithm cannot cope with sudden noise and filter and inhibit the sudden noise when the current neural network runs on the small-sized singlechip.
Disclosure of Invention
The invention mainly aims to solve the technical problem that a noise filtering algorithm cannot cope with sudden noise to filter and inhibit when a neural network runs on a small single chip microcomputer.
The first aspect of the present invention provides a neural network-based speech enhancement method, which includes:
receiving time domain noise voice data;
converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
Optionally, in a first implementation manner of the first aspect of the present invention, before performing a mask estimation operation on the frequency domain magnitude spectrum according to the preset neural network to obtain a complement mask, the method further includes:
establishing an input layer for collecting data;
the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
Optionally, in a second implementation manner of the first aspect of the present invention, the hidden layer includes: the method comprises the steps of performing mask estimation operation on the frequency domain amplitude spectrum according to a preset neural network to obtain a complement mask, wherein the complement mask comprises the following steps of:
the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
the output layer receives the fifth conductive data and outputs a complement mask.
Optionally, in a third implementation manner of the first aspect of the present invention, the acquiring, by the input layer, amplitude data for the frequency domain amplitude spectrum includes:
and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
Optionally, in a fourth implementation manner of the first aspect of the present invention, before performing a mask estimation operation on the frequency domain magnitude spectrum according to the preset neural network to obtain a complement mask, the method further includes:
receiving clean voice data and noise data;
combining the pure voice data and the noise data to obtain combined acoustic data;
converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
performing complex absolute value square operation on the frequency domain acoustic data to obtain predicted acoustic data;
performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
Optionally, in a fifth implementation manner of the first aspect of the present invention, performing, according to a preset wiener filter, an optimal mask operation on the clean speech data and the noise data, to obtain optimal mask data includes:
according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the wiener filter includes:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n Is the power spectrum of the frequency domain noise data, x is the pure voice data, n is the noise numberAccording to the above.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the receiving time domain noise voice data includes:
and receiving a URL address, and capturing time domain noise voice data from the URL address.
A second aspect of the present invention provides a voice enhancement device based on a neural network, comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the neural network-based speech enhancement device to perform the neural network-based speech enhancement method described above.
A third aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when run on a computer, cause the computer to perform the above-described neural network-based speech enhancement method.
In the embodiment of the invention, through carrying out short-time Fourier transform on all audio frequencies to obtain frequency domain data for sudden noise interference, then utilizing a neural network arranged on a singlechip to identify noise and clean sound in the frequency domain, giving a complementary mask corresponding to relevant acoustic data, utilizing the complementary mask to modify the acoustic data, eliminating the noise and then converting the noise back to the audio data in the time domain, realizing the generation of noise-reducing enhanced audio data, and solving the technical problem that a noise filtering algorithm cannot filter and inhibit the sudden noise caused by the operation of the neural network on a small singlechip.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a neural network-based speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a second embodiment of a neural network-based speech enhancement method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a neural network according to an embodiment of the present invention;
FIG. 4 is a first embodiment of a neural network training method in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of data processing at steps 401-404 according to an embodiment of the present invention;
FIG. 6 is a second embodiment of a neural network training method in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of data processing at step 405 in an embodiment of the invention;
FIG. 8 is a schematic diagram of an embodiment of a neural network-based speech enhancement device in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a voice enhancement method, equipment and storage medium based on a neural network.
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the present disclosure has been illustrated in the drawings in some form, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and examples of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.
In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a neural network-based speech enhancement method in an embodiment of the present invention includes:
101. receiving time domain noise voice data;
102. converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
103. performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
in the steps 101-103, the earphone directly receives the external time domain noise voice data S, the STFT short-time Fourier transform is performed on the frequency domain noise voice data S, and then the frequency domain noise voice data S is subjected to the operation processing of amplitude value extraction to obtain a frequency domain amplitude spectrum. In detail, reference may be made to the following processing mathematical logic:
s=stft (S), |s|=abs (S), where STFT () is a short-time fourier transform function, S is frequency domain noise speech data, S is time domain noise speech data, |is a complex absolute value operator taking the magnitude, and abs () is an absolute value operation function.
Further, at 101 the following steps may be performed:
1011. and receiving a URL address, and capturing time domain noise voice data from the URL address.
In step 1011, the time domain noise voice data may be remotely transmitted using an internet URL address to download audio format data such as MP 3.
104. Performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
in the present embodiment, the neural network performs a masking operation on the frequency domain magnitude spectrum |S| to generate a complement mask W The loss function of the neural network adopts a mask mean square error (Mean Square Error), the optimizer is Adam, the initial learning rate is 0.0001, the loss function is expressed as follows every 5 rounds of halving:
wherein,for a trained predictive mask, W is the target mask of the input, N is the mask length, M is the number of samples to be computed, i isThe index value of the sample.
Further, before 104, the following steps may also be performed:
1041. establishing an input layer for collecting data;
1042. the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
1043. and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
In 1041-1043 steps, a neural network is constructed, the neural network is formed by sequentially connecting an input layer, a hidden layer and an output layer, 257 data obtained by linear transmission of 257 points of data acquired by the input layer enter the hidden layer to perform multi-layer linear processing, and 257 points of data are output by the output layer to perform data output.
Referring to fig. 2, fig. 2 is a second embodiment of a voice enhancement method based on a neural network according to an embodiment of the present invention, where the hidden layer includes: the first fully-connected layer, the first GRU connection layer, the second fully-connected layer, the third fully-connected layer, in step 104, may perform the steps of:
1044. the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
1045. the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
1046. the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
1047. the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
1048. the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
1049. the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
10410. the output layer receives the fifth conductive data and outputs a complement mask.
In the steps 1044-10410, referring to fig. 3, the structure of the neural network in the embodiment of the present invention may be schematically shown in fig. 3, the input layer takes the front (257) data of S amplitude to be kept for training, the first layer of the hidden layer is FC (Fully Connected) full-connection layer (257 neurons), then the first layer is ReLU (Rectified Linear Unit) activated by the function of second layer GRU (Gated Recurrent Unit) (128 neurons), the third layer GRU (Gated Recurrent Unit) (128 neurons), the fourth layer FC (257 neurons), the ReLU activated by the function of fifth layer FC (257 neurons), the output is performed by the SIMOID activated function, and the output layer outputs a mask of W (257) points.
In the 1044-14010 embodiment, a voice enhancement RNN model with small volume (5 layers: 3 layers of full connection layers and 2 layers of GRU layers) and low complexity (total 413K floating point type parameter quantity) is utilized, compared with RNNoise, the voice enhancement RNN model omits the traditional algorithms such as VAD and PITCH, and has lower complexity than other open source models, and then GRU neurons are utilized to replace LSTM neurons so that the calculated amount is further reduced, and therefore the voice enhancement RNN model can be transplanted to a singlechip for operation.
Specifically, in step 1044, the following steps may be specifically performed:
10441. and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
In step 10441, the first 257 data whose data is the frequency domain magnitude spectrum |s| are collected as the data amount of the training analysis.
105. Performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
in the present embodiment, the frequency domain noise speech data S and the complement mask W The specific processing formula for dot multiplication is as follows:
wherein,for frequency domain enhanced speech data, S is frequency domain noise speech data,>to complement the mask.
106. And performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
In this embodiment, the frequency domain enhanced speech dataInverse operation can be carried out on the ISTFT () function which can be directly called, and the ISTFT () function is converted into corresponding frequency domain enhanced voice data +.>
Referring to fig. 4, fig. 4 is a first embodiment of a training method of a neural network according to an embodiment of the present invention, before 104, the following training method may be performed:
401. receiving clean voice data and noise data;
402. combining the pure voice data and the noise data to obtain combined acoustic data;
403. converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
404. performing complex absolute value operation on the frequency domain acoustic data to obtain predicted acoustic data;
405. performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
406. performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
407. performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
408. and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
In the steps 401-408, after receiving the clean voice data and the noise data, adding the clean voice data and the noise data to obtain combined acoustic data (voice data with noise), collecting the noise data and the clean voice data, mixing the noise data and the clean voice data into voice with noise by using a random signal-to-noise ratio, and using the mixed voice with noise as a training set. The dataset was Microsoft (The Microsoft Scalable Noisy Speech Dataset (MS-SNSD)), clean speech data: the method comprises the steps of 23075 pure voices in clean_train, wherein the total duration is about 19 hours, 128 pieces of data S (noisy voices) to be predicted of noise in a noise data set noise_train are randomly combined with clean voices according to noise signal to noise ratios of 0, 5, 10, 15 and 20, calculated target data W (ideal ratio mask) are windowed (window length 512) according to a time domain voice signal, 75% of frames are overlapped according to a frame length of 512 frames, and frequency domain features (amplitude values) are extracted through STFT (short time Fourier transform). And obtaining the frequency domain characteristic amplitude of the predicted data S, calculating the power spectrum mask of the target data W, and starting training the neural network. The goal of training is to learn the noise reduction function of wiener filtersAnd (5) performing frequency point suppression on the noisy speech according to the IRM (Ideal Ratio Mask) ideal ratio mask, and recovering the enhanced clean speech. And performing variance operation on the prediction mask and the optimal mask to generate a variance value, namely calculating a loss value based on a loss function, and performing parameter adjustment on parameters of the neural network by using the loss value until the loss value of the neural network is smaller than a termination threshold value to obtain the trained neural network. W is a theoretical optimal learning object, and a neural network is utilized to predict and learn W to obtain a predicted parameter of the neural networkThe prediction parameter->It is the neural network that ultimately modifies the resulting data of the training.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating data processing of steps 401 to 404 according to an embodiment of the present invention.
Referring to fig. 6, fig. 6 is a second embodiment of a neural network training method according to an embodiment of the present invention, in step 405, the following steps may be performed:
4051. according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
4052. according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
4053. substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
In 4051-4053 steps, feature extraction is performed using the clean speech data and the training set to obtain feature data of noisy speech and W mask data of optimal wiener filtering, and the specific processing steps may refer to fig. 7, and fig. 7 is a schematic diagram of data processing in step 405 in the embodiment of the invention. In detail, the wiener filter includes:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n Power spectrum of frequency domain noise data, x is pure voice data, n is noise data, P x The power spectrum is obtained by carrying out complex absolute value square operation on frequency domain pure voice data, and P n The power spectrum is obtained by carrying out complex absolute value square operation on the frequency domain noise data.
In the embodiment of the invention, through carrying out short-time Fourier transform on all audio frequencies to obtain frequency domain data for sudden noise interference, then utilizing a neural network arranged on a singlechip to identify noise and clean sound in the frequency domain, giving a complementary mask corresponding to relevant acoustic data, utilizing the complementary mask to modify the acoustic data, eliminating the noise and then converting the noise back to the audio data in the time domain, realizing the generation of noise-reducing enhanced audio data, and solving the technical problem that a noise filtering algorithm cannot filter and inhibit the sudden noise caused by the operation of the neural network on a small singlechip.
Fig. 8 is a schematic structural diagram of a voice enhancement device based on a neural network according to an embodiment of the present invention, where the voice enhancement device 800 based on the neural network may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 833 or data 832. Wherein memory 820 and storage medium 830 can be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations in the neural network-based speech enhancement device 800. Still further, the processor 810 may be arranged to communicate with the storage medium 830, executing a series of instruction operations in the storage medium 830 on the neural network-based speech enhancement device 800.
The neural network-based speech enhancement device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input/output interfaces 860, and/or one or more operating systems 831, such as Windows Server, mac OS X, unix, linux, free BSD, and the like. It will be appreciated by those skilled in the art that the neural network-based speech enhancement device structure illustrated in fig. 8 does not constitute a limitation of the neural network-based speech enhancement device, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, in which instructions are stored which, when executed on a computer, cause the computer to perform the steps of the neural network based speech enhancement method.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (10)

1. A method for voice enhancement based on a neural network, comprising the steps of:
receiving time domain noise voice data;
converting the time domain noise voice data according to a preset short-time Fourier transform algorithm to obtain frequency domain noise voice data;
performing complex absolute value operation on the frequency domain noise voice data to obtain a frequency domain amplitude spectrum;
performing mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask;
performing dot multiplication processing on the frequency domain noise voice data and the complement mask to obtain frequency domain enhanced voice data;
and performing inverse conversion processing on the frequency domain enhanced voice data according to a preset Fourier inverse conversion algorithm to obtain time domain enhanced voice data.
2. The voice enhancement method based on a neural network according to claim 1, wherein before performing a mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask, the method further comprises:
establishing an input layer for collecting data;
the information output end of the input layer is provided with a hidden layer, and the information output end of the hidden layer is provided with a connection output layer;
and determining the combination of the input layer, the hidden layer and the output layer as a neural network.
3. The neural network-based speech enhancement method of claim 2, wherein the hidden layer comprises: the method comprises the steps of performing mask estimation operation on the frequency domain amplitude spectrum according to a preset neural network to obtain a complement mask, wherein the complement mask comprises the following steps of:
the input layer acquires amplitude data from the frequency domain amplitude spectrum, performs input connection processing on the amplitude data to obtain connection input data, and inputs the connection input data to the first full-connection layer;
the first full connection layer receives the connection input data, performs linear processing on the connection input data and all data of the first full connection layer, performs nonlinear processing on the connection input data and all data of the first full connection layer through a ReLU activation function, generates first conduction data, and transmits the first conduction data to the first GRU connection layer;
the first GRU connection layer receives the first conduction data, performs linear processing on the first conduction data to obtain second conduction data, and transmits the second conduction data to the second GRU connection layer;
the second GRU connection layer receives the second conduction data, performs linear processing on the second conduction data, performs nonlinear processing on the second conduction data through a ReLU activation function, generates third conduction data, and transmits the third conduction data to the second full connection layer;
the second full connection layer receives the third conduction data, performs linear processing on the third conduction data, performs nonlinear processing on the third conduction data through a ReLU activation function, generates fourth conduction data, and transmits the fourth conduction data to the third full connection layer;
the third full connection layer receives the fourth conduction data, performs linear processing on the fourth conduction data, performs nonlinear processing through a SIMOID activation function, generates fifth conduction data, and transmits the fifth conduction data to the output layer;
the output layer receives the fifth conductive data and outputs a complement mask.
4. The neural network-based speech enhancement method of claim 3, wherein the input layer acquiring amplitude data for the frequency domain amplitude spectrum comprises:
and the input layer collects the top 257 points of the frequency domain amplitude spectrum to obtain 257 data points of amplitude data.
5. The voice enhancement method based on a neural network according to claim 1, wherein before performing a mask estimation operation on the frequency domain magnitude spectrum according to a preset neural network to obtain a complement mask, the method further comprises:
receiving clean voice data and noise data;
combining the pure voice data and the noise data to obtain combined acoustic data;
converting the combined acoustic data according to a preset short-time Fourier transform algorithm to obtain frequency domain acoustic data;
performing complex absolute value operation on the frequency domain acoustic data to obtain predicted acoustic data;
performing optimal mask operation on the clean voice data and the noise data according to a preset wiener filter to obtain optimal mask data;
performing mask estimation operation on the predicted acoustic data according to a preset neural network to obtain a predicted mask;
performing minimum mean square error operation on the optimal mask data and the prediction mask to obtain a variance value;
and adjusting parameters of the neural network according to the variance value to generate a trained neural network.
6. The neural network-based speech enhancement method of claim 5, wherein performing an optimal masking operation on the clean speech data and the noise data according to a preset wiener filter to obtain optimal masking data comprises:
according to a preset short-time Fourier transform algorithm, converting the clean voice data to obtain frequency domain clean voice data;
according to a preset short-time Fourier transform algorithm, converting the noise data to obtain frequency domain noise data;
substituting the frequency domain noise data and the frequency domain pure voice data into a preset wiener filter to generate optimal mask data.
7. The neural network-based speech enhancement method of claim 6, wherein the wiener filter comprises:
where w is the optimal mask data, P x Power spectrum, P, for frequency domain pure speech data n The power spectrum of the frequency domain noise data is obtained, x is pure voice data, and n is noise data.
8. The neural network-based speech enhancement method of claim 1, wherein the receiving time domain noise speech data comprises:
and receiving a URL address, and capturing time domain noise voice data from the URL address.
9. A neural network-based speech enhancement device, the neural network-based speech enhancement device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;
the at least one processor invoking the instructions in the memory to cause the neural network-based speech enhancement device to perform the neural network-based speech enhancement method of any of claims 1-8.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the neural network based speech enhancement method of any of claims 1-8.
CN202311130448.0A 2023-08-31 2023-08-31 Voice enhancement method, equipment and storage medium based on neural network Pending CN117153183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311130448.0A CN117153183A (en) 2023-08-31 2023-08-31 Voice enhancement method, equipment and storage medium based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311130448.0A CN117153183A (en) 2023-08-31 2023-08-31 Voice enhancement method, equipment and storage medium based on neural network

Publications (1)

Publication Number Publication Date
CN117153183A true CN117153183A (en) 2023-12-01

Family

ID=88900404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311130448.0A Pending CN117153183A (en) 2023-08-31 2023-08-31 Voice enhancement method, equipment and storage medium based on neural network

Country Status (1)

Country Link
CN (1) CN117153183A (en)

Similar Documents

Publication Publication Date Title
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
US11024324B2 (en) Methods and devices for RNN-based noise reduction in real-time conferences
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
CN110265065B (en) Method for constructing voice endpoint detection model and voice endpoint detection system
CN112767959B (en) Voice enhancement method, device, equipment and medium
CN113345460B (en) Audio signal processing method, device, equipment and storage medium
CN113823264A (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
CN113782044B (en) Voice enhancement method and device
CN116013344A (en) Speech enhancement method under multiple noise environments
CN113571080A (en) Voice enhancement method, device, equipment and storage medium
CN115223583A (en) Voice enhancement method, device, equipment and medium
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
CN111341331B (en) Voice enhancement method, device and medium based on local attention mechanism
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN113035216B (en) Microphone array voice enhancement method and related equipment
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN116682444A (en) Single-channel voice enhancement method based on waveform spectrum fusion network
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN117153183A (en) Voice enhancement method, equipment and storage medium based on neural network
TWI749547B (en) Speech enhancement system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination