WO2019233362A1 - Deep learning-based speech quality enhancing method, device, and system - Google Patents

Deep learning-based speech quality enhancing method, device, and system Download PDF

Info

Publication number
WO2019233362A1
WO2019233362A1 PCT/CN2019/089759 CN2019089759W WO2019233362A1 WO 2019233362 A1 WO2019233362 A1 WO 2019233362A1 CN 2019089759 W CN2019089759 W CN 2019089759W WO 2019233362 A1 WO2019233362 A1 WO 2019233362A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
sample
data
neural network
Prior art date
Application number
PCT/CN2019/089759
Other languages
French (fr)
Chinese (zh)
Inventor
秦宇
姚青山
喻浩文
卢峰
Original Assignee
安克创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安克创新科技股份有限公司 filed Critical 安克创新科技股份有限公司
Publication of WO2019233362A1 publication Critical patent/WO2019233362A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present invention relates to the technical field of sound quality optimization, and more particularly, to a method, a device, and a system for enhancing sound quality based on deep learning.
  • Enhanced Voice Services (EVS) coding technology may reach 48k sampling frequency and 128kbps code rate.
  • EVS Enhanced Voice Services
  • this does not mean that all users can enjoy the experience of high-definition voice communication.
  • the operator of the calling user supports the 4G network
  • the operator of the receiving user only supports the 3G network
  • the two parties may only An adaptive multi-rate coding-narrowband (amr-nb) coding method is selected for speech coding, rather than an adaptive multi-rate coding-wideband (amr-wb) coding method such as a 16kHz sampling frequency. Due to the existence of these scenarios where low-quality voice has to be adopted due to hardware conditions, not everyone can enjoy the benefits of high-definition voice communications.
  • the present invention has been made to solve at least one of the problems described above.
  • the present invention proposes a solution for enhancing the sound quality of speech based on deep learning, which enhances the sound quality of low-quality speech based on a deep learning method, so that the sound quality of low-quality speech is reconstructed to achieve the sound quality of high-quality speech through a deep neural network, thereby enabling Realize the sound quality improvement effect that cannot be achieved by traditional methods.
  • a method for enhancing the sound quality of speech based on deep learning includes: acquiring to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Characteristics; and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech quality of the output speech data is higher than the quality of the speech data to be processed Handle the voice quality of voice data.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample.
  • Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data includes using the characteristics of the speech data to be processed as the trained speech data.
  • An input of a speech reconstruction neural network, and a reconstructed speech feature is output from the trained speech reconstruction neural network; and a time-domain speech waveform is generated based on the reconstructed speech feature as the output speech data.
  • a deep learning-based voice sound quality enhancement device includes: a feature extraction module for obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain Characteristics of the speech data to be processed; and a speech reconstruction module, configured to use the trained speech reconstruction neural network to convert the speech data to be processed based on the characteristics of the speech data to be processed extracted by the feature extraction module Reconstructed to output voice data, wherein the voice quality of the output voice data is higher than the voice quality of the to-be-processed voice data.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample.
  • Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the speech reconstruction module further includes: a reconstruction module, configured to use the characteristics of the speech data to be processed as an input of the trained speech reconstruction neural network, The trained speech reconstruction neural network outputs reconstructed speech features; and a generation module for generating a time-domain speech waveform based on the reconstructed speech features output by the reconstruction module as the output speech data.
  • a deep learning-based voice sound quality enhancement system includes a storage device and a processor.
  • the storage device stores a computer program run by the processor.
  • the computer when executed by the processor, executes the deep learning-based speech sound quality enhancement method according to any one of the above.
  • a storage medium stores a computer program, and the computer program executes the deep learning-based voice sound quality enhancement method according to any one of the foregoing when running.
  • a computer program is provided, which is used by a computer or a processor to execute the deep learning-based voice sound quality enhancement method according to any one of the foregoing, and the computer program further uses Each module in the deep learning-based voice sound quality enhancement device according to any one of the above.
  • the method, device, and system for enhancing speech sound quality based on deep learning according to the embodiments of the present invention enhance low-quality speech sound quality based on the deep learning method, so that the low-quality speech sound quality is reconstructed by deep neural network to achieve high-quality speech sound quality, so that Realize the sound quality improvement effect that cannot be achieved by traditional methods.
  • the method, device, and system for enhancing the voice quality based on deep learning according to the embodiments of the present invention can be conveniently deployed on the server or user end, and can effectively enhance the voice quality.
  • FIG. 1 shows a schematic block diagram of an example electronic device for implementing a deep learning-based method, apparatus, and system for voice sound quality enhancement according to an embodiment of the present invention
  • FIG. 2 shows a schematic flowchart of a deep learning-based voice sound quality enhancement method according to an embodiment of the present invention
  • FIG. 3 shows a training schematic diagram of a speech reconstruction neural network according to an embodiment of the present invention
  • 4A, 4B, and 4C respectively show high-quality speech, low-quality speech, and speech maps of speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention
  • FIG. 5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention.
  • FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system according to an embodiment of the present invention.
  • an example electronic device 100 for implementing a method, an apparatus, and a system for improving the sound quality of a voice based on deep learning according to an embodiment of the present invention is described with reference to FIG. 1.
  • the electronic device 100 includes one or more processors 102, one or more storage devices 104, input devices 106, and output devices 108. These components are connected through a bus system 110 and / or other forms of connection mechanisms (not shown). (Shown) interconnected. It should be noted that the components and structures of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may have other components and structures as needed.
  • the processor 102 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 100 to perform a desired function.
  • CPU central processing unit
  • the storage device 104 may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory.
  • the volatile memory may include, for example, a random access memory (RAM) and / or a cache memory.
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 102 may run the program instructions to implement client functions in the embodiments of the present invention (implemented by the processor) described below. And / or other desired functions.
  • Various application programs and various data such as various data used and / or generated by the application program, can also be stored in the computer-readable storage medium.
  • the input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. In addition, the input device 106 may also be any interface for receiving information.
  • the output device 108 may output various information (such as images or sounds) to the outside (such as a user), and may include one or more of a display, a speaker, and the like. In addition, the output device 108 may be any other device having an output function.
  • an example electronic device for implementing a method, a device, and a system for enhancing the sound quality of a voice based on deep learning may be implemented as a terminal such as a smart phone, a tablet computer, or the like.
  • a deep learning-based speech sound quality enhancement method 200 may include the following steps:
  • step S210 the speech data to be processed is acquired, and feature extraction is performed on the speech data to be processed to obtain the characteristics of the speech data to be processed.
  • the to-be-processed voice data obtained in step S210 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency.
  • the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client.
  • the to-be-processed voice data obtained in step S210 may also be any data that requires sound quality enhancement, such as voice data included in video data.
  • the to-be-processed voice data obtained in step S210 may come from a file stored offline, or from a file played online.
  • a manner of performing feature extraction on the acquired to-be-processed voice data may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may include frequency domain amplitude and / or energy information.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include spectrum phase information.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may also be time-domain features.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
  • frame processing before performing feature extraction on the speech data to be processed, frame processing may be performed on it, and the aforementioned feature extraction is performed frame-by-frame on the speech data obtained after the frame.
  • This situation may be applicable when the to-be-processed voice data obtained in step S210 is from a file stored offline or a complete file from any source.
  • the to-be-processed voice data obtained in step S210 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction.
  • feature data can be selected for each frame of voice data to be processed after framed or buffered to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
  • the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
  • the voice data to be processed before performing feature extraction on the voice data to be processed, may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after the pre-processing.
  • the pre-processing of the speech data to be processed may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the acquired to-be-processed voice data may be sequentially decoded, pre-processed, framed, and feature extracted in order to efficiently extract well-represented features.
  • the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
  • step S220 based on the characteristics of the voice data to be processed, the trained voice reconstruction neural network is used to reconstruct the voice data to be processed into output voice data, where the voice quality of the output voice data is higher than the voice quality The voice quality of the pending voice data.
  • the features of the speech data to be processed extracted in step S210 are input to the trained speech reconstruction neural network, and the speech reconstruction neural network reconstructs the input features to obtain reconstructed speech.
  • Feature, and the reconstructed reconstructed voice feature can be used to generate output voice data with higher voice quality than the acquired to-be-processed voice data. Therefore, the speech sound quality enhancement method of the present invention can accurately supplement the speech information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
  • training of the speech reconstruction neural network according to the embodiment of the present invention is described below with reference to FIG. 3.
  • training of a speech reconstruction neural network according to an embodiment of the present invention may include the following process:
  • a first voice sample and a second voice sample are obtained, wherein the voice quality of the second voice sample is lower than the voice quality of the first voice sample, and the second voice sample is determined by the first voice sample Obtained by transcoding.
  • the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample.
  • the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz.
  • a first speech sample may be transcoded to obtain a second speech sample.
  • the amr-wb speech sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first speech sample, and the first sample can be obtained by transcoding the amr-nb speech with a sampling frequency of 8 kHz and a code rate of 12.2 kbps.
  • the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample.
  • the transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample can also be other situations, which can be adaptively adjusted based on the actual application scenario.
  • the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction requirement for the to-be-processed voice data obtained in step S210, that is, the first voice sample that should be selected can be determined based on the reconstruction requirement. And should use the transcoding method of transcoding it into the second speech sample.
  • feature extraction is performed on the first voice sample and the second voice sample to obtain the features of the first voice sample and the features of the second voice sample, respectively.
  • the manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
  • the first voice sample and the second voice sample may be first processed.
  • Frame processing is performed separately, and the aforementioned feature extraction may be performed frame by frame for the respective speech samples obtained after the first and second speech samples are framed.
  • part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
  • each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing.
  • the speech samples obtained afterwards are performed.
  • the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness.
  • the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
  • the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained feature of the first speech sample is used as an output of the speech reconstruction neural network. Layers to train the speech reconstruction neural network.
  • the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network.
  • Target the output layer thereby training a neural network regressor as the speech reconstruction neural network employed in step S220.
  • step S220 based on the trained speech reconstruction neural network, the features of the speech data to be processed can be reconstructed into reconstructed speech features. Domain features, so time domain voice waveform output can be generated based on the reconstructed voice features.
  • the time-domain speech waveform can be obtained by transforming the reconstructed speech feature by inverse Fourier transform.
  • the output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience.
  • FIG. 4A to FIG. 4C can realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement method according to the embodiment.
  • FIGS. 4A, 4B, and 4C respectively show the respective spectrograms of high-quality speech, low-quality speech, and speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention.
  • FIG. 4A shows a grammap 400 of high-quality speech using PCM format, 16kHz sampling frequency, and 16-bit quantization bits as examples;
  • FIG. 4B shows MP3 format and 8kHz sampling frequency obtained by transcoding the high-quality speech.
  • FIG. 4C shows the reconstructed speech at a 16 kHz sampling frequency obtained by reconstructing the low-quality speech using the deep learning-based speech quality enhancement method according to an embodiment of the present invention Gram map 402. It is obvious from FIGS. 4A to 4C that compared with the high-quality speech map shown in FIG. 4A, the low-quality speech map shown in FIG. 4B lacks a lot of high-frequency components. Reconstruction of the method for improving the sound quality of speech based on deep learning in the embodiment of the present invention, and the reconstructed speech spectrogram shown in FIG. 4C restores these high-frequency components to achieve super-resolution of narrow-band speech, making low-quality speech The sound quality has been improved.
  • the deep learning-based voice sound quality enhancement method enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality. It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • the deep learning-based voice sound quality enhancement method may be implemented in a device, an apparatus, or a system having a memory and a processor.
  • the method for enhancing the sound quality of a voice based on deep learning can be conveniently deployed on a mobile device such as a smart phone, a tablet computer, a personal computer, a headset, and a speaker.
  • a mobile device such as a smart phone, a tablet computer, a personal computer, a headset, and a speaker.
  • the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on a server side (or cloud).
  • the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on the server (or cloud) and personal terminals in a distributed manner.
  • FIG. 5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement apparatus 500 according to an embodiment of the present invention.
  • a deep learning-based voice sound quality enhancement device 500 includes a feature extraction module 510 and a voice reconstruction module 520.
  • Each of the modules may perform each step / function of the deep learning-based speech sound quality enhancement method described above in conjunction with FIG. 2.
  • only the main functions of each module of the deep learning-based voice sound quality enhancement device 500 are described, and details that have been described above are omitted.
  • the feature extraction module 510 is configured to obtain voice data to be processed, and perform feature extraction on the voice data to be processed to obtain characteristics of the voice data to be processed.
  • the speech reconstruction module 520 is configured to reconstruct the speech data to be processed into the output speech data by using the trained speech reconstruction neural network based on the features of the speech data to be processed extracted by the feature extraction module.
  • the voice quality of the output voice data is higher than the voice quality of the voice data to be processed.
  • Both the feature extraction module 510 and the voice reconstruction module 520 can be implemented by the processor 102 in the electronic device shown in FIG. 1 running program instructions stored in the storage device 104.
  • the to-be-processed voice data obtained by the feature extraction module 510 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency.
  • the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client.
  • the to-be-processed voice data obtained by the feature extraction module 510 may also be any data that requires sound quality enhancement, such as voice data included in video data.
  • the to-be-processed voice data obtained by the feature extraction module 510 may come from files stored offline, or from files played online.
  • the manner in which the feature extraction module 510 performs feature extraction on the acquired speech data to be processed may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may include frequency domain amplitude and / or energy information.
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include spectrum phase information.
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may also be time-domain features. In other examples, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
  • the feature extraction module 510 may perform frame processing on it, and the aforementioned feature extraction is performed on the frame-by-frame speech data obtained.
  • This situation may be applicable when the to-be-processed voice data obtained by the feature extraction module 510 is from a file stored offline or a complete file from any source.
  • the to-be-processed voice data obtained by the feature extraction module 510 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction.
  • the feature extraction module 510 may select a part of data for each frame of voice data to be processed or obtained after buffering to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
  • the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. get on. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
  • the voice data to be processed may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after pre-processing.
  • the pre-processing of the speech data to be processed by the feature extraction module 510 may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the feature extraction module 510 may sequentially decode, pre-process, frame, and feature extract the acquired speech data to be processed in order to efficiently extract features with good representativeness.
  • the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
  • the speech reconstruction module 520 may use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data.
  • the voice reconstruction module 520 may further include a reconstruction module (not shown in FIG. 5) and a generation module (not shown in FIG. 5).
  • the reconstruction module may include a trained speech reconstruction neural network that takes as input the features of the speech data to be processed extracted by the feature extraction module 510, and reconstructs the input features to obtain reconstructed speech features. .
  • the generating module generates output voice data with higher voice quality than the acquired to-be-processed voice data based on the reconstructed voice features output by the reconstruction module.
  • the voice sound quality enhancement device of the present invention can accurately supplement the voice information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
  • the training of the speech reconstruction neural network used by the speech reconstruction module 520 may include: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the The speech quality of the first speech sample, and the second speech sample is obtained by transcoding the first speech sample; feature extraction is performed on the first speech sample and the second speech sample to obtain The feature of the first speech sample and the feature of the second speech sample; and the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained The features of the first speech sample are used as targets of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample.
  • the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz.
  • a first speech sample may be transcoded to obtain a second speech sample.
  • an amr-wb voice sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first voice sample, and the first can be obtained by transcoding it into an amr-nb voice with a sampling frequency of 8 kHz and a code rate of 12.2 kbps.
  • the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample.
  • the transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample (that is, the low-quality speech sample) can also be other situations, which can be adaptively adjusted based on the actual application scenario.
  • the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction needs of the to-be-processed voice data obtained by the feature extraction module 510, that is, the first voice sample that should be selected can be determined based on the above-mentioned reconstruction requirements.
  • a manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
  • frame processing may be performed on each of the first voice sample and the second voice sample, and the aforementioned feature extraction may be performed on the first
  • the respective speech samples obtained after the speech samples and the second speech samples are framed are performed frame by frame.
  • part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
  • each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing.
  • the speech samples obtained afterwards are performed.
  • the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness.
  • the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
  • the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network.
  • the goal of the output layer is to train a neural network regressor as the speech reconstruction neural network used in the speech reconstruction module 520.
  • the reconstruction module of the speech reconstruction module 520 can reconstruct the features of the speech data to be processed into reconstructed speech features. Since the reconstructed speech features are frequency domain features, the speech reconstruction is performed.
  • the generating module of module 520 may generate a time-domain speech waveform output based on the reconstructed speech feature. Exemplarily, the generating module may transform the reconstructed speech feature to obtain a time-domain speech waveform by inverse Fourier transform.
  • the output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience.
  • 4A-4C may be combined with reference to the foregoing description of FIGS. 4A-4C to realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement device according to an embodiment. For brevity, I will not repeat them here.
  • the deep learning-based voice sound quality enhancement device enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality, thereby It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • the deep learning-based device can be conveniently deployed on a server or a user, and can effectively enhance the voice quality.
  • FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system 600 according to an embodiment of the present invention.
  • the deep learning-based speech sound quality enhancement system 600 includes a storage device 610 and a processor 620.
  • the storage device 610 stores a program for implementing the corresponding steps in the method for enhancing the sound quality of a voice based on deep learning according to an embodiment of the present invention.
  • the processor 620 is configured to run a program stored in the storage device 610 to execute the corresponding steps of the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention, and to implement the deep learning-based voice sound quality according to an embodiment of the present invention Enhance the corresponding module in the unit.
  • the deep learning-based voice sound quality enhancement system 600 when the program is executed by the processor 620, the deep learning-based voice sound quality enhancement system 600 performs the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Describe the characteristics of the speech data to be processed; and based on the characteristics of the speech data to be processed, use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech of the output speech data The quality is higher than the speech quality of the speech data to be processed.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames.
  • the speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the deep voice-based speech quality enhancement system 600 when the program is run by the processor 620, the deep voice-based speech quality enhancement system 600 is executed to use the trained speech reconstruction neural network to process the to-be-processed Reconstructing speech data to output speech data includes using features of the speech data to be processed as input to the trained speech reconstruction neural network, and reconstructing speech features from the output of the trained speech reconstruction neural network And generating a time-domain speech waveform based on the reconstructed speech feature as the output speech data.
  • a storage medium on which program instructions are stored, and when the program instructions are run by a computer or a processor, the program is used to execute a deep learning-based learning method according to an embodiment of the present invention.
  • the storage medium may include, for example, a memory card of a smart phone, a storage part of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), USB memory, or any combination of the above storage media.
  • the computer-readable storage medium may be any combination of one or more computer-readable storage media.
  • the computer program instructions when run by a computer, may implement various functional modules of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention, and / or may execute a depth-based Learning method for improving sound quality of speech.
  • the computer program instructions when executed by a computer or a processor, cause the computer or the processor to perform the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the to-be-processed voice data.
  • Characteristics of processing speech data and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, where the speech quality of the output speech data is high The speech quality of the speech data to be processed.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames.
  • the speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the computer program instructions when executed by a computer or processor, cause the computer or processor to execute the reconstructed speech data to be processed into the output speech data using the trained speech reconstruction neural network. Including: using the features of the speech data to be processed as input of the trained speech reconstruction neural network, and outputting reconstructed speech features from the trained speech reconstruction neural network; and based on the reconstructed speech The feature generates a time-domain speech waveform as the output speech data.
  • Each module in the deep learning-based voice sound quality enhancement device may be implemented by running a computer program instruction stored in a memory of a processor of an electronic device based on deep learning-based voice sound quality enhancement according to an embodiment of the present invention.
  • Or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to an embodiment of the present invention are run by a computer.
  • a computer program is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • the method is used to execute the corresponding steps of the deep learning-based voice sound quality enhancement method of the embodiment of the present invention, and is used to implement the deep learning-based voice sound quality enhancement device according to the embodiment of the present invention.
  • Corresponding module is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • a method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement enhance low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is achieved by deep neural network reconstruction
  • the sound quality of the speech so that the sound quality improvement effect that cannot be achieved by the traditional method can be achieved.
  • the method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement according to the embodiments of the present invention can be conveniently deployed on the server or user side, and can effectively enhance the voice sound quality.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another device, or some features can be ignored or not implemented.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some modules according to embodiments of the present invention.
  • DSP digital signal processor
  • the invention may also be implemented as a device program (e.g., a computer program and a computer program product) for performing part or all of the method described herein.
  • a program that implements the present invention may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephone Function (AREA)

Abstract

A deep learning-based speech quality enhancing method, a device, and a system. The method comprises: acquiring speech data to be processed, and performing feature extraction on the speech data, so as to obtain a feature thereof (S210); and reconstructing, on the basis of the feature of the speech data and by means of a trained speech reconstruction neural network, the speech data as output speech data, wherein the output speech data has higher voice quality than the voice quality of the speech data to be processed (S220). The invention enhances the voice quality of low-quality speech data on the basis of a deep learning method, such that the voice quality of the low-quality speech data is enhanced to attain the voice quality of high-quality speech data by means of deep neural network-based reconstruction, thereby realizing a voice quality enhancing effect that conventional methods cannot achieve.

Description

基于深度学习的语音音质增强方法、装置和***Method, device and system for enhancing speech sound quality based on deep learning
说明书Manual
技术领域Technical field
本发明涉及音质优化技术领域,更具体地涉及一种基于深度学习的语音音质增强方法、装置和***。The present invention relates to the technical field of sound quality optimization, and more particularly, to a method, a device, and a system for enhancing sound quality based on deep learning.
背景技术Background technique
近年来,语音无线通信飞速发展,目前广泛应用于各种民用和工业领域。无线通信受带宽限制,要求对语音编码压缩,尽可能地降低语音的采样频率和码率。语音编码尽管降低了语音质量,但也极大节省了资源。早期的数字语音通信编码,如全球移动通信***-半速率(GMS-HR),码率在6.5kbps左右,采用8kHz的采样频率,实际带宽小于4k,损失了很多高频信息,使得人声缺乏辨识度,只能满足基本的语音通信需求。In recent years, voice wireless communication has developed rapidly and is currently widely used in various civilian and industrial fields. Wireless communication is limited by bandwidth and requires speech coding compression to reduce the sampling frequency and bit rate of speech as much as possible. Although speech coding reduces speech quality, it also greatly saves resources. Early digital voice communication coding, such as the Global System for Mobile Communications-Half-Rate (GMS-HR), with a bit rate of about 6.5kbps, using a sampling frequency of 8kHz, the actual bandwidth is less than 4k, and a lot of high-frequency information is lost, making the human voice lacking The degree of recognition can only meet the basic voice communication needs.
随着人们对音质的需求越来越高,低质量语音音质已不能满足需求。随着网络带宽提升,更高质量的语音通信也成为了可能。例如,增强语音服务(EVS)编码技术可能达到48k采样频率和128kbps的码率。但是这并不意味着所有用户都能享受到高清语音通信的体验,例如这样的场景:打电话用户的运营商支持4G网络,而接电话用户的运营商只支持3G网络,那么双方可能只能选择自适应多速率编码-窄带(amr-nb)编码方式进行语音编码,而不是例如16kHz采样频率的自适应多速率编码-宽带(amr-wb)编码方式。由于存在这些因为硬件条件而不得不采用低质量语音的场景存在,并不是所有人都能享受到高清语音通信的好处。As people's demand for sound quality becomes higher and higher, low-quality speech sound quality can no longer meet the demand. As network bandwidth increases, higher-quality voice communications are also possible. For example, Enhanced Voice Services (EVS) coding technology may reach 48k sampling frequency and 128kbps code rate. However, this does not mean that all users can enjoy the experience of high-definition voice communication. For example, in this scenario: the operator of the calling user supports the 4G network, and the operator of the receiving user only supports the 3G network, then the two parties may only An adaptive multi-rate coding-narrowband (amr-nb) coding method is selected for speech coding, rather than an adaptive multi-rate coding-wideband (amr-wb) coding method such as a 16kHz sampling frequency. Due to the existence of these scenarios where low-quality voice has to be adopted due to hardware conditions, not everyone can enjoy the benefits of high-definition voice communications.
另一方面,在保持音质情况下尽可能降低编码码率,也是语音通信的主要研究方向。因此,在有限的存储和带宽资源限制下,通过数字信号处理方法,对低质量语音进行重构,使其音质接近高质量语音是一个有价值的研究方向。然而,目前用软件方法进行低质量语音重构尚无相应可行方案。对于低质量语音的重构,通常是采取填充或插值数据的方法,但这种方法过于粗糙,基本无法还原高质量语音的音质。On the other hand, reducing the coding rate as much as possible while maintaining the sound quality is also the main research direction of voice communication. Therefore, under the limitation of limited storage and bandwidth resources, it is a valuable research direction to reconstruct low-quality speech and make its sound quality close to high-quality speech through digital signal processing methods. However, currently there is no corresponding feasible solution for low-quality speech reconstruction using software methods. For the reconstruction of low-quality speech, a method of filling or interpolating data is usually adopted, but this method is too rough to basically restore the sound quality of high-quality speech.
发明内容Summary of the Invention
为了解决上述问题中的至少一个而提出了本发明。本发明提出了一种关于基于深度学习的语音音质增强的方案,其基于深度学习方法对低质量语音音质进行增强,使低质量语音音质通过深层神经网络重构达到高质量语音的音质,从而能够实现传统方法无法达到的音质提升效果。下面简要描述本发明提出的关于基于深度学习的语音音质增强的方案,更多细节将在后续结合附图在具体实施方式中加以描述。The present invention has been made to solve at least one of the problems described above. The present invention proposes a solution for enhancing the sound quality of speech based on deep learning, which enhances the sound quality of low-quality speech based on a deep learning method, so that the sound quality of low-quality speech is reconstructed to achieve the sound quality of high-quality speech through a deep neural network, thereby enabling Realize the sound quality improvement effect that cannot be achieved by traditional methods. The following briefly describes a solution for enhancing speech quality based on deep learning proposed by the present invention, and more details will be described in specific embodiments in the following with reference to the accompanying drawings.
根据本发明一方面,提供了一种基于深度学习的语音音质增强方法,所述方法包括:获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及基于所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。According to an aspect of the present invention, a method for enhancing the sound quality of speech based on deep learning is provided. The method includes: acquiring to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Characteristics; and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech quality of the output speech data is higher than the quality of the speech data to be processed Handle the voice quality of voice data.
在本发明的一个实施例中,所述语音重构神经网络的训练包括:获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。In an embodiment of the present invention, the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample. Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
在本发明的一个实施例中,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。In an embodiment of the present invention, the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the second bit rate.
在本发明的一个实施例中,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。In an embodiment of the present invention, the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the second sampling frequency.
在本发明的一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。In an embodiment of the present invention, the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
在本发明的一个实施例中,所述特征提取得到的特征还包括频谱相位 信息。In an embodiment of the present invention, the features obtained by the feature extraction further include spectrum phase information.
在本发明的一个实施例中,所述特征提取的方式包括短时傅里叶变换。In an embodiment of the present invention, the feature extraction manner includes a short-time Fourier transform.
在本发明的一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。In an embodiment of the present invention, the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
在本发明的一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。In an embodiment of the present invention, the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
在本发明的一个实施例中,所述利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据包括:将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及基于所述重构语音特征生成时域语音波形以作为所述输出语音数据。In one embodiment of the present invention, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data includes using the characteristics of the speech data to be processed as the trained speech data. An input of a speech reconstruction neural network, and a reconstructed speech feature is output from the trained speech reconstruction neural network; and a time-domain speech waveform is generated based on the reconstructed speech feature as the output speech data.
根据本发明另一方面,提供了一种基于深度学习的语音音质增强装置,所述装置包括:特征提取模块,用于获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及语音重构模块,用于基于所述特征提取模块提取的所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。According to another aspect of the present invention, a deep learning-based voice sound quality enhancement device is provided. The device includes: a feature extraction module for obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain Characteristics of the speech data to be processed; and a speech reconstruction module, configured to use the trained speech reconstruction neural network to convert the speech data to be processed based on the characteristics of the speech data to be processed extracted by the feature extraction module Reconstructed to output voice data, wherein the voice quality of the output voice data is higher than the voice quality of the to-be-processed voice data.
在本发明的一个实施例中,所述语音重构神经网络的训练包括:获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的 输出层的目标,以训练所述语音重构神经网络。In an embodiment of the present invention, the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample. Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
在本发明的一个实施例中,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。In an embodiment of the present invention, the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the second bit rate.
在本发明的一个实施例中,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。In an embodiment of the present invention, the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the second sampling frequency.
在本发明的一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。In an embodiment of the present invention, the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
在本发明的一个实施例中,所述特征提取得到的特征还包括频谱相位信息。In an embodiment of the present invention, the features obtained by the feature extraction further include spectrum phase information.
在本发明的一个实施例中,所述特征提取的方式包括短时傅里叶变换。In an embodiment of the present invention, the feature extraction manner includes a short-time Fourier transform.
在本发明的一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。In an embodiment of the present invention, the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
在本发明的一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。In an embodiment of the present invention, the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
在本发明的一个实施例中,所述语音重构模块进一步包括:重构模块,用于将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及生成模块,用于基于所述重构模块输出的所述重构语音特征生成时域语音波形以作为所述输出语音数据。In an embodiment of the present invention, the speech reconstruction module further includes: a reconstruction module, configured to use the characteristics of the speech data to be processed as an input of the trained speech reconstruction neural network, The trained speech reconstruction neural network outputs reconstructed speech features; and a generation module for generating a time-domain speech waveform based on the reconstructed speech features output by the reconstruction module as the output speech data.
根据本发明又一方面,提供了一种基于深度学习的语音音质增强***,所述***包括存储装置和处理器,所述存储装置上存储有由所述处理器运行的计算机程序,所述计算机程序在被所述处理器运行时执行上述任一项所述的基于深度学习的语音音质增强方法。According to another aspect of the present invention, a deep learning-based voice sound quality enhancement system is provided. The system includes a storage device and a processor. The storage device stores a computer program run by the processor. The computer The program, when executed by the processor, executes the deep learning-based speech sound quality enhancement method according to any one of the above.
根据本发明再一方面,提供了一种存储介质,所述存储介质上存储有 计算机程序,所述计算机程序在运行时执行上述任一项所述的基于深度学习的语音音质增强方法。According to still another aspect of the present invention, a storage medium is provided. The storage medium stores a computer program, and the computer program executes the deep learning-based voice sound quality enhancement method according to any one of the foregoing when running.
根据本发明又一方面,提供了一种计算机程序,所述计算机程序被计算机或处理器运行时用于执行上述任一项所述的基于深度学习的语音音质增强方法,所述计算机程序还用于实现上述任一项所述的基于深度学习的语音音质增强装置中的各模块。According to yet another aspect of the present invention, a computer program is provided, which is used by a computer or a processor to execute the deep learning-based voice sound quality enhancement method according to any one of the foregoing, and the computer program further uses Each module in the deep learning-based voice sound quality enhancement device according to any one of the above.
根据本发明实施例的基于深度学习的语音音质增强方法、装置和***基于深度学习方法对低质量语音音质进行增强,使低质量语音音质通过深层神经网络重构达到高质量语音的音质,从而能够实现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习的语音音质增强方法、装置和***可以便利地部署在服务端或用户端,能够高效地实现语音音质的增强。The method, device, and system for enhancing speech sound quality based on deep learning according to the embodiments of the present invention enhance low-quality speech sound quality based on the deep learning method, so that the low-quality speech sound quality is reconstructed by deep neural network to achieve high-quality speech sound quality, so that Realize the sound quality improvement effect that cannot be achieved by traditional methods. In addition, the method, device, and system for enhancing the voice quality based on deep learning according to the embodiments of the present invention can be conveniently deployed on the server or user end, and can effectively enhance the voice quality.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过结合附图对本发明实施例进行更详细的描述,本发明的上述以及其它目的、特征和优势将变得更加明显。附图用来提供对本发明实施例的进一步理解,并且构成说明书的一部分,与本发明实施例一起用于解释本发明,并不构成对本发明的限制。在附图中,相同的参考标号通常代表相同部件或步骤。The above and other objects, features, and advantages of the present invention will become more apparent by describing the embodiments of the present invention in more detail with reference to the accompanying drawings. The drawings are used to provide a further understanding of the embodiments of the present invention, and constitute a part of the specification. They are used to explain the present invention together with the embodiments of the present invention, and do not constitute a limitation on the present invention. In the drawings, the same reference numerals generally represent the same components or steps.
图1示出用于实现根据本发明实施例的基于深度学习的语音音质增强方法、装置和***的示例电子设备的示意性框图;FIG. 1 shows a schematic block diagram of an example electronic device for implementing a deep learning-based method, apparatus, and system for voice sound quality enhancement according to an embodiment of the present invention;
图2示出根据本发明实施例的基于深度学习的语音音质增强方法的示意性流程图;FIG. 2 shows a schematic flowchart of a deep learning-based voice sound quality enhancement method according to an embodiment of the present invention;
图3示出根据本发明实施例的语音重构神经网络的训练示意图;3 shows a training schematic diagram of a speech reconstruction neural network according to an embodiment of the present invention;
图4A、图4B以及4C分别示出高质量语音、低质量语音、以及采用根据本发明实施例的基于深度学习的语音音质增强方法将低质量语音重构所得到的语音各自的语谱图;4A, 4B, and 4C respectively show high-quality speech, low-quality speech, and speech maps of speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention;
图5示出根据本发明实施例的基于深度学习的语音音质增强装置的示意性框图;以及5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention; and
图6示出根据本发明实施例的基于深度学习的语音音质增强***的示 意性框图。FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使得本发明的目的、技术方案和优点更为明显,下面将参照附图详细描述根据本发明的示例实施例。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是本发明的全部实施例,应理解,本发明不受这里描述的示例实施例的限制。基于本发明中描述的本发明实施例,本领域技术人员在没有付出创造性劳动的情况下所得到的所有其它实施例都应落入本发明的保护范围之内。In order to make the objectives, technical solutions, and advantages of the present invention more apparent, an exemplary embodiment according to the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments of the present invention. It should be understood that the present invention is not limited by the exemplary embodiments described herein. Based on the embodiments of the present invention described in the present invention, all other embodiments obtained by those skilled in the art without paying any creative effort should fall within the protection scope of the present invention.
首先,参照图1来描述用于实现本发明实施例的基于深度学习的语音音质增强方法、装置和***的示例电子设备100。First, an example electronic device 100 for implementing a method, an apparatus, and a system for improving the sound quality of a voice based on deep learning according to an embodiment of the present invention is described with reference to FIG. 1.
如图1所示,电子设备100包括一个或多个处理器102、一个或多个存储装置104、输入装置106以及输出装置108,这些组件通过总线***110和/或其它形式的连接机构(未示出)互连。应当注意,图1所示的电子设备100的组件和结构只是示例性的,而非限制性的,根据需要,所述电子设备也可以具有其他组件和结构。As shown in FIG. 1, the electronic device 100 includes one or more processors 102, one or more storage devices 104, input devices 106, and output devices 108. These components are connected through a bus system 110 and / or other forms of connection mechanisms (not shown). (Shown) interconnected. It should be noted that the components and structures of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may have other components and structures as needed.
所述处理器102可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,并且可以控制所述电子设备100中的其它组件以执行期望的功能。The processor 102 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 100 to perform a desired function.
所述存储装置104可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器102可以运行所述程序指令,以实现下文所述的本发明实施例中(由处理器实现)的客户端功能以及/或者其它期望的功能。在所述计算机可读存储介质中还可以存储各种应用程序和各种数据,例如所述应用程序使用和/或产生的各种数据等。The storage device 104 may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, a random access memory (RAM) and / or a cache memory. The non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 102 may run the program instructions to implement client functions in the embodiments of the present invention (implemented by the processor) described below. And / or other desired functions. Various application programs and various data, such as various data used and / or generated by the application program, can also be stored in the computer-readable storage medium.
所述输入装置106可以是用户用来输入指令的装置,并且可以包括键盘、鼠标、麦克风和触摸屏等中的一个或多个。此外,所述输入装置106 也可以是任何接收信息的接口。The input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. In addition, the input device 106 may also be any interface for receiving information.
所述输出装置108可以向外部(例如用户)输出各种信息(例如图像或声音),并且可以包括显示器、扬声器等中的一个或多个。此外,所述输出装置108也可以是任何其他具备输出功能的设备。The output device 108 may output various information (such as images or sounds) to the outside (such as a user), and may include one or more of a display, a speaker, and the like. In addition, the output device 108 may be any other device having an output function.
示例性地,用于实现根据本发明实施例的基于深度学***板电脑等终端。Exemplarily, an example electronic device for implementing a method, a device, and a system for enhancing the sound quality of a voice based on deep learning according to embodiments of the present invention may be implemented as a terminal such as a smart phone, a tablet computer, or the like.
下面,将参考图2描述根据本发明实施例的基于深度学习的语音音质增强方法200。如图2所示,基于深度学习的语音音质增强方法200可以包括如下步骤:Hereinafter, a deep learning-based speech sound quality enhancement method 200 according to an embodiment of the present invention will be described with reference to FIG. 2. As shown in FIG. 2, a deep learning-based voice sound quality enhancement method 200 may include the following steps:
在步骤S210,获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征。In step S210, the speech data to be processed is acquired, and feature extraction is performed on the speech data to be processed to obtain the characteristics of the speech data to be processed.
在一个实施例中,在步骤S210中所获取的待处理语音数据可以为语音通信终端、语音存储/播放设备中接收、存储或播放的需要进行音质增强的低质量语音数据,例如低码率、低采样频率的语音数据。示例性地,待处理语音数据可以包括但不限于:无线语音通话的数据流、用户正在播放的在列表中的语音、或存储在云端、客户端的语音文件等。在其他示例中,在步骤S210中所获取的待处理语音数据也可以为任何需要进行音质增强的数据,例如包括在视频数据中的语音数据等。此外,在步骤S210中所获取的待处理语音数据可以来自离线存放的文件,也可以来自在线播放的文件。In one embodiment, the to-be-processed voice data obtained in step S210 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency. Exemplarily, the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client. In other examples, the to-be-processed voice data obtained in step S210 may also be any data that requires sound quality enhancement, such as voice data included in video data. In addition, the to-be-processed voice data obtained in step S210 may come from a file stored offline, or from a file played online.
在一个实施例中,对所获取的待处理语音数据进行特征提取的方式可以包括但不限于短时傅里叶变换(STFT)。示例性地,对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征可以包括频域幅度和/或能量信息。示例性地,对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征还可以包括频谱相位信息。示例性地,对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征也可以是时域特征。在其他示例中,对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征还可以包括任何其他可以表征待处理语音数据的特征。In one embodiment, a manner of performing feature extraction on the acquired to-be-processed voice data may include, but is not limited to, a short-time Fourier transform (STFT). Exemplarily, the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may include frequency domain amplitude and / or energy information. Exemplarily, the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include spectrum phase information. Exemplarily, the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may also be time-domain features. In other examples, the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
在一个实施例中,在对待处理语音数据进行特征提取之前,可以先对其进行分帧处理,并且前述的特征提取针对分帧后得到语音数据逐帧进行。这种情况可以适用于在步骤S210所获取的待处理语音数据是来自于离线存放的文件或来自于任何源的完整文件时。在另一个实施例中,如果在步骤S210所获取的待处理语音数据来自于在线播放的文件,则可以缓存一帧或多帧待处理语音数据后再进行特征提取。示例性地,可以针对分帧后得到的或缓存后得到的每帧待处理语音数据选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。In one embodiment, before performing feature extraction on the speech data to be processed, frame processing may be performed on it, and the aforementioned feature extraction is performed frame-by-frame on the speech data obtained after the frame. This situation may be applicable when the to-be-processed voice data obtained in step S210 is from a file stored offline or a complete file from any source. In another embodiment, if the to-be-processed voice data obtained in step S210 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction. Exemplarily, feature data can be selected for each frame of voice data to be processed after framed or buffered to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
在又一个实施例中,在对待处理语音数据进行前述的分帧处理之前,可以先对待处理语音数据进行解码处理,并且前述的分帧处理可以针对解码后得到的时域波形数据进行。这是因为,所获取的待处理语音数据一般为经过编码的形式,为了获得其完整的语音时域信息,可先对其进行解码。In another embodiment, before the aforementioned frame processing is performed on the voice data to be processed, the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
在又一个实施例中,在对待处理语音数据进行特征提取之前,还可以先对待处理语音数据进行预处理,并且前述的特征提取可以针对预处理后得到的语音数据进行。示例性地,对待处理语音数据的预处理可以包括但不限于:去噪、回声抑制和自动增益控制等。示例性地,预处理可以是在前述解码处理之后进行。因此,在一个示例中,可以对所获取的待处理语音数据依次进行解码、预处理、分帧和特征提取,以高效地提取具有很好代表性的特征。在其他示例中,前述的预处理操作也可以在分帧之后特征提取之前进行。In another embodiment, before performing feature extraction on the voice data to be processed, the voice data to be processed may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after the pre-processing. Exemplarily, the pre-processing of the speech data to be processed may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like. Exemplarily, the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the acquired to-be-processed voice data may be sequentially decoded, pre-processed, framed, and feature extracted in order to efficiently extract well-represented features. In other examples, the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
现在继续参考图2,描述根据本发明实施例的基于深度学习的语音音质增强方法200的后续步骤。Referring now to FIG. 2, the subsequent steps of the deep learning-based speech sound quality enhancement method 200 according to an embodiment of the present invention are described.
在步骤S220,基于所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。In step S220, based on the characteristics of the voice data to be processed, the trained voice reconstruction neural network is used to reconstruct the voice data to be processed into output voice data, where the voice quality of the output voice data is higher than the voice quality The voice quality of the pending voice data.
在本发明的实施例中,将在步骤S210中提取的待处理语音数据的特征输入到训练好的语音重构神经网络,由该语音重构神经网络对输入的特征进行重构得到重构语音特征,该重构得到的重构语音特征可以用于生成相对于所获取的待处理语音数据语音质量更高的输出语音数据。因此,本发明的语音音质增强方法可以基于深度学习精确地补充低质量语音中丢失 的语音信息,不仅能够高效地实现低质量语音音质的极大提升,又不影响对通信带宽的兼顾(因为传输的仍然是数据量较小的低质量语音数据,但该低质量语音数据可在接收端被重构为高质量语音数据)。In the embodiment of the present invention, the features of the speech data to be processed extracted in step S210 are input to the trained speech reconstruction neural network, and the speech reconstruction neural network reconstructs the input features to obtain reconstructed speech. Feature, and the reconstructed reconstructed voice feature can be used to generate output voice data with higher voice quality than the acquired to-be-processed voice data. Therefore, the speech sound quality enhancement method of the present invention can accurately supplement the speech information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
下面结合图3描述根据本发明实施例的上述语音重构神经网络的训练过程。如图3所示,根据本发明实施例的语音重构神经网络的训练可以包括如下过程:The training process of the speech reconstruction neural network according to the embodiment of the present invention is described below with reference to FIG. 3. As shown in FIG. 3, training of a speech reconstruction neural network according to an embodiment of the present invention may include the following process:
在S310,获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到。In S310, a first voice sample and a second voice sample are obtained, wherein the voice quality of the second voice sample is lower than the voice quality of the first voice sample, and the second voice sample is determined by the first voice sample Obtained by transcoding.
在一个示例中,第一语音样本可以是高质量语音样本,第二语音样本可以是低质量语音样本。示例性地,第一语音样本可以是一组高码率、高采样频率的语音样本,包括但不限于16kHz、24kHz、32kHz采样频率的语音数据。在一个示例中,可以将第一语音样本进行转码以获得第二语音样本。例如,可以将采样频率为16kHz、码率为23.85kbps的amr-wb语音样本作为第一语音样本,通过将其转码为采样频率为8kHz、码率为12.2kbps的amr-nb语音来得到第二语音样本。再如,可以通过将FLAC格式的第一语音样本变换为MP3格式来得到第二语音样本,而不降低码率和采样频率。也就是说,第一语音样本的码率可以高于或等于第二语音样本的码率;第一语音样本的采样频率可以高于或等于第二语音样本的采样频率。当然,这仅是示例性的。第一语音样本(即高质量语音样本)转码得到第二语音样本(即低质量语音样本)也可以是其他的情况,这可以基于实际应用场景来适应性调整。具体地,可以基于对步骤S210获取的待处理语音数据的重构需求来确定应选择的第一语音样本和第二语音样本,也就是说可以基于上述重构需求确定应选择的第一语音样本和应采用的将其转码为第二语音样本的转码方式。In one example, the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample. Exemplarily, the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz. In one example, a first speech sample may be transcoded to obtain a second speech sample. For example, the amr-wb speech sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first speech sample, and the first sample can be obtained by transcoding the amr-nb speech with a sampling frequency of 8 kHz and a code rate of 12.2 kbps. Two speech samples. As another example, the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample. Of course, this is only exemplary. The transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample (that is, the low-quality speech sample) can also be other situations, which can be adaptively adjusted based on the actual application scenario. Specifically, the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction requirement for the to-be-processed voice data obtained in step S210, that is, the first voice sample that should be selected can be determined based on the reconstruction requirement. And should use the transcoding method of transcoding it into the second speech sample.
继续参考图3,在S320,对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征。With continued reference to FIG. 3, at S320, feature extraction is performed on the first voice sample and the second voice sample to obtain the features of the first voice sample and the features of the second voice sample, respectively.
与前文在步骤S210中所描述的类似的,在一个实施例中,对第一语音样本和第二语音样本各自进行特征提取的方式可以包括但不限于短时傅 里叶变换。示例性地,对第一语音样本和第二语音样本各自进行特征提取所得到的特征可以包括其各自的频域幅度和/或能量信息。示例性地,对第一语音样本和第二语音样本进行特征提取所得到的特征还可以包括其各自的频谱相位信息。示例性地,对第一语音样本和第二语音样本进行特征提取所得到的特征也可以是其各自的时域特征。在其他示例中,对第一语音样本和第二语音样本各自进行特征提取所得到的特征还可以包括任何其他可以表征其各自的特征。Similar to the foregoing description in step S210, in one embodiment, the manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform. Exemplarily, the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information. Exemplarily, the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information. Exemplarily, the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features. In other examples, the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
此外,仍与前文在步骤S210中所描述的类似的,在一个实施例中,在对第一语音样本和第二语音样本各自进行特征提取之前,可以先对第一语音样本和第二语音样本各自进行分帧处理,并且前述的特征提取可以针对第一语音样本和第二语音样本各自分帧后得到的其各自的语音样本逐帧进行。示例性地,可以针对每帧语音样本选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。In addition, it is still similar to that described in step S210 above. In one embodiment, before performing feature extraction on each of the first voice sample and the second voice sample, the first voice sample and the second voice sample may be first processed. Frame processing is performed separately, and the aforementioned feature extraction may be performed frame by frame for the respective speech samples obtained after the first and second speech samples are framed. Exemplarily, part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
在又一个实施例中,在对第一语音样本和第二语音样本各自进行前述的分帧处理之前,可以先对第一语音样本和第二语音样本各自进行解码处理,并且前述的分帧处理可以针对第一语音样本和第二语音样本各自解码后得到的其各自的时域波形数据进行。In yet another embodiment, before the foregoing frame processing is performed on the first speech sample and the second speech sample, the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
在又一个实施例中,在对第一语音样本和第二语音样本进行特征提取之前,还可以先对第一语音样本和第二语音样本各自进行预处理,并且前述的特征提取可以针对预处理后得到的语音样本进行。示例性地,对第一语音样本和第二语音样本各自进行的预处理可以包括但不限于:去噪、回声抑制和自动增益控制等。示例性地,预处理可以是前述解码处理之后进行。因此,在一个示例中,可以对第一语音样本和第二语音样本各自依次进行解码、预处理、分帧和特征提取,以高效地提取具有很好代表性的特征。在其他示例中,前述的预处理操作也可以在对第一语音样本和第二语音样本分别分帧之后特征提取之前进行。In another embodiment, before performing feature extraction on the first voice sample and the second voice sample, each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing. The speech samples obtained afterwards are performed. Exemplarily, the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control. Exemplarily, the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness. In other examples, the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
在S330,将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。In S330, the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained feature of the first speech sample is used as an output of the speech reconstruction neural network. Layers to train the speech reconstruction neural network.
在一个实施例中,可以将一帧或多帧第二语音样本的特征作为语音重 构神经网络的输入层的输入,可以将一帧或多帧第一语音样本的特征作为语音重构神经网络的输出层的目标,从而训练一个神经网络回归器作为在步骤S220中采用的语音重构神经网络。In one embodiment, the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network. Target the output layer, thereby training a neural network regressor as the speech reconstruction neural network employed in step S220.
以上结合图3示例性地描述了根据本发明实施例的语音重构神经网络的训练过程。现在继续参考图2,如前所述,在步骤S220中,基于训练好的语音重构神经网络,可将待处理语音数据的特征重构为重构语音特征,由于该重构语音特征为频域特征,因此可基于该重构语音特征生成时域语音波形输出。示例性地,可以通过傅里叶逆变换来对该重构语音特征进行变换得到时域语音波形。输出的语音波形可被存储或经缓存用于播放,从而为用户提供更好的经提升的语音音质体验。下面可以结合图4A-图4C来体会根据实施例的基于深度学习的语音音质增强方法的语音音质增强效果。The training process of the speech reconstruction neural network according to the embodiment of the present invention is exemplarily described above with reference to FIG. 3. Now referring to FIG. 2, as described above, in step S220, based on the trained speech reconstruction neural network, the features of the speech data to be processed can be reconstructed into reconstructed speech features. Domain features, so time domain voice waveform output can be generated based on the reconstructed voice features. Exemplarily, the time-domain speech waveform can be obtained by transforming the reconstructed speech feature by inverse Fourier transform. The output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience. The following can be combined with FIG. 4A to FIG. 4C to realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement method according to the embodiment.
图4A、图4B以及4C分别示出高质量语音、低质量语音、以及采用根据本发明实施例的基于深度学习的语音音质增强方法将低质量语音重构所得到的语音各自的语谱图。其中,图4A示出以PCM格式、16kHz采样频率、16bit量化位数为例的高质量语音的语谱图400;图4B示出对该高质量语音进行转码得到的MP3格式、8kHz采样频率、8kbps码率的低质量语音的语谱图401;图4C示出采用根据本发明实施例的基于深度学习的语音音质增强方法将该低质量语音重构得到的16kHz采样频率的重构语音的语谱图402。从图4A-图4C很明显可以看出,与图4A示出的高质量语音的语谱图相比,图4B示出的低质量语音的语谱图缺少了很多高频成分,而经过根据本发明实施例的基于深度学习的语音音质增强方法的重构,图4C示出的重构语音的语谱图又恢复了这些高频成分,实现窄带语音的超分辨率,使得低质量语音的音质得到了较好的提升。FIGS. 4A, 4B, and 4C respectively show the respective spectrograms of high-quality speech, low-quality speech, and speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention. Among them, FIG. 4A shows a grammap 400 of high-quality speech using PCM format, 16kHz sampling frequency, and 16-bit quantization bits as examples; FIG. 4B shows MP3 format and 8kHz sampling frequency obtained by transcoding the high-quality speech. A spectrogram 401 of low-quality speech at a rate of 8 kbps; FIG. 4C shows the reconstructed speech at a 16 kHz sampling frequency obtained by reconstructing the low-quality speech using the deep learning-based speech quality enhancement method according to an embodiment of the present invention Gram map 402. It is obvious from FIGS. 4A to 4C that compared with the high-quality speech map shown in FIG. 4A, the low-quality speech map shown in FIG. 4B lacks a lot of high-frequency components. Reconstruction of the method for improving the sound quality of speech based on deep learning in the embodiment of the present invention, and the reconstructed speech spectrogram shown in FIG. 4C restores these high-frequency components to achieve super-resolution of narrow-band speech, making low-quality speech The sound quality has been improved.
基于上面的描述,根据本发明实施例的基于深度学习的语音音质增强方法基于深度学习方法对低质量语音音质进行增强,使低质量语音音质通过深层神经网络重构达到高质量语音的音质,从而能够实现传统方法无法达到的音质提升效果。Based on the above description, the deep learning-based voice sound quality enhancement method according to the embodiment of the present invention enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality. It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
以上示例性地描述了根据本发明实施例的基于深度学习的语音音质增强方法。示例性地,根据本发明实施例的基于深度学习的语音音质增强 方法可以在具有存储器和处理器的设备、装置或者***中实现。The foregoing has exemplarily described a method for enhancing speech sound quality based on deep learning according to an embodiment of the present invention. Exemplarily, the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may be implemented in a device, an apparatus, or a system having a memory and a processor.
此外,根据本发明实施例的基于深度学***板电脑、个人计算机、耳机、音箱等移动设备上。替代地,根据本发明实施例的基于深度学习的语音音质增强方法还可以部署在服务器端(或云端)。替代地,根据本发明实施例的基于深度学习的语音音质增强方法还可以分布地部署在服务器端(或云端)和个人终端处。In addition, the method for enhancing the sound quality of a voice based on deep learning according to an embodiment of the present invention can be conveniently deployed on a mobile device such as a smart phone, a tablet computer, a personal computer, a headset, and a speaker. Alternatively, the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on a server side (or cloud). Alternatively, the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on the server (or cloud) and personal terminals in a distributed manner.
下面结合图5描述本发明另一方面提供的基于深度学习的语音音质增强装置。图5示出了根据本发明实施例的基于深度学习的语音音质增强装置500的示意性框图。The following describes a deep learning-based voice sound quality enhancement device provided by another aspect of the present invention with reference to FIG. 5. FIG. 5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement apparatus 500 according to an embodiment of the present invention.
如图5所示,根据本发明实施例的基于深度学习的语音音质增强装置500包括特征提取模块510和语音重构模块520。所述各个模块可分别执行上文中结合图2描述的基于深度学习的语音音质增强方法的各个步骤/功能。以下仅对基于深度学习的语音音质增强装置500的各模块的主要功能进行描述,而省略以上已经描述过的细节内容。As shown in FIG. 5, a deep learning-based voice sound quality enhancement device 500 according to an embodiment of the present invention includes a feature extraction module 510 and a voice reconstruction module 520. Each of the modules may perform each step / function of the deep learning-based speech sound quality enhancement method described above in conjunction with FIG. 2. In the following, only the main functions of each module of the deep learning-based voice sound quality enhancement device 500 are described, and details that have been described above are omitted.
特征提取模块510用于获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征。语音重构模块520用于基于所述特征提取模块提取的所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。特征提取模块510和语音重构模块520均可以由图1所示的电子设备中的处理器102运行存储装置104中存储的程序指令来实现。The feature extraction module 510 is configured to obtain voice data to be processed, and perform feature extraction on the voice data to be processed to obtain characteristics of the voice data to be processed. The speech reconstruction module 520 is configured to reconstruct the speech data to be processed into the output speech data by using the trained speech reconstruction neural network based on the features of the speech data to be processed extracted by the feature extraction module. The voice quality of the output voice data is higher than the voice quality of the voice data to be processed. Both the feature extraction module 510 and the voice reconstruction module 520 can be implemented by the processor 102 in the electronic device shown in FIG. 1 running program instructions stored in the storage device 104.
在一个实施例中,特征提取模块510所获取的待处理语音数据可以为语音通信终端、语音存储/播放设备中接收、存储或播放的需要进行音质增强的低质量语音数据,例如低码率、低采样频率的语音数据。示例性地,待处理语音数据可以包括但不限于:无线语音通话的数据流、用户正在播放的在列表中的语音、或存储在云端、客户端的语音文件等。在其他示例中,特征提取模块510所获取的待处理语音数据也可以为任何需要进行音质增强的数据,例如包括在视频数据中的语音数据等。此外,特征提取模块510所获取的待处理语音数据可以来自离线存放的文件,也可以来自在线播放的文件。In one embodiment, the to-be-processed voice data obtained by the feature extraction module 510 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency. Exemplarily, the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client. In other examples, the to-be-processed voice data obtained by the feature extraction module 510 may also be any data that requires sound quality enhancement, such as voice data included in video data. In addition, the to-be-processed voice data obtained by the feature extraction module 510 may come from files stored offline, or from files played online.
在一个实施例中,特征提取模块510对所获取的待处理语音数据进行特征提取的方式可以包括但不限于短时傅里叶变换(STFT)。示例性地,特征提取模块510对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征可以包括频域幅度和/或能量信息。示例性地,特征提取模块510对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征还可以包括频谱相位信息。示例性地,特征提取模块510对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征也可以是时域特征。在其他示例中,特征提取模块510对所获取的待处理语音数据进行特征提取所得到的待处理语音数据的特征还可以包括任何其他可以表征待处理语音数据的特征。In one embodiment, the manner in which the feature extraction module 510 performs feature extraction on the acquired speech data to be processed may include, but is not limited to, a short-time Fourier transform (STFT). Exemplarily, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may include frequency domain amplitude and / or energy information. Exemplarily, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include spectrum phase information. Exemplarily, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may also be time-domain features. In other examples, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
在一个实施例中,在特征提取模块510对待处理语音数据进行特征提取之前,可以先对其进行分帧处理,并且前述的特征提取针对分帧后得到语音数据逐帧进行。这种情况可以适用于在特征提取模块510所获取的待处理语音数据是来自于离线存放的文件或来自于任何源的完整文件时。在另一个实施例中,如果特征提取模块510所获取的待处理语音数据来自于在线播放的文件,则可以缓存一帧或多帧待处理语音数据后再进行特征提取。示例性地,特征提取模块510可以针对分帧后得到的或缓存后得到的每帧待处理语音数据选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。In one embodiment, before the feature extraction module 510 performs feature extraction on the speech data to be processed, it may perform frame processing on it, and the aforementioned feature extraction is performed on the frame-by-frame speech data obtained. This situation may be applicable when the to-be-processed voice data obtained by the feature extraction module 510 is from a file stored offline or a complete file from any source. In another embodiment, if the to-be-processed voice data obtained by the feature extraction module 510 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction. Exemplarily, the feature extraction module 510 may select a part of data for each frame of voice data to be processed or obtained after buffering to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
在又一个实施例中,在特征提取模块510对待处理语音数据进行前述的分帧处理之前,可以先对待处理语音数据进行解码处理,并且前述的分帧处理可以针对解码后得到的时域波形数据进行。这是因为,所获取的待处理语音数据一般为经过编码的形式,为了获得其完整的语音时域信息,可先对其进行解码。In another embodiment, before the feature extraction module 510 performs the aforementioned frame processing on the voice data to be processed, the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. get on. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
在又一个实施例中,在特征提取模块510对待处理语音数据进行特征提取之前,还可以先对待处理语音数据进行预处理,并且前述的特征提取可以针对预处理后得到的语音数据进行。示例性地,特征提取模块510对待处理语音数据的预处理可以包括但不限于:去噪、回声抑制和自动增益控制等。示例性地,预处理可以是在前述解码处理之后进行。因此,在一个示例中,特征提取模块510可以对所获取的待处理语音数据依次进行解 码、预处理、分帧和特征提取,以高效地提取具有很好代表性的特征。在其他示例中,前述的预处理操作也可以在分帧之后特征提取之前进行。In another embodiment, before the feature extraction module 510 performs feature extraction on the voice data to be processed, the voice data to be processed may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after pre-processing. Exemplarily, the pre-processing of the speech data to be processed by the feature extraction module 510 may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like. Exemplarily, the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the feature extraction module 510 may sequentially decode, pre-process, frame, and feature extract the acquired speech data to be processed in order to efficiently extract features with good representativeness. In other examples, the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
基于特征提取模块510所提取的待处理语音数据的特征,语音重构模块520可以利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据。Based on the features of the speech data to be processed extracted by the feature extraction module 510, the speech reconstruction module 520 may use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data.
在本发明的实施例中,语音重构模块520可以进一步包括重构模块(未在图5中示出)和生成模块(未在图5中示出)。其中,重构模块可以包括训练好的语音重构神经网络,该语音重构神经网络将特征提取模块510提取的待处理语音数据的特征作为输入,对输入的特征进行重构得到重构语音特征。生成模块基于重构模块输出的重构语音特征生成相对于所获取的待处理语音数据语音质量更高的输出语音数据。因此,本发明的语音音质增强装置可以基于深度学习精确地补充低质量语音中丢失的语音信息,不仅能够高效地实现低质量语音音质的极大提升,又不影响对通信带宽的兼顾(因为传输的仍然是数据量较小的低质量语音数据,但该低质量语音数据可在接收端被重构为高质量语音数据)。In the embodiment of the present invention, the voice reconstruction module 520 may further include a reconstruction module (not shown in FIG. 5) and a generation module (not shown in FIG. 5). The reconstruction module may include a trained speech reconstruction neural network that takes as input the features of the speech data to be processed extracted by the feature extraction module 510, and reconstructs the input features to obtain reconstructed speech features. . The generating module generates output voice data with higher voice quality than the acquired to-be-processed voice data based on the reconstructed voice features output by the reconstruction module. Therefore, the voice sound quality enhancement device of the present invention can accurately supplement the voice information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
在本发明的实施例中,语音重构模块520所利用的语音重构神经网络的训练可以包括:获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。可以结合图3参照上文关于图3的描述理解根据本发明实施例的基于深度学习的语音音质增强装置500的语音重构模块520所利用的语音重构神经网络的训练过程。为了简洁,此处不赘述过多的细节。In the embodiment of the present invention, the training of the speech reconstruction neural network used by the speech reconstruction module 520 may include: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the The speech quality of the first speech sample, and the second speech sample is obtained by transcoding the first speech sample; feature extraction is performed on the first speech sample and the second speech sample to obtain The feature of the first speech sample and the feature of the second speech sample; and the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained The features of the first speech sample are used as targets of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network. The training process of the speech reconstruction neural network used by the speech reconstruction module 520 of the deep learning-based speech sound quality enhancement device 500 according to an embodiment of the present invention can be understood with reference to FIG. 3 with reference to the above description of FIG. 3. For brevity, I won't go into too much detail here.
在一个示例中,第一语音样本可以是高质量语音样本,第二语音样本可以是低质量语音样本。示例性地,第一语音样本可以是一组高码率、高采样频率的语音样本,包括但不限于16kHz、24kHz、32kHz采样频率的语音数据。在一个示例中,可以将第一语音样本进行转码以获得第二语音样 本。例如,可以将采样频率为16kHz、码率为23.85kbps的amr-wb语音样本作为第一语音样本,通过将其转码为采样频率为8kHz、码率为12.2kbps的amr-nb语音来得到第二语音样本。再如,可以通过将FLAC格式的第一语音样本变换为MP3格式来得到第二语音样本,而不降低码率和采样频率。也就是说,第一语音样本的码率可以高于或等于第二语音样本的码率;第一语音样本的采样频率可以高于或等于第二语音样本的采样频率。当然,这仅是示例性的。第一语音样本(即高质量语音样本)转码得到第二语音样本(即低质量语音样本)也可以是其他的情况,这可以基于实际应用场景来适应性调整。具体地,可以基于对特征提取模块510获取的待处理语音数据的重构需求来确定应选择的第一语音样本和第二语音样本,也就是说可以基于上述重构需求确定应选择的第一语音样本和应采用的将其转码为第二语音样本的转码方式。In one example, the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample. Exemplarily, the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz. In one example, a first speech sample may be transcoded to obtain a second speech sample. For example, an amr-wb voice sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first voice sample, and the first can be obtained by transcoding it into an amr-nb voice with a sampling frequency of 8 kHz and a code rate of 12.2 kbps. Two speech samples. As another example, the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample. Of course, this is only exemplary. The transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample (that is, the low-quality speech sample) can also be other situations, which can be adaptively adjusted based on the actual application scenario. Specifically, the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction needs of the to-be-processed voice data obtained by the feature extraction module 510, that is, the first voice sample that should be selected can be determined based on the above-mentioned reconstruction requirements. The speech sample and the transcoding method that should be used to transcode it into a second speech sample.
在一个实施例中,对第一语音样本和第二语音样本各自进行特征提取的方式可以包括但不限于短时傅里叶变换。示例性地,对第一语音样本和第二语音样本各自进行特征提取所得到的特征可以包括其各自的频域幅度和/或能量信息。示例性地,对第一语音样本和第二语音样本进行特征提取所得到的特征还可以包括其各自的频谱相位信息。示例性地,对第一语音样本和第二语音样本进行特征提取所得到的特征也可以是其各自的时域特征。在其他示例中,对第一语音样本和第二语音样本各自进行特征提取所得到的特征还可以包括任何其他可以表征其各自的特征。In one embodiment, a manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform. Exemplarily, the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information. Exemplarily, the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information. Exemplarily, the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features. In other examples, the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
在一个实施例中,在对第一语音样本和第二语音样本各自进行特征提取之前,可以先对第一语音样本和第二语音样本各自进行分帧处理,并且前述的特征提取可以针对第一语音样本和第二语音样本各自分帧后得到的其各自的语音样本逐帧进行。示例性地,可以针对每帧语音样本选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。In one embodiment, before performing feature extraction on each of the first voice sample and the second voice sample, frame processing may be performed on each of the first voice sample and the second voice sample, and the aforementioned feature extraction may be performed on the first The respective speech samples obtained after the speech samples and the second speech samples are framed are performed frame by frame. Exemplarily, part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
在又一个实施例中,在对第一语音样本和第二语音样本各自进行前述的分帧处理之前,可以先对第一语音样本和第二语音样本各自进行解码处理,并且前述的分帧处理可以针对第一语音样本和第二语音样本各自解码后得到的其各自的时域波形数据进行。In yet another embodiment, before the foregoing frame processing is performed on the first speech sample and the second speech sample, the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
在又一个实施例中,在对第一语音样本和第二语音样本进行特征提取 之前,还可以先对第一语音样本和第二语音样本各自进行预处理,并且前述的特征提取可以针对预处理后得到的语音样本进行。示例性地,对第一语音样本和第二语音样本各自进行的预处理可以包括但不限于:去噪、回声抑制和自动增益控制等。示例性地,预处理可以是在前述解码处理之后进行。因此,在一个示例中,可以对第一语音样本和第二语音样本各自依次进行解码、预处理、分帧和特征提取,以高效地提取具有很好代表性的特征。在其他示例中,前述的预处理操作也可以在对第一语音样本和第二语音样本分别分帧之后特征提取之前进行。In another embodiment, before performing feature extraction on the first voice sample and the second voice sample, each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing. The speech samples obtained afterwards are performed. Exemplarily, the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control. Exemplarily, the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness. In other examples, the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
在一个实施例中,可以将一帧或多帧第二语音样本的特征作为语音重构神经网络的输入层的输入,可以将一帧或多帧第一语音样本的特征作为语音重构神经网络的输出层的目标,从而训练一个神经网络回归器作为在语音重构模块520中采用的语音重构神经网络。In one embodiment, the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network. The goal of the output layer is to train a neural network regressor as the speech reconstruction neural network used in the speech reconstruction module 520.
基于训练好的语音重构神经网络,语音重构模块520的重构模块可将待处理语音数据的特征重构为重构语音特征,由于该重构语音特征为频域特征,因此语音重构模块520的生成模块可基于该重构语音特征生成时域语音波形输出。示例性地,生成模块可以通过傅里叶逆变换来对该重构语音特征进行变换得到时域语音波形。输出的语音波形可被存储或经缓存用于播放,从而为用户提供更好的经提升的语音音质体验。可以结合图4A-图4C参照前述关于图4A-图4C的描述来体会根据实施例的基于深度学习的语音音质增强装置的语音音质增强效果。为了简洁,此处不再赘述。Based on the trained speech reconstruction neural network, the reconstruction module of the speech reconstruction module 520 can reconstruct the features of the speech data to be processed into reconstructed speech features. Since the reconstructed speech features are frequency domain features, the speech reconstruction is performed. The generating module of module 520 may generate a time-domain speech waveform output based on the reconstructed speech feature. Exemplarily, the generating module may transform the reconstructed speech feature to obtain a time-domain speech waveform by inverse Fourier transform. The output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience. 4A-4C may be combined with reference to the foregoing description of FIGS. 4A-4C to realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement device according to an embodiment. For brevity, I will not repeat them here.
基于上面的描述,根据本发明实施例的基于深度学习的语音音质增强装置基于深度学习方法对低质量语音音质进行增强,使低质量语音音质通过深层神经网络重构达到高质量语音的音质,从而能够实现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习装置可以便利地部署在服务端或用户端,能够高效地实现语音音质的增强。Based on the above description, the deep learning-based voice sound quality enhancement device according to the embodiment of the present invention enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality, thereby It can achieve the sound quality improvement effect that cannot be achieved by traditional methods. In addition, the deep learning-based device according to the embodiment of the present invention can be conveniently deployed on a server or a user, and can effectively enhance the voice quality.
图6示出了根据本发明实施例的基于深度学习的语音音质增强***600的示意性框图。基于深度学习的语音音质增强***600包括存储装置610以及处理器620。FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system 600 according to an embodiment of the present invention. The deep learning-based speech sound quality enhancement system 600 includes a storage device 610 and a processor 620.
其中,存储装置610存储用于实现根据本发明实施例的基于深度学习的语音音质增强方法中的相应步骤的程序。处理器620用于运行存储装置 610中存储的程序,以执行根据本发明实施例的基于深度学习的语音音质增强方法的相应步骤,并且用于实现根据本发明实施例的基于深度学习的语音音质增强装置中的相应模块。The storage device 610 stores a program for implementing the corresponding steps in the method for enhancing the sound quality of a voice based on deep learning according to an embodiment of the present invention. The processor 620 is configured to run a program stored in the storage device 610 to execute the corresponding steps of the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention, and to implement the deep learning-based voice sound quality according to an embodiment of the present invention Enhance the corresponding module in the unit.
在一个实施例中,在所述程序被处理器620运行时使得基于深度学习的语音音质增强***600执行以下步骤:获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及基于所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。In one embodiment, when the program is executed by the processor 620, the deep learning-based voice sound quality enhancement system 600 performs the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Describe the characteristics of the speech data to be processed; and based on the characteristics of the speech data to be processed, use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech of the output speech data The quality is higher than the speech quality of the speech data to be processed.
在一个实施例中,所述语音重构神经网络的训练包括:获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。In one embodiment, the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
在一个实施例中,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。In one embodiment, the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the second bit rate.
在一个实施例中,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。In one embodiment, the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the second sampling frequency.
在一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。In one embodiment, the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
在一个实施例中,所述特征提取得到的特征还包括频谱相位信息。In one embodiment, the features obtained by the feature extraction further include spectrum phase information.
在一个实施例中,所述特征提取的方式包括短时傅里叶变换。In one embodiment, the feature extraction manner includes a short-time Fourier transform.
在一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。In one embodiment, the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
在一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。In one embodiment, the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames. The speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
在一个实施例中,在一个实施例中,在所述程序被处理器620运行时使得基于深度学习的语音音质增强***600执行的所述利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据包括:将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及基于所述重构语音特征生成时域语音波形以作为所述输出语音数据。In one embodiment, in one embodiment, when the program is run by the processor 620, the deep voice-based speech quality enhancement system 600 is executed to use the trained speech reconstruction neural network to process the to-be-processed Reconstructing speech data to output speech data includes using features of the speech data to be processed as input to the trained speech reconstruction neural network, and reconstructing speech features from the output of the trained speech reconstruction neural network And generating a time-domain speech waveform based on the reconstructed speech feature as the output speech data.
此外,根据本发明实施例,还提供了一种存储介质,在所述存储介质上存储了程序指令,在所述程序指令被计算机或处理器运行时用于执行本发明实施例的基于深度学***板电脑的存储部件、个人计算机的硬盘、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、或者上述存储介质的任意组合。所述计算机可读存储介质可以是一个或多个计算机可读存储介质的任意组合。In addition, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, and when the program instructions are run by a computer or a processor, the program is used to execute a deep learning-based learning method according to an embodiment of the present invention. The corresponding steps of the method for enhancing the sound quality of speech and are used to implement the corresponding modules in the device for improving the sound quality of speech based on deep learning according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage part of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.
在一个实施例中,所述计算机程序指令在被计算机运行时可以实现根据本发明实施例的基于深度学习的语音音质增强装置的各个功能模块,并且/或者可以执行根据本发明实施例的基于深度学习的语音音质增强方法。In one embodiment, the computer program instructions, when run by a computer, may implement various functional modules of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention, and / or may execute a depth-based Learning method for improving sound quality of speech.
在一个实施例中,所述计算机程序指令在被计算机或处理器运行时使计算机或处理器执行以下步骤:获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及基于所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。In one embodiment, the computer program instructions, when executed by a computer or a processor, cause the computer or the processor to perform the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the to-be-processed voice data. Characteristics of processing speech data; and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, where the speech quality of the output speech data is high The speech quality of the speech data to be processed.
在一个实施例中,所述语音重构神经网络的训练包括:获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语 音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。In one embodiment, the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
在一个实施例中,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。In one embodiment, the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the second bit rate.
在一个实施例中,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。In one embodiment, the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the second sampling frequency.
在一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。In one embodiment, the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
在一个实施例中,所述特征提取得到的特征还包括频谱相位信息。In one embodiment, the features obtained by the feature extraction further include spectrum phase information.
在一个实施例中,所述特征提取的方式包括短时傅里叶变换。In one embodiment, the feature extraction manner includes a short-time Fourier transform.
在一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。In one embodiment, the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
在一个实施例中,所述语音重构神经网络的训练还包括:在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。In one embodiment, the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames. The speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
在一个实施例中,所述计算机程序指令在被计算机或处理器运行时使计算机或处理器执行的所述利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据包括:将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及基于所述重构语音特征生成时域语音波形以作为所述输出语音数据。In an embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to execute the reconstructed speech data to be processed into the output speech data using the trained speech reconstruction neural network. Including: using the features of the speech data to be processed as input of the trained speech reconstruction neural network, and outputting reconstructed speech features from the trained speech reconstruction neural network; and based on the reconstructed speech The feature generates a time-domain speech waveform as the output speech data.
根据本发明实施例的基于深度学习的语音音质增强装置中的各模块 可以通过根据本发明实施例的基于深度学习的语音音质增强的电子设备的处理器运行在存储器中存储的计算机程序指令来实现,或者可以在根据本发明实施例的计算机程序产品的计算机可读存储介质中存储的计算机指令被计算机运行时实现。Each module in the deep learning-based voice sound quality enhancement device according to an embodiment of the present invention may be implemented by running a computer program instruction stored in a memory of a processor of an electronic device based on deep learning-based voice sound quality enhancement according to an embodiment of the present invention. , Or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to an embodiment of the present invention are run by a computer.
此外,根据本发明实施例,还提供了一种计算机程序,该计算机程序可以存储在云端或本地的存储介质上。在该计算机程序被计算机或处理器运行时用于执行本发明实施例的基于深度学习的语音音质增强方法的相应步骤,并且用于实现根据本发明实施例的基于深度学习的语音音质增强装置中的相应模块。In addition, according to an embodiment of the present invention, a computer program is also provided, and the computer program may be stored on a cloud or a local storage medium. When the computer program is executed by a computer or a processor, the method is used to execute the corresponding steps of the deep learning-based voice sound quality enhancement method of the embodiment of the present invention, and is used to implement the deep learning-based voice sound quality enhancement device according to the embodiment of the present invention. Corresponding module.
根据本发明实施例的基于深度学习的语音音质增强方法、装置、***、存储介质和计算机程序基于深度学习方法对低质量语音音质进行增强,使低质量语音音质通过深层神经网络重构达到高质量语音的音质,从而能够实现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习的语音音质增强方法、装置、***、存储介质和计算机程序可以便利地部署在服务端或用户端,能够高效地实现语音音质的增强。A method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement according to an embodiment of the present invention enhance low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is achieved by deep neural network reconstruction The sound quality of the speech, so that the sound quality improvement effect that cannot be achieved by the traditional method can be achieved. In addition, the method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement according to the embodiments of the present invention can be conveniently deployed on the server or user side, and can effectively enhance the voice sound quality.
尽管这里已经参考附图描述了示例实施例,应理解上述示例实施例仅仅是示例性的,并且不意图将本发明的范围限制于此。本领域普通技术人员可以在其中进行各种改变和修改,而不偏离本发明的范围和精神。所有这些改变和修改意在被包括在所附权利要求所要求的本发明的范围之内。Although example embodiments have been described herein with reference to the accompanying drawings, it should be understood that the above-described example embodiments are merely exemplary and are not intended to limit the scope of the present invention thereto. Those skilled in the art can make various changes and modifications therein without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the invention as claimed in the following claims.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in combination with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. A person skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present invention.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or Can be integrated into another device, or some features can be ignored or not implemented.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided here, numerous specific details are explained. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of the specification.
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该本发明的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如相应的权利要求书所反映的那样,其发明点在于可以用少于某个公开的单个实施例的所有特征的特征来解决相应的技术问题。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be understood that, in order to streamline the invention and help understand one or more of the various aspects of the invention, in describing the exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, diagram, Or in its description. However, the method of the present invention should not be construed to reflect the intention that the claimed invention requires more features than those explicitly recited in each claim. Rather, as reflected by the corresponding claims, the invention is that the corresponding technical problem can be solved with features that are less than all the features of a single disclosed embodiment. Thus, the claims that follow a specific embodiment are hereby explicitly incorporated into this specific embodiment, where each claim itself serves as a separate embodiment of the invention.
本领域的技术人员可以理解,除了特征之间相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that, in addition to the mutual exclusion of features, all combinations of all features disclosed in this specification (including the accompanying claims, abstract, and drawings) and any method or device so disclosed can be adopted in any combination. Processes or units are combined. Each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。In addition, those skilled in the art can understand that although some embodiments described herein include some features included in other embodiments but not other features, the combination of features of different embodiments is meant to be within the scope of the present invention Within and form different embodiments. For example, in the claims, any one of the claimed embodiments can be used in any combination.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination thereof. Those skilled in the art should understand that, in practice, a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some modules according to embodiments of the present invention. The invention may also be implemented as a device program (e.g., a computer program and a computer program product) for performing part or all of the method described herein. Such a program that implements the present invention may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate the invention rather than limit the invention, and that those skilled in the art may design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims that list several devices, several of these devices may be embodied by the same hardware item. The use of the words first, second, and third does not imply any order. These words can be interpreted as names.
以上所述,仅为本发明的具体实施方式或对具体实施方式的说明,本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。本发明的保护范围应以权利要求的保护范围为准。The above description is only a specific embodiment of the present invention or a description of the specific embodiment, and the protection scope of the present invention is not limited to this. Any person skilled in the art can easily easily make changes within the technical scope disclosed by the present invention. Any change or replacement is considered to be covered by the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (21)

  1. 一种基于深度学习的语音音质增强方法,其特征在于,所述方法包括:A method for deepening speech sound quality based on deep learning, characterized in that the method includes:
    获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及Acquiring to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain characteristics of the to-be-processed voice data; and
    基于所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。Based on the characteristics of the voice data to be processed, the trained voice reconstruction neural network is used to reconstruct the voice data to be processed into output voice data, where the voice quality of the output voice data is higher than the voice data to be processed Voice quality.
  2. 根据权利要求1所述的方法,其特征在于,所述语音重构神经网络的训练包括:The method according to claim 1, wherein the training of the speech reconstruction neural network comprises:
    获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;Acquire a first voice sample and a second voice sample, wherein the voice quality of the second voice sample is lower than the voice quality of the first voice sample, and the second voice sample is transcoded by the first voice sample And get
    对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及Performing feature extraction on the first voice sample and the second voice sample to obtain the features of the first voice sample and the features of the second voice sample, respectively; and
    将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。Use the obtained feature of the second speech sample as an input of the input layer of the speech reconstruction neural network, and use the obtained feature of the first speech sample as a target of the output layer of the speech reconstruction neural network To train the speech reconstruction neural network.
  3. 根据权利要求2所述的方法,其特征在于,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。The method according to claim 2, wherein the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the first bit rate Two bit rates.
  4. 根据权利要求3所述的方法,其特征在于,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。The method according to claim 3, wherein the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the first sampling frequency Two sampling frequencies.
  5. 根据权利要求1或2所述的方法,其特征在于,所述特征提取得到的特征包括频域幅度和/或能量信息。The method according to claim 1 or 2, wherein the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  6. 根据权利要求5所述的方法,其特征在于,所述特征提取得到的特征还包括频谱相位信息。The method according to claim 5, wherein the features obtained by the feature extraction further include spectrum phase information.
  7. 根据权利要求6所述的方法,其特征在于,所述特征提取的方式包括短时傅里叶变换。The method according to claim 6, wherein the manner of feature extraction comprises a short-time Fourier transform.
  8. 根据权利要求2所述的方法,其特征在于,所述语音重构神经网络的训练还包括:The method according to claim 2, wherein the training of the speech reconstruction neural network further comprises:
    在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。Before performing feature extraction on the first voice sample and the second voice sample, frame the first voice sample and the second voice sample separately, and the feature extraction is obtained for the framed The speech samples are frame-by-frame.
  9. 根据权利要求8所述的方法,其特征在于,所述语音重构神经网络的训练还包括:The method according to claim 8, wherein the training of the speech reconstruction neural network further comprises:
    在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。Before framing the first speech sample and the second speech sample, decoding the first speech sample and the second speech sample into time-domain waveform data, and the framing is for decoding After the time-domain waveform data is obtained.
  10. 根据权利要求1所述的方法,其特征在于,所述利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据包括:The method according to claim 1, wherein the reconstructing the speech data to be processed into the output speech data by using the trained speech reconstruction neural network comprises:
    将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及Taking the features of the speech data to be processed as an input of the trained speech reconstruction neural network, and outputting reconstructed speech features by the trained speech reconstruction neural network; and
    基于所述重构语音特征生成时域语音波形以作为所述输出语音数据。A time-domain speech waveform is generated based on the reconstructed speech feature as the output speech data.
  11. 一种基于深度学习的语音音质增强装置,其特征在于,所述装置包括:A deep learning-based voice sound quality enhancement device, characterized in that the device includes:
    特征提取模块,用于获取待处理语音数据,并对所述待处理语音数据进行特征提取以得到所述待处理语音数据的特征;以及A feature extraction module, configured to obtain voice data to be processed, and perform feature extraction on the voice data to be processed to obtain characteristics of the voice data to be processed; and
    语音重构模块,用于基于所述特征提取模块提取的所述待处理语音数据的特征,利用训练好的语音重构神经网络将所述待处理语音数据重构为输出语音数据,其中所述输出语音数据的语音质量高于所述待处理语音数据的语音质量。A speech reconstruction module, configured to reconstruct the speech data to be processed into output speech data by using a trained speech reconstruction neural network based on the features of the speech data to be processed extracted by the feature extraction module, wherein the The voice quality of the output voice data is higher than the voice quality of the voice data to be processed.
  12. 根据权利要求11所述的装置,其特征在于,所述语音重构神经网络的训练包括:The apparatus according to claim 11, wherein the training of the speech reconstruction neural network comprises:
    获取第一语音样本和第二语音样本,其中所述第二语音样本的语音质量低于所述第一语音样本的语音质量,且所述第二语音样本由所述第一语音样本通过转码而得到;Acquire a first voice sample and a second voice sample, wherein the voice quality of the second voice sample is lower than the voice quality of the first voice sample, and the second voice sample is transcoded by the first voice sample And get
    对所述第一语音样本和所述第二语音样本分别进行特征提取以分别得到所述第一语音样本的特征和所述第二语音样本的特征;以及Performing feature extraction on the first voice sample and the second voice sample to obtain the features of the first voice sample and the features of the second voice sample, respectively; and
    将得到的所述第二语音样本的特征作为所述语音重构神经网络的输入层的输入,并将得到的所述第一语音样本的特征作为所述语音重构神经网络的输出层的目标,以训练所述语音重构神经网络。Use the obtained feature of the second speech sample as an input of the input layer of the speech reconstruction neural network, and use the obtained feature of the first speech sample as a target of the output layer of the speech reconstruction neural network To train the speech reconstruction neural network.
  13. 根据权利要求12所述的装置,其特征在于,所述第一语音样本具有第一码率,所述第二语音样本具有第二码率,所述第一码率高于或等于所述第二码率。The device according to claim 12, wherein the first speech sample has a first bit rate, the second speech sample has a second bit rate, and the first bit rate is higher than or equal to the first bit rate Two bit rates.
  14. 根据权利要求13所述的装置,其特征在于,所述第一语音样本具有第一采样频率,所述第二语音样本具有第二采样频率,所述第一采样频率高于或等于所述第二采样频率。The device according to claim 13, wherein the first speech sample has a first sampling frequency, the second speech sample has a second sampling frequency, and the first sampling frequency is higher than or equal to the first sampling frequency Two sampling frequencies.
  15. 根据权利要求11或12所述的装置,其特征在于,所述特征提取得到的特征包括频域幅度和/或能量信息。The device according to claim 11 or 12, wherein the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  16. 根据权利要求15所述的装置,其特征在于,所述特征提取得到的特征还包括频谱相位信息。The device according to claim 15, wherein the features obtained by the feature extraction further include spectrum phase information.
  17. 根据权利要求16所述的装置,其特征在于,所述特征提取的方式包括短时傅里叶变换。The apparatus according to claim 16, wherein the manner of feature extraction comprises a short-time Fourier transform.
  18. 根据权利要求12所述的装置,其特征在于,所述语音重构神经网络的训练还包括:The apparatus according to claim 12, wherein the training of the speech reconstruction neural network further comprises:
    在对所述第一语音样本和所述第二语音样本进行特征提取之前,对所述第一语音样本和所述第二语音样本分别进行分帧,并且所述特征提取是针对分帧后得到的语音样本逐帧进行的。Before performing feature extraction on the first voice sample and the second voice sample, frame the first voice sample and the second voice sample separately, and the feature extraction is obtained for the framed The speech samples are frame-by-frame.
  19. 根据权利要求18所述的装置,其特征在于,所述语音重构神经网络的训练还包括:The apparatus according to claim 18, wherein the training of the speech reconstruction neural network further comprises:
    在对所述第一语音样本和所述第二语音样本进行分帧之前,将所述第一语音样本和所述第二语音样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。Before framing the first speech sample and the second speech sample, decoding the first speech sample and the second speech sample into time-domain waveform data, and the framing is for decoding After the time-domain waveform data is obtained.
  20. 根据权利要求11所述的装置,其特征在于,所述语音重构模块进一步包括:The apparatus according to claim 11, wherein the voice reconstruction module further comprises:
    重构模块,用于将所述待处理语音数据的特征作为所述训练好的语音重构神经网络的输入,并由所述训练好的语音重构神经网络输出重构语音特征;以及A reconstruction module, configured to take the features of the speech data to be processed as an input of the trained speech reconstruction neural network, and output the reconstructed speech features from the trained speech reconstruction neural network; and
    生成模块,用于基于所述重构模块输出的所述重构语音特征生成时域语音波形以作为所述输出语音数据。A generating module is configured to generate a time-domain speech waveform as the output speech data based on the reconstructed speech features output by the reconstruction module.
  21. 一种基于深度学习的语音音质增强***,其特征在于,所述***包括存储装置和处理器,所述存储装置上存储有由所述处理器运行的计算机程序,所述计算机程序在被所述处理器运行时执行如权利要求1-10中的任一项所述的基于深度学习的语音音质增强方法。A speech sound quality enhancement system based on deep learning, characterized in that the system includes a storage device and a processor, and the storage device stores a computer program run by the processor, and the computer program is being used by the processor. The processor executes the deep learning-based speech sound quality enhancement method according to any one of claims 1-10 when the processor is running.
PCT/CN2019/089759 2018-06-05 2019-06-03 Deep learning-based speech quality enhancing method, device, and system WO2019233362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810583123.0 2018-06-05
CN201810583123.0A CN109147806B (en) 2018-06-05 2018-06-05 Voice tone enhancement method, device and system based on deep learning

Publications (1)

Publication Number Publication Date
WO2019233362A1 true WO2019233362A1 (en) 2019-12-12

Family

ID=64801980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089759 WO2019233362A1 (en) 2018-06-05 2019-06-03 Deep learning-based speech quality enhancing method, device, and system

Country Status (2)

Country Link
CN (2) CN113870872A (en)
WO (1) WO2019233362A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681669A (en) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 Neural network-based voice data identification method and equipment
CN114863942A (en) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method and device for improving voice quality
CN114863940A (en) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method, device and medium for improving voice quality

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870872A (en) * 2018-06-05 2021-12-31 安克创新科技股份有限公司 Voice tone enhancement method, device and system based on deep learning
CN110022400A (en) * 2019-01-28 2019-07-16 努比亚技术有限公司 A kind of voice communication output method, terminal and computer readable storage medium
EP3935633B1 (en) 2019-04-30 2023-12-27 DeepMind Technologies Limited Bandwidth extension of incoming data using neural networks
CN111429930B (en) * 2020-03-16 2023-02-28 云知声智能科技股份有限公司 Noise reduction model processing method and system based on adaptive sampling rate
US20220365799A1 (en) * 2021-05-17 2022-11-17 Iyo Inc. Using machine learning models to simulate performance of vacuum tube audio hardware

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method
CN107516527A (en) * 2016-06-17 2017-12-26 中兴通讯股份有限公司 A kind of encoding and decoding speech method and terminal
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2779886B2 (en) * 1992-10-05 1998-07-23 日本電信電話株式会社 Wideband audio signal restoration method
JP3184092B2 (en) * 1996-05-27 2001-07-09 シャープ株式会社 Image processing method
US6154499A (en) * 1996-10-21 2000-11-28 Comsat Corporation Communication systems using nested coder and compatible channel coding
JP5230103B2 (en) * 2004-02-18 2013-07-10 ニュアンス コミュニケーションズ,インコーポレイテッド Method and system for generating training data for an automatic speech recognizer
CN101197576A (en) * 2006-12-07 2008-06-11 上海杰得微电子有限公司 Audio signal encoding and decoding method
CN102238373A (en) * 2010-04-20 2011-11-09 上海精视信息技术有限责任公司 Television program transmission system based on broadband mobile communication technology and working method thereof
US9373332B2 (en) * 2010-12-14 2016-06-21 Panasonic Intellectual Property Corporation Of America Coding device, decoding device, and methods thereof
WO2012159370A1 (en) * 2011-08-05 2012-11-29 华为技术有限公司 Voice enhancement method and device
WO2014039828A2 (en) * 2012-09-06 2014-03-13 Simmons Aaron M A method and system for reading fluency training
US9305559B2 (en) * 2012-10-15 2016-04-05 Digimarc Corporation Audio watermark encoding with reversing polarity and pairwise embedding
US9401153B2 (en) * 2012-10-15 2016-07-26 Digimarc Corporation Multi-mode audio recognition and auxiliary data encoding and decoding
CN103151049B (en) * 2013-01-29 2016-03-02 武汉大学 A kind of QoS guarantee method towards Mobile audio frequency and system
CN103236262B (en) * 2013-05-13 2015-08-26 大连理工大学 A kind of code-transferring method of speech coder code stream
CN103338534B (en) * 2013-06-04 2016-01-20 沈阳空管技术开发有限公司 Satellite transmission router
CN103354588A (en) * 2013-06-28 2013-10-16 贵阳朗玛信息技术股份有限公司 Determination method, apparatus and system for recording and playing sampling rate
CN103531205B (en) * 2013-10-09 2016-08-31 常州工学院 The asymmetrical voice conversion method mapped based on deep neural network feature
CN103854655B (en) * 2013-12-26 2016-10-19 上海交通大学 A kind of low bit-rate speech coder and decoder
CN104318927A (en) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 Anti-noise low-bitrate speech coding method and decoding method
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN107622777B (en) * 2016-07-15 2020-04-14 公安部第三研究所 High-code-rate signal acquisition method based on over-complete dictionary pair
CN106997767A (en) * 2017-03-24 2017-08-01 百度在线网络技术(北京)有限公司 Method of speech processing and device based on artificial intelligence
CN107274883B (en) * 2017-07-04 2020-06-02 清华大学 Voice signal reconstruction method and device
CN107564538A (en) * 2017-09-18 2018-01-09 武汉大学 The definition enhancing method and system of a kind of real-time speech communicating

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN107516527A (en) * 2016-06-17 2017-12-26 中兴通讯股份有限公司 A kind of encoding and decoding speech method and terminal
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681669A (en) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 Neural network-based voice data identification method and equipment
CN114863942A (en) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method and device for improving voice quality
CN114863940A (en) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method, device and medium for improving voice quality
CN114863940B (en) * 2022-07-05 2022-09-30 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method, device and medium for improving voice quality
CN114863942B (en) * 2022-07-05 2022-10-21 北京百瑞互联技术有限公司 Model training method for voice quality conversion, method and device for improving voice quality

Also Published As

Publication number Publication date
CN113870872A (en) 2021-12-31
CN109147806B (en) 2021-11-12
CN109147806A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
WO2019233362A1 (en) Deep learning-based speech quality enhancing method, device, and system
WO2019233364A1 (en) Deep learning-based audio quality enhancement
JP6599362B2 (en) High-band excitation signal generation
US8554550B2 (en) Systems, methods, and apparatus for context processing using multi resolution analysis
US20150025881A1 (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
WO2021179788A1 (en) Speech signal encoding and decoding methods, apparatuses and electronic device, and storage medium
WO2015154397A1 (en) Noise signal processing and generation method, encoder/decoder and encoding/decoding system
TW201214419A (en) Systems, methods, apparatus, and computer program products for wideband speech coding
WO2017206432A1 (en) Voice signal processing method, and related device and system
JP6526096B2 (en) System and method for controlling average coding rate
US20130332171A1 (en) Bandwidth Extension via Constrained Synthesis
WO2023241240A1 (en) Audio processing method and apparatus, and electronic device, computer-readable storage medium and computer program product
CN115116451A (en) Audio decoding method, audio encoding method, audio decoding device, audio encoding device, electronic equipment and storage medium
JP6573887B2 (en) Audio signal encoding method, decoding method and apparatus
JP6258522B2 (en) Apparatus and method for switching coding technique in device
WO2015196835A1 (en) Codec method, device and system
WO2023241205A1 (en) Audio processing method and apparatus, and electronic device, computer-readable storage medium and computer program product
Romano et al. A real-time audio compression technique based on fast wavelet filtering and encoding
Jose Amrconvnet: Amr-coded speech enhancement using convolutional neural networks
Xu et al. A multi-scale feature aggregation based lightweight network for audio-visual speech enhancement
Maes et al. Conversational networking: conversational protocols for transport, coding, and control.
WO2023069805A1 (en) Audio signal reconstruction
KR20220050924A (en) Multi-lag format for audio coding
CN117219095A (en) Audio encoding method, audio decoding method, device, equipment and storage medium
CN117594035A (en) Multi-mode voice separation and recognition method and device, refrigerator and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815463

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815463

Country of ref document: EP

Kind code of ref document: A1