WO2021196905A1 - 语音信号去混响处理方法、装置、计算机设备和存储介质 - Google Patents

语音信号去混响处理方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021196905A1
WO2021196905A1 PCT/CN2021/076465 CN2021076465W WO2021196905A1 WO 2021196905 A1 WO2021196905 A1 WO 2021196905A1 CN 2021076465 W CN2021076465 W CN 2021076465W WO 2021196905 A1 WO2021196905 A1 WO 2021196905A1
Authority
WO
WIPO (PCT)
Prior art keywords
reverberation
amplitude spectrum
current frame
predictor
spectrum
Prior art date
Application number
PCT/CN2021/076465
Other languages
English (en)
French (fr)
Inventor
朱睿
李娟娟
王燕南
李岳鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021196905A1 publication Critical patent/WO2021196905A1/zh
Priority to US17/685,042 priority Critical patent/US20220230651A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of communication technology, and in particular to a method, device, computer equipment, and storage medium for processing speech signal de-reverberation.
  • VoIP Voice over Internet Protocol, IP-based voice transmission
  • VoIP-based point-to-point calls or multi-person online conference calls reverberation caused by the speaker's distance from the microphone or poor indoor acoustic environment will cause unclear voices and affect the quality of voice calls.
  • the related technology is to use LPC prediction, autoregressive model, statistical model and other methods to predict the reverberation information of the current frame by obtaining historical frame information for a period of time in the past, so as to de-reverberate single-channel speech.
  • LPC prediction autoregressive model
  • statistical model and other methods to predict the reverberation information of the current frame by obtaining historical frame information for a period of time in the past, so as to de-reverberate single-channel speech.
  • These methods are usually based on the statistical stationarity or short-term stationarity assumptions of speech reverberation components, and rely on historical frame information for reverberation estimation.
  • the early reverberation including early reflections cannot be accurately estimated, and there is a certain error in the estimation of the degree of reverberation, which results in a lower accuracy of the reverberation cancellation in the speech.
  • a voice signal de-reverberation processing method including:
  • Signal conversion is performed on the pure speech subband spectrum and the phase spectrum characteristic corresponding to the current frame to obtain a dereverberated pure speech signal.
  • a speech signal de-reverberation processing device comprising:
  • a voice signal processing module used to obtain an original voice signal; extract the amplitude spectrum feature and phase spectrum feature corresponding to the current frame in the original voice signal;
  • the first reverberation prediction module is configured to extract the sub-band amplitude spectrum from the amplitude spectrum feature corresponding to the current frame, and determine the sub-band amplitude spectrum corresponding to the current frame according to the sub-band amplitude spectrum by the first reverberation predictor Reverberation intensity index;
  • the second reverberation prediction module is configured to determine the pure speech subband spectrum corresponding to the current frame according to the subband amplitude spectrum and the reverberation intensity index by the second reverberation predictor;
  • the speech signal conversion module is used for signal conversion of the pure speech subband spectrum and the phase spectrum characteristic corresponding to the current frame to obtain a dereverberated pure speech signal.
  • a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:
  • Signal conversion is performed on the pure speech subband spectrum and the phase spectrum characteristic corresponding to the current frame to obtain a dereverberated pure speech signal.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the following steps are implemented:
  • Signal conversion is performed on the pure speech subband spectrum and the phase spectrum characteristic corresponding to the current frame to obtain a dereverberated pure speech signal.
  • FIG. 1 is an application environment diagram of a method for de-reverberation processing of a speech signal provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a conference interface provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an interface of a reverberation function setting page provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of an interface of a reverberation function setting page provided by another embodiment of the present application.
  • FIG. 5 is a schematic flow chart of a method for de-reverberation processing of a speech signal in an embodiment of the present application
  • Fig. 6 is a spectrogram of pure speech and speech with reverberation in an embodiment of the present application
  • Fig. 7 is a distribution diagram of reverberation intensity and a prediction distribution diagram of reverberation intensity of a speech signal in an embodiment of the present application;
  • FIG. 8 is a prediction distribution diagram of reverberation intensity using a traditional method and a prediction distribution diagram of reverberation intensity using a speech signal dereverberation processing method in an embodiment of the present application;
  • Fig. 9 is a speech time-domain waveform and spectrogram corresponding to a re-reverberated original speech signal in an embodiment of the present application.
  • Fig. 10 is a speech time-domain waveform and spectrogram corresponding to a pure speech signal in an embodiment of the present application
  • FIG. 11 is a schematic flowchart of a method for de-reverberation processing of a speech signal in another embodiment of the present application.
  • FIG. 12 is a schematic flowchart of steps for determining the pure speech subband spectrum of the current frame according to the subband amplitude spectrum and the reverberation intensity index by using the second reverberation predictor in an embodiment of the present application;
  • FIG. 13 is a schematic flowchart of a method for de-reverberation processing of a speech signal in another embodiment of the present application.
  • Fig. 14 is a structural block diagram of a speech signal de-reverberation processing device in an embodiment of the present application.
  • FIG. 15 is a structural block diagram of a speech signal de-reverberation processing device in an embodiment of the present application.
  • Figure 16 is an internal structure diagram of a computer device in an embodiment of the present application.
  • Fig. 17 is an internal structure diagram of a computer device in another embodiment of the present application.
  • the speech signal de-reverberation processing method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network.
  • the terminal 102 collects the voice data recorded by the user.
  • the terminal 102 or the server 104 obtains the original voice signal, and after extracting the amplitude spectrum feature and phase spectrum feature of the current frame in the original voice signal, the current frame
  • the amplitude spectrum feature of the frame is divided into frequency bands to extract the corresponding sub-band amplitude spectrum.
  • the first reverberation predictor performs reverberation intensity prediction on the subband amplitude spectrum based on the subband, which can accurately predict the reverberation intensity index of the current frame.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • the solutions provided by the embodiments of the present application involve technologies such as artificial intelligence speech enhancement.
  • speech technology is speech separation (SS), speech enhancement (SE) and automatic speech recognition technology (ASR). Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the voice signal de-reverberation processing method provided in the embodiments of the present application can also be applied to cloud conferences, which is an efficient, convenient, and low-cost conference form based on cloud computing technology. Users only need to perform simple and easy-to-use operations through the Internet interface to quickly and efficiently share voice, data files and videos with teams and customers all over the world.
  • the complex technologies such as data transmission and processing in the meeting are served by the cloud meeting. The quotient helps users to operate.
  • the cloud conference system supports multi-server dynamic cluster deployment and provides multiple high-performance servers, which greatly improves conference stability, security, and availability.
  • video conferencing has been welcomed by many users because it can greatly improve communication efficiency, continue to reduce communication costs, and bring about internal management upgrades. It has been widely used in government, military, transportation, transportation, finance, operators, education, and enterprises. And other fields. There is no doubt that the use of cloud computing in video conferencing will be more attractive in terms of convenience, speed, and ease of use, which will surely stimulate the arrival of a new wave of video conferencing applications.
  • This application also provides an application scenario, which can be used in a voice call scenario, and can be specifically applied to a conference scenario.
  • the conference scenario can be a voice conference or a video conference scenario.
  • This application scenario applies the above-mentioned voice signal de-reverberation processing method.
  • the voice signal de-reverberation processing method of this scenario is applied to the user terminal, and the application of the voice signal de-reverberation processing method in this application scenario is as follows:
  • the user can initiate or participate in a voice conference through the corresponding user terminal. After the user enters the conference by using the user terminal, the conference starts.
  • Figure 2 it is a schematic diagram of a meeting interface in an embodiment. When the user terminal enters the meeting interface, the meeting starts.
  • the meeting interface includes some meeting options, as shown in Figure 11, which can include microphones, cameras, screen sharing, members, settings, and options to exit the meeting. These options are used to set various functions of the meeting scene.
  • the recipient user listens to the other party's speech and finds that the other party's voice is muddy and the reverberation is heavy, it will cause the speech content to be unclear.
  • the recipient user can turn on the de-reverberation function through the setting option in the conference interface of the conference application of the user terminal.
  • the reverberation function setting interface in the conference interface is shown in Figure 3.
  • the user can click the "Settings” option, which is the setting options in the meeting interface as shown in Figure 2, and on the reverberation function setting page as shown in Figure 3, check the "Audio Reverberation Elimination” option to enable " "Speaker” corresponds to the audio de-reverberation function.
  • the voice de-reverberation function built into the conference application is turned on, and the user terminal will perform de-reverberation processing on the received voice data.
  • the user terminal displays the communication configuration page in the conference interface, the communication configuration page includes the reverberation cancellation configuration option, and the user triggers the communication configuration page to perform the reverberation cancellation setting.
  • the user terminal obtains the reverberation cancellation request triggered by the reverberation cancellation configuration option, and performs de-reverberation processing on the currently acquired voice signal with reverberation based on the reverberation cancellation request.
  • the user terminal of the voice receiver receives the original voice signal sent by the sender terminal, and after preprocessing the original voice signal such as framing and windowing, extracts the amplitude spectrum feature and phase map feature of the current frame.
  • the user terminal then performs frequency band division on the amplitude spectrum feature of the current frame to extract the corresponding subband amplitude spectrum, and predicts the reverberation intensity of the subband amplitude spectrum based on the subband through the first reverberation predictor, which can accurately predict the current frame's amplitude spectrum.
  • Reverberation intensity index Then, the reverberation intensity index obtained by the combination of the second reverberation predictor is used to further predict the subband amplitude spectrum of the current frame of the pure speech subband spectrum of the current frame, so that the pure speech amplitude spectrum of the current frame can be accurately extracted.
  • the user terminal performs signal conversion on the subband spectrum and phase spectrum characteristics of the pure voice, thereby obtaining a de-reverberated pure voice signal, and outputs the de-reverberated pure voice signal through the speaker device of the user terminal. Therefore, when the user terminal receives the voice data sent by the other party, it can eliminate the reverberation component in the voice of other users in the sound played by the user's speaker or earphone, and retain the pure voice of other users' speech, which effectively improves the dereverberation of voice. Accuracy and efficiency, which can effectively improve the conference call experience.
  • the user after the user enters the conference, when the user speaks, the user finds that the environment in which he is in is relatively reverberant, or the other party feedbacks that the speech is unclear.
  • the user can also configure the reverberation function through the setting options in the reverberation function setting interface as shown in FIG. 12 to enable the de-reverberation function. That is, in the reverberation function setting interface page as shown in Figure 4, check the "audio reverberation cancellation" option to turn on the audio de-reverberation function corresponding to the "microphone". At this time, the voice de-reverberation function built into the conference application is turned on, and the user terminal corresponding to the sender will de-reverberate the recorded voice data.
  • the de-reverberation process is the same as the above-mentioned process.
  • the user terminal can eliminate the reverberation component in the speech voice of the speech sender collected by the microphone, extract the pure speech signal from the speech speech and send it out, which effectively improves the accuracy and efficiency of speech de-reverberation, which can effectively Improve the conference call experience.
  • This application also provides another application scenario, which is applied to a voice call scenario, and specifically can still be applied to a voice conference or a video conference scenario.
  • This application scenario applies the above-mentioned voice signal de-reverberation processing method.
  • the application of the speech signal de-reverberation processing method in the application scenario is as follows:
  • a multi-person conference multiple user terminals communicate with the server for multi-terminal voice interaction.
  • the user terminal sends a voice signal to the server, and the server transmits the voice signal to the corresponding recipient user terminal.
  • Each user needs to accept the voice stream of all other users, that is, an N-person conference, and each user needs to listen to other N-1 channels of voice data, so mixing flow control operations are required.
  • the speaker user can choose to enable de-reverberation, so that the voice signal sent by the sender user terminal has no reverberation.
  • the listener user can also enable the de-reverberation function on the corresponding receiver user terminal, so that the sound signal received by the receiver user terminal has no reverberation.
  • the server can also enable de-reverberation, so that the server performs de-reverberation processing on the passing voice data.
  • the server or the receiver user terminal performs de-reverberation processing, usually after mixing multiple channels of voice data into one channel of voice data, the de-reverberation processing is performed to reduce computing resources.
  • the server may also perform de-reverberation processing on each stream before mixing, or automatically determine whether the stream has reverberation, and then determine whether to perform de-reverberation processing.
  • the server sends all N-1 channels of data to the corresponding end-recipient user terminal, mixes the multiple received voice data into one channel, and performs de-reverberation processing and outputs it through the speaker of the user terminal.
  • the server mixes the received 1 or more channels with the transportation bureau, that is, the server needs to mix N-1 channels of data into 1 channel, and de-reverberate the mixed voice data Then, the de-reverberated voice data is sent to the corresponding receiver user terminal. Specifically, after obtaining the original voice data uploaded by the user terminal of the sender, the server obtains the corresponding original voice signal. After the server performs preprocessing such as framing and windowing on the original speech signal, it extracts the amplitude spectrum feature and the phase map feature of the current frame.
  • preprocessing such as framing and windowing
  • the server then performs frequency band division on the amplitude spectrum feature of the current frame to extract the corresponding subband amplitude spectrum, and predicts the reverberation intensity of the subband amplitude spectrum based on the subband through the first reverberation predictor, which can accurately predict the reverberation of the current frame. Sound intensity index. Then, the reverberation intensity index obtained by the combination of the second reverberation predictor is used to further predict the pure speech subband spectrum of the current frame for the subband amplitude spectrum of the current frame.
  • the server performs signal conversion on the sub-band spectrum and phase spectrum characteristics of the pure voice, so as to obtain the pure voice signal after de-reverberation.
  • the server then sends the de-reverberated pure voice signal to the corresponding recipient user terminal in the current conference, and outputs the de-reverberated pure voice signal through the speaker device of the user terminal, which can effectively obtain a higher reverberation cancellation rate.
  • the pure voice signal effectively improves the accuracy and efficiency of voice de-reverberation.
  • a method for processing speech signal de-reverberation is provided.
  • This embodiment mainly uses the method applied to a computer device as an example.
  • the computer device may be the terminal shown in the figure above. 102 or server 104.
  • the speech signal de-reverberation processing method includes the following steps:
  • Step S502 Obtain the original voice signal.
  • Step S504 Extract the amplitude spectrum feature and the phase spectrum feature of the current frame in the original speech signal.
  • the microphone will also receive the sound waves emitted by the sound source and arriving through other means, as well as the environment. Unwanted sound waves generated by other sound sources (ie, background noise). In acoustics, the reflected wave with a delay time of more than 50ms is called echo, and the effect of the remaining reflected waves is called reverberation.
  • the audio collection device can collect the original voice signal sent by the user through the audio channel, and the original voice signal may be an audio signal with reverberation.
  • the original voice signal may be an audio signal with reverberation.
  • the voice signal de-reverberation processing method in this embodiment may be suitable for processing a single-channel original voice signal.
  • the computer equipment After the computer equipment obtains the original speech signal, it first preprocesses the original speech signal.
  • the preprocessing includes pre-emphasis, frame division and windowing. Specifically, the collected original speech signal is framed and windowed to obtain the preprocessed original speech signal, and then the original speech signal of each frame is processed.
  • a triangular window or Hanning window can be used to divide the original speech signal into multiple frames with a frame length of 10-30ms (milliseconds), and the frame shift can be 10ms, so that the original speech signal can be divided into multi-frame speech signals. It is the voice signal corresponding to multiple voice frames.
  • Fourier transform can realize time-frequency conversion.
  • the change of the amplitude value of each component with frequency is called the amplitude spectrum of the signal; and the change of the phase value of each component with the angular frequency is called the signal's amplitude spectrum.
  • Phase spectrum The amplitude spectrum and phase spectrum of the original speech signal are obtained after Fourier transform.
  • the computer device After the computer device performs windowing and framing on the original speech signal, multiple speech frames can be obtained. Then the computer equipment performs fast Fourier transform on the original speech signal after windowing and framing, thereby obtaining the frequency spectrum of the original speech signal. The computer equipment can extract the amplitude spectrum characteristic and phase spectrum characteristic corresponding to the current frame according to the frequency spectrum of the original speech signal. It is understandable that the current frame may be one of the speech frames being processed by the computer device.
  • Step S506 Extract the sub-band amplitude spectrum from the amplitude spectrum feature corresponding to the current frame, and determine the corresponding reverberation intensity index corresponding to the current frame according to the sub-band amplitude spectrum through the first reverberation predictor.
  • the sub-band amplitude spectrum is a plurality of sub-band amplitude spectra obtained by sub-band division of the amplitude spectrum of each frame of the speech frame, and the plurality of sub-band amplitude spectra are at least two or more.
  • the computer device may perform frequency band division on the amplitude spectrum feature, divide the amplitude spectrum of each frame of speech frame into multiple subband amplitude spectra, and obtain the subband amplitude spectrum corresponding to the amplitude spectrum feature of the current frame. The corresponding subband amplitude spectrum is calculated for each frame.
  • the first reverberation predictor may be a machine learning model.
  • a machine learning model is a model that has a certain ability after learning from samples.
  • it can be a neural network model, such as CNN (Convolutional Neural Networks, convolutional neural network) model, RNN (Recurrent Neural Networks, recurrent neural network), LSTM (Long Short-Term Memory, long-term short-term memory network) model, etc.
  • the first reverberation predictor may be a reverberation intensity predictor based on an LSTM neural network model.
  • the first reverberation predictor is a pre-trained neural network model with reverberation prediction ability.
  • the computer device performs frequency band division on the amplitude spectrum characteristics of the current frame to obtain multiple subband amplitude spectra. That is, the amplitude spectrum feature of each frame is divided into multiple subband amplitude spectra, and each subband amplitude spectrum includes a corresponding subband identifier.
  • the computer device further inputs the sub-band amplitude spectrum corresponding to the amplitude spectrum feature of the current frame to the first reverberation predictor.
  • the first reverberation predictor includes a multilayer neural network
  • the computer device uses the amplitude spectrum characteristics of each subband amplitude spectrum as the input feature of the network model, and uses the multilayer network structure in the first reverberation intensity predictor and the corresponding Analyze the amplitude spectrum characteristics of each sub-band amplitude spectrum based on the network parameters and network weights, predict the pure speech energy ratio of each sub-band in the current frame, and then output according to the pure speech energy ratio of each sub-band, and output the mixture corresponding to the current frame. Sound intensity index.
  • step S508 the second reverberation predictor determines the pure speech subband spectrum corresponding to the current frame according to the subband amplitude spectrum and the reverberation intensity index.
  • the second reverberation predictor may be a reverberation intensity prediction algorithm model based on historical frames.
  • the reverberation intensity prediction algorithm may be a weighted recursive least square method, an autoregressive prediction model, a speech signal linear prediction algorithm, etc., which are not limited here.
  • the computer equipment also uses the second reverberation predictor to extract the steady-state noise spectrum and steady-state reverberation amplitude spectrum contained in each sub-band in the current frame, and uses the steady-state noise spectrum and steady-state reverberation amplitude spectrum and sub-bands of each sub-band. Calculate the posterior signal-to-interference ratio with the amplitude spectrum, and then use the posterior signal-to-interference ratio and the reverberation intensity index output by the first reverberation predictor to calculate the prior signal-to-interference ratio, and then use the prior signal-to-interference ratio to weight the sub-band amplitude spectrum Processing, so that the estimated pure voice subband amplitude spectrum can be accurately and effectively obtained.
  • Step S510 Perform signal conversion on the pure speech subband spectrum and phase spectrum characteristics corresponding to the current frame to obtain a dereverberated pure speech signal.
  • the computer equipment uses the reverberation intensity index corresponding to the current frame predicted by the first reverberation predictor, it uses the second reverberation predictor to determine the pure speech subband spectrum of the current frame according to the subband amplitude spectrum and the reverberation intensity index. This can accurately and effectively estimate the pure voice subband amplitude spectrum without reverberation.
  • the computer equipment then performs inverse constant transformation on the pure speech subband spectrum to obtain the transformed pure speech amplitude spectrum, and combines the characteristics of the pure speech amplitude spectrum and phase spectrum to perform a time domain transformation, thereby obtaining a pure speech signal after dereverberation .
  • the neural network-based first reverberation predictor combined with the historical frame-based second reverberation predictor for reverberation estimation, the accuracy of the reverberation intensity estimation can be improved, thereby effectively improving the reverberation cancellation of the speech signal Accuracy, which in turn can effectively improve the accuracy of speech recognition.
  • the original speech signal is obtained, and after the amplitude spectrum feature and phase spectrum feature of the current frame in the original speech signal are extracted, the amplitude spectrum feature of the current frame is divided into frequency bands to extract the corresponding subband amplitude spectrum .
  • the first reverberation predictor performs reverberation intensity prediction on the subband amplitude spectrum based on the subband, which can accurately predict the reverberation intensity index of the current frame.
  • the pure speech signal after de-reverberation is effectively obtained, thereby effectively improving the accuracy of the reverberation elimination of the speech signal.
  • the traditional reverberation predictor is used to linearly superimpose the power spectrum of the historical frame to estimate the power spectrum of the late reverberation, and then use the current frame to subtract the power spectrum of the late reverberation , Get the power spectrum after de-reverberation to get the time-domain speech signal after de-reverberation.
  • This method relies on the assumption of statistical stationarity or short-term stationarity of speech reverberation components, but it cannot accurately estimate early reverberation including early reflections.
  • the traditional method of directly predicting the amplitude spectrum based on the neural network has a large variation range of the amplitude spectrum and a relatively large learning difficulty, which causes more speech damage; and often requires a complex network structure to process multiple frequency point features, which requires a large amount of calculation , Resulting in lower processing efficiency.
  • a pure speech signal and a reverberated speech signal recorded in a reverberant environment are used for experimental testing.
  • the speech signal dereverberation processing method in this embodiment is used to record the reverberated speech in a reverberant environment.
  • Experimental tests include: comparison by displaying the speech spectrum of pure speech, the spectrogram of reverberated speech recorded in a reverberant environment, and the reverberation intensity distribution graph. Among them, as shown in Figure 6(a), it is the speech spectrum of pure speech, the horizontal axis is the time axis, and the vertical axis is the frequency axis.
  • 6(b) is the spectrogram of pure speech recorded in a reverberant environment with reverberant speech.
  • Figure 6(a) By comparing Figure 6(a) with 6(b), it can be seen that the voice spectrum in 6(b) appears fuzzy and distorted.
  • Figure 7(a) shows the magnitude of distortion in different frequency bands at different times, that is, the intensity of reverberation interference. The brighter the color, the stronger the reverberation.
  • Figure 7(a) reflects the reverberation intensity of the reverberated speech, which is also the target predicted by the first reverberation predictor in this embodiment.
  • the first reverberation predictor based on the neural network is used to predict the reverberation intensity of the reverberated speech, and the obtained prediction result can be shown in Figure 7(b). It can be seen from Fig. 7(b) that the real reverberation intensity distribution in Fig. 7(a) is predicted more accurately by using the first reverberation predictor.
  • the result obtained is shown in FIG. 8(b).
  • the result obtained by adopting the solution of this embodiment is closer to the real reverberation intensity distribution, and the reverberation prediction accuracy rate of the reverberated speech signal is significantly improved.
  • Fig. 9 is a speech time-domain waveform and spectrogram corresponding to the original speech signal of re-reverberation in an embodiment. Together, the spectral lines are blurred, and the overall intelligibility and clarity of the speech signal is low.
  • the speech time-domain waveform and spectrogram corresponding to the pure speech signal obtained are shown in FIG. 10.
  • the first reverberation predictor performs reverberation intensity prediction on the subband amplitude spectrum of the current frame based on the subband to obtain the reverberation intensity index of the current frame.
  • use the second reverberation predictor to further predict the pure speech subband spectrum of the current frame according to the obtained reverberation intensity index and the subband amplitude spectrum, thereby accurately extracting the pure speech signal and effectively improving the mixing of the speech signal.
  • the accuracy of noise elimination is the accuracy of noise elimination.
  • determining the reverberation intensity index corresponding to the current frame according to the subband amplitude spectrum by the first reverberation predictor includes: predicting the pure speech energy ratio corresponding to the subband amplitude spectrum by the first reverberation predictor; The pure speech energy ratio determines the reverberation intensity index corresponding to the current frame.
  • the first reverberation predictor is a neural network model-based reverberation predictor that is trained in advance using a large amount of reverberated speech data and pure speech data.
  • the first reverberation predictor includes a multi-layer network structure, and each layer of the network includes corresponding network parameters and network weights to predict the proportion of pure voice in each subband in the original voice signal with reverberation.
  • the computer equipment After the computer equipment extracts the sub-band amplitude spectrum corresponding to the amplitude spectrum of the current frame, it inputs the sub-band amplitude spectrum of the current frame to the first reverberation predictor. Spectrum analysis.
  • the first reverberation predictor takes the energy ratio of the reverberated original speech and the pure speech in each sub-band amplitude spectrum as the prediction target.
  • the network parameters and network weights of each network layer of the first reverberation predictor can analyze each sub-band.
  • the pure speech energy ratio with amplitude spectrum can then predict the reverberation intensity distribution of the current frame according to the pure speech energy ratio of each subband amplitude spectrum of the current frame, thereby obtaining the reverberation intensity index corresponding to the current frame.
  • the reverberation intensity index of the current frame can be accurately estimated.
  • a method for processing speech signal de-reverberation which includes the following steps:
  • Step S1102 Obtain the original voice signal; extract the amplitude spectrum feature and the phase spectrum feature of the current frame in the original voice signal.
  • Step S1104 extract the sub-band amplitude spectrum from the amplitude spectrum characteristics corresponding to the current frame; extract the dimensional characteristics of the sub-band amplitude spectrum through the input layer of the first reverberation predictor.
  • step S1106 the prediction layer of the first reverberation predictor extracts the characterization information of the sub-band amplitude spectrum according to the dimensional characteristics, and determines the pure speech energy ratio of the sub-band amplitude spectrum according to the characterization information.
  • step S1108 the output layer of the first reverberation predictor outputs the reverberation intensity index corresponding to the current frame according to the pure speech energy ratio corresponding to the subband amplitude spectrum.
  • step S1110 the second reverberation predictor determines the pure speech subband spectrum corresponding to the current frame according to the subband amplitude spectrum and the reverberation intensity index.
  • Step S1112 Perform signal conversion on the pure speech subband spectrum and phase spectrum characteristics corresponding to the current frame to obtain a dereverberated pure speech signal.
  • the first reverberation predictor is a neural network model based on the LSTM long short-term memory network
  • the first reverberation predictor includes an input layer, a prediction layer, and an output layer.
  • the input layer and the output layer can be fully connected layers, the input layer is used to extract the feature dimensions of the model input data, and the output layer is used to normalize the mean value and range of values and output results.
  • the prediction layer may be a network layer with an LSTM structure, where the prediction layer includes at least two network layers with an LSTM structure.
  • the network structure of the prediction layer includes input gates, output gates, forget gates and cell state units, which makes LSTM significantly improve the timing modeling ability, can memorize more information, and effectively capture the long-term dependence in the data , So as to accurately and effectively extract the characterization information of the input features.
  • the computer device uses the first reverberation predictor to predict the reverberation intensity index of the current frame, after inputting the amplitude spectrum of each subband of the current frame to the first reverberation predictor, it first passes through the first reverberation intensity predictor.
  • the input layer extracts the dimensional characteristics of the amplitude spectrum of each subband.
  • the computer device can use the sub-band amplitude spectrum extracted from the constant Q frequency band as the input feature of the network.
  • the number of Q bands can be represented by K, that is, the input feature dimension of the first reverberation predictor.
  • the output is also an 8-dimensional feature, that is, the predicted reverberation intensity on the 8 constant Q bands.
  • the network structure of each layer of the first reverberation predictor may adopt a network layer of 1024 nodes.
  • the prediction layer is a two-layer 1024-node LSTM network.
  • FIG. 7 it is a schematic diagram of the network layer structure corresponding to the first reverberation predictor using a two-layer 1024-node LSTM network.
  • the prediction layer is a network layer based on LSTM, and the LSTM network includes three gates, namely a forget gate, an input gate, and an output gate.
  • the forget gate determines how much information in the previous state should be discarded. For example, a value between 0 and 1 can be output to represent the reserved information part. The value output by the hidden layer at the last moment can be used as the parameter of the forget gate.
  • the input gate is used to decide which information should be kept in the cell state unit, and the parameters of the input gate can be obtained through training.
  • the forget gate calculates how much information in the old cell state unit is discarded, and then enters the gate to add the result to the cell state, indicating how much of the new input information is added to the cell state.
  • the output is calculated based on the cell state.
  • the input data gets the value of the "output gate” through the sigmoid activation function. Then the information of the cell state unit is processed and combined with the value of the output gate for processing to obtain the output result of the cell state unit.
  • the prediction layer of the first reverberation predictor is used to extract the characteristic information of the amplitude spectrum of each subband according to the dimensional characteristics.
  • each network layer structure in the prediction layer extracts the characterization information of each subband amplitude spectrum through corresponding network parameters and network weights, and the characterization information may also include multi-level characterization information.
  • each layer of the network layer extracts the corresponding sub-band amplitude spectrum characterization information, and after extraction through the multi-layer network layer, the deep characterization information of each sub-band amplitude spectrum can be extracted to further accurately use the extracted characterization information Information for predictive analysis.
  • the computer device then outputs the pure speech energy ratio of each subband amplitude spectrum through the prediction layer according to the characterization information, and outputs the reverberation intensity index corresponding to the current frame according to the pure speech energy ratio corresponding to each subband through the output layer.
  • the computer device further uses the second reverberation predictor to determine the pure speech subband spectrum of the current frame according to the subband amplitude spectrum and the reverberation intensity index. By performing signal conversion on the sub-band spectrum and phase spectrum characteristics of the pure speech, the de-reverberated pure speech signal is obtained.
  • the amplitude spectrum of each subband can be analyzed accurately by using the network parameters and network weights of each network layer of the first reverberation predictor based on the neural network that is pre-trained.
  • the pure speech energy ratio can accurately and effectively estimate the reverberation intensity index of each speech frame.
  • determining the pure speech subband spectrum corresponding to the current frame by the second reverberation predictor according to the subband amplitude spectrum and the reverberation intensity index includes: using the second reverberation predictor according to the amplitude spectrum characteristics of the current frame Determine the posterior signal-to-interference ratio of the current frame; according to the posterior signal-to-interference ratio and reverberation strength index, the prior signal-to-interference ratio of the current frame; based on the prior signal-to-interference ratio to filter and enhance the subband amplitude spectrum of the current frame, Obtain the pure voice subband amplitude spectrum corresponding to each voice frame.
  • signal to interference ratio refers to the ratio of the sum of signal energy to interference energy (such as frequency interference, multipath, etc.) and additive noise energy.
  • the a priori signal-to-interference ratio refers to the signal-to-interference ratio obtained based on past experience and analysis
  • the posterior signal-to-interference ratio refers to the estimation of the signal-to-interference ratio that is closer to the actual situation obtained after correcting the original prior information based on new information.
  • the computer equipment When the computer equipment predicts the reverberation of the sub-band amplitude spectrum, it also uses the second reverberation predictor to estimate the stationary noise of each sub-band amplitude spectrum, and calculates the posterior signal-to-interference ratio of the current frame according to the estimation result.
  • the second reverberation predictor further calculates the a priori signal-to-interference ratio of the current frame according to the posterior signal-to-interference ratio of the current frame and the reverberation intensity index predicted by the first reverberation predictor.
  • the sub-band amplitude spectrum of the current frame is weighted and enhanced by the prior signal-to-interference ratio, so that the predicted pure speech sub-band spectrum of the current frame can be obtained .
  • the first reverberation predictor can accurately predict the reverberation intensity index of the current frame, and then use the reverberation intensity index to dynamically adjust the amount of dereverberation, thereby accurately calculating the a priori signal-to-interference ratio of the current frame. Accurately estimate the pure voice subband spectrum.
  • the step of determining the pure speech subband spectrum corresponding to the current frame by the second reverberation predictor according to the subband amplitude spectrum and the reverberation intensity index specifically includes the following content:
  • step S1202 the steady-state noise amplitude spectrum corresponding to each subband in the current frame is extracted by the second reverberation predictor.
  • step S1204 the steady-state reverberation amplitude spectrum corresponding to each subband in the current frame is extracted by the second reverberation predictor.
  • Step S1206 Determine the posterior signal-to-interference ratio of the current frame according to the steady-state noise amplitude spectrum, the steady-state reverberation amplitude spectrum, and the sub-band amplitude spectrum.
  • Step S1208 Determine the prior signal-to-interference ratio of the current frame according to the a posteriori signal-to-interference ratio and the reverberation strength index.
  • Step S1210 Perform filtering and enhancement processing on the sub-band amplitude spectrum of the current frame based on the prior signal-to-interference ratio to obtain the pure voice sub-band amplitude spectrum corresponding to the current frame.
  • steady-state noise refers to continuous noise whose noise intensity fluctuates within 5dB, or impulse noise whose repetition frequency is greater than 10Hz.
  • the steady-state noise amplitude spectrum indicates the amplitude spectrum of the noise amplitude distribution of the sub-band
  • the steady-state reverberation amplitude spectrum indicates the amplitude spectrum of the reverberation amplitude distribution of the sub-band.
  • the second reverberation predictor When the second reverberation predictor processes the sub-band amplitude spectrum of the current frame, the second reverberation predictor extracts the steady-state noise amplitude spectrum corresponding to each sub-band in the current frame, and extracts the steady-state noise amplitude spectrum corresponding to each sub-band in the current frame. State reverberation amplitude spectrum. The second reverberation predictor then uses the steady-state noise amplitude spectrum, steady-state reverberation amplitude spectrum and sub-band amplitude spectrum of each sub-band to calculate the posterior signal-to-interference ratio of the current frame, and further uses the posterior signal-to-interference ratio and reverberation intensity The indicator calculates the prior signal-to-interference ratio of the current frame.
  • the prior signal-to-interference ratio is used to filter and enhance the sub-band amplitude spectrum of the current frame.
  • the prior signal-to-interference ratio can be used to weight the sub-band amplitude spectrum of the current frame to obtain the pure voice sub-band amplitude of the current frame. Spectrum.
  • the computer device divides the frequency band of the amplitude spectrum feature of the current frame, and after extracting the sub-band amplitude spectrum corresponding to the current frame, the first reverberation predictor predicts the reverberation intensity index corresponding to the current frame, and the second reverberation predictor It is also possible to analyze and process the subband amplitude spectrum of the current frame at the same time, and the processing sequence of the first reverberation predictor and the second reverberation predictor is not limited here.
  • the first reverberation predictor outputs the reverberation intensity index of the current frame, and after the second reverberation predictor calculates the posterior signal-to-interference ratio of the current frame, the second reverberation predictor then uses the posterior signal-to-interference ratio and reverberation
  • the strength index calculates the prior signal-to-interference ratio of the current frame, and uses the prior signal-to-interference ratio to filter and enhance the sub-band amplitude spectrum of the current frame, so as to accurately estimate the pure speech sub-band amplitude spectrum of the current frame.
  • the method further includes: obtaining the pure speech amplitude spectrum of the previous frame; based on the pure speech amplitude spectrum of the previous frame, using the steady-state noise amplitude spectrum, the steady-state reverberation amplitude spectrum, and the subband amplitude spectrum to determine the current The posterior signal-to-interference ratio of the frame.
  • the second reverberation predictor is a reverberation intensity prediction algorithm model based on historical frame analysis.
  • the historical frame may be (p-1) frame, (p-2) frame, and so on.
  • the historical frame in this embodiment is the previous frame of the current frame
  • the current frame is a frame that the computer device currently needs to process.
  • the computer equipment processes the previous frame of speech signal corresponding to the current frame of the original speech signal, it can directly obtain the pure speech amplitude spectrum of the previous frame.
  • the computer equipment further processes the speech signal of the current frame, uses the first reverberation predictor to obtain the reverberation intensity index of the current frame, and uses the second reverberation predictor to predict the pure speech subband spectrum of the current frame, and the second reverberation prediction
  • the device uses the pure speech amplitude spectrum of the previous frame and combines the steady-state noise amplitude spectrum and steady-state reverberation amplitude of the current frame.
  • the spectrum and subband amplitude spectrum calculate the posterior signal-to-interference ratio of the current frame.
  • the second reverberation predictor Since the second reverberation predictor analyzes the posterior signal-to-interference ratio of the current frame, it is based on the historical frame and combined with the reverberation intensity index of the current frame predicted by the first reverberation predictor. A posterior signal-to-interference ratio with higher accuracy is calculated, so that the obtained posterior signal-to-interference ratio can be used to further accurately estimate the pure speech subband amplitude spectrum of the current frame.
  • the method further includes: performing frame and windowing processing on the original speech signal to obtain the amplitude spectrum characteristics and phase spectrum characteristics corresponding to the current frame in the original speech signal; obtaining preset frequency band coefficients, and comparing the current frequency band coefficients according to the frequency band coefficients.
  • the amplitude spectrum feature of the frame is divided into frequency bands, and the subband amplitude spectrum corresponding to the current frame is obtained.
  • the frequency band coefficient is used to divide each frame into a corresponding number of sub-bands according to the value of the frequency band coefficient
  • the frequency band coefficient may be a constant coefficient.
  • a constant Q (constant Q value, Q is constant) frequency band division method can be used to divide the amplitude spectrum characteristics of the current frame, where the ratio of the center frequency to the bandwidth is constant Q, and the constant Q value is the frequency band coefficient.
  • the computer device After obtaining the original voice signal, the computer device performs windowing and framing on the original voice signal, and performs fast Fourier transform on the windowed and framing original voice signal, thereby obtaining the frequency spectrum of the original voice signal.
  • the computer equipment then processes the frequency spectrum of each frame of the original speech signal at a time.
  • the computer equipment first extracts the amplitude spectrum characteristics and phase spectrum characteristics of the current frame according to the frequency spectrum of the original speech signal. Then, perform constant Q frequency division on the amplitude spectrum characteristics of the current frame to obtain the corresponding subband amplitude spectrum.
  • a subband corresponds to a subband
  • a subband may include a series of frequency points, for example, subband 1 corresponds to 0-100 Hz, subband 2 corresponds to 100-300 Hz, and so on.
  • the amplitude spectrum characteristic of a certain subband is a weighted summation of the frequency points contained in the subband.
  • the constant Q division conforms to the physiological and auditory characteristics of the human ear that the resolution of the low frequency of the sound is higher than the high frequency, which can effectively improve the amplitude The accuracy of the spectrum analysis, so that the reverberation prediction analysis of the speech signal can be more accurately performed.
  • performing signal conversion on the pure speech subband spectrum and phase spectrum characteristics corresponding to the current frame to obtain the dereverberated pure speech signal includes: performing inverse constant transformation on the pure speech subband spectrum according to the frequency band coefficients to obtain The pure speech amplitude spectrum corresponding to the current frame; time-frequency conversion is performed on the pure speech amplitude spectrum and phase spectrum characteristics corresponding to the current frame to obtain the pure speech signal after de-reverberation.
  • the computer device divides the amplitude spectrum of each frame into multiple sub-band amplitude spectra, and uses the first reverberation predictor to respectively perform reverberation prediction on each sub-band amplitude spectrum to obtain the reverberation intensity index of the current frame. After calculating the pure speech subband spectrum of the current frame by using the second reverberation predictor according to the subband amplitude spectrum and the reverberation intensity index, the computer device then performs inverse constant transformation on the pure speech subband spectrum.
  • the inverse constant Q transform method can be used to change the pure voice subband spectrum to transform the constant Q subband spectrum with uneven frequency distribution back to the STFT amplitude spectrum with equal frequency distribution, so as to obtain the pure voice amplitude spectrum corresponding to the current frame.
  • the computer equipment further combines the obtained pure speech amplitude spectrum with the phase spectrum corresponding to the current frame of the original speech signal, and performs inverse Fourier transform to realize the time-frequency conversion of the speech signal to obtain the converted pure speech signal, which is
  • the pure speech signal after reverberation can be accurately extracted from the pure speech signal, which effectively improves the accuracy of the reverberation cancellation of the speech signal.
  • the first reverberation predictor is trained through the following steps: acquiring reverberated speech data and pure speech data, using the reverberated speech data and pure speech data to generate training sample data; combining reverberation and pure speech The energy ratio is determined as the training target; extract the reverberant frequency band amplitude spectrum corresponding to the reverberated speech data, extract the pure speech frequency band amplitude spectrum of the pure speech data; use the reverberant frequency band amplitude spectrum and the pure speech frequency band amplitude spectrum and the training target to train The first reverberation predictor.
  • the computer equipment Before the computer equipment processes the original speech signal, it also needs to train the first reverberation predictor in advance, and the first reverberation predictor is a neural network model.
  • pure voice data refers to pure voice without reverberation noise
  • reverberant voice data refers to voice with reverberation noise, for example, voice data recorded in a reverberant environment.
  • the computer device obtains reverberated speech data and pure speech data, and uses the reverberated speech data and pure speech data to generate training sample data, and the training sample data is used to train a preset neural network.
  • the training sample data may specifically be a pair of speech data with reverberation and its corresponding pure speech data.
  • the energy ratio of reverberation and pure speech with reverberation speech data and pure speech data is used as the training label, that is, the training target of the model training.
  • the training tag is used to adjust the parameters of each training result to further train and optimize the neural network model.
  • the training sample data is input to the preset neural network model, and the corresponding reverberant speech data is extracted by feature extraction and reverberation intensity prediction analysis.
  • the energy ratio of reverberation and pure speech Specifically, the computer device takes the reverberation with reverberation speech data and the pure speech data and the pure speech energy ratio as a prediction target, and uses the reverberation speech data to train the neural network model through a preset function.
  • the preset neural network model is trained multiple times by using the reverberation speech data and the training target, and the corresponding training result is obtained each time.
  • the computer device uses the training target to adjust the parameters of the preset neural network model according to the training result, and continues iterative training until the training condition is met, and the first reverberation predictor that has been trained is obtained.
  • training the first reverberation predictor by using the amplitude spectrum of the band reverberation frequency band and the amplitude spectrum of the pure speech frequency band and the training target includes: inputting the amplitude spectrum of the reverberation frequency band and the amplitude spectrum of the pure speech frequency band to a preset network model , Get the training result; based on the difference between the training result and the training target, adjust the parameters of the preset neural network model and continue training until the training conditions are met, and the training ends, and the required first reverberation predictor is obtained.
  • the training condition refers to the condition that satisfies the model training.
  • the training condition may be to reach the preset number of iterations, or it may be that the classification performance index of the picture classifier after adjusting the parameters reaches the preset index.
  • the computer device uses the reverberation speech data to train the preset neural network model each time, and after obtaining the corresponding training result, compares the training result with the training target to obtain the difference between the training result and the training target.
  • the computer equipment further aims to reduce the difference, adjusts the parameters of the preset neural network model, and continues training. If the training result of the neural network model after parameter adjustment does not meet the training conditions, continue to use the training label to adjust the parameters of the neural network model and continue training. The training ends when the training conditions are met, and the required prediction model is obtained.
  • the difference between the training result and the training target can be measured by a cost function, and a function such as a cross-entropy loss function or a mean square error can be selected as the cost function.
  • the training can be ended when the value of the cost function is less than the preset value, thereby improving the accuracy of the prediction of the reverberation in the reverberated speech data.
  • the preset neural network model is based on the LSTM model, the minimum mean square error criterion is selected to update the network weights, and finally after the loss parameters are stable, the parameters of each layer of the LSTM network are determined, and the training target is constrained to [0,1] through the sigmoid activation function Scope.
  • the network can predict the proportion of pure voice in the voice.
  • the neural network model when training the prediction model, is guided and optimized through the training tags, which can effectively improve the prediction accuracy of the reverberation in the speech data with reverberation, thereby effectively improving the first
  • the prediction accuracy of the reverberation predictor can effectively improve the accuracy of the reverberation cancellation of the speech signal.
  • the method for de-reverberation of a speech signal includes the following steps:
  • Step S1302 Obtain the original voice signal; extract the amplitude spectrum feature and phase spectrum feature of the current frame in the original voice signal.
  • Step S1304 Obtain preset frequency band coefficients, and perform frequency band division on the amplitude spectrum characteristics of the current frame according to the frequency band coefficients to obtain the subband amplitude spectrum corresponding to the current frame.
  • step S1306 the dimensional characteristics of the sub-band amplitude spectrum are extracted according to the sub-band amplitude spectrum through the input layer of the first reverberation predictor.
  • step S1308 the prediction layer of the first reverberation predictor extracts the characterization information of the subband amplitude spectrum according to the dimensional characteristics, and determines the pure speech energy ratio of the subband amplitude spectrum according to the characterization information.
  • step S1310 the output layer of the first reverberation predictor outputs the reverberation intensity index corresponding to the current frame according to the pure speech energy ratio of the subband amplitude spectrum.
  • step S1312 the steady-state noise amplitude spectrum and the steady-state reverberation amplitude spectrum corresponding to each subband in the current frame are extracted through the second reverberation.
  • Step S1314 Based on the pure speech amplitude spectrum of the previous frame, the posterior signal-to-interference ratio of the current frame is determined according to the steady-state noise amplitude spectrum, the steady-state reverberation amplitude spectrum, and the subband amplitude spectrum.
  • Step S1316 Determine the a priori signal-to-interference ratio of the current frame according to the a posteriori signal-to-interference ratio of the current frame and the reverberation strength index.
  • Step S1318 Perform filtering and enhancement processing on the sub-band amplitude spectrum of the current frame according to the prior signal-to-interference ratio to obtain the pure voice sub-band amplitude spectrum of the current frame.
  • Step S1320 Perform inverse constant transformation on the pure speech subband spectrum according to the frequency band coefficient to obtain the pure speech amplitude spectrum corresponding to the current frame.
  • Step S1322 Perform time-frequency conversion on the pure speech amplitude spectrum and phase spectrum characteristics corresponding to the current frame to obtain a dereverberated pure speech signal.
  • the original speech signal can be expressed as x(n).
  • the computer device After preprocessing the collected original speech signal by framing and windowing, the computer device extracts the amplitude spectrum characteristic X(p, m) and X(p, m) corresponding to the current frame p. Phase spectrum feature ⁇ (p,m), where m is the frequency point identifier, and p is the current frame identifier.
  • the computer equipment further divides the amplitude spectrum characteristic X(p, m) of the current frame into a constant Q frequency band to obtain the sub-band amplitude spectrum Y(p, q).
  • the calculation formula can be as follows:
  • q is a constant Q frequency band identifier, that is, a subband identifier
  • w q is a weighting window of the qth subband, for example, a triangular window or a Hanning window may be used for windowing.
  • the computer device inputs the extracted subband amplitude spectrum Y(p,q) of the subband q of the current frame into the first reverberation intensity predictor, and the subband amplitude spectrum of the current frame is calculated by the first reverberation intensity predictor.
  • Y(p,q) is analyzed and processed, and the reverberation intensity index ⁇ (p,q) in the current frame can be obtained.
  • the computer equipment further uses the second reverberation intensity predictor to estimate the steady-state noise amplitude spectrum ⁇ (p, q) contained in each sub-band and the steady-state reverberation amplitude spectrum l(p, q) contained in each sub-band.
  • the steady-state noise amplitude spectrum ⁇ (p,q) and the steady-state reverberation amplitude spectrum l(p,q) and the combined subband amplitude spectrum Y(p,q) calculate the posterior signal-to-interference ratio ⁇ (p,q), calculate The formula can be as follows:
  • the computer equipment further uses the posterior signal-to-interference ratio ⁇ (p,q) and the reverberation intensity index ⁇ (p,q) output by the first reverberation intensity predictor to calculate the prior signal-to-interference ratio ⁇ (p,q), the calculation formula It can be as follows:
  • the main function of ⁇ (p,q) is to dynamically adjust the amount of de-reverberation.
  • G(p,q) is the predictive gain function, which is used to measure the proportion of pure speech energy in the reverberant speech.
  • the computer equipment uses the prior signal-to-interference ratio ⁇ (p,q) to weight the input subband amplitude spectrum Y(p,q) to obtain the estimated pure speech subband amplitude spectrum S(p,q).
  • the following inverse constant Q transformation is performed on the amplitude spectrum S(p,q) of the pure voice subband without reverberation:
  • Z(p,m) represents the pure language amplitude spectrum feature.
  • the computer equipment then combines the phase spectrum characteristics ⁇ (p,m) of the current frame to perform the inverse STFT to realize the conversion from the frequency domain to the time domain, thereby obtaining the time domain speech signal S(n) after dereverberation.
  • the first reverberation predictor is used to predict the reverberation intensity based on the subband amplitude spectrum of the subband, which can accurately predict the reverberation intensity index of the current frame. Then use the reverberation intensity index combined with the second reverberation predictor to further predict the subband amplitude spectrum of the current frame of the pure speech subband spectrum of the current frame, thereby accurately extracting the pure speech amplitude spectrum of the current frame, thereby Effectively improve the accuracy of the reverberation cancellation of the voice signal.
  • a speech signal de-reverberation processing device 1400 is provided.
  • the device can adopt a software module or a hardware module, or a combination of the two can become a part of computer equipment.
  • the device specifically Including: a speech signal processing module 1402 module, a first reverberation prediction module 1404, a second reverberation prediction module 1406, and a speech signal conversion module 1408, where:
  • the speech signal processing module 1402 is used to obtain the original speech signal; extract the amplitude spectrum characteristics and phase spectrum characteristics of the current frame in the original speech signal;
  • the first reverberation prediction module 1404 is configured to extract the sub-band amplitude spectrum from the amplitude spectrum characteristics corresponding to the current frame, and determine the reverberation intensity index corresponding to the current frame according to the sub-band amplitude spectrum through the first reverberation predictor.
  • the second reverberation prediction module 1406 is configured to determine the pure voice subband spectrum corresponding to the current frame according to the subband amplitude spectrum and the reverberation intensity index by the second reverberation predictor.
  • the speech signal conversion module 1408 is used to perform signal conversion on the pure speech subband spectrum and phase spectrum characteristics corresponding to the current frame to obtain a dereverberated pure speech signal.
  • the first reverberation prediction module 1404 is further configured to predict the pure speech energy ratio corresponding to the subband amplitude spectrum through the first reverberation predictor; and determine the reverberation intensity index corresponding to the current frame according to the pure speech energy ratio.
  • the first reverberation prediction module 1404 is further configured to extract the dimensional features of the sub-band amplitude spectrum through the input layer of the first reverberation predictor; and extract the sub-band amplitude spectrum features according to the dimensional features through the prediction layer of the first reverberation predictor.
  • Characterization information with amplitude spectrum determine the pure speech energy ratio of the sub-band amplitude spectrum according to the characterization information; output the reverberation corresponding to the current frame according to the pure speech energy ratio corresponding to the sub-band amplitude spectrum through the output layer of the first reverberation predictor Strength indicators.
  • the second reverberation prediction module 1406 is also used to determine the posterior signal-to-interference ratio of the current frame according to the amplitude spectrum characteristics of each speech frame by the second reverberation predictor;
  • the loudness index is used to determine the prior signal-to-interference ratio of the current frame;
  • the sub-band amplitude spectrum of the current frame is filtered and enhanced based on the prior signal-to-interference ratio to obtain the pure voice sub-band amplitude spectrum corresponding to the current frame.
  • the second reverberation prediction module 1406 is further configured to extract the steady-state noise amplitude spectrum corresponding to each sub-band in the current frame through the second reverberation; extract the steady-state noise amplitude spectrum corresponding to each sub-band in the current frame through the second reverberation Steady-state reverberation amplitude spectrum: Determine the posterior signal-to-interference ratio of the current frame according to the steady-state noise amplitude spectrum, steady-state reverberation amplitude spectrum and subband amplitude spectrum.
  • the second reverberation prediction module 1406 is also used to obtain the pure speech amplitude spectrum of the previous frame; based on the pure speech amplitude spectrum of the previous frame, the steady-state noise amplitude spectrum, the steady-state reverberation amplitude spectrum and the sub- With amplitude spectrum, estimate the posterior signal-to-interference ratio of the current frame.
  • the speech signal processing module 1402 is also used to perform frame and window processing on the original speech signal to obtain the amplitude spectrum characteristics and phase spectrum characteristics corresponding to the current frame in the original speech signal; to obtain preset frequency band coefficients, according to the frequency band The coefficient divides the frequency band of the amplitude spectrum feature of the current frame to obtain the subband amplitude spectrum corresponding to the current frame.
  • the speech signal conversion module 1408 is also used to perform inverse constant transformation on the pure speech subband spectrum according to the frequency band coefficients to obtain the pure speech amplitude spectrum corresponding to the current frame; and for the pure speech amplitude spectrum and phase spectrum corresponding to the current frame
  • the feature performs time-frequency conversion to obtain a pure voice signal after de-reverberation.
  • the device further includes a reverberation predictor training module 1401 for obtaining reverberated speech data and pure speech data, and using the reverberated speech data and pure speech data to generate training samples Data; Determine the energy ratio of reverberation and pure speech with reverberated speech data and pure speech data as the training target; extract the amplitude spectrum of the reverberant frequency band corresponding to the reverberated speech data, and extract the pure speech frequency band amplitude spectrum of the pure speech data ; Train the first reverberation predictor by using the amplitude spectrum of the band reverberation frequency band and the amplitude spectrum of the pure speech frequency band and the training target.
  • the reverberation predictor training module 1401 is also used to input the amplitude spectrum of the reverberant frequency band and the amplitude spectrum of the pure speech frequency band into the preset network model to obtain the training result; based on the difference between the training result and the training target, adjust Preset the parameters of the neural network model and continue training until the training is completed when the training conditions are met, and the required first reverberation predictor is obtained.
  • each module in the above-mentioned speech signal de-reverberation processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 16.
  • the computer equipment includes a processor, a memory, and a network interface connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store voice data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a speech signal de-reverberation processing method.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 17.
  • the computer equipment includes a processor, a memory, a communication interface, a display screen, a microphone, a speaker, and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, an operator's network, NFC (near field communication) or other technologies.
  • the computer program is executed by the processor to realize a speech signal de-reverberation processing method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touch pad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 16 and FIG. 17 are only block diagrams of part of the structure related to the solution of the present application, and do not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different component arrangement.
  • a computer device including a memory and a processor, and a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when the processor executes the computer program.
  • a computer-readable storage medium is provided, and a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
  • a computer program product or computer readable instruction includes a computer readable instruction, and the computer readable instruction is stored in a computer readable storage medium.
  • the processor of the computer device reads the computer-readable instruction from the computer-readable storage medium, and the processor executes the computer-readable instruction, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical storage.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM may be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Complex Calculations (AREA)

Abstract

一种语音信号去混响处理方法、处理装置、计算机设备及可读存储介质。该方法包括:获取原始语音信号(S502);提取原始语音信号中当前帧的幅度谱特征和相位谱特征(S504);提取幅度谱特征的子带幅度谱,将子带幅度谱输入至第一混响预测器,输出当前帧对应的混响强度指标(S506);利用第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧的纯净语音子带谱(S508);对纯净语音子带谱和相位谱特征进行信号转换,得到去混响后的纯净语音信号(S510)。

Description

语音信号去混响处理方法、装置、计算机设备和存储介质
本申请要求于2020年04月01日提交中国专利局,申请号为2020102500093,申请名称为“语音信号去混响处理方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,具体涉及一种语音信号去混响处理方法、装置、计算机设备和存储介质。
背景技术
随着计算机通信技术的迅速发展,出现了基于VoIP(Voice over Internet Protocol,基于IP的语音传输)的语音通话技术,经由互联网来进行通信,以实现语音通话和多媒体会议等通信功能。在基于VoIP点对点通话或者多人在线电话会议中,由于说话人距离麦克风远或者室内声学环境欠佳导致的混响声,会导致语音不清晰,影响语音通话质量。
目前,相关技术是通过获取过去一段时间的历史帧信息,采用LPC预测、自回归模型、统计模型等方式预测当前帧混响信息,以对单通道语音去混响。这些方式通常基于语音混响成分的统计平稳性或短时平稳性假设,并且依赖于历史帧信息进行混响估计。对于包括早期反射声在内的早期混响无法准确估计,对混响程度估计存在一定误差,导致对语音中的混响消除的准确率较低。
发明内容
一种语音信号去混响处理方法,包括:
获取原始语音信号;
提取所述原始语音信号中当前帧的幅度谱特征和相位谱特征;
从所述当前帧对应的幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
一种语音信号去混响处理装置,所述装置包括:
语音信号处理模块,用于获取原始语音信号;提取所述原始语音信号中当前帧对应的幅度谱特征和相位谱特征;
第一混响预测模块,用于从所述当前帧对应的所述幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
第二混响预测模块,用于通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
语音信号转换模块,用于对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
获取原始语音信号;
提取所述原始语音信号中当前帧对应的幅度谱特征和相位谱特征;
从所述当前帧对应的所述幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
获取原始语音信号;
提取所述原始语音信号中当前帧对应的幅度谱特征和相位谱特征;
从所述当前帧对应的所述幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
附图说明
图1是本申请一个实施例提供的语音信号去混响处理方法的应用环境图;
图2是本申请一个实施例提供的一种会议界面示意图;
图3是本申请一个实施例提供的一种混响功能设置面页的界面示意图;
图4是本申请另一个实施例提供的一种混响功能设置面页的界面示意图;
图5是本申请一个实施例中语音信号去混响处理方法的流程示意图;
图6是本申请一个实施例中纯净语音和带混响语音的语谱图;
图7是本申请一个实施例中语音信号的混响强度分布图和混响强度预测分布图;
图8是本申请一个实施例中采用传统方式的混响强度预测分布图和采用语音信号去混响处理方法的混响强度预测分布图;
图9是本申请一个实施例中重混响的原始语音信号对应的语音时域波形和语谱图;
图10是本申请一个实施例中纯净语音信号对应的语音时域波形和语谱图;
图11是本申请另一个实施例中语音信号去混响处理方法的流程示意图;
图12是本申请一个实施例中利用第二混响预测器根据子带幅度谱和混响强度指标确定当前帧的纯净语音子带谱的步骤流程示意图;
图13是本申请另一个实施例中语音信号去混响处理方法的流程示意图;
图14是本申请一个实施例中语音信号去混响处理装置的结构框图;
图15是本申请一个实施例中语音信号去混响处理装置的结构框图;
图16是本申请一个实施例中计算机设备的内部结构图;
图17是本申请另一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的语音信号去混响处理方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信,终端102采集用户录制的语音数据,终端102或服务器104获取原始语音信号,提取原始语音信号中当前帧的幅度谱特征和相位谱特征后,对当前帧的幅度谱特征进行频带划分提取对应的子带幅度谱。通过第一混响预测器对基于子带的子带幅度谱进行混响强度预测,能够准确地预测当前帧的混响强度指标。再利用第二混响预测器结合得到的混响强度指标对当前帧的子带幅度谱进一步预测当前帧的纯净语音子带谱,由此能够准确地提取当前帧的纯净语音幅度谱并得到对应的纯净语音信号。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
本申请实施例提供的方案涉及人工智能的语音增强等技术。语音技术(Speech Technology)的关键技术有语音分离(SS)和语音增强(SE)及自动语音识别技术(ASR)。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
本申请实施例提供的语音信号去混响处理方法还可应用与云会议,云会议是基于云计算技术的一种高效、便捷、低成本的会议形式。使用者只需要通过互联网界面,进行简单易用的操作,便可快速高效地与全球各地团队及客户同步分享语音、数据文件及视频,而会议中数据的传输、处理等复杂技术由云会议服务商帮助使用者进行操作。
目前国内云会议主要集中在以SaaS(Software as a Service,软件即服务)模式为主体的服务内容,包括电话、网络、视频等服务形式,基于云计算的视频会 议就叫云会议。在云会议时代,数据的传输、处理、存储全部由视频会议厂家的计算机资源处理,用户完全无需再购置昂贵的硬件和安装繁琐的软件,只需打开浏览器,登录相应界面,就能进行高效的远程会议。
云会议***支持多服务器动态集群部署,并提供多台高性能服务器,大大提升了会议稳定性、安全性、可用性。近年来,视频会议因能大幅提高沟通效率,持续降低沟通成本,带来内部管理水平升级,而获得众多用户欢迎,已广泛应用在政府、军队、交通、运输、金融、运营商、教育、企业等各个领域。毫无疑问,视频会议运用云计算以后,在方便性、快捷性、易用性上具有更强的吸引力,必将激发视频会议应用新高潮的到来。
本申请还提供一种应用场景,可用于语音通话场景,具体还可应用于会议场景,会议场景可以为语音会议,还可以为视频会议场景。该应用场景应用上述的语音信号去混响处理方法。具体地,该场景的语音信号去混响处理方法应用于用户终端,该语音信号去混响处理方法在该应用场景的应用如下:
用户可以通过对应的用户终端发起或参加语音会议,用户利用用户终端进入会议后,则开始会议。如图2所示,为一个实施例中的会议界面示意图,当用户终端进入会议界面后,则开始会议。会议界面中包括一些会议选项,如图11所示,可以包括麦克风、摄像头、屏幕共享、成员、设置以及退出会议的选项,这些选项用于设置会议场景的各种功能。
当接收方用户收听对方发言,发现对方声音浑浊,混响较重时,会导致听不清发言内容。接收方用户可以通过用户终端的会议应用程序的会议界面中的设置选项,开启去混响功能。其中,会议界面中混响功能设置界面如图3所示。用户可以通过点击“设置”选项,即如图2所示的会议界面中的设置选项,在如图3所示的混响功能设置面页中,勾选“音频混响消除”选项,开启“扬声器”对应的音频去混响功能。此时,内置于会议应用程序中的语音去混响功能开启,用户终端则会对接收的语音数据进行去混响处理。
用户终端在会议界面中展示通信配置页面,展示通信配置页面包括混响消除配置选项,用户触发示通信配置页面进行混响消除设置。用户终端则获取通过混响消除配置选项触发的混响消除请求,基于混响消除请求对当前获取的带 混响语音信号进行去混响处理。具体地,语音接收方的用户终端接收到发送方终端发送的原始语音信号,并对原始语音信号进行分帧加窗等预处理后,提取当前帧的幅度谱特征和相位铺特征。用户终端进而对当前帧的幅度谱特征进行频带划分提取对应的子带幅度谱,通过第一混响预测器对基于子带的子带幅度谱进行混响强度预测,能够准确地预测当前帧的混响强度指标。再利用第二混响预测器结合得到的混响强度指标对当前帧的子带幅度谱进一步预测当前帧的纯净语音子带谱,由此能够准确地提取当前帧的纯净语音幅度谱。用户终端则对纯净语音子带谱和相位谱特征进行信号转换,从而得到去混响后的纯净语音信号,并通过用户终端的扬声器设备输出去混响后的纯净语音信号。由此用户终端在接收对方发送过来的语音数据时,可以消除用户扬声器或耳机播放声音中,其他用户语音中的混响成分,保留其他用户发言中的纯洁语音,有效提高了语音去混响的准确率和效率,从而能够有效提升会议通话体验。
在另一个应用场景中,用户进入会议后,当用户发言时,用户发现自己所处的环境混响比较重,或者对方反馈听不清发言内容。用户还可以通过如图12所示的混响功能设置界面中的设置选项,进行混响功能配置,以开启去混响功能。即如图4所示的混响功能设置界面页中,勾选“音频混响消除”选项,开启“麦克风”对应的音频去混响功能。此时,内置于会议应用程序中的语音去混响功能开启,发送方对应的用户终端则会对录制的语音数据进行去混响处理,去混响处理过程与上述处理过程相同。由此用户终端可以消除麦克风采集到语音发送方的发言语音中的混响成分,提取出发言语音中的纯净语音信号再发送出去,有效提高了语音去混响的准确率和效率,从而能够有效提升会议通话体验。
本申请还另外提供一种应用场景,应用于语音通话场景,具体仍可以应用于语音会议或视频会议场景。该应用场景应用上述的语音信号去混响处理方法。具体地,该语音信号去混响处理方法在该应用场景的应用如下:
在多人会议中,多个用户终端与服务器进行通信连接,以进行多端语音交互,用户终端发送语音信号至服务器,服务器将语音信号传输至相应的接收方用户终端。每个用户需要接受其他所有用户的语音流,即N人会议,每一个用户需要收听其他N-1路语音数据,因此需要有混音流控操作。多人会议中,发 言方用户可以选择开启去混响,使得发送方用户终端发送出去的语音信号没有混响。收听方用户也可以在对应的接收方用户终端开启去混响功能,使得接收方用户终端接收的声音信号没有混响。服务器也可以开启去混响,使得服务器对经过的语音数据进行去混响处理。服务器或者接收方用户终端的进行去混响处理时,通常将多路语音数据混音成1路语音数据以后,再进行去混响处理,以降低计算资源。进一步地,服务器也可以对混音前的每一路流做去混响处理,或者自动判断该路流是否存在混响,再确定是否进行去混响处理。
在一个实施例中,服务器把N-1路数据都下发给相应的终接收方用户终端将多路接收的语音数据混音成1路,并进行去混响处理后通过用户终端的扬声器输出。
在另一个实施例中,通过服务器对接收的1路或多路与运输局进行混音操作,即服务器需要把N-1路数据混成1路,并对混音后的语音数据进行去混响后,再将去混响后的语音数据下发给相应的接收方用户终端。具体地,服务器获取发送方用户终端上传的原始语音数据后,获取对应的原始语音信号。服务器对原始语音信号进行分帧加窗等预处理后,提取当前帧的幅度谱特征和相位铺特征。服务器进而对当前帧的幅度谱特征进行频带划分提取对应的子带幅度谱,通过第一混响预测器对基于子带的子带幅度谱进行混响强度预测,能够准确地预测当前帧的混响强度指标。再利用第二混响预测器结合得到的混响强度指标对当前帧的子带幅度谱进一步预测当前帧的纯净语音子带谱。服务器则对纯净语音子带谱和相位谱特征进行信号转换,从而得到去混响后的纯净语音信号。服务器进而将去混响后的纯净语音信号发送至当前会议中相应的接收方用户终端,并通过用户终端的扬声器设备输出去混响后的纯净语音信号,从而能够有效得到混响消除率较高的纯净语音信号,有效提高了语音去混响的准确率和效率。
在一个实施例中,如图5所示,提供了一种语音信号去混响处理方法,本实施例主要以该方法应用于计算机设备来举例说明,该计算机设备具体可以是上图中的终端102或者服务器104。参照图5,语音信号去混响处理方法包括以下步骤:
步骤S502,获取原始语音信号。
步骤S504,提取原始语音信号中当前帧的幅度谱特征和相位谱特征。
其中,通常在音频信号采集或录制的情况下,传声器除了接收到所需要的声源发射声波直接到达的部分外,还会接收声源发出的、经过其它途径传递而到达的声波,以及所在环境其它声源产生的不需要的声波(即背景噪声)。在声学上,延迟时间达到约50ms以上的反射波称为回声,其余的反射波产生的效应称为混响。
音频采集装置可以通过音频通道采集用户发出的原始语音信号,原始语音信号可能是带混响的音频信号。通常情况下,由于说话人距离麦克风较远或者室内声学环境欠佳会产生混响声,会导致语音不清晰,从而影响语音通信质量。因此需要对带混响的原始语音信号进行去混响处理。本实施例中的语音信号去混响处理方法可以适用于对单通道的原始语音信号进行处理。
计算机设备获取原始语音信号后,首先对原始语音信号进行预处理,预处理包括预加重和分帧加窗等处理。具体地,对采集到的原始语音信号进行分帧、加窗处理,得到预处理后原始语音信号,进而对每一帧的原始语音信号进行处理。例如,可以采用三角窗或汉宁窗将原始语音信号分为多个帧长为10-30ms(毫秒)的帧,帧移可以取10ms,从而可以将原始语音信号分为多帧语音信号,也就是多个语音帧对应的语音信号。
傅里叶变换可实现时频转换,在傅里叶分析中,把各个分量的幅度值随着频率的变化称为信号的幅度谱;而把各个分量的相位值随角频率变化称为信号的相位谱。原始语音信号经傅里叶变换后得到幅度谱和相位谱。
计算机设备对原始语音信号进行加窗分帧后,可以得到多个语音帧。然后计算机设备对加窗分帧后的原始语音信号进行快速傅里叶转换,由此得到原始语音信号的频谱。计算机设备则可以根据原始语音信号的频谱提取当前帧对应的幅度谱特征和相位谱特征。可以理解的是,当前帧可以是计算机设备正在处理的各语音帧中的其中一帧。
步骤S506,从当前帧对应的幅度谱特征中提取子带幅度谱,通过第一混响预测器根据子带幅度谱,确定当前帧对应的对应的混响强度指标。
其中,子带幅度谱是通过对每一帧语音帧的幅度谱进行子带划分得到的多个子带幅度谱,多个为至少两个以上。具体地,计算机设备可以对幅度谱特征进行频带划分,将每一帧语音帧的幅度谱划分为多个子带幅度谱,得到当前帧的幅度谱特征所对应的子带幅度谱。每一帧都会计算相应的子带幅度谱。
其中,第一混响预测器可以是一种机器学习模型。机器学习模型是通过样本学习后具备某种能力的模型,具体可以是神经网络模型,比如CNN(Convolutional Neural Networks,卷积神经网络)模型、RNN(Recurrent Neural Networks,循环神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)模型等。具体地,第一混响预测器可以为基于LSTM神经网络模型的混响强度预测器。第一混响预测器为预先训练的具有混响预测能力神经网络模型。
具体地,计算机设备对当前帧的幅度谱特征进行频带划分,得到多个子带幅度谱。也就是将每一帧的幅度谱特征划分为多个子带幅度谱,每个子带幅度谱包括对应的子带标识。
计算机设备进一步将当前帧的幅度谱特征所对应的子带幅度谱输入至第一混响预测器。具体地,第一混响预测器包括多层神经网络,计算机设备将每个子带幅度谱的幅度谱特征作为网络模型的输入特征,通过第一混响强度预测器中的多层网络结构以及相应的网络参数和网络权重对各个子带幅度谱的幅度谱特征进行分析,预测当前帧中各个子带的纯净语音能量比,进而输出根据各个子带的纯净语音能量比,输出当前帧对应的混响强度指标。
步骤S508,通过第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧对应的纯净语音子带谱。
其中,第二混响预测器可以为基于历史帧的混响强度预测算法模型。例如,混响强度预测算法可以是加权递归最小二乘法、自回归预测模型、语音信号线性预测等算法,此处不做限定。
计算机设备还利用第二混响预测器提取当前帧中每个子带所含的稳态噪声谱和稳态混响幅度谱,利用每个子带的稳态噪声谱和稳态混响幅度谱以及子带幅度谱计算后验信干比,进而利用后验信干比和第一混响预测器输出的混响强度指标计算先验信干比,再利用先验信干比对子带幅度谱加权处理,从而能够 精准有效地得到所估计的纯净语音子带幅度谱。
步骤S510,对当前帧对应的纯净语音子带谱和相位谱特征进行信号转换,得到去混响后的纯净语音信号。
计算机设备利用第一混响预测器预测的当前帧对应的混响强度指标后,利用第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧的纯净语音子带谱,由此能够准确有效地估计出不带混响的纯净语音子带幅度谱。
计算机设备进而对纯净语音子带谱进行逆恒变换,得到变换后的纯净语音幅度谱,并将纯净语音幅度谱和相位谱特征结合,进行时域变换,从而得到去混响后的纯净语音信号。通过利用基于神经网络的第一混响预测器,与基于历史帧的第二混响预测器结合进行混响估计,能够提高混响强度估计的准确率,从而有效提高了语音信号的混响消除准确性,进而能够有效提高语音识别的准确率。
上述语音信号去混响处理方法中,获取原始语音信号,并提取原始语音信号中当前帧的幅度谱特征和相位谱特征后,对当前帧的幅度谱特征进行频带划分提取对应的子带幅度谱。通过第一混响预测器对基于子带的子带幅度谱进行混响强度预测,能够准确地预测当前帧的混响强度指标。再利用第二混响预测器结合得到的混响强度指标以及子带幅度谱,进一步预测当前帧的纯净语音子带谱,由此能够精准地得到各语音帧的纯净语音幅度谱,从而能够准确有效地根据所提取的纯净语音幅度谱,得到去混响后的纯净语音信号,从而有效提高了语音信号的混响消除准确性。
传统的语音信号去混响处理方式中,利用传统的混响预测器基于历史帧的功率谱线性叠加,估计出晚期混响的功率谱,再利用当前帧中减去晚期混响的功率谱,得到去混响后的功率谱,以得到去混响后的时域语音信号。这种方式依赖语音混响成分的统计平稳性或短时平稳性假设但对于包括早期反射声在内的早期混响无法准确估计。传统的基于神经网络直接预测幅度谱的方式,幅度谱的变化范围较大,学习难度也比较大,导致语音损伤较多;且往往需要复杂的网络结构处理多个频点特征,计算量较大,导致处理效率较低。
本实施例中,利用一段纯净语音信号和一段在混响环境录制的带混响语音 信号进行实验测试,采用本实施例中的语音信号去混响处理方法在混响环境录制的带混响语音信号进行处理。实验测试包括:通过展示纯净语音的语音谱、在混响环境录制的带混响语音的语谱图、混响强度分布图谱进行比较。其中,如图6(a)中所示,为纯净语音的语音谱,横轴为时间轴,纵轴为频率轴。6(b)为纯净语音在混响环境录制的带混响语音的语谱图。通过将图6(a)与6(b)比对,可以看出6(b)中的语音谱线出现模糊失真。如图7(a)所示,7(a)展示了具体不同时刻不同频带失真的大小,即混响干扰的强度,颜色越亮混响越强。图7(a)反映了带混响语音的混响强度,也是本实施例中利用第一混响预测器预测的目标。
利用基于神经网络的第一混响预测器对带混响语音的混响强度进行混响强度预测,得到的预测结果可以如图7(b)所示。从图7(b)可以看出利用第一混响预测器较为准确地预测出了图7(a)中真实混响强度分布。
相比之下,利用不采用本方案中基于神经网络的第一混响预测器,仅采用传统的基于历史帧的混响预测器进行预测,得到的结果如图8(a)所示。从图8(a)可以看出其无法准确估计出混响强度分布的细节。
进一步地,通过将基于神经网络的第一混响预测器预测的结果,结合基于历史帧的第二混响预测器进行混响强度预测,得到的结果如图8(b)所示。相比传统的方式,采用本实施例的方案得到的结果更接近真实的混响强度分布,显著提高了带混响语音信号的混响预测准确率。
图9为一个实施例中重混响的原始语音信号对应的语音时域波形和语谱图,如图9所示,可见由于混响存在,语音存在较长的拖尾,字与字波形连在一起,语谱图谱线模糊,语音信号的整体可懂度和清晰度较低。
通过采用本实施例中的语音信号去混响处理方法对重混响的原始语音信号进行处理,得到的纯净语音信号对应的语音时域波形和语谱图如图10所示。通过第一混响预测器对基于子带的当前帧的子带幅度谱进行混响强度预测,得到当前帧的混响强度指标。再利用第二混响预测器根据得到的混响强度指标以及子带幅度谱,进一步预测当前帧的纯净语音子带谱,由此能够准确地提取出纯净语音信号,有效提高了语音信号的混响消除的准确率。
在一个实施例中,通过第一混响预测器根据子带幅度谱,确定当前帧对应 的混响强度指标包括:通过第一混响预测器预测子带幅度谱对应的纯净语音能量比;根据纯净语音能量比确定当前帧对应的混响强度指标。
其中,第一混响预测器为预先利用大量带混响语音数据和纯净语音数据训练得到的基于神经网络模型的混响预测器。第一混响预测器包括多层网络结构,每层网络包括对应的网络参数和网络权重,以用于预测带混响的原始语音信号中各个子带的纯净语音占比。
计算机设备提取出当前帧的幅度谱对应的子带幅度谱后,将当前帧的子带幅度谱输入至第一混响预测器,第一混响预测器的各层网络分别对各个子带幅度谱进行分析。第一混响预测器将每个子带幅度谱中带混响原始语音与纯净语音的能量比作为预测目标,通过第一混响预测器的各个网络层的网络参数和网络权重可以分析出各个子带幅度谱的纯净语音能量比,进而可以根据当前帧的各个子带幅度谱的纯净语音能量比预测出当前帧的混响强度分布,从而得到当前帧对应的混响强度指标。通过利用预先训练的基于神经网络的第一混响预测器对各个子带幅度谱进行混响预测,由此能够准确地估计出当前帧的混响强度指标。
在一个实施例中,如图11所示,提供了一种语音信号去混响处理方法,包括以下步骤:
步骤S1102,获取原始语音信号;提取原始语音信号中当前帧的幅度谱特征和相位谱特征。
步骤S1104,从当前帧对应的幅度谱特征中提取子带幅度谱;通过第一混响预测器的输入层提取子带幅度谱的维度特征。
步骤S1106,通过第一混响预测器的预测层根据维度特征提取子带幅度谱的表征信息,根据表征信息确定子带幅度谱的纯净语音能量比。
步骤S1108,通过第一混响预测器的输出层根据子带幅度谱对应的纯净语音能量比,输出当前帧对应的混响强度指标。
步骤S1110,通过第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧对应的纯净语音子带谱。
步骤S1112,对当前帧对应的纯净语音子带谱和相位谱特征进行信号转换, 得到去混响后的纯净语音信号。
其中,第一混响预测器为基于LSTM长短期记忆网络的神经网络模型,第一混响预测器包括输入层、预测层和输出层。输入层和输出层可以为全连接层,输入层用于提取模型输入数据的特征维度,输出层用于规整化均值和取值范围以及输出结果等。具体地,预测层可以为LSTM结构的网络层,其中,预测层至少包括两层LSTM结构的网络层。预测层的网络结构中包括输入门、输出门、遗忘门和细胞状态单元,使得LSTM在时序建模能力上得到显著的提升,能够记忆更多的信息,有效地抓住数据中的长时依赖,从而准确有效地提取输入特征的表征信息。
计算机设备利用第一混响预测器预测当前帧的混响强度指标的过程中,将当前帧的各个子带幅度谱输入至第一混响预测器后,首先通过第一混响强度预测器的输入层提取各个子带幅度谱的维度特征。具体地,计算机设备可以将恒Q频带提取的子带幅度谱作为网络的输入特征。例如,Q频带数目可以用K表示,也即第一混响预测器的输入特征维度。如当输入语音采样率为16kHz,帧长为20ms时,进行512点STFT(短时傅里叶变换)后,K取值则为8。通过第一混响预测器对输入特征进行预测分析后,输出也是一个8维特征,即表示8个恒Q带上预测的混响强度。
在其中一个实施例中,第一混响预测器的各层网络结构可以采用1024节点的网络层。预测层为两层1024节点的LSTM网络。如图7所示,为采用两层1024节点LSTM网络的第一混响预测器所对应的网络层结构示意图。
其中,预测层为基于LSTM的网络层,LSTM网络中包括三个门,分别为遗忘门、输入门和输出门。遗忘门决定之前状态中的信息有多少应该舍弃,例如可以通过输出一个0和1之间的数值,代表保留的信息部分。上一时刻的隐藏层输出的数值可以作为遗忘门的参数。输入门用于决定哪些信息应该保留在细胞状态单元中,输入门的参数可以通过训练得到。忘记门计算出旧的细胞状态单元中有多少信息被遗弃,接着输入门将所得的结果加入到细胞状态,表示新的输入信息中有多少加入到细胞状态中。细胞状态单元更新之后,基于细胞状态计算输出。输入数据通过sigmoid激活函数得到“输出门”的值。然后将细胞 状态单元的信息经过处理后,并与输出门的值结合进行处理得到细胞状态单元的输出结果。
计算机设备通过第一混响强度预测器的输入层提取各个子带幅度谱的维度特征后,利用第一混响预测器的预测层根据维度特征提取各个子带幅度谱的表征信息。其中,预测层中的各个网络层结构通过相应的网络参数和网络权重分别提取各个子带幅度谱的表征信息,表征信息还可以包括多层次的表征信息。例如每一层网络层提取相应的子带幅度谱的表征信息,通过多层网络层进行提取后,从而可以提取出各个子带幅度谱的深层表征信息,以进一步准确地利用所提取得到的表征信息进行预测分析。
计算机设备进而通过预测层根据表征信息输出各个子带幅度谱的纯净语音能量比,并通过输出层根据各个子带对应的纯净语音能量比输出当前帧对应的混响强度指标。计算机设备进一步利用第二混响预测器根据子带幅度谱和混响强度指标确定当前帧的纯净语音子带谱。通过对纯净语音子带谱和相位谱特征进行信号转换,由此得到去混响后的纯净语音信号。
本实施例中,通过利用预先训练的基于神经网络的第一混响预测器的各个网络层的网络参数和网络权重对各个子带幅度谱进行分析,可以精准地分析出各个子带幅度谱的纯净语音能量比,由此能够准确有效地估计出各语音帧的混响强度指标。
在一个实施例中,通过第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧对应的纯净语音子带谱包括:通过第二混响预测器根据当前帧的幅度谱特征确定当前帧的后验信干比;根据后验信干比和混响强度指标当前帧的先验信干比;基于用先验信干比对当前帧的子带幅度谱进行滤波增强处理,得到各语音帧对应的纯净语音子带幅度谱。
其中,信干比(signal to interference ratio,SIR)是指信号的能量与干扰能量(如同频干扰、多径等)和加性噪声能量的和的比值。先验信干比表示根据以往经验和分析得到的信干比,后验信干比则表示基于新的信息修正原来的先验信息后获得的更接近实际情况的信干比估计。
计算机设备在预测子带幅度谱的混响时,还利用第二混响预测器对每个子 带幅度谱进行平稳噪声估计,并根据估计结果计算当前帧的后验信干比。第二混响预测器进而根据当前帧的后验信干比,以及结合第一混响预测器预测得到的混响强度指标,计算当前帧的先验信干比。通过第二混响预测器得到当前帧的先验信干比后,利用先验信干比对当前帧的子带幅度谱进行加权增强处理,从而可以得到预测的当前帧的纯净语音子带谱。通过第一混响预测器能够精准地预测得到当前帧的混响强度指标,进而利用混响强度指标动态调节去混响量,由此能够准确地计算当前帧的先验信干比,从而能够精准地估计出其中的纯净语音子带谱。
在一个实施例中,如图12所示,通过第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧对应的纯净语音子带谱的步骤,具体包括以下内容:
步骤S1202,通过第二混响预测器提取当前帧中各子带对应的稳态噪声幅度谱。
步骤S1204,通过第二混响预测器提取当前帧中各子带对应的稳态混响幅度谱。
步骤S1206,根据稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱,确定当前帧的后验信干比。
步骤S1208,根据后验信干比和混响强度指标,确定当前帧的先验信干比。
步骤S1210,基于先验信干比对当前帧的子带幅度谱进行滤波增强处理,得到当前帧对应的纯净语音子带幅度谱。
其中,稳态噪声是指噪声强度波动范围在5dB以内的连续性噪声,或重复频率大于10Hz的脉冲噪声。稳态噪声幅度谱表示子带的噪声幅度分布的幅度谱,稳态混响幅度谱表示子带的混响幅度分布的幅度谱。
第二混响预测器对当前帧的子带幅度谱进行处理时,第二混响预测器提取当前帧中各个子带对应的稳态噪声幅度谱,以及提取当前帧中各个子带对应的稳态混响幅度谱。第二混响预测器进而利用各个子带的稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱计算出当前帧的后验信干比,进一步利用后验信干比和混响强度指标计算当前帧的先验信干比。用过利用先验信干比对当前帧的子带幅度谱进行滤波增强处理,例如可以利用先验信干比对当前帧的子带幅度谱 进行加权,从而得到当前帧的纯净语音子带幅度谱。
其中,计算机设备对当前帧的幅度谱特征进行频带划分,提取当前帧所对应的子带幅度谱后,通过第一混响预测器预测当前帧对应的混响强度指标,第二混响预测器也可以同时对当前帧的子带幅度谱进行分析处理,第一混响预测器和第二混响预测器的处理顺序在此不做限定。第一混响预测器输出当前帧的混响强度指标,以及第二混响预测器计算出当前帧的后验信干比后,第二混响预测器进而利用后验信干比和混响强度指标计算当前帧的先验信干比,利用先验信干比对当前帧的子带幅度谱进行滤波增强处理,从而能够精准地估计出当前帧的纯净语音子带幅度谱。
在一个实施例中,该方法还包括:获取上一帧的纯净语音幅度谱;基于上一帧的纯净语音幅度谱,利用稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱确定当前帧的后验信干比。
其中,第二混响预测器为基于历史帧分析的混响强度预测算法模型。例如,若当前帧是第p帧,则历史帧可以是(p-1)帧、(p-2)帧等。
具体地,本实施例中的历史帧为当前帧的上一帧,当前帧为计算机设备当前需要处理的一帧。计算机设备对原始语音信号的当前帧对应的上一帧语音信号进行处理后,可以直接获得上一帧的纯净语音幅度谱。计算机设备进一步处理当前帧的语音信号,利用第一混响预测器获得当前帧的混响强度指标后,利用第二混响预测器预测当前帧的纯净语音子带谱时,第二混响预测器提取当前帧中各个子带对应的稳态噪声幅度谱和稳态混响幅度谱后,进而利用上一帧的纯净语音幅度谱,结合当前帧的稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱计算当前帧的后验信干比。由于第二混响预测器在分析当前帧的后验信干比时,是在基于历史帧的基础上,并结合第一混响预测器所预测的当前帧的混响强度指标,由此能够计算出准确度较高的后验信干比,从而能够利用得到的后验信干比进一步精准地估计出当前帧的纯净语音子带幅度谱。
在一个实施例中,该方法还包括:对原始语音信号进行分帧加窗处理,得到原始语音信号中当前帧对应的幅度谱特征和相位谱特征;获取预设频带系数,根据频带系数对当前帧的幅度谱特征进行频带划分,得到当前帧对应的子带幅 度谱。
其中,频带系数用于根据频带系数值将每一帧划分为相应数量的子频带,频带系数可以为一个常量系数。例如,可以采用恒Q(恒定Q值,Q为常量)频带划分的方式对当前帧的幅度谱特征进行频带划分,其中,中心频率与带宽比为常量Q,恒定Q值即为频带系数。
具体地,计算机设备获取原始语音信号后,对原始语音信号进行加窗分帧,并对加窗分帧后的原始语音信号进行快速傅里叶转换,由此得到原始语音信号的频谱。计算机设备进而一次对每一帧原始语音信号的频谱进行处理。
计算机设备首先根据原始语音信号的频谱,提取当前帧的幅度谱特征和相位谱特征。然后对当前帧的幅度谱特征进行恒Q频带划分,得到对应的子带幅度谱。其中,一个子带对应一段子频带,一段子频带可能包括一系列频点,例如子带1对应0-100Hz,子带2对应100-300Hz,依次类推。某个子带的幅度谱特征是对该子带内所含频点的一种加权求和。通过对每一帧的幅度谱进行频带划分,能够有效降低幅度谱的特征维度,如恒Q划分符合人耳对声音低频的分辨率高于高频的生理听觉特征,由此能够有效提高对幅度谱进行分析的精度,从而能够更加精准地对语音信号进行混响预测分析。
在一个实施例中,对当前帧对应的纯净语音子带谱和相位谱特征进行信号转换,得到去混响后的纯净语音信号包括:根据频带系数对纯净语音子带谱进行逆恒变换,得到当前帧对应的纯净语音幅度谱;对当前帧对应的纯净语音幅度谱和相位谱特征进行时频转换,得到去混响后的纯净语音信号。
计算机设备通过将每一帧的幅度谱划分为多个子带幅度谱,利用第一混响预测器分别对各个子带幅度谱进行混响预测,得到当前帧的混响强度指标。并利用第二混响预测器根据子带幅度谱和混响强度指标,计算出当前帧的纯净语音子带谱后,计算机设备进而对纯净语音子带谱进行逆恒变换。具体可以采用逆恒Q变换方式对纯净语音子带谱进行变化,以将频率不均匀分布的恒Q子带谱变换回频率均衡分布的STFT幅度谱,从而得到当前帧对应的纯净语音幅度谱。计算机设备进一步将获得的纯净语音幅度谱与原始语音信号的当前帧对应的相位谱结合,进行逆傅里叶变换,以实现语音信号的时频转换,得到转换后 的纯净语音信号,即为去混响后的纯净语音信号,由此能够准确地提取出纯净语音信号,有效提高了语音信号的混响消除的准确率。
在一个实施例中,第一混响预测器经过以下步骤进行训练:获取带混响语音数据和纯净语音数据,利用带混响语音数据和纯净语音数据生成训练样本数据;将混响与纯净语音能量比确定为训练目标;提取带混响语音数据对应的带混响频带幅度谱,提取纯净语音数据的纯净语音频带幅度谱;利用带混响频带幅度谱和纯净语音频带幅度谱以及训练目标训练第一混响预测器。
计算机设备在对原始语音信号进行处理之前,还需要预先训练出第一混响预测器,第一混响预测器为神经网络模型。其中,纯净语音数据是指没有混响噪声的纯净语音,带混响语音数据是指存在混响噪声的语音,例如可以是在混响环境下录制的语音数据。
具体地,计算机设备获取带混响语音数据和纯净语音数据,利用带混响语音数据和纯净语音数据生成训练样本数据,训练样本数据用于对预设的神经网络进行训练。训练样本数据具体可以是带混响语音数据和其对应的纯净语音数据对。利用带混响语音数据和纯净语音数据的混响与纯净语音能量比作为训练标签,即模型训练的训练目标。训练标签用于对每次的训练结果进行调参等处理,以进一步训练和优化神经网络模型。
计算机设备获取带混响语音数据和纯净语音数据生成训练样本数据后,将训练样本数据输入至预设的神经网络模型,通过对带混响语音数据进行特征提取以及混响强度预测分析,得到相应的混响与纯净语音能量比。具体地,计算机设备并将带混响语音数据和纯净语音数据的混响与纯净语音能量比作为预测目标,利用带混响语音数据通过预设的函数训练神经网络模型。
在训练预测模型的过程中,利用带混响语音数据和训练目标对预设的神经网络模型进行多次迭代训练,每次得到相应的训练结果。计算机设备进而利用训练目标根据训练结果对预设神经网络模型的参数进行调整,并继续进行迭代训练,直到满足训练条件时,得到训练完成的第一混响预测器。通过利用神经网络对带混响语音数据和纯净语音数据进行训练,从而可以有效地训练得到混响预测准确较高的第一混响预测器。
在一个实施例中,利用带混响频带幅度谱和纯净语音频带幅度谱以及训练目标训练第一混响预测器包括:将带混响频带幅度谱和纯净语音频带幅度谱输入至预设网络模型,得到训练结果;基于训练结果与训练目标的差异,调整预设神经网络模型的参数并继续训练,直至满足训练条件时结束训练,得到所需的第一混响预测器。
其中,训练条件是指满足模型训练的条件。训练条件可以是达到预设的迭代次数,也可以是调整参数后的图片分类器的分类性能指标达到预设指标。
具体地,计算机设备每次利用带混响语音数据对预设的神经网络模型进行训练,得到相应的训练结果后,将训练结果与训练目标进行比较,得到训练结果与训练目标的差异。计算机设备则进一步以减少差异为目标,调整预设神经网络模型的参数,并继续进行训练。若调参后的神经网络模型的训练结果不满足训练条件时,则继续利用训练标签对神经网络模型进行调参并继续训练。直到满足训练条件时结束训练,得到所需的预测模型。
其中,训练结果与训练目标的差异可以用代价函数来衡量,可以选择交叉熵损失函数或均方误差等函数作为代价函数。可以在代价函数的值小于预先设定的值时结束训练,从而提高对带混响语音数据中的混响的预测准确性。例如,预设神经网络模型为基于LSTM模型,选择最小均方误差准则更新网络权重,最终在损失参数稳定后,确定LSTM网络各层参数,通过sigmoid激活函数将训练目标约束在[0,1]范围。使得面对带混响的新语音数据时,网络可以预测出该语音中各带纯净语音占比。
本实施例中,在训练预测模型时,通过训练标签对神经网络模型进行指导和调参优化,由此能够有效提高对带混响语音数据中的混响的预测精度,从而有效提高了第一混响预测器的预测准确度,进而能够有效提高语音信号的混响消除的准确度。
如图13所示,在一个具体地实施例中,语音信号去混响方法包括以下步骤:
步骤S1302,获取原始语音信号;提取原始语音信号中当前帧的幅度谱特征和相位谱特征。
步骤S1304,获取预设频带系数,根据频带系数对当前帧的幅度谱特征进行 频带划分,得到当前帧对应的子带幅度谱。
步骤S1306,通过第一混响预测器的输入层根据子带幅度谱提取子带幅度谱的维度特征。
步骤S1308,通过第一混响预测器的预测层根据维度特征提取子带幅度谱的表征信息,根据表征信息确定子带幅度谱的纯净语音能量比。
步骤S1310,通过第一混响预测器的输出层根据子带幅度谱的纯净语音能量比,输出当前帧对应的混响强度指标。
步骤S1312,通过第二混响提取当前帧中各子带对应的稳态噪声幅度谱和稳态混响幅度谱。
步骤S1314,基于上一帧的纯净语音幅度谱,根据稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱,确定当前帧的后验信干比。
步骤S1316,根据当前帧的后验信干比和混响强度指标,确定当前帧的先验信干比。
步骤S1318,根据先验信干比对当前帧的子带幅度谱进行滤波增强处理,得到当前帧的纯净语音子带幅度谱。
步骤S1320,根据频带系数对纯净语音子带谱进行逆恒变换,得到当前帧对应的纯净语音幅度谱。
步骤S1322,对当前帧对应的纯净语音幅度谱和相位谱特征进行时频转换,得到去混响后的纯净语音信号。
具体地,原始语音信号可以表示为x(n),计算机设备对采集的原始语音信号进行分帧、加窗等预处理后,提取当前帧p对应的幅度谱特横X(p,m)和相位谱特征θ(p,m),其中,其中m为频点标识,p为当前帧标识。计算机设备进一步对当前帧的幅度谱特横X(p,m)进行恒Q频带划分,得到子带幅度谱Y(p,q)。计算公式可以如下:
Figure PCTCN2021076465-appb-000001
其中,q为恒Q频带标识,即子带标识;w q为第q个子带的加权窗,例如可以采用三角窗或汉宁窗进行加窗处理。
计算机设备则将提取的当前帧的子带q的子带幅度谱Y(p,q)输入至第一混响 强度预测器中,通过第一混响强度预测器对当前帧的子带幅度谱Y(p,q)进行分析处理,可以得到当前帧中的混响强度指标η(p,q)。
计算机设备进一步利用第二混响强度预测器估计每个子带所含的稳态噪声幅度谱λ(p,q)和每个子带所含的稳态混响幅度谱l(p,q),利用稳态噪声幅度谱λ(p,q)和稳态混响幅度谱l(p,q)以及结合子带幅度谱Y(p,q)计算后验信干比γ(p,q),计算公式可以如下:
Figure PCTCN2021076465-appb-000002
计算机设备进一步利用后验信干比γ(p,q)和第一混响强度预测器输出的混响强度指标η(p,q)计算先验信干比ξ(p,q),计算公式可以如下:
Figure PCTCN2021076465-appb-000003
其中,
Figure PCTCN2021076465-appb-000004
其中,η(p,q)的主要作用是动态调整去混响量,估计的η(p,q)越大,表明p时刻子带q混响越重,去混响的量相对也会越大;反之,估计的η(p,q)越小,表明p时刻子带q混响较轻,去混响的量相对减小,音质损伤也会相对减少。G(p,q)为预测增益函数,用于衡量混响语音中的纯净语音能量占比。
计算机设备再利用先验信干比ξ(p,q)对输入子带幅度谱Y(p,q)加权,从而获得估计的纯净语音子带幅度谱S(p,q)。通过对不带混响的纯净语音子带幅度谱S(p,q)进行如下逆恒Q变换:
Figure PCTCN2021076465-appb-000005
其中,Z(p,m)表示纯净语言幅度谱特征。计算机设备再结合当前帧的相位谱特征θ(p,m)进行逆STFT,实现从频域到时域的转换,从而得到去混响后的时域语音信号S(n)。
本实施例中,通过第一混响预测器对基于子带的子带幅度谱进行混响强度预测,能够准确地预测当前帧的混响强度指标。然后再利用第二混响预测器结合得到的混响强度指标对当前帧的子带幅度谱进一步预测当前帧的纯净语音子 带谱,由此能够准确地提取当前帧的纯净语音幅度谱,从而有效提高了语音信号的混响消除准确性。
应该理解的是,虽然图5、11、12、13的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图5、11、12、13中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图14所示,提供了一种语音信号去混响处理装置1400,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:语音信号处理模块1402模块、第一混响预测模块1404、第二混响预测模块1406和语音信号转换模块1408,其中:
语音信号处理模块1402,用于获取原始语音信号;提取原始语音信号中当前帧的幅度谱特征和相位谱特征;
第一混响预测模块1404,用于从当前帧对应的幅度谱特征中提取子带幅度谱,通过第一混响预测器根据子带幅度谱,确定当前帧对应的混响强度指标。
第二混响预测模块1406,用于通过第二混响预测器根据子带幅度谱和混响强度指标,确定当前帧对应的纯净语音子带谱。
语音信号转换模块1408,用于对当前帧对应的纯净语音子带谱和相位谱特征进行信号转换,得到去混响后的纯净语音信号。
在一个实施例中,第一混响预测模块1404还用于通过第一混响预测器预测子带幅度谱对应的纯净语音能量比;根据纯净语音能量比确定当前帧对应的混响强度指标。
在一个实施例中,第一混响预测模块1404还用于通过第一混响预测器的输入层提取子带幅度谱的维度特征;通过第一混响预测器的预测层根据维度特征 提取子带幅度谱的表征信息,根据表征信息确定子带幅度谱的纯净语音能量比;通过第一混响预测器的输出层根据子带幅度谱对应的纯净语音能量比,输出当前帧对应的混响强度指标。
在一个实施例中,第二混响预测模块1406还用于通过第二混响预测器根据各语音帧的幅度谱特征,确定当前帧的后验信干比;通过后验信干比和混响强度指标,确定当前帧的先验信干比;基于先验信干比对当前帧的子带幅度谱进行滤波增强处理,得到当前帧对应的纯净语音子带幅度谱。
在一个实施例中,第二混响预测模块1406还用于通过第二混响提取当前帧中各子带对应的稳态噪声幅度谱;通过第二混响提取当前帧中各子带对应的稳态混响幅度谱;根据稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱,确定当前帧的后验信干比。
在一个实施例中,第二混响预测模块1406还用于获取上一帧的纯净语音幅度谱;基于上一帧的纯净语音幅度谱,利用稳态噪声幅度谱、稳态混响幅度谱和子带幅度谱,估计当前帧的后验信干比。
在一个实施例中,语音信号处理模块1402还用于对原始语音信号进行分帧加窗处理,得到原始语音信号中当前帧对应的幅度谱特征和相位谱特征;获取预设频带系数,根据频带系数对当前帧的幅度谱特征进行频带划分,得到当前帧对应的子带幅度谱。
在一个实施例中,语音信号转换模块1408还用于根据频带系数对纯净语音子带谱进行逆恒变换,得到当前帧对应的纯净语音幅度谱;对当前帧对应的纯净语音幅度谱和相位谱特征进行时频转换,得到去混响后的纯净语音信号。
在一个实施例中,如图15所示,该装置还包括混响预测器训练模块1401,用于获取带混响语音数据和纯净语音数据,利用带混响语音数据和纯净语音数据生成训练样本数据;将带混响语音数据和纯净语音数据的混响与纯净语音能量比确定为训练目标;提取带混响语音数据对应的带混响频带幅度谱,提取纯净语音数据的纯净语音频带幅度谱;利用带混响频带幅度谱和纯净语音频带幅度谱以及训练目标训练第一混响预测器。
在一个实施例中,混响预测器训练模块1401还用于将带混响频带幅度谱和 纯净语音频带幅度谱输入至预设网络模型,得到训练结果;基于训练结果与训练目标的差异,调整预设神经网络模型的参数并继续训练,直至满足训练条件时结束训练,得到所需的第一混响预测器。
关于语音信号去混响处理装置的具体限定可以参见上文中对于语音信号去混响处理方法的限定。上述语音信号去混响处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图16所示。该计算机设备包括通过***总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机程序和数据库。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的数据库用于存储语音数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音信号去混响处理方法。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图17所示。该计算机设备包括通过***总线连接的处理器、存储器、通信接口、显示屏、麦克风、扬声器和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***和计算机程序。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种语音信号去混响处理方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图16和图17中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机可读指令,该计算机程序产品或计算机可读指令包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机可读指令,处理器执行该计算机可读指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音信号去混响处理方法,由计算机设备执行,所述方法包括:
    获取原始语音信号;
    提取所述原始语音信号中当前帧的幅度谱特征和相位谱特征;
    从所述当前帧对应的所述幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
    通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
    对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标包括:
    通过第一混响预测器预测所述子带幅度谱对应的纯净语音能量比;及
    根据所述纯净语音能量比确定所述当前帧对应的混响强度指标。
  3. 根据权利要求2所述的方法,其特征在于,所述通过第一混响预测器预测所述子带幅度谱对应的纯净语音能量比包括:
    通过所述第一混响预测器的输入层提取所述子带幅度谱的维度特征;
    通过所述第一混响预测器的预测层根据所述维度特征提取所述子带幅度谱的表征信息,根据所述表征信息确定所述子带幅度谱的纯净语音能量比;及
    所述根据所述纯净语音能量比确定所述当前帧对应的混响强度指标包括:
    通过所述第一混响预测器的输出层根据所述子带幅度谱对应的纯净语音能量比,输出所述当前帧对应的混响强度指标。
  4. 根据权利要求1所述的方法,其特征在于,所述通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱包括:
    通过所述第二混响预测器根据所述当前帧的幅度谱特征,确定所述当前帧的后验信干比;
    根据所述后验信干比和所述混响强度指标,确定所述当前帧的先验信干比; 及
    基于所述先验信干比对所述当前帧的子带幅度谱进行滤波增强处理,得到所述当前帧对应的纯净语音子带幅度谱。
  5. 根据权利要求4所述的方法,其特征在于,所述通过所述第二混响预测器根据所述当前帧的幅度谱特征,确定所述当前帧的后验信干比包括:
    通过所述第二混响预测器提取所述当前帧中各子带对应的稳态噪声幅度谱;
    通过所述第二混响预测器提取所述当前帧中各子带对应的稳态混响幅度谱;及
    根据所述稳态噪声幅度谱、所述稳态混响幅度谱和所述子带幅度谱,确定所述当前帧的后验信干比。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述稳态噪声幅度谱、所述稳态混响幅度谱和所述子带幅度谱,确定所述当前帧的后验信干比包括:
    获取上一帧的纯净语音幅度谱;及
    基于所述上一帧的纯净语音幅度谱,根据所述稳态噪声幅度谱、所述稳态混响幅度谱和所述子带幅度谱,估计所述当前帧的后验信干比。
  7. 根据权利要求1所述的方法,其特征在于,所述提取所述原始语音信号中所述当前帧对应的幅度谱特征和相位谱特征包括:
    对所述原始语音信号进行分帧加窗处理,得到所述原始语音信号中当前帧对应的幅度谱特征和相位谱特征;及
    所述从所述当前帧对应的所述幅度谱特征中提取子带幅度谱包括:
    获取预设频带系数,根据所述频带系数对所述当前帧的幅度谱特征进行频带划分,得到所述当前帧对应的子带幅度谱。
  8. 根据权利要求1所述的方法,其特征在于,所述对所述当前帧对应的所述纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号包括:
    根据预设频带系数对所述纯净语音子带谱进行逆恒变换,得到所述当前帧 对应的纯净语音幅度谱;及
    对所述当前帧对应的所述纯净语音幅度谱和所述相位谱特征进行时频转换,得到去混响后的纯净语音信号。
  9. 根据权利要求1所述的方法,其特征在于,所述第一混响预测器经过以下步骤进行训练:
    获取带混响语音数据和纯净语音数据,利用所述带混响语音数据和所述纯净语音数据生成训练样本数据;
    将所述带混响语音数据和所述纯净语音数据的混响与纯净语音能量比确定为训练目标;
    提取所述带混响语音数据对应的带混响频带幅度谱,提取所述纯净语音数据的纯净语音频带幅度谱;及
    利用所述带混响频带幅度谱和所述纯净语音频带幅度谱以及所述训练目标训练第一混响预测器。
  10. 根据权利要求9所述的方法,其特征在于,所述利用所述带混响频带幅度谱和所述纯净语音频带幅度谱以及所述训练目标训练第一混响预测器包括:
    将所述带混响频带幅度谱和所述纯净语音频带幅度谱输入至预设网络模型,得到训练结果;及
    基于所述训练结果与所述训练目标的差异,调整所述预设神经网络模型的参数并继续训练,直至满足所述训练条件时结束训练,得到所需的第一混响预测器。
  11. 一种语音信号去混响处理装置,其特征在于,所述装置包括:
    语音信号处理模块,用于获取原始语音信号;提取所述原始语音信号中当前帧的幅度谱特征和相位谱特征;
    第一混响预测模块,用于从所述当前帧对应的所述幅度谱特征中提取子带幅度谱,通过第一混响预测器根据所述子带幅度谱,确定所述当前帧对应的混响强度指标;
    第二混响预测模块,用于通过第二混响预测器根据所述子带幅度谱和所述混响强度指标,确定所述当前帧对应的纯净语音子带谱;及
    语音信号转换模块,用于对所述当前帧对应的纯净语音子带谱和所述相位谱特征进行信号转换,得到去混响后的纯净语音信号。
  12. 根据权利要求11所述的装置,其特征在于,所述第一混响预测模块还用于通过所述第一混响预测器的输入层提取所述子带幅度谱的维度特征;通过所述第一混响预测器的预测层根据所述维度特征提取所述子带幅度谱的表征信息,根据所述表征信息确定所述子带幅度谱的纯净语音能量比;及通过所述第一混响预测器的输出层根据所述子带幅度谱对应的纯净语音能量比输出所述当前帧对应的混响强度指标。
  13. 根据权利要求11所述的装置,其特征在于,所述第二混响预测模块还用于通过所述第二混响预测器根据所述当前帧的幅度谱特征,确定所述当前帧的后验信干比;根据所述后验信干比和所述混响强度指标,确定所述当前帧的先验信干比;及基于所述先验信干比对所述当前帧的子带幅度谱进行滤波增强处理,得到所述当前帧对应的纯净语音子带幅度谱。
  14. 根据权利要求13所述的装置,其特征在于,所述第二混响预测模块还用于通过所述第二混响预测器提取所述当前帧中各子带对应的稳态噪声幅度谱;通过所述第二混响预测器提取所述当前帧中各子带对应的稳态混响幅度谱;及根据所述稳态噪声幅度谱、所述稳态混响幅度谱和所述子带幅度谱,确定所述当前帧的后验信干比。
  15. 根据权利要求14所述的装置,其特征在于,所述第二混响预测模块还用于获取上一帧的纯净语音幅度谱;及基于所述上一帧的纯净语音幅度谱,根据所述稳态噪声幅度谱、所述稳态混响幅度谱和所述子带幅度谱,估计所述当前帧的后验信干比。
  16. 根据权利要求11所述的装置,其特征在于,所述语音信号转换模块还用于根据预设频带系数对所述纯净语音子带谱进行逆恒变换,得到所述当前帧对应的纯净语音幅度谱;及对所述纯净语音幅度谱和所述相位谱特征进行时频转换,得到去混响后的纯净语音信号。
  17. 根据权利要求11所述的装置,其特征在于,所述装置还包括混响预测器训练模块,用于获取带混响语音数据和纯净语音数据,利用所述带混响语音数据和所述纯净语音数据生成训练样本数据;将所述带混响语音数据和所述纯净语音数据的混响与纯净语音能量比确定为训练目标;提取所述带混响语音数据对应的带混响频带幅度谱,提取所述纯净语音数据的纯净语音频带幅度谱;及利用所述带混响频带幅度谱和所述纯净语音频带幅度谱以及所述训练目标训练第一混响预测器。
  18. 根据权利要求17所述的装置,其特征在于,所述混响预测器训练模块还用于将所述带混响频带幅度谱和所述纯净语音频带幅度谱输入至预设网络模型,得到训练结果;及基于所述训练结果与所述训练目标的差异,调整所述预设神经网络模型的参数并继续训练,直至满足所述训练条件时结束训练,得到所需的第一混响预测器。
  19. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至10中任一项所述的方法的步骤。
  20. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至10中任一项所述的方法的步骤。
PCT/CN2021/076465 2020-04-01 2021-02-10 语音信号去混响处理方法、装置、计算机设备和存储介质 WO2021196905A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/685,042 US20220230651A1 (en) 2020-04-01 2022-03-02 Voice signal dereverberation processing method and apparatus, computer device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010250009.3 2020-04-01
CN202010250009.3A CN111489760B (zh) 2020-04-01 2020-04-01 语音信号去混响处理方法、装置、计算机设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/685,042 Continuation US20220230651A1 (en) 2020-04-01 2022-03-02 Voice signal dereverberation processing method and apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2021196905A1 true WO2021196905A1 (zh) 2021-10-07

Family

ID=71797635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/076465 WO2021196905A1 (zh) 2020-04-01 2021-02-10 语音信号去混响处理方法、装置、计算机设备和存储介质

Country Status (3)

Country Link
US (1) US20220230651A1 (zh)
CN (1) CN111489760B (zh)
WO (1) WO2021196905A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116471A (zh) * 2022-04-28 2022-09-27 腾讯科技(深圳)有限公司 音频信号处理方法和装置、训练方法、设备及介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489760B (zh) * 2020-04-01 2023-05-16 腾讯科技(深圳)有限公司 语音信号去混响处理方法、装置、计算机设备和存储介质
CN112542176B (zh) * 2020-11-04 2023-07-21 北京百度网讯科技有限公司 信号增强方法、装置及存储介质
CN112542177B (zh) * 2020-11-04 2023-07-21 北京百度网讯科技有限公司 信号增强方法、装置及存储介质
CN112489668B (zh) * 2020-11-04 2024-02-02 北京百度网讯科技有限公司 去混响方法、装置、电子设备和存储介质
CN113555032B (zh) * 2020-12-22 2024-03-12 腾讯科技(深圳)有限公司 多说话人场景识别及网络训练方法、装置
CN112687283B (zh) * 2020-12-23 2021-11-19 广州智讯通信***有限公司 一种基于指挥调度***的语音均衡方法、装置及存储介质
CN113345461A (zh) * 2021-04-26 2021-09-03 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于语音处理的装置
CN113112998B (zh) * 2021-05-11 2024-03-15 腾讯音乐娱乐科技(深圳)有限公司 模型训练方法、混响效果复现方法、设备及可读存储介质
CN115481649A (zh) * 2021-05-26 2022-12-16 中兴通讯股份有限公司 信号滤波方法及装置、存储介质、电子装置
CN113823314B (zh) * 2021-08-12 2022-10-28 北京荣耀终端有限公司 语音处理方法和电子设备
CN113835065B (zh) * 2021-09-01 2024-05-17 深圳壹秘科技有限公司 基于深度学习的声源方向确定方法、装置、设备及介质
CN114299977B (zh) * 2021-11-30 2022-11-25 北京百度网讯科技有限公司 混响语音的处理方法、装置、电子设备及存储介质

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120082323A1 (en) * 2010-09-30 2012-04-05 Kenji Sato Sound signal processing device
CN102739886A (zh) * 2011-04-01 2012-10-17 中国科学院声学研究所 基于回声频谱估计和语音存在概率的立体声回声抵消方法
CN102750956A (zh) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 一种单通道语音去混响的方法和装置
US20130231923A1 (en) * 2012-03-05 2013-09-05 Pierre Zakarauskas Voice Signal Enhancement
CN106157964A (zh) * 2016-07-14 2016-11-23 西安元智***技术有限责任公司 一种确定回声消除中***延时的方法
CN106340292A (zh) * 2016-09-08 2017-01-18 河海大学 一种基于连续噪声估计的语音增强方法
CN108986799A (zh) * 2018-09-05 2018-12-11 河海大学 一种基于倒谱滤波的混响参数估计方法
CN109119090A (zh) * 2018-10-30 2019-01-01 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及电子设备
CN109243476A (zh) * 2018-10-18 2019-01-18 电信科学技术研究院有限公司 混响语音信号中后混响功率谱的自适应估计方法及装置
CN110148419A (zh) * 2019-04-25 2019-08-20 南京邮电大学 基于深度学习的语音分离方法
CN110211602A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音增强通信方法及装置
CN111489760A (zh) * 2020-04-01 2020-08-04 腾讯科技(深圳)有限公司 语音信号去混响处理方法、装置、计算机设备和存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8848933B2 (en) * 2008-03-06 2014-09-30 Nippon Telegraph And Telephone Corporation Signal enhancement device, method thereof, program, and recording medium
US8218780B2 (en) * 2009-06-15 2012-07-10 Hewlett-Packard Development Company, L.P. Methods and systems for blind dereverberation
CN105792074B (zh) * 2016-02-26 2019-02-05 西北工业大学 一种语音信号处理方法和装置
CN105931648B (zh) * 2016-06-24 2019-05-03 百度在线网络技术(北京)有限公司 音频信号解混响方法和装置
US11373667B2 (en) * 2017-04-19 2022-06-28 Synaptics Incorporated Real-time single-channel speech enhancement in noisy and time-varying environments
CN107346658B (zh) * 2017-07-14 2020-07-28 深圳永顺智信息科技有限公司 混响抑制方法及装置
US10283140B1 (en) * 2018-01-12 2019-05-07 Alibaba Group Holding Limited Enhancing audio signals using sub-band deep neural networks
CN110136733B (zh) * 2018-02-02 2021-05-25 腾讯科技(深圳)有限公司 一种音频信号的解混响方法和装置

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120082323A1 (en) * 2010-09-30 2012-04-05 Kenji Sato Sound signal processing device
CN102739886A (zh) * 2011-04-01 2012-10-17 中国科学院声学研究所 基于回声频谱估计和语音存在概率的立体声回声抵消方法
US20130231923A1 (en) * 2012-03-05 2013-09-05 Pierre Zakarauskas Voice Signal Enhancement
CN102750956A (zh) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 一种单通道语音去混响的方法和装置
CN106157964A (zh) * 2016-07-14 2016-11-23 西安元智***技术有限责任公司 一种确定回声消除中***延时的方法
CN106340292A (zh) * 2016-09-08 2017-01-18 河海大学 一种基于连续噪声估计的语音增强方法
CN108986799A (zh) * 2018-09-05 2018-12-11 河海大学 一种基于倒谱滤波的混响参数估计方法
CN109243476A (zh) * 2018-10-18 2019-01-18 电信科学技术研究院有限公司 混响语音信号中后混响功率谱的自适应估计方法及装置
CN109119090A (zh) * 2018-10-30 2019-01-01 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及电子设备
CN110148419A (zh) * 2019-04-25 2019-08-20 南京邮电大学 基于深度学习的语音分离方法
CN110211602A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音增强通信方法及装置
CN111489760A (zh) * 2020-04-01 2020-08-04 腾讯科技(深圳)有限公司 语音信号去混响处理方法、装置、计算机设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116471A (zh) * 2022-04-28 2022-09-27 腾讯科技(深圳)有限公司 音频信号处理方法和装置、训练方法、设备及介质
CN115116471B (zh) * 2022-04-28 2024-02-13 腾讯科技(深圳)有限公司 音频信号处理方法和装置、训练方法、设备及介质

Also Published As

Publication number Publication date
CN111489760A (zh) 2020-08-04
CN111489760B (zh) 2023-05-16
US20220230651A1 (en) 2022-07-21

Similar Documents

Publication Publication Date Title
WO2021196905A1 (zh) 语音信号去混响处理方法、装置、计算机设备和存储介质
US11456005B2 (en) Audio-visual speech separation
US20200066296A1 (en) Speech Enhancement And Noise Suppression Systems And Methods
CN104520925B (zh) 噪声降低增益的百分位滤波
CN110853664B (zh) 评估语音增强算法性能的方法及装置、电子设备
Westermann et al. Binaural dereverberation based on interaural coherence histograms
CN114203163A (zh) 音频信号处理方法及装置
CN111696567B (zh) 用于远场通话的噪声估计方法及***
US20230317096A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
CN114338623B (zh) 音频的处理方法、装置、设备及介质
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
WO2022166710A1 (zh) 语音增强方法、装置、设备及存储介质
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
US20240177726A1 (en) Speech enhancement
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
US11380312B1 (en) Residual echo suppression for keyword detection
CN112151055A (zh) 音频处理方法及装置
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN114023352B (zh) 一种基于能量谱深度调制的语音增强方法及装置
WO2022166738A1 (zh) 语音增强方法、装置、设备及存储介质
Shankar et al. Real-time single-channel deep neural network-based speech enhancement on edge devices
US11404055B2 (en) Simultaneous dereverberation and denoising via low latency deep learning
WO2023287782A1 (en) Data augmentation for speech enhancement
CN114783455A (zh) 用于语音降噪的方法、装置、电子设备和计算机可读介质
CN114758668A (zh) 语音增强模型的训练方法和语音增强方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21780688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS TO RULE 112(1) EPC - FORM 1205A (16.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21780688

Country of ref document: EP

Kind code of ref document: A1