WO2020042706A1 - 一种基于深度学习的回声消除方法 - Google Patents

一种基于深度学习的回声消除方法 Download PDF

Info

Publication number
WO2020042706A1
WO2020042706A1 PCT/CN2019/090528 CN2019090528W WO2020042706A1 WO 2020042706 A1 WO2020042706 A1 WO 2020042706A1 CN 2019090528 W CN2019090528 W CN 2019090528W WO 2020042706 A1 WO2020042706 A1 WO 2020042706A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
short
echo cancellation
neural network
long
Prior art date
Application number
PCT/CN2019/090528
Other languages
English (en)
French (fr)
Inventor
张�浩
马重
Original Assignee
大象声科(深圳)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大象声科(深圳)科技有限公司 filed Critical 大象声科(深圳)科技有限公司
Publication of WO2020042706A1 publication Critical patent/WO2020042706A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic

Definitions

  • the present disclosure relates to the field of computer application technology, and in particular, to an echo cancellation method, device, electronic device, and storage medium based on deep learning.
  • the microphone will pick up the signal from the speaker and its reverberation, thereby generating an echo.
  • conference calls, speakerphones, and mobile communications all suffer from echo problems.
  • Echo cancellation faces many problems, such as double talk, background noise and non-linear distortion.
  • dual-talk is a typical way of dialogue in a communication system. Speakers at both ends speak at the same time.
  • near-end speech signals will seriously affect the convergence of adaptive algorithms and may cause them to diverge.
  • the signal received at the microphone contains not only echo and near-end speech signals, but also background noise.
  • the method of echo cancellation is to use a finite impulse response (FIR) filter to adaptively estimate the acoustic impulse response between the speaker and the microphone to achieve echo cancellation, and then use a post filter to suppress background noise and Residual echo after echo cancellation.
  • FIR finite impulse response
  • AEC Acoustic Echo Cancellation, Echo Cancellation
  • Echo Cancellation Echo Cancellation
  • the present disclosure provides a method, device, electronic device, and storage medium for echo cancellation based on deep learning.
  • a method for echo cancellation based on deep learning including:
  • the step of extracting an acoustic feature from the received microphone signal includes:
  • the microphone signal includes a near-end signal and a far-end signal
  • the spectrum amplitude vector is subjected to normalization processing, and the steps of forming acoustic characteristics include:
  • Acoustic features are formed by merging the spectral amplitude vectors of the current time frame and the past time frame and normalizing them.
  • a pre-trained method for constructing the recurrent neural network model with long and short-term memory includes:
  • the speaker's voice as the far-end and near-end signals, and use the near-end signal to establish a voice training set, where the far-end signal is an echo signal, and the near-end signal and the echo signal Forming a microphone signal;
  • the speech training set is trained by the recurrent neural network with long and short-term memory, and the recurrent neural network model with long and short-term memory is constructed.
  • the step of constructing the recurrent neural network model with long and short-term memory is trained on the speech training set through the recurrent neural network with long and short-term memory:
  • the ideal ratio film for echo cancellation is estimated by the recurrent neural network with long and short-term memory, and the recurrent neural network model with long and short-term memory is constructed.
  • the speech training set is trained by the recurrent neural network with long and short-term memory
  • the step of constructing the recurrent neural network model with long and short-term memory may also include:
  • an ideal ratio film for echo cancellation is estimated by the recurrent neural network with long-term and short-term memory, and the recurrent neural network model with long-term and short-term memory is constructed.
  • the method may further include:
  • an echo cancellation device based on deep learning including:
  • An acoustic feature extraction module configured to extract an acoustic feature from a received input signal, the input signal including a microphone signal and a far-end signal;
  • a ratio film calculation module configured to perform iterative calculations on the acoustic features in a pre-trained recursive neural network model with long and short-term memory to calculate a ratio film of the acoustic features
  • a masking module configured to mask the acoustic feature by using the ratio film
  • a speech synthesis module is configured to synthesize the masked acoustic feature and the phase of the microphone signal to obtain a near-end signal after echo cancellation.
  • an ideal ratio film is used as the training target of a recurrent neural network model with long and short-term memory.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory connected in communication with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method according to the first aspect.
  • a computer-readable storage medium for storing a program that, when executed, causes an electronic device to perform the method according to the first aspect.
  • the acoustic features are extracted from the received microphone signals, and the acoustic features are iteratively calculated in a pre-trained recurrent neural network model with long and short-term memory. Features are masked. The masked acoustic features are then synthesized with the phase of the microphone signal to achieve echo cancellation. Because the scheme uses a pre-trained recurrent neural network model with long and short-term memory, it can achieve echo cancellation in the presence of background noise, double talk, and non-recurring distortion, which greatly improves the effect and applicable scenarios of echo cancellation. And no post filter is needed, which effectively simplifies the electronic equipment and reduces the cost of the electronic equipment.
  • Fig. 1 is a flow chart showing a method for echo cancellation based on deep learning according to an exemplary embodiment.
  • FIG. 2 is a flowchart of a specific implementation of step S110 in the method for echo cancellation based on deep learning according to the embodiment of FIG. 1.
  • Fig. 3 is a specific implementation flowchart of a method for constructing a recursive neural network model with long and short-term memory according to the corresponding embodiment of Fig. 1.
  • Fig. 4 is a flow chart of echo cancellation according to an exemplary embodiment.
  • FIG. 5 is a flowchart of a specific implementation of step S123 in a method for constructing a recursive neural network model with long-term and short-term memory according to the embodiment shown in FIG. 4.
  • FIG. 6 is another specific implementation flowchart of step S123 in the method for constructing a recurrent neural network model with long and short-term memory according to the embodiment shown in FIG. 4.
  • FIG. 7 is another specific implementation flowchart of step S123 in the method for constructing a recurrent neural network model with long-term and short-term memory according to the embodiment of FIG. 6.
  • Fig. 8 shows a microphone signal (a), a far-end (reference) signal (b), a traditional AEC algorithm linear echo cancellation output (c), and an LSTM3 output signal (d) collected using a smartphone, according to an exemplary embodiment.
  • a microphone signal (a)
  • b far-end (reference) signal
  • c traditional AEC algorithm linear echo cancellation output
  • d LSTM3 output signal
  • Fig. 9 is a block diagram showing an apparatus for echo cancellation based on deep learning according to an exemplary embodiment.
  • Fig. 10 is a block diagram of an acoustic feature extraction module 110 in a deep learning-based echo cancellation device according to the embodiment shown in Fig. 9.
  • FIG. 11 is a block diagram of a ratio film calculation module 120 according to the embodiment shown in FIG. 9.
  • Fig. 12 is a block diagram of a model construction sub-module 123 shown in the corresponding embodiment of Fig. 11.
  • FIG. 13 is another block diagram of the model construction sub-module 123 shown in the embodiment corresponding to FIG. 11.
  • FIG. 14 is another block diagram of the model construction sub-module 123 shown in the embodiment corresponding to FIG. 11.
  • Fig. 1 is a flow chart showing a method for echo cancellation based on deep learning according to an exemplary embodiment.
  • the method for echo cancellation based on deep learning can be used in electronic devices such as smart phones and computers.
  • the method for echo cancellation based on deep learning may include steps S110, S120, S130, and S140.
  • Step S110 extracting an acoustic characteristic microphone signal from the received microphone signal, including a microphone signal and a far-end signal (ie, an echo signal).
  • the microphone signal is a sound signal received during echo cancellation.
  • a recording device such as a microphone will collect a near-end signal and an echo signal, that is, the microphone signal includes a near-end signal and a far-end signal (ie, an echo signal).
  • an electronic device When an electronic device performs echo cancellation, it can receive sound signals collected by a recording device such as a microphone, and can also receive sound signals sent by other electronic devices. It can also receive sound signals through other methods, which are not described here one by one.
  • recording equipment such as a microphone will collect sound signals.
  • the sound signals collected by the recording equipment such as a microphone include not only the near-end signal in the room where the microphone is located, but also the far-end signal transmitted from the far end and played by the speaker. .
  • the recording equipment such as a microphone, collects input signals at an acquisition frequency of 16KHz.
  • Acoustic features are data features that can characterize a sound signal.
  • the short-time Fourier transform STFT
  • the wavelet transform can be used to extract the acoustic features of the sound signals.
  • Other forms extract acoustic features from the received sound signals.
  • step S110 may include steps S111, S112, and S113.
  • step S111 the received microphone signal is divided into time frames according to a preset time period.
  • the preset time period is a preset time interval period, and the sound signal is divided into multiple time frames according to the preset time period.
  • the received microphone signal is divided into time frames according to a preset time period, and there is a half preset time period overlap between every two adjacent time frames.
  • the received sound signal is divided into a plurality of time frames according to 20 milliseconds per frame, and each two adjacent time frames have an overlap of 10 milliseconds. Then apply 320 points of STFT to each time frame of the input signal, which results in 161 frequency intervals.
  • step S112 a spectrum amplitude vector is extracted from the time frame.
  • step S113 the spectrum amplitude vector is normalized to form an acoustic feature.
  • STFT is applied to each time frame to extract a spectral amplitude vector, and each spectral amplitude vector is subjected to a normalization process to form an acoustic feature.
  • multiple consecutive frames centered on the current time frame are connected into larger vectors to form an acoustic feature to improve the effect of echo cancellation.
  • the spectrum amplitude vectors of the current time frame and the past time frame are combined and normalized to form an acoustic feature.
  • the previous 5 frames and the current time frame are spliced into a unified feature vector as the input of the present invention.
  • the number of past time frames can also be less than 5, improving the real-time performance of the application.
  • the sound signal is divided into time frames according to a preset time period, and by setting an appropriate time period, the input of the echo cancellation processing based on the acoustic features extracted from each time frame is provided, and By selectively combining the spectral amplitude vectors of the current time frame and the past time frame to form acoustic features, the echo cancellation performance can be improved.
  • Step S120 Iteratively calculate the acoustic features in a pre-trained recurrent neural network model with long and short-term memory to calculate a ratio film of the acoustic features.
  • the ratio film characterizes the relationship between the input signal and the near-end signal, which indicates the trade-off between suppressing echo and retaining the near-end signal.
  • the input signal can be echo canceled to restore the near-end signal.
  • LSTM recurrent neural network with long short-term memory
  • LSTM Long Short-Term Memory
  • the acoustic feature obtained in step S110 is used as an input of an LSTM model, and an iterative operation is performed in the LSTM model to calculate a ratio film to the acoustic feature.
  • IRM Interleave Ratio Mask
  • S STFT (t, f) and Y STFT (t, f) are the magnitudes of the near-end signal and the microphone signal amplitude in the time-frequency element, respectively.
  • the ideal ratio film is predicted during the supervised training process, and then the acoustic characteristics are masked by the ratio film to obtain the near-end signal after echo cancellation.
  • step S130 the acoustic characteristics are masked by using a ratio film.
  • step S140 the masked acoustic feature and the phase of the microphone signal are synthesized to obtain a near-end signal after echo cancellation.
  • the trained LSTM model is directly used to suppress the echo and background noise. Specifically, an input waveform is operated with a trained LSTM model to generate an estimated ratio film. This ratio film is then used to weight (or mask) the echogenic acoustic features to produce an echo-cancelled near-end signal.
  • the masked spectral amplitude vector is sent to the inverse Fourier transform along with the phase of the microphone signal to derive the near-end signal in the corresponding time domain.
  • the acoustic features are extracted from the received input signal, and the acoustic features are iteratively calculated in a pre-trained recurrent neural network model with long and short-term memory. Using this ratio film to mask the acoustic characteristics. The masked acoustic features are then synthesized with the phase of the microphone signal to achieve echo cancellation. Because the scheme uses a pre-trained recurrent neural network model with long and short-term memory, it can achieve echo cancellation in the context of background noise, double talk, and non-realistic distortion, greatly improving the effect and applicable scenarios of echo cancellation, and There is no need to use a post filter, which effectively simplifies the electronic equipment and reduces the cost of the electronic equipment.
  • Fig. 3 is a specific implementation flowchart of a method for constructing a recursive neural network model with long and short-term memory according to the embodiment of Fig. 1.
  • the method for constructing a recurrent neural network model with long-term and short-term memory may include steps S121, S122, and S123.
  • step S121 the speaker voice during training is determined as the near-end and far-end (reference) signals.
  • training is performed by using various male and female voices.
  • presets are randomly selected from the TIMIT (The DARPA Acoustic-Phonetic Continuous Speech Corpus, acoustic-phoneme continuous speech corpus constructed by Texas Instruments, MIT, and SRI International) Number of talking voices.
  • TIMIT The DARPA Acoustic-Phonetic Continuous Speech Corpus, acoustic-phoneme continuous speech corpus constructed by Texas Instruments, MIT, and SRI International
  • the speech sampling frequency of the TIMIT dataset is 16kHz, which contains a total of 6,300 sentences.
  • step S122 the spoken voice is collected as the near-end and far-end reference signals, and a speech training set is established based on the reference signals.
  • the echo signal is actually recorded or artificially synthesized by the far-end signal through a microphone.
  • the speech training set consists of near-end, far-end reference, and microphone signals.
  • the microphone signal is a mixture of a near-end signal and an echo signal.
  • 100 pairs of speakers are randomly selected from the 630 speakers in the TIMIT dataset as the near-end and far-end speakers (40 male-female, 30 male-male, 30 female-female) .
  • the 7 voices of these human voices are used to generate multiple microphone signals.
  • the microphone signals are a mixture of echo signals of randomly selected near-end voices and randomly selected far-end voices.
  • the remaining 3 voices are used to generate 300 test microphone signals.
  • the entire training set lasts about 50 hours.
  • step S123 the speech training set is trained by a recurrent neural network with long and short-term memory, and a recurrent neural network model with long and short-term memory is constructed.
  • LSTM is a time-recurrent neural network. The paper was first published in 1997. Due to its unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in the time series.
  • LSTMs usually perform better than other temporal recurrent neural networks and Hidden Markov Models (HMM), such as for continuous handwriting recognition without segmentation.
  • HMM Hidden Markov Models
  • the artificial neural network model built with LSTM won the ICDAR handwriting recognition competition championship.
  • LSTM is also commonly used for automatic speech recognition.
  • the TIMIT natural speech database reached a record of 17.7% error rate.
  • LSTM can be used as a complex nonlinear unit to construct larger deep neural networks.
  • LSTM is a specific type of RNN that can effectively capture long-term context. Compared with the traditional RNN, LSTM improves the gradient reduction or gradient explosion problem caused by the training process over time.
  • the storage unit of the LSTM module has three gates: input gate, forget gate and output gate. The input gate controls how much current information should be added to the memory unit, the forget gate control should retain how much previous information, and the output gate controls whether or not to output information.
  • the LSTM can be described by a mathematical formula as follows.
  • i t , f t and o t are the outputs of input gate, forget gate and output gate, respectively.
  • x t and h t represent the input features and hidden activations at time t, respectively.
  • z t and c t denote block input and storage unit, respectively.
  • b i , b f , bo and b z are offsets corresponding to the input gate, forget gate, output gate and input block, respectively.
  • the symbol ⁇ indicates that the array elements are multiplied successively.
  • the input and forget gates are calculated based on previous activations and current inputs, and perform context-sensitive updates on memory cells.
  • Fig. 4 is a flow chart of echo cancellation according to an exemplary embodiment.
  • the input is the received input signal and the output is the near-end signal after echo cancellation.
  • the "1" in the figure represents the steps involved during training, and the “2" in the figure represents the prediction (inference) stage. Steps, "3" in the figure indicates the steps of training and prediction sharing.
  • the present invention uses an ideal ratio film (IRM) as a training target. IRM is obtained by comparing the STFT of the microphone signal with the corresponding STFT of the near-end signal.
  • IRM ideal ratio film
  • the RNN with LSTM estimates the IRM of each input signal (including the microphone signal and the far-end signal), and then calculates the MSE (Mean Square Error) with the IRM. After repeated multiple iterations, the MSE of the entire training set is minimized, and the training samples are used only once in each iteration.
  • the trained LSTM is used directly to suppress the echo and background noise. Specifically, the trained LSTM processes the input signal and calculates a ratio film, then uses the calculated ratio film to process the input signal, and finally resynthesizes the near-end signal after echo cancellation.
  • the output at the top uses a sigmoidal shape function (see Figure 4) to get the prediction of the ratio film, and then compares it with the IRM. By comparison, the MSE error is generated and used to adjust the LSTM weights.
  • FIG. 5 is a specific implementation flowchart of step S123 in the method for constructing a recursive neural network model with long and short-term memory according to the embodiment shown in FIG. 3.
  • step S123 may include steps S1231 and S1232.
  • Step S1231 extracting the acoustic characteristics of the microphone signal and the far-end signal, respectively.
  • step S1232 according to the acoustic characteristics of the microphone signal and the far-end signal, the ideal ratio film for echo cancellation is estimated through a recurrent neural network with long and short-term memory, and a recurrent neural network model with long and short-term memory is constructed.
  • FIG. 6 is another specific implementation flowchart of step S123 in the method for constructing a recursive neural network model with long and short-term memory according to the embodiment shown in FIG. 3.
  • step S123 may include step S1233, step S1234, and step S1235.
  • step S1233 a linear echo cancellation is performed on the microphone signal through the traditional AEC algorithm.
  • a conventional linear AEC echo cancellation algorithm is used to process the microphone signal in advance, and the AEC output is used as the input signal of the LSTM to construct a recurrent neural network model with long and short-term memory.
  • step S1234 the acoustic features are extracted from the far-end signal and the linear AEC output.
  • Step S1235 According to the acoustic characteristics of the far-end signal and the linear AEC output, the ideal ratio film for echo cancellation is estimated by a recurrent neural network with long and short-term memory, and a recurrent neural network model with long and short-term memory is constructed.
  • FIG. 7 is another specific implementation flowchart of step S123 in the method for constructing a recursive neural network model with long and short-term memory according to the embodiment shown in FIG. 3.
  • step S123 may include step S1236 and step S1237.
  • Step S1236 Acoustic feature extraction is performed on the far-end signal, the microphone signal, and the linear AEC output.
  • step S1237 according to the acoustic characteristics of the far-end signal, the microphone signal, and the linear AEC output, an ideal ratio film for echo cancellation is estimated through a recurrent neural network with long and short-term memory, and a recurrent neural network model with long and short-term memory is constructed.
  • step S1231 and step S1232 will be used to estimate the ideal ratio film during echo cancellation by using a recursive neural network with long-term and short-term memory as the input signal to construct a recurrent neural network model with long-term and short-term memory. LSTM1.
  • step S1233, step S1234, and step S1235 the microphone signal is processed by the traditional AEC algorithm in advance to obtain an AEC output.
  • the linear AEC output and the far-end signal are used as input signals.
  • the recursive neural network with long and short-term memory is used to estimate the ideal ratio film during echo cancellation.
  • the recurrent neural network model with long and short-term memory is called LSTM2.
  • the remote signal, microphone signal, and linear AEC output are used as input signals.
  • a recurrent neural network with long and short-term memory is used to estimate the ideal ratio film during echo cancellation to construct a long-term and short-term memory.
  • the recurrent neural network model is called LSTM3.
  • LSTM3 further improves the effect of echo cancellation on the received input signal by using the output of the traditional AEC algorithm as an additional feature.
  • Table 1 shows STOI (Short-Time Objective Intelligence, Short-term Objective Intelligibility), PESQ (Perceptual Evaluation of Speech, Quality, Objective Speech Quality Evaluation), and ERLE (Echo Return) Loss Enhancement (echo return loss increase) results of three performance indicators.
  • the three models LSTM1, LSTM2, and LSTM3 used in this process all have two hidden layers, and each layer has 512 units. "None” is the result of the unprocessed signal; “Ideal” is the result of the ideal ratio film, which can be regarded as the upper limit of the best result.
  • Table 1 AEC results of systems tested in STOI, PESQ, and ERLE
  • the three models LSTM1, LSTM2, and LSTM3 can perform better echo cancellation. Combining traditional AEC algorithms with deep learning can further improve system performance. LSMT3 can significantly improve STOI more than LSTM2.
  • FIG. 8 is a frequency spectrum diagram of a microphone signal and a near-end signal recorded by a smartphone according to an exemplary embodiment.
  • Figure 8 (a) shows the spectrum of the microphone signal;
  • Figure 8 (b) shows the corresponding near-end signal spectrum;
  • Figures 8 (c) and 8 (d) show the LSTM3 model and the traditional linear AEC algorithm
  • FIG. 8 (c) shows a spectrum chart of the linear AEC output
  • FIG. 8 (d) shows a spectrum chart of the near-end signal obtained by LSTM3 after echo cancellation. It can be seen that the output after echo cancellation by LSTM3 is very similar to a clean near-end signal. This shows that the proposed method can well preserve the near-end signal, that is, it can suppress echoes with non-linear distortion and background noise.
  • the echo cancellation performance of the input signal can be effectively improved when the recursive neural network model with long and short-term memory is constructed to perform echo cancellation on the input signal.
  • the following is a device embodiment of the present disclosure, which can be used to execute the embodiment of the above-mentioned deep learning-based echo cancellation method.
  • a device embodiment of the present disclosure which can be used to execute the embodiment of the above-mentioned deep learning-based echo cancellation method.
  • Fig. 9 is a block diagram of a deep learning-based echo cancellation device according to an exemplary embodiment.
  • the device includes, but is not limited to, an acoustic feature extraction module 110, a ratio film calculation module 120, a masking module 130, and a speech synthesis module 140. .
  • An acoustic feature extraction module 110 configured to extract an acoustic feature from a received input signal, where the input signal includes a microphone signal and a far-end signal;
  • the ratio film calculation module 120 is configured to perform an iterative operation on the acoustic feature in a pre-trained recursive neural network model with long and short-term memory to calculate a ratio film of the acoustic feature;
  • a masking module 130 configured to mask the acoustic feature by using the ratio film
  • the speech synthesis module 140 is configured to synthesize the masked acoustic characteristics and the phase of the microphone signal to obtain a near-end signal after echo cancellation.
  • the acoustic feature extraction module 110 described in FIG. 9 includes, but is not limited to, a time frame division unit 111, a spectral amplitude vector extraction unit 112, and an acoustic feature formation unit 113.
  • the time frame dividing unit 111 is configured to divide the received microphone signal into time frames according to a preset time period
  • a spectrum amplitude vector extraction unit 112 configured to extract a spectrum amplitude vector from the time frame
  • the acoustic feature forming unit 113 is configured to perform normalization processing on the spectral amplitude vector to form an acoustic feature.
  • the time frame dividing unit 111 described in FIG. 10 includes, but is not limited to, a time frame dividing subunit.
  • a time frame division subunit is configured to divide the received microphone signal into time frames according to a preset time period, and there is a half of the preset time period overlap between every two adjacent time frames.
  • the acoustic feature forming unit 113 described in FIG. 10 includes, but is not limited to, a multi-time frame normalization sub-unit.
  • the multi-time frame normalization sub-unit is used for combining the current time frame and the past time frame's spectral amplitude vectors to perform normalization processing to form acoustic features.
  • the ratio film calculation module 120 described in FIG. 9 further includes, but is not limited to, a human voice determination submodule 121, a voice training set establishment submodule 122, and a model construction submodule 123.
  • a human voice determining sub-module 121 configured to determine that the human voice during training is a near-end and a far-end (reference) signal;
  • the voice training set establishing sub-module 122 is configured to collect the speaker's voice as the far-end and near-end signals and the near-end signals to establish a voice training set, where the far-end signals are echo signals, so The near-end signal and the echo signal form a microphone signal;
  • a model construction sub-module 123 is configured to train the speech training set through the recurrent neural network with long and short-term memory to construct the recurrent neural network model with long and short-term memory.
  • the model building sub-module 123 described in FIG. 11 further includes, but is not limited to, a first acoustic feature unit 1231 and a first model building unit 1232.
  • a first acoustic feature unit 1231 configured to extract acoustic features of the microphone signal and the far-end signal, respectively;
  • a first model construction unit 1232 is configured to estimate an ideal ratio film during echo cancellation by using the recursive neural network with long and short-term memory according to the acoustic characteristics of the microphone signal and the far-end signal to construct the long- and short-term memory. Recursive neural network model.
  • the model building module 123 described in FIG. 11 may further include, but is not limited to, a linear AEC processing unit 1233, a second acoustic feature unit 1234, and a second model building unit 1235.
  • a linear AEC processing unit 1233 configured to process the microphone signal by using a conventional AEC algorithm
  • a second acoustic feature unit 1234 configured to extract the acoustic features of the far-end signal and the linear AEC output after the deep learning, respectively;
  • a second model construction unit 1235 is configured to estimate an ideal ratio film during echo cancellation through the recurrent neural network with long and short-term memory according to the acoustic characteristics of the far-end signal and the linear AEC output, and construct the A recurrent neural network model for long and short-term memory.
  • the model building module 123 described in FIG. 11 may further include, but is not limited to, a third acoustic feature unit 1236 and a third model building unit 1237.
  • a third acoustic feature unit 1236 configured to extract acoustic features of the far-end signal, the microphone signal, and the linear AEC output;
  • a third model construction unit 1237 is configured to estimate an ideal ratio film during echo cancellation through the recursive neural network with long and short-term memory according to the acoustic characteristics of the far-end signal, the microphone signal, and the linear AEC output, and construct The recurrent neural network model with long and short-term memory.
  • the present invention further provides an electronic device that executes all or part of the steps of the method for echo cancellation based on deep learning as shown in any one of the above exemplary embodiments.
  • Electronic equipment includes:
  • a memory connected in communication with the processor; wherein,
  • the memory stores readability instructions that, when executed by the processor, implement the method according to any one of the foregoing exemplary embodiments.
  • a storage medium is also provided, and the storage medium is a computer-readable storage medium, such as a temporary and non-transitory computer-readable storage medium including instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本公开揭示了一种基于深度学习的回声消除方法、装置及电子设备、存储介质,属于计算机技术领域。所述方法包括:从接收的麦克风信号中提取声学特征,所述麦克风信号包括近端信号和远端信号;将所述声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;采用所述比值膜对所述声学特征进行掩蔽,将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。上述基于深度学习的回声消除方法及装置能够在背景噪声、双讲和非线性失真等情况下实现回声消除,大大提高回声消除的效果和适用场景。并且无需采用后置滤波器,有效简化电子设备,降低电子设备成本。

Description

一种基于深度学习的回声消除方法 技术领域
本公开涉及计算机应用技术领域,特别涉及一种基于深度学习的回声消除方法、装置及电子设备、存储介质。
背景技术
在通信***中,当扬声器和麦克风耦合时,麦克风将拾取扬声器发出的信号及其混响,由此产生回声。例如电话会议,免提电话和移动通信都存在回声的困扰。
回声消除面临许多问题,如双讲、背景噪声和非线性失真等。首先,双讲是通讯***中典型的对话方式,两端说话人时有同时说话。然而,近端语音信号将严重影响自适应算法的收敛性并且可能导致它们发散。此外,在麦克风处接收的信号不仅包含回声和近端语音信号,还包含背景噪声。传统上,回声消除的办法是通过一个有限脉冲响应(FIR)滤波器自适应地估算扬声器和麦克风之间的声学脉冲响应,从而实现回声的消除,然后通过一个后置滤波器来抑制背景噪声和回声消除后残留的回声。
AEC(Acoustic Echo Cancellation,回声消除)的最终目标是完全消除远端信号,只将近端信号发送出去。然而,传统的回声消除方法均是将回声路径建模为线性***,但由于功放和扬声器等组件的非线性限制,在回声消除的实际情况中,远端信号可能会出现非线性失真,严重影响了回声消除的效果。
发明内容
为了解决相关技术中回声消除的效果不好且需后置滤波器的技术问题,本公开提供了一种基于深度学习的回声消除方法、装置及电子设备、存储介质。
第一方面,提供了一种基于深度学习的回声消除方法,包括:
从接收的麦克风信号中提取声学特征,所述麦克风信号包括近端信号和远端信号;
将所述声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
采用所述比值膜对所述声学特征进行掩蔽;
将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
可选的,所述从接收的麦克风信号中提取声学特征的步骤包括:
将接收的麦克风信号按照预设时间周期分为时间帧,所述麦克风信号包括近端信号和远端信号;
从所述时间帧中提取频谱幅度矢量;
对所述频谱幅度矢量进行归一化处理,形成声学特征。
可选的,所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理形成声学特征。
可选的,预先训练的所述具有长短期记忆的递归神经网络模型的构建方法包括:
确定进行训练时的说话人声为近端和远端(参考)信号;
收集所述说话人声作为远端、近端时的远端信号、近端信号,并以此建立语音训练集,其中所述远端信号为回声信号,所述近端信号与所述回声信号形成麦克风信号;
通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型。
可选的,通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型的步骤包括:
分别提取所述麦克风信号、远端(回声)信号的声学特征;
根据所述麦克风信号、远端信号的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
可选的,通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型的步骤也可以包括:
通过传统AEC算法对所述麦克风信号进行线性回声消除;
分别对所述远端信号、经过传统AEC算法进行线性回声消除的线性AEC输出进行声学特征的提取;
根据所述远端信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
可选的,所述方法还可以包括:
分别对所述远端信号、麦克风信号、所述线性AEC输出进行声学特征的提取;
根据所述远端信号、麦克风信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
第二方面,提供了一种基于深度学习的回声消除装置,包括:
声学特征提取模块,用于从接收的输入信号中提取声学特征,所述输入信号包括麦克风信号和远端信号;
比值膜计算模块,用于将所述声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
掩蔽模块,用于采用所述比值膜对所述声学特征进行掩蔽;
语音合成模块,用于将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
可选的,采用理想比值膜作为具有长短期记忆的递归神经网络模型的训练目标。
第三方面,提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如第一方面所述的方法。
第四方面,提供了一种计算机可读存储介质,用于存储程序,所述程序在被执行时使得电子设备执行如第一方面所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
在进行回声消除时,从接收的麦克风信号中提取声学特征,将声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算计算声学特征的比值膜后,采用该比值膜对声学特征进行掩蔽。再将经过掩蔽后的声学特征与麦克风信号的相位进行合成,实现回声消除。由于该方案中采用了预先训练的具有长短期记忆的递归神经网络模型,从而能够在有背景噪声、双讲和非现性失真等情况下实现回声消除,大大提高回声消除的效果和适用场景。并且无需采用后置滤波器,有效简化了电子设备,降低了电子设备成本。
应当理解的是,以上的一般描述和后文的细节描述仅为示例性,并不能限制本公开范围。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并于说明书一起用于解释本发明的原理。
图1是根据一示例性实施例示出的一种基于深度学习的回声消除方法的流程图。
图2是图1对应实施例的基于深度学习的回声消除方法中步骤S110的一种具体实现流程图。
图3是根据图1对应实施例示出的具有长短期记忆的递归神经网络模型的 构建方法的一种具体实现流程图。
图4是根据一示例性实施例示出的回声消除的流程示意图。
图5是根据图4对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的一种具体实现流程图。
图6是根据图4对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的另一种具体实现流程图。
图7是根据图6对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的另一种具体实现流程图。
图8是根据一示例性实施例示出的采用智能手机采集的麦克风信号(a)、远端(参考)信号(b)、传统AEC算法线性回声消除输出(c)和LSTM3输出信号(d)的频谱图。
图9是根据一示例性实施例示出的一种基于深度学习的回声消除装置的框图。
图10是根据图9对应实施例示出的基于深度学习的回声消除装置中声学特征提取模块110的一种框图。
图11是根据图9对应实施例示出的比值膜计算模块120的一种框图。
图12是图11对应实施例示出的模型构建子模块123的一种框图。
图13是图11对应实施例示出的模型构建子模块123的另一种框图。
图14是图11对应实施例示出的模型构建子模块123的另一种框图。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、与本发明的一些方面相一致的装置和方法的例子。
图1是根据一示例性实施例示出的一种基于深度学习的回声消除方法的流 程图。该基于深度学习的回声消除方法可用于智能手机、电脑等电子设备中。如图1所示,该基于深度学习的回声消除方法可以包括步骤S110、步骤S120、步骤S130、步骤S140。
步骤S110,从接收的麦克风信号中提取声学特征麦克风信号包括麦克风信号和远端信号(即回声信号)。
麦克风信号是进行回声消除时所接收到的声音信号,麦克风等录音设备将采集近端信号以及回声信号即,麦克风信号包含近端信号和远端信号(即回声信号)。
电子设备进行回声消除时,可以接收麦克风等录音设备采集的声音信号,也可以接收其它电子设备发送的声音信号,还可以是通过其它方式接收声音信号,在此不进行一一描述。
例如,在电话会议时,麦克风等录音设备将进行声音信号的采集,麦克风等录音设备采集的声音信号不仅包括麦克风所在室内的近端信号,还包括从远端传输过来经扬声器播放的远端信号。
可选的,麦克风等录音设备采集以16KHz的采集频率进行输入信号的采集。
声学特征是能够表征声音信号的数据特征。
从接收的声音信号中提取声学特征时,可以对声音信号采用STFT(Short-time Fourier transform,短时傅里叶变换)提取声学特征,也可以对声音信号采用小波变换提取声学特征,还可以采用其它形式从接收的声音信号中提取声学特征。
可选的,如图2所示,步骤S110可以包括步骤S111、步骤S112、步骤S113。
步骤S111,将接收的麦克风信号按照预设时间周期分为时间帧。
预设时间周期是预先设置的时间间隔期,按照预设时间周期,将声音信号分为多个时间帧。
可选的,将接收的麦克风信号按照预设时间周期进行时间帧的划分,且每相邻两个时间帧之间存在半个预设时间周期的重叠。
在一具体示例性实施例中,将接收的声音信号按照每帧20毫秒分为多个时 间帧,且每两个相邻的时间帧之间具有10毫秒的重叠。然后对输入信号的每个时间帧应用320点STFT,这会产生161个频率区间。
步骤S112,从时间帧中提取频谱幅度矢量。
步骤S113,对频谱幅度矢量进行归一化处理,形成声学特征。
在一示例性实施例中,将STFT应用于每个时间帧以提取频谱幅度矢量,每一频谱幅度矢量经过归一化处理后,形成声学特征。
可选的,通过以当前时间帧为中心的多个连续帧连接成更大的矢量形成声学特征,以提高回声消除的效果。
例如,在对频谱幅度矢量进行归一化处理时,将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。具体地,将先前5帧和当前时间帧拼接成一个统一的特征向量,作为本发明的输入。过去时间帧的数量还可以小于5个,提高应用的实时性。
因此,在从声音信号中提取声学特征时,按照预设时间周期将声音信号分为时间帧,通过设置适当的时间周期,使基于从各时间帧提取的声学特征为回声消除处理提供输入,而且通过将当前时间帧与过去时间帧的频谱幅度矢量进行选择性合并形成声学特征,可提高回声消除性能。
步骤S120,将声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算声学特征的比值膜。
比值膜是表征输入信号与近端信号之间的关系,其指示了抑制回声与保留近端信号的权衡。
理想情况下,通过比值膜对输入信号进行掩蔽处理后,可以对输入信号进行回声消除,还原出近端信号。
具有长短期记忆(LSTM,Long Short-Term Memory)的递归神经网络(RNN,Recurrent Neural Network)(以下将“具有长短期记忆的递归神经网络”简称为“LSTM”)是预先训练而成的。
将步骤S110得到的声学特征作为LSTM模型的输入,在该LSTM模型中进行迭代运算,计算对该声学特征的比值膜。
在该步骤中,将IRM(Ideal Ratio Mask,理想比值膜)作为迭代运算的目 标。频谱图中的每个T-F(时频)单元的IRM可以用以下等式来表述:
Figure PCTCN2019090528-appb-000001
其中S STFT(t,f)和Y STFT(t,f)分别为该时频元中的近端信号和麦克风信号幅度的大小。
通过在监督训练过程中预测理想比值膜,进而采用比值膜对声学特征进行掩蔽,以取得回声消除后的近端信号。
步骤S130,采用比值膜对声学特征进行掩蔽。
步骤S140,将经过掩蔽后的声学特征与麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
训练完成之后,在推断或操作的过程中,直接使用训练的LSTM模型抑制回声和背景噪声。具体来讲,用已训练好的LSTM模型对一输入波形进行操作以产生估计的比值膜。接着用这个比值膜对有回声的声学特征进行加权(或掩蔽),以产生消除回声的近端信号。
在一示例性实施例中,将经过掩蔽后的频谱幅度矢量连同麦克风信号的相位一起发送到逆傅立叶变换,以导出相应时域中的近端信号。
利用如上所述的方法,在进行回声消除时,从接收的输入信号中提取声学特征,将声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算计算声学特征的比值膜后,采用该比值膜对声学特征进行掩蔽。再将经过掩蔽后的声学特征与麦克风信号的相位进行合成,实现回声消除。由于该方案中采用了预先训练的具有长短期记忆的递归神经网络模型,从而能够在背景噪声、双讲和非现性失真等情况下实现回声消除,大大提高回声消除的效果和适用场景,并且无需采用后置滤波器,有效简化了电子设备,降低了电子设备成本。
图3是根据图1对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法的一种具体实现流程图。如图3所示,该具有长短期记忆的递归神经网络模型的构建方法可以包括步骤S121、步骤S122和步骤S123。
步骤S121,确定进行训练时的说话人声作为近端和远端(参考)信号。
选取进行训练时的说话人声时的方式有多种,可以通过预先设立的方式选 取特定的说话人声,也可以通过随机选取训练时的说话人声。
为了实现不受限于训练说话人声的回声消除,通过使用各种各样的男声和女声进行训练。
在一示例性实施例中,通过从TIMIT(The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus,由德州仪器、麻省理工学院和SRI International合作构建的声学-音素连续语音语料库)数据集里随机选取预设数量的说话人声。
TIMIT数据集的语音采样频率为16kHz,一共包含6300个句子,由来自美国八个主要方言地区的630个人每人说出给定的10个句子,所有的句子都在音素级别(phone level)上进行了手动分割、标记。其中,70%的说话人是男性,大多数说话者是成年白人。
步骤S122,收集说话人声作为近端、远端参考信号,并以此建立语音训练集。
回声信号由远端信号通过麦克风实际录制或者人工合成。语音训练集由近端、远端参考和麦克风信号构成。其中,麦克风信号是近端信号与回声信号混合而成。
可选的,从TIMIT数据集里630个说话人声中随机选择100对说话人声作为近端和远端说话人声(40对男性-女性、30对男性-男性、30对女性-女性)。以16kHz采样率录制每种说话人声的10句话语。这些说话人声的7条语音用于产生多个麦克风信号,麦克风信号由随机挑选的近端语音和随机挑选的远端语音的回声信号混合而成。剩下的3条语音用于产生300个测试麦克风信号。整个训练集持续约50个小时。为了进一步提高对于说话人声的泛化能力,我们从TIMIT数据集里剩下的430个说话人中随机选择另外10对说话人声(4对男性-女性、3对男性-男性和3对女性-女性),生成100个未经训练的说话人声的测试混合信号。在2.7×3×4.5米的房间内用智能手机录制回声信号,然后将录制的回声信号加上近端的信号形成麦克风信号。
步骤S123,通过具有长短期记忆的递归神经网络对语音训练集进行训练,构建具有长短期记忆的递归神经网络模型。
LSTM是一种时间递归神经网络,论文首次发表于1997年。由于独特的设 计结构,LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。
LSTM的表现通常比其它时间递归神经网络及隐马尔科夫模型(HMM)更好,比如用在不分段连续手写识别上。2009年,用LSTM构建的人工神经网络模型赢得过ICDAR手写识别比赛冠军。LSTM还普遍用于自动语音识别,2013年运用TIMIT自然演讲数据库达到17.7%错误率的纪录。作为非线性模型,LSTM可作为复杂的非线性单元构造更大型深度神经网络。
LSTM是一种特定类型的RNN,可以有效地捕获长期语境。与传统的RNN相比,LSTM改善了在训练过程中随着时间的推移而带来的梯度减少或梯度***问题。LSTM模块的存储单元有三个门:输入门、遗忘门和输出门。输入门控制应将多少当前信息添加到存储器单元,遗忘门控制应保留多少先前信息,输出门控制是否输出信息。具体的,LSTM可用数学公式描述如下。
i t=σ(W ixx t+W ihh t-1+b i)
f t=σ(W fxx t+W fhh t-1+b f)
o t=σ(W oxx t+W ohh t-1+b o)
z t=g(W zxx t+W zhh t-1+b z)
c t=f t⊙c t-1+i t⊙z t
h t=o t⊙g(c t)
其中i t,f t和o t分别是输入门、遗忘门和输出门的输出。x t和h t分别表示在时间t的输入特征和隐藏激活。z t和c t分别表示块输入和存储单元。σ代表sigmoidal函数,即σ(x)=1/(1+e x),g代表双曲正切函数,即g(x)=(e x-e -x)/(e x+e -x)。b i、b f、b o和b z分别是输入门、遗忘门、输出门和输入块对应的偏移。符号⊙表示数组元素逐次相乘。输入门和遗忘门是根据先前的激活和当前输入计算的,并对存储器单元执行上下文敏感的更新。
图4是根据一示例性实施例示出的回声消除的流程示意图。如图4所示,输入为接收的输入信号,输出为回声消除后的近端信号,图中的“1”表示在训练期间涉及的步骤,图中的“2”表示预测(推断)阶段的步骤,图中的“3”表示训练和预测共享的步骤。作为有监督学习方法,本发明使用理想比值膜(IRM)为训练目标。IRM是通过比较麦克风信号的STFT和其相应的近端信号的STFT得到的。在训练阶段,具有LSTM的RNN估计每个输入信号(包括麦克风信号 和远端信号)的IRM,然后计算与IRM之间的MSE(Mean Square Error,均方误差)。经过重复的多轮迭代将整个训练集的MSE最小化,而每轮迭代中训练样本仅使用一次。训练完成之后,在推断或操作的过程中,直接使用训练后的LSTM抑制回声和背景噪声。具体来讲,训练好的LSTM对输入信号进行处理并计算比值膜,然后使用计算的比值膜对输入信号进行处理,最后重新合成得到回声消除后的近端信号。
顶部的输出通过sigmoidal形函数(参见图4)以得到比值膜的预测,再与IRM进行比较,通过比较,生成MSE错误,用于调整LSTM权重。
可选的,图5是根据图3对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的一种具体实现流程图。如图5所示,该步骤S123可以包括步骤S1231和步骤S1232。
步骤S1231,分别提取麦克风信号、远端信号的声学特征。
步骤S1232,根据麦克风信号、远端信号的声学特征,通过具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型。
可选的,图6是根据图3对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的另一种具体实现流程图。如图6所示,该步骤S123可以包括步骤S1233、步骤S1234和步骤S1235。
步骤S1233,通过传统AEC算法对麦克风信号进行线性回声消除。
通过传统的线性AEC的回声消除算法预先对麦克风信号进行处理,将AEC输出作为LSTM的输入信号,进而构建具有长短期记忆的递归神经网络模型。
步骤S1234,分别对远端信号、线性AEC输出进行声学特征的提取。
步骤S1235,根据远端信号、线性AEC输出的声学特征,通过具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型。
可选的,图7是根据图3对应实施例示出的具有长短期记忆的递归神经网络模型的构建方法中步骤S123的另一种具体实现流程图。如图7所示,该步骤S123除包括步骤S1233、步骤S1234和步骤S1235外,还可以包括步骤S1236、步骤S1237。
步骤S1236,分别对远端信号、麦克风信号、线性AEC输出进行声学特征的提取。
步骤S1237,根据远端信号、麦克风信号、线性AEC输出的声学特征,通过具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型。
将通过步骤S1231和步骤S1232,将麦克风信号、远端信号作为输入信号,采用具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型称为LSTM1。
通过步骤S1233、步骤S1234和步骤S1235,预先通过传统AEC算法对麦克风信号进行处理得到AEC输出。并将线性AEC输出、远端信号作为输入信号,采用具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型称为LSTM2。
通过步骤S1233、步骤S1236和步骤S1237,将远端信号、麦克风信号、线性AEC输出作为输入信号,采用具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建具有长短期记忆的递归神经网络模型称为LSTM3。
相比与LSTM1,LSTM3通过将传统AEC算法的输出作为附加特征进一步提高了对接收的输入信号进行回声消除的效果。
表1表示采用LSTM1、LSTM2、LSTM3三种模型进行回声消除时STOI(Short-Time Objective Intelligibility,短时客观可懂度)、PESQ(Perceptual Evaluation of Speech Quality,客观语音质量评估)和ERLE(Echo Return Loss Enhancement,回声回程损耗增量)三种性能指标的结果。这个过程中所使用的LSTM1、LSTM2、LSTM3三种模型均具有两个隐藏层,每层具有512个单元。“无”是未经处理信号的结果;“理想”是理想比值膜的结果,可以看作是最佳结果的上限。
表1:STOI、PESQ and ERLE中测试的***AEC结果
Figure PCTCN2019090528-appb-000002
Figure PCTCN2019090528-appb-000003
如表1所示,与传统AEC算法相比,LSTM1、LSTM2、LSTM3三个模型能进行更好的回声消除。将传统AEC算法与深度学习相结合可以进一步提高***性能。LSMT3比LSTM2更能显著改进STOI。
为了进一步说明线性AEC结果,图8是根据一示例性实施例示出的采用智能手机录制的麦克风信号及近端信号的的频谱图。图8(a)展示了麦克风信号的频谱图;图8(b)展示了相应的近端信号的频谱图;图8(c)和图8(d)展示采用LSTM3模型与采用传统线性AEC算法进行回声消除后的频谱结果对比示意图,其中,图8(c)展示了线性AEC输出的频谱图,图8(d)展示了LSTM3进行回声消除后得到的近端信号的频谱图。可以看出,通过LSTM3进行回声消除后的输出很类似于干净的近端信号。这表明所提出的方法可以很好地保留近端信号,即可以抑制具有非线性失真的回声以及背景噪声。
利用如上所述的方法,通过构建的具有长短期记忆的递归神经网络模型对输入信号进行回声消除时,能够有效提高回声消除性能。
下述为本公开装置实施例,可以用于执行本上述基于深度学习的回声消除方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开基于深度学习的回声消除方法实施例。
图9是根据一示例性实施例示出的一种基于深度学习的回声消除装置的框图,该装置包括但不限于:声学特征提取模块110、比值膜计算模块120、掩蔽模块130及语音合成模块140。
声学特征提取模块110,用于从接收的输入信号中提取声学特征,所述输入信号包括麦克风信号和远端信号;
比值膜计算模块120,用于将所述声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
掩蔽模块130,用于采用所述比值膜对所述声学特征进行掩蔽;
语音合成模块140,用于将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
上述装置中各个模块的功能和作用的实现过程,具体见上述基于深度学习的回声消除方法中对应步骤的实现过程,在此不再赘述。
可选的,如图10所示,图9中所述的声学特征提取模块110包括但不限于:时间帧划分单元111、频谱幅度矢量提取单元112和声学特征形成单元113。
时间帧划分单元111,用于将接收的麦克风信号按照预设时间周期分为时间帧;
频谱幅度矢量提取单元112,用于从所述时间帧中提取频谱幅度矢量;
声学特征形成单元113,用于对所述频谱幅度矢量进行归一化处理,形成声学特征。
可选的,图10中所述的时间帧划分单元111包括但不限于:时间帧的划分子单元。
时间帧的划分子单元,用于将接收的麦克风信号按照预设时间周期进行时间帧的划分,且每相邻两个时间帧之间存在半个所述预设时间周期的重叠。
可选的,图10中所述的声学特征形成单元113包括但不限于:多时间帧归一化子单元。
多时间帧归一化子单元,用于将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理形成声学特征。
可选的,如图11所示,图9中所述的比值膜计算模块120还包括但不限于:人声确定子模块121、语音训练集建立子模块122和模型构建子模块123。
人声确定子模块121,用于确定进行训练时的说话人声为近端和远端(参考)信号;
语音训练集建立子模块122,用于收集所述说话人声作为远端、近端时的远端信号、近端信号,以此建立语音训练集,其中所述远端信号为回声信号,所述近端信号与所述回声信号形成麦克风信号;
模型构建子模块123,用于通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型。
可选的,如图12所示,图11中所述的模型构建子模块123还包括但不限于:第一声学特征单元1231和第一模型构建单元1232。
第一声学特征单元1231,用于分别提取所述麦克风信号、远端信号的声学特征;
第一模型构建单元1232,用于根据所述麦克风信号、远端信号的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
可选的,如图13所示,图11中所述的模型构建模块123还可以包括但不限于:线性AEC处理单元1233、第二声学特征单元1234和第二模型构建单元1235。
线性AEC处理单元1233,用于通过传统AEC算法对所述麦克风信号进行处理;
第二声学特征单元1234,用于分别对所述远端信号、经过所述深度学习后的线性AEC输出进行声学特征的提取;
第二模型构建单元1235,用于根据所述远端信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
可选的,如图14所示,图11中所述的模型构建模块123还可以包括但不限于:第三声学特征单元1236和第三模型构建单元1237。
第三声学特征单元1236,用于分别对所述远端信号、麦克风信号、线性AEC输出进行声学特征的提取;
第三模型构建单元1237,用于根据所述远端信号、麦克风信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
可选的,本发明还提供一种电子设备,执行如上述示例性实施例任一所示的基于深度学习的回声消除方法的全部或者部分步骤。电子设备包括:
处理器;以及
与所述处理器通信连接的存储器;其中,
所述存储器存储有可读性指令,所述可读性指令被所述处理器执行时实现如上述任一示例性实施例所述的方法。
该实施例中的终端中处理器执行操作的具体方式已经在有关该基于深度学习的回声消除方法的实施例中执行了详细描述,此处将不做详细阐述说明。
在示例性实施例中,还提供了一种存储介质,该存储介质为计算机可读性存储介质,例如可以为包括指令的临时性和非临时性计算机可读性存储介质。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,可以在不脱离其范围时进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (10)

  1. 一种基于深度学习的回声消除方法,其特征在于,所述方法包括:
    从接收的麦克风信号中提取声学特征,所述麦克风信号包括近端信号和远端信号;
    将所述声学特征在预先训练的具有长短期记忆的递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
    采用所述比值膜对所述声学特征进行掩蔽;
    将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
  2. 根据权利要求1所述的方法,其特征在于,所述从接收的麦克风信号中提取声学特征的步骤包括:
    将接收的麦克风信号按照预设时间周期分为时间帧,所述麦克风信号包括近端信号和远端信号;
    从所述时间帧中提取频谱幅度矢量;
    对所述频谱幅度矢量进行归一化处理,形成声学特征。
  3. 根据权利要求2所述的方法,其特征在于,所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
    将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理形成声学特征。
  4. 根据权利要求1所述的方法,其特征在于,预先训练的所述具有长短期记忆的递归神经网络模型的构建方法包括:
    确定进行训练时的说话人声为近端和远端(参考)信号;
    收集所述说话人声作为远端、近端时的远端信号、近端信号,并以此建立语音训练集,其中所述远端信号为回声信号,所述近端信号与所述回声信号形成麦克风信号;
    通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构 建所述具有长短期记忆的递归神经网络模型。
  5. 根据权利要求4所述的方法,其特征在于,通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型的步骤包括:
    分别提取所述麦克风信号、远端(回声)信号的声学特征;
    根据所述麦克风信号、远端信号的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
  6. 根据权利要求4所述的方法,其特征在于,通过所述具有长短期记忆的递归神经网络对所述语音训练集进行训练,构建所述具有长短期记忆的递归神经网络模型的步骤也可以包括:
    通过传统AEC算法对所述麦克风信号进行线性回声消除;
    分别对所述远端信号、经过所述传统AEC算法进行线性回声消除的线性AEC输出进行声学特征的提取;
    根据所述远端信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还可以包括:
    分别对所述远端信号、麦克风信号、所述线性AEC输出进行声学特征的提取;
    根据所述远端信号、麦克风信号、所述线性AEC输出的声学特征,通过所述具有长短期记忆的递归神经网络进行回声消除时理想比值膜的估算,构建所述具有长短期记忆的递归神经网络模型。
  8. 一种基于深度学习的回声消除装置,其特征在于,所述装置包括:
    声学特征提取模块,用于从接收的输入信号中提取声学特征,所述输入信号包括麦克风信号和远端信号;
    比值膜计算模块,用于将所述声学特征在预先训练的具有长短期记忆的递 归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
    掩蔽模块,用于采用所述比值膜对所述声学特征进行掩蔽;
    语音合成模块,用于将经过掩蔽后的所述声学特征与所述麦克风信号的相位进行合成,得到经过回声消除后的近端信号。
  9. 一种电子设备,其特征在于,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-7任一项所述的方法。
  10. 一种计算机可读存储介质,用于存储程序,其特征在于,所述程序在被执行时使得电子设备执行如权利要求1-7任一项所述的方法。
PCT/CN2019/090528 2018-08-31 2019-06-10 一种基于深度学习的回声消除方法 WO2020042706A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811013935.8 2018-08-31
CN201811013935.8A CN109841206B (zh) 2018-08-31 2018-08-31 一种基于深度学习的回声消除方法

Publications (1)

Publication Number Publication Date
WO2020042706A1 true WO2020042706A1 (zh) 2020-03-05

Family

ID=66883031

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090528 WO2020042706A1 (zh) 2018-08-31 2019-06-10 一种基于深度学习的回声消除方法

Country Status (2)

Country Link
CN (1) CN109841206B (zh)
WO (1) WO2020042706A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883155A (zh) * 2020-07-17 2020-11-03 海尔优家智能科技(北京)有限公司 回声消除方法、装置及存储介质
CN112420073A (zh) * 2020-10-12 2021-02-26 北京百度网讯科技有限公司 语音信号处理方法、装置、电子设备和存储介质
CN112750449A (zh) * 2020-09-14 2021-05-04 腾讯科技(深圳)有限公司 回声消除方法、装置、终端、服务器及存储介质
CN113077812A (zh) * 2021-03-19 2021-07-06 北京声智科技有限公司 语音信号生成模型训练方法、回声消除方法和装置及设备
CN113096679A (zh) * 2021-04-02 2021-07-09 北京字节跳动网络技术有限公司 音频数据处理方法和装置
CN113744748A (zh) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 一种网络模型的训练方法、回声消除方法及设备
CN116778970A (zh) * 2023-08-25 2023-09-19 长春市鸣玺科技有限公司 强噪声环境下的语音检测方法

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841206B (zh) * 2018-08-31 2022-08-05 大象声科(深圳)科技有限公司 一种基于深度学习的回声消除方法
CN112055284B (zh) * 2019-06-05 2022-03-29 北京地平线机器人技术研发有限公司 回声消除方法及神经网络的训练方法、装置、介质、设备
CN110136737A (zh) * 2019-06-18 2019-08-16 北京拙河科技有限公司 一种语音降噪方法及装置
CN110473516B (zh) * 2019-09-19 2020-11-27 百度在线网络技术(北京)有限公司 语音合成方法、装置以及电子设备
CN110660406A (zh) * 2019-09-30 2020-01-07 大象声科(深圳)科技有限公司 近距离交谈场景下双麦克风移动电话的实时语音降噪方法
CN110944089A (zh) * 2019-11-04 2020-03-31 中移(杭州)信息技术有限公司 双讲检测方法及电子设备
CN110956976B (zh) * 2019-12-17 2022-09-09 苏州科达科技股份有限公司 一种回声消除方法、装置、设备及可读存储介质
CN113012709B (zh) * 2019-12-20 2023-06-30 北京声智科技有限公司 一种回声消除方法及装置
CN111161752B (zh) * 2019-12-31 2022-10-14 歌尔股份有限公司 回声消除方法和装置
CN111353258A (zh) * 2020-02-10 2020-06-30 厦门快商通科技股份有限公司 基于编码解码神经网络的回声抑制方法及音频装置及设备
CN111343410A (zh) * 2020-02-14 2020-06-26 北京字节跳动网络技术有限公司 一种静音提示方法、装置、电子设备及存储介质
CN111370016B (zh) * 2020-03-20 2023-11-10 北京声智科技有限公司 一种回声消除方法及电子设备
CN111292759B (zh) * 2020-05-11 2020-07-31 上海亮牛半导体科技有限公司 一种基于神经网络的立体声回声消除方法及***
CN111654572A (zh) * 2020-05-27 2020-09-11 维沃移动通信有限公司 音频处理方法、装置、电子设备及存储介质
CN111816177B (zh) * 2020-07-03 2021-08-10 北京声智科技有限公司 电梯的语音打断控制方法、装置及电梯
CN111768796B (zh) * 2020-07-14 2024-05-03 中国科学院声学研究所 一种声学回波消除与去混响方法及装置
CN111883154B (zh) * 2020-07-17 2023-11-28 海尔优家智能科技(北京)有限公司 回声消除方法及装置、计算机可读的存储介质、电子装置
CN111951819B (zh) * 2020-08-20 2024-04-09 北京字节跳动网络技术有限公司 回声消除方法、装置及存储介质
CN112203180A (zh) * 2020-09-24 2021-01-08 安徽文香信息技术有限公司 一种智慧教室扩音器耳麦自适应音量调节***及方法
CN112259112A (zh) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 一种结合声纹识别和深度学习的回声消除方法
WO2022077305A1 (en) * 2020-10-15 2022-04-21 Beijing Didi Infinity Technology And Development Co., Ltd. Method and system for acoustic echo cancellation
CN112466318B (zh) * 2020-10-27 2024-01-19 北京百度网讯科技有限公司 语音处理方法、装置及语音处理模型的生成方法、装置
CN112489668B (zh) * 2020-11-04 2024-02-02 北京百度网讯科技有限公司 去混响方法、装置、电子设备和存储介质
CN112786068B (zh) * 2021-01-12 2024-01-16 普联国际有限公司 一种音频音源分离方法、装置及存储介质
CN112634933B (zh) * 2021-03-10 2021-06-22 北京世纪好未来教育科技有限公司 一种回声消除方法、装置、电子设备和可读存储介质
CN113179354B (zh) * 2021-04-26 2023-10-10 北京有竹居网络技术有限公司 声音信号处理方法、装置和电子设备
CN113192527B (zh) * 2021-04-28 2024-03-19 北京达佳互联信息技术有限公司 用于消除回声的方法、装置、电子设备和存储介质
CN113257267B (zh) * 2021-05-31 2021-10-15 北京达佳互联信息技术有限公司 干扰信号消除模型的训练方法和干扰信号消除方法及设备
CN114173259B (zh) * 2021-12-28 2024-03-26 思必驰科技股份有限公司 回声消除方法及***
CN115762552B (zh) * 2023-01-10 2023-06-27 阿里巴巴达摩院(杭州)科技有限公司 训练回声消除模型的方法、回声消除方法及对应装置
CN116386655B (zh) * 2023-06-05 2023-09-08 深圳比特微电子科技有限公司 回声消除模型建立方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104157293A (zh) * 2014-08-28 2014-11-19 福建师范大学福清分校 一种增强声环境中目标语音信号拾取的信号处理方法
CN105225672A (zh) * 2015-08-21 2016-01-06 胡旻波 融合基频信息的双麦克风定向噪音抑制的***及方法
US20160358602A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
CN106373583A (zh) * 2016-09-28 2017-02-01 北京大学 基于理想软阈值掩模irm的多音频对象编、解码方法
CN107452389A (zh) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 一种通用的单声道实时降噪方法
CN107845389A (zh) * 2017-12-21 2018-03-27 北京工业大学 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
CN109841206A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于深度学习的回声消除方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8189766B1 (en) * 2007-07-26 2012-05-29 Audience, Inc. System and method for blind subband acoustic echo cancellation postfiltering
CN101719969B (zh) * 2009-11-26 2013-10-02 美商威睿电通公司 判断双端对话的方法、***以及消除回声的方法和***
US9936290B2 (en) * 2013-05-03 2018-04-03 Qualcomm Incorporated Multi-channel echo cancellation and noise suppression
CN104581516A (zh) * 2013-10-15 2015-04-29 清华大学 一种医学声信号的双麦克风消噪方法及装置
US10074380B2 (en) * 2016-08-03 2018-09-11 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104157293A (zh) * 2014-08-28 2014-11-19 福建师范大学福清分校 一种增强声环境中目标语音信号拾取的信号处理方法
US20160358602A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
CN105225672A (zh) * 2015-08-21 2016-01-06 胡旻波 融合基频信息的双麦克风定向噪音抑制的***及方法
CN106373583A (zh) * 2016-09-28 2017-02-01 北京大学 基于理想软阈值掩模irm的多音频对象编、解码方法
CN107452389A (zh) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 一种通用的单声道实时降噪方法
CN107845389A (zh) * 2017-12-21 2018-03-27 北京工业大学 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
CN109841206A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于深度学习的回声消除方法

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883155A (zh) * 2020-07-17 2020-11-03 海尔优家智能科技(北京)有限公司 回声消除方法、装置及存储介质
CN111883155B (zh) * 2020-07-17 2023-10-27 海尔优家智能科技(北京)有限公司 回声消除方法、装置及存储介质
CN112750449A (zh) * 2020-09-14 2021-05-04 腾讯科技(深圳)有限公司 回声消除方法、装置、终端、服务器及存储介质
CN112750449B (zh) * 2020-09-14 2024-02-20 腾讯科技(深圳)有限公司 回声消除方法、装置、终端、服务器及存储介质
CN112420073A (zh) * 2020-10-12 2021-02-26 北京百度网讯科技有限公司 语音信号处理方法、装置、电子设备和存储介质
CN112420073B (zh) * 2020-10-12 2024-04-16 北京百度网讯科技有限公司 语音信号处理方法、装置、电子设备和存储介质
CN113077812A (zh) * 2021-03-19 2021-07-06 北京声智科技有限公司 语音信号生成模型训练方法、回声消除方法和装置及设备
CN113096679A (zh) * 2021-04-02 2021-07-09 北京字节跳动网络技术有限公司 音频数据处理方法和装置
CN113744748A (zh) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 一种网络模型的训练方法、回声消除方法及设备
CN116778970A (zh) * 2023-08-25 2023-09-19 长春市鸣玺科技有限公司 强噪声环境下的语音检测方法
CN116778970B (zh) * 2023-08-25 2023-11-24 长春市鸣玺科技有限公司 强噪声环境下的语音检测模型训练方法

Also Published As

Publication number Publication date
CN109841206A (zh) 2019-06-04
CN109841206B (zh) 2022-08-05

Similar Documents

Publication Publication Date Title
WO2020042706A1 (zh) 一种基于深度学习的回声消除方法
WO2020042707A1 (zh) 一种基于卷积递归神经网络的单通道实时降噪方法
CN111756942B (zh) 执行回声消除的通信设备和方法及计算机可读介质
KR101934636B1 (ko) 심화신경망 기반의 잡음 및 에코의 통합 제거 방법 및 장치
CN111653288B (zh) 基于条件变分自编码器的目标人语音增强方法
Qian et al. Speech Enhancement Using Bayesian Wavenet.
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
CN108172231B (zh) 一种基于卡尔曼滤波的去混响方法及***
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN108417224B (zh) 双向神经网络模型的训练和识别方法及***
KR101807961B1 (ko) Lstm 및 심화신경망 기반의 음성 신호 처리 방법 및 장치
CN112735456B (zh) 一种基于dnn-clstm网络的语音增强方法
CN110660406A (zh) 近距离交谈场景下双麦克风移动电话的实时语音降噪方法
CN111986679A (zh) 一种应对复杂声学环境的说话人确认方法、***及存储介质
CN112037809A (zh) 基于多特征流结构深度神经网络的残留回声抑制方法
González et al. MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition
Lv et al. A permutation algorithm based on dynamic time warping in speech frequency-domain blind source separation
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
Doclo et al. Multimicrophone noise reduction using recursive GSVD-based optimal filtering with ANC postprocessing stage
Dionelis et al. Modulation-domain Kalman filtering for monaural blind speech denoising and dereverberation
Nathwani et al. An extended experimental investigation of DNN uncertainty propagation for noise robust ASR
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
US20240135954A1 (en) Learning method for integrated noise echo cancellation system using multi-channel based cross-tower network
Sehr et al. Towards robust distant-talking automatic speech recognition in reverberant environments
KR102374166B1 (ko) 원단 신호 정보를 이용한 반향 신호 제거 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854024

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19854024

Country of ref document: EP

Kind code of ref document: A1