CN111261179A

CN111261179A - Echo cancellation method and device and intelligent equipment

Info

Publication number: CN111261179A
Application number: CN201811453725.0A
Authority: CN
Inventors: 薛少飞; 陈梦喆
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2020-06-09

Abstract

The application discloses an echo cancellation method and device and intelligent equipment, wherein an echo cancellation model with a plurality of layers of recursive networks is used for processing an input signal, and influences caused by nonlinear factors are improved to a great extent, so that echo cancellation generates a better effect, and actual requirements are better met. Moreover, the application considers the case of multiple loudspeakers and multiple microphone arrays, so that the application is wider and more convenient.

Description

Echo cancellation method and device and intelligent equipment

Technical Field

The present application relates to, but not limited to, artificial intelligence technologies, and in particular, to an echo cancellation method and apparatus, and an intelligent device.

Background

Echo feedback is a common problem in electro-acoustic instruments, such as telephones, hearing aids, etc. Echo feedback seriously affects the quality of the speech signal, often causes noise problems such as howling and whistling, reduces the gain of the system, and changes the response of the system.

Adaptive Echo Control (AEC) is based on the correlation between the output signal of a loudspeaker and the multipath Echo generated by the output signal of the loudspeaker, and subtracts an Echo estimate from the input signal of the sound pickup device, thereby achieving the purpose of canceling the Echo.

After the intelligent device is born, the intelligent voice device also needs to eliminate the sound source of the intelligent device, as shown in fig. 1, after the reference signal (i.e. the input signal entering the loudspeaker of the intelligent device) in the intelligent device is amplified by the loudspeaker, the reference signal and the original signal such as the human voice are received by the microphone array of the intelligent device, so as to form a received signal. For example, the smart speaker needs to eliminate music played by itself, and for example, the smart television needs to eliminate sound of a television program played by itself. In scenarios such as voice wake-up, voice recognition, etc., scenarios requiring echo cancellation are often encountered.

In the related art, echo cancellation mainly performs voice wake-up or voice recognition after processing multi-channel sound on a signal layer. On one hand, the original sound signal and the reference signal are processed in a linear mode, and in practical situations, due to reverberation, equipment structures and the like, a large number of nonlinear transformations exist, and the influence of the nonlinear factors cannot be overcome. On the other hand, the method can only judge the AEC effect from the sense of hearing, and the optimization of the sense of hearing does not mean the promotion of voice awakening and voice recognition effects.

Disclosure of Invention

The application provides an echo cancellation method and device and intelligent equipment, which can enable echo cancellation to generate a better effect, so that actual requirements can be better met.

The embodiment of the invention provides an echo cancellation method, which comprises the following steps:

respectively inputting all reference signals into an echo cancellation model according to the number of channels of the received signals, and calculating to obtain a reference signal estimation value;

subtracting the reference signal estimation value corresponding to each channel from the radio signal of each channel to obtain an original signal estimation value;

and carrying out normalization processing on the original signal estimation values corresponding to all channels to obtain original signals.

In one illustrative example, the method further comprises generating the echo cancellation model, comprising:

simulating a radio signal by using a preset original signal and a preset reference signal;

and training the network to be trained by taking the radio signal obtained by simulation and a preset reference signal as input and taking a preset original signal as a modeling target to obtain the echo cancellation model.

In one illustrative example, the echo cancellation model includes a multi-layer recursive network.

In one illustrative example, the analog radio signal comprises:

after the preset impulse response is carried out on the preset reference signal, a preset environmental noise signal is added to obtain a first signal;

and superposing the first signal and a preset original signal to obtain the simulated radio signal.

In one illustrative example, the network to be trained includes at least one of:

the feedforward sequence memory neural network FSMN, the deep feedforward sequence memory neural network DFSMN, the long and short time memory unit LSTM, the bidirectional long and short time memory unit BLSTM or the gate cycle unit GRU.

In one illustrative example, the reference signal comprises at least the original signal comprises at least one path.

In one illustrative example, the method further comprises:

and performing joint training by adopting the obtained original signal, a voice awakening model and a voice recognition model.

The application also provides an echo cancellation processing method, which comprises the following steps:

In one illustrative example, the analog radio signal comprises:

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the echo cancellation method of any one of the above and/or performing the echo cancellation processing method of any one of the above.

The present application further provides an echo cancellation device, comprising a memory and a processor, wherein the memory stores the following instructions executable by the processor: for performing the steps of the echo cancellation method of any one of the above.

The application also provides an echo cancellation device, comprising a memory and a processor, wherein the memory stores the following instructions executable by the processor: for performing the steps of the echo cancellation processing method of any one of the above.

The present application further provides a smart device comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor:

according to the number of channels of the received radio signals, inputting all reference signals entering a loudspeaker into an echo cancellation model respectively, and calculating to obtain a plurality of reference signal estimation values;

for each channel, subtracting a reference signal estimation value corresponding to the channel from the radio signal of each channel to obtain a plurality of original signal estimation values;

and carrying out normalization processing on the original signal estimation values corresponding to all the channels to obtain original signals entering the microphone array.

In one illustrative example, the smart device comprises: intelligent audio amplifier, intelligent TV.

According to the echo cancellation method and device, the echo cancellation model with the multilayer recursive network is used for processing the input signals, and influences caused by nonlinear factors are improved to a great extent, so that echo cancellation can generate a better effect, and actual requirements can be better met. Moreover, the application considers the case of multiple loudspeakers and multiple microphone arrays, so that the application is wider and more convenient.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a schematic diagram of radio signal generation in a smart device;

FIG. 2 is a flow chart of an echo cancellation processing method according to the present application;

FIG. 3 is a diagram of a network architecture in which an embodiment of echo cancellation and speech recognition are used in conjunction with one embodiment of the present application;

FIG. 4 is a schematic diagram of the structure of an echo cancellation processing apparatus according to the present application;

FIG. 5 is a flow chart of the echo cancellation method of the present application;

FIG. 6 is a schematic diagram of an embodiment of an echo cancellation network according to the present application;

fig. 7 is a schematic diagram of a structure of an echo cancellation device according to the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The inventor of the application discovers through research on echo cancellation related technologies that if AEC can be realized based on a deep neural network, the strong nonlinear modeling capability of the deep neural network can be fully utilized, and the influence of nonlinear factors in actual conditions is handled.

Fig. 2 is a flowchart of an echo cancellation processing method of the present application, for training generation of an echo cancellation model, as shown in fig. 2, including:

step 200: and simulating the radio signal by using a preset original signal and a preset reference signal.

In one illustrative example, the analog radio signal may include:

after preset impulse response is carried out on a preset reference signal, a preset environmental noise signal is added to obtain a first signal;

and superposing the first signal and a preset original signal to obtain an analog radio signal.

Taking a microphone array (including 4 microphones) as an example, the far field data is usually simulated by using the near field data, and the formula is as follows:

y_1(t)＝x(t)*h_s1(t)+n(t)；y_2(t)＝x(t)*h_s2(t)+n(t)；y_3(t)＝x(t)*h_s3(t)+n(t)；y_4(t)＝x(t)*h_s4(t)+n(t)。

where y _ i (t) represents the far-field data of the ith microphone generated by simulation, x (t) represents the near-field data, h _ si (t) represents the impulse response of the ith microphone determined by the house, environment and microphone position, represents the convolution operation, and n (t) represents the ambient noise. i is 1, 2, 3 or 4.

Step 201: and training the network to be trained by taking the radio signal obtained by simulation and a preset reference signal as input and taking a preset original signal as a modeling target to obtain an echo cancellation model.

In an exemplary embodiment, the modeling criterion may be a minimum mean square error criterion or may be modeled in the form of a mask. The general goal is to establish a mapping from the received signal to the original signal.

In an exemplary example, the network to be trained may include a multi-layer recursive network such as a feed-forward Sequential Memory neural network (FSMN), or a Deep feed-forward Sequential Memory neural network (DFSMN), or a Long-Short Term Memory unit (LSTM), or a Bidirectional Long-Short Term Memory unit (BLSTM), or a Gated Recursive Unit (GRU). Wherein LSTM is a time recursive Recurrent Neural Networks (RNN)

It should be noted that how to train the network to be trained to obtain the specific implementation of the echo cancellation model is not used to limit the scope of the present application. The application emphasizes that a multi-channel radio signal and a multi-channel reference signal obtained through simulation are used as input, and the network structure of the application is adopted to train the echo cancellation multi-layer neural network.

In one illustrative example, the reference signal includes at least one channel and the original signal includes at least one channel.

In an exemplary embodiment, the multi-channel speech signal after echo cancellation can be further used for model training of subsequent voice wakeup and speech recognition, and can be subjected to joint training.

In an exemplary example, taking echo cancellation followed by a speech recognition model as an example, assuming that the collected signals are 2 channels of original signals (such as wav1 and wav2 in fig. 3) and 2 channels of reference signals (such as ref1 and ref2 in fig. 3), the input of the network is the collected 4 channels of signals, and the echo cancellation process is performed on the collected 4 channels of signals, such as a dashed frame portion indicated by NN Front-end shown in fig. 3, to implement an AEC function, and the AEC processed signals are combined with the reference signals and then input to an Acoustic Model (AM) portion. During training, the NN Front-end and the AM are trained independently, and then the two networks are connected in series to be trained jointly.

The structure shown in fig. 3 is suitable for a wide range of application scenarios, such as: the method comprises the following steps that a multi-channel signal is collected by a wind array and can be added to an input end of a neural network at the same time; the following steps are repeated: there are cases of different types of signals, such as a case where an internal signal is simultaneously acquired in addition to an external signal, and the like.

The echo cancellation model obtained by training is a multilayer recursive network, and is very suitable for overcoming the influence caused by nonlinear factors, so that better effect is generated by echo cancellation, and the actual requirement is better met.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the echo cancellation processing method of any one of the above.

The present application further provides an echo cancellation model generation apparatus, comprising a memory and a processor, wherein the memory stores the steps of the echo cancellation processing method according to any one of the above.

Fig. 4 is a schematic diagram of a structure of an echo cancellation processing apparatus according to the present application, as shown in fig. 4, at least including: the device comprises a signal processing module and a training module; wherein the content of the first and second substances,

the signal processing module is used for simulating a radio signal by using the original signal and the reference signal;

and the training module is used for training to obtain the echo cancellation model by taking the radio signal and the reference signal obtained by simulation as input and the original signal as a modeling target.

In an illustrative example, the signal processing module is specifically configured to:

after the reference signal is subjected to preset impulse response, a preset environmental noise signal is added to obtain a first signal; and superposing the first signal and the original signal to obtain an analog radio signal.

In an illustrative example, the network to be trained may include a multi-layer neural network such as FSMN, or DFSMN, or LSTM, or BLSTM, or GRU.

Fig. 5 is a flowchart of an echo cancellation method according to the present application, as shown in fig. 5, including:

step 500: and respectively calculating all reference signal echo cancellation models according to the number of the channels of the received signals to obtain a reference signal estimation value.

In an exemplary embodiment, the echo cancellation model is a multi-layer recursive network trained using a simulated radio signal of an original signal and a reference signal.

In one illustrative example, the representative form of the radio signal or reference signal may include, but is not limited to, such as: a raw Wave (WAV) signal, or a Fast Fourier Transform (FFT) signal that has undergone a Fourier Transform, or a commonly used word speech wake-up, FilterBank (FKank) feature of speech recognition, etc.

In an exemplary embodiment, the multi-layer recursive network for implementing the echo cancellation model in the present application is divided into a plurality of sub-networks according to the number of channels of the received signal, and each sub-network is an echo cancellation model with all reference signals as inputs. And obtaining a reference signal estimation value corresponding to the channel after echo cancellation model calculation.

In an exemplary embodiment, all reference signals in the sub-network corresponding to each channel are processed by a multi-layer recursive network for non-linearity (including linearity), the sub-network includes multiple layers, such as a recursive network layer of FSMN or LSTM-based RNN, a multi-layer normalization layer, and a direct connection of a residual network.

Step 501: and subtracting the reference signal estimation value corresponding to each channel from the radio signal of each channel to obtain an original signal estimation value.

In one exemplary embodiment, the steps include: for each channel, the estimated reference signal is subtracted from the received signal to obtain the raw signal estimate.

Fig. 6 is a schematic diagram of an embodiment of an echo cancellation network according to the present application, and as shown in fig. 6, in this embodiment, it is assumed that the number of speakers is 2, and the number of microphone arrays is also 2, that is, sound reception signals of 2 channels and reference signals of 2 channels are input, such as a sound reception signal 1 (represented as sound reception signal 1 (channel 1) in fig. 6) of channel 1, a sound reception signal 2 (represented as sound reception signal 2 (channel 2) in fig. 6) of channel 2, a reference signal 1 (represented as reference signal 1 (channel 1) in fig. 6) of channel 1, and a reference signal 2 (represented as reference signal 2 (channel 2) in fig. 6) of channel 2 shown in fig. 6 of the present application. Then, as shown by the dashed-line frame part in fig. 6, the multi-layer recursive network for implementing the echo cancellation model in the present embodiment is divided into 2 sub-networks according to the number of channels of the received signal. In each sub-network, there are for example two recursive network layers, such as FSMN or DFSMN or LSTM or BLSTM or GRU, and a plurality of normalization layers, where in this embodiment there is one normalization layer in the processing of all reference information in each sub-network and one normalization layer in the processing of reference signal estimates from each sub-network.

Step 502: and carrying out normalization processing on the original signal estimation values corresponding to all channels to obtain original signals.

In an exemplary embodiment, the echo cancellation processing of the present application is networked, so that joint training of the echo cancellation processing and a back-end voice wake-up and voice recognition model is more flexible.

The present application further provides a computer-readable storage medium having stored thereon computer-executable instructions for performing the echo cancellation method of any of the above.

The present application further provides an echo cancellation device, comprising a memory and a processor, wherein the memory stores the steps of the echo cancellation method of any one of the above.

Fig. 7 is a schematic structural diagram of the echo cancellation device according to the present application, as shown in fig. 7, at least including: the device comprises a first estimation module, a second estimation module and a processing module; wherein the content of the first and second substances,

the first estimation module is used for respectively inputting all reference signals into the echo cancellation model according to the number of the channels of the received signals and calculating to obtain a reference signal estimation value;

the second estimation module is used for subtracting the reference signal estimation value obtained by calculation corresponding to each channel from the radio signal of each channel to obtain an original signal estimation value;

and the processing module is used for carrying out normalization processing on the original signal estimation value corresponding to each channel to obtain an expected original signal.

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. An echo cancellation method, comprising:

2. The echo cancellation method of claim 1, further comprising, before the method, generating the echo cancellation model, comprising:

3. The echo cancellation method of claim 2, wherein the echo cancellation model comprises a multi-layer recursive network.

4. The echo cancellation method of claim 2, wherein the analog radio signal comprises:

5. The echo cancellation method according to claim 2, wherein the network to be trained comprises at least one of:

6. The echo cancellation method according to claim 1, wherein the reference signal comprises at least the original signal comprises at least one path.

7. The echo cancellation method of claim 1, the method further comprising:

8. An echo cancellation processing method, comprising:

9. The echo cancellation processing method of claim 8, wherein the analog radio signal comprises:

10. The echo cancellation process of claim 8, wherein the network to be trained comprises at least one of:

11. A computer-readable storage medium storing computer-executable instructions for performing the echo cancellation method of any one of claims 1 to 7 and/or performing the echo cancellation processing method of any one of claims 8 to 10.

12. An apparatus for echo cancellation comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the echo cancellation method of any one of claims 1 to 7.

13. An apparatus for echo cancellation comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the echo cancellation processing method of any one of claims 8 to 10.

14. A smart device comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor:

15. The smart device of claim 14, wherein the smart device comprises: intelligent audio amplifier, intelligent TV.