CN112689056A

CN112689056A - Echo cancellation method and echo cancellation device using same

Info

Publication number: CN112689056A
Application number: CN202110271110.1A
Authority: CN
Inventors: 亢润龙; 陆金刚; 方伟
Original assignee: Zhejiang Xinsheng Electronic Technology Co Ltd
Current assignee: Zhejiang Xinsheng Electronic Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-04-20
Anticipated expiration: 2041-03-12
Also published as: CN112689056B

Abstract

The invention provides an echo cancellation method and an echo cancellation device using the same. By adopting the echo cancellation method and the method of the convolutional neural network and the adaptive filter, the convergence speed of the echo path can be increased, and higher convergence precision can be ensured. In the initial stage, the convolutional neural network training model provides a set of initial filter coefficients for the adaptive filter, so that the convergence speed of the adaptive filter in the estimation of the echo path can be increased. On the other hand, when the external environment changes, the echo path diverges from convergence to twice, and if the convergence state is reached again only by the filter, the time is long. In the invention, if the echo path diverges twice, the convolutional neural network model is called to provide a group of new filter coefficients for the adaptive filter, so that the speed of twice convergence can be accelerated, the continuity of echo cancellation is ensured, and the situation that the filter cannot converge at a certain moment and the echo cannot be cancelled is avoided.

Description

Echo cancellation method and echo cancellation device using same

Technical Field

The present invention relates to the field of sound processing, and in particular, to an echo cancellation method and an echo cancellation device using the same.

Background

With the rapid development of modern information technology industry, various voice communication devices such as video conference systems, hands-free phones, mobile communication and hearing aids are continuously appearing, so that people can communicate more conveniently and comfortably. In voice communication devices, however, the presence of echo severely degrades the quality of voice communications, and acoustic echo cancellation is a significant challenge.

The echo phenomenon means that when audio input equipment of the system carries out audio recording, there may be two recorded audio signal sources: one is the audio signal that originally needs to be recorded, and the other is the audio signal that local audio output equipment exported, in this case, the existence of two audio signal sources makes the audio that output equipment broadcast have the echo. In order to avoid the echo phenomenon, the audio signal output by the audio output device needs to be eliminated at the audio input device end, so that the audio signal source recorded by the audio input device only has the audio signal which needs to be recorded originally. For example, when a user a and a user B use an instant messaging tool to perform voice chat, voice data recorded by the user a through a local audio input device reaches an audio output device of a terminal where the user B is located, at this time, if echo cancellation is not performed, the voice data can be recorded by the audio input device of the terminal where the user B is located, and at this time, the user a hears not only the speaking content of the user B but also what the user a just said himself/herself speaks from the local audio output device. Similarly, user B hears what he said before without performing echo cancellation on user a's side. Therefore, in order to make both the user a and the user B hear only the voice content of the other party and not the words that the user a and the user B have spoken before, it is necessary to implement echo cancellation on both the user a and the user B.

The acoustic echo signal has the following characteristics: (1) the echo channel is influenced by factors such as temperature, pressure, humidity and movement of an object, and has characteristics of nonlinearity, instability, time variation and the like; (2) the echo path has the characteristic of multiple paths, namely the echo not only comprises a direct echo which is directly picked up by a microphone after the voice is played by a loudspeaker, but also comprises an indirect echo which is subjected to one or multiple reflections after the voice is played by the microphone; (3) the acoustic echo has the phenomena of time delay, high gain, easy generation of howling and the like. Based on the above characteristics of acoustic echo signals, the research of high-performance acoustic echo cancellation algorithms remains a big difficulty in the field of speech signal processing. With the increasing requirements of people on the voice communication quality, the research on the echo cancellation algorithm with high convergence rate, high convergence precision and low computation complexity is a hot spot in the field of voice signal processing.

In echo cancellation, an adaptive filter is typically used to track and estimate the unknown echo channel in a system-aware manner, which is the most widely used echo cancellation method. The adaptive filter is an algorithm or device that automatically adjusts the filter coefficients to achieve optimal filter characteristics using a specific algorithm based on estimates of the statistical characteristics of the input and output signals. The adaptive filter updates and adjusts the weighting coefficients for each sample of the input signal sequence according to a specific algorithm so that the mean square error of the output signal sequence compared to the desired output signal sequence is minimized. When the adaptive filter is used for echo cancellation, the following two points need to be satisfied: (1) the adaptive filter expects the output signal to be completely echo in the convergence phase and not to be mixed with the near-end speech signal, since the near-end speech signal is uncorrelated with the far-end speech signal. Therefore, to achieve this, a faster convergence rate is required, and the adaptive filter has already converged when the near end has no speech input yet; (2) the echo path may be changed, and once the echo path is changed, the adaptive filter needs to be judged and processed in time, and at this time, the adaptive filter needs a new convergence process to meet the requirement of a new echo path. In order to meet the above two requirements, on one hand, the coefficient needs to be kept stable after the adaptive filter converges, so as to ensure that the adaptive filter is not interfered by the near-end voice input signal; on the other hand, the adaptive filter needs to keep an updated state at any time to ensure that a changing echo path can be tracked. Therefore, using conventional adaptive filter algorithms requires a trade-off between convergence speed and steady state imbalance according to the specific performance requirements of the system.

Therefore, there is a need for an echo cancellation method that can solve the problem between the convergence rate and the steady state imbalance of the conventional adaptive filter, and can achieve the requirements of fast convergence rate, low steady state error and low computation complexity.

Disclosure of Invention

In order to solve the above technical problem, the present invention provides an echo cancellation method. The method selects a proper network training model to be combined with the adaptive filter, can eliminate echo, can meet the requirements of a real-time communication system, has strong anti-interference capability on external environment change, and has the advantages of high convergence speed, high convergence precision and low calculation complexity.

In order to achieve the above object, the present invention provides an echo cancellation method, which includes the following steps:

(1) acquiring a first frame audio signal of an audio sequence; the audio signals comprise far-end signals and near-end signals;

(2) outputting the far-end signal to a filter coefficient adjusting module;

(3) judging whether the audio signal is the initial position, if so, executing the step (4); if the audio signal is not the initial position, executing the step (5);

(4) outputting the far-end signal to a trained convolutional neural network, outputting an initial filter coefficient by the convolutional neural network, and outputting the initial filter coefficient and the far-end signal to an adaptive filter, and executing the step (6);

(5) inputting the far-end signal to an adaptive filter;

(6) the self-adaptive filter obtains a synthesized echo according to the far-end signal; calculating an error signal by the near-end signal and the synthesized echo;

(7) judging whether the energy of the error signal is larger than the upper limit value of the energy multiple of the near-end signal, if so, returning to the step (4); if the energy of the error signal is not more than the upper limit value of the energy multiple of the near-end signal, transmitting the error signal to a filter coefficient adjusting module, judging whether a far-end signal and the error signal are received at the same time, if the far-end signal and the error signal are received at the same time, updating the filter coefficient by using the far-end signal and the error signal, and executing the step (8);

(8) transmitting the processed near-end signal to a far-end device;

(9) judging whether the audio sequence is the last frame, if not, acquiring a next frame signal, and returning to the step (2); and if the audio sequence is the last frame, ending echo cancellation.

By adopting the echo cancellation method and the method of the convolutional neural network and the adaptive filter, the convergence speed of the echo path can be increased, and higher convergence precision can be ensured. In the initial stage, the convolutional neural network training model provides a set of initial filter coefficients for the adaptive filter, so that the convergence speed of the adaptive filter in the estimation of the echo path can be increased. On the other hand, when the external environment changes, the echo path diverges from convergence to twice, and if the convergence state is reached again only by the filter, the time is long. In the invention, if the echo path diverges twice, the convolutional neural network model is called to provide a group of new filter coefficients for the adaptive filter, so that the speed of twice convergence can be accelerated, the continuity of echo cancellation is ensured, and the situation that the filter cannot converge at a certain moment and the echo cannot be cancelled is avoided.

Preferably, the trained neural network includes an input layer, a convolutional layer, an activation layer, a pooling layer, a fully-connected layer, and an output layer.

Preferably, the trained neural network training sample is a far-end signal frequency domain signal, and the target sample is a near-end signal frequency domain signal. The mixed audio signal can distinguish the near-end signal and the echo signal in the frequency domain more easily, which is beneficial to extracting the characteristic information of the echo signal in training.

Preferably, the step (1) comprises converting the audio signal from the time domain to the frequency domain; and (8) converting the processed near-end signal from a frequency domain to a time domain, and transmitting the converted near-end signal to a far-end device. The adaptive filter processes the frequency domain signal, so that the filter can be updated more finely, echo residue is low, the near-end signal and the echo signal can be more easily distinguished from the mixed audio signal in the frequency domain, and the characteristic information of the echo signal can be more favorably extracted in training.

Preferably, said step (8) comprises suppressing the residual echo and transmitting the processed near-end signal to the far-end device.

Preferably, when suppressing the residual echo, the suppressing is performed according to a correlation between signals, the correlation between the signals including at least one of the following signal correlations: the correlation of the near-end signal and the far-end signal, the correlation of the near-end signal and the error signal, and the correlation of the far-end signal and the error signal.

Preferably, in suppressing the residual echo, the residual echo suppression is performed by a wiener filter or spectral subtraction.

Preferably, the adaptive filter is one of the following filters: FIR filters or IIR filters.

Preferably, the adaptive filter employs one of the following adaptive algorithms: LMS algorithm or NLMS algorithm.

Preferably, the upper limit value of the near-end signal energy multiple is 15-25 times of the near-end signal energy.

In another aspect of the present invention, there is provided an echo cancellation device, including: a first echo path generator, a second echo path generator and a decider, wherein:

the first echo path generator is used for receiving a far-end signal, updating a filter coefficient at an audio starting position by adopting a trained convolutional neural network model, and outputting an estimated echo path as a starting reference value of the second echo path generator; the first echo path generator is further used for updating the filter coefficient and reconfiguring the filter coefficient for the second echo path generator when the decision device determines that the error signal energy is far larger than the upper limit value of the near-end signal energy multiple;

the second echo path generator is used for receiving the far-end signal and the updated filter coefficient, adopting a self-adaptive filter model, selecting the updated filter coefficient as the coefficient of the filter coefficient, and outputting the synthesized echo;

the decision device is used for receiving the near-end signal and the error signal, judging the energy of the error signal and the energy of the near-end signal, and calling the first echo path generator to reconfigure the filter coefficient for the second echo path generator if the energy of the error signal is greater than the upper limit of the energy multiple of the near-end signal; and if the energy of the error signal is not more than the upper limit of the energy multiple of the near-end signal, transmitting the output near-end signal to the far-end equipment, and updating the filter coefficient of the second echo path generator by using the far-end signal and the error signal.

Preferably, the echo cancellation device further comprises a residual echo suppressor for performing residual echo suppression on the output near-end signal when the decision device decides that the error signal energy is not greater than the upper limit of the multiple of the near-end signal energy.

Drawings

Fig. 1 is a flowchart of an echo cancellation method according to the present invention.

FIG. 2 is a flow chart of a convolutional neural network training model employed by the present invention.

Fig. 3 is a schematic diagram of an echo cancellation device according to the present invention.

Detailed Description

The technical means adopted by the invention to achieve the predetermined object of the invention are further described below with reference to the drawings and the preferred embodiments of the invention.

As shown in fig. 1, fig. 1 is a flowchart of an echo cancellation method provided in the present invention.

Step 110, obtaining a first frame of audio signal of an audio sequence, where the audio sequence includes a near-end signal and a far-end signal, where the near-end signal at least includes an echo signal and a noise signal, and the far-end signal is a far-end speech signal.

Optional step 120, the audio signal is converted from the time domain to the frequency domain. This step is mainly to determine the input signal type of the adaptive filter. The time domain adaptive filter has a fast response speed to the echo, and can quickly track the echo environment, but more echoes exist. The frequency domain adaptive filter can enable the updating of the filter to be finer, echo residue is low, the near-end signal and the echo signal can be distinguished more easily in the frequency domain by the mixed audio signal, and the characteristic information of the echo signal can be extracted more favorably in training. Therefore, to achieve better echo cancellation, the audio signal may be selectively converted from the time domain to the frequency domain. After optional step 120 is performed, on one hand, step 130 is continuously performed to determine whether the obtained audio signal is a start position; on the other hand, the far-end signal in the audio signal is output to the adaptive filter coefficient adjusting module so as to adjust the adaptive filter coefficient in the following process.

In step 130, determining whether the audio signal is at an initial position, and if the audio signal is at the initial position, inputting the far-end signal into the trained convolutional neural network, and performing step 141; if the audio signal is not the start position, step 140 is executed. The initial filter coefficient of the adaptive filter is 0, and the echo path is estimated using the near-end signal and the far-end signal at a plurality of times, so that the error between the output signal of the adaptive filter and the echo signal is minimized until the echo path converges. Since the adaptive filter does not converge the echo path at the start time, the echo is easily not cancelled. In order to accelerate the convergence rate of the adaptive filter and avoid the phenomenon that echo at the starting moment cannot be eliminated, in the invention, after an audio sequence is judged to be positioned at the starting position, a far-end signal is input into a trained convolutional neural network to obtain an estimated echo path which is used as a starting filter coefficient of the adaptive filter. Therefore, in this embodiment, the convolutional neural network training model provides an initial set of filter coefficients for the adaptive filter in the initial stage, which can speed up the convergence speed of the adaptive filter in estimating the echo path.

In step 141, when the audio signal is the start position, the far-end signal is input into the trained convolutional neural network. In the present embodiment, a convolutional neural network is employed as a training model. The training of the traditional neural network becomes more difficult as the network becomes deeper and deeper, and the optimization of the network becomes more and more difficult. Because the deeper the network is, the more the content the network needs to learn will be, the slower the convergence speed will be, the more and more obvious the phenomenon of gradient disappearance will be, and the training effect range is not as good as that of the relatively shallow network. The convolutional neural network reduces the complexity of a network model and the number of weights through strategies of local receptive fields, weight sharing, down-sampling and the like. Such a network structure is highly invariant to translation, scaling or other forms of deformation. The convolutional neural network adopts the original sequence as input, so that corresponding characteristics can be effectively learned from a large number of samples, and a complex extraction process is avoided. Therefore, convolutional neural networks are widely used in the fields of object recognition, speech recognition, and the like. A common convolutional neural network consists of an input layer, convolutional layer, activation layer, pooling layer, full-link layer, and finally an output layer.

The present invention provides a training method using a convolutional neural network, as shown in fig. 2. The frequency domain data obtained by the near-end signal through FFT is used as a target sample (label), the frequency domain data obtained by the far-end signal through FFT is used as a training sample, and a group of coefficients of an N-order FIR filter are obtained through training, so that the error value of the estimated echo signal and the near-end signal is minimized at the initial moment, the convergence speed of the adaptive filter is improved, and the echo energy is effectively suppressed. In this embodiment, the training input data of the convolutional neural network is frequency domain data rather than time domain data, because the mixed audio signal can more easily distinguish the near-end signal from the echo signal in the frequency domain, which is beneficial to extracting the feature information of the echo signal in the training, so that the error of the estimated echo path model is minimized.

After step 141 is performed, the initial filter coefficient and the far-end signal obtained by the convolutional neural network are input to the adaptive filter, and step 140 is performed, in which the far-end signal and the initial filter coefficient (or the updated filter coefficient) are input to the adaptive filter. After step 140 is performed, step 150 is continued.

Step 150, the far-end signal is input into the adaptive filter to obtain a synthesized echo, and the error signal is calculated from the near-end signal and the synthesized echo. The basic principle of using the adaptive filter to perform echo cancellation is to use the adaptive filter to simulate an echo channel, and adjust the tap coefficient of the filter through a corresponding adaptive algorithm to gradually approximate the tap coefficient to a real echo path. And estimating an echo signal by using an echo channel simulated by the adaptive filter, and subtracting the signal from a signal collected by the microphone, thereby realizing echo cancellation. The adaptive process of the adaptive filter is as follows: the coefficients of an FIR filter (Finite Impulse Response filter) or an IIR filter (Infinite Impulse Response filter) are adjusted by an adaptive algorithm so that the error signal approaches 0. The type of filter is not limited in the present invention as long as the object of the present invention can be achieved, and hereinafter, an FIR filter is exemplified. The adaptive algorithms may include LMS algorithms (Least Mean Square adaptive filtering) and NLMS algorithms (Normalized Least Mean Square adaptive filtering).

Specifically, the relationship between d (n), x (n), and e (n) is as follows:

note that the far-end speech signal input at the nth time is x (N), and the speech signal input vector at the nth time is represented by a vector x (N) of N × 1:X(n) = [x(n),x(n-1),…, x(n-N+1)]^Twhere superscript T denotes the matrix transpose.

In an echo cancellation system, the impulse response sequence of the echo system, which we call the echo path H, is written as:H = [h ₀,h ₁,…, h _N-1]^Twhere N represents the length of the vector. Thus, the generation of the echo signal can be simplified to the effect that the signal x (n) is passed through a FIR filter with an impulse response H.

The near-end signal d (n) includes not only the echo signal but also the noise signal v (n), so the near-end signal d (n) can be represented as:d(n) = y(n) + v(n) = H ^T X(n) + v(n)。

for estimating the echo signal, an N-order FIR filter is usedThe filter w (n) simulates an echo channel, denoted as:W(n) = [w ₀(n), w ₁(n),…, w _N-1(n)]^Tthe output signal y '(n) of the FIR filter used to estimate the echo channel, y' (n) is the estimated echo signal, denoted asy’(n) = W ^T(n) X(n)。

Subtracting the estimated echo signal y' (n) from the near-end signal d (n), and obtaining an error signal e (n) which is the echo-removed signal,e(n) = d(n) - W ^T(n) X(n). In noisy systems, when the echo cancellation system has undergone a certain number of iterations such that the FIR filter W (n) of the analog echo channel is equal to or approximately equal to the echo channel H, there aree(n) ≈ v(n) The echo energy is effectively suppressed.

After the step 150 is executed, the process proceeds to step 160, where the magnitudes of the error signal energy and the near-end signal energy are determined. The error signal represents the difference between the near-end signal and the estimated echo, and the smaller the difference is, the more accurate the estimated echo path is; if the echo path is estimated accurately, the energy of the error signal should be always smaller than that of the near-end signal, and if the energy of the error signal is much larger than that of the near-end signal, the estimated echo path is considered inaccurate, and the estimated echo path is inaccurate and divergent. In the energy calculation, n semaphores are used to convert the signals from the time domain into frequency domain signals, and then the n frequency domain signals are accumulated. If the error signal energy is greater than the upper limit of the near-end signal energy multiple, and the echo path is unable to eliminate the echo, step 141 is executed to recalculate the initial filter coefficient of the adaptive filter. In the determination process, if the energy of the error signal is greater than 15-25 times of the energy of the near-end signal, the energy of the error signal is considered to be much greater than the energy of the near-end signal, that is, the energy of the error signal is greater than the upper limit value of the energy multiple of the near-end signal. In the voice communication process, because the echo channel is influenced by factors such as external temperature and motion of an object, the echo path is obviously changed under the influence of objective factors due to the indirect echo after being played by the microphone and subjected to one or multiple reflections, so that the energy of an error signal is larger than the upper limit value of the energy multiple of a near-end signal. If the error signal energy is not larger than the upper limit value of the near-end signal energy multiple, the echo path estimated by the adaptive filter is considered to meet the requirement of voice communication, and the echo can be eliminated. At this time, step 142 is executed to determine whether a far-end signal and an error signal are received simultaneously, and if the far-end signal and the error signal are received simultaneously, step 143 is executed to update the coefficient of the adaptive filter by using the far-end signal and the error signal, and modify the echo path by using the features of the near-end signal and the far-end signal at the current time, so as to provide more accurate echo path estimation for the next time. Since the operation of step 143 is less time-consuming than the operation of step 141, step 141 is executed only when the error signal energy is much larger than the near-end signal energy (i.e. the error signal energy is larger than the upper limit of the near-end signal energy multiple) in the determination step 160, and the filter coefficient is calculated using the convolutional neural network once. By adopting the echo cancellation method provided by the invention, the convolution neural network can be used for providing the initial filter coefficient for the self-adaptive filter, and the convergence of an echo path is accelerated; and the use times of the convolutional neural network can be reduced, the time consumption is reduced to the minimum, and the real-time requirement of a voice communication system is met.

After step 160 is performed and step 142 is performed, optional step 170 is performed to perform residual echo suppression. In order to avoid the situation that the echo cannot be completely eliminated, the near-end signal output in step 160 is subjected to residual echo suppression to improve the quality of the near-end signal. When performing residual echo suppression, the correlation between signals, such as the correlation between a near-end signal and a far-end signal, the correlation between a near-end signal and an error signal, and the correlation between a far-end signal and an error signal, may be first calculated, and the output near-end signal may be subjected to residual echo suppression processing according to the strength of the correlation between the above signals. The purpose of residual echo suppression can be achieved by Wiener filters or spectral subtraction.

After step 170 is performed, step 171 and step 172 are performed to convert the near-end signal processed in step 170 from the frequency domain to the time domain, and transmit the signal to the far-end device. Simultaneously executing step 180, judging whether the audio sequence is the last frame, if so, finishing echo cancellation; if not, step 181 is executed to extract the next frame of audio signal, and the procedure returns to step 120 to perform the echo cancellation operation for the next frame of audio signal.

In summary, the echo cancellation method provided in the present invention has the following advantages: (1) by adopting the method of the convolutional neural network and the adaptive filter, the convergence speed of the echo path can be increased, and higher convergence precision can be ensured. In the initial stage, the convolutional neural network training model provides a set of initial filter coefficients for the adaptive filter, so that the convergence speed of the adaptive filter in the estimation of the echo path can be increased. On the other hand, when the external environment changes, the echo path diverges from convergence to twice, and if the convergence state is reached again only by the filter, the time is long. In the invention, if the echo path diverges twice, the convolutional neural network model is called to provide a group of new filter coefficients for the adaptive filter, so that the speed of twice convergence can be accelerated, the continuity of echo cancellation is ensured, and the situation that the filter cannot converge at a certain moment and the echo cannot be cancelled is avoided. (2) If only the convolutional neural network is adopted for echo path estimation, the current neural network training model still cannot achieve 100% accurate detection and identification capability, and near-end features are directly obtained through the neural network, so that the situation that the estimated echo path is inaccurate is inevitable. When the external environment changes, the trained neural network has poor correction capability and self-adaptive capability to the new environment, and the echo path cannot be estimated. The method of the convolutional neural network and the adaptive filter can effectively overcome the defects. When the echo path estimated by the neural network training model is inaccurate, the echo path is corrected by the self-adaptive filter, so that the steady state detuning amount is reduced, the echo is effectively eliminated, and the anti-interference capability of the echo elimination method on the complex environment change is improved. (3) If only the adaptive filter is used for echo cancellation, an adaptive filtering algorithm with high convergence rate, low steady-state error and low computational complexity needs to be researched. Traditional algorithms such as LMS (Least Mean Square adaptive filtering) are easily affected by gradient noise amplification although the computation complexity is low and the implementation is easy; traditional algorithms such as NLMS (Normalized Least Mean Square adaptive filtering) have high stability and low computational complexity, but the step size is fixed and a trade-off between convergence speed and steady state imbalance is required. Since the echo path of each frame of audio is affected by temperature, pressure, humidity, and object or object motion, and exhibits characteristics of nonlinearity, instability, time variation, and the like, the echo path also has a multi-path characteristic. Such a complex function mapping relationship is difficult to achieve an optimal solution by only relying on a traditional filtering algorithm. The method of the convolutional neural network and the adaptive filter is adopted, and a more ideal training model can be obtained by training through the neural network, so that the output of the filter is equal to or approximately equal to the output of an echo channel, and an optimal echo path is provided for echo cancellation. (4) The convolutional neural network provided by the invention is used as a training model, the network structure is simple, and the calculation complexity is low. Only when the echo path is in a divergent state, the convolutional neural network is called; if the echo path is in a converged state, the echo path is estimated using an adaptive filter. Compared with the method for eliminating the echo only by adopting the convolutional neural network, the method provided by the invention has the advantages of shorter time consumption and better accordance with the real-time requirement of a voice communication system. In the invention, the training sample and the target sample of the convolutional neural network do not adopt audio time domain data, but preferably adopt frequency domain data as input, because the mixed audio signal can more easily distinguish a near-end signal from an echo signal in a frequency domain, which is beneficial to extracting the characteristic information of the echo signal in the training.

In another aspect of the present invention, an echo cancellation device is also provided, as shown in fig. 3. The echo cancellation device includes a first echo path generator, a second echo path generator, a decider, and a residual echo suppressor.

The first echo path generator is used for receiving a far-end signal, updating a filter coefficient at an audio starting position by adopting a trained convolutional neural network model, and outputting an estimated echo path as a starting reference value of the second echo path generator. The first echo path generator is further configured to update the filter coefficients and reconfigure the filter coefficients for the second echo path generator when the decision device determines that the error signal energy is substantially greater than the near-end signal energy.

And the second echo path generator is used for receiving the far-end signal and the updated filter coefficient, adopting the self-adaptive filter model, selecting the updated filter coefficient as the coefficient of the self-adaptive filter model, and outputting the synthesized echo. Updating the filter coefficient by calling the first echo path generator when the updated filter coefficient received in the second echo path generator is in the audio initial stage and the judger judges that the error signal energy is more than the upper limit of the near-end signal energy multiple; and when the decision device judges that the energy of the error signal is less than the energy of the near-end signal, updating the coefficient of the filter according to the error signal and the far-end signal.

The decision device is used for receiving the near-end signal and the error signal, judging the energy of the error signal and the energy of the near-end signal, and calling the first echo path generator to reconfigure the filter coefficient for the second echo path generator if the energy of the error signal is greater than the upper limit value of the energy multiple of the near-end signal; if the energy of the error signal is not larger than the upper limit value of the energy multiple of the near-end signal, the output near-end signal is subjected to residual echo suppression processing (the output near-end signal is subjected to echo cancellation operation at the moment), the output near-end signal is transmitted to the far-end equipment, and meanwhile, the filter coefficient of the second echo path generator is updated by using the far-end signal and the error signal.

And the residual echo suppressor is used for suppressing the residual echo according to the error signal output by the decision device. When performing residual echo suppression, the correlation between signals, such as the correlation between a near-end signal and a far-end signal, the correlation between a near-end signal and an error signal, and the correlation between a far-end signal and an error signal, may be first calculated, and the output near-end signal may be subjected to residual echo suppression processing according to the strength of the correlation between the above signals.

The echo cancellation device provided by the invention is used together with an echo cancellation method, when the device works, whether a far-end signal is a first frame or not is judged, if the far-end signal is the first frame, the far-end signal is input into a first echo path generator and a second echo path generator, the first echo path generator adopts a trained convolutional neural network model to output an estimated echo path, and a filter coefficient is updated to be used as an initial reference value of the second echo path generator. The second echo path generator processes the far-end signal according to the received initial signal reference value and outputs a synthesized echo y (n). And if the input far-end signal is not the first frame, directly inputting the far-end signal into a second echo path generator, and processing the far-end signal by the second echo path generator to output a synthesized echo y (n). The synthesized echo y (n) is subtracted from the near-end signal d (n) to obtain an error signal e (n), and the decision device simultaneously receives the error signal e (n) and the near-end signal d (n) and compares the energy of the error signal with the energy of the near-end signal. If the error signal energy is larger than the upper limit value of the near-end signal energy multiple, calling the first echo path generator, updating the filter coefficient, and reconfiguring the filter coefficient for the second echo path generator; if the energy of the error signal is not larger than the upper limit value of the energy multiple of the near-end signal, the filter coefficient of the second echo path generator is updated by using the far-end signal and the error signal, meanwhile, the output near-end signal is subjected to residual echo suppression processing by using a residual echo suppressor, and then the processed near-end signal is transmitted to the far-end equipment.

By adopting the echo cancellation device, when the filter parameter configuration in the second echo path generator is carried out, the method of the convolutional neural network and the adaptive filter is adopted, the convergence speed of the echo path can be accelerated, the higher convergence precision can be ensured, and the better echo cancellation effect can be achieved.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An echo cancellation method, comprising the steps of:

(2) outputting the far-end signal to a filter coefficient adjusting module;

(5) inputting the far-end signal to an adaptive filter;

(7) judging whether the energy of the error signal is larger than the upper limit value of the energy multiple of the near-end signal, if so, returning to the step (4); if the energy of the error signal is not more than the energy of the near-end signal, transmitting the error signal to a filter coefficient adjusting module, judging whether a far-end signal and the error signal are received at the same time, if the far-end signal and the error signal are received at the same time, updating the filter coefficient by using the far-end signal and the error signal, and executing the step (8);

(8) transmitting the processed near-end signal to a far-end device;

2. The method of echo cancellation according to claim 1, wherein said trained neural network comprises an input layer, a convolutional layer, an active layer, a pooling layer, a fully-connected layer, and an output layer.

3. The method of echo cancellation according to claim 2, wherein the trained neural network training samples are far-end signal frequency domain signals and the target samples are near-end signal frequency domain signals.

4. The echo cancellation method of claim 1, wherein said step (1) comprises converting the audio signal from the time domain to the frequency domain; and (8) converting the processed near-end signal from a frequency domain to a time domain, and transmitting the converted near-end signal to a far-end device.

5. An echo cancellation method according to claim 1, wherein said step (8) comprises suppressing residual echo and transmitting the processed near-end signal to the far-end device.

6. An echo cancellation method according to claim 5, wherein in suppressing residual echo, suppression is performed based on correlations between signals including at least one of the following signal correlations: the correlation of the near-end signal and the far-end signal, the correlation of the near-end signal and the error signal, and the correlation of the far-end signal and the error signal.

7. An echo cancellation method according to claim 5, characterized in that in suppressing the residual echo, the residual echo suppression is performed by means of a wiener filter or a spectral subtraction.

8. An echo cancellation method according to claim 1, characterized in that said adaptive filter is one of the following filters: FIR filters or IIR filters.

9. The echo cancellation method of claim 1, wherein said adaptive filter employs one of the following adaptive algorithms: LMS algorithm or NLMS algorithm.

10. An echo cancellation method according to claim 1, wherein the upper limit of the multiple of the energy of the near-end signal is 15 to 25 times the energy of the near-end signal.

11. An echo cancellation device using one of the echo cancellation methods of claims 1-10, wherein said echo cancellation device comprises: a first echo path generator, a second echo path generator and a decider, wherein:

the first echo path generator is used for receiving a far-end signal, updating a filter coefficient at an audio starting position by adopting a trained convolutional neural network model, and outputting an estimated echo path as a starting reference value of the second echo path generator; the first echo path generator is further configured to update the filter coefficient and reconfigure the filter coefficient for the second echo path generator when the decision device determines that the error signal energy is greater than the upper limit of the near-end signal energy multiple;

12. The echo cancellation device of claim 11, further comprising a residual echo suppressor for performing residual echo suppression on the output near-end signal when the decision device determines that the error signal energy is not greater than an upper limit of a multiple of the near-end signal energy.