CN114171061A

CN114171061A - Time delay estimation method, equipment and storage medium

Info

Publication number: CN114171061A
Application number: CN202111682261.2A
Authority: CN
Inventors: 朱赛男; 修平平; 严涛; 浦宏杰; 曹李军
Original assignee: Suzhou Keda Special Video Co ltd; Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Special Video Co ltd; Suzhou Keda Technology Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-03-11

Abstract

The application discloses a time delay estimation method, equipment and a storage medium, and belongs to the technical field of communication. The method comprises the steps of obtaining a current frame signal and a reference signal corresponding to the current frame signal; determining a signal search range of the reference signal to obtain n frames of target reference signals in the signal search range; determining a first voice activity state of a current frame signal and a second voice activity state of a target reference signal; determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state; respectively determining a correlation value between a current frame signal and at least one frame of target reference signal under the condition that the value of the reliability is greater than a reliability threshold value; and determining a time delay estimation value corresponding to the current frame signal by using the correlation value. The problem that due to the existence of noise, the peak value of a correlation value is not obvious, and the result of time delay estimation is influenced by a pseudo peak, so that the result of time delay estimation is inaccurate can be solved; the accuracy of the time delay estimation is improved.

Description

Time delay estimation method, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for time delay estimation.

Background

The time delay estimation refers to estimating the time difference between homologous signals received at different time instants. The time delay estimation is widely applied to echo cancellation scenes, sound source positioning scenes and other scenes.

The conventional delay estimation method includes: acquiring a current frame voice signal and a voice signal to be aligned; determining a signal search range of the voice signals to be aligned, wherein the signal search range comprises at least one frame of voice signals to be aligned; calculating a correlation value between the current frame voice signal and at least one frame voice signal to be aligned; and determining the time difference between the frame of speech signal to be aligned corresponding to the peak value of the correlation value and the current frame of speech signal as a time delay estimated value.

However, in an actual use environment, due to the existence of noise (not limited to stationary background noise, but also reverberant sound and speaking sound of other people during a conference, etc.), the peak value of the correlation value is not obvious, and the result of the delay estimation is affected by the pseudo peak, thereby causing a problem that the result of the delay estimation is not accurate.

Disclosure of Invention

The application provides a time delay estimation method, a time delay estimation device and a storage medium, which can solve the problem that in an actual use environment, due to the existence of noise, the peak value of a correlation value is not obvious, and the time delay estimation result is influenced by a pseudo peak, so that the time delay estimation result is not accurate. The application provides the following technical scheme:

in a first aspect, a method for estimating a delay is provided, including: acquiring a current frame signal and a reference signal corresponding to the current frame signal; determining a signal search range of the reference signal to obtain n frames of target reference signals in the signal search range, wherein n is a positive integer; determining a first voice activity state of the current frame signal and a second voice activity state of the target reference signal; determining a confidence level of the current frame signal based on the first voice activity state and the second voice activity state; respectively determining a correlation value between the current frame signal and the at least one frame of target reference signal under the condition that the value of the reliability is greater than a reliability threshold value; and determining the time delay estimation value corresponding to the current frame signal by using the correlation value.

Optionally, the determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state includes: under the condition that the first voice activity state indicates that the current frame signal has a voice signal, determining a pre-alignment reference signal and a nearby reference signal of which the time difference with the pre-alignment reference signal is smaller than or equal to a preset threshold value from the n frames of target reference signals based on the last determined time delay estimation value; wherein, the time delay estimated value determined last time is initialized to 0; determining the confidence level of the current frame signal based on the second voice activity state of the pre-aligned reference signal and the second voice activity state of the nearby reference signal.

Optionally, the determining the reliability of the current frame signal based on the second voice activity state of the pre-aligned reference signal and the second voice activity state of the nearby reference signal includes: weighting the second voice activity state of the pre-alignment reference signal by using a Hanning window with a preset length to obtain a first weighted value; weighting a second voice activity state of the nearby reference signal by using the Hanning window with the preset length to obtain a second weighted value; determining the sum of the first weighted value and the second weighted value to obtain the credibility; wherein the preset length is determined based on the preset threshold.

Optionally, after determining the time delay estimation value corresponding to the current frame signal by using the correlation value, the method further includes: determining m verification reference signals corresponding to random verification time delays from the search range, wherein m is an integer larger than 2; calculating a correlation value between the current frame signal and the verification reference signal; determining target random verification time delays corresponding to two verification reference signals with the maximum correlation value between the two verification reference signals and the current frame signal to obtain a time delay verification range; the upper limit value of the time delay verification range is the larger value of the target random verification time delay, and the lower limit value of the time delay verification range is the smaller value of the target random verification time delay; and verifying the time delay estimated value by using the time delay verification range.

Optionally, the verifying the delay estimation value by using the delay verification range includes: verifying that the time delay estimation value is correct under the condition that the time delay estimation value is within the time delay verification range; determining whether a correlation value corresponding to the time delay estimation value is greater than a correlation value corresponding to the target random verification time delay or not under the condition that the time delay estimation value is not within the time delay verification range; verifying that the time delay estimation value is correct under the condition that the correlation value corresponding to the time delay estimation value is larger than the correlation value corresponding to the target random verification time delay; and under the condition that the correlation value corresponding to the time delay estimation value is smaller than the correlation value corresponding to the target random verification time delay, verifying that the time delay estimation value is wrong, and determining the maximum correlation value corresponding to the target random verification time delay as a final time delay estimation value.

Optionally, the determining, from the search range, m verification reference signals corresponding to random verification delays includes: generating the m random verification delays by using a chaotic mapping mode; and extracting verification reference signals with the time delay between the verification reference signals and the current frame signal as the m random verification time delays from the search range.

Optionally, the respectively determining a correlation value between the current frame signal and the at least one frame target reference signal includes: respectively carrying out band-pass weighting on the current frame signal and the target reference signal according to the distribution frequency of the voice signal; and calculating a correlation value between the current frame signal after band-pass weighting and the target reference signal after band-pass weighting.

Optionally, the signal search range is determined based on a preset maximum delay estimation value.

In a second aspect, an electronic device is provided, which includes a memory, a controller, and a computer program stored on the memory and executable on the controller, and the controller implements the steps of the time delay estimation method when executing the computer program.

In a third aspect, a computer-readable storage medium is provided, in which a program is stored, and the program is used for implementing the time delay estimation method provided in the first aspect when being executed by a processor.

The beneficial effects of this application include at least: acquiring a current frame signal and a reference signal corresponding to the current frame signal; determining a signal search range of the reference signal to obtain n frames of target reference signals in the signal search range; determining a first voice activity state of a current frame signal and a second voice activity state of a target reference signal; determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state; respectively determining a correlation value between a current frame signal and at least one frame of target reference signal under the condition that the value of the reliability is greater than a reliability threshold value; and determining a time delay estimation value corresponding to the current frame signal by using the correlation value. The problem that due to the existence of noise, the peak value of a correlation value is not obvious, and the result of time delay estimation is influenced by a pseudo peak, so that the result of time delay estimation is inaccurate can be solved; the reliability of the current frame signal is determined based on the first voice activity state of the current frame signal and the second voice state of the target reference signal, and the time delay estimation value is determined through the correlation value under the condition that the reliability is greater than the reliability threshold value, so that the accuracy of time delay estimation can be improved. Meanwhile, under the condition that the first voice activity state indicates that the current frame signal has the voice signal, the reliability calculation is carried out, so that the calculation resource can be saved.

In addition, the reliability is calculated by determining which reference signals are used according to the time delay estimated value determined last time, the reliability is calculated without using the second voice activity states of all the reference signals in the signal search range, and the calculation speed of the reliability can be improved.

In addition, the Hanning window is used for carrying out weighted calculation on the reliability of the second voice activity state, so that the jitter can be removed, the influence caused by the phenomena of missing detection and false detection existing in voice detection can be weakened, and the accuracy of reliability calculation can be improved.

In addition, when calculating the correlation value, respectively carrying out band-pass weighting on the current frame signal and the target reference signal according to the distribution frequency of the voice signal; calculating a correlation value between the current frame signal after band-pass weighting and the target reference signal after band-pass weighting; the influence of noise and interference sound on other frequency bands can be further avoided, and the accuracy of correlation value calculation is improved.

In addition, a correlation value between the current frame signal and the verification reference signal is calculated by determining m verification reference signals corresponding to random verification time delays in the search range; determining target random verification time delays corresponding to two verification reference signals with the maximum correlation value between the two verification reference signals and the current frame signal to obtain a time delay verification range; verifying the time delay estimation value by using a time delay verification range; the method can determine whether the delay jitter phenomenon occurs or not, can quickly determine the delay range and perform fine delay estimation in time once the delay jitter phenomenon occurs, can improve the accuracy of delay estimation, and can also ensure the speed of delay estimation.

In addition, m random verification time delays are generated in a chaotic mapping mode, the global range can be monitored, the situation that the time delay falls into the optimal solution of the local peak value is avoided, the consumption of global search can be reduced, the accuracy of time delay estimation can be improved, the speed of time delay estimation can be ensured, and the calculation consumption is reduced.

In addition, the signal search range is determined based on the preset maximum time delay estimation value, voice detection is not needed to be carried out on all reference signals, and the time delay estimation speed can be guaranteed.

Drawings

In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the technical solutions in the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for delay estimation according to an embodiment of the present application;

fig. 2 is a schematic diagram of a delay estimation process according to an embodiment of the present application;

fig. 3 is a block diagram of a delay estimation apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of an electronic device provided by an embodiment of the application.

Detailed Description

The technical solutions of the present application will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are only some embodiments of the present application, but not all embodiments. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In this application, where the contrary is not intended, directional words such as "upper, lower, top and bottom" are generally used with respect to the orientation shown in the drawings, or with respect to the component itself in the vertical, vertical or gravitational direction; likewise, for ease of understanding and description, "inner and outer" refer to the inner and outer relative to the profile of the components themselves, but the above directional words are not intended to limit the application.

The delay estimation method provided by the present application is described in detail below.

Optionally, the execution main body of each embodiment in the application may be an electronic device, and the electronic device may be a terminal such as a computer, a mobile phone, a tablet computer, a personal computer, a video conference terminal, or may also be a server, and the implementation manner of the electronic device is not limited in this embodiment.

Optionally, the application scenarios of the delay estimation method provided by the present application include, but are not limited to:

the first method comprises the following steps: echo cancellation scenarios. At this time, the delay estimation result is used to cancel the far-end echo.

And the second method comprises the following steps: and (5) positioning a scene by a sound source. At this time, the delay estimation result is used to calculate the sound source position.

In practical implementation, the application scenarios of the delay estimation method may also be other scenarios, and the application scenarios of the delay estimation method are not listed one by one in this embodiment.

As shown in fig. 1, an embodiment of the present application provides a delay estimation method, which at least includes the following steps:

step 101, acquiring a current frame signal and a reference signal corresponding to the current frame signal.

Schematically, an audio acquisition end acquires a local audio and a reference audio corresponding to the local audio, wherein a current frame signal is obtained by performing framing, windowing and time-frequency conversion processing on the local audio; the reference signal is obtained by performing framing, windowing and time-frequency conversion processing on the reference audio. The audio capturing end may be an electronic device or another device communicatively connected to the electronic device, and the implementation manner of the audio capturing end is not limited in this embodiment.

The reference audio and the local audio are audio data which are emitted by the same sound source and collected at different collecting time. That is, the present embodiment is used to estimate the time delay between the reference audio and the local audio.

Framing refers to dividing audio data into at least two short segments of equal length, each short segment being called a frame.

Since audio signals usually need to be converted to the frequency domain, the input signals are required to be stationary when time-frequency conversion is performed. The speech signal is unstable macroscopically and stable microscopically, and has short-time stationarity, so that, in the case that the application scenario of the time delay estimation includes speech output, the local audio and the reference audio are respectively subjected to framing processing in consideration of the short-time stationarity of the speech, including: and performing framing according to the preset frame shift. That is, there is an overlapping portion between two adjacent frame signals.

Optionally, the difference between the start positions of two adjacent frames is referred to as frame shift, and the preset frame shift may be 50% or a value smaller than 50%, and the value of the preset frame shift is not limited in this embodiment.

Windowing refers to the process of multiplying a window function by the original time-domain signal.

Because the time-frequency conversion requires that the input signal has periodicity, if the selected current frame signal is not a periodic signal, windowing of the local voice signal and the far-end voice signal after framing is also required. Windowing the local audio and the reference audio respectively, comprising: windowing is performed using a hanning window function w 0.

The hanning window function is expressed by the formula:

w₀(n)＝1/2[1+cos(2*PI*n/(N-1))]，n∈[-(N-1)/2,(N-1)/2]

where N is the length of the hanning window, such as: the value of N is 1024, and in actual implementation, the value of N may be other values, and the implementation of the value of N is not limited in this embodiment.

The windowed current frame signal is represented by:

x(n)＝x(n)w₀(n)

the windowed reference signal is represented by:

d(n)＝d(n)w₀(n)

after windowing, the middle part of the voice signal is highlighted, and the two ends are weakened, but in the embodiment, the influence of the window function can be weakened by repeatedly taking a section of signal during framing, so that the integrity of information is ensured.

The time-frequency conversion processing refers to the transformation of a complex time signal into a structural form represented by frequency components.

Respectively carrying out time-frequency conversion processing on the local audio and the reference audio, wherein the time-frequency conversion processing comprises the following steps: and carrying out fast Fourier transform on the windowed signal to obtain a frequency domain signal.

Wherein the fast fourier transform of the reference audio is represented by:

wherein, N is the length of the hanning window, l is the current frame, k is the frequency point of the l frame signal, and x (·) represents the time domain signal corresponding to the reference signal.

The fast fourier transform of the local audio is represented by:

wherein, N is the length of the Hanning window, l is the current frame, k is the frequency point of the l frame signal, and d (-) represents the time domain signal corresponding to the current frame signal.

The current frame signal is a frame frequency domain signal to be subjected to time delay estimation currently in the local audio, and the reference signal is each frame frequency domain signal in the reference audio.

Step 102, determining a signal search range of the reference signal, and obtaining n frames of target reference signals in the signal search range, wherein n is a positive integer.

Optionally, the signal search range is determined based on a preset maximum delay estimate. Specifically, the maximum value of the time delay between the target reference signal and the reference signal in the signal search range is greater than or equal to the maximum time delay estimation value.

In other embodiments, the signal search range may also be all reference signals, and the setting manner of the signal search range is not limited in this embodiment.

Step 103, determining a first voice activity state of the current frame signal and a second voice activity state of the target reference signal.

Optionally, the electronic device detects a first Voice Activity state of the current frame signal and a second Voice Activity state of the target reference signal through a Voice Activity Detection algorithm (VAD).

In one example, in the case where there is speech in the current frame signal or the target reference signal, the first speech activity state and the second speech activity state may be represented by 1; in the case where no speech is present in the current frame or the target reference signal, the first and second speech activity states may be represented by 0.

After obtaining the second voice activity state of the target reference signal, the electronic device stores the second voice activity state to the first buffer queue QueueVad. Taking the upper limit value of the signal search range as the maximum time delay estimation value as an example, the length of the first buffer queue is the maximum time delay estimation value MAXDELAY, and the initialization value of each buffer position in the first buffer queue is 0.

The first buffer queue QueueVad is expressed by the following formula:

QueueVad＝[vad_x QueueRef(1:(MAXDELAY-1)]

wherein vad _ x is used to represent the second voice activity state of the target reference signal.

In addition, the electronic device stores the n target reference signals to the second buffer queue QueueRef. Taking the upper limit value of the signal search range as the maximum time delay estimation value MAXDELAY as an example, at this time, the value of N is MAXDELAY, accordingly, the length of the second buffer queue QueueRef is MAXDELAY × N/2, and the initialization value of each buffer position in the second buffer queue QueueRef is 0.

The second buffer queue QueueRef is represented by:

QueueRef＝[X(l,k)QueueVad(1:(MAXDELAY-1)*N/2]

wherein, X (l, k) is used to represent a target reference signal, l represents a current frame, k represents data of the target reference signal corresponding to the current frame, MAXDELAY represents a maximum time delay estimation value, and N is a length of a hanning window.

Step 104, determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state.

Confidence is used to indicate the confidence with which the correlation values are calculated using the current frame signal. When the reliability is low, it indicates that even if the correlation value is calculated by using the current frame signal, the correlation value may not be accurate enough; in this case, the delay estimation may be performed without using the correlation.

Optionally, determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state includes the following cases:

first, in the case where the first voice activity state is 0, the confidence level of the current frame signal is 0.

When the first voice activity state is 0, the voice signal does not exist in the current frame signal, at this time, the time delay estimation of the current frame signal is meaningless, and the credibility of the current frame signal is determined to be 0.

Secondly, in the case that the first voice activity state is not 0, the voice state near the pre-aligned reference signal in the target reference signal is used to represent the confidence level of the current frame signal.

Specifically, determining the confidence level of the current frame signal based on the first voice activity state and the second voice activity state comprises: under the condition that the first voice activity state indicates that a voice signal exists in a current frame signal, determining a pre-alignment reference signal and a nearby reference signal of which the time difference with the pre-alignment reference signal is smaller than or equal to a preset threshold value from n frames of target reference signals based on a last determined time delay estimation value; the confidence level of the current frame signal is determined based on the second speech activity state of the pre-aligned reference signal and the second speech activity state of the nearby reference signal.

Wherein, the time delay estimated value determined last time is initialized to 0.

The pre-alignment reference signal is a frame of reference signal with a delay value of the last time delay estimation value in the target reference signals of n frames.

The preset threshold is determined based on the selection length stored in the electronic device, and illustratively, the selection length is deltaT, and at this time, the preset threshold is (deltaT-1)/2. The selected length may be 5 or 6, or may be adjusted according to actual needs, and the embodiment does not limit the implementation manner of the selected length.

In one example, the electronic device weights and sums the second voice activity state of the pre-aligned reference signal and the second voice activity state of the nearby reference signal for confidence.

However, jitter is present due to the speech signal. Based on this, in another example, determining a confidence level of a current frame signal based on a second speech activity state of a pre-aligned reference signal and a second speech activity state of a nearby reference signal comprises: weighting the second voice activity state of the pre-aligned reference signal by using a Hanning window with a preset length to obtain a first weighted value; weighting a second voice activity state of the nearby reference signal by using a Hanning window with a preset length to obtain a second weighted value; determining the sum of the first weighted value and the second weighted value to obtain the credibility; wherein the preset length is determined based on a preset threshold.

Because the Hanning window can remove the jitter, the influence caused by the phenomena of missing detection and false detection existing in the voice detection can be weakened, and the accuracy of reliability calculation is improved. The length of the hanning window w1 is the selected length.

Specifically, the reliability calculation process is expressed by the following formula:

where, delay t is the selected length, R is the reliability of the current frame signal, and lastdelay is the last time delay estimate.

And 105, respectively determining a correlation value between the current frame signal and at least one frame of target reference signal under the condition that the value of the reliability is greater than the reliability threshold value.

The value of the reliability threshold is 1 or other values, and the value of the reliability threshold is not limited in this embodiment.

And under the condition that the value of the reliability is greater than the reliability threshold value, the current frame signal and the target reference frame signal are both voice signals, and at the moment, the accuracy of the time delay estimation value determined based on the correlation value is higher. Therefore, in the embodiment, the correlation value between the current frame signal and at least one frame of target reference signal is calculated only when the value of the reliability is greater than the reliability threshold value, so as to perform the delay estimation, thereby avoiding the problem of inaccurate delay estimation caused by calculating an insignificant correlation value; not only can the computing resource be saved, but also the accuracy of the time delay estimation can be improved.

In one example, the electronic device may calculate the current frame signal and the target reference signal directly using the correlation function.

However, the range of speech is mainly distributed at 300-. Therefore, to further avoid the influence of noise and other interfering sounds in other frequency bands, in another example, the electronic device respectively determines a correlation value between the current frame signal and at least one frame of target reference signal, including: respectively carrying out band-pass weighting on the current frame signal and the target reference signal according to the distribution frequency of the voice signal; and calculating a correlation value between the current frame signal after band-pass weighting and the target reference signal after band-pass weighting.

In this embodiment, the current frame signal and the target reference signal are respectively subjected to band-pass weighting according to the distribution frequency of the voice signal, so that the influence of interference sound on other frequency bands can be further avoided, and the accuracy of delay estimation is improved.

The current frame signal and the target reference signal are respectively subjected to band-pass weighting according to the distribution frequency of the voice signal, the correlation value between the current frame signal subjected to band-pass weighting and the target reference signal subjected to band-pass weighting is calculated, and the correlation value is represented by the following formula:

wherein, X (l, k) is used to represent a target reference signal, l represents a current frame, k represents data of the target reference signal corresponding to the current frame, D (l, k) is used to represent a current frame signal, l represents the current frame, k represents data corresponding to the current frame signal, lastdelay is a last time delay estimation value, N is a length of a hanning window, M is a sampling rate of the current frame signal, 300 is a minimum frequency in a speech signal distribution frequency, and 3400 is a maximum frequency in the speech signal distribution frequency.

And step 106, determining a time delay estimation value corresponding to the current frame signal by using the correlation value.

The determining the time delay estimation value corresponding to the current frame signal by using the correlation value includes: determining a reference signal with the maximum correlation value with the current frame signal from the pre-aligned reference signal and the nearby reference signals; and determining the time difference between the reference signal with the maximum correlation value and the current frame signal as a time delay estimation value.

Such as: in the range of deltaT with lastdelay as the center in the second buffer queue QueuRef, the electronic equipment respectively takes the reference signals to calculate the correlation values

Within this range, the position of the maximum correlation value is searched for as the Delay estimation value Delay, and the coherence value Corr (l, Delay) is updated.

In summary, in the delay estimation method provided in this embodiment, the current frame signal and the reference signal corresponding to the current frame signal are obtained; determining a signal search range of the reference signal to obtain n frames of target reference signals in the signal search range; determining a first voice activity state of a current frame signal and a second voice activity state of a target reference signal; determining the reliability of the current frame signal based on the first voice activity state and the second voice activity state; respectively determining a correlation value between a current frame signal and at least one frame of target reference signal under the condition that the value of the reliability is greater than a reliability threshold value; determining a time delay estimation value corresponding to the current frame signal by using the correlation value; the problem that the time delay estimation result obtained by the existing time delay estimation method is not accurate can be solved; since the confidence level may indicate the confidence level of the correlation value calculated using the current frame signal, which is calculated based on the voice activity state, when the confidence level is low, it indicates that the current frame signal and/or the reference signal are not related to the voice signal, even if the correlation value is calculated using the current frame signal, the correlation value may not be accurate enough; at this time, the time delay estimation can be carried out without using the correlation degree; only when the reliability is higher, the correlation value between the current frame signal and at least one frame of target reference signal is calculated to carry out time delay estimation, so that the problem of inaccurate time delay estimation caused by calculating meaningless correlation values can be avoided; not only can the computing resource be saved, but also the accuracy of the time delay estimation can be improved.

In order to reduce the influence of the false peak, the search for the current delay estimation value is limited by an increased range, but the problem of large delay jitter is often caused by instability of an audio acquisition system. At this time, the increased search range limitation rather makes the delay estimation result fall into a local optimal solution, and the change of the delay cannot be tracked in time, so that the echo signal cannot be accurately and quickly aligned with reference. Thus, the subsequent echo cancellation or sound source localization may not be effective. Based on this, in this embodiment, after the time delay estimated value is calculated, the time delay estimated value also needs to be verified, specifically, after step 106, the method further includes: determining m verification reference signals corresponding to random verification time delays from a search range; calculating a correlation value between a current frame signal and a verification reference signal; determining target random verification time delays corresponding to two verification reference signals with the maximum correlation value between the two verification reference signals and the current frame signal to obtain a time delay verification range; and verifying the time delay estimated value by using the time delay verification range.

Wherein m is an integer greater than 2; the upper limit value of the time delay verification range is the larger value of the target random verification time delay, and the lower limit value of the time delay verification range is the smaller value of the target random verification time delay.

The method for verifying the time delay estimation value by using the time delay verification range comprises the following steps: verifying that the time delay estimation value is correct under the condition that the time delay estimation value is within the time delay verification range; under the condition that the time delay estimation value is not positioned in the time delay verification range, determining whether a correlation value corresponding to the time delay estimation value is larger than a correlation value corresponding to the target random verification time delay or not; verifying that the time delay estimation value is correct under the condition that the correlation value corresponding to the time delay estimation value is larger than the correlation value corresponding to the target random verification time delay; and under the condition that the correlation value corresponding to the time delay estimation value is smaller than the correlation value corresponding to the target random verification time delay, the verification time delay estimation value is wrong, and the maximum correlation value corresponding to the target random verification time delay is determined as the final time delay estimation value.

Illustratively, determining m random verification time delays corresponding verification reference signals from the search range includes: generating m random verification time delays by using a chaotic mapping mode; and extracting the verification reference signals with m random verification time delays from the search range.

Optionally, the generating m random verification delays by using a chaotic mapping manner includes: and generating m random verification time delays by using a worm mapping mode. Specifically, the population mapping manner is represented by the following formula:

Z_l,m+1＝μZ_l,m(1-Z_l,m)

wherein,

μ∈[0,4],Z_l,0＝Z_l-1,m。

the step 105 is referred to as a manner of calculating a correlation value between the current frame signal and the verification reference signal, which is not described herein again in this embodiment.

Assume that the target random verification delays corresponding to the verification reference signals CorrMax1 and CorrMax2 with the largest correlation values are T1 and T2. The electronic equipment judges whether the Delay estimation value Delay is in the range of T1 and T2; if yes, directly outputting the Delay estimation value Delay. Otherwise, continuously judging whether the correlation value Corr corresponding to the time Delay estimation value is greater than CorrMax1, if so, indicating that the correlation value calculated by the experimental estimation value is greater, and at the moment, the calculated time Delay estimation value is more accurate, and directly outputting the time Delay estimation value Delay. Otherwise, determining that the ranges of T1 and T2 are ranges of delay jitter, calculating correlation values corresponding to the delay values in the ranges, performing peak search to determine a delay value T12 corresponding to the maximum value of the peak, where the delay estimation result may be represented as DelayOut-T12, and directly outputting a final delay estimation value DelayOut-T12.

In other embodiment modes, the electronic device may also use other chaotic mapping algorithms to generate m random verification delays, and this embodiment does not limit the generation mode of the m random verification delays.

In the embodiment, a correlation value between a current frame signal and a verification reference signal is calculated by determining m verification reference signals corresponding to random verification time delays in a search range; determining target random verification time delays corresponding to two verification reference signals with the maximum correlation value between the two verification reference signals and the current frame signal to obtain a time delay verification range; verifying the time delay estimation value by using a time delay verification range; the method can determine whether the delay jitter phenomenon occurs or not, can quickly determine the delay range and perform fine delay estimation in time once the delay jitter phenomenon occurs, can improve the accuracy of delay estimation, and can also ensure the speed of delay estimation.

In order to more clearly understand the delay estimation method provided in the present application, the method is described below as an example. Referring to fig. 2, the delay estimation method includes three stages, which are a preprocessing stage, a delay estimation stage, and a delay estimation verification stage.

In the preprocessor stage, the electronic device finds out a pre-aligned reference signal and a nearby reference signal in a far-end cache according to the time delay estimation value determined last time, obtains a reference signal x (n) of [ -deltaT-1)/2, + deltaT-1)/2 nearby the pre-aligned reference signal, and judges the credibility R of the time delay estimation value of the current frame by using the voice activity detection results of x (n) and the current frame signal d (n).

In the delay estimation stage, the electronic device determines whether to perform delay estimation according to the reliability R output in the preprocessing stage. Specifically, when R is smaller than 1, the delay estimation value determined last time is kept unchanged and output to the next stage. Otherwise, in the reference signals of [ -deltaT-1)/2, + deltaT-1)/2], calculating correlation values between the current frame signal and each reference signal of [ -deltaT-1)/2, + deltaT-1)/2] by using a correlation function, and performing peak value search of the correlation values to determine the Delay estimation value Delay.

In the time delay estimation verification stage, the electronic equipment combines the chaos thought to obtain a random global scattering point in a preset MAXDELAY range, and calculates a correlation function of a signal corresponding to the scattering time delay, so as to determine a scattering area [ T1, T2] with the maximum correlation value. Judging whether the delay value output by the delay estimator is in the current area, if so, outputting the current delay estimation; if not, the problem of delay jitter may occur at the current moment. Judging whether the corresponding correlation value Corr output by the delay estimator is larger than the maximum correlation value corresponding to the scattered delay point, if so, outputting the current delay estimation and eliminating delay jitter; if not, the phenomenon of large delay jitter can be determined, the coherent value in the scattering region [ T1, T2] is calculated in time, and the position corresponding to the maximum coherent value is used as the final delay estimation result to be output.

Wherein, the output of the preprocessing stage determines the reliability of the estimation result of the delay estimator; meanwhile, a random time delay scattering point generated in the time delay estimation verification stage according to the chaos idea monitors the time delay estimation value and the global time delay of the previous stage, and the timeliness and the accuracy of time delay result output are guaranteed.

The present embodiment provides a delay estimation device, as shown in fig. 3. The device comprises at least the following modules: a signal acquisition module 310, a range determination module 320, a state determination module 330, a confidence determination module 340, a correlation value determination module 350, and a latency determination module 360.

A signal obtaining module 310, configured to obtain a current frame signal and a reference signal corresponding to the current frame signal;

a range determining module 320, configured to determine a signal search range of the reference signal, to obtain n frames of target reference signals within the signal search range;

a state determining module 330, configured to determine a first voice activity state of the current frame signal and a second voice activity state of the target reference signal;

a confidence level determination module 340, configured to determine a confidence level of the current frame signal based on the first voice activity state and the second voice activity state;

a correlation value determining module 350, configured to determine correlation values between the current frame signal and at least one frame of target reference signal respectively when a value of the confidence level is greater than a confidence level threshold;

and a delay determining module 360, configured to determine a delay estimation value corresponding to the current frame signal by using the correlation value.

For relevant details reference is made to the above-described method embodiments.

It should be noted that: in the time delay estimation device provided in the above embodiment, only the division of the above functional modules is used for illustration when performing time delay estimation, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the time delay estimation device is divided into different functional modules to complete all or part of the above described functions. In addition, the delay estimation device and the delay estimation method provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments and will not be described herein again.

The present embodiment provides an electronic device as shown in fig. 4. The electronic device comprises at least a processor 401 and a memory 402.

Processor 401 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the latency estimation method provided by the method embodiments herein.

In some embodiments, the electronic device may further include: a peripheral interface and at least one peripheral. The processor 401, memory 402 and peripheral interface may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface via a bus, signal line, or circuit board. Illustratively, peripheral devices include, but are not limited to: radio frequency circuit, touch display screen, audio circuit, power supply, etc.

Of course, the electronic device may include fewer or more components, which is not limited by the embodiment.

Optionally, the present application further provides a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the time delay estimation method of the foregoing method embodiment.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It is to be understood that the above-described embodiments are only a few, but not all, of the embodiments described herein. Based on the embodiments in the present application, a person skilled in the art may make other variations or changes without creative efforts, and all of them should fall into the protection scope of the present application.

Claims

1. A method for delay estimation, the method comprising:

acquiring a current frame signal and a reference signal corresponding to the current frame signal;

determining a signal search range of the reference signal to obtain n frames of target reference signals in the signal search range, wherein n is a positive integer;

determining a first voice activity state of the current frame signal and a second voice activity state of the target reference signal;

determining a confidence level of the current frame signal based on the first voice activity state and the second voice activity state;

respectively determining a correlation value between the current frame signal and the at least one frame of target reference signal under the condition that the value of the reliability is greater than a reliability threshold value;

and determining the time delay estimation value corresponding to the current frame signal by using the correlation value.

2. The method of claim 1, wherein said determining the confidence level of the current frame signal based on the first voice activity state and the second voice activity state comprises:

under the condition that the first voice activity state indicates that the current frame signal has a voice signal, determining a pre-alignment reference signal and a nearby reference signal of which the time difference with the pre-alignment reference signal is smaller than or equal to a preset threshold value from the n frames of target reference signals based on the last determined time delay estimation value; wherein, the time delay estimated value determined last time is initialized to 0;

determining the confidence level of the current frame signal based on the second voice activity state of the pre-aligned reference signal and the second voice activity state of the nearby reference signal.

3. The method of claim 2, wherein determining the confidence level of the current frame signal based on the second speech activity state of the pre-aligned reference signal and the second speech activity state of the nearby reference signal comprises:

weighting the second voice activity state of the pre-alignment reference signal by using a Hanning window with a preset length to obtain a first weighted value;

weighting a second voice activity state of the nearby reference signal by using the Hanning window with the preset length to obtain a second weighted value;

determining the sum of the first weighted value and the second weighted value to obtain the credibility; wherein the preset length is determined based on the preset threshold.

4. The method according to claim 1, wherein after determining the delay estimation value corresponding to the current frame signal by using the correlation value, further comprising:

determining m verification reference signals corresponding to random verification time delays from the search range, wherein m is an integer larger than 2;

calculating a correlation value between the current frame signal and the verification reference signal;

determining target random verification time delays corresponding to two verification reference signals with the maximum correlation value between the two verification reference signals and the current frame signal to obtain a time delay verification range; the upper limit value of the time delay verification range is the larger value of the target random verification time delay, and the lower limit value of the time delay verification range is the smaller value of the target random verification time delay;

and verifying the time delay estimated value by using the time delay verification range.

5. The method of claim 4, wherein the validating the delay estimate using the delay validation range comprises:

verifying that the time delay estimation value is correct under the condition that the time delay estimation value is within the time delay verification range;

determining whether a correlation value corresponding to the time delay estimation value is greater than a correlation value corresponding to the target random verification time delay or not under the condition that the time delay estimation value is not within the time delay verification range;

verifying that the time delay estimation value is correct under the condition that the correlation value corresponding to the time delay estimation value is larger than the correlation value corresponding to the target random verification time delay;

and under the condition that the correlation value corresponding to the time delay estimation value is smaller than the correlation value corresponding to the target random verification time delay, verifying that the time delay estimation value is wrong, and determining the maximum correlation value corresponding to the target random verification time delay as a final time delay estimation value.

6. The method of claim 4, wherein the determining m random verification delays corresponding verification reference signals from the search range comprises:

generating the m random verification delays by using a chaotic mapping mode;

and extracting verification reference signals with the time delay between the verification reference signals and the current frame signal as the m random verification time delays from the search range.

7. The method according to any one of claims 1 to 6, wherein said determining the correlation value between the current frame signal and the at least one frame target reference signal respectively comprises:

respectively carrying out band-pass weighting on the current frame signal and the target reference signal according to the distribution frequency of the voice signal;

and calculating a correlation value between the current frame signal after band-pass weighting and the target reference signal after band-pass weighting.

8. The method according to any one of claims 1 to 6, wherein the signal search range is determined based on a preset maximum delay estimate.

9. An electronic device, characterized in that the device comprises a processor and a memory; the memory has stored therein a program that is loaded and executed by the processor to implement the latency estimation method according to any one of claims 1 to 8.

10. A computer-readable storage medium, in which a program is stored, which, when being executed by a processor, is configured to carry out the latency estimation method according to any one of claims 1 to 8.