CN116312621A - Time delay estimation method, echo cancellation method, training method and related equipment - Google Patents

Time delay estimation method, echo cancellation method, training method and related equipment Download PDF

Info

Publication number
CN116312621A
CN116312621A CN202310199275.1A CN202310199275A CN116312621A CN 116312621 A CN116312621 A CN 116312621A CN 202310199275 A CN202310199275 A CN 202310199275A CN 116312621 A CN116312621 A CN 116312621A
Authority
CN
China
Prior art keywords
audio
processing
filter
delay estimation
time delay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310199275.1A
Other languages
Chinese (zh)
Inventor
马路
魏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202310199275.1A priority Critical patent/CN116312621A/en
Publication of CN116312621A publication Critical patent/CN116312621A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a time delay estimation method, an echo cancellation method, a training method and related equipment, and relates to the field of audio processing. The time delay estimation method comprises the following steps: processing a plurality of segments of the first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a corresponding content of the first audio; and processing the processing results of the plurality of filters by using the neural network model to obtain a time delay estimation result of the corresponding content of the first audio in the second audio. The embodiment of the invention improves the precision of time delay estimation and has robustness in the scene with noise and nonlinear echo. In addition, the method has higher treatment efficiency.

Description

Time delay estimation method, echo cancellation method, training method and related equipment
Technical Field
The present invention relates to the field of audio processing, and in particular, to a delay estimation method, an echo cancellation method, a training method, and related devices.
Background
In the related art, a cross-correlation algorithm is mainly used for time delay estimation, namely: and calculating a cross-correlation function between the near-end signal and the far-end signal, and taking the candidate time corresponding to the maximum value of the cross-correlation coefficient as the actual delay by traversing the cross-correlation coefficient of the candidate delay.
In open-source WebRTC (Web Real-Time Communication ), a correlation operation is implemented by adopting an audio fingerprint matching mode, namely: the far-end and near-end speech signals are subjected to a fast fourier transform (Fast Fourier Transform, simply FFT) and binarized spectra of the far-end and near-end speech signals, i.e. audio fingerprints, are obtained from the transformed spectra. And selecting the candidate far-end signal with the highest similarity by solving the bitwise exclusive OR value of the two signals, and calculating the corresponding delay.
Disclosure of Invention
The inventor finds that the method for performing time delay estimation by the cross-correlation algorithm has high computational complexity after analysis, so that the method is not suitable for being applied to a system with high real-time requirements. The audio fingerprint matching algorithm in WebRTC is sensitive to nonlinear echoes and environmental noise, and the accuracy of time delay estimation is seriously reduced in interference environments such as complex noise.
One technical problem to be solved by the embodiment of the invention is as follows: how to reduce the complexity of time delay estimation and improve the accuracy of estimation.
According to a first aspect of some embodiments of the present invention, there is provided a delay estimation method, comprising: processing a plurality of segments of the first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a corresponding content of the first audio; and processing the processing results of the plurality of filters by using the neural network model to obtain a time delay estimation result of the corresponding content of the first audio in the second audio.
In some embodiments, processing the plurality of segments of the first audio with the plurality of filters includes: extracting segments of a preset length from a buffer memory of a frame storing a first audio by taking a plurality of positions as starting points respectively to obtain a plurality of segments, wherein preset intervals are reserved between adjacent starting points; each extracted segment and the second audio are input to each of the plurality of filters, respectively, to obtain a processing result of each filter.
In some embodiments, the predetermined interval is less than the predetermined length.
In some embodiments, each filter comprises a plurality of filter blocks, the number of filter blocks being equal to the number of frames in each segment, each filter block being configured to estimate an echo present in the second audio from the segment of the first audio and to subtract the estimated echo with the second audio to obtain an error and to update the weights of the filter with the error, the sum of the energy of the weights of the filter blocks being the output of the filter block; and, the processing result of each filter includes a set of outputs of the respective filter blocks of the filter.
In some embodiments, the filter is used to estimate matching information of the second audio to the segment of the first audio processed by the filter.
In some embodiments, the filter is a least mean square filter or a multi-delay block frequency domain adaptive filter.
In some embodiments, processing the processing results of the plurality of filters using the neural network model includes: splicing the processing results of the filters to generate input information; and processing the input information by using a neural network model.
In some embodiments, the neural network model is used for predicting the probability corresponding to each time delay, and the time delay estimation result is the time delay corresponding to the maximum probability.
In some embodiments, the number of weight parameters of the neural network model is less than 10 6
In some embodiments, the first audio is audio received from a transmitting end over a communication link and the second audio is audio captured by an audio input device of a receiving end.
According to a second aspect of some embodiments of the present invention, there is provided an echo cancellation method comprising: any one of the time delay estimation methods described above; and performing echo cancellation on the second audio based on the delay estimation result.
In some embodiments, the echo cancellation method further comprises: receiving audio from a transmitting end through a communication link as first audio; and collecting the audio through the audio input device of the receiving end as the second audio.
According to a third aspect of some embodiments of the present invention, there is provided a training method comprising: performing time delay processing on the third audio according to the target time delay to obtain a fourth audio; processing a plurality of segments of the third audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the third audio and a fourth audio comprising a corresponding content of the third audio; processing the processing results of the plurality of filters by using the neural network model to obtain a time delay estimation result of corresponding content of the third audio in the fourth audio; and adjusting parameters of the neural network model according to the time delay estimation result of the fourth audio and the target time delay.
In some embodiments, delay processing the third audio according to the target delay, obtaining the fourth audio includes: processing the fifth audio to obtain an analog audio input device acquisition signal; processing the third audio to obtain an echo signal generated by the signal of the transmitting end at the receiving end; processing the analog audio input equipment acquisition signal, the echo signal and the noise signal to obtain a receiving end mixed signal; and performing time delay processing on the mixed signal of the receiving end according to the target time delay to obtain fourth audio.
According to a fourth aspect of some embodiments of the present invention, there is provided a delay estimation apparatus, comprising: a first audio processing module configured to process a plurality of segments of the first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a respective content of the first audio; and a first estimation module configured to process the processing results of the plurality of filters by using the neural network model to obtain a delay estimation result of the corresponding content of the first audio in the second audio.
According to a fifth aspect of some embodiments of the present invention, there is provided an echo cancellation device, comprising: the time delay estimation device; and an echo cancellation module configured to perform echo cancellation on the second audio based on the delay estimation result.
In some embodiments, the echo cancellation device further comprises: a receiver configured to receive audio from a transmitting end as first audio through a communication link; an audio input device configured to capture audio as a second audio.
According to a sixth aspect of some embodiments of the present invention, there is provided a training device comprising: the time delay module is configured to perform time delay processing on the third audio to obtain fourth audio; a second audio processing module configured to process a plurality of segments of the third audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the third audio and a fourth audio comprising a respective content of the third audio; the second estimation module is configured to process the processing results of the plurality of filters by using the neural network model so as to obtain a time delay estimation result of corresponding content of the third audio in the fourth audio; and the parameter adjustment module is configured to adjust parameters of the neural network model according to the time delay estimation result of the fourth audio and the target time delay.
According to a seventh aspect of some embodiments of the present invention, there is provided an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform any one of the foregoing delay estimation methods, or echo cancellation methods, or training methods, based on instructions stored in the memory.
According to an eighth aspect of some embodiments of the present invention, there is provided a computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any one of the foregoing delay estimation methods, or echo cancellation methods, or training methods.
Some of the embodiments of the above invention have the following advantages or benefits. The above embodiment uses a method of combining a filter bank and a classified neural network to realize time delay estimation, and can accurately extract the related information of the first audio and the second audio by using the filter, and then process nonlinear factors by using a neural network model. Therefore, the embodiment of the invention improves the precision of time delay estimation and has robustness in the scene with noise and nonlinear echo. In addition, the method has higher treatment efficiency.
Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
Fig. 1 illustrates a flow diagram of a delay estimation method according to some embodiments of the invention.
Fig. 2 illustrates a flow diagram of a filter processing method according to some embodiments of the invention.
Fig. 3 schematically shows data processed by the data buffer and the filter.
Fig. 4 illustrates a flow diagram of an echo cancellation method according to some embodiments of the invention.
Fig. 5 illustrates a flow diagram of a neural network model training method, according to some embodiments of the invention.
Fig. 6 illustrates a flow diagram of a receiver-side hybrid signal according to some embodiments of the invention.
Fig. 7 illustrates a schematic structure of a delay estimation apparatus according to some embodiments of the present invention.
Fig. 8 illustrates a schematic structure of an echo cancellation device according to some embodiments of the present invention.
Fig. 9 illustrates a schematic diagram of a training device according to some embodiments of the invention.
Fig. 10 illustrates a schematic structure of an electronic device according to some embodiments of the invention.
Fig. 11 shows a schematic structural view of an electronic device according to further embodiments of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
An embodiment of the delay estimation method of the present invention is described below with reference to fig. 1. In this embodiment, the first audio is used as a reference audio, and the second audio is subjected to delay estimation. The second audio includes corresponding content of the first audio, for example including the same speech. In some embodiments, the second audio may be obtained by processing the first audio. For example, after the first audio is played, the first audio is collected by an audio input device such as a microphone to obtain the second audio. The second audio includes noise, information of other speakers, and the like in addition to the content of the distorted first audio.
Fig. 1 illustrates a flow diagram of a delay estimation method according to some embodiments of the invention. As shown in fig. 1, the delay estimation method of this embodiment includes steps S102 to S104.
In step S102, a plurality of segments of the first audio are processed with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter being for processing one segment of the first audio and the second audio. That is, each filter includes two inputs, one input being a segment of the first audio and the other input being the second audio.
In some embodiments, the filter is used to estimate matching information of the second audio to the segment of the first audio processed by the filter.
In some embodiments, the filter is a least mean square (Least Mean Square, abbreviated as LMS) filter or a multi-delay block frequency domain adaptive filter (multi-delay block frequency-domain adaptive filter, abbreviated as MDF). Other types of filters are also possible as needed and are not described in detail herein.
In some embodiments, the first audio is audio received over a communication link from a transmitting end (i.e., a far end) and the second audio is audio captured by an audio input device at a receiving end (i.e., a near end).
In step S104, the processing results of the plurality of filters are processed by using the neural network model, so as to obtain a delay estimation result of the corresponding content of the first audio in the second audio.
In some embodiments, the processing results of the plurality of filters are stitched to generate input information, e.g., to generate a vector comprising the processing results of the plurality of filters; and processing the input information by using a neural network model. Thus, in the input of the neural network model, the information of each segment can be intuitively embodied.
In some embodiments, the neural network model is used for predicting the probability corresponding to each time delay, and the time delay estimation result is the time delay corresponding to the maximum probability. The neural network model is, for example, a classification model, the class of which corresponds to each of the alternatives of time delays. And determining a delay estimation result according to the classification result of the neural network model. Therefore, a proper model can be selected to carry out time delay estimation according to the requirement, and the applicability of the scheme is improved.
In some embodiments, the neural network model is a compact neural network model, e.g., the number of weight parameters is less than 10 6 Is a model of (a). Therefore, the method can reduce the computational complexity and can be applied to real-time delay estimation.
The filter is used to estimate the linear echo, and due to factors such as the near-end microphone and nonlinear devices in the hardware circuit, the echo in the mixed signal received by the near-end will have nonlinear echo introduced by the device in addition to the linear echo introduced by the environmental impulse response.
The above embodiment uses a method of combining a filter bank and a classified neural network to realize time delay estimation, and can accurately extract the related information of the first audio and the second audio by using the filter, and then process nonlinear factors by using a neural network model. Therefore, the embodiment of the invention improves the precision of time delay estimation and has robustness in the scene with noise and nonlinear echo. In addition, the method has higher treatment efficiency.
In some embodiments, the segments may be extracted from the first audio at the same intervals. An embodiment of the filter processing method of the present invention is described below with reference to fig. 2.
Fig. 2 illustrates a flow diagram of a filter processing method according to some embodiments of the invention. As shown in fig. 2, the filter processing method of this embodiment includes steps S202 to S204.
In step S202, in the buffer storing the frame of the first audio, segments of a preset length are extracted from the plurality of positions as starting points, respectively, to obtain a plurality of segments, wherein a preset interval is provided between adjacent starting points.
If the length of the first audio is greater than the length of the buffer, only a portion of the information of the first audio may be stored in the buffer. For example, frames of the first audio are sequentially added to the buffer until the buffer is full. After the frames in the buffer have all been processed, the first frame in the buffer may be removed and the first frame in the unprocessed frame may be added to the buffer. The cache may be implemented in a queue structure, i.e., a "first-in first-out" data structure.
In some embodiments, the predetermined interval is less than the predetermined length. Therefore, adjacent segments are partially overlapped, so that the edge parts of the segments can be reasonably considered, and the processing accuracy is improved.
In step S204, each extracted segment and the second audio are input to each of the plurality of filters, respectively, to obtain a processing result of each filter.
Fig. 3 schematically shows data processed by the data buffer and the filter. As shown in fig. 3, there are M adaptive filters (0 th to M-1 th) each having a length of N and overlapping lengths of adjacent filters of L, i.e., the interval between the first frames of the segments processed by the adjacent filters is N-L.
When the fragments are extracted, the corresponding position data frames are sequentially extracted from left to right at intervals of N-L frames and sent to the corresponding filters. Firstly, sequentially extracting data from the 0 th data frame position of a data buffer and sending the data to a first filter, namely the 0 th filter; then, sequentially extracting data from the (M-1) -th data frame position of the data buffer and sending the extracted data to the second filter, namely the 1 st filter … …, and finally sequentially extracting data from the (M-1) -th (N-L) -th data frame position of the data buffer and sending the extracted data to the M-1 st filter, namely the filter marked as M-1 in the figure.
In some embodiments, each filter comprises a plurality of filter blocks, the number of filter blocks being equal to the number of frames in each segment, each filter block being configured to estimate an echo present in the second audio from the segment of the first audio and to subtract the estimated echo with the second audio to obtain an error and to update the weights of the filter with the error, the sum of the energy of the weights of the filter blocks being the output of the filter block; and, the processing result of each filter includes a set of outputs of the respective filter blocks of the filter.
For example, each frame of the first audio has K sampling points, and the length of each filter is k×n. Dividing a filter into N filter blocks in a frame unit, calculating energy of K weights by each filter block and taking the energy as total energy of the filter block, and obtaining N filter weight energy by each filter; the same energy calculation method is adopted for the M filters, so that n×m weight energies are obtained in total. These weight energy values may then be stitched together in turn as input to the neural network model.
Through the above embodiment, the frames in the first audio can be temporarily stored by using the buffer, and then the segments with the preset length are sequentially extracted according to the preset interval and input into the filter. Therefore, each segment in the first audio can be processed by using a plurality of filters rapidly, and the matching information of each segment and the second audio is obtained, so that accurate input is provided for a subsequent neural network model, and the accuracy of time delay estimation and the processing efficiency are improved.
In an audio-video call scenario, after a sound uttered by a sender (denoted as near-end) is transmitted to a receiver's device (denoted as far-end), the sound is reflected by surrounding objects and picked up by the receiver's audio input device (e.g., microphone). If the device of the receiving party does not use the echo cancellation algorithm, or if the echo cancellation algorithm has a performance defect, the un-cancelled clean sound will be transmitted back to the current speaker, i.e. the transmitting party, i.e. the so-called: which is heard by itself. By using the time delay estimation method of the invention, the echo cancellation can be realized. An embodiment of the echo cancellation method of the present invention is described below with reference to fig. 4.
Fig. 4 illustrates a flow diagram of an echo cancellation method according to some embodiments of the invention. As shown in fig. 4, the echo cancellation method of this embodiment includes steps S402 to S406.
In step S402, a plurality of segments of the first audio are processed with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter being configured to process one segment of the first audio and a second audio comprising a corresponding content of the first audio.
In some embodiments, audio from a sender is received over a communication link as first audio; and collecting audio through the audio input equipment of the receiving end to serve as second audio.
In step S404, processing results of the plurality of filters are processed by using the neural network model to obtain a delay estimation result of corresponding content of the first audio in the second audio.
In step S406, echo cancellation is performed on the second audio based on the delay estimation result.
Specifically, a far-end signal (e.g., a voice signal of a sender) is transmitted to a device of a receiver, and then is played through a speaker, reflected by surrounding objects to form an echo, and then received by a near-end microphone. At the same time, the near-end speaker's (i.e., recipient's) voice signal is received by the near-end device microphone along with the near-end environmental noise signal. Since the far-end signal is delayed from the original far-end signal by the process of being played out from the speaker of the near-end device to being received by the near-end microphone and converted into a digital signal, the near-end device buffers the mixed signal received over the communication link from the sender signal and the near-end microphone in order to achieve alignment of the far-end and near-end (simply referred to as double-ended) signals. After echo estimation, the far-end signal may be subtracted from the near-end mixed signal to achieve echo cancellation.
By accurately performing delay estimation, the embodiment can more accurately perform echo cancellation, and improves the quality and user experience of voice communication.
The invention also provides a training method of the neural network model.
Fig. 5 illustrates a flow diagram of a neural network model training method, according to some embodiments of the invention. As shown in fig. 5, the neural network model training method of this embodiment includes steps S502 to S508.
In step S502, delay processing is performed on the third audio according to the target delay, so as to obtain a fourth audio.
In some embodiments, processing the fifth audio to obtain an analog audio input device acquisition signal; processing the third audio to obtain an echo signal generated by the signal of the transmitting end at the receiving end; processing the analog audio input equipment acquisition signal, the echo signal and the noise signal to obtain a receiving end mixed signal; and performing time delay processing on the mixed signal of the receiving end according to the target time delay to obtain fourth audio.
An example of generating a receiver-side mixed signal may refer to fig. 6. As shown in fig. 6, different audios of two different speakers are randomly selected from a clean audio library, and the audio of speaker 1 is used as a far-end signal (S 1 ) The audio of speaker 2 is taken as the near-end signal. Two room impulse responses 1 and 2 are randomly selected from the room impulse response simulator and respectively convolved with near-end and far-end signals to obtain near-end speaker signals collected by a microphone and echo signals generated by the far-end signals at the near end. The room impulse response may be manually simulated or may be actually acquired, for example, may be set according to the relative distances between a specific microphone, speaker and speaker. The power of the two-terminal signal is then adjusted according to a set power ratio, which may be randomly set within a certain range, for example, within a range of-10 to 30 dB. For the far-end signal,in order to simulate the nonlinear echo introduced by nonlinear devices such as horns, nonlinear processing may also be performed randomly before the audio of speaker 1 is fed into the convolution. Further, a noise is randomly selected from the noise library, the near-end noise signal power is adjusted according to the set signal-to-noise ratio (relative to the near-end signal power), and the noise signal, the echo signal and the speaker signal collected by the near-end microphone are superimposed to obtain a near-end mixed signal S 2 And then, delaying the mixed signal according to the target delay Y.
In step S504, a plurality of segments of the third audio are processed by a plurality of filters, wherein each segment comprises a plurality of consecutive frames, and each filter is used for processing one segment of the third audio and a fourth audio, wherein the fourth audio comprises corresponding content of the third audio;
in step S506, the processing results of the plurality of filters are processed by using the neural network model to obtain a delay estimation result for the corresponding content of the third audio in the fourth audio.
In step S508, parameters of the neural network model are adjusted according to the delay estimation result of the fourth audio and the target delay.
For example, parameters of the neural network model are adjusted using a gradient descent method based on the delay estimation result and the difference of the target delays.
Through the embodiment, the audio with the time delay effect can be generated in a simulated manner according to the existing audio, and the audio subjected to the time delay processing is processed by using the filter bank and the neural network model, so that the time delay estimation result is obtained. And training the neural network model according to the time delay estimation result and the preset time delay. Therefore, the trained neural network model can more accurately perform time delay estimation.
An embodiment of the delay estimation device of the present invention is described below with reference to fig. 7.
Fig. 7 illustrates a schematic structure of a delay estimation apparatus according to some embodiments of the present invention. As shown in fig. 7, the delay estimation device 700 of this embodiment includes: a first audio processing module 7100 configured to process a plurality of segments of the first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a respective content of the first audio; and a first estimation module 7200 configured to process the processing results of the plurality of filters using the neural network model to obtain a delay estimation result for the corresponding content of the first audio in the second audio.
In some embodiments, the first audio processing module 7100 is further configured to extract segments of a preset length from a plurality of positions as starting points in a buffer of frames storing the first audio, respectively, to obtain a plurality of segments, wherein a preset interval is provided between adjacent starting points; each extracted segment and the second audio are input to each of the plurality of filters, respectively, to obtain a processing result of each filter.
In some embodiments, the predetermined interval is less than the predetermined length.
In some embodiments, each filter comprises a plurality of filter blocks, the number of filter blocks being equal to the number of frames in each segment, each filter block being configured to estimate an echo present in the second audio from the segment of the first audio and to subtract the estimated echo with the second audio to obtain an error and to update the weights of the filter with the error, the sum of the energy of the weights of the filter blocks being the output of the filter block; and, the processing result of each filter includes a set of outputs of the respective filter blocks of the filter.
In some embodiments, the filter is used to estimate matching information of the second audio to the segment of the first audio processed by the filter.
In some embodiments, the filter is a least mean square filter or a multi-delay block frequency domain adaptive filter.
In some embodiments, the first estimation module 7200 is further configured to splice the processing results of the plurality of filters, generating the input information; and processing the input information by using a neural network model.
In some embodiments, the neural network model is used for predicting the probability corresponding to each time delay, and the time delay estimation result is the time delay corresponding to the maximum probability.
In some embodiments, the number of weight parameters of the neural network model is less than 10 6
In some embodiments, the first audio is audio received from a transmitting end over a communication link and the second audio is audio captured by an audio input device of a receiving end.
An embodiment of the echo cancellation device according to the present invention is described below with reference to fig. 8.
Fig. 8 illustrates a schematic structure of an echo cancellation device according to some embodiments of the present invention. As shown in fig. 8, the echo cancellation device 80 of this embodiment includes: the time delay estimation device 700; and an echo cancellation module 8300 configured to perform echo cancellation on the second audio based on the delay estimation result.
In some embodiments, the echo cancellation device 80 further comprises: a receiver 8400 configured to receive audio from a transmitting end as first audio over a communication link; an audio input device 8500 configured to capture audio as a second audio. The audio input device 8500 is, for example, a microphone.
An embodiment of the training device of the present invention is described below with reference to fig. 9.
Fig. 9 illustrates a schematic diagram of a training device according to some embodiments of the invention. As shown in fig. 9, the training apparatus 900 of this embodiment includes: a delay module 9100 configured to perform delay processing on the third audio to obtain a fourth audio; a second audio processing module 9200 configured to process a plurality of segments of the third audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the third audio and a fourth audio, wherein the fourth audio comprises corresponding content of the third audio; a second estimation module 9300 configured to process the processing results of the plurality of filters by using the neural network model, so as to obtain a delay estimation result of the corresponding content of the third audio in the fourth audio; the parameter adjustment module 9400 is configured to adjust parameters of the neural network model according to the delay estimation result of the fourth audio and the target delay.
In some embodiments, the delay module 9100 is further configured to process the fifth audio to obtain an analog audio input device acquisition signal; processing the third audio to obtain an echo signal generated by the signal of the transmitting end at the receiving end; processing the analog audio input equipment acquisition signal, the echo signal and the noise signal to obtain a receiving end mixed signal; and performing time delay processing on the mixed signal of the receiving end according to the target time delay to obtain fourth audio.
Fig. 10 illustrates a schematic structure of an electronic device according to some embodiments of the invention. As shown in fig. 10, the electronic apparatus 100 of this embodiment includes: a memory 1010 and a processor 1020 coupled to the memory 1010, the processor 1020 being configured to perform the delay estimation method, or the echo cancellation method, or the training method of any of the foregoing embodiments based on instructions stored in the memory 1010.
The memory 1010 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs.
Fig. 11 shows a schematic structural view of an electronic device according to further embodiments of the present invention. As shown in fig. 11, the electronic device 110 of this embodiment includes: the memory 1110 and the processor 1120 may also include an input-output interface 1130, a network interface 1140, a storage interface 1150, and the like. These interfaces 1130, 1140, 1150 and the memory 1110 and the processor 1120 may be connected by, for example, a bus 1160. The input/output interface 1130 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, and the like. The network interface 1140 provides a connection interface for a variety of networking devices. The storage interface 1150 provides a connection interface for external storage devices such as SD cards, U discs, and the like.
The embodiment of the present invention also provides a computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any one of the foregoing delay estimation methods, or echo cancellation methods, or training methods.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (20)

1. A method of delay estimation, comprising:
processing a plurality of segments of a first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a respective content of the first audio; and, a step of, in the first embodiment,
and processing the processing results of the plurality of filters by using a neural network model to obtain a time delay estimation result of the corresponding content of the first audio in the second audio.
2. The method of delay estimation of claim 1, wherein the processing the plurality of segments of the first audio with the plurality of filters comprises:
extracting segments of a preset length from a buffer memory of a frame storing a first audio by taking a plurality of positions as starting points respectively to obtain a plurality of segments, wherein preset intervals are reserved between adjacent starting points;
each extracted segment and the second audio are input to each of the plurality of filters, respectively, to obtain a processing result of each filter.
3. The delay estimation method of claim 2, wherein the preset interval is smaller than the preset length.
4. The delay estimation method of claim 1, wherein:
each filter comprises a plurality of filter blocks, the number of the filter blocks is equal to the number of frames in each segment, each filter block is used for estimating echo existing in the second audio according to the segment of the first audio, the estimated echo is subtracted by the second audio to obtain an error, the error is used for updating the weight of the filter, and the energy sum of the weight of the filter block is the output of the filter block; and, in addition, the processing unit,
the processing result of each filter comprises a set of outputs of the respective filter blocks of the filter.
5. The delay estimation method of claim 1, wherein the filter is configured to estimate matching information of the second audio and the segment of the first audio processed by the filter.
6. The delay estimation method of claim 5, wherein the filter is a least mean square filter or a multi-delay block frequency domain adaptive filter.
7. The delay estimation method of claim 1, wherein the processing results of the plurality of filters using a neural network model comprises:
splicing the processing results of the filters to generate input information; and
and processing the input information by using the neural network model.
8. The delay estimation method of claim 1, wherein the neural network model is configured to predict a probability corresponding to each delay, and the delay estimation result is a delay corresponding to a maximum probability.
9. The delay estimation method of claim 8, wherein the number of weight parameters of the neural network model is less than 10 6
10. The time delay estimation method of any one of claims 1 to 9, wherein the first audio is audio received from a transmitting end through a communication link, and the second audio is audio collected by an audio input device of a receiving end.
11. An echo cancellation method, comprising:
the time delay estimation method according to any one of claims 1 to 10; the method comprises the steps of,
and performing echo cancellation on the second audio based on the time delay estimation result.
12. The echo cancellation method of claim 11, further comprising:
receiving audio from a transmitting end through a communication link as first audio; and, in addition, the processing unit,
and collecting audio through the audio input equipment of the receiving end to serve as second audio.
13. A training method, comprising:
performing time delay processing on the third audio according to the target time delay to obtain a fourth audio;
processing a plurality of segments of the third audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the third audio and a fourth audio comprising a respective content of the third audio;
processing the processing results of the plurality of filters by using a neural network model to obtain a time delay estimation result of corresponding content of the third audio in the fourth audio; and
and adjusting parameters of the neural network model according to the time delay estimation result of the fourth audio and the target time delay.
14. The training method of claim 13, wherein the delay processing the third audio according to the target delay to obtain the fourth audio comprises:
processing the fifth audio to obtain an analog audio input device acquisition signal;
processing the third audio to obtain an echo signal generated by a signal of a transmitting end at a receiving end;
processing the analog audio input equipment acquisition signal, the echo signal and the noise signal to obtain a receiving end mixed signal; and
and performing time delay processing on the mixed signal of the receiving end according to the target time delay to obtain fourth audio.
15. A delay estimation apparatus comprising:
a first audio processing module configured to process a plurality of segments of a first audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the first audio and a second audio comprising a respective content of the first audio; and
and the first estimation module is configured to process the processing results of the plurality of filters by using a neural network model so as to obtain a time delay estimation result of corresponding content of the first audio in the second audio.
16. An echo cancellation device, comprising:
the delay estimation device of claim 15; and
and the echo cancellation module is configured to perform echo cancellation on the second audio based on the time delay estimation result.
17. The echo cancellation device of claim 16, further comprising:
a receiver configured to receive audio from a transmitting end as first audio through a communication link; and
an audio input device configured to capture audio as a second audio.
18. A training device, comprising:
the time delay module is configured to perform time delay processing on the third audio to obtain fourth audio;
a second audio processing module configured to process a plurality of segments of the third audio with a plurality of filters, wherein each segment comprises a plurality of consecutive frames, each filter for processing one segment of the third audio and a fourth audio comprising a respective content of the third audio;
a second estimation module configured to process the processing results of the plurality of filters by using a neural network model, so as to obtain a delay estimation result of corresponding content of the third audio in the fourth audio; and
and the parameter adjustment module is configured to adjust parameters of the neural network model according to the time delay estimation result of the fourth audio and the target time delay.
19. An electronic device, comprising:
a memory; and
a processor coupled to the memory, the processor being configured to perform the delay estimation method of any one of claims 1 to 10, or the echo cancellation method of claim 11 or 12, or the training method of claim 13 or 14, based on instructions stored in the memory.
20. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the delay estimation method of any one of claims 1 to 10, or the echo cancellation method of claim 11 or 12, or the training method of claim 13 or 14.
CN202310199275.1A 2023-02-28 2023-02-28 Time delay estimation method, echo cancellation method, training method and related equipment Pending CN116312621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310199275.1A CN116312621A (en) 2023-02-28 2023-02-28 Time delay estimation method, echo cancellation method, training method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310199275.1A CN116312621A (en) 2023-02-28 2023-02-28 Time delay estimation method, echo cancellation method, training method and related equipment

Publications (1)

Publication Number Publication Date
CN116312621A true CN116312621A (en) 2023-06-23

Family

ID=86780941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310199275.1A Pending CN116312621A (en) 2023-02-28 2023-02-28 Time delay estimation method, echo cancellation method, training method and related equipment

Country Status (1)

Country Link
CN (1) CN116312621A (en)

Similar Documents

Publication Publication Date Title
CN111885275B (en) Echo cancellation method and device for voice signal, storage medium and electronic device
JP6703525B2 (en) Method and device for enhancing sound source
CN111031448B (en) Echo cancellation method, echo cancellation device, electronic equipment and storage medium
CN108010536B (en) Echo cancellation method, device, system and storage medium
CN110992923B (en) Echo cancellation method, electronic device, and storage device
CN112116919B (en) Echo cancellation method and device and electronic equipment
CN109920444B (en) Echo time delay detection method and device and computer readable storage medium
CN113763977A (en) Method, apparatus, computing device and storage medium for eliminating echo signal
CN112151051B (en) Audio data processing method and device and storage medium
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
WO2017045512A1 (en) Voice recognition method and apparatus, terminal, and voice recognition device
WO2014132499A1 (en) Signal processing device and method
CN111989934B (en) Echo cancellation device, echo cancellation method, signal processing chip, and electronic apparatus
US10636410B2 (en) Adaptive acoustic echo delay estimation
CN116312621A (en) Time delay estimation method, echo cancellation method, training method and related equipment
CN115620737A (en) Voice signal processing device, method, electronic equipment and sound amplification system
CN117643075A (en) Data augmentation for speech enhancement
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
CN114827363A (en) Method, device and readable storage medium for eliminating echo in call process
CN113382119B (en) Method, device, readable medium and electronic equipment for eliminating echo
CN109378012B (en) Noise reduction method and system for recording audio by single-channel voice equipment
CN112165558A (en) Method and device for detecting double-talk state, storage medium and terminal equipment
US9659575B2 (en) Signal processor and method therefor
CN113257267A (en) Method for training interference signal elimination model and method and equipment for eliminating interference signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination