CN115116471B - Audio signal processing method and device, training method, training device and medium - Google Patents

Audio signal processing method and device, training method, training device and medium Download PDF

Info

Publication number
CN115116471B
CN115116471B CN202210459690.1A CN202210459690A CN115116471B CN 115116471 B CN115116471 B CN 115116471B CN 202210459690 A CN202210459690 A CN 202210459690A CN 115116471 B CN115116471 B CN 115116471B
Authority
CN
China
Prior art keywords
audio signal
matrix
processed
real
imaginary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210459690.1A
Other languages
Chinese (zh)
Other versions
CN115116471A (en
Inventor
马东鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210459690.1A priority Critical patent/CN115116471B/en
Publication of CN115116471A publication Critical patent/CN115116471A/en
Application granted granted Critical
Publication of CN115116471B publication Critical patent/CN115116471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Audio signal processing method and device, training method, training device and medium. The present disclosure provides an audio signal processing method for performing time-frequency domain echo cancellation processing using a neural network, including: acquiring a reference audio signal and an audio signal to be processed; obtaining an amplitude spectrum matrix and a phase spectrum matrix of the reference audio signal and the audio signal to be processed; performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network to generate a first processed audio signal; obtaining real and imaginary data matrices of the reference audio signal and real and imaginary data matrices of the first processed audio signal; and performing time domain echo cancellation processing on the first processed audio signal by using a second neural network to generate a second processed audio signal. The present disclosure also provides a neural network training method, an audio signal processing apparatus, a computing device, a computer-readable storage medium, and a computer program product.

Description

Audio signal processing method and device, training method, training device and medium
Technical Field
The present disclosure relates to the field of computer technology, and more particularly, to an audio signal processing method, an audio signal processing apparatus to which the audio signal processing method is applied, and a training method, a computing device, a computer-readable storage medium, and a computer program product for a neural network model.
Background
With the continuous development of audio signal processing technology, the requirements of the target object using the terminal device on audio quality are also increasing. If echo occurs during the call, the call quality is severely affected. The principle of echo generation is: the audio signal is played in a loudspeaker and subjected to multiple reflections in a closed or semi-closed environment causing signal distortion, and finally is picked up by a microphone together with local audio to form an echo. Echoes can interfere with the delivery of local audio, severely affecting the communication experience.
Adaptive filters are typically used in the conventional art to cancel the linear portion of the echo, however, the nonlinear portion of the echo tends to be difficult to cancel. Recently, neural network models have begun to be applied to echo cancellation. The neural network model can be used in combination with the adaptive filter so as to perform post-processing on the audio signal processed by the adaptive filter to remove nonlinear residues and a small amount of linear residues in the signal; alternatively, neural network models may be used in place of adaptive filters to cancel both linear and nonlinear portions of the echo. The currently used echo cancellation processing based on the neural network model generally converts an audio signal into a frequency domain, obtains a corresponding magnitude spectrum and a phase spectrum, performs echo cancellation processing based on the magnitude spectrum, and then converts the magnitude spectrum into a time domain by combining with the phase spectrum to generate a processed audio signal. However, in this processing method, only the amplitude spectrum of the audio signal is processed, and the phase is discarded, so that the echo cancellation quality of the audio signal is adversely affected.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided an audio signal processing method comprising: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal acquired through a local microphone; obtaining a magnitude spectrum matrix of the reference audio signal and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal; based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, performing a time domain echo cancellation process on the first processed audio signal using a second neural network and generating a second processed audio signal.
According to some exemplary embodiments of the present disclosure, the obtaining the amplitude spectrum matrix of the reference audio signal includes: performing fast fourier transform on the reference audio signal to obtain an amplitude spectrum matrix of the reference audio signal; and the obtaining of the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed comprises: and performing fast Fourier transform on the audio signal to be processed to obtain an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed.
According to some exemplary embodiments of the present disclosure, the performing frequency domain echo cancellation processing on the audio signal to be processed using the first neural network based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, and generating the first processed audio signal includes: splicing the amplitude spectrum matrix of the reference audio signal with the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix; inputting the spliced magnitude spectrum matrix into the first neural network to generate a magnitude spectrum filtering matrix; generating a filtered amplitude spectrum matrix based on the amplitude spectrum filtering matrix and the amplitude spectrum matrix of the audio signal to be processed; and taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters, and performing inverse fast Fourier transform to generate the first processed audio signal.
According to some exemplary embodiments of the present disclosure, the magnitude of the magnitude spectrum filter matrix is the same as the magnitude of the magnitude spectrum matrix of the audio signal to be processed, and wherein the generating the filtered magnitude spectrum matrix based on the magnitude spectrum filter matrix and the magnitude spectrum matrix of the audio signal to be processed comprises: and multiplying each element in the amplitude spectrum matrix of the audio signal to be processed with one element at a corresponding position in the amplitude spectrum filtering matrix to generate the filtered amplitude spectrum matrix.
According to some exemplary embodiments of the present disclosure, the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filter matrix and the magnitude spectrum matrix of the audio signal to be processed comprises: convolving the magnitude spectrum matrix of the audio signal to be processed with the magnitude spectrum filtering matrix to generate the filtered magnitude spectrum matrix.
According to some exemplary embodiments of the present disclosure, the obtaining the real and imaginary data matrices of the reference audio signal and the obtaining the real and imaginary data matrices of the first processed audio signal comprises: performing a fast fourier transform on the reference audio signal to obtain a real data matrix and an imaginary data matrix of the reference audio signal; the first processed audio signal is subjected to a fast fourier transform to obtain a real data matrix and an imaginary data matrix of the first processed audio signal.
According to some exemplary embodiments of the present disclosure, the performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating the second processed audio signal includes: splicing the real part data matrix of the reference audio signal with the real part data matrix of the first processed audio signal to generate a spliced real part data matrix, and splicing the imaginary part data matrix of the reference audio signal with the imaginary part data matrix of the first processed audio signal to generate a spliced imaginary part data matrix; inputting the spliced real part data matrix and the spliced imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix; generating a filtered real data matrix based on the real data matrix and the real data filtering matrix of the first processed audio signal, generating a filtered imaginary data matrix based on the imaginary data matrix and the imaginary data filtering matrix of the first processed audio signal; and performing inverse fast fourier transform with the filtered real part data matrix and the filtered imaginary part data matrix together as input parameters to generate the second processed audio signal.
According to some exemplary embodiments of the present disclosure, the size of the real part data filtering matrix is the same as the size of the real part data matrix of the first processed audio signal, the size of the imaginary part data filtering matrix is the same as the size of the imaginary part data matrix of the first processed audio signal, and wherein the generating a filtered real part data matrix based on the real part data matrix of the first processed audio signal and the real part data filtering matrix, the generating a filtered imaginary part data matrix based on the imaginary part data matrix of the first processed audio signal and the imaginary part data filtering matrix comprises: multiplying each element in the real part data matrix of the first processed audio signal with an element in a corresponding position in the real part data filtering matrix to generate the filtered real part data matrix; each element in the imaginary data matrix of the first processed audio signal is multiplied with an element in the imaginary data filter matrix in a corresponding position to generate the filtered imaginary data matrix.
According to some exemplary embodiments of the present disclosure, the generating the filtered real data matrix based on the real data matrix of the first processed audio signal and the real data filtering matrix comprises: convolving the real data matrix of the first processed audio signal with the real data filtering matrix to generate the filtered real data matrix; and generating a filtered imaginary data matrix based on the imaginary data matrix of the first processed audio signal and the imaginary data filtering matrix comprises: convolving the imaginary data matrix of the first processed audio signal with the imaginary data filter matrix to generate the filtered imaginary data matrix.
According to some exemplary embodiments of the present disclosure, the first neural network and the second neural network are both long-term and short-term memory neural networks.
According to a second aspect of the present disclosure, there is provided a neural network training method, comprising: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is a mixed audio signal obtained by mixing an audio signal acquired by a local microphone with a real audio signal; obtaining a magnitude spectrum matrix of the reference audio signal and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal; performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal; obtaining a first signal-to-noise ratio loss value based on the first processed audio signal and the real audio signal, and obtaining a second signal-to-noise ratio loss value based on the second processed audio signal and the real audio signal; at least one of the first neural network and the second neural network is adjusted based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
According to some example embodiments of the present disclosure, adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises: carrying out weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value; adjusting at least one of the first neural network and the second neural network based on the integrated signal-to-noise ratio loss value
According to some exemplary embodiments of the present disclosure, the weight of the first signal-to-noise ratio loss is greater than the weight of the second signal-to-noise ratio loss.
According to some example embodiments of the present disclosure, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
inputting the first processed audio signal into an acoustic model based on CTC to obtain a CTC loss value; carrying out weighted summation on the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value and the CTC loss value to obtain a comprehensive loss value; and adjusting at least one of the first neural network and the second neural network based on the composite loss value.
According to some exemplary embodiments of the present disclosure, the first snr loss value and the second snr loss value are both scale invariant snr loss values.
According to a third aspect of the present disclosure, there is provided an audio signal processing apparatus comprising: an audio signal acquisition module configured to: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal acquired through a local microphone; an audio signal frequency domain information acquisition module configured to: obtaining a magnitude spectrum matrix of the reference audio signal and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; a frequency domain echo cancellation processing module configured to: based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; an audio signal time domain information acquisition module configured to: obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal; a time domain echo cancellation processing module configured to: based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, performing a time domain echo cancellation process on the first processed audio signal using a second neural network and generating a second processed audio signal.
According to a fourth aspect of the present disclosure, there is provided a computing device comprising a processor and a memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and its exemplary embodiments, or to perform the neural network training method according to the second aspect of the present disclosure and its exemplary embodiments.
According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium configured to store computer executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and its exemplary embodiments, or to perform the neural network training method according to the second aspect of the present disclosure and its exemplary embodiments.
According to a sixth aspect of the present disclosure, there is provided a computer program product comprising computer executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and its exemplary embodiments, or to perform the neural network training method according to the second aspect of the present disclosure and its exemplary embodiments.
Therefore, the audio signal processing method provided by the present disclosure makes it possible to significantly improve the echo cancellation effect of an audio signal by performing a frequency domain echo cancellation process based on the amplitude spectrum of the audio signal in the frequency domain and then performing a time domain echo cancellation process based on the real part data and the imaginary part data of the audio signal in the time domain, taking into account the effect of phase information.
In addition, according to the neural network training method provided by the disclosure, the first neural network and/or the second neural network can be jointly trained based on both the frequency domain echo cancellation result and the time domain echo cancellation result, or based on the frequency domain echo cancellation result, the time domain echo cancellation result and the acoustic model evaluation result, so that the quality of the audio signal processing method for echo cancellation can be further improved.
Drawings
Specific embodiments of the present disclosure will be described in detail below with reference to the drawings so that more details, features, and advantages of the present disclosure can be more fully appreciated and understood; in the drawings:
fig. 1 schematically illustrates an application scenario of an audio signal processing method according to some exemplary embodiments of the present disclosure;
Fig. 2 schematically illustrates an echo cancellation model based on a neural network model in the related art;
fig. 3 schematically illustrates, in flow chart form, an audio signal processing method provided in accordance with some exemplary embodiments of the present disclosure;
FIG. 4 schematically illustrates some details of the audio signal processing method illustrated in FIG. 3, according to some exemplary embodiments of the present disclosure;
fig. 5 schematically illustrates some details of the audio signal processing method illustrated in fig. 3, according to some exemplary embodiments of the present disclosure;
FIG. 6 schematically illustrates an audio signal processing model provided in accordance with some exemplary embodiments of the present disclosure;
FIG. 7a schematically illustrates, in flow chart form, a neural network training method provided in accordance with some exemplary embodiments of the present disclosure;
FIG. 7b schematically illustrates some details of the neural network training method illustrated in FIG. 7a, according to some exemplary embodiments of the present disclosure;
FIG. 8 schematically illustrates a neural network training model provided in accordance with some example embodiments of the present disclosure;
FIG. 9 schematically illustrates, in flow chart form, another neural network training method provided in accordance with some exemplary embodiments of the present disclosure;
FIG. 10 schematically illustrates another neural network training model provided in accordance with some example embodiments of the present disclosure;
fig. 11 schematically illustrates in block diagram form a structure of an audio signal processing apparatus according to some exemplary embodiments of the present disclosure;
fig. 12 schematically illustrates, in block diagram form, the structure of a computing device in accordance with some embodiments of the present disclosure.
It should be understood that the matters shown in the drawings are merely illustrative and thus are not necessarily drawn to scale. Furthermore, the same or similar features are denoted by the same or similar reference numerals throughout the drawings.
Detailed Description
The following description provides specific details of various exemplary embodiments of the disclosure so that those skilled in the art may fully understand and practice the technical solutions according to the present disclosure.
First, some terms involved in exemplary embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:
artificial intelligence (Artificial Intelligence, AI): theory, methods, techniques and application systems that utilize digital computers or digital computer-controlled machines to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to implement a new intelligent machine that can react in a similar manner to human intelligence. The design principle and the implementation method of various intelligent machines are researched by artificial intelligence, so that the machines have the functions of perception, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Echo cancellation (Acoustic Echo Cancelling, AEC) is to cancel noise generated by a microphone and a speaker due to a return path of an audio signal through a sonic interference method. The main function of the AEC is to estimate the acoustic transfer function of the local loudspeaker to the local microphone, including the reflection path, and to filter the incoming audio signal by the estimated acoustic transfer function, resulting in an estimate of the echo signal. The estimated echo is then subtracted from the microphone signal to obtain an anechoic signal, and the anechoic signal, rather than the microphone signal, is transmitted over the channel to the far end. Conventional AECs often employ adaptive filtering, and recently, with the development of artificial intelligence and machine learning, neural networks have been proposed for AECs, which are particularly advantageous for removing nonlinear residues in audio signals.
Signal-to-Noise Ratio (SNR): a ratio of signal (or information) to noise in an electronic device or electronic system. The signal here refers to a signal from outside the apparatus that needs to be processed, the noise refers to an irregular additional signal (or information) that does not exist in the original signal generated after the processing, and the signal does not change with the change of the original signal.
Scale-invariant Signal-to-Noise Ratio (sisr) refers to a Signal-to-Noise Ratio that is not affected by Signal variations. For example, in the case of reducing the influence caused by the variation of the signal by regularization, the sisr may be calculated according to the following formula:
wherein,is the evaluation signal which is used to evaluate the signal,sis a clean signal,/->Is the element product re-summation operation, +.>Is 2 norms.
Character error rate (Character Error Rate, or CER) is one of the most common metrics for speech recognition accuracy. CER can be calculated as cer= (s+d+i)/(s+d+c), where S is the number of characters replaced, D is the number of characters deleted, I is the number of characters inserted, and C is the correct number of words.
Acoustic models based on connection timing classification (Connectionist Temporal Classification, CTC) are well known in the art, and are based on a timing classification method, which does not require frame-level alignment in the time dimension, so that input speech features can predict the results. CTC-based acoustic models greatly simplify the training procedure of acoustic models, and thus the modeling process, relative to traditional acoustic models (e.g., deep neural network-hidden markov model (Deep Neural Network-Hidden Markov Model, i.e., DNN-HMM) -based acoustic models).
Referring to fig. 1, an application scenario of an audio signal processing method according to some exemplary embodiments of the present disclosure is schematically illustrated. As shown in fig. 1, the exemplary application scenario 100 includes a first terminal device 110, a server 120, and a second terminal device 130, and the first terminal device 110 and the second terminal device 130 can communicate with the server 120 directly or indirectly through a network 140 in a wired or wireless manner, respectively.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, which are not limited in this disclosure. The first terminal device 110 and the second terminal device 130 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. Examples of network 140 may include any combination of a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a communication network such as the internet, as the disclosure is not limited in this regard. Accordingly, each of the first terminal device 110, the server 120, and the second terminal device 130 may include at least one communication interface (not shown) capable of communicating over the network 140. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), a wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, etc. It should be appreciated that the first terminal device 110 and the second terminal device 130 included in the application scenario 100 shown in fig. 1 are merely exemplary, and thus, the application scenario 100 may include more terminal devices, for example, an application scenario having a real-time communication conference in which a plurality of target objects participate. Furthermore, it should also be appreciated that in other exemplary application scenarios, the first terminal device 110 and the second terminal device 130 may communicate directly with each other, thus eliminating the need for the server 120.
In the exemplary application scenario 100 shown in fig. 1, the second terminal device 130 may collect an audio signal, perform channel coding to obtain a reference audio signal, then transmit the reference audio signal to the server 120, the server 120 transmits the reference audio signal to the first terminal device 110, after receiving the reference audio signal, the first terminal device 110 performs channel decoding on the reference audio signal, and further plays the reference audio signal through a speaker that the first terminal device has, meanwhile, the first terminal device 110 may also collect a local audio signal through a microphone, and in this process, because it also collects the reference audio signal played through the speaker to generate an echo signal, so the audio signal collected by the first terminal device 110 through the microphone includes the echo signal. Similarly, for the second terminal device 130, it may receive the reference audio signal from the first terminal device 110 and play it through the speaker it has, while the second terminal device 110 may also collect the local audio signal through the microphone and in the process generate the echo signal because it collects the reference audio signal played through the speaker. In this case, the audio signal collected by the second terminal device 130 through the microphone also includes an echo signal. Therefore, for the terminal device in the exemplary application scenario shown in fig. 1, echo cancellation processing needs to be performed on the acquired audio signal.
In the conventional technology, an adaptive filter is generally used to cancel the linear portion in the echo, however, the nonlinear portion in the echo tends to be difficult to cancel. Recently, neural networks have begun to be applied to echo cancellation, which is capable of removing both linear and nonlinear portions of echoes possessed by audio signals, as compared to conventional adaptive filters.
Referring to fig. 2, there is schematically shown a neural network-based echo cancellation model in the related art. As shown in fig. 2, the echo cancellation model 200 first acquires a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local speaker, and the audio signal to be processed is an audio signal acquired through a local microphone. The reference audio signal is input into a first fast fourier transform module 210a for fast fourier transform to obtain an amplitude spectrum matrix 220 of the reference audio signal, and the audio signal to be processed is input into a second fast fourier transform module 210b for fast fourier transform to obtain an amplitude spectrum matrix 230 and a phase spectrum matrix 240 of the audio signal to be processed. Then, the amplitude spectrum matrix 220 of the reference audio signal and the amplitude spectrum matrix 230 of the audio signal to be processed are input to the splicing processing module 250 for splicing processing. The stitched amplitude spectrum matrix is input into a long and short term memory neural network (i.e., LSTM neural network) 260 to generate an amplitude spectrum mask matrix 270. The amplitude spectrum mask matrix 270 and the amplitude spectrum matrix 230 of the audio signal to be processed are input to the multiplication processing module 280 to multiply each element in the amplitude spectrum matrix 230 of the audio signal to be processed with one element at a corresponding position in the amplitude spectrum mask matrix 270 to generate a masked amplitude spectrum matrix. The masked amplitude spectrum matrix is input to an inverse fourier transform module 290 together with a phase spectrum matrix 240 of the audio signal to be processed to generate the processed audio signal. The processed audio signal is the audio signal processed by the echo cancellation model 200.
As can be seen from the above analysis, the echo cancellation model 200 converts the audio signal to be processed into the frequency domain, obtains the corresponding amplitude spectrum and phase spectrum, performs echo cancellation processing on the amplitude spectrum, and then converts the amplitude spectrum into the time domain in combination with the phase spectrum to generate the processed audio signal. In the echo cancellation model 200, the LSTM neural network 260 performs echo cancellation processing only for the amplitude spectrum of the audio signal, and discards the phase spectrum. However, the phase information of the audio signal is substantially advantageous for the processing of the signal by the neural network, which can further enhance the performance of the neural network in cancelling the echo signal. Thus, in the related art echo cancellation model 200, the LSTM neural network 260 discards the phase spectrum during processing, which can adversely affect the quality of echo cancellation.
Referring to fig. 3, an audio signal processing method provided according to some exemplary embodiments of the present disclosure is schematically shown in the form of a flowchart. As shown in fig. 3, the audio signal processing method 300 includes steps 310, 320, 330, 340, and 350.
In step 310, a reference audio signal and an audio signal to be processed are obtained, wherein the reference audio signal is an audio signal played through a local speaker, and the audio signal to be processed is an audio signal collected through a local microphone. It should be understood that local speaker and local microphone mean that the speaker and microphone are in the same space, and thus, the audio signal played by the speaker (i.e., the reference audio signal) may be reflected multiple times in that space and may be picked up by the microphone, thereby generating an echo signal contained in the audio signal to be processed. The reference audio signal may have a variety of sources, as non-limiting examples, the reference audio signal may be an audio signal collected and transmitted by a remote terminal device, or may be an audio signal received from a server that may be played by a speaker, or may be any audio signal at a local terminal device that may be played by a speaker, so long as the audio signal is played by a local speaker and thus may be collected by a local microphone to form an echo signal. The present disclosure is not limited to the source of the reference audio signal. Since the audio signal to be processed contains an echo signal caused by playing the reference audio signal, the audio signal to be processed needs to be subjected to echo cancellation processing.
In step 320, a magnitude spectrum matrix of the reference audio signal is obtained, and a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed are obtained. It should be appreciated that any suitable method may be employed to obtain the amplitude spectrum matrix of the reference audio signal, as well as to obtain the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, as this disclosure is not limited in this regard. In some exemplary embodiments of the present disclosure, a reference audio signal may be subjected to a fast fourier transform to obtain an amplitude spectrum matrix of the reference audio signal, and an audio signal to be processed may also be subjected to a fast fourier transform to obtain an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed. By means of fast fourier transformation, frequency domain features, namely an amplitude spectrum and a phase spectrum, of the reference audio signal and the audio signal to be processed can be obtained quickly and conveniently. It should be appreciated that any suitable method of obtaining the amplitude spectrum and the phase spectrum of the audio signal is possible, which is not limiting to the present disclosure. For example, in other exemplary embodiments of the present disclosure, the reference audio signal and the audio signal to be processed may also be input into the trained neural network model, respectively, to obtain an amplitude spectrum matrix of the reference audio signal and an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed, respectively.
In step 330, a frequency domain echo cancellation process is performed on the audio signal to be processed using a first neural network based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, and a first processed audio signal is generated. In this step, by performing corresponding processing on the frequency domain characteristics of the audio signal to be processed, the echo cancellation processing of the audio signal to be processed can be realized in the frequency domain. As a non-limiting example, the first neural network may be configured to: an amplitude spectrum filter matrix is generated based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed. For example, the amplitude spectrum filter matrix may be a masking matrix of the same size as the amplitude spectrum matrix of the audio signal to be processed, or may be a convolution matrix capable of being convolved with the amplitude spectrum matrix of the audio signal to be processed. It follows that the first neural network functions as: based on the acquired frequency domain features of both the reference audio signal and the audio signal to be processed (i.e., the amplitude spectrum matrices of both), a respective filter matrix is adaptively generated, which may then be used for filtering the frequency domain features of the audio signal to be processed (i.e., the amplitude spectrum matrices of the audio signal to be processed). Thereby, based on the first neural network, the echo cancellation processing of the audio signal to be processed in the frequency domain can be realized. The function of the first neural network will be described in more detail below. Further, it should be appreciated that the first neural network may be any suitable neural network model, such as, but not limited to, fully connected neural networks, convolutional neural networks, recurrent neural networks, and LSTM neural networks, among others. The present disclosure is not limited in the types of neural networks that may be used as the first neural network.
Referring to fig. 4, details of step 330 of the audio signal processing method 300 shown in fig. 3 are schematically shown, according to some exemplary embodiments of the present disclosure. As shown in fig. 4, in some exemplary embodiments of the present disclosure, step 330 of audio signal processing method 300 includes steps 330-1, 330-2, 330-3, and 330-4.
In step 330-1, the amplitude spectrum matrix of the reference audio signal is spliced with the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix. In step 330-2, the stitched amplitude spectrum matrix is input to the first neural network to generate an amplitude spectrum filter matrix. In step 330-3, a filtered magnitude spectrum matrix is generated based on the magnitude spectrum filter matrix and the magnitude spectrum matrix of the audio signal to be processed. In step 330-4, an inverse fast fourier transform is performed to generate the first processed audio signal using the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters. In some exemplary embodiments of the present disclosure, the magnitude of the magnitude spectrum filtering matrix generated in the above step 330-2 is the same as the magnitude of the magnitude spectrum matrix of the audio signal to be processed, and thus, in step 330-3, each element in the magnitude spectrum matrix of the audio signal to be processed may be multiplied with one element at a corresponding position in the magnitude spectrum filtering matrix to generate a filtered magnitude spectrum matrix. It should be understood that the term "corresponding position" here refers to the position number of an element in the amplitude spectrum matrix of the audio signal to be processed in this matrix being the same as the position number of the corresponding element in the amplitude spectrum filter matrix. For example, elements each having a position number (m, n) (m and n are any integer greater than 0), and the like. Furthermore, in accordance with other exemplary embodiments of the present disclosure, in step 330-3, the amplitude spectrum matrix of the audio signal to be processed may be convolved with an amplitude spectrum filter matrix to generate a filtered amplitude spectrum matrix. In this case, the magnitude of the magnitude spectrum filter matrix generated in step 330-2 described above need not be the same as the magnitude of the magnitude spectrum matrix of the audio signal to be processed.
The elements at the corresponding positions are directly multiplied, the operation is simple, the calculated amount is relatively small, and the complexity of the model is low. The convolution operation is adopted, the calculated amount is relatively large, but the filtering effect of the signal is good, so that the method is very beneficial to the improvement of the echo cancellation quality. Further, it should be appreciated that the filtered amplitude spectrum matrix may be generated based on the amplitude spectrum filter matrix and the amplitude spectrum matrix of the audio signal to be processed in any suitable manner, which is not limiting of the present disclosure.
Referring back to fig. 3, in step 340, a real data matrix and an imaginary data matrix of the reference audio signal are obtained, and a real data matrix and an imaginary data matrix of the first processed audio signal are obtained. It should be appreciated that any suitable method may be employed to obtain the real and imaginary data matrices of the reference audio signal, and to obtain the real and imaginary data matrices of the audio signal to be processed, as this disclosure is not limited in this regard. In some exemplary embodiments of the present disclosure, the reference audio signal may be subjected to a fast fourier transform to obtain a real data matrix and an imaginary data matrix of the reference audio signal, and the audio signal to be processed may also be subjected to a fast fourier transform to obtain a real data matrix and an imaginary data matrix of the audio signal to be processed. By means of fast fourier transformation, the time domain characteristics of the reference audio signal and the audio signal to be processed, namely the real part data matrix and the imaginary part data matrix, can be obtained quickly and conveniently. It should be appreciated that any suitable method of obtaining the real and imaginary data matrices of the audio signal is possible, which is not limiting in this disclosure. For example, in other exemplary embodiments of the present disclosure, the reference audio signal and the audio signal to be processed may also be input into the trained neural network model, respectively, to obtain real and imaginary data matrices of the reference audio signal and real and imaginary data matrices of the audio signal to be processed, respectively.
In step 350, the first processed audio signal is time domain echo cancellation processed with a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and a second processed audio signal is generated. In this step, by performing corresponding processing on the time domain characteristics of the audio signal to be processed, the echo cancellation processing of the audio signal to be processed can be realized in the time domain. As a non-limiting example, the second neural network may be configured to: the real data filtering matrix and the imaginary data filtering matrix are generated based on the real data matrix and the imaginary data matrix of the reference audio signal and the real data matrix and the imaginary data matrix of the first processed audio signal, respectively. For example, the real and imaginary data filtering matrices may be masking matrices of the same size as the real and imaginary data matrices, respectively, of the first processed audio signal, or may be convolution matrices capable of being convolved with the real and imaginary data matrices, respectively, of the first processed audio signal. It follows that the second neural network functions in: based on the acquired time domain features of both the reference audio signal and the first processed audio signal (i.e., the real and imaginary data matrices of both), respective filter matrices are adaptively generated, which may then be used to filter the time domain features of the first processed audio signal (i.e., the real and imaginary data matrices of the first processed audio signal). Thus, based on the second neural network, the echo cancellation processing of the first processed audio signal in the time domain can be realized. The function of the second neural network will be described in more detail below. Furthermore, it should be understood that the second neural network may likewise be any suitable neural network model, such as, but not limited to, fully connected neural networks, convolutional neural networks, recurrent neural networks, and LSTM neural networks, among others. The present disclosure is not limited in any way with respect to the types of neural networks that may be used as the second neural network.
Referring to fig. 5, details of step 350 of the audio signal processing method 300 shown in fig. 3 are schematically shown, according to some exemplary embodiments of the present disclosure. As shown in fig. 5, in some exemplary embodiments of the present disclosure, step 350 of audio signal processing method 300 includes steps 350-1, 350-2, 350-3, and 350-4.
In step 350-1, the real data matrix of the reference audio signal is stitched with the real data matrix of the first processed audio signal to generate a stitched real data matrix, and the imaginary data matrix of the reference audio signal is stitched with the imaginary data matrix of the first processed audio signal to generate a stitched imaginary data matrix. In step 350-2, the stitched real data matrix and the stitched imaginary data matrix are input to the second neural network, generating a real data filtering matrix and an imaginary data filtering matrix. In step 350-3, a filtered real data matrix is generated based on the real data matrix and the real data filtering matrix of the first processed audio signal, and a filtered imaginary data matrix is generated based on the imaginary data matrix and the imaginary data filtering matrix of the first processed audio signal. In step 350-4, an inverse fast fourier transform is performed with the filtered real data matrix and the filtered imaginary data matrix together as input parameters to generate the second processed audio signal. In some exemplary embodiments of the present disclosure, the size of the real part data filtering matrix generated in the above step 350-2 is the same as the size of the real part data matrix of the first processed audio signal, and the size of the generated imaginary part data filtering matrix is the same as the size of the imaginary part data matrix of the first processed audio signal, whereby each element in the real part data matrix of the first processed audio signal may be multiplied with one element at a corresponding position in the real part data filtering matrix to generate a filtered real part data matrix, and each element in the imaginary part data matrix of the first processed audio signal may be multiplied with one element at a corresponding position in the imaginary part data filtering matrix to generate a filtered imaginary part data matrix, in step 350-3. As already explained in detail above, the term "corresponding position" here refers to that the position number of an element in the real data matrix of the first processed audio signal in this matrix is the same as the position number of a corresponding element in the real data filter matrix, and that the position number of an element in the imaginary data matrix of the first processed audio signal in this matrix is the same as the position number of a corresponding element in the imaginary data filter matrix. Furthermore, in accordance with further exemplary embodiments of the present disclosure, in step 350-3, the real data matrix of the first processed audio signal may be convolved with the real data filter matrix to generate a filtered real data matrix, and the imaginary data matrix of the first processed audio signal may be convolved with the imaginary data filter matrix to generate a filtered imaginary data matrix. In this case, the size of the real data filtering matrix generated in the above-described step 350-2 is not necessarily the same as the size of the real data matrix of the first processed audio signal, and the size of the imaginary data filtering matrix generated is not necessarily the same as the size of the imaginary data matrix of the first processed audio signal.
The elements at the corresponding positions are directly multiplied, the operation is simple, the calculated amount is relatively small, and the complexity of the model is low. The convolution operation is adopted, the calculated amount is relatively large, but the filtering effect of the signal is good, so that the method is very beneficial to the improvement of the echo cancellation quality. It should be appreciated that the filtered real data matrix may be generated based on the real data filter matrix and the real data matrix of the first processed audio signal and the filtered imaginary data matrix may be generated based on the imaginary data filter matrix and the imaginary data matrix of the first processed audio signal in any suitable manner, which is not limiting in this disclosure.
Referring to fig. 6, an audio signal processing model corresponding to the audio signal processing methods shown in fig. 3 to 5 is schematically shown, provided according to some exemplary embodiments of the present disclosure. As shown in fig. 6, the audio signal processing model 400 first acquires a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local speaker, and the audio signal to be processed is an audio signal collected through a local microphone. The reference audio signal is input into a first fast fourier transform module 410a for fast fourier transform to obtain an amplitude spectrum matrix 420a of the reference audio signal, and the audio signal to be processed is input into a second fast fourier transform module 410b for fast fourier transform to obtain an amplitude spectrum matrix 430a and a phase spectrum matrix 440 of the audio signal to be processed. The amplitude spectrum matrix 420a of the reference audio signal and the amplitude spectrum matrix 430a of the audio signal to be processed are input to the first splicing processing module 450a for splicing processing to generate a spliced amplitude spectrum matrix. The stitched magnitude spectrum matrix is then input into a first LSTM neural network 460a to generate a magnitude spectrum filter matrix 470a. The amplitude spectrum filter matrix 470a and the amplitude spectrum matrix 430a of the audio signal to be processed are input to a first filter processing module 480a. The first filter processing module 480a may multiply each element in the magnitude spectrum matrix 430a of the audio signal to be processed with one element in a corresponding position in the magnitude spectrum filter matrix 470a to generate a filtered magnitude spectrum matrix, or the first filter processing module 480a may convolve the magnitude spectrum matrix 430a of the audio signal to be processed with the magnitude spectrum filter matrix 470a to generate a filtered magnitude spectrum matrix. The filtered amplitude spectrum matrix and the phase spectrum matrix 440 of the audio signal to be processed are input to the first inverse fourier transform module 490a for inverse fourier transforming to generate a first processed audio signal. The reference audio signal is also input into a third fast fourier transform module 410c for fast fourier transform to obtain a real data matrix 420b and an imaginary data matrix 420c of the reference audio signal. The first processed audio signal is input into the fourth fast fourier transform module 410d for fast fourier transform to obtain a real data matrix 430b and an imaginary data matrix 430c of the first processed audio signal. The real data matrix 420b and the imaginary data matrix 420c of the reference audio signal, and the real data matrix 430b and the imaginary data matrix 430c of the first processed audio signal are input to a second splice processing module 450b for respective splicing processes to generate a spliced real data matrix and a spliced imaginary data matrix, respectively. The stitched real and imaginary data matrices are then input into a second LSTM neural network 460b to generate real and imaginary data filter matrices 470b, 470c, respectively. The real data filter matrix 470b and the real data matrix 430b of the first processed audio signal are input to a second filter processing module 480b. The second filtering processing module 480b may multiply each element in the real data matrix 430b of the first processed audio signal with one element in a corresponding position in the real data filtering matrix 470b to generate a filtered real data matrix, or the second filtering processing module 480b may convolve the real data matrix 430b of the first processed audio signal with the real data filtering matrix 470b to generate a filtered real data matrix. The imaginary data filter matrix 470c and the imaginary data matrix 430c of the first processed audio signal are input to a third filter processing module 480c. The third filter processing module 480c may multiply each element in the imaginary data matrix 430c of the first processed audio signal with one element in a corresponding position in the imaginary data filter matrix 470c to generate a filtered imaginary data matrix, or the third filter processing module 480c may convolve the imaginary data matrix 430c of the first processed audio signal with the imaginary data filter matrix 470c to generate a filtered imaginary data matrix. The filtered real data matrix and the filtered imaginary data matrix are input to a second inverse fourier transform module 490b for inverse fourier transforming to generate a second processed audio signal.
From the above analysis, it can be seen that the granularity of the frequency domain modeling is larger, only the magnitude spectrum can be learned, the modeling capability is poor, but the consumed computational power is less, and the time domain modeling has finer granularity, so that the modeling capability is better, but a larger model is required, and the greater computational power is required. Therefore, the audio signal processing method and the audio signal processing model provided by the exemplary embodiment of the disclosure adopt a time-frequency domain mixing mode, so that the advantages of the audio signal processing method and the audio signal processing model can be obtained, the defects of the audio signal processing method and the audio signal processing model are overcome, the audio signal processing method and the audio signal processing model have finer granularity and stronger modeling capability, and meanwhile, less consumed computational power is realized.
Therefore, according to the audio signal processing method and the audio signal processing model provided in the exemplary embodiments of the present disclosure, by first performing a frequency domain echo cancellation process based on an amplitude spectrum of an audio signal in a frequency domain and then performing a time domain echo cancellation process based on real part data and imaginary part data of the audio signal in a time domain, the effect of phase information is taken into consideration, so that the echo cancellation effect of the audio signal is significantly improved. For example, after the echo cancellation process, the processed voice signal may be subjected to voice recognition and compared with the content of the real voice signal to obtain a corresponding character error rate (or CER), for checking the effect of the echo cancellation process. In the case of speech recognition, the CER is relatively reduced by about 16% compared to conventional echo cancellation processing by performing echo cancellation processing on a speech signal using the audio signal processing method provided by the exemplary embodiment of the present disclosure, as shown in the following table:
Echo cancellation processing method Test set CER
Conventional AEC 9.42
Audio signal processing method according to the present disclosure 7.88
Therefore, the audio signal processing method and the audio signal processing scheme according to the exemplary embodiments of the present disclosure improve the echo cancellation effect and improve the quality of voice transmission, and thus improve the user experience, compared to the conventional echo cancellation method. Furthermore, both the first neural network 460a and the second neural network 460b may be trained with specific audio signals to enable echo cancellation processing for the frequency and time domains of the respective audio signals, as will be described in more detail below.
Referring to fig. 7a, a neural network training method provided according to some exemplary embodiments of the present disclosure, which may be used to train the first and second neural networks used in the above-described audio signal processing method and audio signal processing model, is schematically shown in the form of a flowchart. As shown in fig. 7a, the neural network training method 500 includes steps 510, 520, 530, 540, 550, 560, and 570.
In step 510, a reference audio signal and an audio signal to be processed are obtained, wherein the reference audio signal is an audio signal played through a local speaker, and the audio signal to be processed is a mixed audio signal obtained by mixing an audio signal collected by a local microphone with a real audio signal. Thus, in the neural network training method 500, the audio signal collected by the local microphone is essentially an echo signal caused by the reference audio signal played by the local speaker, and thus, the echo signal, together with the mixed audio signal obtained by mixing the real audio signal and the reference audio signal, constitutes an audio signal that can be used to train the first neural network and the second neural network. Steps 520 to 550 are the same as steps 320 to 350 in the audio signal processing method 300 shown in fig. 3, and thus are not described here again. It should be understood that in some exemplary embodiments, step 530 may also include those steps shown in FIG. 4, and step 550 may also include those steps shown in FIG. 5. In step 560, a first snr loss value is obtained based on the first processed audio signal and the real audio signal, and a second snr loss value is obtained based on the second processed audio signal and the real audio signal. In some exemplary embodiments of the present disclosure, both the first signal-to-noise ratio Loss value and the second signal-to-noise ratio Loss value may be scale-invariant signal-to-noise ratio Loss values (i.e., sisr Loss). However, any other suitable index for calculating signal quality loss is possible, as this disclosure is not limiting. In step 570, at least one of the first neural network and the second neural network is adjusted based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
Referring to fig. 7b, step 570 is further defined according to some exemplary embodiments of the present disclosure. As shown in fig. 7b, step 570 further comprises: step 570-1, performing weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value; and, step 570-2, adjusting at least one of the first neural network and the second neural network based on the integrated signal-to-noise ratio loss value. In some exemplary embodiments of the present disclosure, the weight of the first snr loss value may be different from the weight of the second snr loss value, e.g., the weight of the first snr loss value may be greater than the weight of the second snr loss value, depending on the degree of dependence on the first and second neural networks in the application. In some exemplary embodiments of the present disclosure, based on this situation, the first neural network and the second neural network may be jointly trained using the neural network training method according to the exemplary embodiments of the present disclosure, but in practical applications, the echo cancellation process may be performed using only the first neural network, without using the second neural network. Thus, in practical application, the complexity of the echo cancellation process can be reduced, and thus the amount of computation can be reduced, while still ensuring the quality of the echo processing. In other exemplary embodiments of the present disclosure, the weight of the first snr loss value may be less than the weight of the second snr loss value, or the weight of the first snr loss value may be equal to the weight of the second snr loss value. It should be understood that the signal-to-noise ratio loss values or scale-invariant signal-to-noise ratio loss values recited in this disclosure are exemplary, and indeed, any suitable other index may be used to train the neural network, as this disclosure is not limited in this regard.
Referring to fig. 8, a neural network training model corresponding to the neural network training method shown in fig. 7 is schematically illustrated, provided in accordance with some exemplary embodiments of the present disclosure. It should be appreciated that the neural network training model 600 shown in fig. 8 is substantially the same as the audio signal processing model 400 shown in fig. 6, except that the neural network training model 600 shown in fig. 8 further includes a first signal-to-noise ratio (Signal Noise Ratio, or SNR) loss calculation module 610a, a second signal-to-noise ratio loss calculation module 610b, and a composite signal-to-noise ratio loss calculation module 620. Only these differences will be described hereinafter, and the same points as those shown in fig. 6 will not be described again.
The first snr loss calculation module 610a receives the first processed audio signal from the first inverse fourier transform module 490a and calculates a first snr loss value. The second snr loss calculation module 610b receives the second processed audio signal from the second inverse fourier transform module 490b and calculates a second snr loss value. The integrated snr loss calculation module 620 receives the first snr loss value and the second snr loss value and performs a weighted summation of the two to obtain an integrated snr loss value. As described above, the weight of the first snr loss value may be different from the weight of the second snr loss value depending on the degree of dependence on the first and second neural networks in the application. For example, in other exemplary embodiments of the present disclosure, the weight of the first signal-to-noise ratio loss value may be greater than the weight of the second signal-to-noise ratio loss value. In other exemplary embodiments of the present disclosure, the weight of the first snr loss value may be less than or equal to the weight of the second snr loss value. At least one of the first neural network 460a and the second neural network 460b may then be adjusted based on the integrated signal-to-noise ratio loss value.
Thus, the neural network training method 500 and the neural network training model 600 are capable of jointly training the first neural network and/or the second neural network based on both the results of the frequency domain echo cancellation and the results of the time domain echo cancellation. In addition, after the joint training, only the frequency domain echo cancellation (i.e., the first neural network 460a and the related links thereof) can be used in practical applications, without the time domain echo cancellation (i.e., the second neural network 460b and the related links thereof), so that the complexity of the neural network model used in the application is not increased, and high quality of the echo cancellation is ensured.
Referring to fig. 9, another neural network training method provided according to some exemplary embodiments of the present disclosure, which may be used to train the first and second neural networks used in the above-described audio signal processing method and audio signal processing model, is schematically shown in the form of a flowchart. As shown in fig. 9, the neural network training method 700 includes steps 710, 720, 730, 740, 750, 760, 770, 780, and 790.
Steps 710 to 760 in the neural network training method 700 are the same as steps 510 to 560 in the neural network training method 500 shown in fig. 7, and thus are not described in detail herein. Furthermore, it should be understood that in some exemplary embodiments, step 730 may also include those steps shown in FIG. 4, and step 750 may also include those steps shown in FIG. 5. In step 770, the first processed audio signal is input into a CTC-based acoustic model to obtain a CTC loss value. In step 780, the first snr loss value, the second snr loss value, and the CTC loss value are weighted and summed to obtain a composite loss value. In step 790, at least one of the first neural network and the second neural network is adjusted based on the composite loss value. In some exemplary embodiments of the present disclosure, both the first signal-to-noise ratio Loss value and the second signal-to-noise ratio Loss value may be scale-invariant signal-to-noise ratio Loss values (i.e., sisr Loss). However, any other suitable index for calculating signal quality loss is possible, as this disclosure is not limiting. Further, in some exemplary embodiments of the present disclosure, the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value, the CTC loss value may have different weights, respectively. For example, the first snr loss value may be weighted more than both the second snr loss value and the CTC loss value. However, it should be appreciated that any other weights with respect to the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value, and the CTC loss value are possible, depending on the actual situation.
Referring to fig. 10, another neural network training model is schematically shown, which corresponds to the neural network training method shown in fig. 9, provided in accordance with some exemplary embodiments of the present disclosure. It should be appreciated that the neural network training model 800 shown in fig. 10 is substantially identical to the neural network training model 600 shown in fig. 8, except that the neural network training model 800 shown in fig. 10 further includes a composite loss value calculation module 810 and a CTC-based acoustic model. Only these differences will be described below, and the same points as those shown in fig. 8 will not be described again.
As shown in fig. 10, the first processed audio from the first inverse fourier transform module 490a is supplied to a Filter bank (FBank) feature extraction module 820 to acquire band features, then sequentially passes through a difference processing module 830 and a mean square value difference processing module 840 to acquire acoustic features, which are then passed through a convolution module 850, then supplied to a third neural network module 860 to be processed, and after being linearized by a linearization module 870, supplied to a CTC loss value calculation module 880 to calculate CTC loss values. As shown in fig. 10, in this exemplary embodiment, the third neural network module 860 may include an LSTM network and a bulk sample normalization layer (Batch Normalization, BN layer). However, the third neural network module 860 may include other types of neural network models, such as fully connected neural networks, convolutional neural networks, recurrent neural networks, and the like, to which the present disclosure is not limited. In addition, the FBank feature extraction module 820, the difference processing module 830, the mean square value difference processing module 840, the convolution module 850, the third neural network module 860, the linearization module 870, and the CTC loss value calculation module 880 together form an exemplary acoustic model based on the CTC model, which is well known to those skilled in the art, and thus will not be described herein. It should be understood that other types of CTC-based acoustic models are possible, as well as the present disclosure is not limited in this regard. The integrated loss value computation module 810 receives the first snr loss value, the second snr loss value, and the CTC loss value and performs a weighted summation of the three to obtain an integrated loss value. Then, at least one of the first neural network 460a and the second neural network 460b may be adjusted, or at least one of the first neural network 460a, the second neural network 460b, and the third neural network 860 may be adjusted, based on the integrated signal-to-noise ratio loss value.
Thus, the neural network training method 700 and the neural network training model 800 are capable of jointly training the first neural network and/or the second neural network based on the results of the frequency domain echo cancellation, the results of the time domain echo cancellation, and the results of the acoustic model evaluation. In joint training, gradients of the acoustic model can be transferred to the first and second neural networks for further training thereof, and thus performance of the trained neural network can be further improved. In addition, because the acoustic model is not updated during training, other acoustic models which are not subjected to joint training can be directly connected to the audio signal processing model after training when the audio signal processing model is on line, so that the audio signal processing model after training has strong universality.
Referring to fig. 11, a structure of an audio signal processing apparatus according to some exemplary embodiments of the present disclosure is schematically shown in block diagram form. As shown in fig. 11, the audio signal processing apparatus 900 includes: an audio signal acquisition module 910, an audio signal frequency domain information acquisition module 920, a frequency domain echo cancellation processing module 930, an audio signal time domain information acquisition module 940, and a time domain echo cancellation processing module 950.
The audio signal acquisition module 910 is configured to: a reference audio signal and an audio signal to be processed are acquired. The audio signal frequency domain information acquisition module 920 is configured to: obtaining an amplitude spectrum matrix of the reference audio signal, and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed. The frequency domain echo cancellation processing module 930 is configured to: and carrying out frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed, and generating a first processed audio signal. The audio signal time domain information acquisition module 940 is configured to: obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal. The time domain echo cancellation processing module 950 is configured to: and performing time domain echo cancellation processing on the first processed audio signal by using a second neural network based on the real part data matrix and the imaginary part data matrix of the reference audio signal and the real part data matrix and the imaginary part data matrix of the first processed audio signal to generate a second processed audio signal. The above-described respective modules relate to the operations of steps 310 to 350 described above with respect to fig. 3, and thus are not described herein.
Furthermore, each of the modules described above with respect to fig. 11 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer-executable code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of these modules may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.
Referring to fig. 12, a structure of a computing device 1000 in accordance with some embodiments of the present disclosure is schematically shown in block diagram form. The computing device 1000 may be used in various application scenarios described in this disclosure, and it may implement the audio signal processing methods and/or neural network training methods described in this disclosure.
The computing device 1000 may include at least one processor 1002, memory 1004, communication interface(s) 1006, display device 1008, other input/output (I/O) devices 1010, and one or more mass storage 1012, capable of communicating with each other, such as through a system bus 1014 or other suitable means of connection.
The processor 1002 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 1002 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 1002 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 1004, mass storage 1012, or other computer-readable medium, such as program code for the operating system 1016, program code for the application programs 1018, program code for the other programs 1020, and so forth.
Memory 1004 and mass storage device 1012 are examples of computer-readable storage media for storing instructions that can be executed by processor 1002 to implement the various functions described previously. For example, the memory 1004 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 1012 may generally include hard drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. Memory 1004 and mass storage device 1012 may both be referred to herein collectively as a computer-readable memory or a computer-readable storage medium, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer-executable code that may be executed by processor 1002 as a particular machine configured to implement the operations and functions described in the various exemplary embodiments of the present disclosure (e.g., the audio signal processing method model, the neural network training method, and models corresponding thereto described above).
A number of program modules may be stored on the mass storage device 1012. These program modules include an operating system 1016, one or more application programs 1018, other programs 1020, and program data 1022, which can be executed by the processor 1002. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer-executable code or instructions) for implementing the following components/functions: an audio signal acquisition module 910, an audio signal frequency domain information acquisition module 920, a frequency domain echo cancellation processing module 930, an audio signal time domain information acquisition module 940, and a time domain echo cancellation processing module 950.
Although illustrated in fig. 12 as being stored in the memory 1004 of the computing device 1000, the audio signal processing methods and models, the neural network training methods and models, and the audio signal acquisition module 910, the audio signal frequency domain information acquisition module 920, the frequency domain echo cancellation processing module 930, the audio signal time domain information acquisition module 940, and the time domain echo cancellation processing module 950, or portions thereof, may be implemented using any form of computer readable media accessible by the computing device 1000. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer storage media as defined by the present disclosure does not include communication media.
Computing device 1000 may also include one or more communication interfaces 1006 for exchanging data with other devices, such as via a network, direct connection, or the like. The communication interface 1006 may facilitate communications within a variety of network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 1006 may also provide communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.
In some examples, computing device 1000 may also include a display device 1008, such as a display, for displaying information and images. Other I/O devices 1010 may be devices that receive various inputs from and provide various outputs to a target object, including but not limited to touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.
The present disclosure also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform the audio signal processing method and/or the neural network training method provided in the various alternative embodiments described above.
The terminology used herein is for the purpose of describing embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this disclosure, specify the presence of stated features, but do not preclude the presence or addition of one or more other features. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various features, these features should not be limited by these terms. These terms are only used to distinguish one feature from another feature.
Unless defined otherwise, all terms (including technical and scientific terms) used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the description of the present specification, the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc. describe mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
The disclosure describes various techniques in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used in this disclosure generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described in this disclosure are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a list of executable instructions for implementing the logic functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. Furthermore, it should also be understood that the various steps of the methods shown in the flowcharts or otherwise described herein are merely exemplary and do not imply that the steps of the illustrated or described methods must be performed in accordance with the steps shown or described. Rather, the various steps of the methods shown in the flowcharts or otherwise described herein may be performed in a different order than in the present disclosure, or may be performed simultaneously. Furthermore, the methods represented in the flowcharts or otherwise described herein may include other additional steps as desired.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, it may be implemented using any one or combination of the following techniques, as known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps of the methods of the above embodiments may be performed by hardware associated with program instructions, and the program may be stored in a computer readable storage medium, which when executed, includes performing one or a combination of the steps of the method embodiments.
Although the present disclosure has been described in detail in connection with some exemplary embodiments, it is not intended to be limited to the specific form set forth in the disclosure. Rather, the scope of the present disclosure is limited only by the appended claims.

Claims (16)

1. An audio signal processing method, comprising:
acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal acquired through a local microphone;
obtaining an amplitude spectrum matrix of the reference audio signal and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
based on the magnitude spectrum matrix of the reference audio signal, the magnitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal, including: splicing the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix, inputting the spliced amplitude spectrum matrix into the first neural network to generate an amplitude spectrum filtering matrix, generating a filtered amplitude spectrum matrix based on the amplitude spectrum filtering matrix and the amplitude spectrum matrix of the audio signal to be processed, and performing inverse fast Fourier transform by taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters to generate the audio signal after the first processing;
Obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal;
performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal, comprising: splicing the real part data matrix of the reference audio signal with the real part data matrix of the first processed audio signal to generate a spliced real part data matrix, and splicing the imaginary part data matrix of the reference audio signal with the imaginary part data matrix of the first processed audio signal to generate a spliced imaginary part data matrix;
inputting the spliced real part data matrix and the spliced imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix; generating a filtered real part data matrix based on the real part data matrix and the real part data filtering matrix of the first processed audio signal, and generating a filtered imaginary part data matrix based on the imaginary part data matrix and the imaginary part data filtering matrix of the first processed audio signal; and performing inverse fast fourier transform with the filtered real part data matrix and the filtered imaginary part data matrix together as input parameters to generate the second processed audio signal.
2. The audio signal processing method of claim 1, wherein the obtaining the amplitude spectrum matrix of the reference audio signal comprises: performing a fast fourier transform on the reference audio signal to obtain an amplitude spectrum matrix of the reference audio signal, and
the obtaining of the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed comprises: and performing fast Fourier transform on the audio signal to be processed to obtain an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed.
3. The audio signal processing method of claim 1, wherein the magnitude spectrum filter matrix is the same as the magnitude spectrum matrix of the audio signal to be processed, and wherein the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filter matrix and the magnitude spectrum matrix of the audio signal to be processed comprises:
multiplying each element in the amplitude spectrum matrix of the audio signal to be processed with one element in a corresponding position in the amplitude spectrum filter matrix to generate the filtered amplitude spectrum matrix.
4. The audio signal processing method of claim 1, wherein the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filter matrix and the magnitude spectrum matrix of the audio signal to be processed comprises:
Convolving the magnitude spectrum matrix of the audio signal to be processed with the magnitude spectrum filtering matrix to generate the filtered magnitude spectrum matrix.
5. The audio signal processing method of claim 1, wherein the obtaining the real and imaginary data matrices of the reference audio signal and the obtaining the real and imaginary data matrices of the first processed audio signal comprises:
performing a fast fourier transform on the reference audio signal to obtain a real data matrix and an imaginary data matrix of the reference audio signal;
performing fast fourier transform on the first processed audio signal to obtain a real data matrix and an imaginary data matrix of the first processed audio signal.
6. The audio signal processing method of claim 1, wherein the real data filter matrix has a size identical to a size of a real data matrix of the first processed audio signal, the imaginary data filter matrix has a size identical to a size of an imaginary data matrix of the first processed audio signal, and wherein the generating a filtered real data matrix based on the real data matrix and the real data filter matrix of the first processed audio signal, the generating a filtered imaginary data matrix based on the imaginary data matrix and the imaginary data filter matrix of the first processed audio signal comprises:
Multiplying each element in the real part data matrix of the first processed audio signal with an element in a corresponding position in the real part data filtering matrix to generate the filtered real part data matrix;
multiplying each element of the imaginary data matrix of the first processed audio signal with an element of a corresponding position in the imaginary data filter matrix to generate the filtered imaginary data matrix.
7. The audio signal processing method of claim 1, wherein the generating a filtered real data matrix based on the real data matrix and the real data filtering matrix of the first processed audio signal comprises: convolving the real data matrix of the first processed audio signal with the real data filtering matrix to generate the filtered real data matrix, and
the generating a filtered imaginary data matrix based on the imaginary data matrix of the first processed audio signal and the imaginary data filtering matrix comprises: convolving an imaginary data matrix of the first processed audio signal with the imaginary data filtering matrix to generate the filtered imaginary data matrix.
8. The audio signal processing method of claim 1, wherein the first neural network and the second neural network are both long-short-term memory neural networks.
9. A neural network training method, comprising:
acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is a mixed audio signal obtained by mixing an audio signal acquired by a local microphone with a real audio signal;
obtaining an amplitude spectrum matrix of the reference audio signal and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
based on the magnitude spectrum matrix of the reference audio signal, the magnitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal, including: splicing the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix, inputting the spliced amplitude spectrum matrix into the first neural network to generate an amplitude spectrum filtering matrix, generating a filtered amplitude spectrum matrix based on the amplitude spectrum filtering matrix and the amplitude spectrum matrix of the audio signal to be processed, and performing inverse fast Fourier transform by taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters to generate the audio signal after the first processing;
Obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal;
performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal, comprising: splicing the real part data matrix of the reference audio signal with the real part data matrix of the first processed audio signal to generate a spliced real part data matrix, and splicing the imaginary part data matrix of the reference audio signal with the imaginary part data matrix of the first processed audio signal to generate a spliced imaginary part data matrix; inputting the spliced real part data matrix and the spliced imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix; generating a filtered real part data matrix based on the real part data matrix and the real part data filtering matrix of the first processed audio signal, and generating a filtered imaginary part data matrix based on the imaginary part data matrix and the imaginary part data filtering matrix of the first processed audio signal; taking the filtered real part data matrix and the filtered imaginary part data matrix as input parameters together for performing inverse fast fourier transform to generate the second processed audio signal;
Obtaining a first signal-to-noise ratio loss value based on the first processed audio signal and the real audio signal, and obtaining a second signal-to-noise ratio loss value based on the second processed audio signal and the real audio signal;
at least one of the first neural network and the second neural network is adjusted based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
10. The neural network training method of claim 9, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
carrying out weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value;
at least one of the first neural network and the second neural network is adjusted based on the integrated signal-to-noise ratio loss value.
11. The neural network training method of claim 10, wherein the first signal-to-noise ratio penalty is weighted more than the second signal-to-noise ratio penalty.
12. The neural network training method of claim 9, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
Inputting the first processed audio signal into an acoustic model based on CTC to obtain a CTC loss value;
carrying out weighted summation on the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value and the CTC loss value to obtain a comprehensive loss value; and
at least one of the first neural network and the second neural network is adjusted based on the composite loss value.
13. The neural network training method of any of claims 9 to 12, wherein the first and second signal-to-noise ratio loss values are both scale-invariant signal-to-noise ratio loss values.
14. An audio signal processing apparatus comprising:
an audio signal acquisition module configured to: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal acquired through a local microphone;
an audio signal frequency domain information acquisition module configured to: obtaining an amplitude spectrum matrix of the reference audio signal and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
a frequency domain echo cancellation processing module configured to: based on the magnitude spectrum matrix of the reference audio signal and the magnitude spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal, including: splicing the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix, inputting the spliced amplitude spectrum matrix into the first neural network to generate an amplitude spectrum filtering matrix, generating a filtered amplitude spectrum matrix based on the amplitude spectrum filtering matrix and the amplitude spectrum matrix of the audio signal to be processed, and performing inverse fast Fourier transform by taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters to generate the audio signal after the first processing;
An audio signal time domain information acquisition module configured to: obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal;
a time domain echo cancellation processing module configured to: performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal, comprising: splicing the real part data matrix of the reference audio signal with the real part data matrix of the first processed audio signal to generate a spliced real part data matrix, and splicing the imaginary part data matrix of the reference audio signal with the imaginary part data matrix of the first processed audio signal to generate a spliced imaginary part data matrix; inputting the spliced real part data matrix and the spliced imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix; generating a filtered real part data matrix based on the real part data matrix and the real part data filtering matrix of the first processed audio signal, and generating a filtered imaginary part data matrix based on the imaginary part data matrix and the imaginary part data filtering matrix of the first processed audio signal; and performing inverse fast fourier transform with the filtered real part data matrix and the filtered imaginary part data matrix together as input parameters to generate the second processed audio signal.
15. A computing device comprising a processor and a memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the audio signal processing method according to any one of claims 1 to 8 or cause the processor to perform the neural network training method according to any one of claims 9 to 13.
16. A computer readable storage medium configured to store computer executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method according to any one of claims 1 to 8 or to perform the neural network training method according to any one of claims 9 to 13.
CN202210459690.1A 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium Active CN115116471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210459690.1A CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210459690.1A CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Publications (2)

Publication Number Publication Date
CN115116471A CN115116471A (en) 2022-09-27
CN115116471B true CN115116471B (en) 2024-02-13

Family

ID=83326289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210459690.1A Active CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Country Status (1)

Country Link
CN (1) CN115116471B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689878A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Echo cancellation method, echo cancellation device, and computer-readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140225A (en) * 2020-01-20 2021-07-20 腾讯科技(深圳)有限公司 Voice signal processing method and device, electronic equipment and storage medium
WO2021196905A1 (en) * 2020-04-01 2021-10-07 腾讯科技(深圳)有限公司 Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN113744748A (en) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 Network model training method, echo cancellation method and device
CN113870888A (en) * 2021-09-24 2021-12-31 武汉大学 Feature extraction method and device based on time domain and frequency domain of voice signal, and echo cancellation method and device
CN114121031A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Device voice noise reduction, electronic device, and storage medium
CN114242098A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Voice enhancement method, device, equipment and storage medium
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140225A (en) * 2020-01-20 2021-07-20 腾讯科技(深圳)有限公司 Voice signal processing method and device, electronic equipment and storage medium
WO2021196905A1 (en) * 2020-04-01 2021-10-07 腾讯科技(深圳)有限公司 Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN113744748A (en) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 Network model training method, echo cancellation method and device
CN113870888A (en) * 2021-09-24 2021-12-31 武汉大学 Feature extraction method and device based on time domain and frequency domain of voice signal, and echo cancellation method and device
CN114121031A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Device voice noise reduction, electronic device, and storage medium
CN114242098A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Voice enhancement method, device, equipment and storage medium
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Deep Learning for Joint Acoustic Echo and Noise Cancellation with Nonlinear Distortions;Hao Zhang,等;《INTERSPEECH 2019》;全文 *
DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement;Xiaohuai Le,等;《https://arxiv.org/pdf/2107.05429.pdf》;全文 *
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression;Nils L. Westhausen,等;《https://arxiv.org/pdf/2005.07551.pdf》;全文 *
F-T-LSTM based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement;Shimin Zhang,等;《https://arxiv.org/pdf/2106.07577.pdf》;全文 *
基于BLSTM神经网络的回声和噪声抑制算法;王冬霞,等;《信号处理》;全文 *
智能音箱中回声消除算法的研究与实现;张伟;《中国优秀硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN115116471A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN110288978B (en) Speech recognition model training method and device
CN110503971A (en) Time-frequency mask neural network based estimation and Wave beam forming for speech processes
CN110415686A (en) Method of speech processing, device, medium, electronic equipment
CN111968658B (en) Speech signal enhancement method, device, electronic equipment and storage medium
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
CN109242092B (en) Image processing method and device, electronic equipment and storage medium
CN115116471B (en) Audio signal processing method and device, training method, training device and medium
CN114242044B (en) Voice quality evaluation method, voice quality evaluation model training method and device
CN109509010A (en) A kind of method for processing multimedia information, terminal and storage medium
CN112289338B (en) Signal processing method and device, computer equipment and readable storage medium
CN112491442A (en) Self-interference elimination method and device
CN113707167A (en) Training method and training device for residual echo suppression model
CN116013344A (en) Speech enhancement method under multiple noise environments
CN112486784A (en) Method, apparatus and medium for diagnosing and optimizing data analysis system
WO2024027295A1 (en) Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product
CN111414669B (en) Audio data processing method and device
CN116306672A (en) Data processing method and device
US20220358362A1 (en) Data processing method, electronic device and computer-readable medium
Wu et al. Optimal design of NLMS algorithm with a variable scaler against impulsive interference
WO2022077305A1 (en) Method and system for acoustic echo cancellation
CN111613211B (en) Method and device for processing specific word voice
US11689693B2 (en) Video frame interpolation method and device, computer readable storage medium
CN113763978A (en) Voice signal processing method, device, electronic equipment and storage medium
Huang et al. A method for extracting fingerprint feature of communication satellite signal
CN112329692A (en) Wireless sensing method and device for cross-scene human behavior under limited sample condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant