CN115862651A

CN115862651A - Audio processing method and device

Info

Publication number: CN115862651A
Application number: CN202211436870.4A
Authority: CN
Inventors: 王少华
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-28

Abstract

The application discloses an audio processing method and an audio processing device, and belongs to the technical field of communication. The method comprises the following steps: acquiring audio signals, wherein the audio signals comprise a first audio sub-signal and a second audio sub-signal which are acquired by different microphones of the electronic equipment; constructing a voice covariance matrix and a noise covariance matrix corresponding to the audio signals according to the existence probability of the voice signals corresponding to each audio frequency point in the audio signals; obtaining a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, and inverting the mixing matrix to determine a de-mixing matrix of the audio signal; the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal; and outputting a first voice signal, a first noise signal and a second voice signal corresponding to the first audio sub-signal and a second noise signal corresponding to the second audio sub-signal according to the de-mixing matrix and the audio signal.

Description

Audio processing method and device

Technical Field

The present application belongs to the field of communications technologies, and in particular, to an audio processing method and apparatus.

Background

The human perception of sound not only includes three elements of loudness, tone and timbre, but also human perception of spatial information of sound, such as direction, distance and environmental information of sound.

Stereo sound includes spatial information of a sound signal, compared to a mono signal. With the development of technology, portable devices such as multi-microphone mobile phones and tablet phones have become popular. Stereo recording is thus becoming a fundamental function.

In the related art, for stereo output, a plurality of paths of speech signals containing spatial information need to be output, whereas a conventional speech enhancement algorithm only outputs one path of speech. Therefore, whether a beam forming algorithm or a blind source separation algorithm is adopted, multiple times of spatial filtering are generally needed for outputting the desired stereo, and the computational complexity of the multiple times of spatial filtering is high.

Therefore, how to better perform stereo output has become an urgent problem to be solved in the industry.

Disclosure of Invention

The embodiments of the present application provide an audio processing method and an audio processing apparatus, which can solve the problem that stereo output requires multiple spatial filtering and has high computational complexity.

In a first aspect, an embodiment of the present application provides an audio processing method, where the method includes:

acquiring audio signals, wherein the audio signals comprise a first audio sub-signal and a second audio sub-signal which are acquired by different microphones of the electronic equipment;

constructing a voice covariance matrix and a noise covariance matrix corresponding to the audio signals according to the existence probability of the voice signals corresponding to each audio frequency point in the audio signals;

obtaining a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, and inverting the mixing matrix to determine a de-mixing matrix of the audio signal; wherein the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal;

and respectively outputting a first voice signal, a first noise signal and a second voice signal corresponding to the first audio sub-signal and a second noise signal corresponding to the second audio sub-signal according to the de-mixing matrix and the audio signal.

In a second aspect, an embodiment of the present application provides an audio processing apparatus, including:

the acquisition module is used for acquiring audio signals, wherein the audio signals comprise a first audio sub-signal and a second audio sub-signal which are acquired by different microphones of the electronic equipment;

the building module is used for building a voice covariance matrix and a noise covariance matrix corresponding to the audio signals according to the existence probability of the voice signals corresponding to each audio frequency point in the audio signals;

the processing module is used for obtaining a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, inverting the mixing matrix and determining a de-mixing matrix of the audio signal; wherein the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal;

and the output module is used for respectively outputting a first voice signal, a first noise signal and a second voice signal corresponding to the first audio sub-signal and the second noise signal according to the unmixing matrix and the audio signal.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, stored on a storage medium, for execution by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, after the audio signal is obtained, the existence probability of the voice signal corresponding to each audio frequency point in the audio signal can be used as supervision information, a voice covariance matrix and a noise covariance matrix are constructed according to the supervision information, the supervision information can help to select the voice covariance matrix, the problem of channel selection in a blind source separation algorithm can be solved, a mixing matrix corresponding to the audio signal is calculated through a spatial transfer function, a de-mixing matrix is determined according to the mixing matrix, and a first voice signal, a first noise signal, a second voice signal and a second noise signal are output according to the de-mixing matrix and the audio information respectively, multiple spatial filtering is not needed, the operation complexity is effectively reduced, and the algorithm robustness is improved.

Drawings

FIG. 1 is a schematic diagram of human voice enhancement in the related art;

FIG. 2 is a schematic flowchart of an audio processing method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in other sequences than those illustrated or otherwise described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense to distinguish one object from another, and not necessarily to limit the number of objects, e.g., the first object may be one or more. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

An audio processing method and an audio processing apparatus provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

In the related art, stereo vocal enhancement is an important application scenario of stereo, fig. 1 is a schematic diagram of vocal enhancement in the related art, as shown in fig. 1, two microphones are taken as an example, and assuming that the environment where the sound is located is a noise and reverberation scenario, according to equation 1, the signal collected by the microphones can be represented as x ₁ (n) and x ₂ (n),

x _m (n)＝a _m (n)*s(n)+r _m (n)， (1)

Wherein m =1,2,s (n) denotes an audio signal sound source, a _m (n) represents the Acoustic Transfer Function (ATF) of the source of the audio signal with respect to the mth microphone, represents the convolution, r _m And (n) represents a noise component corresponding to the mth microphone. The stereo human voice enhancement means that voice enhancement is carried out on the two paths of signals, noise signals in the two paths of signals are removed, corresponding voice components and space information are reserved, and enhanced y1 (n) and enhanced y2 (n) are obtained respectively, wherein the human voice enhancement is carried out on the target y ₁ (n)≈a ₁ (n)*s(n)，y ₂ (n)≈a ₂ (n)*s(n)。

Fig. 2 is a schematic flow chart of an audio processing method provided in an embodiment of the present application, as shown in fig. 2, including:

step 210, acquiring an audio signal, where the audio signal includes a first audio sub-signal and a second audio sub-signal collected by different microphones of an electronic device;

specifically, the audio signals are acquired by a plurality of microphones disposed in the electronic device, the microphones may be disposed together to form a microphone array, or may be disposed at different positions of the electronic device, each of the microphones may separately collect the audio sub-signals, for example, the microphone a may collect the first audio sub-signal, and the microphone B may collect the second audio sub-signal.

More specifically, the audio sub-signals collected by the different microphones may each include a speech signal from a human voice sound source and a noise signal from another sound source.

It is understood that in the embodiment of the present application, each audio sub-signal x is collected at the microphone _m After (n), further framing, windowing and Fourier transforming may be performed to obtain X _m (k, l), m =1,2, where k denotes audio frequency points in the audio signal and l denotes a time frame of the audio signal, and finally the audio signal X (k, l) = [ X = is acquired ₁ (k,l)X ₂ (k,l)] ^T 。

Step 220, constructing a voice covariance matrix and a noise covariance matrix corresponding to the audio signals according to the existence probability of the voice signals corresponding to each audio frequency point in the audio signals;

specifically, each audio signal has a plurality of audio frequency points, each audio frequency point may correspond to a speech signal or a noise signal, and it can be understood that the speech signal existence probability vad (l) is the probability that the audio frequency point may correspond to the speech signal.

In the embodiment of the present application, the existence probability vad (l) of the voice signal corresponding to each audio frequency point may be specifically obtained by analyzing through a related deep learning neural network, that is, the voice signal is input into the deep learning neural network, that is, the existence probability vad (l) of the voice signal corresponding to each audio frequency point in the voice signal may be output.

It can be understood that the speech signal presence probability vad (l) in the embodiment of the present application may also be analyzed in other conventional manners, the obtaining manner of the speech signal presence probability vad (l) in the embodiment of the present application is not limited, and the speech signal presence probability vad (l) in the embodiment of the present application does not need to be very accurate, and may be a piece of coarser information.

The speech signal existence probability vad (l) in the embodiment of the present application may be used as a supervisory message to screen a speech frame from an audio signal in an expected sense, so as to construct a covariance matrix, and further select which covariance matrix is a speech covariance matrix, thereby solving the problem of channel selection in a blind source separation algorithm.

Further, a voice covariance matrix Φ is respectively constructed based on the existence probabilities of the voice signals _XX (k, l) and noise covariance matrix Φ _NN (k, l), specifically formula 2 and formula 3:

Φ _XX (k,l)＝(1-α)Φ _XX (k,l)+α vad(l)X(k,l)X ^H (k,l), (2)

Φ _NN (k,l)＝(1-α)Φ _NN (k,l)+α(1-vad(l))X(k,l)X ^H (k,l), (3)

wherein, α is a smoothing factor, k is an audio frequency point, l is a time frame, and vad (l) is the existence probability of the voice signal.

Step 230, obtaining a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, and inverting the mixing matrix to determine an unmixing matrix of the audio signal; wherein the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal;

in the related art, usually, the unmixing matrix corresponding to the audio signal is directly solved, but the calculation method is complex and has a large calculation amount, and in the embodiment of the present application, the unmixing matrix of the audio signal can be obtained by directly calculating the mixing matrix corresponding to the audio signal and then inverting the mixing matrix.

More specifically, the column vector a of the mixing matrix described in the embodiments of the present application _i (k, l) has a definite physical meaning, i.e. a spatial transfer function, which may specifically comprise a first spatial transfer function a corresponding to the speech signal path ₁ (k, l), and a second spatial transmission corresponding to the noise signal pathTransfer function a ₂ (k,l)。

Therefore, it can be understood that, in the present application, specifically, the hybrid matrix may be obtained by updating the first spatial transfer function and the second spatial transfer function, and after the updating of the first spatial transfer function and the second spatial transfer function is completed, the hybrid matrix a (k, l) is updated and solved.

More specifically, in the embodiment of the present application, after the update of the mixing matrix a (k, l) is completed, normalization processing is further performed to avoid the problem of uncertainty in amplitude in the blind source separation algorithm, and finally, the mixing matrix after normalization processing is inverted, and according to formula 4, the unmixing matrix W (k, l) can be obtained.

W(k,l)＝A ^-1 (k,l) (4)。

In the embodiment of the present application, it is assumed that the blind source separation algorithm outputs the voice signal channels Y ₁ (k, l) and noise signal path N ₁ (k, l), according to equation 5, specifically:

for stereo input, the unmixing matrix W (k, l) is a 2x2 matrix that can be disassembled according to equation 6

W(k,l)＝[w ₁ (k,l) w ₂ (k,l)] ^H , (6)

Wherein, w _i (k, l), is a 2-dimensional column vector, i =1,2.

According to equation 7, the mixing matrix can be split into:

A(k,l)＝[a ₁ (k,l) a ₂ (k,l)]， (7)

wherein, a _i (k, l), is a 2-dimensional column vector, i =1,2.

Step 240, respectively outputting a first voice signal, a first noise signal corresponding to the first audio sub-signal, and a second voice signal, a second noise signal corresponding to the second audio sub-signal, according to the unmixing matrix and the audio signal.

In the embodiment of the present application, a first speech signal and a first noise signal corresponding to a first audio sub-signal can be obtained respectively according to the unmixing matrix and the audio signal, a second speech signal and a second noise signal corresponding to a second audio sub-signal can also be obtained respectively,

furthermore, the first voice signal, the first noise signal, the second voice signal and the second noise signal are respectively subjected to inverse FFT, windowing and frame transformation to a time domain, and the first voice signal, the first noise signal corresponding to the first audio sub-signal and the second voice signal, the second noise signal corresponding to the second audio sub-signal are respectively output.

In the embodiment of the application, after the audio signal is obtained, the existence probability of the voice signal corresponding to each audio frequency point in the audio signal can be used as supervision information, a voice covariance matrix and a noise covariance matrix are constructed according to the supervision information, the supervision information can help to select the voice covariance matrix, the problem of channel selection in a blind source separation algorithm can be solved, a mixing matrix corresponding to the audio signal is calculated through a spatial transfer function, a de-mixing matrix is determined according to the mixing matrix, and then the first voice signal, the first noise signal, the second voice signal and the second noise signal are respectively output according to the de-mixing matrix and the audio information, multiple spatial filtering is not needed, the operation complexity is effectively reduced, and the algorithm robustness is improved.

Optionally, obtaining a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix includes:

updating the first spatial transfer function and the second spatial transfer function according to the voice covariance matrix and the noise covariance matrix to obtain a first target spatial transfer function and a second target spatial transfer function;

respectively carrying out normalization processing on the first target space transfer function and the second target space transfer function according to the first space relative transfer function and the second space relative transfer function to obtain a mixing matrix corresponding to the audio signal;

wherein the first spatial relative transfer function is determined based on a ratio of a third spatial transfer function to a fourth spatial transfer function, and the second spatial relative transfer function is determined based on a ratio of a fifth spatial transfer function to a sixth spatial transfer function; the third spatial transfer function is a spatial transfer function of the speech signal with respect to the first microphone, the fourth spatial transfer function is a spatial transfer function of the speech signal with respect to the second microphone, the fifth spatial transfer function is a spatial transfer function of the noise signal with respect to the second microphone, and the sixth spatial transfer function is a spatial transfer function of the noise signal with respect to the first microphone.

In particular, the first spatial transfer function described in the embodiments of the present application may specifically be a transfer function of a sound source of a speech signal with respect to a microphone, the second spatial transfer function may specifically be a transfer function of a sound source of a noise signal with respect to a microphone, and the noise may theoretically come from multiple directions, but in the embodiments of the present application the noise is considered to come from a directional source in a desired sense.

More specifically, in the embodiment of the present application, when any audio frequency point in an audio signal is detected, the first spatial transfer function and the second spatial transfer function are both updated.

In other embodiments, to reduce the number of updates and the amount of computation, the first spatial transfer function is updated if the audio bin may correspond to a speech signal, the second spatial transfer function is updated if the audio bin may correspond to a noise signal,

in the embodiment of the application, after the first spatial transfer function and the second spatial transfer function are updated, due to the problem of uncertain amplitude in the blind source separation algorithm, the column vector of the mixed matrix can be further calibrated to obtain the spatial transfer function with definite physical significance, and the problem of uncertain amplitude of the blind source separation algorithm is solved.

More specifically, the column vector correction calibration may specifically be to perform normalization processing on the first spatial transfer function through the first spatial relative transfer function, perform normalization processing on the second spatial transfer function through the second spatial relative transfer function, and finally obtain a normalized mixing matrix.

The first spatial relative transfer function described in the embodiments of the present application refers to a transfer coefficient of a voice signal between the first microphone and the second microphone, and the second spatial relative transfer function refers to a transfer coefficient of a noise signal between the first microphone and the second microphone.

It can be understood that, according to equation 8, the specific process of normalization may specifically be:

order to

Wherein the first spatial relative transfer function a is according to equation 9 _1rtf (k,l)＝a ₂₁ (k,l)/a ₁₁ (k,l)。

Wherein the second spatial relative transfer function is according to equation 10:

a _2rtf (k,l)＝a ₁₂ (k,l)/a ₂₂ (k,l)。 (10)

more specifically, in the embodiment of the present application, the spatial transfer coefficient of the voice signal source relative to the first microphone is the third spatial transfer function a ₁₁ (k, l) the spatial transfer coefficient of the speech signal source relative to the second microphone is a fourth spatial transfer function a ₂₁ (k, l) the spatial transfer coefficient of the noise signal source with respect to the first microphone is a fifth spatial transfer function a ₁₂ (k, l) the spatial transfer coefficient of the noise signal source relative to the second microphone is a sixth spatial transfer function a ₂₂ (k,l)。

After traversing all audio frequency points in the audio signal through the updating and normalization cyclic processing, updating of the mixing matrix is completed, and the mixing matrix corresponding to the audio signal is obtained.

In the embodiment of the application, the calculation process is effectively simplified through updating the first spatial transfer function and the second spatial transfer function and further realizing the updating and requesting of the mixing matrix, meanwhile, the normalization processing of the first spatial transfer function and the second spatial transfer function is realized through the first spatial relative transfer function and the second spatial relative transfer function of the voice signal and the noise signal between the microphones, the calibration of the column vector of the mixing matrix is effectively realized, and the problem of uncertainty of the amplitude in the blind source separation algorithm is solved.

Optionally, updating the first spatial transfer function and the second spatial transfer function according to the speech covariance matrix and the noise covariance matrix to obtain a first target spatial transfer function and a second target spatial transfer function, including:

under the condition that a first target audio frequency point is detected in the audio signal, updating the first spatial transfer function based on the voice covariance matrix and the noise covariance matrix until all audio frequency points in the audio signal are traversed to obtain a first target spatial transfer function;

under the condition that a second target audio frequency point is detected in the audio signal, updating the second spatial transfer function based on the voice covariance matrix and the noise covariance matrix until all audio frequency points in the audio signal are traversed to obtain a second target spatial transfer function;

the first target audio frequency point is an audio frequency point of which the existence probability of voice signals in the audio signals exceeds a first preset threshold, and the second target audio frequency point is an audio frequency point of which the existence probability of noise signals in the audio signals exceeds a second preset threshold.

Specifically, in order to effectively reduce the number of times of updating and reduce the amount of computation, in the embodiment of the present application, the first spatial transfer function is updated only when the speech exists, and correspondingly, the second spatial transfer function may be updated only when the noise exists.

It can be understood that when the existence probability of the voice signal corresponding to the audio frequency point exceeds the first preset threshold, it indicates that the audio frequency point may correspond to the voice signal, and at this time, the first spatial transfer function is updated.

When the existence probability of the noise signal corresponding to the audio frequency point exceeds a second preset threshold, the audio frequency point is possibly corresponding to the noise signal, and at the moment, the second spatial transfer function is updated.

More specifically, according to formula 11, updating the first spatial transfer function in the embodiment of the present application specifically includes:

when vad (l) > thr1,

ds ₁ ＝w ₁ ^H (k,l)Φ _XX (k,l)w ₁ (k,l),

ds ₂ ＝w ₁ ^H (k,l)Φ _NN (k,l)w ₁ (k,l), (11)

us＝w ₂ ^H (k,l)Φ _NN (k,l)w ₁ (k,l),

v _s ＝us/ds ₂ ,

wherein thr1 is a first predetermined threshold value, w ₁ (k, l) is the column vector, ds, corresponding to the speech signal channel in the downmix matrix ₁ 、ds ₂ Us and v _s Are all intermediate quantities in the calculation process.

More specifically, according to formula 12, in the embodiment of the present application, the second spatial transfer function is updated, specifically:

when vad (l) > thr2,

un＝w ₁ ^H (k,l)Φ _XX (k,l)e ₂ (k,l),

dn ₁ ＝w ₂ ^H (k,l)Φ _XX (k,l)w ₂ (k,l), (12)

dn ₂ ＝w ₂ ^H (k,l)Φ _NN (k,l)w ₂ (k,l),

v _n ＝un/dn ₁ ，

wherein un and dn ₁ 、dn ₂ And v _n Is an intermediate quantity in the calculation process, thr2 is a second preset threshold value, w ₂ (k, l) unmixing the column vectors corresponding to the noise signal channels in the matrix.

In the embodiment of the present application, after the traversal of all audio frequency points in an audio signal is completed, the updating of the first spatial transfer function and the second spatial transfer function is completed.

In the embodiment of the application, the first spatial transfer function is updated under the condition that the first target audio frequency point is detected in the audio signal, and the second spatial transfer function is updated under the condition that the second target audio frequency point is detected in the audio signal, so that the updating times can be effectively reduced and the updating efficiency can be improved under the condition that the updating efficiency is ensured.

Optionally, respectively outputting a first voice signal, a first noise signal corresponding to the first audio sub-signal, and a second voice signal, a second noise signal corresponding to the second audio sub-signal according to the unmixing matrix and the audio signal, including:

acquiring a first voice signal and a first noise signal corresponding to a first audio sub-signal according to the product of the unmixing matrix and the audio signal;

and acquiring a second voice signal and a second noise signal corresponding to the second audio sub-signal based on the first voice signal, the first noise signal, the first spatial relative transfer function and the second spatial relative transfer function.

Specifically, the obtaining a first voice signal and a first noise signal corresponding to the first audio sub-signal according to formula 13 specifically includes:

wherein, Y ₁ (k, l) is a first audio sub-signal picked up by a first microphoneOf (1), N ₁ (k, l) is the first noise signal in the first audio sub-signal.

Further, after obtaining the first voice signal and the first noise signal, the relative transfer coefficient between the two microphones may be further combined, and according to equation 14, the second voice signal and the second noise signal in the second audio sub-signal collected by the second microphone are further determined, specifically:

wherein, Y ₂ (k, l) is the second speech signal in the second audio sub-signal, N ₂ (k, l) is a second noise signal in the second audio sub-signal, a _1rtf (k, l) is a first spatial relative transfer function, a _2rtf (k, l) is a second spatial relative transfer function.

More specifically, after obtaining the first voice signal, the first noise signal, the second voice signal, and the second noise signal, the first voice signal and the second voice signal may be further enhanced.

In the embodiment of the application, the separated first voice signal and the first noise signal are obtained by de-mixing the matrix and the audio signal, and the separated second voice signal and the second noise signal are obtained according to the first spatial relative transfer function and the second spatial relative transfer function, so that stereo four-channel output is realized, two times of spatial filtering are not needed, and the complexity of the algorithm is reduced.

Optionally, the first spatial relative transfer function is causally constrained, wherein the causally constraint is specifically:

transforming the first space relative transfer function to a time domain to obtain a first time domain signal;

and performing truncation processing on the first time domain signal according to a preset time domain range to obtain a constrained first spatial relative transfer function, wherein the preset time domain range is determined based on a finite long impulse response corresponding to the first spatial transfer function.

Specifically, in the embodiment of the present application, only the first spatial relative transfer function corresponding to the audio frequency of the audio signal is considered, and the first spatial relative transfer function is set to be an effective long impulse response, and according to equation 15, the corresponding time domain impulse response may be represented as:

in order to make the first space relative transfer function a _1rt (k, l) satisfies h ^T (n) this construction, a may be _1rtf (K, l) transforming to time domain, truncating the time domain signal, and retaining [ -K ] _L ,K _R ]Range, which causally constrains the relative transfer paths.

Thus, equation 16 is specifically:

wherein the content of the first and second substances,

represents a pair of _1rtf (k, l) re-assigning the result after the effective long impulse response constraint to a _1rtf (k,l)。

In the embodiment of the application, the noise suppression effect of the algorithm can be effectively improved by the first space relative transfer function of the voice channel.

Optionally, the scheme of the embodiment of the present application is applicable to a human voice enhancement scene with multi-channel input and multi-channel output, such as a recording, a spatial audio scene, and the like. The noisy multi-channel voice can be enhanced in a spatial filtering mode, and the human voice enhancement of each channel is realized. Compared with the existing blind source and beam forming method, the method and the device are low in complexity and suitable for time-varying acoustic scenes.

According to the audio processing method provided by the embodiment of the application, the execution main body can be an audio processing device. The audio processing apparatus provided in the embodiment of the present application is described with a method for performing audio processing by an audio processing apparatus as an example.

Fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, as shown in fig. 3, including: an acquisition module 310, a construction module 320, a processing module 330, and an output module 340; the obtaining module 310 is configured to obtain an audio signal, where the audio signal includes a first audio sub-signal and a second audio sub-signal collected by different microphones of the electronic device; the constructing module 320 is configured to construct a voice covariance matrix and a noise covariance matrix corresponding to the audio signal according to a voice signal existence probability corresponding to each audio frequency point in the audio signal; the processing module 330 is configured to obtain a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, invert the mixing matrix, and determine a demixing matrix of the audio signal; wherein the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal; the output module 340 is configured to output a first voice signal, a first noise signal corresponding to the first audio sub-signal, and a second voice signal and a second noise signal corresponding to the second audio sub-signal, respectively, according to the unmixing matrix and the audio signal.

Optionally, the processing module is specifically configured to:

Optionally, the output module is specifically configured to:

The audio processing apparatus in the embodiment of the present application may be an electronic device, and may also be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (Network Attached Storage, NAS), a personal computer (NAS), a Television (TV), an assistant, a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The audio processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The audio processing apparatus provided in the embodiment of the present application can implement each process implemented by the method embodiments in fig. 1 to fig. 2, and is not described herein again to avoid repetition.

Optionally, fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, and as shown in fig. 4, an electronic device 400 is further provided in an embodiment of the present application and includes a processor 401 and a memory 402, where the memory 402 stores a program or an instruction that can be executed on the processor 401, and when the program or the instruction is executed by the processor 401, the steps of the embodiment of the audio processing method are implemented, and the same technical effects can be achieved, and are not described again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

The electronic device 500 includes, but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and the like.

Those skilled in the art will appreciate that the electronic device 500 may further include a power supply (e.g., a battery) for supplying power to various components, and the power supply may be logically connected to the processor 510 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 5 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The input unit 504 is configured to acquire an audio signal, where the audio signal includes a first audio sub-signal and a second audio sub-signal collected by different microphones of the electronic device;

the processor 510 is configured to construct a voice covariance matrix and a noise covariance matrix corresponding to the audio signal according to a voice signal existence probability corresponding to each audio frequency point in the audio signal;

the processor 510 is configured to obtain a mixing matrix corresponding to the audio signal according to the voice covariance matrix and the noise covariance matrix, invert the mixing matrix, and determine a de-mixing matrix of the audio signal; wherein the mixing matrix comprises a first spatial transfer function corresponding to a voice signal channel in the audio signal and a second spatial transfer function corresponding to a noise signal channel in the audio signal;

the audio output unit 503 is configured to output a first speech signal, a first noise signal corresponding to the first audio sub-signal, and a second speech signal, and a second noise signal corresponding to the second audio sub-signal, respectively, according to the downmix matrix and the audio signal.

Processor 510 is configured to update the first spatial transfer function and the second spatial transfer function according to the speech covariance matrix and the noise covariance matrix to obtain a first target spatial transfer function and a second target spatial transfer function;

The processor 510 is configured to, when a first target audio frequency point is detected in the audio signal, update the first spatial transfer function based on the voice covariance matrix and the noise covariance matrix until all audio frequency points in the audio signal are traversed to obtain a first target spatial transfer function;

under the condition that a second target audio frequency point is detected in the audio signal, updating the second spatial transfer function based on the voice covariance matrix and the noise covariance matrix until all the audio frequency points in the audio signal are traversed to obtain a second target spatial transfer function;

The processor 510 is configured to obtain a first speech signal and a first noise signal corresponding to a first audio sub-signal according to a product of the downmix matrix and the audio signal;

Processor 510 is configured to transform the first spatial relative transfer function to a time domain to obtain a first time domain signal;

It should be understood that in the embodiment of the present application, the input Unit 504 may include a Graphics Processing Unit (GPU) 5041 and a microphone 5042, and the Graphics processor 5041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 506 may include a display panel 5061, and the display panel 5061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 507 includes at least one of a touch panel 5071 and other input devices 5072. A touch panel 5071, also referred to as a touch screen. The touch panel 5071 may include two parts of a touch detection device and a touch controller. Other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in further detail herein.

The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a first storage area storing a program or an instruction and a second storage area storing data, wherein the first storage area may store an operating system, an application program or an instruction (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 509 may include volatile memory or non-volatile memory, or the memory 509 may include both volatile and non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). The memory 509 in the embodiments of the subject application includes, but is not limited to, these and any other suitable types of memory.

Processor 510 may include one or more processing units; optionally, the processor 510 integrates an application processor, which mainly handles operations related to the operating system, user interface, and applications, and a modem processor, which mainly handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into processor 510.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the audio processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above-mentioned audio processing method embodiment, and can achieve the same technical effect, and is not described here again to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as a system-on-chip, or a system-on-chip.

Embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the foregoing audio processing method embodiments, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the present embodiments are not limited to those precise embodiments, which are intended to be illustrative rather than restrictive, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope of the appended claims.

Claims

1. An audio processing method, comprising:

2. The audio processing method of claim 1, wherein obtaining a mixing matrix corresponding to the audio signal according to the speech covariance matrix and the noise covariance matrix comprises:

3. The audio processing method of claim 2, wherein updating the first spatial transfer function and the second spatial transfer function according to the speech covariance matrix and the noise covariance matrix to obtain a first target spatial transfer function and a second target spatial transfer function comprises:

under the condition that a first target audio frequency point is detected in the audio signal, updating the first spatial transfer function based on the voice covariance matrix and the noise covariance matrix until all the audio frequency points in the audio signal are traversed to obtain a first target spatial transfer function;

4. The audio processing method according to claim 2, wherein outputting a first speech signal, a first noise signal corresponding to the first audio sub-signal, and a second speech signal, and a second noise signal corresponding to the second audio sub-signal respectively according to the downmix matrix and the audio signal comprises:

5. The audio processing method according to claim 2, characterized in that said first spatial relative transfer function is causally constrained, wherein said causal constraints are in particular:

6. An audio processing apparatus, comprising:

7. The audio processing device according to claim 6, wherein the processing module is specifically configured to:

8. The audio processing device according to claim 7, wherein the processing module is specifically configured to:

9. The audio processing apparatus according to claim 7, wherein the output module is specifically configured to:

10. Audio processing device according to claim 7, characterized in that said first spatial relative transfer function is causally constrained, wherein said causal constraints are in particular: