CN114333876B

CN114333876B - Signal processing method and device

Info

Publication number: CN114333876B
Application number: CN202111415175.5A
Authority: CN
Inventors: 陈日林; 张兆奇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2024-02-09
Anticipated expiration: 2041-11-25
Also published as: CN114333876A

Abstract

The application provides a signal processing method and device, which can reduce influence of reverberation on signal separation and improve signal separation performance by obtaining a unmixed matrix according to a mixed matrix containing related transfer functions between microphones. In the method, a first mixing matrix comprising a correlation transfer function between microphones and a speech signal with reverberation can be obtained according to an observed signal, then a unmixed matrix of the observed signal is obtained according to the first mixing matrix and the speech signal with reverberation, and finally a separation signal is obtained according to the unmixed matrix. The embodiment of the application can be used in the field of audio processing, such as enhancement of front-end voice signals.

Description

Signal processing method and device

Technical Field

The present application relates to the field of audio processing, and more particularly, to a method and apparatus for signal processing.

Background

The cocktail party effect reveals the masking effect of the human ear, i.e. the natural ability to extract a desired sound source from a complex noisy auditory scene (an acoustic scene where multiple sound sources are present at the same time). As voice interaction technology is increasingly mature, target voice signals can be extracted through a blind source separation method. Blind source separation (Blind Source Separation, BSS) refers to the process of separating a source signal from a mixed signal (i.e., an observed signal) without knowledge of the source signal and signal mixing system (or transmission channel).

Independent vector analysis (Independent Vector Analysis, IVA) is a common blind source separation method, i.e. the received observed signal is decomposed into several independent separations according to statistically independent principles, and these independent components are used as an approximate estimate of the source signal. However, in existing IVA-based blind source separation methods, the mixing matrix is considered to be constructed as a room transfer function, which makes separation performance affected by room reverberation conditions.

Disclosure of Invention

The embodiment of the application provides a signal processing method and device, which can reduce influence of reverberation on signal separation by obtaining a unmixed matrix according to a mixed matrix containing related transfer functions among microphones, thereby improving signal separation performance.

In a first aspect, a method of signal processing is provided, comprising:

acquiring an observation signal, wherein the observation signal comprises original sound source signals of at least two sources acquired through at least two microphones;

determining a first mixing matrix H and a speech signal with reverberation from the observed signalWherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used for representing the observation signal and the reverberated speech signal +. >Mapping relation between the two;

mixing the first mixing matrix H and the reverberated speech signalInputting a signal processing model to obtain a unmixed matrix W of the observed signal, wherein the signal processing model is used for representing the first mixed matrix H and the voice signal with reverberation +.>And the mapping relation between the unmixed matrix W;

and obtaining a separation signal according to the unmixed matrix W and the observation signal.

In a second aspect, there is provided an apparatus for signal processing, comprising:

an acquisition unit configured to acquire an observation signal, wherein the observation signal includes original sound source signals of at least two sources acquired by at least two microphones;

a processing unit for determining based on the observed signalDefining a first mixing matrix H and a reverberated speech signalWherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used for representing the observation signal and the reverberated speech signal +.>Mapping relation between the two;

the processing unit is further configured to mix the first mixing matrix H and the reverberated speech signalInputting a signal processing model to obtain a unmixed matrix W of the observed signal, wherein the signal processing model is used for representing the first mixed matrix H and the voice signal with reverberation +. >And the mapping relation between the unmixed matrix W;

the processing unit is further configured to obtain a separation signal according to the unmixed matrix W and the observation signal.

In a third aspect, an electronic device is provided, comprising: a processor and a memory; the memory is used for storing a computer program; the processor is configured to execute the computer program to implement the method according to the first aspect.

In a fourth aspect, a chip is provided, comprising: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method according to the first aspect.

In a fifth aspect, there is provided a computer readable storage medium comprising computer instructions which, when executed by a computer, cause the computer to implement the method of the first aspect.

In a sixth aspect, there is provided a computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

According to the method, a first mixing matrix comprising a correlation transfer function between microphones and a voice signal with reverberation are obtained according to an observation signal, then a unmixed matrix of the observation signal is obtained according to the first mixing matrix and the voice signal with reverberation, and finally a separation signal is obtained from the observation signal according to the unmixed matrix. Since the first mixing matrix contains the correlation transfer function between microphones instead of the room transfer function, and the correlation transfer function between microphones does not contain reverberation, obtaining the unmixed matrix according to the first mixing matrix can reduce the influence of reverberation on signal separation, thereby improving the signal separation performance.

Drawings

FIG. 1 is a schematic diagram of an application scenario suitable for use in embodiments of the present application;

FIG. 2 is a schematic diagram of a speech recognition system suitable for use in embodiments of the present application;

FIG. 3 is a schematic flow chart of a method of signal processing provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart of another method of signal processing provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of another method of signal processing provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of another method of signal processing provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of another method of signal processing provided by an embodiment of the present application;

FIG. 8 is a schematic flow chart of another method of signal processing provided by an embodiment of the present application;

FIG. 9 is a schematic diagram comparing the effect of the method of signal processing provided in the embodiments of the present application with the solution of sound source separation in the prior art;

FIG. 10 is an alternative schematic block diagram of an apparatus for signal processing of an embodiment of the present application;

fig. 11 is another alternative schematic block diagram of an electronic device provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

It should be understood that in the embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for purposes of illustration and distinction only, and does not represent a specific limitation on the number of devices in the embodiments of the present application, and should not constitute any limitation on the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a signal processing scheme, which can enhance front-end voice signals, such as enhancing expected signals, suppressing interference signals and the like, and can be applied to various fields, such as intelligent home, video conference, intelligent traffic, auxiliary driving and the like, without limitation.

The application scenarios to which the technical solution of the embodiment of the present application may be applied will be described in some simple ways. It should be noted that the application scenarios described below are only for illustrating the embodiments of the present application and are not limiting. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic diagram of an application scenario suitable for use in embodiments of the present application. As shown in fig. 1, the application scenario may include a user terminal, which may be, for example, a mobile phone, an intelligent voice interaction device (such as a wearable device such as a smart watch, a smart glasses, etc.), a vehicle-mounted terminal, and an intelligent home appliance (such as a smart speaker, a coffee machine, a printer, etc.). Optionally, the application scenario may further include a computing device, which may be, for example, a cloud server, an intelligent portable device, or a home computing hub, which is not limited in this application. The smart portable device may be a smart phone, a computer, or the like, and the home computing hub may be a smart phone, a computer, a smart television, a router, or the like, without limitation. The user terminal and the computing device may be connected through a wireless network, or may be connected through a bluetooth pairing connection, which is not limited in the embodiments of the present application.

It should be noted that the user terminal in fig. 1 is merely illustrative, and the user terminal applicable to the present application is not limited thereto, and may be, for example, an electronic device in an internet of things (internet of things, ioT) system, or the like. In addition, the computing device in fig. 1 is merely illustrative, and the computing device to which the present application is applicable is not limited thereto, and may be, for example, a mobile internet device or the like. It should also be noted that the plurality of electronic devices shown in the embodiments of the present application are for better and more comprehensive description of the embodiments of the present application, but should not be construed as limiting the embodiments of the present application in any way.

As a specific example, when the system architecture shown in fig. 1 is applied to a home use scenario, the user terminal may be an end-side device such as a smart speaker, a smart home, etc., and the computing device may be a home computing hub, such as a mobile phone, a television, a router, etc., or may be a cloud device, such as a cloud server, etc., which is not limited in this embodiment of the present application.

As another specific example, when the system architecture shown in fig. 1 is applied to a personal wearable scenario, the user terminal may be, for example, a personal wearable device, such as a smart bracelet, a smart watch, a smart earphone, a smart glasses, etc., and the computing device may be a portable device, such as a mobile phone, etc., which is not limited in this embodiment of the present application.

In some embodiments, the method for signal processing provided in the embodiments of the present application may be implemented by a user terminal. For example, after obtaining the observation signal, the ue may obtain a unmixed matrix according to the method for signal processing provided in the embodiment of the present application, and obtain the separated signal according to the unmixed matrix.

In other embodiments, the method for signal processing provided in the embodiments of the present application may be cooperatively implemented by a user terminal and a computing device. For example, after obtaining the observation signal, the user terminal may send the observation signal to the computing device, where the computing device obtains a unmixed matrix according to the method for signal processing provided by the embodiment of the application, and sends the unmixed matrix to the user terminal, where the user terminal obtains the separation signal according to the unmixed matrix. For another example, the computing device may obtain the unmixed matrix according to the signal processing method provided in the embodiments of the present application, obtain the separated signal according to the unmixed matrix, and then send the separated signal to the user terminal.

FIG. 2 is a schematic diagram of a speech recognition system suitable for use in embodiments of the present application. As shown in fig. 2, a front-end signal processing module 201 may be disposed before the speech recognition system 202, where the target speech and the interfering speech may be received by one or more microphones (an example of a microphone), and observation signals output by the microphones are input to the front-end signal processing module 201, for example, the enhanced clean target speech signal (i.e., separation signal) may be obtained after respectively passing through echo cancellation, dereverberation, sound source separation (which may also be referred to as blind source separation), post-processing, etc., and then the target speech signal may be input to the speech recognition system 202 for speech recognition. The signal processing scheme provided by the embodiment of the application can be applied to the sound source separation module, and the target voice signal is obtained by obtaining the unmixed matrix and performing signal separation on the observed signal.

The front-end signal processing module 201 in fig. 2 may be on the user terminal in fig. 1 or on the computing device in fig. 1, which is not limited in this application.

In the following, related terms related to embodiments of the present application are described.

1) Mixing matrix: the mapping relationship (such as the linear combination relationship of the frequency domain in the complex domain) between the observed signal and the original sound source signal is characterized. The mixing matrix may be a matrix of room transfer functions (Room Transfer Function, RTF) from each sound source to each microphone.

2) Unmixing matrix: the inverse matrix of the mixing matrix, i.e. the target matrix to be solved, characterizes the mapping relationship between the target speech signal and the observed signal (e.g. the frequency domain linear combination relationship in the complex domain). The unmixed matrix may also be referred to as a separate matrix, both representing the same meaning.

3) Room transfer function: a function characterizing the frequency domain propagation characteristics of sound from a sound source to a microphone, such as a microphone.

4) The relative transfer function between microphones characterizes the function of the frequency domain propagation characteristics of sound from one microphone to another. When the microphones are microphones, the correlation transfer function between the microphones may be referred to as a correlation transfer function between the microphones.

At present, in a blind source separation method based on IVA, a separation method based on IVA is adopted, a source signal model is established according to a mixed matrix, an objective function is obtained, objective function optimization is performed iteratively, and a separation matrix is solved until the model converges, so that an estimated source signal is obtained. In this scheme, the mixing matrix is considered to be composed of room transfer functions, which makes the separation performance of the speech signal affected by the room reverberation situation, so that the dereverberation preprocessing needs to be performed in advance, increasing the complexity of the sound source separation algorithm. Second, this approach has difficulty estimating the variance of the source signal, requiring pre-whitening of the observed signal, and thus making it difficult to real-time in the product. Finally, the scheme adopts a natural gradient method to carry out parameter optimization, the classification performance is limited by step parameters, and although a large number of self-adaptive step-changing technologies are proposed, the gradient descent algorithm still has larger calculation amount.

In view of the above problems, the embodiments of the present application provide a signal processing method, which may transform a hybrid matrix into a hybrid matrix including a correlation transfer function between microphones, instead of a room transfer function, where the correlation transfer function between microphones does not include reverberation, so as to obtain a unmixed matrix according to the hybrid matrix including the correlation transfer function between microphones, and reduce the influence of reverberation on signal separation, thereby improving signal separation performance.

Furthermore, according to the embodiment of the application, the first parameter can be constructed according to the above mixing matrix and the voice signal with reverberation, and the unmixing matrix is determined according to the mapping relation between the first parameter and the unmixing matrix, so that the estimation of a voice signal model can be avoided in the process of signal separation, the pre-whitening treatment of an observed signal is not needed, and meanwhile, the parameter optimization by adopting a natural gradient method is avoided, so that the separation process is not constrained by step size parameters, and the calculated amount can be effectively reduced.

The technical solutions provided by the embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 3 shows a schematic flow chart of a method 300 of signal processing provided in an embodiment of the present application. The method 300 may be used for blind source separation, for example, may be applied to the application scenario shown in fig. 1, or may be applied to the speech recognition system shown in fig. 2, without limitation. As shown in fig. 3, method 300 includes steps 310 through 340.

An observation signal is acquired 310, wherein the observation signal comprises raw sound source signals of at least two sources acquired by at least two microphones.

The user terminal may illustratively obtain the aforementioned observation signal via one or more microphones, such as microphones. The observed signal may comprise speech signals from a plurality of sound sources, which may comprise target speech signals, i.e. speech signals from a desired sound source. The observation signal may also comprise interfering speech signals, i.e. speech signals from undesired sound sources. In addition, the transmission channel or mixed system information of the observation signal is unknown.

In some embodiments, the observed signal may be subjected to a Short-time Fourier transform (Short-Time Fourier Transform, STFT) to yield the following equation (1):

x(f，t)＝A ^f s(f，t) (1)

wherein x (f, t) represents an observation signal of f frequency point and t time, A ^f Representing the mixing matrix at f frequency points (i.e. one example of the second mixing matrix a), s (f, t) representing the original sound source signals of at least two sources of f frequency points, t time, f being the frequency of the signal, t being the time of the signal.

In the following description, a dual microphone, dual sound source scenario is taken as an example to describe the solution provided in the embodiments of the present application. It will be appreciated that the process may be extended to multi-microphone, multi-sound source situations, and specific reference may be made to the description of a dual-microphone, dual-sound source process, which may require some simple adaptations, which are within the scope of embodiments of the present application.

For example, in a dual microphone, dual source scenario, the observed signal x (f, t) may be expressed as:

x(f,t)＝[x ₁ (f,t)，x ₂ (f,t)] ^T

the original sound source signal s (f, t) can be expressed as:

s(f,t)＝[s ₁ (f，t)，s ₂ (f,t)] ^T

mixing matrix A ^f Generally consists of a room transfer function, which can be expressed as:

wherein, as can be seen from the formula (2), A ^f Including 4 parameters

In the method 300 for signal processing provided in the embodiment of the present application, it is necessary to estimate the unmixed matrix W ^f The method comprises the following steps:

y(f,t)＝W ^f x(f,t) (3)

where y (f, t) represents the estimated separation signal, or may be referred to as the target speech signal, should be as consistent as possible with s (f, t). In a dual microphone, dual source scenario, y (f, t) =y1f, t, y2f, tT.

320, determining a first mixing matrix H and reverberated speech signals from the observed signalsWherein the first mixing matrix H comprises a first correlation transfer function between at least two microphones. The first mixing matrix H is used for representing the observation signal and the speech signal with reverberation +.>Mapping relation between the two.

Illustratively, in step 320, the above equation (1) may be transformed to obtain:

wherein H is ^f Representing a mixing matrix at f-frequency points, including the associated transfer functions between microphones,the reverberated voice signal of f frequency point and t time is shown.

In some alternative embodiments, referring to FIG. 4, a reverberated speech signal may be determined according to the following steps 321 and 322

321, determining a mapping relationship between the second mixing matrix a and the first mixing matrix H.

322 determining the reverberated speech signal based on the mapping relationship and the original sound source signals of the at least two sources

Taking a dual microphone, dual source scenario as an example, the mixing matrix A in equation (1) can be described ^f The following transformation is performed:

wherein, and->Is wheatThe correlation transfer function between the microphones, both of which constitute a new mixing matrix +.>I.e. one example of a first mixing matrix H. H ^f Includes 2 parameters->

Further, substituting equation (5) into equation (4) can result in:

from equation (6), the reverberated speech signal

In equation (6), the speech signal to be recovered is changed from the original speech signal s (f, t) to a reverberated speech signalWhereas a hybrid matrix A consisting of room transfer functions ^f Becomes a mixing matrix H consisting of microphone-related transfer functions ^f Since the reverberation comprised by the room transfer function is transferred to the reverberated speech signal +.>So that the microphone related transfer function does not contain reverberation.

330 mixing the first mixing matrix H and the reverberated speech signalInputting a signal processing model to obtain a unmixed matrix W of the observed signal, wherein the signal processing model is used for representing the first mixed matrix H and the voice signal with reverberation +.>And the mapping relation between the unmixed matrix W.

That is, the signal processing model may be based on the first mixing matrix H, the reverberated speech signalMapping relation between the unmixed matrix W, the input first mixed matrix H and the reverberated speech signal +.>And obtaining a unmixed matrix W of the observation signals.

In some alternative embodiments, referring to fig. 5, a unmixed matrix W of the observed signal may be determined according to steps 331 and 332.

331, speech signal with reverberation according to first mixing matrix HAnd a unmixed matrix W, determining a first parameter.

332, obtaining the unmixed matrix W according to the mapping relation between the first parameter and the unmixed matrix W.

In some embodiments, the first parameter described above may be defined. As one possible implementation, referring to fig. 6, the first parameter may be determined according to the following steps 333 and 334:

333, speech signal with reverberation according to first mixing matrix H A second parameter is determined.

334, determining the first parameter based on the second parameter and the unmixed matrix W.

The second parameter may be expressed asIn the embodiment of the application, a first parameter may be definedDefining a second parameter +.>Wherein E []Representing the data expectations, different values of k correspond to different sound sources.

For a dual microphone, dual source scenario, becauseAnd->Independently of each other, the above formula (6) is substituted into the second parameter +.>In (2), can be obtained:

in addition, the unmixed matrix W ^f And mixing matrix H ^f Is a reciprocal matrix, satisfy W ^f H ^f =i, i.e.:

in the examples of the present application, it can be considered thatThe formula (8) is satisfied, wherein the value of k is 1 or 2 for the dual sound source scene, and (t-1) represents the last moment of time t corresponding to different sound sources respectively.

For equation (7), each term of the equal sign left and right is multiplied byRight multiplication +.>The method can obtain:

due toSatisfy formula (8), i.e.)>Whereby equation (9) can become:

in the formula (10)Namely +.>

Similarly, for equation (7), each term of the equal sign left and right is multiplied byRight multiplication +.>The method can obtain:

in the formula (11)Namely +.>

As a specific implementation, the first parameter (e.g.And) Is used to determine a mapping relationship between the first parameter and the unmixed matrix W.

That is, it is possible to let:

according to formula (12), can be obtainedAnd->Can be expressed, for example, as +.>

In some alternative embodiments, the modulus of the unmixed matrix W may also be determined according to a minimum distortion principle (minimal distortion principle). Illustratively, the modulus of the unmixed matrix W may be determined according to the following equation (14):

W ^f (t)＝diag(diag((W ^f (t)) ^-1 ))W ^f (t) (14)

in summary, as a possible implementation of step 330, the reverberated speech signal may be first of all mixed according to the first mixing matrix HDetermining a second parameter->Then according to the second parameter->And a unmixed matrix W, determining a first parameterFinally according to the first parameter and->The mapping relation with the unmixed matrix W, for example, formulas (13) and (14), results in the unmixed matrix W.

And 340, obtaining a separation signal according to the unmixed matrix W and the observation signal.

Exemplary, the observation signal x (f, t) may be demultiplexed into a matrix W ^f Substituting the above formula (3) to obtain the separation signal y (f, t), i.e. the target speech signal.

Therefore, according to the observation signal, the first mixing matrix comprising the correlation transfer function between microphones and the voice signal with reverberation are obtained, then according to the first mixing matrix and the voice signal with reverberation, the unmixed matrix of the observation signal is obtained, and finally according to the unmixed matrix, the separation signal is obtained. Since the first mixing matrix contains the correlation transfer function between microphones instead of the room transfer function, and the correlation transfer function between microphones does not contain reverberation, obtaining the unmixed matrix according to the first mixing matrix can reduce the influence of reverberation on signal separation, thereby improving the signal separation performance.

Further, according to the embodiment of the application, the first parameter can be constructed according to the first mixing matrix and the voice signal with reverberation, and the unmixing matrix is determined according to the mapping relation between the first parameter and the unmixing matrix, so that estimation of a voice signal model can be avoided in a signal separation process, pre-whitening treatment on an observed signal is not needed, and parameter optimization by a natural gradient method is avoided, so that the separation process is not constrained by step size parameters, the calculated amount can be effectively reduced, and the signal separation efficiency is improved.

In some alternative embodiments, for example in case the energy of a certain original sound source signal in the observed signal is weak, the above-mentioned first parameter is causedThe mapping relation (such as formula (13)) with the unmixed matrix W has a denominator of 0, which may cause the above-mentioned signal processing process to be unstable, for example, a downtime condition occurs.

To ensure the stability of the above signal processing process and to improve the separation performance of the method 300, an auxiliary virtual sound source (Auxiliary Image Source, auxIS) may be introduced to enhance the observed signal to obtain a first mixing matrix H of the enhanced observed signal and a reverberated speech signal The auxiliary virtual sound source can enhance the weaker sound source signal in the original sound source signals to avoid the too weak energy of the original sound source signal, resulting in a first parameter +>The mapping relation (such as formula (13)) with the unmixed matrix W has a denominator of 0, so that the stability of the signal processing process and the performance of signal separation can be improved.

Illustratively, referring to FIG. 7, in method 300, a first mixing matrix H of enhanced observed signals and reverberated speech signals may be obtained by the following steps 350 through 370.

350, determining the energy of the signal of the auxiliary virtual sound source according to the observed signal.

As a possible implementation, referring to fig. 8, the energy of the signal of the auxiliary virtual sound source may be determined through the following steps 351 and 352.

351, determining the amplitude spectrum of the signal of the auxiliary virtual sound source according to the observed signal.

And 352, determining the energy of the signal of the auxiliary virtual sound source according to the energy ratio of the observed signal and the signal of the auxiliary virtual sound source.

That is, the signal of the auxiliary virtual sound source may be decomposed into two parts, namely, an amplitude spectrum of the signal of the auxiliary virtual sound source and an energy ratio of the observed signal to the signal of the auxiliary virtual sound source, specifically, see formula (15):

Wherein lambda is _dB The energy ratio of the observed signal and the signal of the auxiliary virtual sound source can be given in advance;to assist in the magnitude spectrum of the virtual sound source.

By way of example, one can defineThe following are provided:

360, obtaining a second related transfer function corresponding to the auxiliary virtual sound source

For example, an auxiliary virtual sound source may be introduced to enhance the kth sound source (e.g., the weakest one of the original sound source signals), where the auxiliary virtual sound source corresponds to a second associated transfer functionCan be expressed as +.>Wherein k can be positive integers and respectively correspond to different sound sources.

Alternatively to this, the method may comprise,the method of estimation of the estimated correlation transfer function may vary with the use of the scene. In some embodiments, the estimation can be performed using a way of averaging the pre-multipoint measurements (i.e. a way of true measurement), resulting in the correlation transfer function +.>For example, a scene, such as a car, where the speaker's position is relatively fixed may be used. In some embodiments, the estimation may be performed using an adaptive correlation transfer function estimation algorithm (e.g., a far field approximation estimation algorithm) to obtain the correlation transfer function +.>For example, in a scenario where the speaker location is unknown, such as a conference room.

370, according to the original sound source signals of said at least two sources, the energy of the signals of said auxiliary virtual sound source and said second related transfer functionObtaining said first mixing matrix H and said reverberated speech signal +.>Wherein said first mixing matrix H comprises said second correlation transfer function +.>Said reverberated speech signal +.>Including the energy of the signal of the auxiliary virtual sound source.

Illustratively, after deriving equation (6) above, this equation (6) may be further expanded to yield:

when the k sound source is enhanced by introducing the auxiliary virtual sound source, the enhanced observation signal can be recorded as x _k (f, t) can be expressed as the following formula:

that is, the first mixing matrix H may be updated at this timeSpeech signal with reverberation->Can be updated to->

For example, for a binaural scene, k has a value of 1,2, for two different sound sources. When a virtual sound source is introduced, the 1 st sound source is enhanced to obtain an observation signal x ₁ (f, t) is as follows:

when a virtual sound source is introduced, the 2 nd sound source is enhanced to obtain an observation signal x ₂ (f, t) is as follows:

after the observed signal is enhanced by the auxiliary virtual sound source, the enhanced first mixing matrix H and the reverberated voice signal can be obtained And inputting the signal processing model to obtain a unmixed matrix W of the enhanced observation signal. Accordingly, it is possible at this time to use the enhanced first mixing matrix H and the reverberated speech signal +.>And determining the second parameter and the first parameter, and further obtaining a unmixed matrix W according to the first parameter and the unmixed matrix W.

Exemplary, the first mixing matrix H and the reverberated speech signal are enhancedThe determined second parameter may be denoted +.>The first parameter may be denoted +.>

Substituting formula (18) intoThe method can obtain the following steps:

exemplary, for double microphones, doubleA sound source scene, in which a first parameter is determinedAfterwards, +.>Substituting the above formulas (13) and (14) results in the unmixed matrix W. Then, a separation signal can be obtained according to the unmixed matrix W and the enhanced observation signal.

As a specific example, in acquiringThereafter, the process can be carried outSubstituting formulas (13) and (14) to obtain +.>And the modulus of the unmixed matrix W. Then, can be according to->Obtain the target voice signal of the 1 st sound source according to +.>And obtaining the target voice signal of the 2 nd sound source.

That is, in the case of introducing the auxiliary virtual sound source, first, the energy λ (f, t) of the auxiliary virtual sound source can be obtained from the observation signal, and the correlation transfer function between the microphones corresponding to the auxiliary virtual sound source can be estimated Then the energy lambda (f, t) and the associated transfer function can be used>Determining an enhanced observed signal x _k (f, t) and can be further based on the enhancementIs a reference signal x of (2) _k (f, t) determining a second parameter +.>First parameter->Finally according to the first parameter and->The mapping relation with the unmixed matrix W, for example, formulas (13) and (14), obtains the unmixed matrix W, and further obtains the separation signal y (f, t), that is, the target speech signal.

Therefore, the embodiment of the application enhances the observed signal by introducing the auxiliary virtual sound source, so as to obtain a second mixing matrix corresponding to the enhanced observed signal and a reverberated voice signal, wherein the second mixing matrix comprises a second related transfer function corresponding to the auxiliary virtual sound source, and the reverberated voice signal comprises the energy of the signal of the auxiliary virtual sound source. The unmixed matrix W to be solved may be considered as a special beamforming matrix (i.e. the unmixed matrix W is not a matrix designed by direction information, but a matrix designed by sound source independence), and the added auxiliary virtual sound source can enhance the original speech signal in the observed signal, so that the accuracy of the unmixed matrix W can be increased, thereby ensuring the stability of the signal processing process on the one hand, and improving the performance of signal separation on the other hand.

Fig. 9 is a schematic diagram comparing the effect of the method of signal processing provided in the embodiment of the present application with that of the scheme of sound source separation in the prior art. Wherein, (a) is a comparison graph of the rise value (improvement) of the Signal-to-Interference Ratio (SIR) of the separated Signal obtained by each scheme, and (b) is a comparison graph of the rise value (SDR) of the Signal-to-Distortion Ratio (SDR) of the separated Signal obtained by each scheme, and X-axis of (a) and (b) represent reverberation time.

For example, a mixed speech signal may be acquired in a dual microphone, dual source mixed scene. As a specific example, two microphones may be used to collect sound signals of two persons speaking simultaneously in a room having a length of 4.45m, a width of 3.55 m, and a height of 2.5 m. The two persons may be spaced 1m from the microphones, respectively, with a direction angle of 45 ° and 135 ° with respect to the microphones, respectively, and the distance between the two microphones may be 0.1m. The reverberation time is adjusted from 150ms to 300ms, with an adjustment step size of 10ms.

The language signals received by the double microphones can be respectively processed by (1) the traditional AuxIVA technology; (2) The reference algorithm has a geometric constraint helper function (Geometrically Constrained Auxiliary-function with VCD, GCAV) -IVA of VCD; (3) AuxIS-AuxIVA using steering vectors for estimation provided by the embodiments of the present application; (4) The AuxIS-AuxIVA estimated by using the predicted value is provided by the embodiment of the application. Wherein the steering vector represents a far field approximation of the correlation transfer function of the AuxIS, and the predicted magnitude is the correlation transfer function of the AuxIS that is actually measured.

As can be seen from fig. 9, under different reverberation times, the separation signal obtained by the signal processing method provided by the embodiment of the present application has a significant improvement on SIR and SDR compared with the existing method, so that the signal processing method provided by the embodiment of the present application can help to improve the quality of the front-end signal.

The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.

It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be implemented in other sequences than those illustrated or described.

Method embodiments of the present application are described in detail above in connection with fig. 3-9, and apparatus embodiments of the present application are described in detail below in connection with fig. 10-11.

Fig. 10 is a schematic block diagram of an apparatus 700 for signal processing according to an embodiment of the present application. As shown in fig. 10, the apparatus 700 for signal processing may include an acquisition unit 710 and a processing unit 720.

An acquisition unit 710 for acquiring an observation signal, wherein the observation signal includes original sound source signals of at least two sources acquired by at least two microphones;

a processing unit 720 for determining a first mixing matrix H and a speech signal with reverberation based on the observed signalWherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used for representing the observation signal and the reverberated speech signal +.>Mapping relation between the two;

the processing unit 720 is further configured to mix the first mixing matrix H and the reverberated speech signalInputting a signal processing model to obtain a unmixed matrix W of the observed signal, wherein the signal processing model is used for representing the first mixed matrix H and the voice signal with reverberation +. >And the mapping relation between the unmixed matrix W;

the processing unit 720 is further configured to obtain a separation signal according to the unmixed matrix W and the observation signal.

Optionally, the processing unit 720 is specifically configured to:

according to the first mixing matrix H, the speech signal with reverberationAnd the unmixed matrix W, determining a first parameter;

and obtaining the unmixed matrix W according to the mapping relation between the first parameter and the unmixed matrix W.

Optionally, the processing unit 720 is specifically configured to:

according to the first mixing matrix H, the speech signal with reverberationDetermining a second parameter;

and determining the first parameter according to the second parameter and the unmixed matrix W.

Optionally, the processing unit 720 is further configured to:

and determining a mapping relation between the first parameter and the unmixed matrix W according to the null space of the first parameter.

Optionally, the processing unit 720 is further configured to:

and determining the modulus value of the unmixed matrix W according to a minimum distortion principle.

Optionally, the processing unit 720 is further configured to determine, according to the observed signal, energy of a signal of the auxiliary virtual sound source;

the obtaining unit 710 is further configured to obtain a second related transfer function corresponding to the auxiliary virtual sound source.

The processing unit 720 is specifically configured to:

based on the observed signal, the energy of the signal of the auxiliary virtual sound source and the second correlation transmissionTransfer functionObtaining said first mixing matrix H and said reverberated speech signal +.>Wherein said first mixing matrix H comprises said second correlation transfer function, said reverberated speech signal +.>Including the energy of the signal of the auxiliary virtual sound source.

Optionally, the processing unit 720 is specifically configured to:

determining an amplitude spectrum of the signal of the auxiliary virtual sound source according to the observation signal;

and determining the energy of the signal of the auxiliary virtual sound source according to the energy ratio of the observed signal to the signal of the auxiliary virtual sound source.

Optionally, the obtaining unit 710 is specifically configured to determine the second correlation transfer function by using a way of averaging the pre-multipoint measurements.

Optionally, the obtaining unit 710 is specifically configured to determine the second correlation transfer function using an adaptive correlation transfer function estimation algorithm.

Optionally, the processing unit 720 is specifically configured to:

determining a mapping relationship between the first mixing matrix H and a second mixing matrix A, wherein the second mixing matrix A is used for representing the mapping relationship between the observed signal and the original sound source signals of the at least two sources;

Determining the voice signal with reverberation according to the mapping relation and the original sound source signals of the N sources

Optionally, the second mixing matrix a comprises a room transfer function between a sound source of the observation signal and a microphone.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 700 for signal processing in this embodiment may correspond to a corresponding main body for performing the method 300 in this embodiment of the present application, and the foregoing and other operations and/or functions of each module in the apparatus 700 are respectively for implementing each method in fig. 3 to 8, or corresponding flow in each method, which are not described herein for brevity.

The apparatus and system of embodiments of the present application are described above in terms of functional modules in connection with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 11 is a schematic block diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 11, the electronic device 800 may include:

a memory 810 and a processor 820, the memory 810 being for storing a computer program and transmitting the program code to the processor 820. In other words, the processor 820 may call and run a computer program from the memory 810 to implement the communication method in the embodiments of the present application.

For example, the processor 820 may be used to perform the steps of the method 300 described above according to instructions in the computer program.

In some embodiments of the present application, the processor 820 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the present application, the memory 810 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules that are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device 800.

Optionally, the electronic device 800 may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

Processor 820 may control transceiver 830 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 830 may include a transmitter and a receiver. Transceiver 830 may further include antennas, the number of which may be one or more.

It should be appreciated that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a communication device comprising a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory, such that the encoder performs the method of the above method embodiments.

According to an aspect of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method of the above-described method embodiments.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, a flow or function consistent with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, device, and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of signal processing, comprising:

Determining a first mixing matrix H and a speech signal with reverberation from the observed signalWherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used for representing the observation signal and the reverberated speech signal +.>Mapping relation between the two;

2. The method of claim 1, wherein said mixing said first mixing matrix H and said reverberated speech signalsInputting a signal processing model to obtain a unmixed matrix W of the observation signal, wherein the unmixed matrix W comprises:

3. The method of claim 2, wherein the reverberated speech signal is based on the first mixing matrix HAnd the unmixed matrix W, determining a first parameter, including:

4. A method according to claim 2 or 3, further comprising:

5. A method according to claim 2 or 3, further comprising:

6. A method according to any one of claims 1-3, further comprising:

determining the energy of the signal of the auxiliary virtual sound source according to the observation signal;

acquiring a second related transfer function corresponding to the auxiliary virtual sound source;

wherein the first mixing matrix H and the voice signal with reverberation are determined according to the observed signalComprising the following steps:

from the observed signal, the energy of the signal of the auxiliary virtual sound source and the second related transfer function Obtaining said first mixing matrix H and said reverberated speech signal +.>Wherein said first mixing matrix H comprises said second correlation transfer function, said reverberated speech signal +.>Including the energy of the signal of the auxiliary virtual sound source.

7. The method of claim 6, wherein determining the energy of the signal of the auxiliary virtual sound source from the observed signal comprises:

8. The method of claim 6, wherein the obtaining a second associated transfer function corresponding to the auxiliary virtual sound source comprises:

the second correlation transfer function is determined using a means of averaging of the pre-multipoint measurements.

9. The method of claim 6, wherein the obtaining a second associated transfer function corresponding to the auxiliary virtual sound source comprises:

the second correlation transfer function is determined using an adaptive correlation transfer function estimation algorithm.

10. A method according to any of claims 1-3, characterized in that the mixing matrix H and reverberated speech signal are determined from the observed signalComprising the following steps:

determining the reverberated voice signal according to the mapping relation and the original sound source signals of the at least two sources

11. The method of claim 10, wherein the second mixing matrix a comprises a room transfer function between a sound source of the observation signal and a microphone.

12. An apparatus for signal processing, comprising:

a processing unit for determining a first mixing matrix H and a voice signal with reverberation according to the observed signalWherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used for representing the observation signal and the reverberated speech signal +. >Mapping relation between the two;

the processing unit is further configured to mix the first mixing matrix H and the reverberated speech signalInputting a signal processing model to obtain a unmixed matrix W of the observed signals, whereinThe signal processing model is used for representing the first mixing matrix H, the reverberated speech signal +.>And the mapping relation between the unmixed matrix W;

13. An electronic device comprising a processor and a memory, the memory having instructions stored therein that when executed by the processor cause the processor to perform the method of any of claims 1-11.

14. A computer storage medium for storing a computer program, the computer program comprising instructions for performing the method of any one of claims 1-11.